# Spam Detection - Model Testing

- Add the project's root directory (two levels up) to the Python path so the modules can be imported, even if they arent in the current working directory:

In [None]:
import sys
import os

project_root = os.path.abspath(os.path.join('..', '..'))
if project_root not in sys.path:
    sys.path.append(project_root)

- Import the required libraries and modules, as well as our utility functions:

In [None]:
import pandas as pd
import joblib

from src.utils import load_config, get_project_root, save_as_csv

- Load the config using the utility function. Get paths to relevant folders/files needed to save and retrieve files:

In [None]:
config = load_config()

test_path = config['data']['task1']['processed']['test']
model_path = os.path.join(get_project_root(), config['data']['task1']['models'], 'best_model.pkl')
output_path = os.path.join(get_project_root(), config['data']['task1']['results'])

processed_features_test_path = os.path.join(get_project_root(), test_path, "spam_detection_test_processed_features.csv")

- Load the test CSV data file into a dataframe. Also, load the previously trained model so we can apply the data to it:

In [None]:
test_df = pd.read_csv(processed_features_test_path)
model = joblib.load(model_path)

- Drop clean text column as it is not necessary for model predicting nor is it consistent. Make predictions on test data:

In [None]:
X_test = test_df.drop(columns=['clean_text', 'text'], errors='ignore')
predictions = model.predict(X_test)

- Keep just the clean text of the test data and the new prediction - our label. We also want to put the label first, this just makes it easier to read and more consistent with original formatting:

In [None]:
test_df['label'] = predictions
ordered_cols = ['label'] + [col for col in test_df.columns if col != 'label']
test_df = test_df[ordered_cols]

- Inspect a few for testing purposes:

In [None]:
keep_columns = ['label', 'text']

In [None]:
print(test_df[keep_columns].head(30)) 

- Make it so that the dataframe being saved only consists of the predicted label - this should allow for correct marking, where we can compare expected label with predicted label based on the message.

In [None]:
final_df = test_df[['label']]

- Save the results dataframe to the specified file location:

In [None]:
save_as_csv(final_df, output_path, "results.csv")

print(f"Results Saved")