# Sentiment Analyzer - Prediction Pipeline

This notebook loads a trained sentiment analysis model and generates predictions for new text data.

<a href="https://colab.research.google.com/github/georgehtliu/sentiment-analyzer/blob/master/submission_createcsv.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Import Libraries

- `joblib`: Loading saved scikit-learn models and vectorizers
- `pandas`: Data manipulation and CSV handling

In [8]:
import joblib
import pandas as pd
import string
import re

## Load Model, Vectorizer, and Input Data

This section loads the trained model, vectorizer, and the dataset to be classified.

In [None]:
# File paths - adjust these if your files are in different locations
model_path = "SentimentNewton_Log.pkl"
vectorizer_path = "Vectorizer.pkl"
judge_data_path = "contestant_judgment.csv"

# For Google Colab, uncomment the following:
"""
from google.colab import drive
drive.mount('/content/drive')

model_path = input('Please enter path to SentimentNewton_Log.pkl: ')
vectorizer_path = input('Please enter path to Vectorizer.pkl: ')
judge_data_path = input("Please enter the path to contestant_judgment.csv: ")
"""

# Load the trained model and vectorizer
print("Loading trained model and vectorizer...")
try:
    clf_log = joblib.load(model_path)
    vectorizer = joblib.load(vectorizer_path)
    print(f"✓ Model loaded from: {model_path}")
    print(f"✓ Vectorizer loaded from: {vectorizer_path}")
except FileNotFoundError as e:
    print(f"Error: Could not find file. Make sure you have run submission_training.ipynb first.")
    raise

## Data Preprocessing and Vectorization

Apply the same preprocessing steps used during training, then vectorize the text.

In [12]:
# Load the dataset to be classified
print(f"Loading data from: {judge_data_path}")
df_judge = pd.read_csv(judge_data_path)

# Check if 'Text' column exists
if 'Text' not in df_judge.columns:
    print("Available columns:", df_judge.columns.tolist())
    raise ValueError("Dataset must contain a 'Text' column")

print(f"Dataset shape: {df_judge.shape}")
print("\nSample of input data:")
df_judge.head()

# Apply the same preprocessing as training
def remove_punct(text):
    """Remove punctuation and numbers from text."""
    if pd.isna(text):
        return ""
    text = "".join([char for char in str(text) if char not in string.punctuation])
    text = re.sub('[0-9]+', '', text)
    return text

print("\nPreprocessing text data...")
df_judge['Text'] = df_judge['Text'].map(lambda text: remove_punct(text))

# Vectorize the preprocessed text
print("Vectorizing text data...")
X = df_judge['Text']
X_vectors = vectorizer.transform(X)
print(f"Feature matrix shape: {X_vectors.shape}")

## Generate Predictions and Save Results

Predict sentiment for all texts and save results to CSV file.

In [13]:
# Generate predictions
print("Generating predictions...")
df_judge['Sentiment'] = clf_log.predict(X_vectors)

# Display prediction distribution
print("\nPrediction distribution:")
print(df_judge['Sentiment'].value_counts())

# Save results to CSV
csv_path = 'predicted_labels.csv'  # Save to local directory by default

# For Google Colab, uncomment to save to Drive:
# csv_path = '/content/drive/My Drive/predicted_labels.csv'

print(f"\nSaving predictions to: {csv_path}")
df_judge.to_csv(csv_path, index=False)

print("\n✓ Predictions saved successfully!")
print(f"✓ Total predictions: {len(df_judge)}")
print("\nDone!")

Done!
