# Sentiment Analyzer - Additional Experiments

This notebook documents alternative preprocessing techniques and model architectures that were tested during development. These approaches were ultimately not used in the final model as they did not improve performance.

<a href="https://colab.research.google.com/github/georgehtliu/ignition-hack-2020/blob/master/submission_extras.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Import Libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import f1_score
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
import nltk
import string
import re

## Alternative Preprocessing Techniques

The following preprocessing techniques were tested but ultimately resulted in lower F1 scores compared to simple punctuation removal.

### Lemmatization with Part-of-Speech Tagging

**Result:** Significantly increases training time and decreases F1 scores by ~1%.

Not recommended for this use case.

In [None]:
# Download required NLTK data
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

def get_wordnet_pos(word):
    """Get the WordNet part-of-speech tag for a word."""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dictionary = {
        "J": wordnet.ADJ,
        "N": wordnet.NOUN,
        "V": wordnet.VERB,
        "R": wordnet.ADV
    }
    return tag_dictionary.get(tag, wordnet.NOUN)

lemmatizer = WordNetLemmatizer()

def wn_lemmatize(sentence, lemmatizer):
    """Lemmatize words in a sentence using POS tagging."""
    words_list = sentence.split()
    for i in range(len(words_list)):
        if len(words_list[i]) >= 1:
            words_list[i] = lemmatizer.lemmatize(
                words_list[i], 
                get_wordnet_pos(words_list[i])
            )
    return ' '.join(words_list)

# Usage example (not used in final model):
# df["Text"] = df['Text'].apply(lambda sentence: wn_lemmatize(sentence, lemmatizer))

### Name Lemmatization / Generalization

**Result:** Slightly decreases F1 scores.

Removes @mentions and #hashtags from text, but this removal reduces model performance.

In [None]:
def lemmatize_name(text):
    """
    Remove @mentions and #hashtags from the beginning of text.
    
    Note: This approach decreased F1 scores and was not used in the final model.
    """
    if len(text) > 0 and (text[0] == '@' or text[0] == '#'):
        words = text.split()
        if len(words) > 0:
            words[0] = ''
        return ' '.join(words)
    return text

# Usage example (not used in final model):
# df['Text'] = df['Text'].map(lambda text: lemmatize_name(text))

### Stop Word Removal

**Result:** Decreases F1 scores by ~1%.

Removing common stop words reduced model performance, likely because they provide context for sentiment analysis.

In [None]:
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = stopwords.words('english')

def remove_stopwords(text):
    """Remove English stop words from text."""
    words = text.split() if isinstance(text, str) else []
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return ' '.join(filtered_words)

# Usage example (not used in final model):
# df['Text'] = df['Text'].apply(lambda x: remove_stopwords(x))

### Punctuation Removal

**Note:** While the vectorizer has built-in functionality for handling punctuation, we found that explicit preprocessing improved performance in our case.

In [None]:
def remove_punct(text):
    """Remove punctuation and numbers from text."""
    if pd.isna(text):
        return ""
    text = "".join([char for char in str(text) if char not in string.punctuation])
    text = re.sub('[0-9]+', '', text)
    return text

# This preprocessing step IS used in the final model:
# df['Text'] = df['Text'].map(lambda text: remove_punct(text))

### Tokenization

**Note:** Redundant due to vectorizer's built-in tokenization functionality.

In [None]:
def tokenization(text):
    """Tokenize text using regex (not needed - vectorizer handles this)."""
    tokens = re.split('\W+', text)
    return tokens

# Usage example (not needed - TF-IDF vectorizer handles tokenization):
# df['Text'] = df['Text'].map(lambda text: tokenization(text))

## Alternative Classifiers

The following classifiers were tested but did not outperform Logistic Regression for this task.

### Neural Network (MLPClassifier)

**Result:** Very slow to train, mediocre accuracy compared to Logistic Regression.

In [None]:
# Multi-layer Perceptron (Neural Network)
# Note: Requires X_train_vectors, y_train, X_test_vectors, y_test to be defined
# (from train_test_split in the main training notebook)

clf_nn = MLPClassifier(
    solver='adam', 
    activation='relu', 
    hidden_layer_sizes=(64, 64),
    random_state=42
)
clf_nn.fit(X_train_vectors, y_train)
predictions = clf_nn.predict(X_test_vectors)
print(f"F1 Score: {f1_score(y_test, predictions, average='weighted')}")

### Decision Tree Classifier

**Result:** Sub-par accuracy compared to Logistic Regression.

In [None]:
# Decision Tree with GridSearchCV
parameters_dt = {
    'criterion': ('gini', 'entropy'),
    'splitter': ('best', 'random'),
    'max_depth': (None, 4, 100, 1000)
}

dt = DecisionTreeClassifier(random_state=42)
clf_dt = GridSearchCV(dt, parameters_dt, cv=5, scoring='f1_weighted')
clf_dt.fit(X_train_vectors, y_train)

predictions = clf_dt.predict(X_test_vectors)
print(f"Best Parameters: {clf_dt.best_params_}")
print(f"F1 Score: {f1_score(y_test, predictions, average='weighted')}")

### Support Vector Machine (SVM)

**Result:** Incapable of handling large datasets efficiently. Good accuracy for smaller datasets, but not scalable.

In [None]:
# Support Vector Machine (SVM)
# Note: This requires a smaller subset of data due to memory constraints
# Around 68% accuracy using 8000 of the 1M training examples

clf_svm = SVC(
    kernel='rbf', 
    C=4, 
    decision_function_shape='ovo',
    random_state=42
)
clf_svm.fit(X_train_vectors, y_train)

predictions = clf_svm.predict(X_test_vectors)
print(f"F1 Score: {f1_score(y_test, predictions, average='weighted')}")

### Stochastic Gradient Descent (SGD) Classifier

**Result:** Very fast to train, but does not improve much as dataset size increases.

In [None]:
# SGD Classifier with logistic loss
clf_sgd = SGDClassifier(
    loss='log',
    penalty='elasticnet',
    l1_ratio=0.05,
    random_state=42
)
clf_sgd.fit(X_train_vectors, y_train)

predictions = clf_sgd.predict(X_test_vectors)
print(f"F1 Score: {f1_score(y_test, predictions, average='weighted')}")