<pre>
<b>Author:</b> Ashlynn Wimer
<b>Date:</b> 3/5/2024
</pre>

This notebook is used to actually classify my samples. In it, we build the model we'll use, explain the decision, and do a brief bit of error analysis.

In [1]:
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import roc_curve, roc_auc_score, classification_report
from sklearn.svm import SVC
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier # Let's play with this
from sklearn.linear_model import LogisticRegression

import matplotlib.pyplot as plt
import HelperChan 
import pandas as pd
import numpy as np

import warnings

In [2]:
# Load in labeled data
gpt_labeled_data = pd.concat([
    pd.read_csv('../data/gpt_labeled.csv', index_col='Unnamed: 0'),
    pd.read_csv('../data/gpt_labeled_2.csv', index_col='Unnamed: 0')
], ignore_index=True)
gpt_labeled_data['classification'] = gpt_labeled_data['gptlabels']\
    .apply(lambda x: 1 if x=='Yes' else 0)

hand_labeled_posts = pd.concat(
    [
        pd.read_csv('../data/first_pass_labels.csv', index_col='Unnamed: 0'),
        pd.read_csv('../data/lgbt_week_2_classified.csv', index_col='Unnamed: 0'),
        pd.read_csv('../data/lgbt_week_3_classified.csv', index_col='Unnamed: 0')
    ],
    ignore_index=True
).drop_duplicates() # there are none, but I'm paranoid

# Load in docvecs
docvecs = pd.read_csv('../data/docvecs.csv')
gpt_docvecs = gpt_labeled_data.merge(docvecs, on='id')
hand_docvecs = hand_labeled_posts.merge(docvecs, on='id')

# Make a test-train split on handvecs
hand_train = hand_docvecs[['docsvecs', 'classification']]\
    .sample(n=400, random_state=42)
hand_test = hand_docvecs.drop(hand_train.index)

# Create a modified training set
modified_train = pd.concat([
    gpt_docvecs[['docsvecs', 'classification']].sample(n=200, random_state=42),
    hand_train
])

modified_train['docsvecs'] = modified_train['docsvecs']\
    .apply(HelperChan.very_good_ast_literal_eval)\
    .apply(np.array)

vec_mod_train = np.stack(modified_train['docsvecs'])
lab_mod_train = modified_train['classification']

# Clean up our testing set
vec_test = hand_test['docsvecs']\
    .apply(HelperChan.very_good_ast_literal_eval)\
    .apply(np.array)
vec_test = np.stack(vec_test)
lab_test = hand_test['classification']

Hyperparameters and base estimator selection was conducted in a separate scratch notebook (not included in this repo due to its messiness) by way of mashing GridSearchCV at models until things worked.

In [3]:
base_estimators = [
    ('SVC', SVC()),
    ('BaggedLogistics', BaggingClassifier(estimator=LogisticRegression(), 
                                        max_features=.75, 
                                        max_samples=.75, 
                                        bootstrap_features=True, 
                                        n_estimators=10)),
    ('GradientBoost', GradientBoostingClassifier()),
    ('DecisionTree', DecisionTreeClassifier()),
    ('LogisticRegression', LogisticRegression(C=0.5, 
                                l1_ratio=.25, 
                                penalty=None, 
                                solver='sag', 
                                tol=0.0001))
]

stack_clf = StackingClassifier(estimators=base_estimators, final_estimator=MLPClassifier(max_iter=1000))

with warnings.catch_warnings(action='ignore'):
    stack_clf.fit(vec_mod_train, lab_mod_train)

In [4]:
print(classification_report(lab_test, stack_clf.predict(vec_test)))

              precision    recall  f1-score   support

           0       0.82      0.87      0.84        61
           1       0.77      0.69      0.73        39

    accuracy                           0.80       100
   macro avg       0.79      0.78      0.79       100
weighted avg       0.80      0.80      0.80       100



Based on way too much playing with this, this is likely the best balance in performance I can hope for, so let's go with it and use this to classify all of our posts.

In [5]:
# Load in all posts
all_posts = pd.concat(
    [
        pd.read_csv('../data/lgbt_week_1.csv'),
        pd.read_csv('../data/lgbt_week_2.csv'),
        pd.read_csv('../data/lgbt_week_3.csv')
    ], ignore_index=True
)

all_docvecs = all_posts.merge(docvecs, on='id')

all_docvecs['docsvecs'] = all_docvecs['docsvecs']\
    .apply(HelperChan.very_good_ast_literal_eval)\
    .apply(np.array)

assert len(all_docvecs) == len(all_posts), 'Something is weird about your embeddings.'

In [6]:
all_docvecs = all_docvecs.assign(
    transRelated = lambda x: stack_clf.predict(np.stack(x['docsvecs']))
)

In [7]:
all_docvecs['transRelated']

0         0
1         0
2         0
3         0
4         1
         ..
187370    0
187371    0
187372    0
187373    0
187374    1
Name: transRelated, Length: 187375, dtype: int64

In [8]:
all_docvecs.drop(columns=['docsvecs']).to_csv('../data/fully_labeled_data.csv')