Steps in this notebook:
1. Ingest data from previous step and load libraries
2. Train a classification model based on the top label in each row from the factor analysis
3. Pickle model for future data refreshes

# Init libraries and data

In [1]:
import pandas as pd
import numpy as np

#random seed for reproducibility
np.random.seed(67)

In [3]:
data = pd.read_csv('sample data files/input_for_step_3b.csv')

In [4]:
emotions_list = ['admiration', 'amusement', 'anger', 'annoyance', 'approval',
'caring', 'confusion', 'curiosity', 'desire',
'disappointment',
'disapproval',
'disgust',
'embarrassment',
'excitement',
'fear',
'gratitude',
'grief',
'joy',
'love',
'nervousness',
'optimism',
'pride',
'realization',
'relief',
'remorse',
'sadness',
'surprise',
'neutral']

# Load SKlearn model

Note: using an SGDC classifier for multi-class classification based on continuous input variables. this is a pretty basic modeling pipeline and could no doubt be improved by future contributors. 😉

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.pipeline import make_pipeline

## test train split and check accuracy of model

In [6]:
X = data[emotions_list]
y = data.top_factor_label

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [14]:
clf = make_pipeline(StandardScaler(),
                    RandomForestClassifier())

In [15]:
clf.fit(X_train, y_train)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('randomforestclassifier', RandomForestClassifier())])

In [16]:
y_pred = clf.predict(X_test)

print('accuracy %s' % accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred,target_names=y.unique()))

accuracy 0.9954954954954955
              precision    recall  f1-score   support

     Neutral       1.00      1.00      1.00        89
     Sadness       0.97      1.00      0.98        97
  Excitement       0.99      1.00      1.00       156
      Desire       1.00      1.00      1.00        32
    Approval       0.98      0.98      0.98        66
      Caring       1.00      1.00      1.00        64
  Curiousity       1.00      0.99      1.00       508
   Gratitude       1.00      0.99      0.99        98

    accuracy                           1.00      1110
   macro avg       0.99      1.00      0.99      1110
weighted avg       1.00      1.00      1.00      1110



Notes:
- 99.5% accuracy is pretty good for now but if the model is applied to new data there may be some drift requiring tuning and re-training over time.
- This model applies the results of the factor analysis as labels, but, is based on the output of the HuggingFace model. A few reasons for this.
- To generalize the FA results over time and sustain them as feature columns, a regression model would be needed to predict the FA columns, which would be based on the feature engineering of the HuggingFace model. It would be a lot of modeling on top on modeled data.
- With this approach, the text classification labels 
- Additional, when explaning the results of one tonality label vs another, the 'processed_text' can help inform which specific terms decide the category. This helps with explainability.

These steps are somewhat a matter of personal preference and can be altered as you see fit.

## Exporting the model for classification of future results (e.g. daily/ weekly/ monthly / quarterly) or new data sets

In [17]:
import pickle

pickle.dump(clf, open('model/subject_line_tonality_classifier.pkl','wb'))