In [8]:
from pathlib import Path
import pandas as pd

from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

pd.set_option("display.max_colwidth", None)

In [3]:
data_path = Path("../data")
emotions_1_path = data_path/"goemotions_1.csv"

df = pd.read_csv(emotions_1_path)

In [4]:
df.columns

Index(['text', 'id', 'author', 'subreddit', 'link_id', 'parent_id',
       'created_utc', 'rater_id', 'example_very_unclear', 'admiration',
       'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion',
       'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust',
       'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy',
       'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief',
       'remorse', 'sadness', 'surprise', 'neutral'],
      dtype='object')

In [9]:
# Excitement examples
df[["text", "excitement"]].loc[lambda d: d["excitement"] == 1].sample(5)

Unnamed: 0,text,excitement
34496,"Wow. I have to say, I am relieved. Pretty incredible idea.",1
2395,You should check out my comment at the bottom of this thread of why I somewhat support antifa. This comic is obviously fearmongering conservatives.,1
33550,This is me and my wife. She squeals with delight whenever it snows and I start looking for property in Arizona.,1
15760,I'm so doing this!,1
13877,Simply amazing.,1


In [10]:
# No excitement examples
df[["text", "excitement"]].loc[lambda d: d["excitement"] == 0].sample(5)

Unnamed: 0,text,excitement
46482,I saw one that us taking Greedy William's at 6 with the Jaga taking [NAME] at 7 lol. Cllelin ferrel was at 4,0
53849,I legit wish you never talked to me,0
46034,I think she was trying to demonstrate that such solutions are practical even if politically difficult. But it was a sloppy analogy.,0
39474,Because killing people is wrong?,0
27614,Doesn't state strangers either. Assume some more.,0


In [11]:
df["excitement"].value_counts()

0    68100
1     1900
Name: excitement, dtype: int64

In [12]:
X, y = df["text"], df["excitement"]

pipe = make_pipeline(
    CountVectorizer(),
    LogisticRegression(class_weight="balanced", max_iter=1000)
)

In [13]:
%%time

pipe.fit(X, y)

Wall time: 3.73 s


Pipeline(steps=[('countvectorizer', CountVectorizer()),
                ('logisticregression',
                 LogisticRegression(class_weight='balanced', max_iter=1000))])

## Trick 1: Model Uncertainty

This trick consists in building a model that outputs probabilities. To say that the model is uncertain of its predictions, we'll look at probabilities in the range ]0.45, 0.55[.

**Safouane Note**: Using this interval makes the assumption that the model is calibrated. As we're using logistic regression, there is no need to recalibrate the model. Had we been using Random forest for example, this assumption wouldn't have been valid😅

In [14]:
# Column at 0-th position => predictions for non-excitement
# Column at 1st position => predictions for excitement
pipe.predict_proba(X)

array([[0.81906852, 0.18093148],
       [0.87337871, 0.12662129],
       [0.99887474, 0.00112526],
       ...,
       [0.95766974, 0.04233026],
       [0.8940276 , 0.1059724 ],
       [0.97989241, 0.02010759]])

In [16]:
# The 0-th position corresponds to no excitment predictions
probas = pipe.predict_proba(X)[:, 0]

# See what examples the model is uncertain about
# when it comes to predicting excitement
(
    df
    .loc[(probas > 0.45) & (probas < 0.55)]
    [["text", "excitement"]]
    .head(7)
)

Unnamed: 0,text,excitement
8,that's adorable asf,0
46,"If there’s a pattern, yes.",0
107,My fans on patreon will be rewarded soon,0
154,"Ones with close ties to SA, anyway. An escaped apostate won't exactly be itching to run home.",0
158,I really like this ring so I’m glad to hear that.,0
262,OMG THOSE TINY SHOES! *desire to boop snoot intensifies*,0
362,This. I relate to this. So much. Almost too much.,0


Examples at rows 262 and 362 should have annotated as excitement examples.