# Current Approach to Identifying Tweets needing Support

In [1]:
%%time
!pip3 freeze | grep -E 'boto3|s3fs|scikit-learn|distributed|dask==|dask-m|black==|jupyter-server|pandas'
!conda list -n spark | grep -E 'ipykernel'

black==22.6.0
boto3==1.24.56
dask==2022.8.0
dask-ml==2022.5.27
distributed==2022.8.0
nb-black==1.0.7
pandas==1.4.3
s3fs==0.4.2
scikit-learn==1.1.2
ipykernel                 6.15.1             pyh210e3f2_0    conda-forge
CPU times: user 40.6 ms, sys: 13.4 ms, total: 54 ms
Wall time: 2.38 s


In [2]:
%load_ext lab_black

In [3]:
import os
from glob import glob
from datetime import datetime
import zipfile

import boto3
import dask.dataframe as dd
import numpy as np
import pandas as pd
import sklearn.metrics as skm
from dask_ml.model_selection import train_test_split
from sklearn.model_selection import train_test_split as sk_train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.pipeline import Pipeline

## About

This notebook walks through the end-to-end workflow being currently used to identify negative sentiment tweets that need to be reviewed by mission team members.

Machine learning scoring metrics identified in `scoping.md` are used to quantitatively evaluate this approach. The business metric is time wasted reading non-negative sentiment tweets and this is also estimated here.

For a summary of assumptions made, see the discussion at the end of the notebook.

## User Inputs

In [10]:
label_mapper = {0: "does_not_need_support", 1: "needs_support"}

nrows = 172_000
partition_size = 21_000
frac_negative = 0.045  # for dummy data only

test_split_frac = 0.125

# inference
inference_start_date = "2022-01-10 00:00:00"

In [11]:
n_partitions = int(nrows / partition_size)

val_split_frac = test_split_frac / (1 - test_split_frac)

## Get Data

In [12]:
%%time
df = pd.concat(
    [
        pd.DataFrame(np.random.randint(0, 1, nrows-int(frac_negative*nrows)), columns=['label']),
        pd.DataFrame(np.random.randint(1, 2, int(frac_negative*nrows)), columns=['label']),
    ], ignore_index=True
).assign(text='A')
ddf = dd.from_pandas(df, npartitions=n_partitions)
display(ddf.head())

Unnamed: 0,label,text
0,0,A
1,0,A
2,0,A
3,0,A
4,0,A


CPU times: user 14.6 ms, sys: 6.74 ms, total: 21.3 ms
Wall time: 22 ms


In [13]:
%%time
X = ddf[['text']]
y = ddf['label']

CPU times: user 1.53 ms, sys: 205 µs, total: 1.74 ms
Wall time: 1.67 ms


## Split Data

In [14]:
%%time
X_train_val, X_test, y_train_val, y_test = train_test_split(
    X, y, test_size=test_split_frac, random_state=88, shuffle=True
)
X_train, X_val, y_train, y_val = train_test_split(
    X_train_val, y_train_val, test_size=val_split_frac, random_state=88, shuffle=True
)

CPU times: user 9 ms, sys: 0 ns, total: 9 ms
Wall time: 9.22 ms


Show the class distribution of the label (*needs support* vs *does not need support*)

In [15]:
%%time
class_distribution_train = (
    y_train.value_counts()
    .rename("num_tweets")
    .compute()
    .to_frame()
)
class_distribution_train = class_distribution_train.assign(
    frac_tweets=lambda df: df["num_tweets"] / df["num_tweets"].sum(axis=0)
)
class_distribution_train

CPU times: user 70.3 ms, sys: 17.1 ms, total: 87.4 ms
Wall time: 87.5 ms


Unnamed: 0,num_tweets,frac_tweets
0,123292,0.954842
1,5831,0.045158


## (Naive) Model Training

Define the ML pipeline

In [16]:
pipe = Pipeline([("clf", DummyClassifier(strategy="uniform", random_state=88))])

Training

In [17]:
%%time
_ = pipe.fit(X_train_val, y_train_val)

CPU times: user 105 ms, sys: 18.2 ms, total: 123 ms
Wall time: 193 ms


Make predictions on the tweets in the test split

In [18]:
%%time
y_test_pred = pd.Series(pipe.predict(X_test), name='label', index=y_test.index.compute())
# ddf_test_pred = dd.from_pandas(y_test_pred.to_frame(), npartitions=y_test.npartitions)
y_test_pred.head().to_frame()

CPU times: user 75.3 ms, sys: 25.6 ms, total: 101 ms
Wall time: 80.8 ms


Unnamed: 0,label
3725,0
6935,0
11244,1
728,1
17922,0


**Notes**
1. The predictions will be brought into memory since the pipeline is an in-memory object (`sklearn.pipeline.Pipeline`). However, since these are the test split predictions, we will **assume that the length of labels in the the test split is small enough to fit into local memory**.

## Model Evaluation

Bring the test-split into memory

In [19]:
%%time
y_test_computed = y_test.compute()

CPU times: user 37 ms, sys: 6.39 ms, total: 43.4 ms
Wall time: 51.5 ms


### Scoring Metrics

Calculate evaluation metrics on the test split

In [20]:
%%time
metrics_dict = dict(
    accuracy=skm.accuracy_score(y_test_computed, y_test_pred),
    precision=skm.precision_score(y_test_computed, y_test_pred, average='weighted'),
    recall=skm.recall_score(y_test_computed, y_test_pred, average='weighted'),
    f1_score=skm.f1_score(y_test_computed, y_test_pred, average='weighted'),
    f2_score=skm.fbeta_score(y_test_computed, y_test_pred, beta=2, average='weighted'),
)
df_cr = pd.DataFrame(
    skm.classification_report(
        y_test_computed,
        y_test_pred,
        labels=list(label_mapper),
        target_names=list(label_mapper.values()),
        output_dict=True,
    )
).T.iloc[:2].astype({"support": pd.Int32Dtype()})
df_cm = pd.DataFrame(
    skm.confusion_matrix(y_test_computed, y_test_pred, labels=list(label_mapper))
).rename(columns=label_mapper)
df_cm.index = df_cm.index.map(label_mapper)
df_cm = pd.concat(
    [
        df_cm.assign(total=lambda df: df.sum(axis=1)),
        df_cm.sum(axis=0).rename("total").to_frame().T,
    ]
).fillna(len(y_test)).astype({"total": pd.Int32Dtype()})
df_metrics = pd.DataFrame.from_dict(metrics_dict, orient='index').T.assign(split='test')
display(df_metrics)
display(df_cr)
display(df_cm)

Unnamed: 0,accuracy,precision,recall,f1_score,f2_score,split
0,0.497134,0.913809,0.497134,0.627629,0.532078,test


Unnamed: 0,precision,recall,f1-score,support
does_not_need_support,0.95504,0.496628,0.653455,20317
needs_support,0.045722,0.507772,0.08389,965


Unnamed: 0,does_not_need_support,needs_support,total
does_not_need_support,10090,10227,20317
needs_support,475,490,965
total,10565,10717,21282


CPU times: user 125 ms, sys: 0 ns, total: 125 ms
Wall time: 131 ms


**Observations**
1. We had identified F1-score and F2-score as being the candidate scoring metrics for this use-case. The closer both values are to 1.0 the better the current ML model. The metrics are approximately 0.6 and 0.5 respectively, indicating that (strictly from the perspective of an ML metric) improvement is warranted.
2. The classification report (second output above) shows F1-score dropping by nearly an order of magnitude for the minority class (negative or neutral sentiment tweets) compared to the majority class (positive sentiment tweets).
2. Ultimately, we identified F2-score as being the primary metric to be used. With the current approach, this metric score is poor (approximately 0.53).

### Business Metrics

Next, we will calculate a business metric - the amount of time spent unnecessariliy reading tweets. This will refer to tweets with a positive sentiment that were being read by the mission's social media support team.

Use the third output above (confusion matrix) to summarize the number of unnecessarily read tweets below

In [21]:
num_tweets_unnecessarily_read = (
    df_cm.loc["total", "needs_support"] - df_cm.loc["needs_support", "total"]
)

df_reading_time_summary = (
    pd.Series(
        [
            num_tweets_unnecessarily_read,
            len(y_test),
            num_tweets_unnecessarily_read / len(y_test),
        ],
        index=[
            "number tweets unnecessarily read",
            "total number tweets",
            "fraction tweets unnecessarily read",
        ],
    )
    .to_frame()
    .T.astype(
        {
            "number tweets unnecessarily read": pd.Int32Dtype(),
            "total number tweets": pd.Int32Dtype(),
        }
    )
)
df_reading_time_summary

Unnamed: 0,number tweets unnecessarily read,total number tweets,fraction tweets unnecessarily read
0,9752,21282,0.458228


**Observations**
1. Out of the 21,282 tweets on which the currently used (naive) ML model is being evaluated, the model predicted that
   - 10,717 tweets need support
   - 10,565 tweets do not need support

   In reality
   - 965 tweets need support
   - 20,317 tweets do not need support

   This means 9,752 (10,717 - 965, or approx. 45%) of the available tweets would unnecessarily be read (and responded to) by the mission team members who are acting in a support capacity, in order to mitigate negative sentiment on Twitter. If we **assume an average combined reading and responding time of one minute per tweet**, then this would amount to (9,752 tweets X 0.50 sec/tweet X 1 min/60 sec) an average of 81 hours of time wasted reading tweets that did not express a negative sentiment about the mission on the platform.

## Summary of Assumptions
1. The length of the labels in the validation and test splits is small enough to fit into local memory.

<span style="float:left;">
    <a href="./6_nlp_labeling.ipynb"><< 6 - NLP-based Labeling</a>
</span>

<span style="float:right;">
    <a href="./8_analysis.ipynb">8 - Analysis >></a>
</span>