# Text Classification:  Insults with Naive Bayes

In [12]:
import numpy as np
import pandas as pd
import sklearn

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
import sklearn.feature_extraction.text as text
import sklearn.naive_bayes as nb
import matplotlib.pyplot as plt
from sklearn.metrics import precision_score, recall_score, accuracy_score
# Load libraries
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, f_classif, mutual_info_classif
from sklearn import preprocessing
from sklearn import pipeline
from sklearn import linear_model
from sklearn.svm import LinearSVC
from sklearn.feature_selection import chi2, f_classif, mutual_info_classif



In the notebook we dive into the general question of feature selection.  In the regression
and classification module notebook **Dimensionality Reduction** we looked at the specific

$$
\begin{array}[t]{lll}
1. &  \text{fclassif} &  \text{ANOVA F-value between label/feature for classification tasks.}\\
2. &  \text{chi2} & \text{Chi-squared stat for non-negative features for classification tasks.}\\
3. &  \text{mutual_info_classif} &  \text{Mutual information for a discrete target.}\\
\end{array}
$$

Mutual Information, written $\text{I}$, is defined to be:

$$
\text{I}\,(X;\,Y) = \text{D}_{\text{KL}}\,(\, \text{P}_{X,Y} \mid\mid \text{P}_{X}\otimes \text{P}_{Y}\, )
$$


where $D_{\text{KL}}$ is the Kullback-Leibler Divergence (KL-Divergence) of two distributions and
$\text{P}_{X,Y}$ is the joint distribution of X and Y and $\text{P}_{X}\otimes \text{P}_{Y}$ is 
the distribution that gives $P(X) \times P(Y)$ as the joint probability of X and Y.
KL Divergence is an Information Theoretic measure of the divergence
between two distributions, $\text{I}\,(X;\,Y)$ measures the distance between
the joint distribution and the distribution that would obtain
if X and Y were completely independent.  If X and Y are independent,
$\text{I}\,(X;\,Y)$ is 0.  If X conditionally depends on Y that
increases the difference between the joint distribution and independence ($\text{P}_{X}\otimes \text{P}_{Y}$).

Thus $\text{I}\,(x;\,y)$ measures the amount of information you gain about the outcome of $Y$
if you know the outcome $X$ 

A Chi-square test is a hypothesis testing method. The one relevant to our problem
is using a Chi-square test to check if the observed frequencies of some category match expected frequencies. In our setting:  For each feature value v for a feature F and for each class c we want to 
see if the number of members of class c exhibiting value v for F is greater than expected.


Let's use a toy example to demonstrate the central idea that all 3 feature selectors
implement.

In [None]:
X, y = make_classification(
        n_samples=100, n_features=10, n_informative=2, n_clusters_per_class=1, 
        shuffle=False, random_state=42)
chi2_stats, p_values = chi2(X, y)
chi2_stats

The numbers show that all but 2 of the features have little connection with the classification
problem.  Of course we set it up this way when we created `X, y` with `make_classification`  by
setting the parameter `n_informative` to 2.

From Wikipedia: ``Pearson's chi-squared test is used to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table. For contingency tables with smaller sample sizes, a Fisher's exact test is used instead.''

## Loading the data

Let's open the CSV file with `pandas`.

In [2]:
import os.path
site = 'https://raw.githubusercontent.com/gawron/python-for-social-science/master/'\
'text_classification/'
#site = 'https://gawron.sdsu.edu/python_for_ss/course_core/book_draft/_static/'
df = pd.read_csv(os.path.join(site,"troll.csv"))

Each row is a comment  taken from a blog or online forum. There are three columns: whether the comment is insulting (1) or not (0), the data, and the unicode-encoded contents of the comment.

In [3]:
df[['Insult', 'Comment']].tail()

Unnamed: 0,Insult,Comment
3942,1,"""you are both morons and that is never happening"""
3943,0,"""Many toolbars include spell check, like Yahoo..."
3944,0,"""@LambeauOrWrigley\xa0\xa0@K.Moss\xa0\nSioux F..."
3945,0,"""How about Felix? He is sure turning into one ..."
3946,0,"""You're all upset, defending this hipster band..."


In [6]:
# Split the data into training and test sets FIRST
T_train,T_test, y_train,y_test = train_test_split(df['Comment'],df['Insult'])

We imported the `chi2` function:

In [19]:
chi2

<function sklearn.feature_selection._univariate_selection.chi2(X, y)>

Here's ine way to use it:

In [273]:
#tf = text.TfidfVectorizer(min_df=2,max_df=.8)
sublinear_tf = True
#subliner_tf = False
tf = text.TfidfVectorizer(sublinear_tf=sublinear_tf)
# Train your vectorizer oNLY on the trainingh data.
X_train = tf.fit_transform(T_train)
print(*X_train.shape)
# N features with highest chi-squared statistics are selected
# chi2 is a functiomported above
chi2_features = SelectKBest(chi2, k = 10_000)
X_train_chi = chi2_features.fit_transform(X_train, y_train)

2960 13829


`X-train` is our **term-document** matrix.

In [274]:
X_train.shape,X_train_chi.shape

((2960, 13829), (2960, 10000))

## Training

Now, we are going to train a classifier as usual. We
have already split the data and labels into train and test sets.

We use an **SVM classifier**.

In [275]:
from sklearn.svm import LinearSVC

clf = LinearSVC()
clf_chi = LinearSVC()

clf.fit(X_train, y_train)
clf_chi.fit(X_train_chi, y_train)

And we're done.  How'd we do?  Now we  test on the test set.  Before we can do that we need to
vectorize the test set.  But don't just copy what we did with the training data:

```
X_test = tf.fit_transform(T_test)
```

That would retrain the vectorizer from scratch.  Any words that occurred in the training texts
but not in the test texts would be forgotten!  Plus training the vectorizer
is part of the classifier training pipeline.  If we let the vectorizer see
the test data during its training phase, we'd be compromising the whole
idea of splitting training and test data.  So what we want to do
with the test data is just apply the transform part of vectorizing:

```
X_test = tf.transform(T_test)
```

That is, build a representation of the test data using only the vocabulary you learned
about in training.  Ignore any new words.

In [276]:
X_test = tf.transform(T_test)
X_test_chi = chi2_features.transform(X_test)
clf.score(X_test, y_test),clf_chi.score(X_test_chi, y_test)

(0.8358662613981763, 0.8297872340425532)

Well, not a reliable result.  But at least trimming down the model didn't seem to hurt it.

Let's clean this all up a bit by putting everything in a pipeline.

In [281]:
from sklearn import preprocessing
from sklearn import pipeline
from sklearn import linear_model
from sklearn.svm import LinearSVC
from sklearn.feature_selection import chi2, f_classif, mutual_info_classif

#from sklearn import decomposition as dec
#from pipeline import make_pipeline
#from sklearn.preprocessing import StandardScaler

#poly = preprocessing.PolynomialFeatures(degree=20, include_bias=True)
#scaler = preprocessing.StandardScaler()
#lin_reg2 = linear_model.LinearRegression()

X,y = df['Comment'].values,df['Insult'].values

def make_tf_feat_selection_clf_pipeline (selection_function=None, k=5_000,
                                         selector = SelectKBest,sublinear_tf=False):
    tf = text.TfidfVectorizer(sublinear_tf=sublinear_tf)
    #chi2_features = SelectKBest(selector, k)
    if selection_function is not None:
        k_best_features = selector(selection_function, k=k)
    else:
        k_best_features = selector(k=k)
    svm_clf = LinearSVC()
    return pipeline.Pipeline([('vect', tf), ('feat_selector', k_best_features), ('svm', svm_clf)])

#pipeline_reg = make_tf_feat_selection_clf_pipeline (chi2,k=5_000)
#pipeline_reg = make_tf_feat_selection_clf_pipeline (mutual_info_classif,k=5_000)
pipeline_reg = make_tf_feat_selection_clf_pipeline (selection_function=f_classif, k=5_000,sublinear_tf=True)

# Train
T_train,T_test, y_train,y_test = train_test_split(X,y)
pipeline_reg.fit(T_train, y_train)
predicted = pipeline_reg.predict(T_test)

accuracy_score(predicted, y_test),\
precision_score(predicted, y_test),\
recall_score(predicted, y_test)

(0.8409321175278622, 0.5478927203065134, 0.7857142857142857)

In [282]:
f_classif.__name__

'f_classif'

In [283]:
num_runs = 10
stats = np.zeros((num_runs,3))
#selection_function = chi2
selection_function = f_classif
sublinear_tf=True
print(selection_function.__name__,end="\n===================\n")
for k in (3_000,5_000, 10_000):
    for test_run in range(num_runs):
        pipeline_reg = make_tf_feat_selection_clf_pipeline (selection_function = selection_function,k=k,
                                                           sublinear_tf=sublinear_tf)
        T_train,T_test, y_train,y_test = train_test_split(X,y)
         
        # Train
        pipeline_reg.fit(T_train, y_train)
        # Test
        predicted = pipeline_reg.predict(T_test)
        stats[test_run] = accuracy_score(predicted, y_test),\
                            precision_score(predicted, y_test),\
                             recall_score(predicted, y_test)

    stats_mn = stats.mean(axis=0)
    a,p,r = stats_mn

    print(f"{k=:>6_d} {a=:.3f} {p=:.3f} {r=:.3f}")

f_classif
k= 3_000 a=0.840 p=0.554 r=0.770
k= 5_000 a=0.834 p=0.541 r=0.762
k=10_000 a=0.832 p=0.586 r=0.731


In [223]:
num_runs = 10
stats = np.zeros((num_runs,3))
selection_function = chi2
#selection_function = f_classif
sublinear_tf=True

print(selection_function.__name__,end="\n===================\n")
for k in (3_000,5_000, 10_000):
    for test_run in range(num_runs):
        pipeline_reg = make_tf_feat_selection_clf_pipeline (selection_function = selection_function,k=k)
        T_train,T_test, y_train,y_test = train_test_split(X,y)

        # Train
        pipeline_reg.fit(T_train, y_train)
        # Test
        predicted = pipeline_reg.predict(T_test)
        stats[test_run] = accuracy_score(predicted, y_test),\
                            precision_score(predicted, y_test),\
                             recall_score(predicted, y_test)

    stats_mn = stats.mean(axis=0)
    a,p,r = stats_mn

    print(f"{k=:>6_d} {a=:.3f} {p=:.3f} {r=:.3f}")

chi2
k= 3_000 a=0.837 p=0.555 r=0.774
k= 5_000 a=0.836 p=0.568 r=0.759
k=10_000 a=0.837 p=0.592 r=0.744


Mutual information can work like the others but used in the context of select KBest
it requires discrete-valued features, so we need a slightly different pipeline.

In [311]:
kb= SelectKBest(mutual_info_classif)

# You can try these others with this pipeline as well.
#kb= SelectKBest(f_classif)
#kb= SelectKBest(chi2)

T_train,T_test, y_train,y_test = train_test_split(X,y)
         
svm_clf = LinearSVC()

cv = CountVectorizer()
tf = text.TfidfTransformer(sublinear_tf=True)

pipeline_reg=pipeline.Pipeline([('vect', cv), ('feat_selector', kb),
                                ('tfidf',tf), ('svm', svm_clf)])
# Train
pipeline_reg.fit(T_train, y_train)
# Test
predicted = pipeline_reg.predict(T_test)

In [312]:
a,p,r = accuracy_score(y_test,predicted,),\
        precision_score(y_test, predicted),\
            recall_score(y_test, predicted)

# MI k= 3_000 a=0.806 p=0.789 r=0.350
print(f"{k=:>6_d} {a=:.3f} {p=:.3f} {r=:.3f}")

k= 3_000 a=0.806 p=0.789 r=0.350


We'd like to investigate getting around this limitation so let's try coding this up a
little differently.

### Mutual Information

We will use Mutual Information two ways

1.  Pass in a sequence of strings add select a vocab using word counts (operating on a Count Vectorized representation of the docs like we did above)
2.  Operate on a TFIDF TDM to do feature selection.

In [256]:
from sklearn.datasets import make_classification

class MutualInfo:
    
    def  __init__(self,k,use_counts=True):
        self.k =k
        self.use_counts= use_counts
        
    def fit_transform(self,X,y):
        """
        If self.use_counts create  cv, a count vectorized version of X,
        and assign feature ranks based mutual info of cv[:,feat_i] and y
        Use the feature ranks to choose a vocab of size self.k.
        
        Otherwise assign feature ranks based mutual info of X[:,feat_i] and y
        Use the feature ranks to choose a vocab of size self.k.
        """
        #X, y = make_classification(
        # n_samples=100, n_features=10, n_informative=2, n_clusters_per_class=1, 
        #shuffle=False, random_state=42)
        if hasattr(X,'toarray'):
            X = X.toarray()
            #print(type(X))
        if self.use_counts:
            self.cv = CountVectorizer()
            XC = self.cv.fit_transform(X)#.toarray()
            ranks = mutual_info_classif(XC, y)
            self.idxs = ranks.argsort()[-self.k:][::-1]
            self.vocabulary_ = [wd for (wd,idx) in self.cv.vocabulary_.items() if idx in self.idxs]
            return self.vocabulary_
        else:
            ranks = mutual_info_classif(X, y)
            self.idxs = ranks.argsort()[-self.k:]
            return self.transform(X)

    
    def transform (self, X):
        if X.ndim == 2:
            return X[:,self.idxs]
        else:
            return X[self.idxs]



Toy test of `MutualInfo` class with use_counts = False

In [210]:
X0, y0 = make_classification(
        n_samples=100, n_features=10, n_informative=2, n_clusters_per_class=1, 
        shuffle=False, random_state=42)
print(X0.shape)
mi = MutualInfo(k=2,use_counts=False)
newX = mi.fit_transform(X0,y0)
print(X0[0])
print(newX[0])
print(newX.shape)
svm_clf_mi = LinearSVC()
svm_clf_mi.fit(newX,y0)
svm_clf = LinearSVC()
svm_clf.fit(X0,y0)

# No loss of info with just two feats (because we set it up that way)
print(svm_clf.score(X0,y0),svm_clf_mi.score(newX,y0))

(100, 10)
[-1.17278867  0.63356167  0.35137231  0.18646621  0.95400176  0.65139125
 -0.31526924  0.75896922 -0.77282521 -0.23681861]
[ 0.35137231 -1.17278867]
(100, 2)
1.0 1.0


In [270]:
from sklearn.feature_extraction.text import CountVectorizer


def split_fit_and_eval_mutual_info (X,y,k,use_counts=True):
    # Split train test
    # X is a seq of strings, not a pd.Series inst.
    # y is a seq of lbls
    T_train,T_test, y_train,y_test = train_test_split(X,y)

    mi = MutualInfo(k=k,use_counts=use_counts)
    ###  Find Mutual Info Vocab in use counts case
    if use_counts:
        print("Training new vocab to classification task with MI")
        vocab = mi.fit_transform(T_train,y_train)
        print(f"{len(vocab)=} {k=}")
    else:
        vocab=None

    # Instantiate Vectorizer with new vocab
    tf = text.TfidfVectorizer(vocabulary=vocab)
    XM = tf.fit_transform(T_train)
    XT = tf.transform(T_test)
    
    if not use_counts:
        XM = mi.fit_transform(XM,y_train)
        X_test = mi.transform(XT)
    else:
        X_test = XT

    # Train clf with TD Matrix fitted to new vocab
    svm_clf = LinearSVC()
    svm_clf.fit(XM,y_train)

    predictions = svm_clf.predict(X_test)
    a,p, r = accuracy_score(y_test,predictions),\
                                precision_score(y_test, predictions),\
                                 recall_score(y_test, predictions)
    return a,p,r

Discrete feature selection.  Do feature selection based on word counts.

In [271]:
X,y = df["Comment"].values,df["Insult"].values
k,use_counts = 3_000,True

a,p,r = split_fit_and_eval_mutual_info (X,y,k,use_counts=use_counts)
#a 830 p=702 r=.579; a=0.823 p=0.725 r=0.560
print(f"{k=:>6_d} {a=:.3f} {p=:.3f} {r=:.3f}")

Training new vocab to classification task with MI
len(vocab)=3000 k=3000
k= 3_000 a=0.828 p=0.749 r=0.510


In [265]:
T_train,T_test, y_train,y_test = train_test_split(X,y)
mi = MutualInfo(k=k,use_counts=True)
vocab = mi.fit_transform(T_train,y_train)
# This fn will implement feature selection by accpeting a pre-selected vocab/.\
a,p,r =  split_fit_and_eval_feature_selection (X,y,vocab=vocab)
#a 830 p=702 r=.579
print(f"{k=:>6_d} {a=:.3f} {p=:.3f} {r=:.3f}")

k= 3_000 a=0.831 p=0.766 r=0.508


In [None]:
Continuous value feature selection.  Do feature selection based on TFIDF scores.

In [272]:

k,use_counts = 3_000,False

a,p,r = split_fit_and_eval_mutual_info (X,y,k,use_counts=use_counts)
#a=0.810 p=0.726 r=0.504; 
print(f"{k=:>6_d} {a=:.3f} {p=:.3f} {r=:.3f}")

k= 3_000 a=0.826 p=0.699 r=0.528


In [268]:

# Or if no vocab is supplied, it will implement feature selection via 
# mutual information on a matrix of continuous  TFIDF values 
a,p,r =  split_fit_and_eval_feature_selection (X,y,vocab= None)
#a 830 p=702 r=.579
print(f"{k=:>6_d} {a=:.3f} {p=:.3f} {r=:.3f}")

k= 3_000 a=0.793 p=0.692 r=0.441


Provisionally not using counts i sthe winner, except on precision.

####  Refining the results

Feature selection is a hyper-parameter. So we do it outside of the usual train test loop.

That means our results are themselves tied to a single train test split, so we 
do feature selection multiple times.  After each feature selection, we run the usual
train test loop (num_runs times).

For multiple runs and use_counts = True

  1.  Do one train test split to select a vocab with MI
  2.  Use that vocab to vectorize, train, and test multiple train/test splits.
  
For multiple runs and use_counts = False
  
  1.  Do one train test split to create Vectorizer
  2.  Do MI to select a vocab with that Vectorizer.

In [266]:
def split_fit_and_eval_feature_selection (X,y,vocab=None):
    """
    Either use a preselected vocab
    or select features using mutual information.    
    """
    
    # Split train test
    # X is a seq of strings, not a pd.Series inst.
    # y is a seq of lbls
    T_train,T_test, y_train,y_test = train_test_split(X,y)

    # Instantiate Vectorizer with new vocab if not None
    # if vocab is None use all vocab
    vectorizer = text.TfidfVectorizer(vocabulary=vocab)
    X_train = vectorizer.fit_transform(T_train)
    X_test = vectorizer.transform(T_test)
    
    if vocab is None:
        mi = MutualInfo(k=k,use_counts=False)
        X_train  = mi.fit_transform(X_train,y_train)
        X_test = mi.transform(X_test)
        

    # Train clf with Term-Doc Matrix pruned to precomputed vocab or 
    # pruned by feature selection
    svm_clf = LinearSVC()
    svm_clf.fit(X_train,y_train)

    predictions = svm_clf.predict(X_test)
    a,p, r = accuracy_score(y_test,predictions),\
                                precision_score(y_test, predictions),\
                                 recall_score(y_test, predictions)
    return a,p,r

Using Mutual Information to build a discrete model (based on word counts rather than some continuous significance measure),  Since our features are words.
this amounts to having a vocabulary pre selected by Mutual Information. (use_counts=True)

Each preselected vocabulary is then eavluated with 10 separate split and fit rounds.

In [253]:
num_runs = 10
total_exps = num_runs**2
scores = np.zeros((3,))

for i in range(num_runs):
    print(f"\nsplit: {i}")
    T_train,T_test, y_train,y_test = train_test_split(X,y)
    mi = MutualInfo(k=k,use_counts=True)
    vocab = mi.fit_transform(T_train,y_train)
    
    for j in range(num_runs):
        print(f"  expt: {j}")
        scores += split_fit_and_eval_feature_selection (X,y,vocab=vocab)
    print()
    
a,p,r =  scores/total_exps
print(f"{k=:>6_d} {a=:.3f} {p=:.3f} {r=:.3f}")

split: 0
 expt: 0
 expt: 1
 expt: 2
 expt: 3
 expt: 4
 expt: 5
 expt: 6
 expt: 7
 expt: 8
 expt: 9
split: 1
 expt: 0
 expt: 1
 expt: 2
 expt: 3
 expt: 4
 expt: 5
 expt: 6
 expt: 7
 expt: 8
 expt: 9
split: 2
 expt: 0
 expt: 1
 expt: 2
 expt: 3
 expt: 4
 expt: 5
 expt: 6
 expt: 7
 expt: 8
 expt: 9
split: 3
 expt: 0
 expt: 1
 expt: 2
 expt: 3
 expt: 4
 expt: 5
 expt: 6
 expt: 7
 expt: 8
 expt: 9
split: 4
 expt: 0
 expt: 1
 expt: 2
 expt: 3
 expt: 4
 expt: 5
 expt: 6
 expt: 7
 expt: 8
 expt: 9
split: 5
 expt: 0
 expt: 1
 expt: 2
 expt: 3
 expt: 4
 expt: 5
 expt: 6
 expt: 7
 expt: 8
 expt: 9
split: 6
 expt: 0
 expt: 1
 expt: 2
 expt: 3
 expt: 4
 expt: 5
 expt: 6
 expt: 7
 expt: 8
 expt: 9
split: 7
 expt: 0
 expt: 1
 expt: 2
 expt: 3
 expt: 4
 expt: 5
 expt: 6
 expt: 7
 expt: 8
 expt: 9
split: 8
 expt: 0
 expt: 1
 expt: 2
 expt: 3
 expt: 4
 expt: 5
 expt: 6
 expt: 7
 expt: 8
 expt: 9
split: 9
 expt: 0
 expt: 1
 expt: 2
 expt: 3
 expt: 4
 expt: 5
 expt: 6
 expt: 7
 expt: 8
 expt: 9
k= 3_000 a

Using the continuous version of Mutual Info.  This means doing the MI feature selection on term-document matrices containing TFIDF scores. This requires averaging between the scores of the 3 nearest neighbors to estimate the MI values.

The continuous MI feature selection is expensive.  And rather than attempt to follow the paradigm above,
(where we create 10 reduced feature models and evaluate each reduced model with 10 train test splits (= 100 expereiments), we simply  try 10 feature selections, each estimated from a different train test split.

In [258]:
num_runs = 10
total_exps = num_runs
scores = np.zeros((3,))

for i in range(1):
    print(f"\nsplit set: {i}")

    #T_train,T_test, y_train,y_test = train_test_split(X,y)
    #tf = text.TfidfVectorizer()
    #X_train = tf.fit_transform(T_train)
    #mi = MutualInfo(k=k,use_counts=False)
    #X_train_T = mi.fit_transform(X_train,y_train)

    for j in range(num_runs):
        print(f"  expt: {j}")
        scores += split_fit_and_eval_feature_selection (X,y,vocab=None)
    print()

a,p,r =  scores/total_exps

print(f"{k=:>6_d} {a=:.3f} {p=:.3f} {r=:.3f}")


split set: 0
  expt: 0
  expt: 1
  expt: 2
  expt: 3
  expt: 4
  expt: 5
  expt: 6
  expt: 7
  expt: 8
  expt: 9

k= 3_000 a=0.800 p=0.686 r=0.464


The result is a resounding defeat. The continuous models perform worse on all measures than the discrete
models.

##  Using the transformer

Get some of the redundancies out by simplifying the Mutual Information instance (MIi) interface.

It **always** accepts a Term Doc Matrix and alqys outputs  a reduced form.

To do feature selection on the base of counts, we pass in a count vectorizer 
TDM and pass back a truncated feat vectorizer.  We then pass that to a TFIDFTransformer.
(which accepts a CountVectorized TDM).

Model A
----------

T_Train, T_Test, y_Train, y_Test

vec_pipe = [CountVectorizer =>  Mutual Info => TFIDFTransformer => LinearSVC]

X_Train = vec_pipe.fit_transform(T_Train)
predicted = vec_pipe.transform(T_test)


Model B
-----------

```python
T_Train, T_Test, y_Train, y_Test
```

vec_pipe = [TFIDFVectorizer =>  Mutual Info => LinearSVC]


```python
X_Train = vec_pipe.fit_transform(T_Train)
predicted = vec_pipe.transform(T_test)
```

Note we switch from the TFIDFTransformer to the TFIDF Vectorizer,  The former acceots a count
vectorized TD as input.  The latter wants a sequence of document strin gs.  Both
out a TDM with TFIDF values.  

CountVectorizer => TFIDFTransformer

is equivalent  to

TFIDFVectorizer

The motivation for resorting to the  Transformer is so that we could interpose
Mutual Information feature selection between Count Vectorizing and converting the
counts to TFIDF values.  This allows MI selection to work on probabilities based on counts,
more or less its original intent, and avoids averaging based on nearest neighbors.


In [290]:
from sklearn.datasets import make_classification

class MutualInfo:
    
    def  __init__(self,k,discrete_features=True):
        self.k =k
        self.discrete_features = discrete_features
        
    def fit_transform(self,X,y):
        """
        If self.use_counts create  cv, a count vectorized version of X,
        and assign feature ranks based mutual info of cv[:,feat_i] and y
        Use the feature ranks to choose a vocab of size self.k.
        
        Otherwise assign feature ranks based mutual info of X[:,feat_i] and y
        Use the feature ranks to choose a vocab of size self.k.
        """
        
        if hasattr(X,'toarray'):
            X = X.toarray()
    
        ranks = mutual_info_classif(X, y,discrete_features=self.discrete_features)
        self.idxs = ranks.argsort()[-self.k:]
        return self.transform(X)

    def transform (self, X):
        if X.ndim == 2:
            return X[:,self.idxs]
        else:
            return X[self.idxs]


sublinear_tf,k=True,3000

In [294]:
### Model A

def make_model_A ():
    cv = CountVectorizer()
    mi = MutualInfo(k=k,discrete_features=True)
    tf = text.TfidfTransformer(sublinear_tf=sublinear_tf)
    svm_clf = LinearSVC()

    return pipeline.Pipeline([('vect', cv), 
                              ('feat_selector', mi), 
                              ('tfidf_vect',tf), 
                              ('svm', svm_clf)])

# Data 
X,y = df["Comment"].values,df["Insult"].values
T_train,T_test, y_train,y_test = train_test_split(X,y)

#Train
vec_pipe = make_model_A ()
vec_pipe.fit(T_train, y_train)
#Test
predicted = vec_pipe.predict(T_test)

a,p, r = accuracy_score(y_test,predicted),\
           precision_score(y_test, predicted),\
              recall_score(y_test, predicted)

print(f"{k=:>6_d} {a=:.3f} {p=:.3f} {r=:.3f}")

k= 3_000 a=0.817 p=0.765 r=0.533


In [295]:
# Model B

def make_model_B():
    mi = MutualInfo(k=k, discrete_features=False)
    tf = text.TfidfVectorizer(sublinear_tf=sublinear_tf)
    svm_clf = LinearSVC()

    return pipeline.Pipeline([('tfidf_vect',tf),  ('feat_selector', mi), ('svm', svm_clf)])

# Data 
X,y = df["Comment"].values,df["Insult"].values
T_train,T_test, y_train,y_test = train_test_split(X,y)

#Train
vec_pipe =  make_model_B()
vec_pipe.fit(T_train, y_train)
#Test
predicted = vec_pipe.predict(T_test)

a,p, r = accuracy_score(y_test, predicted),\
           precision_score(y_test, predicted),\
              recall_score(y_test, predicted)

print(f"{k=:>6_d} {a=:.3f} {p=:.3f} {r=:.3f}")

k= 3_000 a=0.785 p=0.678 r=0.482


### Grid Search + Pipeline

The basic idea of dimensionality reduction

Model

[TFDIF -> TruncatedSVD -> SVM]



In [8]:
from sklearn.decomposition import NMF, PCA, TruncatedSVD
# Avoiding SparsePCA which seems ill-behaved
from datetime import datetime

## Data 
X,y = df["Comment"].values,df["Insult"].values
T_train,T_test, y_train,y_test = train_test_split(X,y)

## Reduction params
k,sparse_pca=500,False
#red = TruncatedSVD(n_components=k)
#red = NMF(n_components=k)

if sparse_pca:
    print("Using sparse PCA for reduction")
    red = SparsePCA(n_components=k)
else:
    print("Using Trucated SVD for reduction")
    red = TruncatedSVD(n_components=k)
# Data 

#TSVD k= 500 a=0.823 p=0.716 r=0.580

######  Text -> TFIDF
vectorizer = text.TfidfVectorizer()
X_train = vectorizer.fit_transform(T_train)
X_test = vectorizer.transform(T_test)

if sparse_pca:
    X_train = X_train.toarray()
    X_test = X_test.toarray()

#########

###### TFIDF -> reduced
print(f"Beginning reduction fit print {datetime.now()} ")
X_train_red = red.fit_transform(X_train)
print(f"Reduction fit completed {datetime.now()} ")
X_test_red = red.transform(X_test)
############

### Reduced =>  Class

svm_clf = LinearSVC()
svm_clf.fit(X_train_red,y_train)
predicted = svm_clf.predict(X_test_red)

### Eval
a,p, r = accuracy_score(y_test, predicted),\
           precision_score(y_test, predicted),\
              recall_score(y_test, predicted)

print(f"{k=:>6_d} {a=:.3f} {p=:.3f} {r=:.3f}")

Using Trucated SVD for reduction
Beginning reduction fit print 2024-04-19 15:04:00.292262 
Reduction fit completed 2024-04-19 15:04:01.904266 
k=   500 a=0.839 p=0.698 r=0.615


###  Putting it all together:  The Grid Search

We have two sets of different techniques for reducing the size of document representations
and hopefully improving classifier performance: dimensionality redction and feature selection.
But which reduction technique shoudl we use, and how man features should we keep?

Answering questions like these is what the scikit learn grid search package was designed for.

In [None]:
pipe = pipeline.Pipeline(
    [
        ("vectorizer", text.TfidfVectorizer()),
        # the reduce_dim stage is populated by the param_grid
        ("reduce_dim", "passthrough"),
        ("classify", LinearSVC()),
    ]
)

N_FEATURES_OPTIONS = [5_000, 3_000, 500]
C_OPTIONS = [1, 10, 100]
#DIM_REDUCERS = [TruncatedSVD(), NMF(max_iter=1_000)]
DIM_REDUCERS = [TruncatedSVD()]

param_grid = [
    {
        "reduce_dim": DIM_REDUCERS,
        "reduce_dim__n_components": N_FEATURES_OPTIONS,
        "classify__C": C_OPTIONS,
    },
    {
        "reduce_dim": [SelectKBest(chi2),SelectKBest(f_classif)],
        "reduce_dim__k": N_FEATURES_OPTIONS,
        "classify__C": C_OPTIONS,
    },
]

# This is the default
nfolds=5

grid = GridSearchCV(pipe, n_jobs=1, param_grid=param_grid, scoring="f1",cv=nfolds)
grid.fit(X, y)

In [None]:
import pandas as pd

reducer_labels = ["SVD", "KBest(chi2)","KBest(f_classif)"]

mean_scores = np.array(grid.cv_results_["mean_test_score"])
# scores are in the order of param_grid iteration, which is alphabetical
mean_scores = mean_scores.reshape(len(C_OPTIONS), -1, len(N_FEATURES_OPTIONS))
# select score for best C
mean_scores = mean_scores.max(axis=0)
# create a dataframe to ease plotting
mean_scores = pd.DataFrame(
    mean_scores.T, index=N_FEATURES_OPTIONS, columns=reducer_labels
)

ax = mean_scores.plot.bar()
ax.set_title("Comparing feature reduction techniques")
ax.set_xlabel("Reduced number of features")
ax.set_ylabel("F-Score")
ax.set_ylim((0, 1))
ax.legend(loc="upper left")

plt.show()

Discussion questions

1.  How many experiments did the grid search do?  Note this depends both on the param grid and the number of "folds" in the cross validation strategy.
2. How many distinct grid points are represented in the plot?  
3. Where did the others go?
4. What feature has been left out of plot? 
5. What is the best combination of features?