# Text Classification:  Insults with Naive Bayes

In [21]:
a[:4],a[4:]

(array([0, 1, 2, 3]), array([ 4,  5,  6,  7,  8,  9, 10, 11]))

In [20]:
a = np.arange(12)
b=np.arange(12,16)
a

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [17]:
bp = b.copy()

In [18]:
bp[-1] = 2
bp

array([12, 13, 14,  2])

In [19]:
b

array([12, 13, 14, 15])

In [14]:
c = np.concatenate([a,b,b])
c

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 12,
       13, 14, 15])

In [5]:
sklearn.__version__

'1.2.1'

In [1]:
import numpy as np
import pandas as pd
import sklearn
from datetime import datetime
import time

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
import sklearn.feature_extraction.text as text
import sklearn.naive_bayes as nb
import matplotlib.pyplot as plt
from sklearn.metrics import precision_score, recall_score, accuracy_score
# Load libraries
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, f_classif, mutual_info_classif
from sklearn import preprocessing
from sklearn import pipeline
from sklearn import linear_model
from sklearn.svm import LinearSVC
from sklearn.feature_selection import chi2, f_classif, mutual_info_classif
from sklearn.datasets import make_classification
from sklearn.decomposition import NMF, PCA, TruncatedSVD



In the notebook we dive into the general question of feature selection for classifiers. In the regression
and classification module notebook **Dimensionality Reduction** we looked at 
applying dimensionality reduction to classifiers; here we turn to feature selection, another
technique that generally results in more compact representations and hopefully better performing
classifiers.

The general form of what we're doing
in this notebook is to use  some metric
to quantify the usefulness of each feature in order to
discard all but the top $k$ features.  We'll look at the following feature selection functions in scikit learn.

$$
\begin{array}[t]{lll}
1. &  \text{mutual_info_classif} &  \text{Mutual information for a discrete target.}\\
2. &  \text{chi2} & \text{Chi-squared stat for non-negative features for classification tasks.}\\
3. &  \text{fclassif} &  \text{ANOVA F-value between label/feature for classification tasks.}\\
\end{array}
$$

Mutual Information, written $\text{I}$, is defined to be:

$$
\text{I}\,(X;\,Y) = \text{D}_{\text{KL}}\,(\, \text{P}_{X,Y} \mid\mid \text{P}_{X}\otimes \text{P}_{Y}\, )
$$


where $D_{\text{KL}}$ is the Kullback-Leibler Divergence (KL-Divergence) of two distributions and
$\text{P}_{X,Y}$ is the joint distribution of X and Y and $\text{P}_{X}\otimes \text{P}_{Y}$ is 
the distribution that gives $P(X) \times P(Y)$ as the joint probability of X and Y.
KL Divergence is an Information Theoretic measure of the divergence
between two distributions, $\text{I}\,(X;\,Y)$ measures the distance between
the joint distribution and the distribution that would obtain
if X and Y were completely independent.  If X and Y are independent,
$\text{I}\,(X;\,Y)$ is 0.  If X conditionally depends on Y, that
increases the difference between the joint distribution and independence ($\text{P}_{X}\otimes \text{P}_{Y}$).

Thus $\text{I}\,(X;\,Y)$ measures the amount of information you gain about the outcome of $Y$
if you know the outcome $X$.  If Y is a class assignment and X is a feature column
in our data matrix, $\text{I}\,(X;\,Y)$  measures how much knowing the value of feature $X$ 
tells us about the class Y.  Hence, it makes sense to use $\text{I}\,(F;\,c)$ to measure how useful
feature F is for a classification problem.  

The chi2 classification function takes the feature matrix and the class labels  as arguments
and for each feature and each computes the $\Xi^{2}@ statistic representing the strength
of the feature's association with the class.

A note of caution:  The test really only makes sense for categorical variables, but
the scikit learn implementation is built in such a way that it can be used for
continuous variables.  We will try both.

In [4]:
X, y = make_classification(
        n_samples=100, n_features=10, n_informative=2, n_clusters_per_class=1, 
        shuffle=False, random_state=42,shift=5)
chi2_stats, p_values = chi2(X, y)
chi2_stats

array([2.25415508e+01, 7.35179236e-02, 7.25973700e-02, 1.01320229e-02,
       1.08125275e+00, 5.54434465e-02, 1.13279935e-02, 1.09136519e-01,
       1.32228682e-01, 1.66505941e-02])

The numbers show that all but 2 of the 10 features have little connection with the classification
problem.  Of course we set it up this way when we created `X, y` with `make_classification`,  by
setting the parameter `n_informative` to 2.

Here are the column numbers arranged from least significant to most significant:

In [10]:
chi2_stats.argsort()

array([3, 6, 9, 5, 2, 1, 7, 8, 4, 0])

Hence, the following code makes a new feature matrix containing only the two
most significant features (the columns indexed 0 and 4).

In [11]:
X_reduced = X[:, chi2_stats.argsort()[-2:]]

Finally the ANOVA F-value (or `fclassif` in scikit learn) compares the ration of explained
variance to the unexaplined variance. If the null hypothesis is true, you expect F to have a value close to 1.0 most of the time. A large F ratio means that the variation among group means is more than you'd expect to see by chance.

## Loading the data

Let's open the CSV file with `pandas`.

In [2]:
import os.path
site = 'https://raw.githubusercontent.com/gawron/python-for-social-science/master/'\
'text_classification/'
#site = 'https://gawron.sdsu.edu/python_for_ss/course_core/book_draft/_static/'
df = pd.read_csv(os.path.join(site,"troll.csv"))

Each row is a comment  taken from a blog or online forum. There are three columns: whether the comment is insulting (1) or not (0), the data, and the unicode-encoded contents of the comment.

In [3]:
df[['Insult', 'Comment']].tail()

Unnamed: 0,Insult,Comment
3942,1,"""you are both morons and that is never happening"""
3943,0,"""Many toolbars include spell check, like Yahoo..."
3944,0,"""@LambeauOrWrigley\xa0\xa0@K.Moss\xa0\nSioux F..."
3945,0,"""How about Felix? He is sure turning into one ..."
3946,0,"""You're all upset, defending this hipster band..."


In [6]:
# Split the data into training and test sets FIRST
T_train,T_test, y_train,y_test = train_test_split(df['Comment'],df['Insult'])

### Demoing $\chi^{2}$

We imported the `chi2` function:

In [19]:
chi2

<function sklearn.feature_selection._univariate_selection.chi2(X, y)>

Here's ine way to use it:

In [273]:
#tf = text.TfidfVectorizer(min_df=2,max_df=.8)
sublinear_tf = True
#subliner_tf = False
tf = text.TfidfVectorizer(sublinear_tf=sublinear_tf)
# Train your vectorizer oNLY on the trainingh data.
X_train = tf.fit_transform(T_train)
print(*X_train.shape)
# N features with highest chi-squared statistics are selected
# chi2 is a functiomported above
chi2_features = SelectKBest(chi2, k = 10_000)
X_train_chi = chi2_features.fit_transform(X_train, y_train)

2960 13829


`X-train` is our **term-document** matrix.

In [274]:
X_train.shape,X_train_chi.shape

((2960, 13829), (2960, 10000))

## Preliminary experiments

Now, we are going to train a classifier as usual. We
have already split the data and labels into train and test sets.

We use an **SVM classifier** and try it out on the two representations of our
data,.

In [None]:
from sklearn.svm import LinearSVC

clf = LinearSVC()
clf_chi = LinearSVC()

# Unreduced
clf.fit(X_train, y_train)

In [None]:
# Reduced
clf_chi.fit(X_train_chi, y_train)

And we're done.  How'd we do?  Now we  test on the test set.  Before we can do that we need to
vectorize the test set.  But don't just copy what we did with the training data:

```
X_test = tf.fit_transform(T_test)
```

That would retrain the vectorizer from scratch.  Any words that occurred in the training texts
but not in the test texts would be forgotten!  Plus training the vectorizer
is part of the classifier training pipeline.  If we let the vectorizer see
the test data during its training phase, we'd be compromising the whole
idea of splitting training and test data.  So what we want to do
with the test data is just apply the transform part of vectorizing:

```
X_test = tf.transform(T_test)
```

That is, build a representation of the test data using only the vocabulary you learned
about in training.  Ignore any new words.

In [276]:
X_test = tf.transform(T_test)
X_test_chi = chi2_features.transform(X_test)
clf.score(X_test, y_test),clf_chi.score(X_test_chi, y_test)

(0.8358662613981763, 0.8297872340425532)

Well, not a reliable result.  But at least trimming down the model didn't seem to hurt it.

Let's clean this all up a bit by putting everything in a pipeline.

In [281]:
from sklearn import preprocessing
from sklearn import pipeline
from sklearn import linear_model
from sklearn.svm import LinearSVC
from sklearn.feature_selection import chi2, f_classif, mutual_info_classif

#from sklearn import decomposition as dec
#from pipeline import make_pipeline
#from sklearn.preprocessing import StandardScaler

#poly = preprocessing.PolynomialFeatures(degree=20, include_bias=True)
#scaler = preprocessing.StandardScaler()
#lin_reg2 = linear_model.LinearRegression()

X,y = df['Comment'].values,df['Insult'].values

def make_tf_feat_selection_clf_pipeline (selection_function=None, k=5_000,
                                         selector = SelectKBest,sublinear_tf=False):
    tf = text.TfidfVectorizer(sublinear_tf=sublinear_tf)
    #chi2_features = SelectKBest(selector, k)
    if selection_function is not None:
        k_best_features = selector(selection_function, k=k)
    else:
        k_best_features = selector(k=k)
    svm_clf = LinearSVC()
    return pipeline.Pipeline([('vect', tf), ('feat_selector', k_best_features), ('svm', svm_clf)])


pipeline_reg = make_tf_feat_selection_clf_pipeline (selection_function=f_classif, k=5_000,sublinear_tf=True)

# Train
T_train,T_test, y_train,y_test = train_test_split(X,y)
pipeline_reg.fit(T_train, y_train)
predicted = pipeline_reg.predict(T_test)

accuracy_score(predicted, y_test),\
precision_score(predicted, y_test),\
recall_score(predicted, y_test)

(0.8409321175278622, 0.5478927203065134, 0.7857142857142857)

In [283]:
num_runs = 10
stats = np.zeros((num_runs,3))
#selection_function = chi2
selection_function = f_classif
sublinear_tf=True
print(selection_function.__name__,end="\n===================\n")
for k in (3_000,5_000, 10_000):
    for test_run in range(num_runs):
        pipeline_reg = make_tf_feat_selection_clf_pipeline (selection_function = selection_function,k=k,
                                                           sublinear_tf=sublinear_tf)
        T_train,T_test, y_train,y_test = train_test_split(X,y)
         
        # Train
        pipeline_reg.fit(T_train, y_train)
        # Test
        predicted = pipeline_reg.predict(T_test)
        stats[test_run] = accuracy_score(predicted, y_test),\
                            precision_score(predicted, y_test),\
                             recall_score(predicted, y_test)

    stats_mn = stats.mean(axis=0)
    a,p,r = stats_mn

    print(f"{k=:>6_d} {a=:.3f} {p=:.3f} {r=:.3f}")

f_classif
k= 3_000 a=0.840 p=0.554 r=0.770
k= 5_000 a=0.834 p=0.541 r=0.762
k=10_000 a=0.832 p=0.586 r=0.731


In [223]:
num_runs = 10
stats = np.zeros((num_runs,3))
selection_function = chi2
#selection_function = f_classif
sublinear_tf=True

print(selection_function.__name__,end="\n===================\n")
for k in (3_000,5_000, 10_000):
    for test_run in range(num_runs):
        pipeline_reg = make_tf_feat_selection_clf_pipeline (selection_function = selection_function,k=k)
        T_train,T_test, y_train,y_test = train_test_split(X,y)

        # Train
        pipeline_reg.fit(T_train, y_train)
        # Test
        predicted = pipeline_reg.predict(T_test)
        stats[test_run] = accuracy_score(predicted, y_test),\
                            precision_score(predicted, y_test),\
                             recall_score(predicted, y_test)

    stats_mn = stats.mean(axis=0)
    a,p,r = stats_mn

    print(f"{k=:>6_d} {a=:.3f} {p=:.3f} {r=:.3f}")

chi2
k= 3_000 a=0.837 p=0.555 r=0.774
k= 5_000 a=0.836 p=0.568 r=0.759
k=10_000 a=0.837 p=0.592 r=0.744


Mutual information can work like the others but used in the context of select KBest
it requires discrete-valued features, so we need a slightly different pipeline.

We will CountVectorize our doc set to give a term document matrix (TDM) filled with
counts, run mutual information on that to feature reduce, and then classify in
the usual way.  Note we use a TfidfTransformer rather than a TfidfVectorizer.  This
is because the docs have already been vectorized so we nbeed a transformer that
maps from a TDM with counts to a TDM with TFIDF values; TfidfTransformer fits
the bill.

The following code also gives us  a chance to try out chi2 on counts, which
it was always intended for.

In [312]:


k = 500
kb= SelectKBest(mutual_info_classif,k=k)
# You can try these others with this pipeline as well.
# chi2 and f_classif work better with numbers over 3K
#kb= SelectKBest(f_classif,k=k)
#kb= SelectKBest(chi2, k=k)


T_train,T_test, y_train,y_test = train_test_split(X,y)
         
svm_clf = LinearSVC()

cv = CountVectorizer()
tf = text.TfidfTransformer(sublinear_tf=True)

pipeline_reg=pipeline.Pipeline([('vect', cv), ('feat_selector', kb),
                                ('tfidf',tf), ('svm', svm_clf)])
# Train
pipeline_reg.fit(T_train, y_train)
# Test
predicted = pipeline_reg.predict(T_test)


# Evaluate
a,p,r = accuracy_score(y_test,predicted,),\
        precision_score(y_test, predicted),\
            recall_score(y_test, predicted)

# MI k= 3_000 a=0.806 p=0.789 r=0.350
# chi k= 3_000 a=0.806 p=0.789 r=0.350
print(f"{k=:>6_d} {a=:.3f} {p=:.3f} {r=:.3f}")

k= 3_000 a=0.806 p=0.789 r=0.350


##  A Mutual Information transformer

Let's implement a Mutual Information Transformer.

Assumption:
The Mutual Information Transformer.**always** accepts a Term Doc (as a 2D numpuy array) and alwys outputs 
a 2D numpy array with fewer columns.

We will use Mutual Information two ways

1.  Pass in a sequence of strings add select a vocab using word counts (operating on a Count Vectorized representation of the docs like we did above)
2.  Operate on a TFIDF TDM to do feature selection.


Model A
----------

To do feature selection on the base of counts, we pass in a count vectorizer 
TDM and pass back a truncated feat vectorizer.  We then pass that to a TFIDFTransformer.
(which accepts a CountVectorized TDM).


T_Train, T_Test, y_Train, y_Test

vec_pipe = [CountVectorizer =>  Mutual Info => TFIDFTransformer => LinearSVC]

X_Train = vec_pipe.fit_transform(T_Train)
predicted = vec_pipe.transform(T_test)


Model B
-----------

Operate on a TFIDF TDM to do feature selection.

```python
T_Train, T_Test, y_Train, y_Test
```

vec_pipe = [TFIDFVectorizer =>  Mutual Info => LinearSVC]


```python
X_Train = vec_pipe.fit_transform(T_Train)
predicted = vec_pipe.transform(T_test)
```

Note we switch from the TFIDFTransformer to the TFIDF Vectorizer,  The former acceots a count
vectorized TD as input.  The latter wants a sequence of document strin gs.  Both
out a TDM with TFIDF values.  

CountVectorizer => TFIDFTransformer

is equivalent  to

TFIDFVectorizer

The motivation for resorting to the  Transformer is so that we could interpose
Mutual Information feature selection between Count Vectorizing and converting the
counts to TFIDF values.  This allows MI selection to work on probabilities based on counts,
more or less its original intent, and avoids averaging based on nearest neighbors.


In [290]:
from sklearn.datasets import make_classification

class MutualInfo:
    
    def  __init__(self,k,discrete_features=True):
        self.k =k
        self.discrete_features = discrete_features
        
    def fit_transform(self,X,y):
        """
        If self.use_counts create  cv, a count vectorized version of X,
        and assign feature ranks based mutual info of cv[:,feat_i] and y
        Use the feature ranks to choose a vocab of size self.k.
        
        Otherwise assign feature ranks based mutual info of X[:,feat_i] and y
        Use the feature ranks to choose a vocab of size self.k.
        """
        
        if hasattr(X,'toarray'):
            X = X.toarray()
    
        ranks = mutual_info_classif(X, y,discrete_features=self.discrete_features)
        self.idxs = ranks.argsort()[-self.k:]
        return self.transform(X)

    def transform (self, X):
        if X.ndim == 2:
            return X[:,self.idxs]
        else:
            return X[self.idxs]


sublinear_tf,k=True,3000

In [294]:
### Model A

def make_model_A ():
    cv = CountVectorizer()
    mi = MutualInfo(k=k,discrete_features=True)
    tf = text.TfidfTransformer(sublinear_tf=sublinear_tf)
    svm_clf = LinearSVC()

    return pipeline.Pipeline([('vect', cv), 
                              ('feat_selector', mi), 
                              ('tfidf_vect',tf), 
                              ('svm', svm_clf)])

# Data 
X,y = df["Comment"].values,df["Insult"].values
T_train,T_test, y_train,y_test = train_test_split(X,y)

#Train
vec_pipe = make_model_A ()
vec_pipe.fit(T_train, y_train)
#Test
predicted = vec_pipe.predict(T_test)

a,p, r = accuracy_score(y_test,predicted),\
           precision_score(y_test, predicted),\
              recall_score(y_test, predicted)

print(f"{k=:>6_d} {a=:.3f} {p=:.3f} {r=:.3f}")

k= 3_000 a=0.817 p=0.765 r=0.533


In [295]:
# Model B

def make_model_B():
    mi = MutualInfo(k=k, discrete_features=False)
    tf = text.TfidfVectorizer(sublinear_tf=sublinear_tf)
    svm_clf = LinearSVC()

    return pipeline.Pipeline([('tfidf_vect',tf),  ('feat_selector', mi), ('svm', svm_clf)])

# Data 
X,y = df["Comment"].values,df["Insult"].values
T_train,T_test, y_train,y_test = train_test_split(X,y)

#Train
vec_pipe =  make_model_B()
vec_pipe.fit(T_train, y_train)
#Test
predicted = vec_pipe.predict(T_test)

a,p, r = accuracy_score(y_test, predicted),\
           precision_score(y_test, predicted),\
              recall_score(y_test, predicted)

print(f"{k=:>6_d} {a=:.3f} {p=:.3f} {r=:.3f}")

k= 3_000 a=0.785 p=0.678 r=0.482


### Dimensionality reduction

The basic idea of dimensionality reduction

Model

[TFDIF -> TruncatedSVD -> SVM]



In [6]:
from sklearn.decomposition import NMF, PCA, TruncatedSVD
# Avoiding SparsePCA which seems ill-behaved
from datetime import datetime
from sklearn.model_selection import cross_val_score, ShuffleSplit

## Data 
X,y = df["Comment"].values,df["Insult"].values
T_train,T_test, y_train,y_test = train_test_split(X,y)

## Reduction params
#k,sparse_pca=500,False
k,sparse_pca=3_000,False
#red = TruncatedSVD(n_components=k)
#red = NMF(n_components=k)

if sparse_pca:
    print("Using sparse PCA for reduction")
    red = SparsePCA(n_components=k)
else:
    print("Using Truncated SVD for reduction")
    red = TruncatedSVD(n_components=k)
# Data 

#TSVD k= 500 a=0.823 p=0.716 r=0.580

######  Text -> TFIDF
vectorizer = text.TfidfVectorizer()
X_train = vectorizer.fit_transform(T_train)
X_test = vectorizer.transform(T_test)

if sparse_pca:
    X_train = X_train.toarray()
    X_test = X_test.toarray()

#########

###### TFIDF -> reduced
print(f"Beginning reduction fit print {datetime.now()} ")
X_train_red = red.fit_transform(X_train)
print(f"Reduction fit completed {datetime.now()} ")
X_test_red = red.transform(X_test)
############

### Reduced =>  Class

svm_clf = LinearSVC()
svm_clf.fit(X_train_red,y_train)
predicted = svm_clf.predict(X_test_red)

### Eval
a,p, r = accuracy_score(y_test, predicted),\
           precision_score(y_test, predicted),\
              recall_score(y_test, predicted)

print(f"{k=:>6_d} {a=:.3f} {p=:.3f} {r=:.3f}")

Using Truncated SVD for reduction
Beginning reduction fit print 2024-04-20 12:35:57.324080 
Reduction fit completed 2024-04-20 12:36:39.164440 
k= 3_000 a=0.830 p=0.750 r=0.600


```
Using Truncated SVD for reduction
Beginning reduction fit print 2024-04-20 12:32:33.001275 
Reduction fit completed 2024-04-20 12:33:17.055381 
k= 3_000 a=0.843 p=0.761 r=0.630

Using Truncated SVD for reduction
Beginning reduction fit print 2024-04-20 12:33:51.944394 
Reduction fit completed 2024-04-20 12:34:34.503089 
k= 3_000 a=0.818 p=0.702 r=0.600

Using Truncated SVD for reduction
Beginning reduction fit print 2024-04-20 12:34:54.958394 
Reduction fit completed 2024-04-20 12:35:37.068576 
k= 3_000 a=0.848 p=0.791 r=0.579

Using Truncated SVD for reduction
Beginning reduction fit print 2024-04-20 12:35:57.324080 
Reduction fit completed 2024-04-20 12:36:39.164440 
k= 3_000 a=0.830 p=0.750 r=0.600

```

###  Putting it all together:  The Grid Search

We have two sets of different techniques for reducing the size of document representations
and hopefully improving classifier performance: dimensionality redction and feature selection.
But which reduction technique should we use, and how many features should we keep?

Answering questions like these is what the scikit learn grid search package was designed for.

In [None]:
from sklearn.model_selection import RandomizedSearchCV,ShuffleSplit


X,y = df["Comment"].values,df["Insult"].values

pipe = pipeline.Pipeline(
    [
        ("vectorizer", text.TfidfVectorizer()),
        # the reduce_dim stage is populated by the param_grid
        ("reduce_dim", "passthrough"),
        ("classify", LinearSVC()),
    ]
)

#N_FEATURES_OPTIONS = [5_000, 3_000, 500]
N_FEATURES_OPTIONS = [500, 3_000]
#C_OPTIONS = np.logspace(0,2,3)
C_OPTIONS = np.logspace(0,1,2)
#DIM_REDUCERS = [TruncatedSVD(), NMF(max_iter=1_000)]
DIM_REDUCERS = [TruncatedSVD()]

from sklearn.base import BaseEstimator, TransformerMixin

class IdentityTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, n_components=0):
        self.n_components=n_components
        pass
    
    def fit(self, input_array, y=None):
        return self
    
    def transform(self, input_array, y=None):
        return input_array*1

#DIM_REDUCERS = [TruncatedSVD(), NMF(max_iter=1_000)]
DIM_REDUCERS = [IdentityTransformer(), TruncatedSVD()]

param_grid = [
    {
        "reduce_dim": DIM_REDUCERS,
        "reduce_dim__n_components": N_FEATURES_OPTIONS,
        "classify__C": C_OPTIONS,
    },
#    {
#        "reduce_dim": [SelectKBest(chi2),SelectKBest(f_classif)],
#        "reduce_dim__k": N_FEATURES_OPTIONS,
#        "classify__C": C_OPTIONS,
#    },
]

# The default is 5-fold cross validation
#cv=3
cv = ShuffleSplit(n_splits=3, test_size=0.2, random_state=0)
t0 = time.time()
print(datetime.now().strftime("%A %d. %B %Y %H:%M:%S"))

# If a grid search is too costly
#grid = RandomizedSearchCV(
#    estimator=pipe,
#    param_distributions=param_grid,
#    n_iter=10,
#    random_state=0,
#    scoring = "f1"
#    n_jobs=1,
#    verbose=10,
#    cv = cv
#)

from joblib import parallel_backend
grid = GridSearchCV(pipe, param_grid=param_grid, scoring="f1", n_jobs=1, cv=cv, verbose=10)

with parallel_backend('threading'):
    grid.fit(X, y)
datetime.now().strftime("%A %d. %B %Y %H:%M:%S")
secs = time.time() - t0
hrs,mins = secs//3600,(secs%3600)//60
print(f"{hrs} hours {mins} minutes")

Saturday 20. April 2024 12:38:39
Fitting 3 folds for each of 8 candidates, totalling 24 fits
[CV 1/3; 1/8] START classify__C=1.0, reduce_dim=IdentityTransformer(), reduce_dim__n_components=500
[CV 1/3; 1/8] END classify__C=1.0, reduce_dim=IdentityTransformer(), reduce_dim__n_components=500;, score=0.656 total time=   0.1s
[CV 2/3; 1/8] START classify__C=1.0, reduce_dim=IdentityTransformer(), reduce_dim__n_components=500
[CV 2/3; 1/8] END classify__C=1.0, reduce_dim=IdentityTransformer(), reduce_dim__n_components=500;, score=0.667 total time=   0.1s
[CV 3/3; 1/8] START classify__C=1.0, reduce_dim=IdentityTransformer(), reduce_dim__n_components=500
[CV 3/3; 1/8] END classify__C=1.0, reduce_dim=IdentityTransformer(), reduce_dim__n_components=500;, score=0.660 total time=   0.1s
[CV 1/3; 2/8] START classify__C=1.0, reduce_dim=IdentityTransformer(), reduce_dim__n_components=3000
[CV 1/3; 2/8] END classify__C=1.0, reduce_dim=IdentityTransformer(), reduce_dim__n_components=3000;, score=0.656 

In [None]:
datetime.now().strftime("%A %d. %B %Y")

In [12]:
subgrid

{'reduce_dim': [TruncatedSVD()],
 'reduce_dim__n_components': [3000, 500],
 'classify__C': array([ 1., 10.])}

In [16]:
print("Best parameters combination found:")
best_parameters = grid.best_estimator_.get_params()
for subgrid in param_grid:
    for param_name in sorted(subgrid.keys()):
        try:
            print(f"{param_name}: {best_parameters[param_name]}")
        except KeyError:
            print(f"{param_name}: Not found")
            continue

Best parameters combination found:
classify__C: 1.0
reduce_dim: SelectKBest(k=3000)
reduce_dim__n_components: Not found
classify__C: 1.0
reduce_dim: SelectKBest(k=3000)
reduce_dim__k: 3000


In [None]:
import pandas as pd

reducer_labels = ["SVD", "KBest(chi2)","KBest(f_classif)"]

mean_scores = np.array(grid.cv_results_["mean_test_score"])
# scores are in the order of param_grid iteration, which is alphabetical
mean_scores = mean_scores.reshape(len(C_OPTIONS), -1, len(N_FEATURES_OPTIONS))
# select score for best C
mean_scores = mean_scores.max(axis=0)
# create a dataframe to ease plotting
mean_scores = pd.DataFrame(
    mean_scores.T, index=N_FEATURES_OPTIONS, columns=reducer_labels
)

ax = mean_scores.plot.bar()
ax.set_title("Comparing feature reduction techniques")
ax.set_xlabel("Reduced number of features")
ax.set_ylabel("F-Score")
ax.set_ylim((0, 1))
ax.legend(loc="upper left")

#plt.show()

Discussion questions
-------------------------

1. Call all the features with variable values in the code above
   our **grid features**. And call one particular
   assignment of values to **all** our grid features a **grid point**.  How many
   distinct grid points are there in our grid?
2. How many experiments did the grid search run?  Note this depends 
   both on the number of grid points and the number of "folds". in the cross validation strategy.
3. What grid points have been left out of our plot? 
4. What is the best combination of features?
