## Classification with sklearn


## loading a dataset

In [1]:
import urllib.request
import os

def download_file(url,local_file, force=False):
    """
    Helper function to download a file and store it locally
    """
    if not os.path.exists(local_file) or force:
        print('Downloading',url,'to',local_file)
        with urllib.request.urlopen(url) as opener, \
             open(local_file, mode='w', encoding='utf-8') as outfile:
                    outfile.write(opener.read().decode('utf-8'))
    else:
        print(local_file,'already downloaded')

In [2]:
train_file = 'news_en_train.txt'
train_url='http://www.esuli.it/demo/data/news_en_train.csv'
test_file = 'news_en_test.txt'
test_url = 'http://www.esuli.it/demo/data/news_en_test.csv'
delimiter = ','

download_file(train_url, train_file)
download_file(test_url, test_file)

Downloading http://www.esuli.it/demo/data/news_en_train.csv to news_en_train.txt
Downloading http://www.esuli.it/demo/data/news_en_test.csv to news_en_test.txt


In [None]:
import csv
x_train = list()
y_train = list()
with open(train_file, encoding='utf-8', newline='') as infile:
    reader = csv.reader(infile, delimiter=delimiter)
    for row in reader:
        x_train.append(row[0])
        y_train.append(row[1])

x_test = list()
y_test = list()
with open(test_file, encoding='utf-8', newline='') as infile:
    reader = csv.reader(infile, delimiter=delimiter)
    for row in reader:
        x_test.append(row[0])
        y_test.append(row[1])


In [None]:
len(x_train),len(y_train),len(x_test),len(y_test)

In [None]:
set(y_train)

In [None]:
sample_idx = 10
x_train[sample_idx]

In [None]:
y_train[sample_idx]

# Binary classification

This is a multi-class single-label dataset.
We start with a simpler binary classification problem, e.g., economy vs not economy.

Just to make a choice, we use as the reference label the one of the example in the cell above.

In [None]:
import numpy as np

# numpy implements many useful and powerful vector manipulation tools
# here I'm using it to quickly create a True,False vector corresponding
# to the original values being equal to our label of interest or not
# i.e., binary labels

y_train_bin = np.asarray(y_train)==y_train[sample_idx]
y_test_bin = np.asarray(y_test)==y_train[sample_idx]
y_train_bin,y_test_bin

## Building the pipeline by hand

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import LinearSVC

## Tokenization

Try the following two cells removing the min_df parameter

In [None]:
vect = CountVectorizer(min_df=5)  # tokenization and frequency count

print('fit')
vect.fit(x_train)
print('transform')
X_train_tok = vect.transform(x_train)
print('done')

# the two steps above can be condensed in a single step that processes train
# data only once.

# print('fit_transform')
# X_train_tok = vect.fit_transform(x_train)
# print('done')

X_test_tok =vect.transform(x_test)

In [None]:
len(vect.vocabulary_)

In [None]:
vect.vocabulary_

In [None]:
vect.get_feature_names()

In [None]:
X_train_tok[0,:]

In [None]:
print(X_train_tok[0,:])

Some scikit-learn modules implement an inverse_transform method to reconstruct input from their output.
Let's print out the feature names and their frequency for a document. Note that frequency info is lost.

In [None]:
vect.inverse_transform(X_train_tok[0,:])

Let's attach frequency data to features

In [None]:
for feat,freq in zip(vect.inverse_transform(X_train_tok[0,:])[0],X_train_tok[0,:].data):
  print(feat,freq)

## Feature selection

This is the first element where we use the labels, because it is a supervised method.

In [None]:
bin_sel = SelectKBest(chi2, k=5000)  # feature selection
bin_sel.fit(X_train_tok,y_train_bin)
X_train_sel_bin = bin_sel.transform(X_train_tok)
X_test_sel_bin = bin_sel.transform(X_test_tok)

In [None]:
bin_sel.get_support()

In [None]:
X_train_sel_bin

In [None]:
X_train_sel_bin[0,:]

In [None]:
print(X_train_sel_bin[0,:])

The feature selection module has an inverse transform method so that we can map selected feature back to the original large feature space

In [None]:
bin_sel.inverse_transform(X_train_sel_bin[0,:])

In [None]:
print(vect.inverse_transform(bin_sel.inverse_transform(X_train_sel_bin[0,:])))

## Weighting

In [None]:
tfidf = TfidfTransformer()  # weighting
tfidf.fit(X_train_sel_bin)
X_train_vec_bin = tfidf.transform(X_train_sel_bin)
X_test_vec_bin =tfidf.transform(X_test_sel_bin)

In [None]:
print(X_train_vec_bin[0,:])

In [None]:
for feat,weight,freq in zip(vect.inverse_transform(bin_sel.inverse_transform(X_train_vec_bin[0,:]))[0],X_train_vec_bin[0,:].data,X_train_sel_bin[0,:].data):
  print(feat,weight,freq)

## Learning algorithm

In [None]:
svm_bin = LinearSVC()  # linear svm with default parameters
svm_bin_clf = svm_bin.fit(X_train_vec_bin,y_train_bin)
bin_predictions = svm_bin_clf.predict(X_test_vec_bin)

In [None]:
len(bin_predictions)

In [None]:
bin_predictions

## Evaluation of accuracy

In [None]:
correct = 0
for prediction,true_label in zip(bin_predictions, y_test_bin):
    if prediction==true_label:
        correct += 1
print(correct/len(bin_predictions))

## Using sklearn pipeline object

In [None]:
bin_pipeline = Pipeline([
    ('vect', CountVectorizer()),  # feature extraction
    ('sel', SelectKBest(chi2, k=5000)),  # feature selection
    ('tfidf', TfidfTransformer()),  # weighting
    ('learner', LinearSVC())  # learning algorithm
])

bin_pipeline.fit(x_train,y_train_bin)
bin_predictions = bin_pipeline.predict(x_test)
correct = 0
for prediction,true_label in zip(bin_predictions, y_test_bin):
    if prediction==true_label:
        correct += 1
print(correct/len(bin_predictions))

In [None]:
from sklearn.metrics import confusion_matrix, classification_report
print('Classification report:')
print(classification_report(y_test_bin, bin_predictions))
print('Confusion matrix:')
cm = confusion_matrix(y_test_bin, bin_predictions)
print(cm)

## Inspecting the pipeline

We can have a look at the parameters of the supervised method of the pipeline to understand how it determines its classification decisions.



In [None]:
tokenizer = bin_pipeline.named_steps['vect']
selector = bin_pipeline.named_steps['sel']
classifier = bin_pipeline.named_steps['learner']

First we look at the feature selection function.
We get the chi^2 score assigned to every feature.

In [None]:
feature_names = tokenizer.get_feature_names()
feats_w_score = list()
for index,(selected,score) in enumerate(zip(selector.get_support(),selector.scores_)):
    feats_w_score.append((score,selected,feature_names[index]))
feats_w_score = sorted(feats_w_score)
len(feats_w_score)

This are the 100 less and most informative features

In [None]:
feats_w_score[:100],feats_w_score[-100:]

Then we look at the parameters of the linear classification model.
Values with highest absolute values are those which contribute the most to the classification decision. Values close to zero are less important.

In [None]:
feats_w_classifier_weight = list()
for index,weight in enumerate(selector.inverse_transform(classifier.coef_)[0]):
    if weight!=0:
        feats_w_classifier_weight.append((weight,feature_names[index]))
feats_w_classifier_weight = sorted(feats_w_classifier_weight)
len(feats_w_classifier_weight)

These are the feature that most contribute to a positive decision

In [None]:
feats_w_classifier_weight[-100:]

These are the features that most contribute to a negative decision.

In [None]:
feats_w_classifier_weight[:100]

## Testing other classifiers

### Decision tree

In [None]:
dt_bin_pipeline = Pipeline([
    ('vect', CountVectorizer()),  # feature extraction
    ('sel', SelectKBest(chi2, k=5000)),  # feature selection
    ('tfidf', TfidfTransformer()),  # weighting
    ('learner', DecisionTreeClassifier())  # learning algorithm
])

dt_bin_pipeline.fit(x_train,y_train_bin)
bin_predictions = dt_bin_pipeline.predict(x_test)

print('Classification report:')
print(classification_report(y_test_bin, bin_predictions))
print('Confusion matrix:')
cm = confusion_matrix(y_test_bin, bin_predictions)
print(cm)

We can try to visualize the tree, but there are too many dimension to have a structure that is really inspectable (I'm referring to the font size, but to the number of nodes of the tree!).

DT visualization works on low dimensional data (see https://scikit-learn.org/stable/modules/tree.html#classification)

In [None]:
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(24, 24))
plot_tree(dt_bin_pipeline.named_steps['learner'])
plt.show()

### Naive Bayes

NB uses a multinomial model based on term frequencies, we can skip the tfidf module.

In [None]:
nb_bin_pipeline = Pipeline([
    ('vect', CountVectorizer()),  # feature extraction
    ('sel', SelectKBest(chi2, k=5000)),  # feature selection
    ('learner', MultinomialNB())  # learning algorithm
])

nb_bin_pipeline.fit(x_train,y_train_bin)
bin_predictions = nb_bin_pipeline.predict(x_test)

print('Classification report:')
print(classification_report(y_test_bin, bin_predictions))
print('Confusion matrix:')
cm = confusion_matrix(y_test_bin, bin_predictions)
print(cm)

In [None]:
tokenizer = nb_bin_pipeline.named_steps['vect']
selector = nb_bin_pipeline.named_steps['sel']
classifier = nb_bin_pipeline.named_steps['learner']


NB model stores log values of priors and likelihoods

In [None]:
classifier.class_log_prior_,classifier.feature_log_prob_, len(classifier.feature_log_prob_[0])

In NB a key factor for decision is the ratio between the likelihood for positive and negative decision.

The next cell exploits numpy to perform element-by-element division between log probabilities of p(w|class=1) and p(w|class=0), producing a vector of such ratios.

In [None]:
ratio = classifier.feature_log_prob_[0]/classifier.feature_log_prob_[1]

In [None]:
feats_w_classifier_weight = list()
feature_names = tokenizer.get_feature_names()
for index,weight in enumerate(selector.inverse_transform([ratio])[0]):
    if weight!=0:
        feats_w_classifier_weight.append((weight,feature_names[index]))
feats_w_classifier_weight = sorted(feats_w_classifier_weight)
len(feats_w_classifier_weight)

This are the most relevant features for a positive decision

In [None]:
feats_w_classifier_weight[-100::-1]

These are the most relevat features for a negative decision.

In [None]:
feats_w_classifier_weight[:100]

# Multi-class single-label classification

Tokenization does not change from the binary problem, as the dataset is the same.

## Feature selection

Here we use single-label labels

In [None]:
sel = SelectKBest(chi2, k=5000)  # feature selection
sel.fit(X_train_tok,y_train)
X_train_sel = sel.transform(X_train_tok)
X_test_sel = sel.transform(X_test_tok)

In [None]:
sel.get_support()

In [None]:
X_train_sel

In [None]:
X_train_sel[0,:]

In [None]:
print(X_train_sel[0,:])

Selected feature differ from the binary case, as now they have to be informative with respect to a different set of labels.

In [None]:
print(vect.inverse_transform(sel.inverse_transform(X_train_sel[0,:])))

## Weighting

In [None]:
tfidf = TfidfTransformer()  # weighting
tfidf.fit(X_train_sel)
X_train_vec = tfidf.transform(X_train_sel)
X_test_vec =tfidf.transform(X_test_sel)

In [None]:
print(X_train_vec[0,:])

In [None]:
for feat,weight in zip(vect.inverse_transform(sel.inverse_transform(X_train_vec[0,:]))[0],X_train_vec[0,:].data):
  print(feat,weight)

## Learning algorithm

Linear SVM implement multi-class single-label using a one-vs-rest approach

In [None]:
learner = LinearSVC()  # linear svm with default parameters
classifier = learner.fit(X_train_vec,y_train)
predictions = classifier.predict(X_test_vec)

In [None]:
len(predictions)

In [None]:
predictions

## Evaluation of accuracy

In [None]:
correct = 0
for prediction,true_label in zip(predictions, y_test):
    if prediction==true_label:
        correct += 1
print(correct/len(predictions))

## Using sklearn pipeline object

In [None]:
pipeline = Pipeline([
    ('vect', CountVectorizer()),  # feature extraction
    ('sel', SelectKBest(chi2, k=5000)),  # feature selection
    ('tfidf', TfidfTransformer()),  # weighting
    ('learner', LinearSVC())  # learning algorithm
])

classifier = pipeline.fit(x_train,y_train)
predictions = classifier.predict(x_test)
correct = 0
for prediction,true_label in zip(predictions, y_test):
    if prediction==true_label:
        correct += 1
print(correct/len(predictions))

In [None]:
from sklearn.metrics import confusion_matrix, classification_report
print('Classification report:')
print(classification_report(y_test, predictions))
print('Confusion matrix:')
cm = confusion_matrix(y_test, predictions)
print(cm)

The classification score for the binary classifier we learned earlier is different, though it is trained on exactly the same data. Why?

We try a linear svm with one-vs-one model.

LinearSVC does not implement OvO.

We can wrap it into a OneVsOneClassifier that can be applied to any classifier.

(Note that other classifiers natively implement OvO, e.g., sklearn.svm.SVC)

In [None]:
from sklearn.multiclass import OneVsOneClassifier

pipeline = Pipeline([
    ('vect', CountVectorizer()),  # feature extraction
    ('sel', SelectKBest(chi2, k=5000)),  # feature selection
    ('tfidf', TfidfTransformer()),  # weighting
    ('learner', OneVsOneClassifier(LinearSVC()))  # learning algorithm
])

classifier = pipeline.fit(x_train,y_train)
predictions = classifier.predict(x_test)

In [None]:
from sklearn.metrics import confusion_matrix, classification_report
print('Classification report:')
print(classification_report(y_test, predictions))
print('Confusion matrix:')
cm = confusion_matrix(y_test, predictions)
print(cm)

# Saving classifiers

Fitted classifiers (both single object and pipelines), as any scikit object, can be saved and the load for successive reuse.

NOTE: saving a file on Colab saves it on the temporary virtual machine on the cloud, to get a persistent copy additional code is require see https://colab.research.google.com/notebooks/io.ipynb

In [None]:
import pickle

In [None]:
with open('news_en_classifier.pkl',mode='bw') as outputfile:
  pickle.dump(pipeline,outputfile)

In [None]:
with open('news_en_classifier.pkl',mode='br') as inputfile:
  pipeline = pickle.load(inputfile)

In [None]:
pipeline

In [None]:
from google.colab import files

files.download('news_en_classifier.pkl')

In [None]:
files.upload()

In [None]:
with open('news_en_classifier (1).pkl',mode='br') as inputfile:
  pipeline2 = pickle.load(inputfile)
pipeline2