# Exercise Lecture 13: Classification


In this set of exercises, we will use classification to classify news articles into 5 topics. The dataset consists of 2225 documents and 5 categories: business, entertainment, politics, sport, and technology. 


The exercises cover the following points:

* Storing the data into an pandas dataframe and inspecting the data
* Converting the corpus into a tfd-idf document token matrix
* Learning a perceptron model from the data 
* Inspecting the results

## 1. Loading the data

#### Exercice 1.1 Create a pandas dataframe containing the news data

* The data file is in "data/bbc/"
* Use the [load_files](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_files.html) method from sklearn.datasets to load all files 
* load_files returns a dictionnary with keys "data" and "labels". 
* Use [Pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) method to create a dataframe whose headers are "texts" and "labels" (the text is the data from load_files)
    


In [1]:
import pandas as pd
import numpy as np
from sklearn.cluster import MiniBatchKMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

In [2]:
from sklearn.datasets import load_files

# YOUR CODE HERE

 data_path = "data/bbc/" 

data = load_files(data_path, encoding="utf-8", decode_error="replace")  

df = pd.DataFrame(list(zip(data['data'],data['target'])), columns=['texts', 'labels'])  

####  Exercise 1.2  - Exploring the data and finding out which labels is associated with each domain

* Print out the shape of the dataframe to find out how many BBC news report there are in the dataframe created in Exercise 1.1
* Print out the dataframe to find out which label (an integer) is associated with which topic.
* Define a dictionary tag_to_ix which maps the integers in the "labels" domain to the corresponding topic in the BBC data (these topics as: Business, Entertainment, Politics, Sport and Technology)

In [3]:
# Print out the shape of the dataframe
print(df.shape) 

 

# We print the different category names and labels. 

print(set(data['target'])) 

categories = pd.DataFrame(list(data['target_names']), columns=['names']) 

display(categories) 

 

# Print to see the articles associated to given label values. 

df[df['labels'] == 0].head() 

df[df['labels'] == 1].head() 

df[df['labels'] == 2].head() 

df[df['labels'] == 3].head() 

df[df['labels'] == 4].head() 

 

# The insert may only be done once per initialization! 

labels = [0,1,2,3,4] 

categories.insert(loc=1, column='labels', value=labels) 

display(categories) 

 

# If you want a merged version (similarly to databases' inner joins): 

merge = pd.merge(df ,categories ,on=['labels','labels']) 

merge.drop_duplicates(subset=["labels"],keep="first")[["texts", "names"]].head() 

####  Exercise 1.3: Don't forget to shuffle the data 

In [4]:
from sklearn.utils import shuffle 

df= shuffle(df) 

df.head() 

## 2.  Vectorizing the input texts

#### Exercise 2.2 

* Extract $X$ and $Y$ from the dataframe
* $Y$ = the category (business, etc.) of each BBC news item
* $X$ = the features used for clustering. The features of a news items is the list of tokens contained in that item. We hope that words can help classify news items into the correct category: business, entertainment, politics, sport or  technology.


In [5]:
# Defining X and Y
Y = data[label] 

X = data['text'] 

#### Exercise 2.3 Create train and test data

* Use sklearn train_test_split method 

In [6]:
# Import 'train_test_split'
from sklearn.model_selection import train_test_split

# Shuffle and split the data into training and testing subsets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42) 

#### Exercise 2.4 Vectorizing the data

Use sklearn [TfidfVectorizer]( https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) method to turn the news items into a TF-IDF matrix where each row represents a news item, the columns are tokens and the cell contains the tf-idf score of each token.

* Import the TfidfVectorizer method from sklearn
* Create a tf-idf vectorizer. The maximum nb of features should be set to 8000. Set use_idf to True, stop_words to "english" and the tokenizer to nltk.word_tokenize.
* Apply the tfidf_vectorizer.fit_transform method to X to vectorize all input texts (i.e., both X_train and X_test)
* Print out the shape of the training and test data
* Print out the size of the vocabulary (use tfidf_vectorizer.vocabulary_ to extract the vocabulary)

In [7]:
import nltk 

from nltk import word_tokenize 

from nltk.corpus import stopwords 

  

# Using TFIDF vectorizer to convert convert words to Vector Space 

tfidf_vectorizer = TfidfVectorizer(max_features=8000,  

                                   use_idf=True,  

                                   stop_words='english',  

                                   tokenizer=nltk.word_tokenize, 

                                   ngram_range=(1, 3)) 

  

# Fit the vectorizer to train and test data 

X_train_tf = tfidf_vectorizer.fit_transform(X_train) 

X_test_tf = tfidf_vectorizer.transform(X_test) 

  

# Print the shape of the matrices X_train and X_test and the size of the vocabulary 

print(X_train_tf.shape) 

print(X_test_tf.shape) 

print( "Vocabulary size:", len(tfidf_vectorizer.vocabulary_) ) 

#### Exercise 2.5 - Use the get_feature_names method to print out the features

This allows use to spot things that may not be right. E.g., oes the vocabulary contain uninformative tokens such as numbers, punctuation signs ? We won't work on this here (you have often done the linguistic preprocessing by now so it's not that useful as a learning exercise here) but it is good practice to look at your vocabulary before launching the machine learning to avoid learning on noisy data and henceforth get low results from your ML algorithm. 

In [8]:
features = tfidf_vectorizer.get_feature_names() 

print(features)  

## 3. Training a perceptron classifier


#### Exercise 3.1

* Import the [Perceptron](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html) module 
* Create an object of the class Perceptron
* Train the model using the [fit](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html#sklearn.linear_model.Perceptron.fit) method
* Test the model using the [predict](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html#sklearn.linear_model.Perceptron.fit)
 method
* Print out expected values and predictions
* Print out accuracy using [sklearn accuracy_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) method

In [9]:
from sklearn.linear_model import Perceptron 

from sklearn.metrics import accuracy_score 

  

# Create a Perceptron object 

clf = Perceptron(max_iter=5, tol=None) 

  

# Train the model on the training data 

clf.fit(X_train_tf, Y_train) 

  

# Test the model on the test data 

predictions = clf.predict(X_test_tf) 

  

# Print out the expected values and the predictions 

print('Expected Values:', list(Y_test)) 

print('Predictions:', list(predictions)) 


# Print accuracy 

print( "Accuracy:", accuracy_score(Y_test, predictions)) 

#### Exercise 3.2

* sklearn tfidf_vectorizer creates a vocabulary dictionary {(k,v),} where k is a token and v is an index (integer)
   - Create a dictionary ix_to_tag mapping each index to the corresponding token  and a dictionary tag_to_idx mapping each token to the corresponding index
* The [coef_ ](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html#sklearn.linear_model.Perceptron.fit) attribute contains the learned weights for each feature. Size = nb of classes, nb of features. 
* Save the feature weights in a dictionary where key = token index, value = weight
* Define a function that derives a sorted list of (tokenIndex, weight) pairs
* For each class, 
   -  get the feature weights for each class
   - Sort the weights
   - Print out the first 6 token:weight pairs (replace token indices by the corresponding token)

In [10]:
# token to index dictionary is already in the tfidf model 

token2idx = tfidf.vocabulary_ 

# inverse of the dictionary 

idx2token = {v: k for k, v in token2idx.items()} 

  

# clf.coef_ yields matrix with classes as rows and tokens/features as columns 

# we don't have access to anything else, so we retrieve tokens weights from the inverse of this matrix 

idx2weight = {i: weight for i, weight in enumerate(clf.coef_.T)} 

weight : vector of size 5  

 

# {0: array([-0.10224227,  0.34115365, -0.23020905, -0.15069405, -0.2137474 ]), 
1: array([-0.2794715 ,  0.        ,  0.06609557,  0.11862325,  0.06047473]), 
2: array([ 0.76770035,  0.20680919, -0.54074955, -0.35905334, -0.34783302]), 
3: array([ 0.60604639, -0.54384953, -0.22093773, -0.51137058, -0.39435784]), ... 

 

top_n = 6 

  

# argsort on the clf.coef_ sorts each row (axis=1) increasingly and yields indices instead of the actual values 

argsorted_cls = np.argsort(clf.coef_, axis=1) 

# argsorted_cls: matrix of size C X D (C: number of classes, D: number of features) 

  

# we loop over the obtained, sorted indices, keeping the index number (representing the class index) 

for class_index, sorted_tokens in enumerate(argsorted_cls): 

    # using idx2target we can obtain classes actual name 

    print(f"Class {idx2target[class_index]} ({class_index}) and it's top {top_n} tokens:") 

     

    # we need to inverse the obtained indices from the argsorted_cls, to make it decreasing 

    # we are interested in top 6 results 

    for token in sorted_tokens[::-1][:top_n]:  

        # we can use idx2weight to obtain back the token's weight 

        # from this we can check and verify both: 

        #   1) tokens are really ranked from top 1 to top 6 

        #   2) among classes, the highest value is being assigned to the class 

        #      to which the token has been located at as the top one 

        reformatted_weights = ', '.join([f"{x:.4f}" for x in idx2weight[token].tolist()]) 

        # idx2token allows us to obtain the token's actual name 

        print(f"Token {idx2token[token]} ({token})  

has a weight:\n\t[{reformatted_weights}]") 

    print() 