### **spacy_text_classification : Exercise**


- In this exercise, you are going to classify whether a given text belongs to one of possible classes ['BUSINESS', 'SPORTS', 'CRIME'].

- you are going to use spacy for pre-processing the text, convert text to numbers and apply different classification algorithms.

In [1]:
#uncomment the below line and run this cell to install the large english model which is trained on wikipedia data

# !python -m spacy download en_core_web_lg

In [1]:
#import spacy and load the language model downloaded
import spacy

nlp = spacy.load('en_core_web_lg')

2022-12-22 16:41:45.080113: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-22 16:41:45.239802: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-12-22 16:41:45.239819: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-12-22 16:41:45.874237: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2022-

### **About Data: News Category Classifier**

Credits: https://www.kaggle.com/code/hengzheng/news-category-classifier-val-acc-0-65


- This data consists of two columns.
        - Text
        - Category
- Text are the description about a particular topic.
- Category determine which class the text belongs to.
- we have classes mainly of 'BUSINESS', 'SPORTS', 'CRIME' and comes under **Multi-class** classification Problem.

In [2]:
#import pandas library

import pandas as pd

#read the dataset "news_dataset.json" provided and load it into dataframe "df"

df = pd.read_json('word_vector.json')

#print the shape of data
print(df.shape)

#print the top5 rows

df.head(5)

(7500, 2)


Unnamed: 0,text,category
0,"Larry Nassar Blames His Victims, Says He 'Was ...",CRIME
1,"Woman Beats Cancer, Dies Falling From Horse",CRIME
2,Vegas Taxpayers Could Spend A Record $750 Mill...,SPORTS
3,This Richard Sherman Interception Literally Sh...,SPORTS
4,7 Things That Could Totally Kill Weed Legaliza...,BUSINESS


In [3]:
#check the distribution of labels 

df.category.value_counts()

CRIME       2500
SPORTS      2500
BUSINESS    2500
Name: category, dtype: int64

In [4]:
#Add the new column "label_num" which gives a unique number to each of these labels 

df['label'] = df.category.map({
    'CRIME': 0,
    'SPORTS': 1,
    'BUSINESS': 2
})

#check the results with top 5 rows
df.head(5)

Unnamed: 0,text,category,label
0,"Larry Nassar Blames His Victims, Says He 'Was ...",CRIME,0
1,"Woman Beats Cancer, Dies Falling From Horse",CRIME,0
2,Vegas Taxpayers Could Spend A Record $750 Mill...,SPORTS,1
3,This Richard Sherman Interception Literally Sh...,SPORTS,1
4,7 Things That Could Totally Kill Weed Legaliza...,BUSINESS,2


### **Preprocess the text**

In [5]:
#use this utility function to preprocess the text
#1. Remove the stop words
#2. Convert to base form using lemmatisation

def preprocess(text):
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)
    return ' '.join(filtered_tokens)

In [6]:
#create a new column "preprocessed_text" which store the clean form of given text [use apply and lambda function]

df['preprocessed_text'] = df.text.apply(preprocess)

In [7]:
#print the top 5 rows
df.head(5)

Unnamed: 0,text,category,label,preprocessed_text
0,"Larry Nassar Blames His Victims, Says He 'Was ...",CRIME,0,Larry Nassar blame victim say victimize newly ...
1,"Woman Beats Cancer, Dies Falling From Horse",CRIME,0,woman Beats Cancer die fall horse
2,Vegas Taxpayers Could Spend A Record $750 Mill...,SPORTS,1,Vegas Taxpayers spend Record $ 750 million New...
3,This Richard Sherman Interception Literally Sh...,SPORTS,1,Richard Sherman Interception literally shake W...
4,7 Things That Could Totally Kill Weed Legaliza...,BUSINESS,2,7 thing totally kill Weed Legalization Buzz


### **Get the spacy embeddings for each preprocessed text**

In [8]:
#create a new column "vector" that store the vector representation of each pre-processed text

df['vector'] = df.preprocessed_text.apply(lambda x:nlp(x).vector)

In [9]:
#print the top 5 rows
df.head(5)

Unnamed: 0,text,category,label,preprocessed_text,vector
0,"Larry Nassar Blames His Victims, Says He 'Was ...",CRIME,0,Larry Nassar blame victim say victimize newly ...,"[-0.34700856, 0.03156133, -0.21043956, -0.0024..."
1,"Woman Beats Cancer, Dies Falling From Horse",CRIME,0,woman Beats Cancer die fall horse,"[-0.23481816, 0.35125497, 0.011834003, -0.0055..."
2,Vegas Taxpayers Could Spend A Record $750 Mill...,SPORTS,1,Vegas Taxpayers spend Record $ 750 million New...,"[0.053351898, 0.08053064, -0.05101806, -0.1991..."
3,This Richard Sherman Interception Literally Sh...,SPORTS,1,Richard Sherman Interception literally shake W...,"[-0.038867258, 0.28459162, 0.071352966, -0.045..."
4,7 Things That Could Totally Kill Weed Legaliza...,BUSINESS,2,7 thing totally kill Weed Legalization Buzz,"[-0.20180944, 0.11867001, 0.0036708585, -0.189..."


**Train-Test splitting**

In [11]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df.vector, 
                                                 df.label, 
                                                 test_size = 0.2, 
                                                 random_state = 2022)


**Reshape the X_train and X_test so as to fit for models**

In [13]:
# import numpy as np

import numpy as np

#reshapes the X_train and X_test using 'stack' function of numpy. Store the result in new variables "X_train_2d" and "X_test_2d"

X_train_2d = np.stack(X_train)
X_test_2d = np.stack(X_test)

**Attempt 1:**


- use spacy glove embeddings for text vectorization.

- use Decision Tree as the classifier.

- print the classification report.

In [14]:
X_train_2d

array([[-0.12078159,  0.1251742 ,  0.24247845, ..., -0.23317586,
        -0.12946762,  0.14674115],
       [-0.08704231,  0.15871203, -0.09212321, ..., -0.13176163,
         0.10281384,  0.08144095],
       [-0.42832655,  0.23338538,  0.1606854 , ..., -0.3532798 ,
         0.1205477 , -0.01483208],
       ...,
       [-0.166941  , -0.01782378,  0.076621  , ..., -0.14866823,
         0.02668255,  0.17461611],
       [-0.17725699,  0.09010557,  0.04612279, ...,  0.09186578,
        -0.03186679, -0.02291214],
       [-0.14895883,  0.1750557 , -0.0813623 , ...,  0.10252918,
         0.09775136,  0.05753722]], dtype=float32)

In [17]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import classification_report

# scaler = MinMaxScaler()
# all_train_embed = scaler.fit_transform(X_train_2d)
# all_test_embed = scaler.fit_transform(X_test_2d)

#1. creating a Decision Tree model object

model = DecisionTreeClassifier()

#2. fit with all_train_embeddings and y_train

model.fit(X_train_2d, y_train)

#3. get the predictions for all_test_embeddings and store it in y_pred

y_pred = model.predict(X_test_2d)

#4. print the classfication report
print(classification_report(y_pred, y_test))

              precision    recall  f1-score   support

           0       0.76      0.75      0.76       491
           1       0.74      0.70      0.72       506
           2       0.72      0.76      0.74       503

    accuracy                           0.74      1500
   macro avg       0.74      0.74      0.74      1500
weighted avg       0.74      0.74      0.74      1500



**Attempt 2:**


- use spacy glove embeddings for text vectorization.
- use MultinomialNB as the classifier after applying the MinMaxscaler.
- print the classification report.

In [18]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import classification_report



#doing scaling because Negative values will not pass into Naive Bayes models

scaler = MinMaxScaler()
all_train_embed = scaler.fit_transform(X_train_2d)
all_test_embed = scaler.fit_transform(X_test_2d)

#1. creating a MultinomialNB model object 

model = MultinomialNB()

#2. fit with all_train_embeddings(scaled) and y_train

model.fit(all_train_embed, y_train)

#3. get the predictions for all_test_embeddings and store it in y_pred

y_pred = model.predict(all_test_embed)

#4. print the classfication report
print(classification_report(y_pred, y_test))

              precision    recall  f1-score   support

           0       0.92      0.84      0.88       532
           1       0.79      0.96      0.87       392
           2       0.91      0.85      0.88       576

    accuracy                           0.87      1500
   macro avg       0.87      0.88      0.87      1500
weighted avg       0.88      0.87      0.88      1500



**Attempt 3:**


- use spacy glove embeddings for text vectorization.
- use KNeighborsClassifier as the classifier after applying the MinMaxscaler.
- print the classification report.

In [21]:
from  sklearn.neighbors import KNeighborsClassifier


#1. creating a KNN model object

model = KNeighborsClassifier()

#2. fit with all_train_embeddings and y_train

model.fit(X_train_2d, y_train)

#3. get the predictions for all_test_embeddings and store it in y_pred

y_pred = model.predict(X_test_2d)

#4. print the classfication report
print(classification_report(y_pred, y_test))

              precision    recall  f1-score   support

           0       0.95      0.85      0.90       541
           1       0.88      0.93      0.90       453
           2       0.87      0.92      0.90       506

    accuracy                           0.90      1500
   macro avg       0.90      0.90      0.90      1500
weighted avg       0.90      0.90      0.90      1500



**Attempt 4:**


- use spacy glove embeddings for text vectorization.
- use RandomForestClassifier as the classifier after applying the MinMaxscaler.
- print the classification report.

In [22]:
from sklearn.ensemble import RandomForestClassifier


#1. creating a Random Forest model object

model = RandomForestClassifier()

#2. fit with all_train_embeddings and y_train

model.fit(all_train_embed, y_train)

#3. get the predictions for all_test_embeddings and store it in y_pred

y_pred = model.predict(all_test_embed)

#4. print the classfication report
print(classification_report(y_pred, y_test))

              precision    recall  f1-score   support

           0       0.91      0.85      0.88       522
           1       0.83      0.92      0.88       433
           2       0.88      0.86      0.87       545

    accuracy                           0.87      1500
   macro avg       0.87      0.88      0.87      1500
weighted avg       0.88      0.87      0.87      1500



**Attempt 5:**


- use spacy glove embeddings for text vectorization.
- use GradientBoostingClassifier as the classifier after applying the MinMaxscaler.
- print the classification report.

In [23]:
from sklearn.ensemble import GradientBoostingClassifier


#1. creating a GradientBoosting model object

model = GradientBoostingClassifier()

#2. fit with all_train_embeddings and y_train

model.fit(all_train_embed, y_train)

#3. get the predictions for all_test_embeddings and store it in y_pred

y_pred = model.predict(all_test_embed)

#4. print the classfication report
print(classification_report(y_pred, y_test))

              precision    recall  f1-score   support

           0       0.91      0.87      0.89       509
           1       0.81      0.95      0.87       405
           2       0.93      0.85      0.89       586

    accuracy                           0.88      1500
   macro avg       0.88      0.89      0.88      1500
weighted avg       0.89      0.88      0.89      1500



**Print the confusion Matrix with the best model got**

In [24]:
#finally print the confusion matrix for the best model: GradientBoostingClassifier

# from sklearn.metrics import confusion_matrix


import gensim


## [**Solution**](./spacy_word_embeddings_solution.ipynb)