### **spacy_text_classification : Exercise**


- In this exercise, you are going to classify whether a given text belongs to one of possible classes ['BUSINESS', 'SPORTS', 'CRIME'].

- you are going to use spacy for pre-processing the text, convert text to numbers and apply different classification algorithms.

In [None]:
#uncomment the below line and run this cell to install the large english model which is trained on wikipedia data

#python -m spacy download en_core_web_lg

In [1]:
#import spacy and load the language model downloaded
import spacy
nlp = spacy.load('en_core_web_lg')

### **About Data: News Category Classifier**

Credits: https://www.kaggle.com/code/hengzheng/news-category-classifier-val-acc-0-65


- This data consists of two columns.
        - Text
        - Category
- Text are the description about a particular topic.
- Category determine which class the text belongs to.
- we have classes mainly of 'BUSINESS', 'SPORTS', 'CRIME' and comes under **Multi-class** classification Problem.

In [2]:
#import pandas library
import pandas as pd


#read the dataset "news_dataset.json" provided and load it into dataframe "df"
df = pd.read_json('news_dataset.json')


#print the shape of data

print(df.shape)
#print the top5 rows

df.head(5)

(7500, 2)


Unnamed: 0,text,category
0,"Larry Nassar Blames His Victims, Says He 'Was ...",CRIME
1,"Woman Beats Cancer, Dies Falling From Horse",CRIME
2,Vegas Taxpayers Could Spend A Record $750 Mill...,SPORTS
3,This Richard Sherman Interception Literally Sh...,SPORTS
4,7 Things That Could Totally Kill Weed Legaliza...,BUSINESS


In [3]:
#check the distribution of labels 
df.category.value_counts()


category
CRIME       2500
SPORTS      2500
BUSINESS    2500
Name: count, dtype: int64

In [4]:
#Add the new column "label_num" which gives a unique number to each of these labels 
df['label_num'] = df.category.map({'CRIME' : 0,'SPORTS' : 1,'BUSINESS' : 2})
#check the results with top 5 rows
df.head(5)

Unnamed: 0,text,category,label_num
0,"Larry Nassar Blames His Victims, Says He 'Was ...",CRIME,0
1,"Woman Beats Cancer, Dies Falling From Horse",CRIME,0
2,Vegas Taxpayers Could Spend A Record $750 Mill...,SPORTS,1
3,This Richard Sherman Interception Literally Sh...,SPORTS,1
4,7 Things That Could Totally Kill Weed Legaliza...,BUSINESS,2


### **Preprocess the text**

In [5]:
#use this utility function to preprocess the text
#1. Remove the stop words
#2. Convert to base form using lemmatisation

def preprocess(text):
    doc = nlp(text)
    filtered_tokens = []
    for token in doc:
        if token.is_stop or token.is_punct:
            continue
        filtered_tokens.append(token.lemma_)
    return ' '.join(filtered_tokens)

In [6]:
#create a new column "preprocessed_text" which store the clean form of given text [use apply and lambda function]
df['preprocessed_text'] = df.text.apply(preprocess)


In [7]:
#print the top 5 rows
df.head(5)
print(df.shape)

(7500, 4)


### **Get the spacy embeddings for each preprocessed text**

In [8]:
#create a new column "vector" that store the vector representation of each pre-processed text
df['vector'] = df['preprocessed_text'].apply(lambda x : nlp(x).vector)

In [9]:
#print the top 5 rows
df.head(5)

Unnamed: 0,text,category,label_num,preprocessed_text,vector
0,"Larry Nassar Blames His Victims, Says He 'Was ...",CRIME,0,Larry Nassar blame victim say victimize newly ...,"[-0.5585511, -0.29323253, -0.9253956, 0.189389..."
1,"Woman Beats Cancer, Dies Falling From Horse",CRIME,0,woman Beats Cancer die fall horse,"[-0.73039824, -0.43196002, -1.2930516, -1.0628..."
2,Vegas Taxpayers Could Spend A Record $750 Mill...,SPORTS,1,vegas taxpayer spend Record $ 750 million New ...,"[-1.9413117, 0.121578515, -3.2996283, 1.511650..."
3,This Richard Sherman Interception Literally Sh...,SPORTS,1,Richard Sherman Interception literally shake W...,"[-1.4702771, -0.685319, 0.57398, -0.31135806, ..."
4,7 Things That Could Totally Kill Weed Legaliza...,BUSINESS,2,7 thing totally kill Weed Legalization Buzz,"[-1.037173, -1.9495698, -1.7179357, 1.2975286,..."


**Train-Test splitting**

In [10]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(df.vector.values,df['label_num'],test_size = 0.2,random_state = 101)

In [11]:
X_train

array([array([-0.5318781 , -0.599398  ,  0.3271672 ,  1.2804395 ,  1.090432  ,
              -0.4792144 , -0.9844051 ,  1.3348907 , -0.98543066, -0.68468547,
               2.5758688 ,  1.3856196 , -2.0308526 ,  1.0337145 , -1.8755577 ,
               1.5267786 ,  1.8376709 ,  0.80882347, -1.1843305 ,  1.671786  ,
               0.5867218 ,  0.689457  , -1.0907533 ,  0.44135466,  1.058478  ,
              -1.3017744 , -1.4896494 ,  0.36513215, -0.6569299 , -0.31712592,
               0.41081   ,  0.9200926 ,  0.57198703,  0.7181503 , -1.4927808 ,
              -1.6230602 ,  0.9042353 ,  0.09339035, -1.0925941 , -0.4859365 ,
              -0.7622355 ,  0.19805318, -0.6372656 ,  0.24390484, -1.1115562 ,
               0.6806727 ,  1.2991282 , -1.9580548 ,  0.8642943 , -1.128616  ,
              -0.5287813 ,  0.92761236,  0.36389628, -1.6911044 ,  1.1627318 ,
               0.03679959, -1.6881634 ,  0.35766616, -0.92373747, -1.6270107 ,
               1.0956442 ,  0.72303075, -0.62215835,

In [12]:
df.text[6262]

'Women in Business: Stacy Simpson, Chief Communications Officer, SapientNitro Stacy Simpson is the Chief Communications Officer for SapientNitro, part of Publicis Sapient. In this capacity, she oversees global communications, strategic and brand marketing strategies for SapientNitro, as well as the global corporate communications strategy for Sapient.'

In [14]:
X_train_2d

array([[-0.5318781 , -0.599398  ,  0.3271672 , ..., -1.3495964 ,
         0.6478096 ,  2.827019  ],
       [-1.0653182 ,  0.33839583, -0.94524   , ..., -1.0792093 ,
        -0.2966691 ,  0.90708065],
       [-0.7502081 , -0.88332623, -0.13070875, ..., -2.0946796 ,
         1.1505525 ,  0.01130641],
       ...,
       [-0.4052488 ,  0.9787283 , -2.896716  , ..., -1.306215  ,
        -2.0445228 ,  0.46358013],
       [-2.0199819 , -0.3826296 , -1.6088997 , ..., -0.32391703,
         1.4019929 ,  1.1902901 ],
       [-0.09262099,  0.8324502 , -1.875309  , ...,  0.69715655,
        -2.102335  ,  0.702706  ]], dtype=float32)

**Reshape the X_train and X_test so as to fit for models**

In [13]:
# import numpy as np
import numpy as np


#reshapes the X_train and X_test using 'stack' function of numpy. Store the result in new variables "X_train_2d" and "X_test_2d"
X_train_2d=np.stack(X_train)
X_test_2d=np.stack(X_test)

**Attempt 1:**


- use spacy glove embeddings for text vectorization.

- use Decision Tree as the classifier.

- print the classification report.

In [15]:
from sklearn.tree import DecisionTreeClassifier
#import CountVectorizer, RandomForest, pipeline, classification_report from sklearn 

from sklearn.metrics import classification_report


dtc = DecisionTreeClassifier()

#2. fit with X_train and y_train
dtc.fit(X_train_2d,y_train)

#3. get the predictions for X_test and store it in y_pred
y_pred = dtc.predict(X_test_2d)

#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.72      0.73      0.73       493
           1       0.74      0.72      0.73       510
           2       0.74      0.76      0.75       497

    accuracy                           0.73      1500
   macro avg       0.73      0.73      0.73      1500
weighted avg       0.73      0.73      0.73      1500



**Attempt 2:**


- use spacy glove embeddings for text vectorization.
- use MultinomialNB as the classifier after applying the MinMaxscaler.
- print the classification report.

In [17]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import classification_report



#doing scaling because Negative values will not pass into Naive Bayes models

# Initialize MinMaxScaler
scaler = MinMaxScaler()

# Fit scaler to the data and transform the data
scaled_X_train2d = scaler.fit_transform(X_train_2d)
scaled_X_test2d = scaler.fit_transform(X_test_2d)
#1. creating a MultinomialNB model object 

mnb = MultinomialNB()

#2. fit with all_train_embeddings(scaled) and y_train

mnb.fit(scaled_X_train2d,y_train)
#3. get the predictions for all_test_embeddings and store it in y_pred

y_pred = mnb.predict(scaled_X_test2d)

#4. print the classfication report
print(classification_report(y_test, y_pred))



              precision    recall  f1-score   support

           0       0.86      0.88      0.87       493
           1       0.82      0.84      0.83       510
           2       0.85      0.81      0.83       497

    accuracy                           0.84      1500
   macro avg       0.85      0.84      0.84      1500
weighted avg       0.84      0.84      0.84      1500



**Attempt 3:**


- use spacy glove embeddings for text vectorization.
- use KNeighborsClassifier as the classifier after applying the MinMaxscaler.
- print the classification report.

In [18]:
from  sklearn.neighbors import KNeighborsClassifier


#1. creating a KNN model object

knn = KNeighborsClassifier()

#2. fit with all_train_embeddings and y_train

knn.fit(scaled_X_train2d,y_train)

#3. get the predictions for all_test_embeddings and store it in y_pred
y_pred = knn.predict(scaled_X_test2d)


#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.80      0.90      0.84       493
           1       0.88      0.82      0.85       510
           2       0.89      0.84      0.86       497

    accuracy                           0.85      1500
   macro avg       0.86      0.85      0.85      1500
weighted avg       0.86      0.85      0.85      1500



**Attempt 4:**


- use spacy glove embeddings for text vectorization.
- use RandomForestClassifier as the classifier after applying the MinMaxscaler.
- print the classification report.

In [19]:
from sklearn.ensemble import RandomForestClassifier


#1. creating a Random Forest model object

rfc = RandomForestClassifier()

#2. fit with all_train_embeddings and y_train
rfc.fit(scaled_X_train2d,y_train)


#3. get the predictions for all_test_embeddings and store it in y_pred
y_pred = rfc.predict(scaled_X_test2d)


#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.87      0.82      0.85       493
           1       0.76      0.89      0.82       510
           2       0.90      0.78      0.84       497

    accuracy                           0.83      1500
   macro avg       0.84      0.83      0.83      1500
weighted avg       0.84      0.83      0.83      1500



**Attempt 5:**


- use spacy glove embeddings for text vectorization.
- use GradientBoostingClassifier as the classifier after applying the MinMaxscaler.
- print the classification report.

In [20]:
from sklearn.ensemble import GradientBoostingClassifier


#1. creating a GradientBoosting model object

gbc = GradientBoostingClassifier()

#2. fit with all_train_embeddings and y_train

gbc.fit(scaled_X_train2d,y_train)

#3. get the predictions for all_test_embeddings and store it in y_pred

y_pred = gbc.predict(scaled_X_test2d)

#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.85      0.89      0.87       493
           1       0.84      0.91      0.88       510
           2       0.93      0.82      0.88       497

    accuracy                           0.87      1500
   macro avg       0.88      0.87      0.87      1500
weighted avg       0.88      0.87      0.87      1500



**Print the confusion Matrix with the best model got**

In [22]:
#finally print the confusion matrix for the best model: GradientBoostingClassifier

from sklearn.metrics import confusion_matrix
cm =confusion_matrix(y_test, y_pred)
cm




array([[438,  41,  14],
       [ 32, 463,  15],
       [ 43,  44, 410]])