### NLP Using Spacy


#### 1. Import Relevant Libraries

In [1]:
#loading relevant libraries
import numpy as np
import re
import pickle
import nltk
from nltk.corpus import stopwords
from sklearn.datasets import load_files
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to C:\Users\HP
[nltk_data]     OMEN\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

#### 2. Loading training data

In [2]:
import pandas as pd
df = pd.read_csv("Dataset-SA/Dataset-SA.csv")

In [3]:
df.head()

Unnamed: 0,product_name,product_price,Rate,Review,Summary,Sentiment
0,Candes 12 L Room/Personal Air Cooler??????(Whi...,3999,5,super!,great cooler excellent air flow and for this p...,positive
1,Candes 12 L Room/Personal Air Cooler??????(Whi...,3999,5,awesome,best budget 2 fit cooler nice cooling,positive
2,Candes 12 L Room/Personal Air Cooler??????(Whi...,3999,3,fair,the quality is good but the power of air is de...,positive
3,Candes 12 L Room/Personal Air Cooler??????(Whi...,3999,1,useless product,very bad product its a only a fan,negative
4,Candes 12 L Room/Personal Air Cooler??????(Whi...,3999,3,fair,ok ok product,neutral


#### 3. Data Preparation & Cleaning

We are interested in the Reviews (Summary) and the sentiment features. We will need to encode our sentiment and for this project we will only consider the positive and negative sentiments.

In [5]:
#let's check for balance of our target values
df['Sentiment'].value_counts()

positive    166581
negative     28232
neutral      10239
Name: Sentiment, dtype: int64

From the above we can see that the number of positive reviews is more than all the other reviews. For this project we will drop the neutral reviews and take an equal number of positive & negative reviews for our training data.

In [6]:
positive = df[df['Sentiment'] == "positive"][0:28232]

In [7]:
negative = df[df['Sentiment'] == "negative"]

In [8]:
#joining the positive and negative review data
data = pd.concat([positive, negative], axis=0)

In [9]:

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 56464 entries, 0 to 205032
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   product_name   56464 non-null  object
 1   product_price  56464 non-null  object
 2   Rate           56464 non-null  object
 3   Review         52633 non-null  object
 4   Summary        56462 non-null  object
 5   Sentiment      56464 non-null  object
dtypes: object(6)
memory usage: 3.0+ MB


Our computer only understands numbers hence we will map our sentiment column to one for positive reviews and 0 for negative reviews

In [10]:
data['target'] = data['Sentiment'].map({'positive':1, 'negative':0})

In [11]:
data.head()

Unnamed: 0,product_name,product_price,Rate,Review,Summary,Sentiment,target
0,Candes 12 L Room/Personal Air Cooler??????(Whi...,3999,5,super!,great cooler excellent air flow and for this p...,positive,1
1,Candes 12 L Room/Personal Air Cooler??????(Whi...,3999,5,awesome,best budget 2 fit cooler nice cooling,positive,1
2,Candes 12 L Room/Personal Air Cooler??????(Whi...,3999,3,fair,the quality is good but the power of air is de...,positive,1
5,Candes 12 L Room/Personal Air Cooler??????(Whi...,3999,5,awesome,the cooler is really fantastic and provides go...,positive,1
6,Candes 12 L Room/Personal Air Cooler??????(Whi...,3999,5,highly recommended,very good product,positive,1


In [12]:
#as stated earlier we only need our Summary and target features
data = data[['Summary', 'target']]

In [13]:
data['target'].value_counts()

1    28232
0    28232
Name: target, dtype: int64

#### 4. Installing Our spacy vectorizer & Vectorizing our text

In [None]:
!pip install spacy
#installing the spacy package in our virtual environment 

In [14]:
import spacy

In [None]:
!python -m spacy download en_core_web_lg
#downloading the large spacy library

In [15]:
#initializing spacy and loading a large model that has word vectors to convert our text into vectors
vectorizer = spacy.load("en_core_web_lg")


In [16]:
doc = vectorizer("I love you")
doc.vector

array([ 5.4856664e-01,  7.5831342e-01, -6.2819333e+00, -5.6275001e+00,
       -8.2746401e+00,  5.3333282e-02,  9.5883322e-01,  2.7246034e+00,
       -7.3242664e+00,  7.4066663e+00,  7.7700334e+00,  2.1358335e+00,
       -6.4710402e+00, -3.2076669e-01,  4.9224668e+00, -5.1614666e+00,
        3.8421535e+00, -8.5877666e+00, -2.6871998e+00,  7.6546663e-01,
        2.9816034e+00,  2.8566668e+00,  7.3543328e-01, -8.5770006e+00,
       -4.8483336e-01, -2.2024300e+00, -1.3333334e+00,  3.5360000e+00,
       -1.6272532e+00,  9.8620337e-01,  4.0916666e-01, -5.2569666e+00,
       -5.1800013e-02,  2.2076566e+00,  5.2794003e+00, -4.9555001e-01,
       -1.8044467e+00, -2.5639334e-01,  6.3855000e+00,  9.8912269e-01,
       -2.0281901e+00,  5.3624001e+00,  2.2340868e+00, -3.1247399e+00,
        3.0864766e+00,  4.1481667e+00, -3.0569334e+00, -5.6166000e+00,
        1.3551198e+00,  2.6157568e+00,  1.2546000e+00, -2.6840966e+00,
        3.9365670e-01, -3.5257666e+00, -5.1567836e+00, -1.6843634e+00,
      


We will remove all digits from our dataset to enable us convert it to vectors

In [17]:
reviews = []

for review in data['Summary']:
    review = re.sub(r'\d','',str(review))
    reviews.append(review)

In [19]:
#incorporating our cleaned text into our dataframe
data['reviews_cleaned'] = reviews

In [20]:
#vectorizing our cleaned text
data['rev_vector'] = data['reviews_cleaned'].apply(lambda x: vectorizer(x).vector)

vectorizing this took a while. 

In [21]:
data.head()

Unnamed: 0,Summary,target,reviews_cleaned,rev_vector
0,great cooler excellent air flow and for this p...,1,great cooler excellent air flow and for this p...,"[-0.54611576, 0.3405574, -4.4681125, -0.504579..."
1,best budget 2 fit cooler nice cooling,1,best budget fit cooler nice cooling,"[-1.8262001, 1.5094519, -5.0645423, -0.1739227..."
2,the quality is good but the power of air is de...,1,the quality is good but the power of air is de...,"[-1.9831436, 1.3934054, -1.6227155, 1.8680637,..."
5,the cooler is really fantastic and provides go...,1,the cooler is really fantastic and provides go...,"[-0.8724725, 0.3050531, -2.2475398, 0.5829884,..."
6,very good product,1,very good product,"[-0.7020133, -1.0288533, -1.9058, -1.28308, 3...."


In [23]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data.rev_vector.values, data.target, test_size=0.2, random_state=20, shuffle=True)

In [24]:
X_train

array([array([-9.1961497e-01,  3.6443356e-02, -3.7814999e+00,  1.5567299e+00,
               2.4409406e+00, -1.1354750e+00, -1.6675831e-01,  2.4436133e+00,
              -1.5548867e+00, -7.1410507e-01,  5.3053837e+00,  7.5602335e-01,
              -3.5840166e+00,  1.9001135e+00,  1.1566831e+00,  4.2606637e-02,
               5.5920500e-01, -4.6051118e-01,  1.5931234e+00, -1.2741717e+00,
              -1.4499824e-04,  1.0891050e+00,  1.4009150e+00, -1.4086499e+00,
               4.1975832e-01, -1.3474833e+00, -3.1877499e+00, -1.3787133e+00,
              -1.2353767e+00,  1.9199901e+00,  1.4492531e+00, -1.9669000e+00,
              -6.2577671e-01, -1.8484601e+00,  1.9008499e+00, -9.6329993e-01,
              -3.8371840e-01,  6.2569326e-01,  8.3566666e-02,  1.8001183e+00,
               1.5576665e-01, -5.2661818e-02,  4.1774082e+00, -9.8012000e-01,
               2.7623335e-01,  2.2943833e+00,  1.2464992e-01, -1.2783790e-01,
              -8.9134330e-01,  1.0804365e+00,  1.3411299e+00, -3

In [25]:
#from above we can see that the X data is not a 2D np array and we will need to convert it to enable us to train our models

X_train_2d = np.stack(X_train)
X_test_2d = np.stack(X_test)

X_train_2d

array([[-0.919615  ,  0.03644336, -3.7814999 , ..., -0.360715  ,
        -2.93429   ,  0.04947497],
       [-0.72953326,  2.4088235 , -2.6129668 , ...,  2.3154333 ,
        -1.2391833 ,  7.5864334 ],
       [ 0.07358   , -0.88712996, -2.3459    , ...,  2.4633    ,
        -6.0582    ,  4.4717503 ],
       ...,
       [-1.3278867 , -1.1612867 , -2.0342333 , ...,  1.7158667 ,
        -3.3866832 ,  5.1788    ],
       [-1.545145  , -0.570495  , -0.81775   , ..., -1.8491999 ,
        -0.57046497,  0.52043504],
       [-1.1637334 ,  1.4053568 , -0.77704   , ..., -0.47794333,
        -3.104623  ,  6.2532334 ]], dtype=float32)

###  Step 3. Model Building

In [27]:

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler

In [28]:
#we will need to scale the training and test data to enable us use the naive_bayes classifier
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train_2d)
X_test_scaled = scaler.transform(X_test_2d)

In [29]:
clf = MultinomialNB()
clf.fit(X_train_scaled, y_train)

In [30]:
from sklearn.metrics import accuracy_score, classification_report

y_pred = clf.predict(X_test_scaled)
print(accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

0.8218365359071992
              precision    recall  f1-score   support

           0       0.75      0.97      0.84      5585
           1       0.96      0.68      0.79      5708

    accuracy                           0.82     11293
   macro avg       0.85      0.82      0.82     11293
weighted avg       0.85      0.82      0.82     11293



In [31]:
clf_KNN = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
clf_KNN.fit(X_train_scaled, y_train)

In [32]:
y_pred_KNN = clf_KNN.predict(X_test_scaled)
print(accuracy_score(y_test, y_pred_KNN))
print(classification_report(y_test, y_pred_KNN))

0.8990525104046755
              precision    recall  f1-score   support

           0       0.92      0.87      0.89      5585
           1       0.88      0.93      0.90      5708

    accuracy                           0.90     11293
   macro avg       0.90      0.90      0.90     11293
weighted avg       0.90      0.90      0.90     11293



using Gridsearch CV to find the better model and hyperparameters

In [34]:
model_params = {
    
    'logistic_regression': {
        'model': LogisticRegression(solver='liblinear', multi_class='auto'),
        'params':{
            'C': [1,5,7,10]
        }
    },
    'KNN':{
        'model':KNeighborsClassifier(),
        'params':{
            'n_neighbors' : [3,5,7],
            'metric': ['euclidean']
        }
    }
}

In [35]:
scores = []

for model_name, mp in model_params.items():
    clf = GridSearchCV(mp['model'], mp['params'], cv=5, return_train_score=False)
    clf.fit(X_train_scaled, y_train)
    scores.append({
        'model': model_name,
        'best_score': clf.best_score_,
        'best_params': clf.best_params_
    })

In [36]:
scores

[{'model': 'logistic_regression',
  'best_score': 0.9487946402810314,
  'best_params': {'C': 7}},
 {'model': 'KNN',
  'best_score': 0.8939806344328668,
  'best_params': {'metric': 'euclidean', 'n_neighbors': 5}}]

From the above we can see that logistic regression presents the most accurate model so far

In [37]:
model = LogisticRegression(solver='liblinear', multi_class = 'auto', C = 7)
model.fit(X_train_scaled, y_train)

In [38]:
y_pred_log = model.predict(X_test_scaled)
print(classification_report(y_test, y_pred_log))

              precision    recall  f1-score   support

           0       0.94      0.95      0.95      5585
           1       0.95      0.94      0.95      5708

    accuracy                           0.95     11293
   macro avg       0.95      0.95      0.95     11293
weighted avg       0.95      0.95      0.95     11293



In [43]:
model.predict(scaler.transform(vectorizer('trash').vector.reshape(1,300))).item()

0

In [40]:
#pickling out model 
with open ('model_spc.pickle', 'wb') as f:
    pickle.dump(model,f)

In [41]:
#pickling our vectorizer
with open('spacy_vec.pickle','wb') as f:
    pickle.dump(vectorizer,f)

In [42]:
with open ('mm_scaler.pickle', 'wb') as f:
    pickle.dump(scaler,f)