**word2vec Word Embedding**

In [None]:
!python -m spacy download en_core_web_lg

Collecting en_core_web_lg==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.5/en_core_web_lg-2.2.5.tar.gz (827.9 MB)
[K     |████████████████████████████████| 827.9 MB 1.2 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')


In [None]:
# !pip install spacy

Word embedding is one of the most popular representation of document vocabulary. It is capable of capturing context of a word in a document, semantic and syntactic similarity, relation with other words, etc.


Word2Vec is one of the most popular technique to learn word embeddings using shallow neural network. It was developed by Tomas Mikolov in 2013 at Google.

**Word Vector**

https://github.com/explosion/spacy.


https://spacy.io/usage/vectors-similarity.

Similarity is determined by comparing word vectors or "word embeddings", multi-dimensional meaning representati as of a word.

python -m spacy download en core_web_1g

Similarity is determined by comparing word vectors or 'word embeddings', multi-dimensional meaning representatioon 

In [None]:
import spacy
# nlp = en_core_web_lg.load()  # another way to load model - en_core_web_lg
# nlp

In [None]:
# another way to load model - en_core_web_lg

import spacy.cli                                  
spacy.cli.download("en_core_web_lg")   

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')


In [None]:
print(spacy.__version__) 

2.2.4


In [None]:
nlp = spacy.load('en_core_web_lg')

In [None]:
x = 'dog cat lion dsfaf'
doc = nlp(x)

In [None]:
for token in doc:
  print(token.text, token.has_vector, token.vector_norm)

dog True 7.0336733
cat True 6.6808186
lion True 6.5120897
dsfaf False 0.0


**Semantic Similarity**

spacy is able to compare two objects, and make a prediction of how similar they are. Predicting similarity is useful for building recommendation systems or flagging duplicates.

For example, you can suggest a user content that's similar to what they're currently looking at, or label a support ticket as a duplicate if it's very similar to an already existing one.

Each Doc, Span and Token comes with a similarity() method that lets you compare it with another object, and determine the similarity.

In [None]:
x

'dog cat lion dsfaf'

In [None]:
doc = nlp(x)

In [None]:
for token1 in doc:
  for token2 in doc:
    print(token1.text, token2.text, token1.similarity(token2))

dog dog 1.0
dog cat 0.80168545
dog lion 0.47424486
dog dsfaf 0.0
cat dog 0.80168545
cat cat 1.0
cat lion 0.5265438
cat dsfaf 0.0
lion dog 0.47424486
lion cat 0.5265438
lion lion 1.0
lion dsfaf 0.0
dsfaf dog 0.0
dsfaf cat 0.0
dsfaf lion 0.0
dsfaf dsfaf 1.0


**Model Building for word2vec**

**Data Preparation**

In [None]:
!pip install git+https://github.com/laxmimerit/preprocess_kgptalkie.git

Collecting git+https://github.com/laxmimerit/preprocess_kgptalkie.git
  Cloning https://github.com/laxmimerit/preprocess_kgptalkie.git to /tmp/pip-req-build-6vm5vf2r
  Running command git clone -q https://github.com/laxmimerit/preprocess_kgptalkie.git /tmp/pip-req-build-6vm5vf2r


In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import classification_report

In [None]:
import preprocess_kgptalkie as ps

In [None]:
df = pd.read_csv('imdb_reviews.txt', sep='\t', header=None)
df.columns = ['reviews','sentiment']

In [None]:
df.head()

Unnamed: 0,reviews,sentiment
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1


In [None]:
x = 'A vry, very slow-moving, aimlss movie'
ps.spelling_correction(x)

TextBlob("A very, very slow-moving, aimless movie")

In [None]:
%%time 
df['reviews'] =  df['reviews'].apply(lambda x: ps.cont_exp(x))
df['reviews'] = df['reviews'].apply(lambda x: ps.remove_emails(x))
df['reviews'] = df['reviews'].apply(lambda x: ps.remove_html_tags(x))
df['reviews']  = df['reviews'].apply(lambda x: ps.remove_urls(x))

df['reviews'] = df['reviews'].apply(lambda x: ps.remove_special_chars(x))
df ['reviews'] =  df['reviews'].apply(lambda x: ps.remove_accented_chars(x))
df['reviews']= df['reviews'].apply(lambda x: ps.make_base(x))
df['reviews'] =  df['reviews'].apply(lambda x: ps.spelling_correction(x))

CPU times: user 3min 14s, sys: 577 ms, total: 3min 14s
Wall time: 3min 22s


In [None]:
df.head()

Unnamed: 0,reviews,sentiment
0,"(a, , v, e, r, y, , v, e, r, y, , v, e, r, ...",0
1,"(n, o, t, , s, u, r, e, , w, h, o, , w, a, ...",0
2,"(a, t, t, e, m, p, t, , a, r, t, l, e, s, s, ...",0
3,"(v, e, r, y, , l, i, t, t, l, e, , m, u, s, ...",0
4,"(t, h, e, , g, o, o, d, , s, c, e, n, e, , ...",1


**ML model Building**

In [None]:
import spacy
nlp = spacy.load('en_core_web_lg')

In [None]:
x = 'cat dog'
doc= nlp(x)

In [None]:
doc.vector.shape

(300,)

In [None]:
doc.vector.reshape(1, -1).shape

(1, 300)

In [None]:
def get_vec(x):
  doc = nlp(x)
  vec = doc.vector
  return vec

In [None]:
df['vec'] = df['reviews'].apply(lambda x: get_vec(str(x)))

In [None]:
df.head()

Unnamed: 0,reviews,sentiment,vec
0,"(a, , v, e, r, y, , v, e, r, y, , v, e, r, ...",0,"[-0.074153, 0.11350991, -0.23838478, 0.1394247..."
1,"(n, o, t, , s, u, r, e, , w, h, o, , w, a, ...",0,"[0.062192187, 0.1952087, -0.14579107, -0.00481..."
2,"(a, t, t, e, m, p, t, , a, r, t, l, e, s, s, ...",0,"[-0.19790795, 0.015133962, -0.107922316, -0.06..."
3,"(v, e, r, y, , l, i, t, t, l, e, , m, u, s, ...",0,"[-0.09093174, 0.25162372, -0.25681874, 0.15846..."
4,"(t, h, e, , g, o, o, d, , s, c, e, n, e, , ...",1,"[0.064886056, 0.13270056, -0.15480983, -0.0207..."


In [None]:
df.shape

(748, 3)

In [None]:
x = df['vec'].to_numpy()
x = x.reshape(-1,1)

In [None]:
x = np.concatenate(np.concatenate(x, axis=0), axis = 0).reshape(-1, 300)

In [None]:
x.shape

(748, 300)

In [None]:
y = df['sentiment']

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0, stratify=y)

In [None]:
x_train.shape, x_test.shape

((598, 300), (150, 300))

**ML Model Training and Testing**

In [None]:
clf = LogisticRegression(solver = 'liblinear', )

In [None]:
clf.fit(x_train, y_train)

LogisticRegression(solver='liblinear')

In [None]:
y_pred = clf.predict(x_test)
y_pred

array([1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0,
       1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1,
       1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0,
       0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1,
       0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0])

In [None]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.79      0.82      0.81        73
           1       0.82      0.79      0.81        77

    accuracy                           0.81       150
   macro avg       0.81      0.81      0.81       150
weighted avg       0.81      0.81      0.81       150



In [None]:
import pickle

In [None]:
pickle.dump(clf, open('w2v_sentiment.pkl','wb'))

**Support Vector Machine on word2vec**

In [None]:
from sklearn.svm import LinearSVC

In [None]:
clf = LinearSVC()

In [None]:
clf.fit(x_train, y_train)

LinearSVC()

In [None]:
y_pred = clf.predict(x_test)

In [None]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.77      0.81      0.79        73
           1       0.81      0.77      0.79        77

    accuracy                           0.79       150
   macro avg       0.79      0.79      0.79       150
weighted avg       0.79      0.79      0.79       150



**Grid Search Cross Validation for Hyperparameter Tuning**

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
logit = LogisticRegression(solver = 'liblinear')

In [None]:
hyperparameter = {
    'penalty' : ['l1', 'l2'],          # l1 = lasso regression, l2 = ridge regression 
    'C' : (1,2,3,4)                    
}

In [None]:
clf = GridSearchCV(logit, hyperparameter, n_jobs = -1, cv = 5)    # n_jobs = -1 means selecting the by default no. of core from system the system is having(or using all processor) # cv = 5, gives 80% accuracy on test data

In [None]:
help(GridSearchCV)

Help on class GridSearchCV in module sklearn.model_selection._search:

class GridSearchCV(BaseSearchCV)
 |  GridSearchCV(estimator, param_grid, *, scoring=None, n_jobs=None, iid='deprecated', refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', error_score=nan, return_train_score=False)
 |  
 |  Exhaustive search over specified parameter values for an estimator.
 |  
 |  Important members are fit, predict.
 |  
 |  GridSearchCV implements a "fit" and a "score" method.
 |  It also implements "predict", "predict_proba", "decision_function",
 |  "transform" and "inverse_transform" if they are implemented in the
 |  estimator used.
 |  
 |  The parameters of the estimator used to apply these methods are optimized
 |  by cross-validated grid-search over a parameter grid.
 |  
 |  Read more in the :ref:`User Guide <grid_search>`.
 |  
 |  Parameters
 |  ----------
 |  estimator : estimator object.
 |      This is assumed to implement the scikit-learn estimator interface.
 |      Either e

In [None]:
%%time
clf.fit(x_train, y_train)

CPU times: user 210 ms, sys: 144 ms, total: 354 ms
Wall time: 3.45 s


GridSearchCV(cv=5, estimator=LogisticRegression(solver='liblinear'), n_jobs=-1,
             param_grid={'C': (1, 2, 3, 4), 'penalty': ['l1', 'l2']})

In [None]:
clf.best_params_

{'C': 1, 'penalty': 'l2'}

In [None]:
clf.best_score_

0.8294117647058823

In [None]:
y_pred = clf.predict(x_test)

In [None]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.79      0.82      0.81        73
           1       0.82      0.79      0.81        77

    accuracy                           0.81       150
   macro avg       0.81      0.81      0.81       150
weighted avg       0.81      0.81      0.81       150



**Test Every Machine Learning Model** 

https://pypi.org/project/lazypredict/

In [None]:
!pip install lazypredict



In [None]:
!pip install xgboost
!pip install lightgbm



In [None]:
from lazypredict.Supervised import LazyClassifier

In [None]:
clf = LazyClassifier(verbose=0, ignore_warnings=True, custom_metric=None)

In [None]:
%%time
model, predictions = clf.fit(x_train, x_test, y_train, y_test)

100%|██████████| 29/29 [00:08<00:00,  3.56it/s]


In [None]:
model

Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
SVC,0.78,0.78,0.78,0.78,0.19
NuSVC,0.78,0.78,0.78,0.78,0.2
LogisticRegression,0.74,0.74,0.74,0.74,0.11
ExtraTreesClassifier,0.73,0.73,0.73,0.73,0.24
CalibratedClassifierCV,0.72,0.72,0.72,0.72,0.86
RandomForestClassifier,0.72,0.72,0.72,0.72,0.56
AdaBoostClassifier,0.72,0.72,0.72,0.72,0.97
GaussianNB,0.71,0.72,0.72,0.71,0.03
LGBMClassifier,0.71,0.71,0.71,0.71,1.71
NearestCentroid,0.71,0.71,0.71,0.71,0.03
