# `word2vec` Word Embedding


In [1]:
#!python -m spacy download en_core_web_lg

Word embedding is one of the most popular representation of document vocabulary. It is capable of capturing context of a word in a document, semantic and syntactic similarity, relation with other words, etc.

## Word Vectors with Spacy

https://github.com/explosion/spaCy

https://spacy.io/usage/vectors-similarity

Similarity is determined by comparing word vectors or “word embeddings”, multi-dimensional meaning representations of a word. 

`python -m spacy download en_core_web_lg`

`en_vectors_web_lg`, which includes over 1 million unique vectors

In [2]:
import spacy

In [3]:
nlp = spacy.load('en_core_web_lg')
nlp2 = spacy.load('en_core_web_lg')

In [4]:
x = 'dog cat lion dsfaf'
doc = nlp(x)

In [5]:
for token in doc:
    print(token.text, token.has_vector, token.vector_norm)

dog True 7.0336733
cat True 6.6808186
lion True 6.5120897
dsfaf False 0.0


In [6]:
sentence2 = 'ali eve git dershane hastane okul'
doc2 = nlp2(sentence2)

for token in doc2:
    print(token.text, token.has_vector, token.vector_norm)

ali True 7.0754066
eve True 5.777626
git True 7.654205
dershane False 0.0
hastane False 0.0
okul False 0.0


## Semantic Similarity 

spaCy is able to compare two objects, and make a prediction of how similar they are. Predicting similarity is useful for building recommendation systems or flagging duplicates. 

For example, you can suggest a user content that’s similar to what they’re currently looking at, or label a support ticket as a duplicate if it’s very similar to an already existing one.

Each `Doc, Span and Token` comes with a `.similarity()` method that lets you compare it with another object, and determine the similarity.

In [7]:
x

'dog cat lion dsfaf'

In [8]:
doc = nlp(x)

In [9]:
for token1 in doc:
    for token2 in doc:
        print(token1.text, token2.text, token1.similarity(token2))

dog dog 1.0
dog cat 0.80168545
dog lion 0.47424486
dog dsfaf 0.0
cat dog 0.80168545
cat cat 1.0
cat lion 0.52654374
cat dsfaf 0.0
lion dog 0.47424486
lion cat 0.52654374
lion lion 1.0
lion dsfaf 0.0
dsfaf dog 0.0
dsfaf cat 0.0
dsfaf lion 0.0
dsfaf dsfaf 1.0


  print(token1.text, token2.text, token1.similarity(token2))


# Model Building for `word2vec` 

## Data Preparation 

In [25]:
!pip install git+https://github.com/laxmimerit/preprocess_kgptalkie.git

Collecting git+https://github.com/laxmimerit/preprocess_kgptalkie.git
  Cloning https://github.com/laxmimerit/preprocess_kgptalkie.git to c:\users\ertug\appdata\local\temp\pip-req-build-dvvq5x86


  ERROR: Error [WinError 2] The system cannot find the file specified while executing command git clone -q https://github.com/laxmimerit/preprocess_kgptalkie.git 'C:\Users\ertug\AppData\Local\Temp\pip-req-build-dvvq5x86'
ERROR: Cannot find command 'git' - do you have 'git' installed and in your PATH?


In [10]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import classification_report

In [11]:
import preprocess_kgptalkie as ps

In [12]:
df = pd.read_csv('data/imdb_reviews.txt', sep = '\t', header = None)
df.columns = ['reviews', 'sentiment']

In [13]:
df.head()

Unnamed: 0,reviews,sentiment
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1


In [37]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ertug\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [14]:
x = "A very, very, very slow-moving, aimlss movie"
ps.spelling_correction(x).raw_sentences[0]

'A very, very, very slow-moving, aimless movie'

In [15]:
%%time
df['reviews'] = df['reviews'].apply(lambda x: ps.cont_exp(x))
df['reviews'] = df['reviews'].apply(lambda x: ps.remove_emails(x))
df['reviews'] = df['reviews'].apply(lambda x: ps.remove_html_tags(x))
df['reviews'] = df['reviews'].apply(lambda x: ps.remove_urls(x))

df['reviews'] = df['reviews'].apply(lambda x: ps.remove_special_chars(x))
df['reviews'] = df['reviews'].apply(lambda x: ps.remove_accented_chars(x))
df['reviews'] = df['reviews'].apply(lambda x: ps.make_base(x))
df['reviews'] = df['reviews'].apply(lambda x: ps.spelling_correction(x).raw_sentences[0])

KeyboardInterrupt: 

In [16]:
df.head()

Unnamed: 0,reviews,sentiment
0,a very very very slowmoving aimless movthat is...,0
1,not syoure who was more lose the flat characte...,0
2,attempointe aretweetiness with black white and...,0
3,very little myousi see or anything to speak of,0
4,the good scene in the movthat is was when Gera...,1


## ML Model Building 

In [17]:
import spacy
nlp = spacy.load('en_core_web_lg')

In [24]:
x = 'cat dog school play game orlando magic'
doc = nlp(x)

In [25]:
doc.vector.shape

(300,)

In [28]:
#doc.vector

In [26]:
doc.vector.reshape(1, -1).shape

(1, 300)

In [83]:
for token in doc:
    print(token,token.vector.shape)

#each cat dog vectors are different (300,) vector.

cat (300,)
dog (300,)
school (300,)
play (300,)
game (300,)
orlando (300,)
magic (300,)


In [35]:
def get_vec(x):
    doc = nlp(x)
    vec = doc.vector
    return vec

In [36]:
df['vec'] = df['reviews'].apply(lambda x: get_vec(x))

In [37]:
df.head()

Unnamed: 0,reviews,sentiment,vec
0,a very very very slowmoving aimless movthat is...,0,"[-0.057691786, 0.12695377, -0.122108765, 0.096..."
1,not syoure who was more lose the flat characte...,0,"[0.054727968, 0.14161739, -0.10472363, -0.0142..."
2,attempointe aretweetiness with black white and...,0,"[-0.13334763, 0.037903063, -0.08646696, -0.070..."
3,very little myousi see or anything to speak of,0,"[-0.10341489, 0.16605233, -0.3033911, 0.120629..."
4,the good scene in the movthat is was when Gera...,1,"[0.04783881, 0.16731454, -0.10701978, -0.02907..."


In [38]:
df.shape

(748, 3)

In [39]:
df['vec'].shape

(748,)

In [42]:
X = df['vec'].to_numpy()
print(X[0].shape)
print(X.shape)
X = X.reshape(-1, 1)
print(X.shape)

(300,)
(748,)
(748, 1)


In [50]:
np.concatenate(X.flatten(), axis = 0).shape

(224400,)

In [51]:
X.flatten().shape

(748,)

X has 748 row where each row has 1 column which is array([300]) so it is way different than 2 dim numpy representation

In [52]:
X = np.concatenate(np.concatenate(X, axis = 0), axis = 0).reshape(-1, 300)

In [53]:
X.shape

(748, 300)

In [54]:
y = df['sentiment']

In [55]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0, stratify = y)

In [56]:
X_train.shape, X_test.shape

((598, 300), (150, 300))

## ML Model Traning and Testing 

In [57]:
clf = LogisticRegression(solver = 'liblinear', )

In [58]:
clf.fit(X_train, y_train)

LogisticRegression(solver='liblinear')

In [59]:
y_pred = clf.predict(X_test)

In [60]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.69      0.73      0.71        73
           1       0.73      0.69      0.71        77

    accuracy                           0.71       150
   macro avg       0.71      0.71      0.71       150
weighted avg       0.71      0.71      0.71       150



In [62]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,y_pred))

0.7066666666666667


In [None]:
import pickle 

In [None]:
pickle.dump(clf, open('w2v_sentiment.pkl', 'wb'))

## Support Vector Machine on `word2vec`

In [63]:
from sklearn.svm import LinearSVC

In [64]:
clf = LinearSVC()

In [65]:
clf.fit(X_train, y_train)

LinearSVC()

In [66]:
y_pred = clf.predict(X_test)

In [67]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.69      0.71      0.70        73
           1       0.72      0.70      0.71        77

    accuracy                           0.71       150
   macro avg       0.71      0.71      0.71       150
weighted avg       0.71      0.71      0.71       150



In [68]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,y_pred))

0.7066666666666667


## Grid Search Cross Validation for Hyperparameters Tuning¶ 

In [69]:
from sklearn.model_selection import GridSearchCV

In [70]:
logit = LogisticRegression(solver = 'liblinear')

In [71]:
hyperparameters = {
    'penalty': ['l1', 'l2'],
    'C': (1, 2, 3, 4)
}

In [72]:
clf = GridSearchCV(logit, hyperparameters, n_jobs=-1, cv = 5)

In [73]:
%%time
clf.fit(X_train, y_train)

Wall time: 2.57 s


GridSearchCV(cv=5, estimator=LogisticRegression(solver='liblinear'), n_jobs=-1,
             param_grid={'C': (1, 2, 3, 4), 'penalty': ['l1', 'l2']})

In [74]:
clf.best_params_

{'C': 2, 'penalty': 'l2'}

In [75]:
clf.best_score_

0.7926190476190476

In [76]:
y_pred = clf.predict(X_test)

In [77]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.67      0.70      0.68        73
           1       0.70      0.68      0.69        77

    accuracy                           0.69       150
   macro avg       0.69      0.69      0.69       150
weighted avg       0.69      0.69      0.69       150



In [78]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,y_pred))

0.6866666666666666


## Test Every Machine Learning Model 

https://pypi.org/project/lazypredict/

In [79]:
!pip install lazypredict

Collecting lazypredict
  Downloading lazypredict-0.2.7-py2.py3-none-any.whl (11 kB)
Installing collected packages: lazypredict
Successfully installed lazypredict-0.2.7


In [80]:
# !pip install xgboost
# !pip install lightgbm
# install it with terminal in admin mode

In [81]:
from lazypredict.Supervised import LazyClassifier



ModuleNotFoundError: No module named 'xgboost'

In [None]:
clf = LazyClassifier(verbose=0, ignore_warnings=True, custom_metric=None)

In [None]:
%%time
models, predictions = clf.fit(X_train, X_test,  y_train, y_test)

In [None]:
models