# Capstone Fall Quarter : Natural Language Processing  

## Objective - Our model should convert any input string into a vector as accurately as possible. It will act as a pipeline for our further NLP analysis.   

### Chosen Dataset - Covid-19 releated news is collected from various articles.  The dataset consists of the following columns: authors, title, publish_date, description, text and URL. We are mainly going to concentrate on the text column for our NLP analysis.

### Data Cleaning Steps:

In [1]:
# import the necessary packages

import pandas as pd
import numpy as np

In [2]:
# Loading our dataset as a dataframe using pandas 
df = pd.read_csv('news.csv', delimiter =',')

In [3]:
df.head(3)

Unnamed: 0.1,Unnamed: 0,authors,title,publish_date,description,text,url
0,0,['Cbc News'],Coronavirus a 'wake-up call' for Canada's pres...,2020-03-27 08:00:00,Canadian pharmacies are limiting how much medi...,Canadian pharmacies are limiting how much medi...,https://www.cbc.ca/news/health/covid-19-drug-s...
1,1,['Cbc News'],Yukon gov't names 2 possible sources of corona...,2020-03-27 01:45:00,The Yukon government has identified two places...,The Yukon government has identified two places...,https://www.cbc.ca/news/canada/north/yukon-cor...
2,2,['The Associated Press'],U.S. Senate passes $2T coronavirus relief package,2020-03-26 05:13:00,The Senate has passed an unparalleled $2.2 tri...,The Senate late Wednesday passed an unparallel...,https://www.cbc.ca/news/world/senate-coronavir...


In [4]:
#shape of the dataset

df.shape

(3566, 7)

In [5]:
df.describe()

Unnamed: 0.1,Unnamed: 0
count,3566.0
mean,2455.649748
std,1298.52945
min,0.0
25%,1473.25
50%,2496.5
75%,3569.75
max,4608.0


In [6]:
# Dropping unwanted columns from the dataset
df.drop(["Unnamed: 0",'publish_date','url','title','description'], axis = 1, inplace = True)

In [7]:
df.head()

Unnamed: 0,authors,text
0,['Cbc News'],Canadian pharmacies are limiting how much medi...
1,['Cbc News'],The Yukon government has identified two places...
2,['The Associated Press'],The Senate late Wednesday passed an unparallel...
3,['Cbc News'],Scientists around the world are racing to find...
4,['Cbc News'],Trudeau says rules of Quarantine Act will ...


In [8]:
#stripping the authors column from special characters 

df['authors'] = df['authors'].str.strip('[]')
df['authors'] = df['authors'].str.strip('  ''')
df['authors'] = df.authors.str.replace("[({':]", "")
df['authors'] = df['authors'].str.lower()

In [9]:
df.head()

Unnamed: 0,authors,text
0,cbc news,Canadian pharmacies are limiting how much medi...
1,cbc news,The Yukon government has identified two places...
2,the associated press,The Senate late Wednesday passed an unparallel...
3,cbc news,Scientists around the world are racing to find...
4,cbc news,Trudeau says rules of Quarantine Act will ...


In [10]:
df['authors'].nunique()

261

In [11]:
#df['authors'].unique()

#### Cleaning Authors Column

In [12]:
import re

df['authors'].replace(to_replace = [r'cbcs?\b.*',r'.*\bcbcs?', r'.*cbcnews.*'], value='cbc', regex=True, inplace=True)
df['authors'].replace(to_replace = ['the associated press'], value='associated press', inplace=True)
df['authors'].replace(to_replace = [r'canadian?\b.*',r'.*\bcanadian?'], value='canadian', regex=True, inplace=True)
df['authors'].replace(to_replace = [r'freelancer?\b.*',r'.*\bfreelancer?'], value='freelancer', regex=True, inplace=True)


In [13]:
df['authors'].nunique()

36

### NLP Basics: Implementing a pipeline to clean text

### Pre-processing text data

Cleaning up the text data is necessary to highlight attributes. These would be loaded to machine learning system to pick up on. Cleaning (or pre-processing) the data typically consists of a number of steps:
1. **Remove punctuation**
2. **Tokenization**
3. **Remove stopwords**
4. **Lemmatize/Stem**

In [14]:
df.head()

Unnamed: 0,authors,text
0,cbc,Canadian pharmacies are limiting how much medi...
1,cbc,The Yukon government has identified two places...
2,associated press,The Senate late Wednesday passed an unparallel...
3,cbc,Scientists around the world are racing to find...
4,cbc,Trudeau says rules of Quarantine Act will ...


### Remove punctuation

In [15]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [16]:
def remove_punct(text):
    text_nopunct = "".join([char for char in text if char not in string.punctuation])
    return text_nopunct

In [17]:
df['body_text_clean'] = df['text'].apply(lambda x: remove_punct(x))

In [18]:
df.head()

Unnamed: 0,authors,text,body_text_clean
0,cbc,Canadian pharmacies are limiting how much medi...,Canadian pharmacies are limiting how much medi...
1,cbc,The Yukon government has identified two places...,The Yukon government has identified two places...
2,associated press,The Senate late Wednesday passed an unparallel...,The Senate late Wednesday passed an unparallel...
3,cbc,Scientists around the world are racing to find...,Scientists around the world are racing to find...
4,cbc,Trudeau says rules of Quarantine Act will ...,Trudeau says rules of Quarantine Act will ...


### Test Tokenization

In [19]:
import re

In [20]:
def tokenize(text):
    tokens = re.split('\\W+', text)
    return tokens

In [21]:
df['body_text_tokenized'] = df['body_text_clean'].apply(lambda x: tokenize(x.lower()))

In [22]:
df.head()

Unnamed: 0,authors,text,body_text_clean,body_text_tokenized
0,cbc,Canadian pharmacies are limiting how much medi...,Canadian pharmacies are limiting how much medi...,"[canadian, pharmacies, are, limiting, how, muc..."
1,cbc,The Yukon government has identified two places...,The Yukon government has identified two places...,"[the, yukon, government, has, identified, two,..."
2,associated press,The Senate late Wednesday passed an unparallel...,The Senate late Wednesday passed an unparallel...,"[the, senate, late, wednesday, passed, an, unp..."
3,cbc,Scientists around the world are racing to find...,Scientists around the world are racing to find...,"[scientists, around, the, world, are, racing, ..."
4,cbc,Trudeau says rules of Quarantine Act will ...,Trudeau says rules of Quarantine Act will ...,"[, trudeau, says, rules, of, quarantine, act, ..."


### Remove stopwords

In [23]:
import nltk

#from nltk.corpus import stopwords
#stopwords.words('english')

In [24]:
stopword = nltk.corpus.stopwords.words('english')

In [25]:
#stopword

In [26]:
def remove_stopwords(tokenized_list):
    text = [word for word in tokenized_list if word not in stopword]
    return text

In [27]:
df['body_text_nostop'] = df['body_text_tokenized'].apply(lambda x: remove_stopwords(x))

### Supplemental Data Cleaning: Using Stemming

In [28]:
import nltk

ps = nltk.PorterStemmer()   

In [29]:
def stemming(tokenized_text):
    text = [ps.stem(word) for word in tokenized_text]
    return text

In [30]:
df['body_text_stemmed'] = df['body_text_nostop'].apply(lambda x: stemming(x))

In [31]:
df.head()

Unnamed: 0,authors,text,body_text_clean,body_text_tokenized,body_text_nostop,body_text_stemmed
0,cbc,Canadian pharmacies are limiting how much medi...,Canadian pharmacies are limiting how much medi...,"[canadian, pharmacies, are, limiting, how, muc...","[canadian, pharmacies, limiting, much, medicat...","[canadian, pharmaci, limit, much, medic, dispe..."
1,cbc,The Yukon government has identified two places...,The Yukon government has identified two places...,"[the, yukon, government, has, identified, two,...","[yukon, government, identified, two, places, w...","[yukon, govern, identifi, two, place, whitehor..."
2,associated press,The Senate late Wednesday passed an unparallel...,The Senate late Wednesday passed an unparallel...,"[the, senate, late, wednesday, passed, an, unp...","[senate, late, wednesday, passed, unparalleled...","[senat, late, wednesday, pass, unparallel, 22,..."
3,cbc,Scientists around the world are racing to find...,Scientists around the world are racing to find...,"[scientists, around, the, world, are, racing, ...","[scientists, around, world, racing, find, nove...","[scientist, around, world, race, find, novel, ..."
4,cbc,Trudeau says rules of Quarantine Act will ...,Trudeau says rules of Quarantine Act will ...,"[, trudeau, says, rules, of, quarantine, act, ...","[, trudeau, says, rules, quarantine, act, enfo...","[, trudeau, say, rule, quarantin, act, enforc,..."


### Supplemental Data Cleaning: Using a Lemmatizer

In [32]:
# nltk.download()
import nltk

# https://wordnet.princeton.edu/|
wn = nltk.WordNetLemmatizer()   
ps = nltk.PorterStemmer()

In [33]:
def lemmatizing(tokenized_text):
    text = [wn.lemmatize(word) for word in tokenized_text]
    return text

In [34]:
df['body_text_lemmatized'] = df['body_text_nostop'].apply(lambda x: lemmatizing(x))

In [35]:
df.head(2)

Unnamed: 0,authors,text,body_text_clean,body_text_tokenized,body_text_nostop,body_text_stemmed,body_text_lemmatized
0,cbc,Canadian pharmacies are limiting how much medi...,Canadian pharmacies are limiting how much medi...,"[canadian, pharmacies, are, limiting, how, muc...","[canadian, pharmacies, limiting, much, medicat...","[canadian, pharmaci, limit, much, medic, dispe...","[canadian, pharmacy, limiting, much, medicatio..."
1,cbc,The Yukon government has identified two places...,The Yukon government has identified two places...,"[the, yukon, government, has, identified, two,...","[yukon, government, identified, two, places, w...","[yukon, govern, identifi, two, place, whitehor...","[yukon, government, identified, two, place, wh..."


In [36]:
len(df['body_text_lemmatized'])

3566

In [37]:
print("Total words in first row i.e row 0:",len(df['body_text_lemmatized'][0]))
print("Total words in first row i.e row 1:",len(df['body_text_lemmatized'][1]))

Total words in first row i.e row 0: 242
Total words in first row i.e row 1: 179


## Doc2Vec using Infer Vector

In [38]:
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize

In [39]:
common_texts

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

In [40]:
df.head(1)

Unnamed: 0,authors,text,body_text_clean,body_text_tokenized,body_text_nostop,body_text_stemmed,body_text_lemmatized
0,cbc,Canadian pharmacies are limiting how much medi...,Canadian pharmacies are limiting how much medi...,"[canadian, pharmacies, are, limiting, how, muc...","[canadian, pharmacies, limiting, much, medicat...","[canadian, pharmaci, limit, much, medic, dispe...","[canadian, pharmacy, limiting, much, medicatio..."


### USE The Lemmatized Column 

In [41]:
tagged_data = [TaggedDocument(doc, [i]) for i, doc in enumerate(df['body_text_lemmatized'])]

In [42]:
tagged_data[0:2]

[TaggedDocument(words=['canadian', 'pharmacy', 'limiting', 'much', 'medication', 'dispensed', 'try', 'prevent', 'shortage', 'recognizing', 'active', 'ingredient', 'drug', 'come', 'india', 'china', 'medical', 'supply', 'chain', 'disrupted', 'spread', 'covid19', 'provincial', 'regulatory', 'college', 'complying', 'canadian', 'pharmacist', 'association', 'call', 'limit', 'amount', 'medication', 'given', 'patient', '30day', 'supply', 'goal', 'stop', 'people', 'refilling', 'prescription', 'early', 'ensure', 'lifesaving', 'drug', 'dont', 'run', 'short', 'supply', 'chain', 'vulnerable', 'mina', 'tadrous', 'pharmacist', 'researcher', 'toronto', 'monitor', 'pharmaceutical', 'supply', 'worried', 'canadian', 'start', 'stockpiling', 'drug', 'watching', 'unfolding', 'u', 'region', 'virus', 'spread', 'said', 'pharmacist', 'concerned', 'drug', 'lifesaving', 'inhaler', 'people', 'might', 'stockpile', 'based', 'misinformation', 'circulating', 'potential', 'treatment', 'covid19', 'relationship', 'people

# Manual Approach For Tuning The Model

* used FOR LOOP Approach

#### Step 1: Take some sample data in the wordsExcel and pass the synonyms list to the String2Vec function.

In [None]:
# Sample Data 

data = ["the quick brown fox",\
        "Corona is spreading"]

In [None]:
# or

In [49]:
df_phrases = pd.read_excel('wordsExcel.xlsx')
data = df_phrases['Synonyms_List']

In [52]:
data.head()

0    Government, Administration, executive, regime,...
1    Healthy, Hefty, alright, in good shape, salubr...
2                  Family, menage, household, ancestry
3    Symptoms, Manifestation, indication, indicator...
4    Home, place, dwelling_house, menage, household...
Name: Synonyms_List, dtype: object

#### Step 2: User-defined Function : String2vec():

    1. We use the Doc2vec Model in order to convert the sentences to a vector. 
    2. This function takes in excel data and computes the similiarity scores between the sentences
    3. Display the hyperparamters of the best model, based on the highest score that String2Vec() function   
       returns.


In [55]:
from sklearn.metrics.pairwise import cosine_similarity
import math

def clean_cos(cos_angle):
    return min(1,max(cos_angle,-1))

def String2vec(data, new_model):
    
    print("\n Started") 
    model = new_model
    
    vectors = []
    pair_list = []
    cosine_degree = []
    dot_product_score = []

    #to find the vector of a document which is not in training data
    for i in data:
        test_data = word_tokenize(i.lower())
        #print("Tokenized data:  ",test_data)
        vec = model.infer_vector(test_data)  ### here we are using model and calling the test data(which is 50 excel words) 
        vectors.append(vec)
    
    for i in range(len(vectors)):
        for j in range(len(vectors)):
            import math
            vec1 = vectors[i]
            vec2 = vectors[j]
            sim = (np.dot(vec1,vec2) / (np.linalg.norm(vec2) * np.linalg.norm(vec2)))
            # # The dot product divided by the product magnitude of the two vectors
            #print("Similarity of \"{}\" and \"{}\" is {}" .format(data[i],data[j],sim))
            dot_product_score.append(sim)
            
            pair = data[i],data[j]
            pair_list.append(pair)
            
            cos_sim = clean_cos(sim)
            angle_in_radians = math.acos(cos_sim)
            #print("Degrees: ",math.degrees(angle_in_radians))
            cosine_degree.append(math.degrees(angle_in_radians))

    
    print("\n")
    print(pd.DataFrame({'pair': pair_list, 'similarity_degrees': cosine_degree, 'dot_product_score': dot_product_score}))
    
    res = sum(map(lambda i : i * i, dot_product_score))
    print("Sum of squares scores: ", res)
    
    return (res)
            

### Step 3: FOR LOOP or Manual Approach For Model Building

    * Consider Two hyperparameters vector_size, min_alpha and initiate different values.
    * Create a model for each set of vector_size, min_alpha value and pass it to String2Vec() function. 
    * Identify the hyperparameters vector size and min_aplha of a model that has the highest score 


In [57]:
# #for loop - one iteration of this loop is one iteration of GridSearch for parameter tuning
# 4 * 4 values => 16 iterations . 0 to 15

min_alpha = [0.50, 0.075, 0.1, 0.2]
vector_size = [2,5,10,20]               # 50, 150, 250, 300

#min_alpha = [0.1, 0.2]
#vector_size = [10,20]

max_square = 0
Iteration_number = 0

# Change variable names.
for i in vector_size:
    for j in min_alpha:
        new_model = Doc2Vec(tagged_data, vector_size=i, min_alpha= j, window=2, min_count=4, workers=5, epochs =10)

        print("\n Iteration", Iteration_number)
        
        sum_of_squares= String2vec(data, new_model)  # Function call
        
        print("********* Hyperparameter values, sum & Mean metrics for this model are ******** ")
        print("Vector Size: ", i)
        print("min_alpha: ", j)
        print("Sum of Squares: ", sum_of_squares)
        
        Iteration_number = Iteration_number+1
        
        
        #To find the best parameters for the most accurate model
        if (sum_of_squares >= max_square):
            max_square = sum_of_squares
            vecsize = i
            alpha_value = j
            
        print("##########  Next loop    ###############")

print("\n All Iterations Completed")            
print("Best sum of squares : ",max_square)
print("Best vec size: ",vecsize)
print("Best Alpha value: ",alpha_value)
print("\n")


 Iteration 0

 Started


                                                   pair  similarity_degrees  \
0     (Government, Administration, executive, regime...            0.019782   
1     (Government, Administration, executive, regime...            0.000000   
2     (Government, Administration, executive, regime...           58.434756   
3     (Government, Administration, executive, regime...          180.000000   
4     (Government, Administration, executive, regime...            0.000000   
5     (Government, Administration, executive, regime...           30.823314   
6     (Government, Administration, executive, regime...            0.000000   
7     (Government, Administration, executive, regime...            0.000000   
8     (Government, Administration, executive, regime...            0.000000   
9     (Government, Administration, executive, regime...            0.000000   
10    (Government, Administration, executive, regime...           85.089871   
11    (Government, Adminis


 Iteration 2

 Started


                                                   pair  similarity_degrees  \
0     (Government, Administration, executive, regime...            0.019782   
1     (Government, Administration, executive, regime...           46.494851   
2     (Government, Administration, executive, regime...           39.520324   
3     (Government, Administration, executive, regime...            0.000000   
4     (Government, Administration, executive, regime...            0.000000   
5     (Government, Administration, executive, regime...            0.000000   
6     (Government, Administration, executive, regime...          180.000000   
7     (Government, Administration, executive, regime...            0.000000   
8     (Government, Administration, executive, regime...           89.185086   
9     (Government, Administration, executive, regime...            0.000000   
10    (Government, Administration, executive, regime...           30.542771   
11    (Government, Adminis


 Iteration 4

 Started


                                                   pair  similarity_degrees  \
0     (Government, Administration, executive, regime...            0.027976   
1     (Government, Administration, executive, regime...           92.507796   
2     (Government, Administration, executive, regime...            0.000000   
3     (Government, Administration, executive, regime...           17.758255   
4     (Government, Administration, executive, regime...           29.567227   
5     (Government, Administration, executive, regime...            0.000000   
6     (Government, Administration, executive, regime...          141.684917   
7     (Government, Administration, executive, regime...            0.000000   
8     (Government, Administration, executive, regime...           61.884070   
9     (Government, Administration, executive, regime...           60.738541   
10    (Government, Administration, executive, regime...          107.283794   
11    (Government, Adminis


 Iteration 6

 Started


                                                   pair  similarity_degrees  \
0     (Government, Administration, executive, regime...            0.000000   
1     (Government, Administration, executive, regime...           76.495815   
2     (Government, Administration, executive, regime...           36.346388   
3     (Government, Administration, executive, regime...           46.515861   
4     (Government, Administration, executive, regime...           55.798857   
5     (Government, Administration, executive, regime...           27.303933   
6     (Government, Administration, executive, regime...           80.845005   
7     (Government, Administration, executive, regime...           55.663655   
8     (Government, Administration, executive, regime...           71.793347   
9     (Government, Administration, executive, regime...          115.731658   
10    (Government, Administration, executive, regime...           32.515095   
11    (Government, Adminis


 Iteration 8

 Started


                                                   pair  similarity_degrees  \
0     (Government, Administration, executive, regime...            0.000000   
1     (Government, Administration, executive, regime...          107.997871   
2     (Government, Administration, executive, regime...           84.641221   
3     (Government, Administration, executive, regime...           89.837773   
4     (Government, Administration, executive, regime...           89.693130   
5     (Government, Administration, executive, regime...           89.838442   
6     (Government, Administration, executive, regime...           90.665948   
7     (Government, Administration, executive, regime...           80.690920   
8     (Government, Administration, executive, regime...           90.022510   
9     (Government, Administration, executive, regime...           89.710079   
10    (Government, Administration, executive, regime...           98.672009   
11    (Government, Adminis


 Iteration 10

 Started


                                                   pair  similarity_degrees  \
0     (Government, Administration, executive, regime...            0.000000   
1     (Government, Administration, executive, regime...           65.379294   
2     (Government, Administration, executive, regime...           37.103611   
3     (Government, Administration, executive, regime...           91.407044   
4     (Government, Administration, executive, regime...           58.700441   
5     (Government, Administration, executive, regime...           65.989086   
6     (Government, Administration, executive, regime...           66.855612   
7     (Government, Administration, executive, regime...           59.497934   
8     (Government, Administration, executive, regime...           77.015568   
9     (Government, Administration, executive, regime...          114.092071   
10    (Government, Administration, executive, regime...           60.984131   
11    (Government, Admini


 Iteration 12

 Started


                                                   pair  similarity_degrees  \
0     (Government, Administration, executive, regime...            0.000000   
1     (Government, Administration, executive, regime...          107.593531   
2     (Government, Administration, executive, regime...           95.691158   
3     (Government, Administration, executive, regime...          102.793533   
4     (Government, Administration, executive, regime...           86.648190   
5     (Government, Administration, executive, regime...           86.357035   
6     (Government, Administration, executive, regime...           89.857966   
7     (Government, Administration, executive, regime...           89.755834   
8     (Government, Administration, executive, regime...           89.892782   
9     (Government, Administration, executive, regime...           81.798999   
10    (Government, Administration, executive, regime...           89.954312   
11    (Government, Admini


 Iteration 14

 Started


                                                   pair  similarity_degrees  \
0     (Government, Administration, executive, regime...            0.027976   
1     (Government, Administration, executive, regime...           72.021744   
2     (Government, Administration, executive, regime...            0.000000   
3     (Government, Administration, executive, regime...           91.649723   
4     (Government, Administration, executive, regime...           65.347660   
5     (Government, Administration, executive, regime...           62.386526   
6     (Government, Administration, executive, regime...           70.560552   
7     (Government, Administration, executive, regime...           85.669090   
8     (Government, Administration, executive, regime...           81.663857   
9     (Government, Administration, executive, regime...          103.112008   
10    (Government, Administration, executive, regime...           41.457379   
11    (Government, Admini

Total : 16 Iterations.  All Iterations Completed
    
Best sum of squares :  265271.4051018052
Best vec size:  10
Best Alpha value:  0.5  
    

# GRID SEARCH METHOD FOR TUNING THE MODEL


Reference Links:

https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/
    
grid_search: Run a function iteratively using a grid search approach - R is available
https://rdrr.io/cran/paramtest/man/grid_search.html

https://stackoverflow.com/questions/38064637/pass-estimator-to-custom-score-function-via-sklearn-metrics-make-scorer

Feed paramters to the scoring function



In [64]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

In [None]:
# Created a separate function mb_gridsearch() to build a model.

In [66]:
def mb_gridsearch():
    print("\n Entered the Function To  Build The Model for each of the Hyperparameters: ")
    print("Vector size and alpha values has to be assigned")
    
    model =  Doc2Vec(tagged_data, vector_size=vec_size, min_alpha=0.5 , window=2, min_count=4, workers=5, epochs =3)
    
    return model

In [67]:
vector_size = [2,5,10]

# param_grid = dict(vector_size=[2,5,10])

param_grid= { "min_alpha" : [0.50, 0.075, 0.1, 0.2],
                      "vector_size" : [2,5,10,20] }, 

In [70]:
gs = GridSearchCV(
        estimator= mb_gridsearch, # mb_gridsearch(param_grid)
        param_grid = param_grid,
        cv=5, 
        n_jobs=-1, 
        scoring=String2vec_new(data),  # The model that is returned from estimator has to be considered in Sting2vec function.
        verbose=2
    )




 Started


NameError: name 'model' is not defined

* Currently stuck at creating different models for each vector size and alpha values in grid search.

* Each model that is built out of those hyperparametrs is used in String2Vec() function to convert the excel data or random data to vectors using infer vector method.

* 



In [71]:
def String2vec_new(data):
   
    print("\n Started") 
    
    #model = 
    
    vectors = []
    pair_list = []
    cosine_degree = []
    dot_product_score = []

    for i in data:
        test_data = word_tokenize(i.lower())
    
        vec = model.infer_vector(test_data)  ### here we are using model and calling the test data(which is 50 excel words) 
        vectors.append(vec)
    
    for i in range(len(vectors)):
        for j in range(len(vectors)):
            import math
            vec1 = vectors[i]
            vec2 = vectors[j]
            sim = (np.dot(vec1,vec2) / (np.linalg.norm(vec2) * np.linalg.norm(vec2)))
            # # The dot product divided by the product magnitude of the two vectors
            #print("Similarity of \"{}\" and \"{}\" is {}" .format(data[i],data[j],sim))
            dot_product_score.append(sim)
            
            pair = data[i],data[j]
            pair_list.append(pair)
            
            cos_sim = clean_cos(sim)
            angle_in_radians = math.acos(cos_sim)
            #print("Degrees: ",math.degrees(angle_in_radians))
            cosine_degree.append(math.degrees(angle_in_radians))

    
    print("\n")
    print(pd.DataFrame({'pair': pair_list, 'similarity_degrees': cosine_degree, 'dot_product_score': dot_product_score}))
    
    res = sum(map(lambda i : i * i, dot_product_score))
    print("Sum of squares scores: ", res)
    
    return res

### TRAILS

In [65]:
log_grid = GridSearchCV(Doc2Vec(tagged_data),
                        param_grid= { "min_alpha" : [0.50, 0.075, 0.1, 0.2],
                                      "vector_size" : [2,5,10,20] },
                        scoring=String2vec_new(data), 
                        cv =5)

# How to pass the model to String2Vec_new function ?


 Started


NameError: name 'model' is not defined

In [None]:
log_grid.scoring  # Should display the sum of squares score :  265271.4051018052

In [None]:
log_grid.score

In [None]:
# Best parameters
print("Best Parameters: {}\n".format(log_grid.best_params_))
print("Best accuracy: {}\n".format(log_grid.best_score_))
print("Finished.")