**Semantic Similarity Scores**

---



This is a colab notebook to obtain the cosine similarity scores for 200 random words, chosen from words from the Google News Dataset. 

- Setting up the gensim module from NLTK




In [None]:
from nltk.test.gensim_fixt import setup_module
setup_module()

- Importing all the necessary modules

In [None]:
import numpy as np
import pandas as pd
import gensim
import csv
import random

In this notebook, we utilise a pre-trained model based on 10 billion words from the google news dataset

- After loading the pre-trained model, we obtain a list of the unique words using the dot vocab method

In [None]:
from nltk.data import find
import nltk
nltk.download('word2vec_sample')
word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))
word2vec = gensim.models.KeyedVectors.load_word2vec_format(word2vec_sample, binary=False)

# contains the list of all unique words in pre-trained word2vec vectors
w2v_vocabulary = word2vec.vocab

[nltk_data] Downloading package word2vec_sample to /root/nltk_data...
[nltk_data]   Unzipping models/word2vec_sample.zip.


- Having obtained this list, we then move to choosing the first 200 words to implement our methods on them

In [None]:
count=0
w2v_chosenwords=[]
for i in w2v_vocabulary:
  w2v_chosenwords.append(i)
  count+=1
  if count>200:
    break
print("The chosen words:",w2v_chosenwords)

The chosen words: ['fawn', 'deferment', 'Debts', 'Poetry', 'woods', 'clotted', 'hanging', 'hastily', 'comically', 'localized', 'spidery', 'disobeying', 'Adjusting', 'originality', 'Journey', 'mutinies', 'Western', 'alphabetic', 'Gravesend', 'Elec', 'slothful', 'wracked', 'Valle', 'Famed', 'stipulate', 'pigment', 'appropriation', 'rawhide', 'strictest', 'screaming', 'wooded', 'liaisons', 'broiler', 'wooden', 'Pergamon', 'Loeb', 'Sack', 'broiled', 'circuitry', 'deferments', 'resounds', 'Colonialism', 'gaskets', 'scrapes', 'precocity', 'Shocked', 'feasibility', 'miniatures', 'deadheads', 'mortgages', 'sustaining', 'consenting', 'Honorable', 'Pampa', 'scraped', 'snuggled', 'inanimate', 'errors', 'semicircular', 'tiered', 'Initially', 'cooking', 'Hamilton', 'outfielders', 'Niagara', 'hallucinating', 'succumb', 'shocks', 'crouch', 'chins', 'Foundation', 'jubilantly', 'zlotys', 'mailings', 'perforations', 'affiliates', 'perfunctorily', 'china', 'affiliated', 'Footnotes', 'confronts', 'doldrum

*Information to note: Dimension of the word vectors of the words is 300, which needs to be noted*

Defining a few functions, which are necessary for the problem:
- Mean & Standard Deviation calculator
- Random Pair Generator

In [None]:
def mean_var(arr):
  temp=arr.shape
  if len(temp)==1:
    sum=0
    for i in range(temp[0]):
      sum+=arr[i]
    mean=sum/temp[0]
    var=0
    for i in range(temp[0]):
      var+=(arr[i]-mean)**2
    std=(var/temp[0])**0.5
    return (mean,std)

In [None]:
def random_pairs(arr):
    x=[] 
    for j in range(int(len(arr)/2)):
      x.append([arr[i] for i in random.sample(range(len(arr)), 2)])
    return x

Beyond this, we need to generate a dataframe or a csv file with columns as -
- Serial Number
- word2vec similarity
- Dot Product calculated similarity
- PCA value similarity(dimension=4)
- t-SNE value similarity
- Newly Proposed method similarity

In [None]:
dataframe = pd.DataFrame(columns =['Serial Number'], dtype = int) 

In [None]:
dataframe

Unnamed: 0,Serial Number


*To choose n/2 pairs randomly from the chosen n sample points*

In [None]:
w2v_randompairs=random_pairs(w2v_chosenwords)

In [None]:
print(w2v_randompairs)

[['precocity', 'tomes'], ['natures', 'Valle'], ['Hoping', 'consenting'], ['Abbott', 'slothful'], ['Isles', 'catchy'], ['Hoping', 'chins'], ['Matunuck', 'dinosaurs'], ['Loeb', 'Tippecanoe'], ['Footnotes', 'Shocked'], ['Colombia', 'Sack'], ['morally', 'Famed'], ['secede', 'leisurely'], ['music', 'succumb'], ['wooden', 'wickedly'], ['fig', 'Archuleta'], ['leadings', 'modest'], ['hallucinating', 'Copp'], ['spotty', 'liaisons'], ['Robertson', 'His'], ['broiled', 'natures'], ['Western', 'Willa'], ['woods', 'secede'], ['Pampa', 'Archuleta'], ['Elec', 'gaskets'], ['Tippecanoe', 'Farther'], ['anticipations', 'affiliates'], ['Valle', 'sermons'], ['Abbott', 'Harvey'], ['kids', 'tomes'], ['sermons', 'natures'], ['Foundation', 'raggedness'], ['Debts', 'hallucinating'], ['turnouts', 'Sack'], ['raggedness', 'playback'], ['dinosaurs', 'kingdoms'], ['consenting', 'Famed'], ['jubilantly', 'appropriation'], ['Harvey', 'dinosaurs'], ['shocks', 'stipulate'], ['exuberantly', 'Niagara'], ['modest', 'sentenci

Similarity Values are values calculated as cosine similarity between the vectors of two words, which means a total of 100 Similarity values will be generated from the 100 pairs

In [None]:
serial=[]
for i in range(1,101):
  serial.append(i)
dataframe['Serial Number']=serial

In [None]:
dataframe

Unnamed: 0,Serial Number
0,1
1,2
2,3
3,4
4,5
...,...
95,96
96,97
97,98
98,99


**Method 1:**
The first column is to be filled with the recorded word2vec values, which will be used as a standard for comparison

In [None]:
w2v_values=[]
for i in w2v_randompairs:
  w2v_values.append(word2vec.wv.similarity(i[0],i[1]))
dataframe['w2v_idealvalues']=w2v_values

  w2v_values.append(word2vec.wv.similarity(i[0],i[1]))


In [None]:
dataframe

Unnamed: 0,Serial Number,w2v_idealvalues
0,1,0.244612
1,2,-0.008264
2,3,-0.073541
3,4,0.023472
4,5,0.052351
...,...,...
95,96,0.219348
96,97,0.125567
97,98,0.038487
98,99,0.109628


**Method 2:**
The second column is to be filled with the dot products of two word vectors, to compare the closeness of the values

In [None]:
dotprod_values=[]
for i in w2v_randompairs:
  vector_1=word2vec.wv[i[0]]
  vector_2=word2vec.wv[i[1]]
  value_calc=np.dot(vector_1,vector_2)
  dotprod_values.append(value_calc)
dataframe['dotprod_calcvalues']=dotprod_values

  vector_1=word2vec.wv[i[0]]
  vector_2=word2vec.wv[i[1]]


In [None]:
dataframe

Unnamed: 0,Serial Number,w2v_idealvalues,dotprod_calcvalues
0,1,0.244612,0.244612
1,2,-0.008264,-0.008264
2,3,-0.073541,-0.073541
3,4,0.023472,0.023472
4,5,0.052351,0.052351
...,...,...,...
95,96,0.219348,0.219348
96,97,0.125567,0.125567
97,98,0.038487,0.038487
98,99,0.109628,0.109628


**Method 3:**
The third column is to be filled with values obtained from the word vectors after **Principal Component Analysis**, with reduced dimensions being restricted to 4

In [None]:
count = 0
max_count = 200
X = np.zeros(shape=(300,max_count))
X=X.T

In [None]:
c=0
for i in w2v_randompairs:
  X[c,:]=(word2vec.wv[i[0]])
  X[c+1,:]=(word2vec.wv[i[1]])
  c+=1
  if c>198:
    break
print(X)
X.shape

[[ 0.0984045   0.0250404   0.0462735  ... -0.0164007  -0.0436377
   0.0401233 ]
 [ 0.0217004   0.0548711   0.0385958  ...  0.0465009   0.0375108
   0.00484385]
 [ 0.0894125   0.15895499  0.0601927  ...  0.0363786   0.0312651
   0.0382779 ]
 ...
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]]


  X[c,:]=(word2vec.wv[i[0]])
  X[c+1,:]=(word2vec.wv[i[1]])


(200, 300)

- Implement PCA from the scikit library

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=4)
X_4 = pca.fit_transform(X)

In [None]:
X_4.shape

(200, 4)

In [None]:
pca_values=[]
for i in range(0,X_4.shape[0]-1,2):
  pca_vector1=(X_4[i,:])
  pca_vector2=(X_4[i+1,:])
  pca_value=np.dot(pca_vector1,pca_vector2)
  pca_values.append(pca_value)

In [None]:
dataframe['pca_values']=pca_values

In [None]:
dataframe

Unnamed: 0,Serial Number,w2v_idealvalues,dotprod_calcvalues,pca_values
0,1,0.244612,0.244612,0.168238
1,2,-0.008264,-0.008264,-0.002253
2,3,-0.073541,-0.073541,-0.026221
3,4,0.023472,0.023472,0.033245
4,5,0.052351,0.052351,0.008576
...,...,...,...,...
95,96,0.219348,0.219348,0.014619
96,97,0.125567,0.125567,0.014619
97,98,0.038487,0.038487,0.014619
98,99,0.109628,0.109628,0.014619


**Method 4:**
The fourth column is to be filled with values obtained from the word vectors after **t-SNE** is applied, with reduced dimensions being restricted to 3

In [None]:
from sklearn.manifold import TSNE
t_sne = TSNE(n_components=3, learning_rate='auto',init='random')
X_embedded= t_sne.fit_transform(X)

In [None]:
X_embedded.shape

(200, 3)

In [None]:
tSNE_values=[]
for i in range(0,X_embedded.shape[0]-1,2):
  tSNE_vector1=(X_embedded[i,:])
  tSNE_vector2=(X_embedded[i+1,:])
  tSNE_value=np.dot(tSNE_vector1,tSNE_vector2)
  tSNE_values.append(tSNE_value)

In [None]:
dataframe['tSNE_values']=tSNE_values
dataframe

Unnamed: 0,Serial Number,w2v_idealvalues,dotprod_calcvalues,pca_values,tSNE_values
0,1,0.244612,0.244612,0.168238,-6818.879883
1,2,-0.008264,-0.008264,-0.002253,-1308.133423
2,3,-0.073541,-0.073541,-0.026221,1580.192749
3,4,0.023472,0.023472,0.033245,3360.744873
4,5,0.052351,0.052351,0.008576,6854.895020
...,...,...,...,...,...
95,96,0.219348,0.219348,0.014619,2119.205078
96,97,0.125567,0.125567,0.014619,2485.828613
97,98,0.038487,0.038487,0.014619,3684.112305
98,99,0.109628,0.109628,0.014619,6384.616699


*Temporary Checkpoint - downloading the dataframe*

In [None]:
dataframe.to_csv('tablecheck.csv')

**Method 5:**
The fifth column is to be filled with values obtained from the word vectors after the proposed method is applied.

Consideration:
- **X has dimensions (200,300)** following which:
  - Calculate Mean and Standard Deviation, of each column as C(mu,j) and C(sigma,j).
  - Convert each element C(ij) of a column to a normalised form 
     C'(ij)= [C(ij)-C(mu,j)]/C(sigma,j)

In [None]:
mean_varlist=[]
for i in range(300):
  arr=X[:,i]
  mean_varlist.append(mean_var(arr))

print(mean_varlist)

[(0.008380936817266047, 0.03995530227615192), (0.00665860119392164, 0.04262847032750476), (-0.0032147553423419596, 0.03944598559422028), (0.011306474330995115, 0.03803074488813927), (-0.00848831161230919, 0.0394444042804869), (-0.005759209812385962, 0.04070623090277321), (0.002803726104903035, 0.03870883350858176), (-0.005511106261983514, 0.04010234676472784), (0.0045630796113982795, 0.03835901814874673), (0.01219394372048555, 0.04100669472776521), (-0.0035543095745379106, 0.0394499975971345), (-0.01313124586304184, 0.04152263353548587), (-0.005585998305723479, 0.03698157922757588), (0.011893395003862679, 0.041797032453394466), (-0.014396477361151483, 0.04084891258633147), (0.007945007443286158, 0.04536329916816473), (0.0008274009672459215, 0.043023555276288226), (0.01952604287333088, 0.0439566854812804), (-0.0011458639305783436, 0.04124196278294247), (-0.008616843527997844, 0.045406221629927176), (0.010854613361880183, 0.03821572001332416), (0.005531250337771781, 0.04343760081030745),

In [None]:
len(mean_varlist)

300

In [None]:
copy=X.copy()

In [None]:
for i in range(300):
  arr=X[:,i]
  mean_vartup=mean_varlist[i]
  arr=(arr-mean_vartup[0])/(mean_vartup[1])
  copy[:,i]=arr
print(copy)
print(copy.shape)

[[ 2.25310672  0.43120942  1.2545828  ... -0.19344246 -1.36156373
   0.68968423]
 [ 0.33335911  1.13099296  1.05994448 ...  1.4279985   0.74289443
  -0.19147851]
 [ 2.02805539  3.57264501  1.60745016 ...  1.16707174  0.580922
   0.64359238]
 ...
 [-0.20975781 -0.1562008   0.08149766 ...  0.22932528 -0.22988893
  -0.31246171]
 [-0.20975781 -0.1562008   0.08149766 ...  0.22932528 -0.22988893
  -0.31246171]
 [-0.20975781 -0.1562008   0.08149766 ...  0.22932528 -0.22988893
  -0.31246171]]
(200, 300)


In [None]:

type(X)

numpy.ndarray

In [None]:
print(X)

[[ 0.0984045   0.0250404   0.0462735  ... -0.0164007  -0.0436377
   0.0401233 ]
 [ 0.0217004   0.0548711   0.0385958  ...  0.0465009   0.0375108
   0.00484385]
 [ 0.0894125   0.15895499  0.0601927  ...  0.0363786   0.0312651
   0.0382779 ]
 ...
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]]


copy is the normalized X matrix

- Calculation of Moments is necessary:
  - For an n-th moment, each element along a row, should be raised to the power n.
  - Beyond this, all the rows elements are added along a column, followed by average of the 200 elements. 



In [None]:
def moment(n,copy):
  copy_new=copy.copy()
  for i in range(200):
    for j in range(300):
      copy_new[i,j]=copy[i,j]**n
  temp=[]
  for i in range(300):
    temp.append(np.sum(copy_new[:,i])/200)
  return temp

In [None]:
moment_0=moment(0,copy)
print(moment_0)

[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,

In [None]:
moment_1=moment(1,copy)
print(moment_1)

[-3.552713678800501e-17, -8.881784197001253e-18, 6.661338147750939e-18, 0.0, -5.329070518200751e-17, 1.7763568394002505e-17, -4.4408920985006264e-17, 5.329070518200751e-17, 1.7763568394002505e-17, 1.7763568394002505e-17, 1.3322676295501878e-17, 3.552713678800501e-17, 2.6645352591003756e-17, 3.552713678800501e-17, 0.0, 0.0, -3.1086244689504386e-17, 7.105427357601002e-17, -1.9984014443252817e-17, -1.7763568394002505e-17, 7.105427357601002e-17, -1.7763568394002505e-17, 0.0, 0.0, -1.7763568394002505e-17, 8.881784197001253e-18, 7.105427357601002e-17, 0.0, 1.7763568394002505e-17, 8.881784197001253e-18, 1.7763568394002505e-17, 3.552713678800501e-17, 0.0, -4.4408920985006264e-17, 4.4408920985006264e-17, 0.0, -3.552713678800501e-17, -2.55351295663786e-17, -4.4408920985006264e-17, -8.881784197001253e-18, -5.329070518200751e-17, -8.881784197001253e-18, -1.7763568394002505e-17, 2.7755575615628914e-17, -1.3322676295501878e-17, -1.7763568394002505e-17, -3.552713678800501e-17, 0.0, 1.3877787807814457

In [None]:
moment_2=moment(2,copy)
print(moment_2)

[0.9999999999999973, 1.000000000000006, 0.9999999999999933, 0.999999999999992, 0.9999999999999993, 0.9999999999999932, 0.9999999999999984, 1.0000000000000067, 0.9999999999999993, 0.9999999999999991, 1.0000000000000038, 1.0000000000000049, 1.0000000000000095, 1.0000000000000033, 0.9999999999999991, 0.9999999999999954, 1.0000000000000058, 0.9999999999999943, 0.9999999999999979, 1.000000000000005, 1.0000000000000067, 1.0000000000000018, 0.9999999999999951, 0.9999999999999969, 1.0000000000000038, 1.0000000000000062, 1.0000000000000062, 1.0000000000000064, 1.0000000000000058, 1.0000000000000062, 0.9999999999999963, 0.9999999999999923, 0.9999999999999991, 1.0000000000000044, 0.9999999999999923, 0.9999999999999978, 1.0000000000000007, 1.0000000000000109, 0.9999999999999966, 0.9999999999999953, 1.0000000000000058, 0.9999999999999987, 1.0000000000000038, 0.9999999999999951, 0.9999999999999987, 0.9999999999999949, 1.0000000000000029, 0.9999999999999943, 0.9999999999999987, 0.9999999999999944, 1.

In [None]:
moment_3=moment(3,copy)
print(moment_3)

[-0.2835414891120153, 0.49432589254468146, -0.13225382918151904, 0.7279778061340716, -0.5243486712160093, -1.4351038848434516, 0.6212183996194699, -0.42332981421639615, 0.2293591785502681, 1.177901941041845, -0.9894541731442302, -0.5046259108569771, -0.46457935469681816, 0.3855678064328819, -0.7389101454095908, 1.2889767101854803, 0.4297734736397492, 1.4616319391517072, -1.2128818141415212, -0.6527051226167069, 0.24292684875081935, 0.1369453800404718, 1.639780130110642, 0.7975270878561096, 0.0006641147250644508, -0.43391776852573855, -0.96188971210814, 0.6377397444318462, -0.14565221857371224, 0.04265823960211407, -0.9187313657727101, -0.17828131573734818, -0.5064012190261609, 0.4929049548779748, -0.7153449455103036, -0.4111145284565515, 0.9032119559068035, -0.12674856691615285, 0.42601048718715534, 0.33962693048298415, 0.6881103288744964, -0.23121563335787132, 1.0329298824072164, -0.5031418894107781, -0.4149089588055205, -0.14084066766324155, -0.9114432998120807, -0.444409327536429, -

In [None]:
moment_4=moment(4,copy)
print(moment_4)

[6.087343789007151, 5.92209524489574, 5.115434864445599, 5.251530592259645, 6.743850272015532, 8.315992317987691, 5.800507333524332, 4.939963650202792, 6.100705826000743, 4.771309505008793, 6.679516709778978, 4.0887931965608715, 4.559242172658963, 4.888866308324492, 4.3247965470833005, 6.472834215353412, 5.796534866423837, 5.667716252723444, 7.635405428283494, 5.1965920021717125, 6.452034061713193, 6.051549940809083, 10.721804096099756, 6.728196641584597, 4.729711680777388, 6.3521125679490105, 3.9313800751569987, 3.733799541850544, 5.160169945506702, 5.322586683391766, 6.177808905854128, 4.331388781339455, 7.013047632480591, 5.017875363781007, 5.484129064980964, 4.634095709456786, 5.0497314974151495, 5.841395475471685, 5.5544018145423415, 5.253444098101474, 4.357783532397097, 5.561794448523817, 5.259787462317336, 4.715760126464284, 7.435999480457375, 5.267365800779171, 5.847232637807639, 5.193744224494107, 5.521821193442769, 4.2495076496725614, 5.797044425609511, 4.865030308251156, 5.2

In [None]:
moment_5=moment(5,copy)
print(moment_5)

[-10.816086496904559, 6.093208013635609, -0.21394164230140347, 6.777161854607471, -1.5143453642678415, -28.494243147259564, 9.043644156516997, -5.468345486902541, -0.43735576568051243, 11.262671183819169, -14.534463518766602, -1.259627820302759, -3.221914987865949, 1.1442576741612032, -5.552587497752878, 17.520508555265994, 7.156981630512777, 15.888437550024896, -22.807504788372295, -4.550723884461002, -0.7671078137982079, 1.960418501954049, 42.31126026354589, 15.704312757438736, -1.3898478048279295, 5.75166408553467, -5.842343948752954, 3.819476079907261, -4.907859852934383, 0.061656472962503414, -12.973114953964473, 0.7396686979459126, -5.815184433565939, 4.746340098342995, -7.911410654700288, -2.5703315428428537, 7.917095565157089, -5.152666127322221, 1.9555648749910461, 3.211417974157428, 6.560592526018072, -3.902892123329447, 10.148834810181793, -5.2508502102378305, -8.06465780737825, -1.8927545051110277, -11.59003140398758, -5.737273792818144, -5.2017032300442985, -0.669647742893

In [None]:
moment_6=moment(6,copy)
print(moment_6)

[66.46260658512686, 53.78449987218898, 39.70207928862371, 45.49730179204669, 72.70853900565649, 147.30251825157072, 52.42010371951245, 36.57380677153351, 55.98922745050079, 41.757996672599695, 76.10848616266216, 22.71161483192961, 31.86368411574503, 39.91263382644186, 30.3237087864687, 73.05920683481412, 50.31218159908913, 57.112065385767764, 109.26745114231407, 37.863234510506665, 84.00743926835668, 57.18900482596776, 236.9445242672751, 87.07054176320077, 33.856589063273745, 80.00757168459481, 22.84209705024469, 21.488982007908476, 41.914386766472695, 38.81405458949293, 63.221655478287815, 27.713289390218925, 80.22055520593493, 35.23133217938317, 43.19074105285611, 28.996081401611946, 41.5047151032397, 55.34623932079018, 49.78794455298757, 39.94110248849074, 30.945413879570868, 52.8979210464618, 44.99576450521312, 33.008701548339474, 91.94921936814666, 41.11455408797465, 53.7275586176602, 38.60867082940602, 46.10954683614728, 24.321788474132486, 53.72856382064197, 33.74323377770009, 3

In [None]:
moment_7=moment(7,copy)
print(moment_7)

[-201.27150758402024, 83.28483347641945, 10.193463652997888, 78.84426681069446, 23.419262598124373, -690.3097984217314, 130.27526351474523, -71.67925126093502, -34.67102184485708, 136.40573218629592, -243.32745591611513, 0.9401408453029547, -33.34330140211491, -1.5611668927207896, -55.92601050067895, 260.22756782728686, 98.85964577886361, 191.128136918977, -440.5796201739659, -35.753690010867864, -33.517101464413386, 41.676449755137185, 1157.6789214956582, 305.30374992851523, -15.77696160706477, 230.4101341476393, -39.58148780039644, 30.762821042078702, -77.3266717633027, -8.023663877739427, -189.2075429622956, 21.08769458599113, -85.25074408019931, 42.746164091155606, -82.49492439388067, -19.367658826221255, 84.7917390985603, -113.88106804977353, 7.118620492134832, 37.93493879934737, 71.4450219876205, -78.29387912543709, 119.48796477171392, -55.69293357588773, -165.9571737699992, -36.927145919430416, -155.20703092099612, -65.16550463649028, -84.5280571681061, -12.360573417885842, -97.

In [None]:
moment_8=moment(8,copy)
print(moment_8)

[937.533025831931, 579.1327405539114, 381.64251482982286, 500.21028628845573, 945.2377923986948, 3605.792644655953, 601.636858896666, 350.5443531866297, 618.6920767441474, 520.7514555065242, 1166.602589123164, 144.81365945925882, 284.19354980943297, 429.56113046869234, 273.1269887302161, 1055.245422643844, 523.8551276265381, 690.6356398848417, 2030.8641352484701, 319.5294023049375, 1422.0093054264885, 649.9344153126455, 6229.449561166445, 1456.736167254356, 310.25247936494407, 1456.6545441469616, 159.39354889851927, 152.18599876233046, 431.0458839175185, 321.60265951692816, 822.2396353706177, 220.9719991646877, 1069.6040597701476, 284.1196654232072, 393.6675452521927, 206.24627407408786, 419.5927261919627, 718.0334527785517, 530.7159023082888, 367.0700308389434, 289.06185439177096, 669.2288793847729, 496.32960332592927, 287.61071175427304, 1394.3815167819528, 397.17186275294245, 627.9728992095337, 355.16675342834054, 504.6051754180818, 165.91424149255405, 586.8831798090023, 269.0211731

In [None]:
moment_9=moment(9,copy)
print(moment_9)

[-3410.5138555507715, 1137.823839075292, 257.3229270034854, 1032.1599585925928, 751.1185304289052, -18581.608957709042, 1874.480963958045, -942.6603252803859, -739.4824990238661, 1956.7271875065737, -4514.704994214555, 54.852728819459436, -420.85941703025907, -110.22874577604769, -648.1612824583501, 4133.582379813232, 1293.5522640016543, 2463.669265721868, -9015.616424411231, -322.3053172534818, -768.9225492989142, 809.1798927226944, 32114.46022283356, 5884.173706964898, -146.70837351110677, 5879.986723769334, -297.8970938208422, 283.3624545776675, -1100.54464145663, -146.26757947604028, -2900.879501282424, 321.2995075027336, -1343.3089579022478, 396.7590647574642, -862.3629793826179, -162.82870825286867, 1016.6913067447059, -2204.142504846587, -38.50196812758887, 491.78000887860486, 828.7752051493703, -1542.294226149627, 1562.9952953833358, -618.4688967968789, -3313.701475586209, -638.507188778383, -2141.0617437757273, -780.7033734944349, -1331.7041194355525, -163.66379746937997, -105

In [None]:
moment_10=moment(10,copy)
print(moment_10)

[14563.630634932837, 6714.253426843875, 4113.800690267108, 6184.24841560909, 13452.594939310995, 98424.46366673287, 7789.407403859356, 3970.498142253202, 7474.237148990524, 7737.463523720785, 20949.170865104894, 1005.3859003892139, 2895.775140421303, 5294.100783304784, 2856.0779204128603, 16845.393639057413, 5968.95671449707, 9040.03801864692, 41644.878885875994, 2924.111876377816, 26141.25149159208, 8126.435057610512, 170691.17058609283, 26621.054341368865, 3229.3749537371173, 30647.105019438968, 1234.2085520965477, 1215.5550834501887, 5061.555821737843, 2847.7935709448266, 12138.131318192569, 2060.018067091669, 15029.964570327436, 2442.6649362257654, 3834.245098597992, 1573.4641980498823, 4754.933087911601, 11135.447673565433, 6095.666772743518, 3794.5377635386844, 3115.267553583521, 9808.581257741658, 6236.864396205292, 2832.9177053864346, 22955.810941325486, 4415.79499828504, 8281.285390459228, 3779.931499236236, 6518.06505982873, 1279.5403752241184, 6976.05507487376, 2294.31350876

In [None]:
moment_matrix=np.zeros(shape=(300,11))

In [None]:
moment_array=[moment_0,moment_1,moment_2,moment_3,moment_4,moment_5,moment_6,moment_7,moment_8,moment_9,moment_10]

In [None]:
for i in range(11):
  moment_matrix[:,i]=moment_array[i]
print(moment_matrix)

[[ 1.00000000e+00 -3.55271368e-17  1.00000000e+00 ...  9.37533026e+02
  -3.41051386e+03  1.45636306e+04]
 [ 1.00000000e+00 -8.88178420e-18  1.00000000e+00 ...  5.79132741e+02
   1.13782384e+03  6.71425343e+03]
 [ 1.00000000e+00  6.66133815e-18  1.00000000e+00 ...  3.81642515e+02
   2.57322927e+02  4.11380069e+03]
 ...
 [ 1.00000000e+00 -1.77635684e-17  1.00000000e+00 ...  2.68057116e+03
  -1.33355449e+04  6.88855446e+04]
 [ 1.00000000e+00  0.00000000e+00  1.00000000e+00 ...  2.62376606e+02
  -3.77410015e+02  2.33684799e+03]
 [ 1.00000000e+00  3.55271368e-17  1.00000000e+00 ...  1.93799707e+02
   2.94700034e+02  1.46384313e+03]]


Hence, the matrix consisting of moments from 0 to 10, are obtained. This is known as **basis matrix**

In [None]:
mean_varlist_new=[]
for i in range(11):
  arr_new=moment_matrix[:,i]
  mean_varlist_new.append(mean_var(arr_new))

print(mean_varlist_new)
copy_3=moment_matrix.copy()

[(1.0, 0.0), (1.0140036958243095e-18, 2.816939767609146e-17), (1.0, 4.942533683990113e-15), (-0.061757343827472555, 0.6185310888093842), (5.4825578038786675, 0.9880417398625744), (-0.925247369550875, 8.50188565188367), (49.71644861833631, 27.30561056802581), (-16.639502656823993, 171.1022798033202), (622.016064063785, 731.1049137903519), (-350.47636095860565, 4193.914771787211), (9901.254968208059, 20247.52989969493)]


In [None]:
for i in range(10):
  arr_one=moment_matrix[:,i+1]
  mean_vartupnew=mean_varlist_new[i+1]
  arr_one=(arr_one-mean_vartupnew[0])/(mean_vartupnew[1])
  copy_3[:,i+1]=arr_one
print(copy_3)
print(copy_3.shape)

[[ 1.         -1.29719282 -0.53910311 ...  0.43156181 -0.7296375
   0.23026886]
 [ 1.         -0.35129569  1.21298199 ... -0.0586555   0.35487135
  -0.15740199]
 [ 1.          0.20047764 -1.34775776 ... -0.3287812   0.14492409
  -0.28583508]
 ...
 [ 1.         -0.66659473  0.49417785 ...  2.81567673 -3.09616891
   2.91315978]
 [ 1.         -0.03599664 -0.33693944 ... -0.49191224 -0.00642208
  -0.37359653]
 [ 1.          1.22519954  1.25790725 ... -0.58571123  0.15383632
  -0.41671314]]
(300, 11)


The next step:
- W which is the compressed word vector, needs to be obtained for every word, using the pseudo inverse matrix method.
 ================> W = ((X.T)X)^(-1)) (X.T)Y
- Y is the word vector of each word and X is the matrix of every word

- **moment_matrix** - (300,11)
- **Y** - (300,1)

In [None]:
word_vecmatrix=[]
for i in range(200):
  Y=X[i,:]
  moment_matrixsquare=np.matmul(copy_3.T,copy_3)
  moment_inverse=np.linalg.inv(moment_matrixsquare)
  pre_mult=np.matmul(moment_inverse,copy_3.T)
  W=np.matmul(pre_mult,Y)
  word_vecmatrix.append(W)
print(word_vecmatrix)

[array([ 0.00265442, -0.00545432, -0.00033772,  0.09907814, -0.01699903,
       -0.24550238,  0.05462581,  0.30087496, -0.02999613, -0.14062431,
       -0.00896083]), array([ 0.00498383,  0.00287813, -0.00649718,  0.06245093,  0.02437514,
       -0.09043137, -0.10258911,  0.07416134,  0.14621661, -0.02828818,
       -0.06751288]), array([ 0.00398567, -0.00356896, -0.00178434,  0.03332091, -0.00745163,
       -0.06444686,  0.00517828,  0.05695618,  0.00020023, -0.017717  ,
        0.00514057]), array([-0.00268274,  0.00372781, -0.00277634,  0.08431643,  0.04444354,
       -0.20819347, -0.13551178,  0.23280084,  0.16991964, -0.09453663,
       -0.07820782]), array([ 0.0013373 ,  0.00111146,  0.00176722,  0.08636536,  0.02920693,
       -0.23004276, -0.13135096,  0.29579568,  0.23308317, -0.14196766,
       -0.134343  ]), array([ 0.00398567, -0.00356896, -0.00178434,  0.03332091, -0.00745163,
       -0.06444686,  0.00517828,  0.05695618,  0.00020023, -0.017717  ,
        0.00514057]), arr

In [None]:
len(word_vecmatrix)

200

In [None]:
newmethod_values=[]
count=0
for i in range(0,199,2):
  newmethod_vec1=word_vecmatrix[i]
  newmethod_vec2=word_vecmatrix[i+1]
  newmethod_value=np.dot(newmethod_vec1,newmethod_vec2)
  newmethod_values.append(newmethod_value)
print(newmethod_values)

[0.04488033156335742, 0.029741318368488874, 0.03552240593407768, 0.6191467978973859, 0.11962634225721558, 0.06349786855003627, 0.03101528828045038, 0.33105675596101286, 0.2643680241160498, 0.04175299150763918, 0.19488226197102318, 0.30758048006022487, 0.10768563082173105, 0.15914351102166568, 0.14728745751967165, -0.05781205976479433, -0.3337252597712402, -0.018982926657178687, -0.018290982664784905, -0.07760814286147663, 0.07039305342658336, 0.08619246643410547, -0.1847730632037119, 0.13863212275546247, -0.0028797308253496598, 0.06763230273568432, 0.11389602382930865, -0.08413425888308189, 0.060222402492958745, -0.058292318908583954, -0.19296771077466135, 0.034197698391274886, 0.18428308568176371, 0.016519720898209803, 0.06667797872684429, 0.20816463735767715, 0.020721370531486576, -0.030190854731879, 0.08197109710129177, 0.020255790583568276, 0.10421641963554626, -0.1682309604505695, 0.017162708730660073, 0.14936477647228874, 0.056265332175427854, 0.05284793884364329, -0.172570425305

In [None]:
print(len(newmethod_values))

100


In [None]:
dataframe['newmethod_values']=newmethod_values

In [None]:
dataframe

Unnamed: 0,Serial Number,w2v_idealvalues,dotprod_calcvalues,pca_values,tSNE_values,newmethod_values
0,1,0.244612,0.244612,0.168238,-6818.879883,0.044880
1,2,-0.008264,-0.008264,-0.002253,-1308.133423,0.029741
2,3,-0.073541,-0.073541,-0.026221,1580.192749,0.035522
3,4,0.023472,0.023472,0.033245,3360.744873,0.619147
4,5,0.052351,0.052351,0.008576,6854.895020,0.119626
...,...,...,...,...,...,...
95,96,0.219348,0.219348,0.014619,2119.205078,0.000000
96,97,0.125567,0.125567,0.014619,2485.828613,0.000000
97,98,0.038487,0.038487,0.014619,3684.112305,0.000000
98,99,0.109628,0.109628,0.014619,6384.616699,0.000000


*Final Checkpoint- Table consists of similarity values of all methods*

In [None]:
dataframe.to_csv('finaltable_new.csv')

Plot the moment_matrix before normalising
- Calculate 7th column with new method without normalising
