<a href="https://colab.research.google.com/github/austinlasseter/pca_word_vectors/blob/main/starter_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Word Vectors and Principle Components Analysis (PCA)

In [13]:
# import the PCA module from sklearn


Word vectors represent a significant leap forward in advancing our ability to analyze relationships across words, sentences, and documents. It is word vectors that make technologies such as speech recognition and machine translation possible. **Word vectors are simply vectors of numbers that represent the meaning of a word.**   
http://jalammar.github.io/illustrated-word2vec/

![](https://github.com/austinlasseter/pca_word_vectors/blob/main/word2vec.png?raw=1)

This is a word embedding for the word “king” (GloVe vector trained on Wikipedia):
```
[ 0.50451 , 0.68607 , -0.59517 , -0.022801, 0.60046 , -0.13498 , -0.08813 , 0.47377 , -0.61798 , -0.31012 , -0.076666, 1.493 , -0.034189, -0.98173 , 0.68229 , 0.81722 , -0.51874 , -0.31503 , -0.55809 , 0.66421 , 0.1961 , -0.13495 , -0.11476 , -0.30344 , 0.41177 , -2.223 , -1.0756 , -1.0783 , -0.34354 , 0.33505 , 1.9927 , -0.04234 , -0.64319 , 0.71125 , 0.49159 , 0.16754 , 0.34344 , -0.25663 , -0.8523 , 0.1661 , 0.40102 , 1.1685 , -1.0137 , -0.21585 , -0.15155 , 0.78321 , -0.91241 , -1.6106 , -0.64426 , -0.51042 ]
```

It’s a list of 50 numbers. We can’t tell much by looking at the values. But let’s visualize it a bit so we can compare it other word vectors. Let’s color code the cells based on their values (red if they’re close to 2, white if they’re close to 0, blue if they’re close to -2). 

See how “Man” and “Woman” are much more similar to each other than either of them is to “king”? This tells you something. These vector representations capture quite a bit of the information/meaning/associations of these words.

![](https://github.com/austinlasseter/pca_word_vectors/blob/main/king-man-woman-embedding.png?raw=1)

Now let's try working with some word vectors of our own, and seeing how we can combine them with a technique called Principle Components Analysis (PCA) in order to learn something about their meanings.

In [14]:
# read in a dataset of 10 word vectors (this is the result of an NLP using Word2Vec)
url ='https://raw.githubusercontent.com/austinlasseter/learnspacy/master/words300df.csv'


In [15]:
# create the index of the dataframe
words = ['car', 'truck', 'suv', 'elves', 'dragon', 'sword', 'king', 'queen', 'prince',  'potato']

In [16]:
# Take a look at the df


Why do the 2 most popular neural models — Word2Vec and GloVe - consistently use 300-D word vectors?

https://code.google.com/archive/p/word2vec/


"having a lower number of parameters leads to better generalization. It is found that 300-dimensional word embeddings perform much better than, say, 3000-dimensional ones."
https://medium.com/explorations-in-language-and-learning/understanding-word-vectors-f5f9e9fdef98

In [17]:
# let's look at our first word and its 300D word vectors


In [18]:
# intialise the pca model 


In [19]:
# fit the pca model to our 300-dimensional data, this will work out which is the best 
# way to project the data down that will best maintain the relative distances 
# between data points. It will store these intructions on how to transform the data.


In [20]:
# Tell our (fitted) pca model to transform our 300D data down onto 2D using the 
# instructions it learnt during the fit phase.


In [21]:
# let's take look at our first word and its new 2D word vectors


In [22]:
# each word and vector pair are actually just X-Y coordinates


In [23]:
# Make that into a dataframe


In [24]:
# create a  plot 
# plt.figure(figsize=(10,5))


## in 3 Dimensions

Plotly graph
Plotly website: https://plot.ly/python/
Plotly Forum: https://community.plot.ly/
Github repository: https://github.com/austinlasseter/plotly_dash_tutorial

In [25]:
import plotly
import plotly.graph_objs as go

In [26]:
# for colab notebooks:
def enable_plotly_in_cell():
    import IPython
    from plotly.offline import init_notebook_mode
    display(IPython.core.display.HTML('''<script src="/static/components/requirejs/require.js"></script>'''))
    init_notebook_mode(connected=False)

In [27]:
# intialise pca model for 3 dimensions


In [28]:
# fit the pca model to our 300-dimensional data


In [29]:
# Tell our (fitted) pca model to transform our 300D data down onto 2D using the 
# instructions it learnt during the fit phase.


In [30]:
# each word and vector pair are coordinates


In [31]:
# convert to a 3D dataframe


In [None]:
# for each word and coordinate pair: 


In [None]:
# now with plotly create a dictionary of values using the 'go.Scatter' class
mydata = go.Scatter3d(x = x_values, 
                      y = y_values, 
                      z = z_values, 
                      mode='markers', 
                      hovertext=words,
                      marker=dict(
                            size=12,
                            color=z_values,                # set color to an array/list of desired values
                            colorscale='Viridis',   # choose a colorscale
                            opacity=0.8
                        )
                     )

In [None]:
enable_plotly_in_cell() # don't forget to include this! It's just for colab notebooks.
# wrap that dictionary into a list and display using 'go.Figure' class
