# 3. Analysing words

This notebook will introduce you to the basics of analysing words. 
You'll learn how to preprocess and represent text.








Legend of symbols:

- 🤓: Tips

- 🤖📝: Your turn

- ❓: Question

- 💫: Extra exercise 

## 3.1. Word vectorization

In this section, we'll learn how to transform words into vectors. Let's start with one-hot encodings.

### 3.1.1. One-hot encoding

The library **<tt> sklearn <tt>** has a function that transforms categorical features to one-hot vectors:
    
🌍 https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html 

🌍 https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

First, we will import the functions we need.

In [None]:
from numpy import array
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

In [None]:
sent1 = "Can I eat the Pizza".lower()
sent2 = "You can eat the Pizza".lower()

doc1 = sent1.split()
doc2 = sent2.split()

doc1_array = array(doc1)
doc2_array = array(doc2)

doc3 = doc1+doc2
data = list(doc3)

values = array(data)
print(values)

❓ What does this code do?

After that, we will transform words to numbers based on its position. To do so, we will use the **<tt> LabelEncoder() <tt>**.

In [None]:
# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)
print(integer_encoded)

In [None]:
# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print(onehot_encoded)

In [None]:
# This shows the order of words in the matrix
list(label_encoder.classes_)

### 🤖📝 **Your turn**

Load the news dataset and calculate the onehot encoding of the first new.

In [None]:
import pandas as pd

_______

### 3.1.2. Word embeddings (word2vec)

Training a word2vec model in Python is scarily easy with **<tt> gensim <tt>**.
    
🌍 https://radimrehurek.com/gensim/

In [None]:
from gensim.models import Word2Vec

Now, we will train the word2vec model taking a look to the parameters:

🌍 https://radimrehurek.com/gensim/models/word2vec.html

First, we will prepare the data. To do so, we need to transform every sentence in a list within a list:

In [None]:
sent1 = "Can I eat the Pizza".lower()
sent2 = "You can eat the Pizza".lower()

doc1 = sent1.split()
doc2 = sent2.split()

In [None]:
doc3 = [doc1, doc2]

In [None]:
model = Word2Vec(doc3, size=300, window=3, min_count=1, workers=4)

In [None]:
print(model)

Now, we can analyse the vocabulary of this word2vec model

In [None]:
print(model.wv.vocab)

We can analyse the embeddings by:

In [None]:
model['pizza']

Using word2vec, we can analyse similarities across words:

In [None]:
model.most_similar(positive=['pizza',], topn=1)

In [None]:
model.most_similar(negative=['pizza',], topn=1)

And relations between words:

In [None]:
print(model.similarity('pizza', 'eat'))

🤓 Note that this model doesn't contain a lot of text, so it doesn't make sense.



### 🤖📝 **Your turn**

Train a word2vec embedding with the news corpus and extract the top 10 most similar words of *ultraviolet*.

Help to prepare the input for the model:

In [9]:
# This is a loop that iterates over the dataframe
news_vec = []

for index, row in df.iterrows():
    sent = row['corpus'].lower()
    sent = sent.split()
    news_vec.append(sent)    
 
# Print the first element of the list:
print(news_vec[0])

['the', 'reindeer', 'is', 'the', 'emblematic', 'christmas', 'animal', 'and,', 'while', 'not', 'exactly', 'magical,', 'it', 'is', 'among', 'the', 'best', 'adapted', 'to', 'snowy', 'conditions.for', 'a', 'start,', 'a', 'reindeer’s', 'feet', 'have', 'four', 'toes', 'with', 'dewclaws', 'that', 'spread', 'out', 'to', 'distribute', 'its', 'weight', 'like', 'snowshoes,', 'and', 'are', 'equipped', 'with', 'sharp', 'hooves', 'for', 'digging', 'in', 'snow.a', 'reindeer’s', 'nose', 'warms', 'the', 'air', 'on', 'its', 'way', 'to', 'the', 'lungs,', 'cooling', 'it', 'again', 'before', 'it', 'is', 'exhaled.', 'as', 'well', 'as', 'retaining', 'heat,', 'this', 'helps', 'prevent', 'water', 'from', 'being', 'lost', 'as', 'vapour.', 'this', 'is', 'why', 'reindeer', 'breath', 'does', 'not', 'steam', 'like', 'human', 'and', 'horse', 'breath.a', 'reindeer’s', 'thick', 'double-layered', 'coat', 'is', 'so', 'efficient', 'that', 'it', 'is', 'more', 'likely', 'to', 'overheat', 'than', 'get', 'too', 'cold,', 'esp

#### 💫 Extra

- Extract the most similiar word to *climate*.
- Calculate the similarity between *climate* and *weather*.
- Calculate the most similar word to *huamanitarian* + *climate* - *droguth*.

Does make sense?

## 3.2. Word preprocessing

### Resources

📕 Hovy, D. (2020). Text Analysis in Python for Social Scientists: Discovery and Exploration. Cambridge University Press.

🌍 https://medium.com/zero-equals-false/one-hot-encoding-129ccc293cda

🌍 https://markroxor.github.io/gensim/static/notebooks/word2vec.html
