# 3. Analysing words

This notebook will introduce you to the basics of analysing words. 
You'll learn how to preprocess and represent words.








Legend of symbols:

- 🤓: Tips

- 🤖📝: Your turn

- ❓: Question

- 💫: Extra exercise 

## 3.1. Word vectorization

In this section, we'll learn how to transform words into vectors. Let's start with one-hot encodings.

### 3.1.1. One-hot encoding

The library **<tt> sklearn <tt>** has a function that transforms categorical features to one-hot vectors:
    
🌍 https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html 

🌍 https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

First, we will import the functions we need.

In [1]:
from numpy import array
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

In [25]:
sent1 = "Can I eat the Pizza".lower()
sent2 = "You can eat the Pizza".lower()

doc1 = sent1.split()
doc2 = sent2.split()

doc1_array = array(doc1)
doc2_array = array(doc2)

doc3 = doc1+doc2
data = list(doc3)

values = array(data)
print(values)

['can' 'i' 'eat' 'the' 'pizza' 'you' 'can' 'eat' 'the' 'pizza']


❓ What does this code do?

After that, we will transform words to numbers based on its position. To do so, we will use the **<tt> LabelEncoder() <tt>**.

In [3]:
# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)
print(integer_encoded)

[0 2 1 4 3 5 0 1 4 3]


In [4]:
# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print(onehot_encoded)

[[1. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 1.]
 [1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0. 0.]]


In [5]:
# This shows the order of words in the matrix
list(label_encoder.classes_)

['can', 'eat', 'i', 'pizza', 'the', 'you']

### 🤖📝 **Your turn**

Load the news dataset and calculate the onehot encoding of the first new.

In [6]:
import pandas as pd
df = pd.read_csv('../data/news.csv')

In [26]:
values = array(df['corpus'][0].lower().split())

In [28]:
# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)

In [29]:
# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print(onehot_encoded)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


### 3.1.2. Word embeddings (word2vec)

Training a word2vec model in Python is scarily easy with **<tt> gensim <tt>**.
    
🌍 https://radimrehurek.com/gensim/

In [30]:
from gensim.models import Word2Vec

Now, we will train the word2vec model taking a look to the parameters:

🌍 https://radimrehurek.com/gensim/models/word2vec.html

First, we will prepare the data. To do so, we need to transform every sentence in a list within a list:

In [31]:
sent1 = "Can I eat the Pizza".lower()
sent2 = "You can eat the Pizza".lower()

doc1 = sent1.split()
doc2 = sent2.split()

In [32]:
doc3 = [doc1, doc2]

In [33]:
model = Word2Vec(doc3, size=300, window=3, min_count=1, workers=4)

In [34]:
print(model)

Word2Vec(vocab=6, size=300, alpha=0.025)


Now, we can analyse the vocabulary of this word2vec model

In [35]:
print(model.wv.vocab)

{'can': <gensim.models.keyedvectors.Vocab object at 0x7f8f48567340>, 'i': <gensim.models.keyedvectors.Vocab object at 0x7f8f48567370>, 'eat': <gensim.models.keyedvectors.Vocab object at 0x7f8f485673d0>, 'the': <gensim.models.keyedvectors.Vocab object at 0x7f8f48567430>, 'pizza': <gensim.models.keyedvectors.Vocab object at 0x7f8f48567490>, 'you': <gensim.models.keyedvectors.Vocab object at 0x7f8f485674f0>}


We can analyse the embeddings by:

In [36]:
model['pizza']

  model['pizza']


array([ 3.21682659e-04, -3.14061675e-04, -1.22502475e-04, -6.77091593e-04,
        4.40915865e-05, -1.07827841e-03, -2.13238396e-04, -1.36438129e-03,
        6.36198558e-04,  1.65240408e-03,  1.22291676e-03, -1.22182979e-03,
        1.65365881e-03,  1.31142489e-03,  1.43662468e-03, -2.89088086e-04,
       -7.97189830e-04, -2.57448759e-04,  6.80498779e-04,  8.74991994e-04,
        8.59063934e-04,  1.98715148e-04,  3.23505665e-04, -4.19695774e-04,
        7.28339481e-04,  9.47904889e-04, -1.20824773e-03,  4.65748017e-04,
        9.54837073e-04,  3.14157922e-04, -9.68350214e-04, -1.46815274e-03,
       -8.42247391e-04, -1.57231512e-03,  5.23636001e-04, -5.77613304e-04,
        1.55919068e-03, -3.12595475e-05, -6.59398735e-04,  1.42673869e-03,
        4.94487875e-04, -1.11329160e-03, -1.07714883e-03,  1.42718409e-03,
       -1.36631844e-03,  8.11703270e-04,  8.10952857e-04, -4.45070764e-04,
       -1.12503232e-03,  1.06437982e-03,  1.14634447e-03, -1.77222391e-04,
       -1.32265943e-03,  

Using word2vec, we can analyse similarities across words:

In [37]:
model.most_similar(positive=['pizza',], topn=1)

  model.most_similar(positive=['pizza',], topn=1)


[('the', 0.07776791602373123)]

In [38]:
model.most_similar(negative=['pizza',], topn=1)

  model.most_similar(negative=['pizza',], topn=1)


[('eat', 0.09031939506530762)]

And relations between words:

In [39]:
print(model.similarity('pizza', 'eat'))

-0.09031941


  print(model.similarity('pizza', 'eat'))


🤓 Note that this model doesn't contain a lot of text, so it doesn't make sense.



### 🤖📝 **Your turn**

Train a word2vec embedding with the news corpus and extract the top 10 most similar words of *ultraviolet*.

Help to prepare the input for the model:

In [40]:
# This is a loop that iterates over the dataframe
news_vec = []

for index, row in df.iterrows():
    sent = row['corpus'].lower()
    sent = sent.split()
    news_vec.append(sent)    
 
# Print the first element of the list:
print(news_vec[0])

['the', 'reindeer', 'is', 'the', 'emblematic', 'christmas', 'animal', 'and,', 'while', 'not', 'exactly', 'magical,', 'it', 'is', 'among', 'the', 'best', 'adapted', 'to', 'snowy', 'conditions.for', 'a', 'start,', 'a', 'reindeer’s', 'feet', 'have', 'four', 'toes', 'with', 'dewclaws', 'that', 'spread', 'out', 'to', 'distribute', 'its', 'weight', 'like', 'snowshoes,', 'and', 'are', 'equipped', 'with', 'sharp', 'hooves', 'for', 'digging', 'in', 'snow.a', 'reindeer’s', 'nose', 'warms', 'the', 'air', 'on', 'its', 'way', 'to', 'the', 'lungs,', 'cooling', 'it', 'again', 'before', 'it', 'is', 'exhaled.', 'as', 'well', 'as', 'retaining', 'heat,', 'this', 'helps', 'prevent', 'water', 'from', 'being', 'lost', 'as', 'vapour.', 'this', 'is', 'why', 'reindeer', 'breath', 'does', 'not', 'steam', 'like', 'human', 'and', 'horse', 'breath.a', 'reindeer’s', 'thick', 'double-layered', 'coat', 'is', 'so', 'efficient', 'that', 'it', 'is', 'more', 'likely', 'to', 'overheat', 'than', 'get', 'too', 'cold,', 'esp

#### 💫 Extra

- Extract the most similiar word to *climate*.
- Calculate the similarity between *climate* and *weather*.
- Calculate the most similar word to *huamanitarian* + *climate* - *droguth*.

Does make sense?

## 3.2. Word preprocessing

Let's replicate the examples we have seen previously in the lecture.

### 3.2.1. Tokenization

The process of separate symbols by introducing extra white space is called **tokenization**.

In [41]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [42]:
documents = "I've been 2 times to New York in 2011, but did not have the constitution for it. It DIDN'T appeal to me. I preferred Los Angeles."
tokens = [[token.text for token in sentence] for sentence in nlp(documents).sents]

❓ What does this code do?

In [43]:
print(tokens)

[['I', "'ve", 'been', '2', 'times', 'to', 'New', 'York', 'in', '2011', ',', 'but', 'did', 'not', 'have', 'the', 'constitution', 'for', 'it', '.'], ['It', "DIDN'T", 'appeal', 'to', 'me', '.'], ['I', 'preferred', 'Los', 'Angeles', '.']]


### 3.2.2. Lemmatization

The process of reducing words to its dictionary based (lemma) is called **lemmatization**.

In [44]:
lemmas = [[token.lemma_ for token in sentence] for sentence in nlp(documents).sents]

In [45]:
print(lemmas)

[['-PRON-', 'have', 'be', '2', 'time', 'to', 'New', 'York', 'in', '2011', ',', 'but', 'do', 'not', 'have', 'the', 'constitution', 'for', '-PRON-', '.'], ['-PRON-', "DIDN'T", 'appeal', 'to', '-PRON-', '.'], ['-PRON-', 'prefer', 'Los', 'Angeles', '.']]


### 3.2.3. Stemming

The process of reducing words to its stem is called **stemming**. 

This process is more radical than lemmatization.

In [46]:
from nltk import SnowballStemmer
stemmer = SnowballStemmer('english')

stems = [[stemmer.stem(token) for token in sentence] for sentence in tokens]

In [47]:
print(stems)

[['i', 've', 'been', '2', 'time', 'to', 'new', 'york', 'in', '2011', ',', 'but', 'did', 'not', 'have', 'the', 'constitut', 'for', 'it', '.'], ['it', "didn't", 'appeal', 'to', 'me', '.'], ['i', 'prefer', 'los', 'angel', '.']]


### 3.2.4. Part of speech

**Part of speech** corresponds to the process of classifying words to its category: nouns, verbs, adjectives, etc.

In [48]:
pos = [[token.pos_ for token in sentence] for sentence in nlp(documents).sents]

In [49]:
print(pos)

[['PRON', 'AUX', 'AUX', 'NUM', 'NOUN', 'ADP', 'PROPN', 'PROPN', 'ADP', 'NUM', 'PUNCT', 'CCONJ', 'AUX', 'PART', 'AUX', 'DET', 'NOUN', 'ADP', 'PRON', 'PUNCT'], ['PRON', 'PROPN', 'NOUN', 'ADP', 'PRON', 'PUNCT'], ['PRON', 'VERB', 'PROPN', 'PROPN', 'PUNCT']]


### 3.2.5. Stop words

**Stopwords** is the process of removing words that cannot be beneficial for the analysis, like determiners.

In [50]:
content = [[token.text for token in sentence if token.pos_ in {'NOUN', 'VERB', 'PROPN', 'ADJ', 'ADV'} and not token.is_stop]
for sentence in nlp(documents).sents]

In [51]:
print(content)

[['times', 'New', 'York', 'constitution'], ["DIDN'T", 'appeal'], ['preferred', 'Los', 'Angeles']]


Another alternative using **<tt> nltk <tt>** is:

In [56]:
import nltk
#nltk.download('stopwords')
from nltk.corpus import stopwords
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

### 3.2.6. Parsing

**Parsing** is the process of classifying words in a sentence based on its syntax.

In [57]:
[[(c.text, c.head.text, c.dep_) for c in nlp(sentence.text)]

 for sentence in nlp(documents).sents]

[[('I', 'been', 'nsubj'),
  ("'ve", 'been', 'aux'),
  ('been', 'been', 'ROOT'),
  ('2', 'times', 'nummod'),
  ('times', 'been', 'npadvmod'),
  ('to', 'been', 'prep'),
  ('New', 'York', 'compound'),
  ('York', 'to', 'pobj'),
  ('in', 'been', 'prep'),
  ('2011', 'in', 'pobj'),
  (',', 'been', 'punct'),
  ('but', 'been', 'cc'),
  ('did', 'have', 'aux'),
  ('not', 'have', 'neg'),
  ('have', 'been', 'conj'),
  ('the', 'constitution', 'det'),
  ('constitution', 'have', 'dobj'),
  ('for', 'constitution', 'prep'),
  ('it', 'for', 'pobj'),
  ('.', 'been', 'punct')],
 [('It', "DIDN'T", 'nsubj'),
  ("DIDN'T", "DIDN'T", 'ROOT'),
  ('appeal', "DIDN'T", 'ccomp'),
  ('to', 'appeal', 'prep'),
  ('me', 'to', 'pobj'),
  ('.', "DIDN'T", 'punct')],
 [('I', 'preferred', 'nsubj'),
  ('preferred', 'preferred', 'ROOT'),
  ('Los', 'Angeles', 'compound'),
  ('Angeles', 'preferred', 'dobj'),
  ('.', 'preferred', 'punct')]]

### 3.2.7. Named Entity Recognition (NER)

**Named Entity Recognition** is the process of classifying words in a sentence based on its noun category (PERSON, FACILITY, ORGANIZATION, GEOPOLITICAL ENTITY, etc.).

In [58]:
entities = [[(entity.text, entity.label_) for entity in nlp(sentence.text).ents] for sentence in nlp(documents).sents]

In [59]:
print(entities)

[[('2', 'CARDINAL'), ('New York', 'GPE'), ('2011', 'DATE')], [], [('Los Angeles', 'GPE')]]


### 🤖📝 **Your turn**

Apply the 7 different methods to preprocess words on the first row of the new's dataset.

### Resources

📕 Hovy, D. (2020). Text Analysis in Python for Social Scientists: Discovery and Exploration. Cambridge University Press.

🌍 https://medium.com/zero-equals-false/one-hot-encoding-129ccc293cda

🌍 https://markroxor.github.io/gensim/static/notebooks/word2vec.html
