# 3. Analysing words

This notebook will introduce you to the basics of analysing words. 
You'll learn how to preprocess and represent words.








Legend of symbols:

- 🤓: Tips

- 🤖📝: Your turn

- ❓: Question

- 💫: Extra exercise 

## 3.1. Word vectorization

In this section, we'll learn how to transform words into vectors. Let's start with one-hot encodings.

### 3.1.1. One-hot encoding

The library **<tt> sklearn <tt>** has a function that transforms categorical features to one-hot vectors:
    
🌍 https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html 

🌍 https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

First, we will import the functions we need.

In [1]:
from numpy import array
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

In [2]:
sent1 = "Can I eat the Pizza".lower()
sent2 = "You can eat the Pizza".lower()

In [5]:
sent1

'can i eat the pizza'

In [6]:
print(sent2)

you can eat the pizza


In [7]:
doc1 = sent1.split()
doc2 = sent2.split()

In [9]:
type(sent1)

str

In [8]:
doc1

['can', 'i', 'eat', 'the', 'pizza']

In [10]:
type(doc1)

list

In [11]:
doc1_array = array(doc1)
doc2_array = array(doc2)

In [12]:
doc1_array

array(['can', 'i', 'eat', 'the', 'pizza'], dtype='<U5')

In [13]:
doc3 = doc1+doc2

In [14]:
doc3

['can', 'i', 'eat', 'the', 'pizza', 'you', 'can', 'eat', 'the', 'pizza']

In [17]:
type(doc3)

list

In [15]:
data = list(doc3)

In [16]:
values = array(data)
print(values)

['can' 'i' 'eat' 'the' 'pizza' 'you' 'can' 'eat' 'the' 'pizza']


❓ What does this code do?

This code transforms string sentences into a list and an array of words that we can manipulate later.

After that, we will transform words to numbers based on its position. To do so, we will use the **<tt> LabelEncoder() <tt>**.

In [32]:
# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)
print(integer_encoded)

[0 2 1 4 3 5 0 1 4 3]


Seeing this variable integer encoded as a matrix, we could say that in contains 1 row and 10 columns (1x10).

In [34]:
len(integer_encoded)

10

In [19]:
type(integer_encoded)

numpy.ndarray

In [35]:
# binary encode
onehot_encoder = OneHotEncoder(sparse=False)

In [36]:
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)


Now, using the reshape method, we will transpose the integer encoded array into a matrix of 10 rows and 1 column (10 x 1).

In [37]:
integer_encoded

array([[0],
       [2],
       [1],
       [4],
       [3],
       [5],
       [0],
       [1],
       [4],
       [3]])

In [39]:
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print(onehot_encoded)

[[1. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 1.]
 [1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0. 0.]]


In [41]:
# This shows the order of words in the matrix
list(label_encoder.classes_)

['can', 'eat', 'i', 'pizza', 'the', 'you']

### 🤖📝 **Your turn**

Load the news dataset and calculate the onehot encoding of the first new.

In [42]:
import pandas as pd

# Import data
df = pd.read_csv('../data/news.csv')

In [43]:
df.head()

Unnamed: 0,topic,media,corpus,headline,link
0,climatic,The Guardian,The reindeer is the emblematic Christmas anima...,Weatherwatch: reindeer adapted to snow but not...,https://www.theguardian.com/world/2019/dec/23/...
1,climatic,The Guardian,The European parliament is split over whether ...,European parliament split on declaring climate...,https://www.theguardian.com/world/2019/nov/26/...
2,climatic,The Guardian,Fisayo Soyombo was eating an evening snack in ...,‘Climate of fear’: Nigeria intensifies crackdo...,https://www.theguardian.com/world/2019/nov/14/...
3,climatic,The Guardian,The European Union considers itself as a leade...,EU's soaring climate rhetoric not always match...,https://www.theguardian.com/world/2019/dec/11/...
4,climatic,The Guardian,"Good morning, we’re now exactly two weeks out ...",Thursday briefing: Political climate too hot f...,https://www.theguardian.com/world/2019/nov/28/...


In [44]:
first_new = df.iloc[0,2]

In [45]:
first_new

'The reindeer is the emblematic Christmas animal and, while not exactly magical, it is among the best adapted to snowy conditions.For a start, a reindeer’s feet have four toes with dewclaws that spread out to distribute its weight like snowshoes, and are equipped with sharp hooves for digging in snow.A reindeer’s nose warms the air on its way to the lungs, cooling it again before it is exhaled. As well as retaining heat, this helps prevent water from being lost as vapour. This is why reindeer breath does not steam like human and horse breath.A reindeer’s thick double-layered coat is so efficient that it is more likely to overheat than get too cold, especially when running. When this happens, reindeer pant like dogs to cool down, bypassing the nasal heat exchanger.Snowfields may be featureless to human eyes, but reindeer are sensitive to ultraviolet light, an evolutionary development that only occurred after the animals moved to Arctic regions. Snow reflects ultraviolet, so this ultravi

In [46]:
first_new_low = first_new.lower()

In [47]:
first_new_low

'the reindeer is the emblematic christmas animal and, while not exactly magical, it is among the best adapted to snowy conditions.for a start, a reindeer’s feet have four toes with dewclaws that spread out to distribute its weight like snowshoes, and are equipped with sharp hooves for digging in snow.a reindeer’s nose warms the air on its way to the lungs, cooling it again before it is exhaled. as well as retaining heat, this helps prevent water from being lost as vapour. this is why reindeer breath does not steam like human and horse breath.a reindeer’s thick double-layered coat is so efficient that it is more likely to overheat than get too cold, especially when running. when this happens, reindeer pant like dogs to cool down, bypassing the nasal heat exchanger.snowfields may be featureless to human eyes, but reindeer are sensitive to ultraviolet light, an evolutionary development that only occurred after the animals moved to arctic regions. snow reflects ultraviolet, so this ultravi

In [48]:
first_new_low_list = first_new_low.split()

In [49]:
first_new_low_list

['the',
 'reindeer',
 'is',
 'the',
 'emblematic',
 'christmas',
 'animal',
 'and,',
 'while',
 'not',
 'exactly',
 'magical,',
 'it',
 'is',
 'among',
 'the',
 'best',
 'adapted',
 'to',
 'snowy',
 'conditions.for',
 'a',
 'start,',
 'a',
 'reindeer’s',
 'feet',
 'have',
 'four',
 'toes',
 'with',
 'dewclaws',
 'that',
 'spread',
 'out',
 'to',
 'distribute',
 'its',
 'weight',
 'like',
 'snowshoes,',
 'and',
 'are',
 'equipped',
 'with',
 'sharp',
 'hooves',
 'for',
 'digging',
 'in',
 'snow.a',
 'reindeer’s',
 'nose',
 'warms',
 'the',
 'air',
 'on',
 'its',
 'way',
 'to',
 'the',
 'lungs,',
 'cooling',
 'it',
 'again',
 'before',
 'it',
 'is',
 'exhaled.',
 'as',
 'well',
 'as',
 'retaining',
 'heat,',
 'this',
 'helps',
 'prevent',
 'water',
 'from',
 'being',
 'lost',
 'as',
 'vapour.',
 'this',
 'is',
 'why',
 'reindeer',
 'breath',
 'does',
 'not',
 'steam',
 'like',
 'human',
 'and',
 'horse',
 'breath.a',
 'reindeer’s',
 'thick',
 'double-layered',
 'coat',
 'is',
 'so',
 'ef

In [50]:
first_new_low_list_array = array(first_new_low_list)

In [51]:
first_new_low_list_array

array(['the', 'reindeer', 'is', 'the', 'emblematic', 'christmas',
       'animal', 'and,', 'while', 'not', 'exactly', 'magical,', 'it',
       'is', 'among', 'the', 'best', 'adapted', 'to', 'snowy',
       'conditions.for', 'a', 'start,', 'a', 'reindeer’s', 'feet', 'have',
       'four', 'toes', 'with', 'dewclaws', 'that', 'spread', 'out', 'to',
       'distribute', 'its', 'weight', 'like', 'snowshoes,', 'and', 'are',
       'equipped', 'with', 'sharp', 'hooves', 'for', 'digging', 'in',
       'snow.a', 'reindeer’s', 'nose', 'warms', 'the', 'air', 'on', 'its',
       'way', 'to', 'the', 'lungs,', 'cooling', 'it', 'again', 'before',
       'it', 'is', 'exhaled.', 'as', 'well', 'as', 'retaining', 'heat,',
       'this', 'helps', 'prevent', 'water', 'from', 'being', 'lost', 'as',
       'vapour.', 'this', 'is', 'why', 'reindeer', 'breath', 'does',
       'not', 'steam', 'like', 'human', 'and', 'horse', 'breath.a',
       'reindeer’s', 'thick', 'double-layered', 'coat', 'is', 'so',
       

In [52]:
# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(first_new_low_list_array)
print(integer_encoded)

[123 105  72 123  46  27  10   9 146  89  50  83  73  72   6 123  19   1
 128 115  33   0 119   0 106  55  63  57 129 148  37 122 118  95 128  39
  75 141  78 114   8  14  47 148 110  67  56  38  70 113 106  88 137 123
   4  92  75 139 128 123  81  35  73   3  17  73  72  52  15 142  15 107
  65 126  66 101 138  59  18  80  15 136 126  72 147 105  20  40  89 120
  78  69   8  68  21 106 125  42  30  72 116  45 122  73  72  85  79 128
  96 121  60 130  31  48 143 108 143 126  62 105  97  78  41 128  34  43
  24 123  87  64  51  84  16  54 128  69  53  22 105  14 109 128 132  77
   7  49  36 122  93  90   2 123  11  86 128  13 104 112 103 133 116 126
 134   5 105 128 117  12  82  92  74  70  98  76 145 124  44   8 131  91
 135 111 144  94 105  63  99 146 105 127  70  28 140 124  14  71  25  23
  29  26   8 123  58  32 122 102 100  61]


In [53]:
type(integer_encoded)

numpy.ndarray

In [54]:
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)

In [55]:
integer_encoded

array([[123],
       [105],
       [ 72],
       [123],
       [ 46],
       [ 27],
       [ 10],
       [  9],
       [146],
       [ 89],
       [ 50],
       [ 83],
       [ 73],
       [ 72],
       [  6],
       [123],
       [ 19],
       [  1],
       [128],
       [115],
       [ 33],
       [  0],
       [119],
       [  0],
       [106],
       [ 55],
       [ 63],
       [ 57],
       [129],
       [148],
       [ 37],
       [122],
       [118],
       [ 95],
       [128],
       [ 39],
       [ 75],
       [141],
       [ 78],
       [114],
       [  8],
       [ 14],
       [ 47],
       [148],
       [110],
       [ 67],
       [ 56],
       [ 38],
       [ 70],
       [113],
       [106],
       [ 88],
       [137],
       [123],
       [  4],
       [ 92],
       [ 75],
       [139],
       [128],
       [123],
       [ 81],
       [ 35],
       [ 73],
       [  3],
       [ 17],
       [ 73],
       [ 72],
       [ 52],
       [ 15],
       [142],
       [ 15],
      

In [56]:
onehot_encoder = OneHotEncoder(sparse=False)

In [57]:
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print(onehot_encoded)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [58]:
list(label_encoder.classes_)

['a',
 'adapted',
 'after',
 'again',
 'air',
 'allows',
 'among',
 'an',
 'and',
 'and,',
 'animal',
 'animals',
 'anything',
 'arctic',
 'are',
 'as',
 'be',
 'before',
 'being',
 'best',
 'breath',
 'breath.a',
 'but',
 'by',
 'bypassing',
 'challenged',
 'change',
 'christmas',
 'christmas-card',
 'climate',
 'coat',
 'cold,',
 'conditions',
 'conditions.for',
 'cool',
 'cooling',
 'development',
 'dewclaws',
 'digging',
 'distribute',
 'does',
 'dogs',
 'double-layered',
 'down,',
 'eat,',
 'efficient',
 'emblematic',
 'equipped',
 'especially',
 'evolutionary',
 'exactly',
 'exchanger.snowfields',
 'exhaled.',
 'eyes,',
 'featureless',
 'feet',
 'for',
 'four',
 'freeze-thaw',
 'from',
 'get',
 'grazing.',
 'happens,',
 'have',
 'heat',
 'heat,',
 'helps',
 'hooves',
 'horse',
 'human',
 'in',
 'increasingly',
 'is',
 'it',
 'it,',
 'its',
 'lichen,',
 'light,',
 'like',
 'likely',
 'lost',
 'lungs,',
 'lying',
 'magical,',
 'may',
 'more',
 'moved',
 'nasal',
 'nose',
 'not',
 '

❓ What is the number of unique words on that new?

In [68]:
len(label_encoder.classes_)

149

🤓 In that matrix (onehot_encoded), the number of columns represents the total number of **unique** words within the text, while the number of rows represents the total number of words (duplicated or not) within the text.

❓ What is the one-hot expression of *adapted*?

In [69]:
onehot_encoded[:,1]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0.])

🤓 The one-hot encoding representation of adapted corresponds to the second column (position 1) of the one-hot encoding matrix.

In [72]:
# One hot encoding of 'arctic'
onehot_encoded[:,13]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0.])

🤓 Finally, if we sum the column or the one-hot encoding representation we will obtain the frequency of that word.

In [70]:
sum(onehot_encoded[:,1])

1.0

### 3.1.2. Word embeddings (word2vec)

Training a word2vec model in Python is scarily easy with **<tt> gensim <tt>**.
    
🌍 https://radimrehurek.com/gensim/

In [None]:
from gensim.models import Word2Vec

Now, we will train the word2vec model taking a look to the parameters:

🌍 https://radimrehurek.com/gensim/models/word2vec.html

First, we will prepare the data. To do so, we need to transform every sentence in a list within a list:

In [None]:
sent1 = "Can I eat the Pizza".lower()
sent2 = "You can eat the Pizza".lower()

doc1 = sent1.split()
doc2 = sent2.split()

In [None]:
doc3 = [doc1, doc2]

In [None]:
model = Word2Vec(doc3, size=300, window=3, min_count=1, workers=4)

In [None]:
print(model)

Now, we can analyse the vocabulary of this word2vec model

In [None]:
print(model.wv.vocab)

We can analyse the embeddings by:

In [None]:
model['pizza']

Using word2vec, we can analyse similarities across words:

In [None]:
model.most_similar(positive=['pizza',], topn=1)

In [None]:
model.most_similar(negative=['pizza',], topn=1)

And relations between words:

In [None]:
print(model.similarity('pizza', 'eat'))

🤓 Note that this model doesn't contain a lot of text, so it doesn't make sense.



### 🤖📝 **Your turn**

Train a word2vec embedding with the news corpus and extract the top 10 most similar words of *ultraviolet*.

Help to prepare the input for the model:

In [None]:
# This is a loop that iterates over the dataframe
news_vec = []

for index, row in df.iterrows():
    sent = row['corpus'].lower()
    sent = sent.split()
    news_vec.append(sent)    
 
# Print the first element of the list:
print(news_vec[0])

#### 💫 Extra

- Extract the most similiar word to *climate*.
- Calculate the similarity between *climate* and *weather*.
- Calculate the most similar word to *huamanitarian* + *climate* - *droguth*.

Does make sense?

## 3.2. Word preprocessing

Let's replicate the examples we have seen previously in the lecture.

### 3.2.1. Tokenization

The process of separate symbols by introducing extra white space is called **tokenization**.

In [74]:
! pip install spacy
import spacy
!python -m spacy download en_core_web_sm
nlp = spacy.load('en_core_web_sm')

You should consider upgrading via the '/home/avaldivia/env37/bin/python3.7 -m pip install --upgrade pip' command.[0m


In [75]:
documents = "I've been 2 times to New York in 2011, but did not have the constitution for it. It DIDN'T appeal to me. I preferred Los Angeles."

In [76]:
tokens = [[token.text for token in sentence] for sentence in nlp(documents).sents]

In [77]:
tokens

[['I',
  "'ve",
  'been',
  '2',
  'times',
  'to',
  'New',
  'York',
  'in',
  '2011',
  ',',
  'but',
  'did',
  'not',
  'have',
  'the',
  'constitution',
  'for',
  'it',
  '.'],
 ['It', "DIDN'T", 'appeal', 'to', 'me', '.'],
 ['I', 'preferred', 'Los', 'Angeles', '.']]

❓ What does this code do?

In [83]:
print(tokens)

[['I', "'ve", 'been', '2', 'times', 'to', 'New', 'York', 'in', '2011', ',', 'but', 'did', 'not', 'have', 'the', 'constitution', 'for', 'it', '.'], ['It', "DIDN'T", 'appeal', 'to', 'me', '.'], ['I', 'preferred', 'Los', 'Angeles', '.']]


### 3.2.2. Lemmatization

The process of reducing words to its dictionary based (lemma) is called **lemmatization**.

In [84]:
lemmas = [[token.lemma_ for token in sentence] for sentence in nlp(documents).sents]

In [85]:
print(lemmas)

[['-PRON-', 'have', 'be', '2', 'time', 'to', 'New', 'York', 'in', '2011', ',', 'but', 'do', 'not', 'have', 'the', 'constitution', 'for', '-PRON-', '.'], ['-PRON-', "DIDN'T", 'appeal', 'to', '-PRON-', '.'], ['-PRON-', 'prefer', 'Los', 'Angeles', '.']]


### 3.2.3. Stemming

The process of reducing words to its stem is called **stemming**. 

This process is more radical than lemmatization.

In [86]:
!pip install nltk
from nltk import SnowballStemmer
stemmer = SnowballStemmer('english')

stems = [[stemmer.stem(token) for token in sentence] for sentence in tokens]

You should consider upgrading via the '/home/avaldivia/env37/bin/python3.7 -m pip install --upgrade pip' command.[0m


In [87]:
print(stems)

[['i', 've', 'been', '2', 'time', 'to', 'new', 'york', 'in', '2011', ',', 'but', 'did', 'not', 'have', 'the', 'constitut', 'for', 'it', '.'], ['it', "didn't", 'appeal', 'to', 'me', '.'], ['i', 'prefer', 'los', 'angel', '.']]


### 3.2.4. Part of speech

**Part of speech** corresponds to the process of classifying words to its category: nouns, verbs, adjectives, etc.

In [88]:
pos = [[token.pos_ for token in sentence] for sentence in nlp(documents).sents]

In [89]:
print(pos)

[['PRON', 'AUX', 'AUX', 'NUM', 'NOUN', 'ADP', 'PROPN', 'PROPN', 'ADP', 'NUM', 'PUNCT', 'CCONJ', 'AUX', 'PART', 'AUX', 'DET', 'NOUN', 'ADP', 'PRON', 'PUNCT'], ['PRON', 'PROPN', 'VERB', 'ADP', 'PRON', 'PUNCT'], ['PRON', 'VERB', 'PROPN', 'PROPN', 'PUNCT']]


### 3.2.5. Stop words

**Stopwords** is the process of removing words that cannot be beneficial for the analysis, like determiners.

In [90]:
content = [[token.text for token in sentence if token.pos_ in {'NOUN', 'VERB', 'PROPN', 'ADJ', 'ADV'} and not token.is_stop]
for sentence in nlp(documents).sents]

In [91]:
print(content)

[['times', 'New', 'York', 'constitution'], ["DIDN'T", 'appeal'], ['preferred', 'Los', 'Angeles']]


Another alternative using **<tt> nltk <tt>** is:

In [101]:
import nltk
from nltk.corpus import stopwords
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [99]:
# Implement sentiment analysis into tokens

In [97]:
tokens = [[token.sentiment for token in sentence] for sentence in nlp(documents).sents]

In [98]:
print(tokens)

[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0, 0.0]]


**Parsing** is the process of classifying words in a sentence based on its syntax.

In [103]:
tokens = [[(c.text, c.head.text, c.dep_) for c in nlp(sentence.text)] for sentence in nlp(documents).sents]

In [104]:
print(tokens)

[[('I', 'been', 'nsubj'), ("'ve", 'been', 'aux'), ('been', 'been', 'ROOT'), ('2', 'been', 'attr'), ('times', '2', 'quantmod'), ('to', 'been', 'prep'), ('New', 'York', 'compound'), ('York', 'to', 'pobj'), ('in', 'been', 'prep'), ('2011', 'in', 'pobj'), (',', 'been', 'punct'), ('but', 'been', 'cc'), ('did', 'have', 'aux'), ('not', 'have', 'neg'), ('have', 'been', 'conj'), ('the', 'constitution', 'det'), ('constitution', 'have', 'dobj'), ('for', 'constitution', 'prep'), ('it', 'for', 'pobj'), ('.', 'been', 'punct')], [('It', 'appeal', 'nsubj'), ("DIDN'T", 'appeal', 'intj'), ('appeal', 'appeal', 'ROOT'), ('to', 'appeal', 'prep'), ('me', 'to', 'pobj'), ('.', 'appeal', 'punct')], [('I', 'preferred', 'nsubj'), ('preferred', 'preferred', 'ROOT'), ('Los', 'Angeles', 'compound'), ('Angeles', 'preferred', 'dobj'), ('.', 'preferred', 'punct')]]


### 3.2.7. Named Entity Recognition (NER)

**Named Entity Recognition** is the process of classifying words in a sentence based on its noun category (PERSON, FACILITY, ORGANIZATION, GEOPOLITICAL ENTITY, etc.).

In [105]:
entities = [[(entity.text, entity.label_) for entity in nlp(sentence.text).ents] for sentence in nlp(documents).sents]

In [106]:
print(entities)

[[('2', 'CARDINAL'), ('New York', 'GPE'), ('2011', 'DATE')], [], [('Los Angeles', 'GPE')]]


### 🤖📝 **Your turn**

Apply the 7 different methods to preprocess words on the first row of the new's dataset.

### Resources

📕 Hovy, D. (2020). Text Analysis in Python for Social Scientists: Discovery and Exploration. Cambridge University Press.

🌍 https://medium.com/zero-equals-false/one-hot-encoding-129ccc293cda

🌍 https://markroxor.github.io/gensim/static/notebooks/word2vec.html
