<img src = "Images/logo-eki.png" style = "width:20%"/>

version : 1.0<br>
last updated date : 2017/03/17

In [1]:
import nltk
import pandas as pd

<h1 style = "color:#1E718B"> Introduction to Natural Language Processing</h1><br>

<h2><u>Libraries</u></h2>

### NLTK

### SpaCy

***
<h1 style = "color:#1E718B">The basics of preprocessing textual data</h1>


<img src = "Images/sentence1.png"/>

<h2><u>Tokenization & uniformisation</u></h2>

### Tokenization
<img src = "Images/tokenization.png"/>

Tokenization is maybe the first and most important step of any NLP task. <br>
A **token** in NLP is a representation of one semantic entity, that could be : 
- A word
- A punctuation symbol
- A number

Indeed, the simplest representation a text could have would be a representation of its elements taking independently.

In [24]:
sentence = "Ekimetrics is a consulting firm, with offices around the world."

#### Native method with the split function

In [25]:
print(sentence.split(" "))

['Ekimetrics', 'is', 'a', 'consulting', 'firm,', 'with', 'offices', 'around', 'the', 'world.']


#### Advanced method with nltk

In [26]:
sentence_as_list_of_tokens = nltk.wordpunct_tokenize(sentence)
print(sentence_as_list_of_tokens)

['Ekimetrics', 'is', 'a', 'consulting', 'firm', ',', 'with', 'offices', 'around', 'the', 'world', '.']


There is an important difference on the examples shown above : **the punctuation**<br>
When you split a text on the spaces, it does not consider the subtilities of language, and the best way to see it is simply with punctuation.<br>
- In the first example: it has kept the token "firm," with the comma ",". Same for the last token "world." with the point "."
- In the second example : the function has considered "," as a semantic entity and made the split with it

<br> You could also for example split also on the punctuation, but you would have to deal with more complex example : 
- "it's" should be split into ["it","'s"] and not ["it","'","s"] or ["it's"]
- "www.google.com" should not be split

<br> The nltk (or SpaCy) tokenizing function consider most of these semantic cases.

### Uniformisation
If your goal is to compare semantic entities and extract information for textual data <br> Most of the time, it's important to have a set of tokens that are uniforms.
<br>
#### Lowercase
There is the simplest built-in function in Python for that : **lower()**
<img src = "Images/sentence2.png"/>

In [27]:
"HELLO LONDON".lower()

'hello london'

In [28]:
sentence.lower()

'ekimetrics is a consulting firm, with offices around the world.'

#### Other uniformization processes
Other preprocessing could be made to textual data : 
- British/American
- Synonyms

<h2><u>Stop words and characters</u></h2><br>
**Some words can have more semantic meaning than others**.<br> In the sentence *the dog chased a cat.* the important tokens would be **dog, chased, cat**<br>
The other tokens would then be **stop words** or **stop characters**

<img src = "Images/sentence3.png"/>

Several entities here would not bring any semantic value to the analysis : 
- *the*
- *a*
- *.* at the end of the sentence
<br>

Most of the time, a NLP task requires to remove all those stop tokens. They are mostly : 
- Pronouns
- Punctuation
- Simple verbs (be, has)

<br>

**There are two main ways to remove stop words :**

#### Using dictionaries of stop words and characters
Indeed, the best idea is to build a list of all the words you would like to remove

##### Building a custom list
This can save lives if you have custom characters or words you would like to remove.
- ** Define the list**
- ** Cross the list** with your list of tokens

In [29]:
stop_words_list = ["the","a","is","was","has","have","with"]
stop_characters_list = [",","."]

In [30]:
print(sentence_as_list_of_tokens)
[token for token in sentence_as_list_of_tokens if token not in stop_words_list + stop_characters_list]

['Ekimetrics', 'is', 'a', 'consulting', 'firm', ',', 'with', 'offices', 'around', 'the', 'world', '.']


['Ekimetrics', 'consulting', 'firm', 'offices', 'around', 'world']

##### Using the libraries
NLTK and SpaCy contain big dictionaries of stop words and in several languages 

In [31]:
from nltk.corpus import stopwords
print(stopwords.words("english")[:70])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about']


#### Considering the length of the token
Another method is to **remove short tokens**. Indeed, most of the time (not always) shorter tokens carry less meaning. 
<br> This is also an infinitely faster and simpler method which sometimes is enough 

In [32]:
print(sentence_as_list_of_tokens)
[token for token in sentence_as_list_of_tokens if len(token) > 3]

['Ekimetrics', 'is', 'a', 'consulting', 'firm', ',', 'with', 'offices', 'around', 'the', 'world', '.']


['Ekimetrics', 'consulting', 'firm', 'with', 'offices', 'around', 'world']

<h2><u>Lemmatization & stemming</u></h2>

### Stemming

In [33]:
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()

In [34]:
print(porter_stemmer.stem("saying"))
print(porter_stemmer.stem("crying"))
print(porter_stemmer.stem("trains"))
print(porter_stemmer.stem("string"))

say
cri
train
string


### Lemmatization

In [35]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

In [36]:
print(wordnet_lemmatizer.lemmatize("saying"))
print(wordnet_lemmatizer.lemmatize("crying"))
print(wordnet_lemmatizer.lemmatize("trains"))
print(wordnet_lemmatizer.lemmatize("string"))

saying
cry
train
string


In [37]:
print(wordnet_lemmatizer.lemmatize("are"))
print(wordnet_lemmatizer.lemmatize("is"))

are
is


In [38]:
print(wordnet_lemmatizer.lemmatize("are",pos = "v"))
print(wordnet_lemmatizer.lemmatize("is",pos = "v"))

be
be


<h2 style = "color:#BF0000"><u>Exercise 1</u></h2><br>
We learned the main NLP preprocessing tasks : 
1. Tokenization
2. Uniformisation
3. Stop words and stop characters removing
4. Lemmatization or stemming

Let's try to create a **preprocessing function** which will operate all these tasks, <br>
Taking a text as input and returns filtered tokens

**Hint** you can also use the following variables : 

In [5]:
from ekimetrics.nlp.utils import stop_characters_punctuation
print(stop_characters_punctuation[:20])

["'", '.', ';', ',', ':', '?', '!', "'", '"', '(', ')', '&', '”', '“', '‘', '’', ' ', '<', '>', '►']


In [10]:
def preprocessing(text,lemmatization = True):
    
    # TOKENIZATION
    tokens = nltk.wordpunct_tokenize(text)
    
    # LOWERCASE
    tokens = [token.lower() for token in tokens]
    
    # STOP WORDS AND PUNCTUATION
    tokens = [token for token in tokens if token not in list(stopwords.words("english")) + stop_characters_punctuation]
    
    # LEMMATIZATION
    if lemmatization:
        tokens = [wordnet_lemmatizer.lemmatize(token) for token in tokens]
        
    return tokens


preprocessing(sentence)

['ekimetrics', 'consulting', 'firm', 'office', 'around', 'world']

***
<h1 style = "color:#1E718B">Textual data representation </h1>

**The Yelp reviews dataset**

In [39]:
import pandas as pd
reviews = pd.read_pickle("Data/yelp_reviews.pkl")

In [40]:
reviews.iloc[0]["text"][:200] + "..."

'If you enjoy service by someone who is as competent as he is personable, I would recommend Corey Kaplan highly. The time he has spent here has been very productive and working with him educational and...'

In [13]:
texts = list(reviews["text"])

In [14]:
print(preprocessing(texts[0]))

['enjoy', 'service', 'someone', 'competent', 'personable', 'would', 'recommend', 'corey', 'kaplan', 'highly', 'time', 'spent', 'productive', 'working', 'educational', 'enjoyable', 'hope', 'need', 'though', 'highly', 'unlikely', 'knowing', 'nice', 'way', 'el', 'centro', 'ca', 'scottsdale', 'az']


In [42]:
example_tokens = [token.lower() for token in nltk.wordpunct_tokenize(texts[3])]

In [41]:
print(example_tokens)

NameError: name 'example_tokens' is not defined

<h2><u>Word count & bag of words representation</u></h2>

### Word count

#### First method : dictionaries

In [43]:
count = {}
for token in example_tokens:
    if token in count:
        count[token] += 1
    else:
        count[token] = 1
print(count)

{'today': 1, 'i': 3, 'the': 3, 'small': 1, '!': 1, 'retro': 1, 'going': 1, 'use': 1, 'items': 1, 'highly': 1, 'went': 1, 'definitely': 1, 'great': 1, 'found': 2, 'other': 1, 'dresser': 1, "'": 1, 'fair': 1, 'delivered': 1, 'up': 1, 'looking': 1, 'a': 3, 'did': 1, 'piece': 1, 'm': 1, 'tv': 1, 'keep': 1, 'decor': 1, 'good': 1, 'in': 1, 'was': 1, 'perfect': 1, 'be': 1, 'work': 1, '.': 7, 'job': 1, 'and': 1, 'to': 1, 'look': 1, 'some': 1, 'recommended': 1, 'will': 1, 'stand': 1, 'for': 3, 'back': 1, 'yesterday': 1, 'they': 1, 'price': 1, 'shawn': 1, 'as': 1}


#### Second method : Counter libraries

In [44]:
from collections import Counter
count = Counter(example_tokens)
print(count)

Counter({'.': 7, 'i': 3, 'the': 3, 'a': 3, 'for': 3, 'found': 2, 'today': 1, 'small': 1, '!': 1, 'retro': 1, 'going': 1, 'use': 1, 'items': 1, 'highly': 1, 'went': 1, 'definitely': 1, 'great': 1, 'other': 1, 'dresser': 1, "'": 1, 'fair': 1, 'delivered': 1, 'up': 1, 'looking': 1, 'did': 1, 'piece': 1, 'm': 1, 'tv': 1, 'keep': 1, 'decor': 1, 'good': 1, 'in': 1, 'was': 1, 'perfect': 1, 'be': 1, 'work': 1, 'job': 1, 'and': 1, 'to': 1, 'look': 1, 'some': 1, 'recommended': 1, 'will': 1, 'stand': 1, 'back': 1, 'yesterday': 1, 'they': 1, 'price': 1, 'shawn': 1, 'as': 1})


In [45]:
count.most_common(10)

[('.', 7),
 ('i', 3),
 ('the', 3),
 ('a', 3),
 ('for', 3),
 ('found', 2),
 ('today', 1),
 ('small', 1),
 ('!', 1),
 ('retro', 1)]

<h2 style = "color:#BF0000"><u>Exercise 2 : playing with the dataset</u></h2><br>
- Try to see the most common words for all the texts
- Preprocess completely the texts before the analysis
- Find the most common words by rating on Yelp

<h2><u>Tf-idf representation</u></h2>

<h2><u>Analysis using Ekimetrics library</u></h2>

In [17]:
from ekimetrics.nlp import spacy_models,models,utils

### Loading the spaCy engine

In [16]:
nlp = spacy_models.spacy_brain()

>> Loading spaCy NLP brain
... Process "loading spaCy" finished in 0m10s


### Loading the advanced corpus model with spaCy

In [18]:
spacy_corpus = spacy_models.Spacy_Corpus(json_path = "Data/texts_full_cleaned.json",nlp = nlp,max_documents=100)

[100/100] Spacy NLP analysis ... finished in 0m2s


### Cleaning the corpus

In [19]:
spacy_corpus.clean()

>> Cleaning the corpus
[100/100] Filtering unwanted tokens ... finished in 0m1s
[100/100] Lemmatizing tokens ... finished in 0m1s
[100/100] Filtering unwanted tokens ... finished in 0m1s
[100/100] Applying token collocation model on documents ... finished in 0m5s
[100/100] Applying token collocation model on documents ... finished in 0m5s


### Clustering the documents

In [20]:
spacy_corpus.clustering()
spacy_corpus.describe_clusters()

>> Clustering on the corpus ...
... Process "clustering" finished in 0m0s
>> Cluster 0 - 27 documents - top words : energy, service, contact, technology, solution
>> Cluster 1 - 8 documents - top words : solution, service, product, support, ice
>> Cluster 2 - 8 documents - top words : content, website, use, liability, party
>> Cluster 3 - 7 documents - top words : group, scheme, member, technology, water
>> Cluster 4 - 11 documents - top words : solar, energy, home, installation, year
>> Cluster 5 - 8 documents - top words : sensor, wind, technology, turbine, monitor
>> Cluster 6 - 2 documents - top words : reason, domain, configure, asap, visit
>> Cluster 7 - 2 documents - top words : transformer, special, substation, choke, smit
>> Cluster 8 - 5 documents - top words : ups, marine, power, severn, innovation
>> Cluster 9 - 22 documents - top words : product, contact, solution, power, engine


### Latent Dirichlet Allocation

In [21]:
lda = spacy_corpus.LDA(save = "lda_model",n_topics=20)

>> Computing the Latent Dirichlet Allocation algorithm on the corpus
... Indexing the tokens in the corpus
... Creating a Bag of Words representation of the corpus as a sparse matrix
... Applying the Latent Dirichlet Algorithm to detect the topics
... finished in 0m8s


In [22]:
import pyLDAvis
pyLDAvis.display(lda)

### Finding entities

In [23]:
spacy_corpus.find_entities().head()

Unnamed: 0,document,entity,type
0,http://www.comatrol.com,qiao jin,PERSON
1,http://www.bcs.org,karen burt,PERSON
2,http://www.faradion.co.uk/,merc,ORG
3,http://www.ssc-balticwind.com,tim petersen,PERSON


***
<h1 style = "color:#1E718B"> Advanced</h1>

<h2><u>N-Grams models</u></h2>

<h2><u>Part Of Speech Tagging (POS Tagging)</u></h2>

<h2><u>Name Entity Recognition (NER)</u></h2>

### With NLTK

### With SpaCy

<h2><u>Sentiment Analysis</u></h2>

<h2><u>Word2Vec</u></h2>

<h2><u>Latent Dirichlet Allocation</u></h2>

<h2><u>Clustering</u></h2>

<h2><u>Deep Learning ?</u></h2>