<a href="https://colab.research.google.com/github/basselkassem/nlp-toolkit/blob/master/1_text_processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**A text is a sequence of:**


1. characters
2. words
3. sentences
4. paragraphs



# Tokenization

In [0]:
from nltk.tokenize import WhitespaceTokenizer, WordPunctTokenizer, TreebankWordTokenizer

In [0]:
text = "I am hungry, I cann't find food."

In [0]:
ws_tokenizer = WhitespaceTokenizer()
print(ws_tokenizer.tokenize(text))

['I', 'am', 'hungry,', 'I', "cann't", 'find', 'food.']


In [0]:
wp_tokenizer = WordPunctTokenizer()
print(wp_tokenizer.tokenize(text))

['I', 'am', 'hungry', ',', 'I', 'cann', "'", 't', 'find', 'food', '.']


In [0]:
tbw_tokenizer = TreebankWordTokenizer()
print(tbw_tokenizer.tokenize(text))

['I', 'am', 'hungry', ',', 'I', 'can', "n't", 'find', 'food', '.']


# Normalization
1. stemming
2. lemmatization

try both and choose the one that give best results

In [0]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [0]:
from nltk.stem import PorterStemmer, WordNetLemmatizer

In [0]:
text = "playing was missunderstands extractions dogs stations beautiful limitations called understood"
tokens = tbw_tokenizer.tokenize(text)

In [0]:
stemmer = PorterStemmer()
res = [stemmer.stem(token) for token in tokens]
print(res)

['play', 'wa', 'missunderstand', 'extract', 'dog', 'station', 'beauti', 'limit', 'call', 'understood']


In [0]:
stemmer = WordNetLemmatizer()
res = [stemmer.lemmatize(token) for token in tokens]
print(res)

['playing', 'wa', 'missunderstands', 'extraction', 'dog', 'station', 'beautiful', 'limitation', 'called', 'understood']


In [0]:
import stanfordnlp
#stanfordnlp.download('en', force=True)

In [0]:
stemmer = stanfordnlp.Pipeline(lang = 'en', processors='tokenize,lemma', use_gpu=False)
doc = stemmer(text)
for i, sentence in enumerate(doc.sentences):
  for word in sentence.words:
    print(word.text, word.lemma)

Use device: cpu
---
Loading: tokenize
With settings: 
{'model_path': '/root/stanfordnlp_resources/en_ewt_models/en_ewt_tokenizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
---
Loading: lemma
With settings: 
{'model_path': '/root/stanfordnlp_resources/en_ewt_models/en_ewt_lemmatizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
Building an attentional Seq2Seq model...
Using a Bi-LSTM encoder
Using soft attention for LSTM.
Finetune all embeddings.
[Running seq2seq lemmatizer with edit classifier]
Done loading processors!
---
playing play
was be
missunderstands missunderstand
extractions extraction
dogs dog
stations station
beautiful beautiful
limitations limitation
called call
understood understand


# Vectorization (feature extraction)

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

In [0]:
corpus = [
          "good movie",
          "not a good movie",
          "did not like",
          "I like it",
          "good one",
]

## Bag of words
Count the occurrences of a token/n-gram in a text

In [0]:
count_vectorizer = CountVectorizer(ngram_range=[1, 2], min_df=2, max_df=0.5)
x = count_vectorizer.fit_transform(corpus)
cols = count_vectorizer.get_feature_names()
rows = x.toarray()
features = pd.DataFrame(columns=cols, data = rows)
print(features.shape)
features

(5, 4)


Unnamed: 0,good movie,like,movie,not
0,1,0,1,0
1,1,0,1,1
2,0,1,0,1
3,0,1,0,0
4,0,0,0,0


## Tfidf (term frequency inverse document frequency)




---


**$idf$**

---


*   $N = |D|$ total number of documents
*   $|\{d \in D: t\in d\}|$ number of documents where term $t$ appears
*   $idf(t, D) = log(\frac{N}{|\{d \in D: t\in d\}|})$


---



**$tfidf$**

---


*   $tf(t,d)= \Bigg\{\begin{array}\frac{f_{t,d}}{\sum_w f_{w,d}}\\1 + log(f_{t,d})\end{array}$
*   $tfidf(t,d,D)=tf(t,d) \times idf(t, D)$
*  Terms which appeare frequently in small number of documents will get high $tfidf$


In [0]:
tfidf_tokenizer = TfidfVectorizer(ngram_range = (1, 2), min_df= 2, max_df=2)
x = tfidf_tokenizer.fit_transform(corpus)
cols = tfidf_tokenizer.get_feature_names()
rows = x.todense()
features = pd.DataFrame(columns=cols, data = rows)
print(features.shape)
features


(5, 4)


Unnamed: 0,good movie,like,movie,not
0,0.707107,0.0,0.707107,0.0
1,0.57735,0.0,0.57735,0.57735
2,0.0,0.707107,0.0,0.707107
3,0.0,1.0,0.0,0.0
4,0.0,0.0,0.0,0.0
