# Day22 - NLP

**Tokenization** is the process by which the initial string is splitted into smaller units (called tokens). These tokens can be words, digits or punctuation. The current structure allows to analyze each token separately, in order to decide which one can be maintained or not. It can be easily done with _word_tokenize_, by NLTK (Natural Language Toolkit)

In [50]:
from nltk import word_tokenize
tokens = word_tokenize("Hi, my name is Francesco! :)")
tokens

['Hi', ',', 'my', 'name', 'is', 'Francesco', '!', ':', ')']

During any kind of analysis, we're not interested in the most common words, so we can avoid the **stopwords** (aka the most frequent words) as "I,me,you..". 
So, we can filter the stopwords and punctuation just iterating the tokens obtained so far.

In [58]:
import string
from nltk.corpus import stopwords

cleaned_tokens = []
for token in tokens:
    # remove stopwords and punctuation
    if token not in stopwords.words('english') and \
       token not in string.punctuation:             
        cleaned_tokens.append(token)
        
cleaned_tokens

['Hi', 'name', 'Francesco']

We can also remove some elements thanks to some **regular expressions** (re). In the following example we'll remove digits and "#" (useful if we're analyzing tweets et simila

In [60]:
import re # regular expressions 
text_with_digits = "Today I feel gr8! #GGWP"

cleaned_text = re.sub(r"[0-9]","",text_with_digits)
cleaned_text = re.sub(r"#", "", cleaned_text)
word_tokenize(cleaned_text)

['Today', 'I', 'feel', 'gr', '!', 'GGWP']