## Introduction to Natural Language Processing

In recent years, the growth and advancements made in natural language processing have highlighted how impactful this group of algorithms can be to everyday life. Natural language processing (NLP)  can be defined as a subset of machine learning that deals with the analysis and understanding of human language. It enables algorithms to better understand, interpret, and generate natural language. 

## NLP Preprocessing Tasks & Tokenization

Before diving deep into the nuances of training NLP models, it’s important to first discuss the necessity of preprocessing natural language data. Machines are great at understanding numbers, but not so great at understanding words. Because of this, every implementation of a natural language processing model involves the conversion of text to numbers, however different models differ in the techniques used for this conversion.  While the techniques can slightly differ model to model, it’s important to understand the fundamentals


One of the first concepts in NLP data cleaning is tokenization. Tokenization is the process of separating words, phrases and symbols into individual elements. Each small segment is referred to as a token, these tokens can be recombined to create different versions of text used downstream for embeddings.

## Using NLTK 

In order to get started, this notebook leverages the Natural Language Toolkit (NLTK) library. NLTK is a leading platform for building Python programs to work with human language data

In [1]:
pip install nltk 

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pprint
import nltk
nltk.download('punkt') # required dependency
nltk.download('stopwords') # required dependency 

[nltk_data] Downloading package punkt to /Users/chautong/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/chautong/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

We'll start by tokenizing a simple sentence:

In [3]:
corpus = f"Hey! I'm new in town. Can you please point me in the direction of the groccery store"

In [4]:
pprint.pprint(corpus)

("Hey! I'm new in town. Can you please point me in the direction of the "
 'groccery store')


There are endless ways to tokenize the sample sentences, the first will be to tokenize by sentence. The `sent_tokenize` module will break the corpus up based on end of sentence punctuations

![Tokenization](docs/images/tokenization.png)

In [5]:
from nltk.tokenize import sent_tokenize

## Toeknize the entire sentnece 

(sent_tokenize(corpus, language="english"))

['Hey!',
 "I'm new in town.",
 'Can you please point me in the direction of the groccery store']

Another way to tokenize is by words. The `word_tokenize` modules will define tokens by 

In [6]:
from nltk.tokenize import word_tokenize 

# Tokenize by word

(word_tokenize(corpus, language="english", preserve_line=False))

['Hey',
 '!',
 'I',
 "'m",
 'new',
 'in',
 'town',
 '.',
 'Can',
 'you',
 'please',
 'point',
 'me',
 'in',
 'the',
 'direction',
 'of',
 'the',
 'groccery',
 'store']

In [8]:
# Tokenize every word in both sentences separately 

[word_tokenize(character, 
               language="english", 
               preserve_line=False) for character in sent_tokenize(corpus, language="english")]

[['Hey', '!'],
 ['I', "'m", 'new', 'in', 'town', '.'],
 ['Can',
  'you',
  'please',
  'point',
  'me',
  'in',
  'the',
  'direction',
  'of',
  'the',
  'groccery',
  'store']]

In [9]:
from nltk.tokenize import wordpunct_tokenize

# Tokenize by word and punctuation 

wordpunct_tokenize(corpus)

['Hey',
 '!',
 'I',
 "'",
 'm',
 'new',
 'in',
 'town',
 '.',
 'Can',
 'you',
 'please',
 'point',
 'me',
 'in',
 'the',
 'direction',
 'of',
 'the',
 'groccery',
 'store']

## Stopwords 
 
 
Once tokenized, the next step is to identify and remove stop words. Stop words refer to common words in a language that aren’t considered to be helpful in understanding text. In the English language, these are words like “the”, “a”, “and”, “in”, etc. Because these words carry such little information, they are often removed to ensure that the model only uses words that carry impact. 

In [10]:
from nltk.corpus import stopwords

## NLTK has several built in stop words 
stopwords.words("english")

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [12]:
## Remvoing stopwords from the sentence

english_stop_words = stopwords.words("english")
tokens = word_tokenize(corpus.lower())
filtered_sentence = [word for word in tokens if not word in english_stop_words]

print(filtered_sentence)

['hey', '!', "'m", 'new', 'town', '.', 'please', 'point', 'direction', 'groccery', 'store']


In [13]:
## Say we wantted to add `please` as a stop word, in NLTK the stop word
## list can be extended 

english_stop_words.append("please")

In [14]:
filtered_sentence = [word for word in tokens if not word in english_stop_words]

print(filtered_sentence)

['hey', '!', "'m", 'new', 'town', '.', 'point', 'direction', 'groccery', 'store']


### Stopwords in Alternative Languages 


The example above is for English stop words, however NLTK supports several languages 

In [15]:
stopwords.words("french")

['au',
 'aux',
 'avec',
 'ce',
 'ces',
 'dans',
 'de',
 'des',
 'du',
 'elle',
 'en',
 'et',
 'eux',
 'il',
 'ils',
 'je',
 'la',
 'le',
 'les',
 'leur',
 'lui',
 'ma',
 'mais',
 'me',
 'même',
 'mes',
 'moi',
 'mon',
 'ne',
 'nos',
 'notre',
 'nous',
 'on',
 'ou',
 'par',
 'pas',
 'pour',
 'qu',
 'que',
 'qui',
 'sa',
 'se',
 'ses',
 'son',
 'sur',
 'ta',
 'te',
 'tes',
 'toi',
 'ton',
 'tu',
 'un',
 'une',
 'vos',
 'votre',
 'vous',
 'c',
 'd',
 'j',
 'l',
 'à',
 'm',
 'n',
 's',
 't',
 'y',
 'été',
 'étée',
 'étées',
 'étés',
 'étant',
 'étante',
 'étants',
 'étantes',
 'suis',
 'es',
 'est',
 'sommes',
 'êtes',
 'sont',
 'serai',
 'seras',
 'sera',
 'serons',
 'serez',
 'seront',
 'serais',
 'serait',
 'serions',
 'seriez',
 'seraient',
 'étais',
 'était',
 'étions',
 'étiez',
 'étaient',
 'fus',
 'fut',
 'fûmes',
 'fûtes',
 'furent',
 'sois',
 'soit',
 'soyons',
 'soyez',
 'soient',
 'fusse',
 'fusses',
 'fût',
 'fussions',
 'fussiez',
 'fussent',
 'ayant',
 'ayante',
 'ayantes',


In [None]:
stopwords.words("spanish")

In [None]:
stopwords.words("italian")