<a href="https://colab.research.google.com/github/cwmarris/pull-request-monitor/blob/master/OH_Introduction_To_NLP_01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to NLP 01: Overview

* Author: Amy Zhuang
* Date: July 2020

## What is NLP?

Natural Language Processing (NLP) is a field of data science that gives the machines the ability to read, understand and derive meanings from human languages.

## What are NLP's use cases?

* Sentiment Analysis
* Topic Modeling (unsupervised)
* Named Entity Recognition (NER)
* Part of Speach (POS)
* Language Translation
* Language Generation
* Text Summarization
* Text Classification (supervised)
* Text Segmentation (unsupervised)
* Speech to Text and Text to Speech
* Chatbot


## NLP Terminologies

* Stop Words
* Tokenization
* Stemming
* Lemmatization
* Count Vectorization
* TF-IDF: Tf-idf stands for term frequency-inverse document frequency. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus
TF-IDF=TF(t)*IDF(t)
 * TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear many more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization.
 * IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following.

* Bag of Words
* Word Embedding: for example, king-man+woman=queen

## Hands-on Exercise

### Stopwords

In [None]:
text = 'HBAP students benefit from world-class instruction in courses designed by esteemed Harvard faculty and collaborate with diverse peers in highly interactive online classes. '

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('word_tokenize')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
STOPWORDS = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Error loading word_tokenize: Package 'word_tokenize' not
[nltk_data]     found in index
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
# Tokenization
tokens = word_tokenize(text)
tokens

['HBAP',
 'students',
 'benefit',
 'from',
 'world-class',
 'instruction',
 'in',
 'courses',
 'designed',
 'by',
 'esteemed',
 'Harvard',
 'faculty',
 'and',
 'collaborate',
 'with',
 'diverse',
 'peers',
 'in',
 'highly',
 'interactive',
 'online',
 'classes',
 '.']

In [None]:
# Remove Stopping Words
text_no_stopwords = [w for w in tokens if not w in STOPWORDS] 
text_no_stopwords

['HBAP',
 'students',
 'benefit',
 'world-class',
 'instruction',
 'courses',
 'designed',
 'esteemed',
 'Harvard',
 'faculty',
 'collaborate',
 'diverse',
 'peers',
 'highly',
 'interactive',
 'online',
 'classes',
 '.']

### Stemming

In [None]:
# Stemming
from nltk.stem import PorterStemmer
text_stemmed = [PorterStemmer().stem(w) for w in text_no_stopwords]
text_stemmed

['hbap',
 'student',
 'benefit',
 'world-class',
 'instruct',
 'cours',
 'design',
 'esteem',
 'harvard',
 'faculti',
 'collabor',
 'divers',
 'peer',
 'highli',
 'interact',
 'onlin',
 'class',
 '.']

### Lemmatization

In [None]:
# Lemmatization
nltk.download('wordnet')
wn = nltk.WordNetLemmatizer()
text_lemma = [wn.lemmatize(w) for w in text_no_stopwords]
text_lemma

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


['HBAP',
 'student',
 'benefit',
 'world-class',
 'instruction',
 'course',
 'designed',
 'esteemed',
 'Harvard',
 'faculty',
 'collaborate',
 'diverse',
 'peer',
 'highly',
 'interactive',
 'online',
 'class',
 '.']

Tip: Choose Stemming for speed and lemmatization for accuracy

### Count Vectorization

In [None]:
# Count Vectorization
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

In [None]:
text1 = ['Data science is fun.', 'Data science helps us to make data driven decisions.']

In [None]:
vectorizer.fit(text1)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [None]:
print('Vocabulary: ')
print(vectorizer.vocabulary_)

Vocabulary: 
{'data': 0, 'science': 7, 'is': 5, 'fun': 3, 'helps': 4, 'us': 9, 'to': 8, 'make': 6, 'driven': 2, 'decisions': 1}


In [None]:
vector = vectorizer.transform(text1)

In [None]:
print('Full vector: ')
print(vector.toarray())

Full vector: 
[[1 0 0 1 0 1 0 1 0 0]
 [2 1 1 0 1 0 1 1 1 1]]


### TFIDF

In [None]:
# TFIDF Vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()

In [None]:
tfidf.fit(text1)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=None,
                min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words=None, strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)

In [None]:
print('Vocabulary: ')
print(tfidf.vocabulary_)

Vocabulary: 
{'data': 0, 'science': 7, 'is': 5, 'fun': 3, 'helps': 4, 'us': 9, 'to': 8, 'make': 6, 'driven': 2, 'decisions': 1}


In [None]:
vector_tfidf = tfidf.transform(text1)
print('Full vector: ')
print(vector_tfidf.toarray())

Full vector: 
[[0.40993715 0.         0.         0.57615236 0.         0.57615236
  0.         0.40993715 0.         0.        ]
 [0.48719673 0.342369   0.342369   0.         0.342369   0.
  0.342369   0.24359836 0.342369   0.342369  ]]


## NLP Learning Materials

* NLP with Deep Learning from Stanford: https://www.youtube.com/watch?v=8rXD5-xhemo&list=PLoROMvodv4rOhcuXMZkNm7j3fVwBBY42z
* NLP with Python: https://www.udemy.com/course/nlp-natural-language-processing-with-python/
* spaCy documentation: https://spacy.io/api/doc
* NLTK documenation: https://www.nltk.org/
