# Natural Language Processing with Python

Natural language processing - commonly referred to as NLP - is the area of computer science dedicated to computers understanding human language such as speech or text. There are many examples of NLP applications including the following:

- Sentiment Analysis - Determine the tone of text
- Speech Recognition - Translate a sound clip to text
- Predictive Text - Complete sentences based on a few words

In this course, we will walk through the basics of NLP with Python libraries such as pandas, spaCy, and scikit-learn. We will cover the following topics:
- Preprocessing
- Token Frequency
- Part of Speech Tagging
- Named Entity Recognition
- Text Similarity
- Dependency Parsing

This course assumes you have a beginner to intermediate knowledge of Python.

## Downloading PyCharm
We will use PyCharm as the primary integrated development environment for this tutorial, but feel free to use your own IDE. To install Pycharm, select the Community Edition from [this link](https://www.jetbrains.com/pycharm/download/) (it's free!).

## spaCy
Throughout this tutorial, we will be using a spaCy, a popular open-source library for NLP in Python. The library is designed to help you create applications that process and understand text. spaCy offers several pre-made text processing pipelines on their site. The pipelines are packaged as [models](https://spacy.io/models/en) which can be downloaded. For this demo, we will download the small English model trained on text from blogs, news, and comments.

In [None]:
!python -m spacy download en_core_web_sm

In [None]:
import pandas as pd
import spacy
from spacy import displacy
import string
from sklearn.feature_extraction.text import CountVectorizer
from yellowbrick.text import FreqDistVisualizer
from pathlib import Path
nlp = spacy.load('en_core_web_sm')


## The Data
In this course, we will analyze a sample of 500 Amazon Home and Kitchen product reviews. [The data](http://jmcauley.ucsd.edu/data/amazon/links.html) is provided by Julian McAuley at the University of California, San Diego and contains reviews from May 1996 - July 2014. In addition to reviews (ratings, text, helpfulness votes), McAuley provides product metadata (descriptions, category information, price, brand, and image features) and links (also viewed/also bought graphs). 

For this course, we will focus on the review data only. This data is a great example of the ways humans typically communicate through text and includes reviews with typos, run on sentences, and grammatical errors.

McAuley provides the following functions to parse the JSON dataset and save it as a dataframe.

In [None]:
def parse(path):
    f = open(path, 'rb')
    for l in f:
        yield eval(l)


def getDF(path):
    i = 0
    df = {}
    for d in parse(path):
        df[i] = d
        i += 1
    return pd.DataFrame.from_dict(df, orient='index')

In [None]:
df = getDF('./large_files/Home_and_Kitchen_5.json')
df_sample = df.sample(n=500, random_state=1)

In [None]:
df_sample.head()

## Text Preprocessing
Text preprocessing applies a variety of steps to text in order to clean or transform it for the computer to better understand. There are several common preprocessing steps. Let's take an example sentence and apply these steps to it.

Sentence: "She was offered the job 11 months ago."

- Lowercase 
    - "she was offered the job 11 months ago."
- Remove punctuation 
    - "she was offered the job 11 months ago"
- Remove numbers 
    - "she was offered the job months ago"
- Remove stop words - remove words that are very common in the English language 
    - "she offered job months ago"
- Tokenization - splitting the sentence up into tokens 
    - "she", "offered", "job", "months", "ago"
- Stemming / lemmatization - transforming the token into its root form 
    - "offer", "job", "month", "ago"

While these are all very popular preprocessing steps, they may not all be used on every project or even in this same order. The data you have and the problem you're trying to solve may add or remove any of these steps (and more) from your preprocessing. For example, if you want to see how many sentences are in the average Amazon review, you shouldn't remove punctuation.

spaCy allows us to apply all of the preprocessing steps above in a single line of code by using [token attributes](https://spacy.io/api/token#attributes).
- `token.lemma_` - lemmatizes the token
- `token.is_alpha` - Removes punctuation and numbers (non-alphabetic characters)
- `token.is_stop` - Removes stop words

In [None]:
# Try out text preprocessing on sample text
text = "She was offered the job 11 months ago."
doc = nlp(text)
text_clean = [token.lemma_ for token in doc if token.is_alpha and not token.is_stop]

In [None]:
print(doc)

In [None]:
print(text_clean)

In [None]:
# Preprocess the entire dataframe
def preprocess_text(spacy_doc: spacy.tokens.doc.Doc) -> str:
    """
    Preprocess a spacy Doc by lemmatizing, removing stop words, and removing non-alphabetical characters.
    
    Parameters
    ----------
    spacy_doc: spacy.tokens.doc.Doc
        A spacy Doc object, i.e. a sequence of Token objects

    Returns
    -------
    str
        The cleaned text

    """
    text_clean = [token.lemma_ for token in spacy_doc if token.is_alpha and not token.is_stop]
    return ' '.join(text_clean)


df_sample['spacy_doc'] = df_sample['reviewText'].apply(lambda x: nlp(x))
df_sample['review_text_clean'] = df_sample['spacy_doc'].apply(lambda x: preprocess_text(x))

In [None]:
print(df_sample['spacy_doc'].head())

In [None]:
print(df_sample['review_text_clean'].head())

## Term Frequency
With thousands of data points, we don't have time to read through each individual review to learn more about the data set. One way we can summarize the data is through the most popular words in reviews (term frequency). To do this, we use `CountVectorizer` from `scikit-learn`.

CountVectorizer converts text into a matrix of token counts. It has a variety of [parameters](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) you can use to customize your results, but we will focus on `stop_words` and `ngram_range`.

Although we previously removed stop words in our preprocessing steps, scikit-learn uses a different set of stop words than spaCy. Oftentimes it is beneficial to combine multiple stop words lists or create your own custom list to exclude common words that don't add value.

N-grams are used to break text up into chunks. An example of a 1-gram is "hello", and an example of a 2-gram is "hello there". Modifying the `ngram_range` in `CountVectorizer` allows us to see the most popular words AND most popular phrases.

In [None]:
vectorizer = CountVectorizer(stop_words='english', ngram_range=(3, 3))
docs = vectorizer.fit_transform(df_sample['review_text_clean'])
features = vectorizer.get_feature_names_out()

Next, we plot the most popular words with `FreqDistVisualizer` from `scikit-yellowbrick`.

In [None]:
visualizer = FreqDistVisualizer(features=features, size=(1080, 720))
visualizer.fit(docs)
visualizer.show()

In [None]:
df_sample['token_count_all'] = df_sample['spacy_doc'].apply(lambda x: len(x))
df_sample['token_count_clean'] = df_sample['review_text_clean'].apply(lambda x: len(x.split()))

In [None]:
df_sample['token_count_all'].value_counts().sort_index().plot.bar(figsize=(40,5), 
                                                                    title='All Tokens per Review',
                                                                    xlabel='Tokens',
                                                                    ylabel='Number of Reviews')

In [None]:
df_sample['token_count_clean'].value_counts().sort_index().plot.bar(figsize=(17,5), 
                                                                    title='Clean Tokens per Review',
                                                                    xlabel='Tokens',
                                                                    ylabel='Number of Reviews')

In [None]:
df_sample[df_sample['token_count_clean'] == 271]['reviewText'].values[0]

## Named Entity Recognition
Named Entity Recognition (NER) identifies real-world objects such as people, places, or things in text. NER is useful in many scenarios such as identifying and masking sensitive information such as names of people. spaCy recognizes [several different types of entities](https://v2.spacy.io/api/annotation#named-entities) and has a nice visualization to highlight all entities it recognized in text. 

In [None]:
doc = df_sample['spacy_doc'][271416]
print(doc)
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

In [None]:
displacy.render(df_sample['spacy_doc'][271416], style="ent")

In [None]:
# Recognize all entities in all Amazon reviews
df_entities = pd.DataFrame(columns=['index', 'spacy_doc', 'entity_text', 'entity_label', 'entity_start', 'entity_end'])

index = 0
for row in df_sample.itertuples():
    for ent in row.spacy_doc.ents:
        df_entities.at[index, 'index'] = row.Index
        df_entities.at[index, 'spacy_doc'] = row.spacy_doc
        df_entities.at[index, 'entity_text'] = ent.text
        df_entities.at[index, 'entity_label'] = ent.label_
        df_entities.at[index, 'entity_start'] = ent.start_char
        df_entities.at[index, 'entity_end'] = ent.end_char
        index += 1

In [None]:
df_entities.head()

In [None]:
#See the most popular entities recognized
df_entities['entity_label'].value_counts()

In [None]:
#See what products are recognized
df_filtered = df_entities[df_entities['entity_label'] == 'ORG'] # entity_label = ORG, PRODUCT, PERSON, WORK_OF_ART

In [None]:
df_filtered['entity_text'].value_counts()

## Part of Speech Tagging
Part-of-Speech tagging determines which [part of speech](https://en.wikipedia.org/wiki/Part_of_speech) each token is. This usually occurs behind the scenes before lemmatization since many words can serve as multiple parts of speech and may be lemmatized differently depending on the certain part of speech. Additionally, POS tagging is used as a foundation for NER and many other text processing steps. One real world application of POS tagging is to distinguish between words with the same spelling but different meanings for translation. For example, if a computer was translating "Can you throw this can in the trash?" to Spanish, it would need to know that "can" has two different parts of speech in this sentence.

In [None]:
doc = df_sample['spacy_doc'][271416]
for token in doc:
    print(token.text, token.pos_, token.dep_)

In [None]:
# Dependency Parsing

displacy.render(df_sample['spacy_doc'][271416], style="dep")

In [None]:
doc = df_sample['spacy_doc'][271416]

In [None]:
for token in doc:
    if token.pos_ == 'ADJ' or token.pos_ == 'ADV':
        print(token.text, token.pos_, token.dep_)

In [None]:
def count_adverbs_adjectives(spacy_doc: spacy.tokens.doc.Doc) -> int:
    """
    Count the number of adjectives and adverbs in the text
    
    Parameters
    ----------
    spacy_doc: spacy.tokens.doc.Doc
        A spacy Doc object, i.e. a sequence of Token objects

    Returns
    -------
    int
        The number of adverbs and adjectives in the text

    """
    counter = 0
    for token in spacy_doc:
        if token.pos_ == 'ADJ' or token.pos_ == 'ADV':
            counter+=1
    
    return counter   

In [None]:
df_sample['count_adj_adv'] = df_sample['spacy_doc'].apply(lambda x: count_adverbs_adjectives(x))

In [None]:
df_sample['count_adj_adv'].value_counts().sort_index().plot.bar(figsize=(15,5))

In [None]:
df_sample[df_sample['count_adj_adv'] == 119]['reviewText'].values[0]