## Natural Language Processing

First, think about C3PO, Luke Skywalker's robot sidekick in Star Wars. C3P0 is a fantasized version of human-computer interaction in the distant future. However, humans interacting with machines is an every day reality for us. Your home or car smart assistant (Alexa), customer service on websites or phone lines, autocorrect features, etc. are all examples of Natural Language Processing.

Natural Language Processing (NLP) is the field of deriving meaningful information from human speech. NLP is a branch of computer science, or more specifically a branch of artificial intelligence, concerned with allowing computers the ability to understand human speech either in a written or spoken format. 






### There are many types of NLP
There are many different varieties of natural language processing. These are just a few real world examples of these techniques to give an idea of how this is used today.

#### Sentiment Analysis
This is most of what we will be doing today. Sentiment analysis examines text in order to identify the general "feeling" of the text. Take this example...
Businesses are using sentiment analysis today to monitor and evaluate customer service. Does this customer seem satisfied?

![airbnb tweet](./images/airbnb.png)

This person is not happy. By analyzing sentiment analysis on customer support chats, tweets, etc. a company can get insights on where their service model is not working. 
    
#### Topic Modeling
Topic modeling is an unsupervised machine learning technique that is capable of scanning a set of documents, detecting patterns within them, and automatically clustering word groups or similar expressions that characterize the documents.  An example...
Imagine that you work at a legal firm and someone at a company has embezzled money. You need to figure out who that person is and you are monitoring company emails from the last six months. There are probably thousands of emails and you don't want to waste time reading all of them. In this case, you can have a computer read the text of the emails and identify the ones that are relevant to the topic of money, narrowing down the amount of emails needed to read
    
![email examples](./images/topicmodeling.png)
    
#### Text Generation
Text generation is simply the task of producing new text. A very common example of this is autocomplete or autofill features, such as when texting or in a search engine. Take the following example, we all use this every day right?

![google autofill](./images/google.png)

The code simple predicts what you might type next...

## Example 1: VADER sentiment scoring

VADER - Valence Aware Dictionary for sEntiment Reasoning. VADER is a model used for text sentiment analysis that is sensitive to both polarity (positive/negative) and intensity of emotion. This model does not account for relationships between words. This is the "bag of word" approach. All words in the text are thrown into a bag and scored. The cumulative score determines the final rating. More on this later.

## Data Gathering

The first part of most projects like this is getting data. This could be another workshop in itself so for the sake of our meeting today we will be using some sample data that I provide about product reviews on Amazon. 

There are many potential sources of data. It may be available via an API. You might have to grab it using a web scraper. You may be lucky and someone has already gathered it for you. You can process basically any textual data from many file formats.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk     #natural language toolkit

### Import Sample Data from NLTK

For starters, we will use some sample data that I am providing to you. This is data from Amazon.com for product reviews.

There are more than 500,000 rows of data in this dataset so it is big! There are 10 columns for each review but today we will be concerned really only with the "Score", "Summary", and "Text" columns.

In [None]:
data = pd.read_csv('/Users/ep9k/Desktop/sentiment_analysis/Reviews.csv')

In [None]:
print(data)

Let's see the text of just the first row of the dataset

In [None]:
# show the text of just the first review
print(data['Text'][0])


Here you can see the size and shape of the test data. There are over 500,000 reviews in this dataset so let's make it smaller, just for our the purposes of our workshop today.

In [None]:
data.shape


In [None]:
# I am simplifying the dataset to just the first 500 reviews
data = data.head(500)

# Exploratory Data Analysis (EDA)
Let's play with the data to see what is in it. There are a lot of reasons for doing this, but basically EDA's main purpose is to explore the data and understand it more before making assumptions about it. This might also help you identify outliers, and find interesting relationships between the variables.

Products have a score of 1-5. This is basically a star rating. Let's see how many times each score occurs

In [None]:
data['Score'].value_counts()

Let's use Matplotlib to visualize the data in a plot

In [None]:
ax = data['Score'].value_counts().sort_index().plot(kind='bar',
                                                    title='Count of Reviews by Stars',
                                                    figsize=(10, 5))
ax.set_xlabel('Review Rating')
plt.show()

It looks like there are a lot of 5 star reviews. I assume this means the corresponding 'text' for each review will be positive. We will test this assumption later. First we need to process the text some more.

# NLTK Basics

NLTK (Natural Language Toolkit) is a python library for working with human language data. It is just one of many libraries which you can use for Natural Language Processing. A lot of the data you might be analyzing is unstructured data (aka human text). Before you can analyze data programmatically, you need to do some pre-processing. 

Let's start by processing the text of one review

In [None]:
# getting text of one review
example_text = data['Text'][50]
print(example_text)

## Tokenization

Tokenization is the process of breaking textual data into words, terms, sentences, or some other meaninful chunk as discrete elements. After data gathering and maybe some EDA, tokenization is often the next step in the NLP workflow. The effect of this process is it breaks the text into a data structure that the computer can interpret.

NLTK allows tokenization out of the box with word_tokenize(). However, it is a little bit messy

In [None]:
tokens = nltk.word_tokenize(example_text)
print(tokens)

## Part of Speech
We can find the part of speech for each token. [Here](https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/) is a (complete?) list of NLTK's parts of speech. 

We don't really need the part of speech for our later exercises today but this is useful and you might want this in the future.

In [None]:
tagged = nltk.pos_tag(tokens)
tagged

### Vader Sentiment Scoring

I introduced this model of sentiment analysis earlier but now for more details. The VADER model uses the "bag of words" approach to produce a sentiment score. This model takes all the words in your sentence/corpus and assigns a score to each word of positive, negative, neutral. Then the model takes the sum of all those scores and the result is the overall sentiment of that sentence. Stop words are removed from the scoring. Stop words are common words like "the", "and", "a/an" that don't contribute to the sentiment of a phrase or sentence. Keep in mind, this is a relatively simplistic way of performing sentiment analysis and does not take in to account the relationship between words. 

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer

In [None]:
sia = SentimentIntensityAnalyzer()

A few quick examples of the Sentiment Intensity Analyzer in action. First, a couple of individual words followed by a sentence.

In [None]:
sia.polarity_scores("I")   # no score

In [None]:
sia.polarity_scores("am")

In [None]:
sia.polarity_scores("so")

In [None]:
sia.polarity_scores("happy!")

All together now

In [None]:
sia.polarity_scores('I am so happy!')

Punctuation also has an impact. The same sentence without the '!' is scored less positively

In [None]:
sia.polarity_scores("I am so happy")

The compound score has a range of values from -1 to +1 to rate how positive (+1) or negative (-1) a statement is. 

In [None]:
sia.polarity_scores('This is the worst!')

Now let's run this on our example text from earlier

In [None]:
print(example_text)
sia.polarity_scores(example_text)

Now let's run it on our entire dataset

In [None]:
# makes a dictionary which holds the polarity score of each review

results = {}

for i, row in data.iterrows():
    text = row['Text']
    myId = row['Id']
    results[myId] = sia.polarity_scores(text)
    

In [None]:
vaders = pd.DataFrame(results).T

#this line sets the index column and calls it 'id'
vaders = vaders.reset_index().rename(columns={'index':'Id'})    

vaders


Now let's merge our vaders sentiment scores with our original dataframe. Now we have sentiment score and metadata added to our original data. 

In [None]:
#it is actually very easy to do
data = data.merge(vaders, how='left')

data.head()

### Testing Assumptions

Let's now test some of our assumptions. I would assume that if a reviewer gave a product a 5 star review, then the text would have a positive sentiment. Accordingly, a one star review would have text with a negative sentiment. 

To start, I'll look at the sentiments of 5 star and 1 star reviews. Then we will visualize the data.

In [None]:
five_stars = data.loc[data['Score'] == 5]
# limit results to just the first 10 
five_stars = five_stars.head()

five_stars

From looking at the five star results, it does indeed look like they have positive sentiment scores. In fact, most of them have a very positive sentiment score. 

In [None]:
one_stars = data.loc[data['Score'] == 1]
# limit results to just the first 10 
one_stars = one_stars.head()

one_stars

### Data Visualization

I have already used a basic plotting library, MatPlotLib, to do a few bar plots. Now I'll use an alternative called Seaborn. Seaborn is an extension to MatPlotLib which allows more sophisticated statistical graphics. But it also looks nice for simple plots too.

In [None]:
#overall compound score of each review
ax = sns.barplot(data=data, x='Score', y='compound', ci=None)  #ci is for the confidence interval
ax.set_title('Compound Score by Amazon Stars')
plt.show()

Here you see a positive relationship between score and positive columns. Positivity score increases as the score increases. This means that one star reviews have less positive sentiment.

In [None]:
#positive sentiment score of each review
ax = sns.barplot(data=data, x='Score', y='pos', ci=None)

Here we have a negative relationship between score and negative columns. Negativity score decreases as score increases. This means that 5 star reviews have less negative sentiment.

In [None]:
#negative sentiment score of each review
ax = sns.barplot(data=data, x='Score', y='neg', ci=None)

Or you could be fancy and do them all together in one plot.

In [None]:

fig, axs = plt.subplots(1, 2, figsize=(12, 3))
sns.barplot(data=data, x='Score', y='pos', ax=axs[0], ci=None)
sns.barplot(data=data, x='Score', y='neg', ax=axs[1], ci=None)
axs[0].set_title('Positive')
axs[1].set_title('Negative')
plt.show()

## Example 2: TextBlob. A rules-based approach to scoring sentiment

TextBlob is a library built on top of nltk. It provides some additional functionality such as rules-based sentiment scores. 

In [None]:
!conda install -c conda-forge textblob

In [None]:
from textblob import TextBlob

example1 = TextBlob("I love winter").sentiment

#again, polarity is measured between -1 and 1
#subjectivity is measured between 0 and 1. This is a measure of how opinionated something is. 
example1

### More about this module

Linguist [Tom De Smedt](https://scholar.google.com/citations?user=8VBuRDwAAAAJ&hl=cs) has manually labeled all words in the english language ([from WordNet](https://wordnet.princeton.edu/)) their sentiment as "positive", "negative", etc. Let's take the word 'great' as an example.  

![Great lexicon](./images/great.png)

In [None]:
print(TextBlob("great").sentiment) 

Because "great" has several meanings, how do we know which one to use and which polarity/subjectivity score to assign to this word?  TextBlob gets the results above by just averaging all the polarity and subjectivity scores of the potential uses of "great"

In [None]:
print(TextBlob("not great").sentiment) 

"Not great" has a polarity score of -0.4, while the subjectivity remains unchanged. In this case, when TextBlob sees 'not' in front of something, it multiplies the polarity score of that word by -0.5. 

In [None]:
print(TextBlob("very great").sentiment) 

If a word is preceeded by "very", both the sentiment and subjectivity scores are multiplied by 1.3, with a cap score of 1. 

In [None]:
print(TextBlob("I am great.").sentiment) 

"I am great" has the same score as our first example, because "I" and "am" do not affect "great". 

In [None]:
print(TextBlob("I am great!").sentiment)

Punctuation also affects the scores. Here you see an "!" increases the polarity (though I don't know by how much)"

### TextBlob Summary
TextBlob finds all of the words and phrases that it can assign a polarity and subjectivity to and averages them all together to get final scores. 

### Example with real text
For this example, we are going to use TextBlob to analyze the sentiment of the Harry Potter book series. I found the text of all 7 Harry Potter books in [this github repo](https://github.com/formcept/whiteboard/tree/master/nbviewer/notebooks/data/harrypotter). 

In [None]:
# read in text of Harry Potter: The Sorcerer's Stone
harry_potter1 = open('../SentimentAnalysis_NLP-main/transcripts/HarryPotterPhilosophersStone.txt','r').read()
harry_potter2 = open('../SentimentAnalysis_NLP-main/transcripts/HarryPotterChamberOfSecrets.txt','r').read()
harry_potter3 = open('../SentimentAnalysis_NLP-main/transcripts/HarryPotterPrisonerOfAzkaban.txt','r').read()
harry_potter4 = open('../SentimentAnalysis_NLP-main/transcripts/HarryPotterGobletOfFire.txt','r').read()
harry_potter5 = open('../SentimentAnalysis_NLP-main/transcripts/HarryPotterOrderOfThePhoenix.txt','r').read()
harry_potter6 = open('../SentimentAnalysis_NLP-main/transcripts/HarryPotterHalfBloodPrince.txt','r').read()
harry_potter7 = open('../SentimentAnalysis_NLP-main/transcripts/HarryPotterDeathlyHallows.txt','r').read()

book_texts = [harry_potter1, harry_potter2, harry_potter3, harry_potter4, harry_potter5, harry_potter6, harry_potter7]

book_names = ["Sorceror's Stone", "Chamber of Secrets", "Prisoner of Azkaban", "Goblet of Fire", "Order of the Phoenix", "Half Blood Prince", "Deathly Hallows"]

Here I am creating a dictionary of the name and text of each book, then converting that to a pandas dataframe

In [None]:
book_data = {}

for i, book in enumerate(book_names):
    book_data[book] = book_texts[i]
    
# create pandas dataframe with this data
data_df = pd.DataFrame(book_data.items(), columns=['BookName', 'BookText'])

data_df

### Cleaning the data

As we did earlier, there is usually some steps of pre-processing the data before we can perform sentiment analysis on it. We are only going to do one round of data cleaning for the purposes of this workshop.

Common data cleaning steps on all text:

- Make text all lower case
- Remove punctuation
- Remove numerical values
- Remove common non-sensical text (/n)
- Tokenize text
- Remove stop words

<b> lambda functions: </b> small anonymous functions, meaning functions that are not named

In [None]:
# Apply a first round of text cleaning techniques
import re
import string

def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = re.sub('\n', '', text)
    return text

round1 = lambda x: clean_text_round1(x)

harry_potter_df = pd.DataFrame(data_df.BookText.apply(round1))

harry_potter_df

## Document-Term Matrix
For many Natural Language Processing techniques, the text must be tokenized, meaning broken down into smaller pieces. The most common tokenization technique is to break down text into words. We can do this using scikit-learn's CountVectorizer, where every row will represent a different document and every column will represent a different word.

In addition, with CountVectorizer, we can remove stop words. 

In [None]:
# we will create a document-term matrix using CountVectorizer and exclude common English stop words

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(harry_potter_df['BookText'])
harry_potter_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
harry_potter_dtm.index = ["Sorceror's Stone", "Chamber of Secrets", "Prisoner of Azkaban", "Goblet of Fire", "Order of the Phoenix", "Half Blood Prince", "Deathly Hallows"]
harry_potter_dtm

## Sentiment of Routine

We could look at the overall sentiment of each book (and we will), but let's do something a little more interesting. In most stories, there is an arc to the plot. Stories alternate between positive and negative events and usually end up in a positive outcome in the end. Is this true of the Harry Potter books?

Start with getting the overall positivity and polarity for each book

In [None]:
# apply a lambda function to find the polarity and subjectivity of each story
pol = lambda x: TextBlob(x).sentiment.polarity
sub = lambda x: TextBlob(x).sentiment.subjectivity

harry_potter_df['polarity'] = harry_potter_df['BookText'].apply(pol)
harry_potter_df['subjectivity'] = harry_potter_df['BookText'].apply(sub)

harry_potter_df

## Sentiment of Routine over Time

Most Storylines have an arc. What is the arc of the Harry Potter books?
To get this (arbitrarily), we will split the text into 10 chunks and we will get the polarity and subjectivity score for each section.

In [None]:
import math

def split_text(text, n=10):
    '''Takes in a string of text and splits into n equal parts, with a default of 10 equal parts.'''

    # Calculate length of text, the size of each chunk of text and the starting points of each chunk of text
    length = len(text)
    size = math.floor(length / n)
    start = np.arange(0, length, size)
    
    # Pull out equally sized pieces of text and put it into a list
    split_list = []
    for piece in range(n):
        split_list.append(text[start[piece]:start[piece]+size])
    return split_list

# Let's create a list to hold all of the pieces of text
list_pieces = []
for t in harry_potter_df.BookText:
    split = split_text(t)
    list_pieces.append(split)
    


To demonstrate this point, first this shows there are indeed 7 books.

In [None]:
len(list_pieces)

And now that there are 10 chunks in each book's text

In [None]:
len(list_pieces[0])

Calculate the polarity score for each section of text in each book. In the end there should be 70 scores. 7 books x 10 sections per book = 70

In [None]:
# Calculate the polarity for each piece of text in each book

polarity_transcript = []
for lp in list_pieces:
    polarity_piece = []
    for p in lp:
        polarity_piece.append(TextBlob(p).sentiment.polarity)
    polarity_transcript.append(polarity_piece)
    
polarity_transcript

Show the polarity of sections in the first book

In [None]:
# Show the plot for the first book: Harry Potter and the Sorceror's Stone
harry_potter_df['BookTitle'] = ["Sorceror's Stone", "Chamber of Secrets", "Prisoner of Azkaban", "Goblet of Fire", "Order of the Phoenix", "Half Blood Prince", "Deathly Hallows"]

plt.plot(polarity_transcript[0])
plt.title(harry_potter_df['BookTitle'][0])
plt.show()

Lastly, show the polarity scores for all 7 books in one plot. This will show the 'arc' of each book. One thing that caught my eye is that the 6th book (Half Blood Prince) ends on a negative note while the 7th book (Deathly Hallows) ends on a very positi

In [None]:
# Show all books in one plot

plt.rcParams['figure.figsize'] = [16, 12]

for index, book in enumerate(harry_potter_df.index):
    plt.subplot(3, 4, index+1)                             # gives each book its own plot                  
    plt.plot(polarity_transcript[index])                   # plotting polarity score of each book  
    plt.xlabel('Book Segment')                             # x axis label
    plt.ylabel('Polarity')                                 # y axis label
    plt.xticks(np.arange(0, 10, 1.0))                      # adds extra ticks on x axis
    plt.title(harry_potter_df['BookTitle'][index])         # title of each plot
    plt.tight_layout()                                     # spaces out plots
    
plt.show()

## Sources / Self Help

I like to end every workshop with a self-help section and where to go for more help. As always, this is just an introduction to these topics.

First, here are a couple resources I used to put this video together

- [pyOhio Natural Language Processing Workshop](https://www.youtube.com/watch?v=xvqsFTUsOmc)

- [Sentiment Analysis in Python](https://www.youtube.com/watch?v=QpzMWQvxXWk)

### UVA Resources
Remember, you can always reach our to me or others here at UVA for more help! 

[UVA Statlab](https://data.library.virginia.edu/statlab/)

[UVA Research Computing](https://www.rc.virginia.edu/)

[UVA Digital Humanities](https://dh.library.virginia.edu/)

