# Text Analysis 03: Document-Term Matrix and Sentiment Analysis


---
<img src="data/word_magnets.jpg" style="width: 400px; height: 300px;" />
 *Photo by [Steve Johnson](https://www.flickr.com/photos/artbystevejohnson/4654424717)* 

### Professor Crystal Chang

This notebook will extend the word count approach from the last module to multiple documents with Document Term Matrices. We will also introduce sentiment analysis.

*Estimated Time: 60 minutes*

---


### Table of Contents

[The Data](#section data)<br>
[Context](#section context)<br>
1 - [Working with Multiple Documents](#section 1)<br>
2 - [The Term-Document Matrix](#section 2)<br>
3 - [Sentiment Analysis](#section 3)<br>

**Dependencies:**

In [1]:
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import CountVectorizer
!pip install textblob
from textblob import TextBlob
from sklearn.feature_extraction.text import TfidfTransformer

Collecting textblob
  Using cached https://files.pythonhosted.org/packages/11/18/7f55c8be6d68ddc4036ffda5382ca51e23a1075987f708b9123712091af1/textblob-0.15.1-py2.py3-none-any.whl
Installing collected packages: textblob
Successfully installed textblob-0.15.1


In [2]:
# Run this cell to set up your notebook
import csv
import matplotlib.pyplot as plt

# Ensure that Pandas shows at least 280 characters in columns, so we can see full tweets
pd.set_option('max_colwidth', 280)

%matplotlib inline
plt.style.use('fivethirtyeight')
import re

## Introduction

One of the powerful things about text analysis with Python is the ability to work with a large number of documents simultaneously. In this notebook, we'll expand Term Frequency analysis to cover multiple text documents in a **Term Document Matrix**. 

The term-document model is also sometimes referred to as "bag-of-words" by those who don't think very highly of it. The term document model looks at language as individual communicative efforts that contain one or more tokens. The kind and number of the tokens in a document tells you something about what is attempting to be communicated, and the order of those tokens is ignored. This is the primary method still used for most text analysis.

We'll also learn another method of text analysis by **Sentiment**. We'll look at the tools to conduct sentiment analysis, and explore pros and cons to such an approach.

---

# 1. Working with Multiple Documents <a id='section 1'></a>

In order to actually turn our text into a bag of words, we'll have to do some preprocessing. This is a crucial step at the beginning of any NLP project, and much of this first section will involve it.

To start with, let's import NLTK and load some data: tweets from @realDonaldTrump. 

In [3]:
# read in the data from a file
trump = pd.read_csv('data/trumptweets.csv', header=0, index_col=0)

# display the table of data
trump.head()

Unnamed: 0_level_0,retweet_count,source,text,est_time
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1049473255151755264,9898,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",https://t.co/4ySIkmfllE,2018-10-08 20:34:56-05:00
1049445228694962176,11000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",https://t.co/k2bOxapRtR,2018-10-08 18:43:34-05:00
1049385141557030912,8549,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Great to see @AGPamBondi launch a cutting-edge statewide school safety APP in Florida today - named by Parkland Survivors. BIG PRIORITY and Florida is getting it done! #FortifyFL,2018-10-08 14:44:49-05:00
1049383326975373312,11120,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","Every day, our police officers race into darkened allies, deserted streets, &amp; onto the doorsteps of the most hardened criminals. They see the worst of humanity &amp; they respond with the best of the American Spirit. America’s LEOs have earned the everlasting gratitude of...",2018-10-08 14:37:36-05:00
1049380830395609090,12174,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",We thank you. We salute you. We honor you. And we promise you: we will ALWAYS have your BACK – now and FOREVER! #IACP2018 https://t.co/nvUUIuvouj,2018-10-08 14:27:41-05:00


Same dataframe generated from the demo! Data isn't always pretty, but there are ways to process and clean it up.

For now, let's just focus on the `text` column. We can still accomplish a lot by extracting the text of the tweets as a list.

In [None]:
tweets = list(trump['text'])
tweets

## Pre-processing

Just as in the last module, when we worked with a single text document, we also want to do some **pre-processing** on this text to do better analysis. Here, the text analysis tools we're using are more sophisticated: they can automatically do things like lower the case of text and remove stop words! But, we do have some noise in our data we want to remove. 

Take another look at two tweets in particular:

In [None]:
print(tweets[4])
print()
print(tweets[9])

We can see that the fifth tweet (and many tweets, in fact) contain a link, which probably won't tell us much about Trump's word usage. We can also see that some tweets contain characters that aren't from the English alphabet. It would be helpful to filter these out before we look at things like word frequency.



## Regular Expressions 

### Overview

Regular expressions (regex or regexp for short) are special sequences of characters that define patterns
to search for in text. They're often used in find-and-replace operations, or to add up the number of words
or phrases matching a particular pattern.

Regular expressions are useful in a variety of applications, and can be used in different programs and
programming languages. We will start by learning the general components of regular expressions, using a
simple online tool, RegExr. We'll also demonstrate how to use them in Python.

To get started:

1. Go to this site: [http://regexr.com](http://regexr.com).
2. Copy and paste the two tweets we just printed into the __Text__ field.
3. Delete what you see in the __Expression__ field. This is where we'll insert our own regular expressions
to find sequences in the headlines below.

~~~ {.input}
We thank you. We salute you. We honor you. And we promise you: we will ALWAYS have your BACK – now and FOREVER! #IACP2018 https://t.co/nvUUIuvouj

RT @FLOTUS: Thank you Kenya 🇰🇪 🇺🇸 https://t.co/JrHncob8Qp
 
~~~

### a. Special Characters

Strings are composed of characters, and we are writing patterns to match specific sequences of characters.
Various characters have special meaning in regular expressions. When we use these characters in an expression,
we aren't matching the identical character, we're using the character as a placeholder for some other character(s)
or part(s) of a string.

If you want to match a character that happens to be a special character, you have to escape it with a backslash
`\`. Try typing the following special characters into the __Expression__ field on the regexr.com site. What happens
when you type `Prime Minister Abe` vs. `^Prime Minister Abe`? How about `.`, `\.`, or `\.$`?

~~~ {.input}
.         any single character
^         start of string
$         end of string
\n        new line
\r        carriage return
\t        tab
~~~

### b. Quantifiers

Some special characters refer to optional characters, to a specific number of characters, or to an open-ended
number of characters matching the preceding pattern. Try looking for the letter 'e' followed by a number of 's's:
what happens if you type `es`, `es*`, `es+`, `es{1}`, `es{1,2}`?

~~~ {.input}
*        0 or more of the preceding character/expression
+        1 or more of the preceding character/expression
?        0 or 1 of the preceding character/expression
{n}      n copies of the preceding character/expression 
{n,m}    n to m copies of the preceding character/expression 
~~~

### c. Sets

Regular expressions also allow you to define sets of characters. Within a set of square brackets, you may list
characters individually, e.g. `[aeiou]`, or in a range, e.g. `[A-Z]` (note that all regular expressions are case
sensitive).

You can also create a complement set by excluding certain characters, using `^` as the first character
in the set. The set `[^A-Za-z]` will match any character except a letter. All other special characters loose
their special meaning inside a set, so the set `[.?]` will look for a literal period or question mark.

The set will match only one character contained within that set, so to find sequences of multiple characters from
the same set, use a quantifier like `+` or a specific number or number range `{n,m}`.

~~~ {.input}
[0-9]        any numeric character
[a-z]        any lowercase alphabetic character
[A-Z]        any uppercase alphabetic character
[aeiou]      any vowel (i.e. any character within the brackets)
[0-9a-z]     to combine sets, list them one after another 
[^...]       exclude specific characters
~~~

### d. Special sequences

Several special characters denote special sequences. These begin with a `\` followed by a letter.
Note that the uppercase version is usually the complement of the lowercase version.

~~~ {.input}
\d        Any digit
\D        Any non-digit character
\w        Any alphanumeric character [0-9a-zA-Z_] 
\W        Any non-alphanumeric character
\s        Any whitespace (space, tab, new line)
\S        Any non-whitespace character
\b        Matches the beginning or end of a word (does not consume a character)
\B        Matches only when the position is not the beginning or end of a word (does not consume a character)
~~~

### e. Groups and Logical OR

Parentheses are used to designate groups of characters, to aid in logical conditions, and to be able to retrieve the
contents of certain groups separately.

The pipe character `|` serves as a logical OR operator, to match the expression before or after the pipe. Group parentheses
can be used to indicate which elements of the expression are being operated on by the `|`.

~~~ {.input}
|            Logical OR opeator
(...)        Matches whatever regular expression is inside the parentheses, and notes the start and end of a group
(this|that)  Matches the expression "this" or the expression "that"
~~~

## regex in Python

Important methods:

In [None]:
re.compile?

In [None]:
re.search?

In [None]:
re.match?

In [None]:
re.sub?

In [None]:
re.findall?

In [None]:
re.split?

### Challenge
Find a regex pattern that will match a url. Then, remove the url from `url_tweet`.

In [None]:

url_tweet = tweets[4]

# your code here


### Challenge

Ultimately, we want to remove urls and non-English characters from *all* tweets in `tweets`. Instead of going through each tweet individually, let's use another **list comprehension**.

Remember, the syntax is:

`[<do_something(item)> for <item> in <sequence> if <condition>]`

In this case, though, there's no condition: we want to change every tweet. So, our code will look like this:

`[<do_something(item)> for <item> in <sequence>]`

Replace the ellipses with expressions to remove the urls and non-English characters from the tweets. 
- The first line creates a new list called `no_urls` by removing the urls from each tweet in `tweets`
- The second line creates a new list called  `no_urls_all_engl` by removing all non-English characters from the tweets in `no_urls`. Hint: the regex for matching non-English characters is `'[^\x00-\x7F]+'`

Let's make sure it worked:

In [None]:
no_urls = [... for x in tweets] 
no_urls_all_engl = [... for x in no_urls] 

In [None]:
print(no_urls_all_engl[4])
print()
print(no_urls_all_engl[9])

### Apply to `text` column

Now that we have a list comprehension to remove URLs, let's apply this to the `text` column. 

In [None]:
trump['text'] = [... for x in trump['text']] #Your code here

trump.head()

### **More regex**

Take a look at one of the values in the `source` column. 
 `<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>`
 
Let's use regex to clean this up and only include the text between the tags.
 
**Use list comprehension to clean the source**
 
hint: The regex for matching the string between the tags is `^[^>]*>"`
 

In [None]:
trump['source'] = [... for x in trump['source']]

trump.head()

# 2. The Term Document Matrix <a id='section 2'></a>

If we plan to compare word frequencies across texts, we could collate these `Counter` dictionaries for each tweet in `tweets`. But we don't want to write all that code! There is an easy function that streamlines the process called `CountVectorizer`.

Let's look at the docstring:


In [None]:
CountVectorizer?

Cool. So we'll create the `CountVectorizer` object, then transform it on our `list` of documents: the tweets in  `no_urls_all_engl`. We can give `CountVectorizer` a list of stop words to automatically exclude them from our matrix.

In [None]:
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

stop_words = ENGLISH_STOP_WORDS

cv = CountVectorizer(stop_words=stop_words)
dtm = cv.fit_transform(no_urls_all_engl)
dtm

What's this? A sparse matrix just means that some cells in the table don't have value. Why? Because the vocabulary base is not the same for all the books! Let's try to demonstrate this.


In [None]:
# de-sparsify
desparse = dtm.toarray()

# create labels for columns
word_list = cv.get_feature_names()

# create a new table
dtm_df = pd.DataFrame(columns=word_list, data=desparse)
dtm_df.head()

Welcome to the ***Document Term Matrix***. This is a core concept in NLP and text analysis. It's not that complicated!

We have columns for each word *in the entire corpus*. Then each *row* is for each *document*. In our case, that's tweets. The values are the word count for that word in the corresponding document. Note that there are many 0s, that word just doesn't show up in that document!

We can call up frequencies for a given word for each tweet easily, since they are the column names:

In [None]:
dtm_df['news'][3225:3235]

And we can see the total usage of a given word using `sum`.

In [None]:
dtm_df['news'].sum()

## Normalization

Let's take this another step further. In order to make apples-to-apples comparisons across tweets, we can normalize our values by dividing each word count by the total number of words in its tweet. To do that, we'll need to `sum` on `axis=1`, which means summing the row (number of words in that tweet), as opposed to summing the column.

Once we have the total number of words in that tweet, we can get the percentage of words that one particular word accounts for, and we can do that for every word across the matrix!

Note: normalization is not a huge deal for tweets, since each tweet is limited to 280 characters. But, if you're comparing documents with very different lengths, normalization is key to getting an accurate picture of word usage.

In [None]:

row_sums = np.sum(desparse, axis=1)
normed = desparse/row_sums[:,None]
dtm_df = pd.DataFrame(columns=word_list, data=normed)
dtm_df.head()

We can still get the normalized frequencies of the word 'news' for each tweet:

In [None]:
dtm_df['news'][3225:3235]

## Streamlining

That was a lot of work; if this is such a common task hasn't someone streamlined this? In fact, we can simply instruct `CountVectorizer` not to include stopwords at all and another function, `TfidfTransformer`, normalizes easily.

In [None]:
cv = CountVectorizer(stop_words=stop_words)
dtm = cv.fit_transform(no_urls_all_engl)
tt = TfidfTransformer(norm='l1',use_idf=False)
dtm_tf = tt.fit_transform(dtm)
dtm_tf

---

# 3. Sentiment <a id='section 3'></a>

Frequently, we are interested in text to learn something about the person who is speaking. One of these things we've talked about already - linguistic diversity. A similar metric was used a couple of years ago to settle the question of who has the [largest vocabulary in Hip Hop](http://poly-graph.co/vocabulary.html).

> Unsurprisingly, top spots go to Canibus, Aesop Rock, and the Wu Tang Clan. E-40 is also in the top 20, but mostly because he makes up a lot of words; as are OutKast, who print their lyrics with words slurred in the actual typography

Another thing we can learn is about how the speaker is feeling, with a process called sentiment analysis. Before we start, be forewarned that this is not a robust method by any stretch of the imagination. Sentiment classifiers are often trained on product reviews, which limits their ecological validity.

We're going to use TextBlob because it's an easy way to work with text data, and has a built in sentiment classifier.

In [None]:
blob = TextBlob(no_urls_all_engl[11])
blob.sentences[:10]

To check the polarity of a sentence, we can just use `.polarity`. Polarity is a number between -1 and 1, where -1 is considered 'negative', 0 is 'neutral' and 1 is 'positive'. 

In [None]:
blob.sentences[1].polarity

And to get the polarity of a tweet, we can use `.polarity` on the blob itself.

In [None]:
blob.polarity

What's happening behind the scenes? While there are new algorithms for sentiment anaysis emerging (cf. `VADER`), most algorithms currently rely only on a `dictionary` of words and a corresponding `positive`, `negative`, or `neutral`. Based on all the words in a sentence, a value is calculated for the sentence as a whole. And, polarity for a tweet is calculated as the average polarity of all sentences in the tweet. Not super fancy, I know. Of course, you can change the `dictionary` used in the library itself, or opt for more advanced algorithms that aim to capture context.

We can also get the **subjectivity** of a tweet. Subjectivity ranges from 0 ('objective') to 1 ('subjective') and is related to the different possible meanings a word can take.

In [None]:
blob.subjectivity

And you can get both at once using `.sentiment`.

In [None]:
blob.sentiment

### Challenge
Write a list comprehension to calculate the sentiment of all tweets in `no_url_all_engl`. 

Hint: it may be easier to write two list comprehensions- one to convert each tweet into a TextBlob, and one to calculate the sentiment for each blob in your TextBlob list. However, you can do it in a single list comprehension!


In [1]:
# your code here - make sure to name the final result as "sentiments"



Now, let's add the net polarities to our original table.

In [None]:
trump['polarity'] = [s[0] for s in sentiments]
trump['subjectivity'] = [s[1] for s in sentiments]

trump.head()

## 4. Further Resources <a id='section 4'></a>

These modules have covered just a few of the text analysis methods out there today. If you'd like to learn more, or if you're interested in using these techniques in your own text analysis project, here's some resources you might find helpful.

- Located on the first floor of Moffitt Library, the [Data + Digital Research Help](https://data.berkeley.edu/education/data-digital-research-help) service provide students a hub for support in their co-curricular data science projects. 

- [D-Lab](http://dlab.berkeley.edu/calendar-node-field-date) provides a variety of free workshop trainings for those interested in learning Python, R, Stata, Excel, Geospatial Mapping, Qualitative Methods, etc.

- The [Data Lab](http://www.lib.berkeley.edu/libraries/data-lab) offers consultations to current UC Berkeley students, staff and faculty on research involving numeric data, including finding and recommending data sources and advising on technical data issues such as file format conversion, web scraping, and basic data analysis assistance. 
- D-Lab also hosts a robust list of [campus data-related resources](http://dlab.berkeley.edu/dlab-campus-resources), including places to get data set and data analysis support


If you're up for a challenge, you can look at more text analysis examples
- https://www.datacamp.com/community/tutorials/text-analytics-beginners-nltk  
Datacamp is also a great resource to learn Data Science online. 

- More indepth Bag of Words
https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/

- Tokenizing words w/ NLTK https://pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/

There are tons of resources online!

---

### [Feedback survey](https://docs.google.com/forms/d/1qk4za1wvVSvug4yHvUtXFMTJHhbkIUjxOl8TwMMCDl4/edit?ts=5a8de458)




---

## Bibliography

- Regex section taken from materials by Chris Hench. https://github.com/henchc/textxd-2017/blob/master/03-regex.ipynb
- Document Term Matrix section adapted from materials by Chris Hench: https://github.com/henchc/textxd-2017/blob/master/06-DTM.ipynb
- Sentiment section adapted from the D-Lab's "Intro to Text Analysis" workshop: https://github.com/dlab-berkeley/python-text-analysis/blob/master/Intro_to_TextAnalysis/Intro_to_TextAnalysis.ipynb
  


---
Notebook developed by: Keeley Takimoto  
Modified by: Tina Nguyen

Data Science Modules: http://data.berkeley.edu/education/modules
