# Text Analysis 03: Document-Term Matrix and Sentiment Analysis


---
<img src="data/word_magnets.jpg" style="width: 400px; height: 300px;" />
 *Photo by [Steve Johnson](https://www.flickr.com/photos/artbystevejohnson/4654424717)* 

### Professor Crystal Chang

This notebook will extend the word count approach from the last module to multiple documents with Document Term Matrices. We will also introduce sentiment analysis.

*Estimated Time: 60 minutes*

---


### Table of Contents

[The Data](#section data)<br>

[Context](#section context)<br>

1 - [Working with Multiple Documents](#section 1)<br>

2 - [The Term-Document Matrix](#section 2)<br>

3 - [Sentiment Analysis](#section 3)<br>



**Dependencies:**

In [53]:
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import CountVectorizer


## Introduction

One of the powerful things about text analysis with Python is the ability to work with a large number of documents simultaneously. In this module, we'll expand the bag-of-words approach to cover multiple text documents in a **Term Document Matrix**. We'll also explore another way of analysing texts by sentiment.



# 1. Working with Multiple Documents <a id='section 1'></a>

The term-document model is also sometimes referred to as "bag-of-words" by those who don't think very highly of it. The term document model looks at language as individual communicative efforts that contain one or more tokens. The kind and number of the tokens in a document tells you something about what is attempting to be communicated, and the order of those tokens is ignored.

This is the primary method still used for most text analysis, although models utilizing word embeddings are beginning to take hold. We will discuss word embeddings briefly at the end.

In order to actually turn our text into a bag of words, we'll have to do some preprocessing. This is a crucial step at the beginning of any NLP project, and much of this first section will involve it.

To start with, let's import NLTK and load some data: tweets from @realDonaldTrump. 

In [4]:
# read in the data from a file
trump = pd.read_csv('data/trump_tweets.csv', header=0, index_col=0)

# display the table of data
trump.head()

Unnamed: 0_level_0,retweet_count,source,text,time
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
966006815745040384,6945,Twitter for iPhone,Main Street is BOOMING thanks to our incredibl...,2018-02-20 17:49:07
965971586913374208,11350,Twitter for iPhone,....cameras running. Another False Accusation....,2018-02-20 15:29:07
965968309358333952,12435,Twitter for iPhone,"A woman I don’t know and, to the best of my kn...",2018-02-20 15:16:06
965943827931549696,10989,Twitter for iPhone,"I have been much tougher on Russia than Obama,...",2018-02-20 13:38:49
965937068907073536,13746,Twitter for iPhone,Hope Republicans in the Great State of Pennsyl...,2018-02-20 13:11:58


The cell above uses an object we haven't worked with before, called a **DataFrame**. DataFrames organize data in a table and come in handy for:
- comparing many documents at a time
- doing analysis involving **metadata** (data about the data, like when it was produced or where it came from).

For now, we can still accomplish a lot by extracting the text of the tweets as a list.

In [32]:
tweets = list(trump['text'])
tweets

['Main Street is BOOMING thanks to our incredible TAX CUT and Reform law. "This shows small-business owners are more than just optimistic, they are ready to grow their businesses." https://t.co/w9aw68UwOj',
 '....cameras running. Another False Accusation. Why doesn’t @washingtonpost report the story of the women taking money to make up stories about me? One had her home mortgage paid off. Only @FoxNews so reported...doesn’t fit the Mainstream Media narrative.',
 'A woman I don’t know and, to the best of my knowledge, never met, is on the FRONT PAGE of the Fake News Washington Post saying I kissed her (for two minutes yet) in the lobby of Trump Tower 12 years ago. Never happened! Who would do this in a public space with live security......',
 'I have been much tougher on Russia than Obama, just look at the facts. Total Fake News!',
 'Hope Republicans in the Great State of Pennsylvania challenge the new “pushed” Congressional Map, all the way to the Supreme Court, if necessary. Your Orig

### For Loops
When we're dealing with a collection of documents, we often want to perform the same operations on each item in the colleciton. Instead of copying and pasting the same calls over and over again, it would be better to use a **for loop**.

A basic **for loop** is written in the following order:
- The word "for"
- A name we want to give each item in a sequence
- The word "in"
- A sequence (i.e. "range(100))" to go through numbers 0-99

For example, to greet someone ten times, we could write:

In [6]:
# Run me to see "hello!" printed ten times!
for i in range(10):
    print("hello!")

hello!
hello!
hello!
hello!
hello!
hello!
hello!
hello!
hello!
hello!


You can also cycle through a non-numerical sequence.

In [52]:
for word in ['Never', 'gonna', 'give', 'you', 'up']:
    print(word)

Never
gonna
give
you
up


In this way, for loops help us avoid redundant code and have useful capabilities.

#### Challenge

Using a for loop, write an expression to print out the first five tweets in the `tweets` list. Think about:
- how to make a list of only the first 5 tweets (a 'slice')
- how to cycle through that smaller list

In [None]:
# your code here

## Pre-processing

Just as in the last module, when we worked with a single text document, we also want to do some **pre-processing** on this text to do better analysis. Here, the text analysis tools we're using are more sophisticated: they can automatically do things like lower the case of text and remove stop words! But, we do have some noise in our data we want to remove. 

Take another look at two tweets in particular:

In [51]:
print(tweets[0])
print()
print(tweets[710])

Main Street is BOOMING thanks to our incredible TAX CUT and Reform law. "This shows small-business owners are more than just optimistic, they are ready to grow their businesses." https://t.co/w9aw68UwOj

RT @IvankaTrump: Touched by the warm hospitality of Prime Minister Abe and the Japanese people. ありがとうございます [Thank you]! Until next time 🇯🇵…


We can see that the first tweet (and many tweets, in fact) contain a link, which probably won't tell us much about Trump's word usage. We can also see that some tweets contain characters that aren't from the English alphabet. It would be helpful to filter these out before we look at things like word frequency.

### Challenge
Clean the tweets by removing links and non-English characters. Hint:
- all links begin with 'http'
- the regex pattern for non-English characters is `'[^\x00-\x7F]+'`

In [55]:
# your code here

Now, let's make sure our pre-processing worked.

In [56]:
print(tweets[0])
print()
print(tweets[710])

Main Street is BOOMING thanks to our incredible TAX CUT and Reform law. "This shows small-business owners are more than just optimistic, they are ready to grow their businesses." 

RT @IvankaTrump: Touched by the warm hospitality of Prime Minister Abe and the Japanese people.   [Thank you]! Until next time  


# 2. The Term Document Matrix <a id='section 2'></a>

Today we'll see our first, admittedly primitive, computational model of language called "Bag of Words". This model was very popular in early text analysis, and continues to be used today. In fact, the models that have replaced it are still very difficult to actually interpret, giving the BoW approach a slight advantage if we want to understand why the model makes certain decisions.

Getting into the model we'll have to revisit Term Frequency (think `Counter`). We'll then see the Document-Term Matrix (DTM), which we've discusssed briefly before. We'll have to normalize these counts if we want to compare. Then we'll look at the available Python libraries to streamline this process.


If we plan to compare word frequencies across texts, we could collate these `Counter` dictionaries for each tweet in `tweets`. But we don't want to write all that code! There is an easy function that streamlines the process called `CountVectorizer`.

Let's look at the docstring:

In [57]:
CountVectorizer?

Intro to section 1 here.

In [7]:


cv = CountVectorizer()
dtm = cv.fit_transform(tweets)
dtm

<3227x8179 sparse matrix of type '<class 'numpy.int64'>'
	with 66023 stored elements in Compressed Sparse Row format>

In [8]:
# de-sparsify
desparse = dtm.toarray()

# create labels for columns
word_list = cv.get_feature_names()

# create a new table
dtm_df = pd.DataFrame(columns=word_list, data=desparse)
dtm_df.head()

Unnamed: 0,00,000,00am,00ame,00mao6vk7r,00pm,00pme,03e4ybiwr0,05dtfjaubx,075,...,ありがとうございます,そして,アジア歴訪の大成功をお祈りしています,トランプ大統領による,ドナルド,初の,日米同盟の揺るぎない絆を世界に示すことができました,本当にありがとう,歴史的な日本訪問は,間違いなく
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [11]:
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

stop_words = ENGLISH_STOP_WORDS

In [12]:
cv = CountVectorizer(stop_words=stop_words)
dtm = cv.fit_transform(tweets)
desparse = dtm.toarray()
word_list = cv.get_feature_names()
dtm_df = pd.DataFrame(columns=word_list, data=desparse)
row_sums = np.sum(desparse, axis=1)
normed = desparse/row_sums[:,None]
dtm_df = pd.DataFrame(columns=word_list, data=normed)
dtm_df

Unnamed: 0,00,000,00am,00ame,00mao6vk7r,00pm,00pme,03e4ybiwr0,05dtfjaubx,075,...,ありがとうございます,そして,アジア歴訪の大成功をお祈りしています,トランプ大統領による,ドナルド,初の,日米同盟の揺るぎない絆を世界に示すことができました,本当にありがとう,歴史的な日本訪問は,間違いなく
0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


---

# 3. Sentiment <a id='section 3'></a>

Frequently, we are interested in text to learn something about the person who is speaking. One of these things we've talked about already - linguistic diversity. A similar metric was used a couple of years ago to settle the question of who has the [largest vocabulary in Hip Hop](http://poly-graph.co/vocabulary.html).

> Unsurprisingly, top spots go to Canibus, Aesop Rock, and the Wu Tang Clan. E-40 is also in the top 20, but mostly because he makes up a lot of words; as are OutKast, who print their lyrics with words slurred in the actual typography

Another thing we can learn is about how the speaker is feeling, with a process called sentiment analysis. Before we start, be forewarned that this is not a robust method by any stretch of the imagination. Sentiment classifiers are often trained on product reviews, which limits their ecological validity.

We're going to use TextBlob because it's an easy way to work with text data, and has a built in sentiment classifier.

Intro to section 2 here.

In [14]:
!pip install textblob
from textblob import TextBlob
blob = TextBlob(tweets[0])
blob.sentences[:10]



[Sentence("Main Street is BOOMING thanks to our incredible TAX CUT and Reform law."),
 Sentence(""This shows small-business owners are more than just optimistic, they are ready to grow their businesses.""),
 Sentence("https://t.co/w9aw68UwOj")]

To check the polarity of a string, we can just iterate through the tweet's sentences. TextBlob will calculate the polarity of each sentence with `sentiment.polarity`, and we can just add it to our accumulator variable `net_pol`.

In [30]:
net_pol = 0
for sentence in blob.sentences:
    pol = sentence.sentiment.polarity
    print(pol, sentence)
    net_pol = net_pol + pol
print()
print("Net polarity of tweet: ", net_pol)

0.4222222222222222 Main Street is BOOMING thanks to our incredible TAX CUT and Reform law.
0.35 "This shows small-business owners are more than just optimistic, they are ready to grow their businesses."
0.0 https://t.co/w9aw68UwOj

Net polarity of tweet:  0.7722222222222221


What's happening behind the scenes? While there are new algorithms for sentiment anaysis emerging (cf. `VADER`), most algorithms currently rely only on a `dictionary` of words and a corresponding `positive`, `negative`, or `neutral`. Based on all the words in a sentence, a value is calculated for the sentence as a whole. Not super fancy, I know. Of course, you can change the `dictionary` used in the library itself, or opt for more advanced algorithms that aim to capture context.

### Challenge
Write a function that will calculate the net polarity of a tweet. Then, use the function to calculate the net polarity of all tweets in `tweets`. We've given you some skeleton code to help you get started.

Hint: use the code in the previous cell as a guide, but remember: we probably don't want to print every sentence for all 3000-odd tweets!

In [26]:
def get_net_polarity(tweet):
    '''Returns the net polarity of the tweet.'''
    blob = ...
    #your code here
    return ...

In [27]:
# makes an empty list
all_tweet_polarity = []

#cycle through each index in all_tweet_polarity
for tweet in tweets:
    # add the net polarity to the end of the list
    all_tweet_polarity.append(...)
    
all_tweet_polarity

[0.7722222222222221,
 -0.20000000000000007,
 0.3181818181818182,
 -0.11249999999999999,
 0.5308712121212121,
 0.1875,
 0.3,
 0.14047619047619045,
 -0.04166666666666667,
 1.15,
 1.15,
 1.08,
 1.5236111111111112,
 0.0,
 0.525,
 -0.225,
 1.2571428571428571,
 0.875,
 1.325,
 -0.25,
 0.26785714285714285,
 0.0,
 0.7875000000000001,
 0.0,
 0.2857142857142857,
 0.625,
 -0.7333333333333333,
 0.525,
 0.0,
 0.12222222222222223,
 -0.48958333333333337,
 -0.075,
 -0.25,
 -0.7,
 0.3,
 0.31666666666666665,
 -0.675,
 -1.1666666666666665,
 0.5,
 0.0,
 -0.6499999999999999,
 -0.3,
 0.0,
 0.0,
 0.24333333333333332,
 -0.09999999999999994,
 -1.0,
 -1.0,
 0.0,
 0.3333333333333333,
 0.43333333333333335,
 1.1,
 0.0,
 0.08437499999999987,
 -0.39999999999999997,
 0.3375,
 0.8172619047619047,
 0.2095238095238095,
 -0.11174242424242424,
 0.590625,
 0.0,
 0.3,
 -0.5,
 0.0,
 -0.8693181818181819,
 -0.15555555555555559,
 0.05,
 0.0,
 0.3814814814814814,
 0.3325,
 0.3,
 0.125,
 0.4,
 0.29285714285714287,
 0.0,
 0.0,
 1.

Now, let's add the net polarities to our original table.

In [28]:
trump['net_polarity'] = all_tweet_polarity

trump.head()

Unnamed: 0_level_0,retweet_count,source,text,time,net_polarity
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
966006815745040384,6945,Twitter for iPhone,Main Street is BOOMING thanks to our incredibl...,2018-02-20 17:49:07,0.772222
965971586913374208,11350,Twitter for iPhone,....cameras running. Another False Accusation....,2018-02-20 15:29:07,-0.2
965968309358333952,12435,Twitter for iPhone,"A woman I don’t know and, to the best of my kn...",2018-02-20 15:16:06,0.318182
965943827931549696,10989,Twitter for iPhone,"I have been much tougher on Russia than Obama,...",2018-02-20 13:38:49,-0.1125
965937068907073536,13746,Twitter for iPhone,Hope Republicans in the Great State of Pennsyl...,2018-02-20 13:11:58,0.530871


## 4. Further Resources <a id='section 4'></a>

These modules have covered just a few of the text analysis methods out there today. If you'd like to learn more, or if you're interested in using these techniques in your own text analysis project, here's some resources you might find helpful.

- Located on the first floor of Moffitt Library, the [Data + Digital Research Help](https://data.berkeley.edu/education/data-digital-research-help) service provide students a hub for support in their co-curricular data science projects. 

- [D-Lab](http://dlab.berkeley.edu/calendar-node-field-date) provides a variety of free workshop trainings for those interested in learning Python, R, Stata, Excel, Geospatial Mapping, Qualitative Methods, etc.

- The [Data Lab](http://www.lib.berkeley.edu/libraries/data-lab) offers consultations to current UC Berkeley students, staff and faculty on research involving numeric data, including finding and recommending data sources and advising on technical data issues such as file format conversion, web scraping, and basic data analysis assistance. 
- D-Lab also hosts a robust list of [campus data-related resources](http://dlab.berkeley.edu/dlab-campus-resources), including places to get data set and data analysis support

---

## Bibliography

- For loops section from DS-Modules Core Resources: https://github.com/ds-modules/core-resources/blob/master/intro/control-for-loops.ipynb
- Document Term Matrix section adapted from materials by Chris Hench: https://github.com/henchc/textxd-2017/blob/master/06-DTM.ipynb
- Sentiment section adapted from the D-Lab's "Intro to Text Analysis" workshop: https://github.com/dlab-berkeley/python-text-analysis/blob/master/Intro_to_TextAnalysis/Intro_to_TextAnalysis.ipynb


---
Notebook developed by: Keeley Takimoto

Data Science Modules: http://data.berkeley.edu/education/modules
