# Turning Text into Numbers

Learning Objectives:
- Introduce bag-of-words representation for text
- Apply bag-of-words approach to sentiment analysis task

One of the key benefits of data wrangling and analysis is pulling out patterns from complex data. One rich source of data is text, including social media posts to novels to news corpora.

The key question surrounding text data is: how do we go from words to numbers that are compatible with models and statistical analysis? Today we will go through an example using movie review data and sentiment analysis for text.




## Movie Review Dataset and Sentiment Analysis


Today we'll use a [data set](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?resource=download) which contains reviews of movies from IMDB. These reviews reflect someone's **sentiment**, or attitude, towards the movie they are reviewing, be it positive or negative. 

Let's load in the dataset and take a look at a few examples.


In [29]:
import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/emilygrabowski/demos-tutorials/main/text_analysis/IMDB_reviews_cleaned.csv")[:10000]
text=df['text'].values
sentiment = df['sentiment'].values

**Question:** Are the reviews below negative or positive? What specific words or phrases from the text helped you to make that judgment?

In [31]:
print("Example 1:")
text[14]

Example 1:


"This a fantastic movie of three prisoners who become famous. One of the actors is george clooney and I'm not a fan but this roll is not bad. Another good thing about the movie is the soundtrack (The man of constant sorrow). I recommand this movie to everybody. Greetings Bart"

In [32]:
print("Example 2:")
text[19]

Example 2:


"An awful film! It must have been up against some real stinkers to be nominated for the Golden Globe. They've taken the story of the first famous female Renaissance painter and mangled it beyond recognition. My complaint is not that they've taken liberties with the facts; if the story were good, that would perfectly fine. But it's simply bizarre -- by all accounts the true story of this artist would have made for a far better film, so why did they come up with this dishwater-dull script? I suppose there weren't enough naked people in the factual version. It's hurriedly capped off in the end with a summary of the artist's life -- we could have saved ourselves a couple of hours if they'd favored the rest of the film with same brevity."

In [34]:
print('Example 3:')
text[7]

Example 3:


"This show was an amazing, fresh & innovative idea in the 70's when it first aired. The first 7 or 8 years were brilliant, but things dropped off after that. By 1990, the show was not really funny anymore, and it's continued its decline further to the complete waste of time it is today.<br /><br />It's truly disgraceful how far this show has fallen. The writing is painfully bad, the performances are almost as bad - if not for the mildly entertaining respite of the guest-hosts, this show probably wouldn't still be on the air. I find it so hard to believe that the same creator that hand-selected the original cast also chose the band of hacks that followed. How can one recognize such brilliance and then see fit to replace it with such mediocrity? I felt I must give 2 stars out of respect for the original cast that made this show such a huge success. As it is now, the show is just awful. I can't believe it's still on the air."

## Bag-of-words


Now that we've taken a look at the dataset, let's talk about transforming the text to numbers. In this case, we need a framework in which to assign meaningful numeric values to each review. One way to simplify this problem is to treat the text as a collection of unordered words, which alleviates the need to account for things like word order.

For example, in judging the sentiment of each of those reviews, there are certain keywords and phrases that are helpful for identifying sentiment. 

1. Positive sentiment: Good, happy, best, awesome
2. Negative sentiment: Worst, terrible, horrible, bad

The idea that individual words are informative as to the content of a text is the logic underpinning the bag-of-words approach. You take the words in a review, put it in a bag, shake it up, and count the occurence of each word in the review. And now we have numbers!

![link text](https://dudeperf3ct.github.io/images/lstm_and_gru/bag-of-words.png)


[Image source](https://dudeperf3ct.github.io/images/lstm_and_gru/)


**Question:** What are some cases where a bag-of-words approach might not work? (i.e. where context changes the meaning of words)

## Bag-of-words in Python

The steps to actually generating this result is as follows:
1. Preprocess the text (lower case, remove punctuation and formatting)
2. Identify all of the unique words in the text
3. Count the occurrence of each unique word in each review (which can be 0)
4. Make a table of reviews x vocabulary 


This is all possible to do in base Python, but the more efficient and common way is to use a package developed for this purpose which will roll all of these steps together. 

Today, we will use a function called [CountVectorizer()](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html), which will *count* up all of the occurences of each word in the text and make a *vector*. We won't get into the details of the whole package that this comes from (sklearn, which is an excellent machine learning package) today, but I'll walk through the example code below:

In [36]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

vectorizer  = CountVectorizer() #Load the bag of words model
X = vectorizer.fit_transform(text).toarray() # Make the bag of words representation and transform it into a matrix


print("Number of items in the vocabulary:",X.shape[1])
dtm = pd.DataFrame(X,columns = vectorizer.get_feature_names_out()) #Add vocabulary labels to the matrix
dtm


Number of items in the vocabulary: 52640


Unnamed: 0,00,000,00001,0069,007,00am,00s,01,0126,01pm,...,être,ís,ísnt,île,ïn,óli,önsjön,über,überwoman,ünfaithful
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9998,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now, let's answer some questions about the table: 
1. How many reviews are in the table?
2. How many unique words are in the table?
3. What does each of the numbers stand for? What does a row correspond to?

Now let's take a look at the most common words in the data. Again, we won't worry too much about the code and will focus on the output of the cell below:

In [37]:
print("Average occurrence of words in the dataset:")
dtm.mean().sort_values(ascending=False)

Average occurrence of words in the dataset:


the           13.3888
and            6.4880
of             5.7838
to             5.3343
is             4.2572
               ...   
fume           0.0001
fumiko         0.0001
quinlan        0.0001
quine          0.0001
ünfaithful     0.0001
Length: 52640, dtype: float64

Notice that in the tables above, there are many vocabulary words that might not be informative for our model (i.e. a lot of zeros in the bag-of-words representation).  We can add a few arguments in order to help make the results more effective. First, let's set the `stop_words` argument to remove common function words in English like 'to', 'the', 'and', etc. We can also restrict the overall number of words in the vocabulary by changing the `max_features` argument. Try out a few different values for `max_features`. How does it change the results?


In [38]:
vectorizer2  = CountVectorizer(max_features = 500,stop_words='english') #Load the bag of words model
X = vectorizer2.fit_transform(text).toarray() # Make the bag of words representation

print("Number of items in the vocabulary:",X.shape[1])
term_matrix = pd.DataFrame(X,columns = vectorizer2.get_feature_names_out()) #Make the matrix

print("Most common words in the dataset:")
term_matrix.mean().sort_values(ascending=False)

Number of items in the vocabulary: 500
Most common words in the dataset:


br           4.0964
movie        1.7778
film         1.5658
like         0.8101
just         0.7197
              ...  
enjoyable    0.0343
thriller     0.0339
events       0.0339
actual       0.0339
talking      0.0338
Length: 500, dtype: float64

## **Demo**: Regression for sentiment analysis

Now that we have a numerical representation, we can use our models to pull out patterns in the data. In this case, we will use **logistic regression**, a common type of regression used for binary outcome variables (0=negative, 1=positive). 
Logistic regression uses an s-curve to calculate te probability of the outcome being 1 or 0.

![](https://pimages.toolbox.com/wp-content/uploads/2022/04/11040522/46-4.png)



The main intuition of logistic regression is that negative coefficients--> negative sentiment and positive coefficients--> positive sentiment. 


The code for generating a logistic regression is given below. Are there any surprising results in the most negative and most positive words?

In [40]:
import statsmodels.api as sm

res = sm.Logit(sentiment,term_matrix).fit()
print('Regression coefficients for each word:')
res.params.sort_values()

Optimization terminated successfully.
         Current function value: 0.330388
         Iterations 8
Regression coefficients for each word:


waste       -2.380682
awful       -1.708785
dull        -1.693389
worst       -1.592559
horrible    -1.500726
               ...   
wonderful    1.018785
perfect      1.064582
amazing      1.070463
loved        1.102197
excellent    1.361603
Length: 500, dtype: float64

We can also predict the sentiment for a new text. Now that we are talking about predictions rather than coefficients, a predicted value closesr to 1 = more positive, and a predicted value close to 0 = more negative. 

Let's take a couple of examples below.

In [42]:
new_text = ["This is the worst movie so bad, terrible, horrible","This is the best movie ever I'm such a fan, what an excellent performance"]
X = vectorizer2.transform(new_text).toarray() # Make the bag of words representation
term_matrix = pd.DataFrame(X,columns = vectorizer2.get_feature_names_out()) #Make the matrix
res.predict(term_matrix) #predict with the logistic regression


0    0.007466
1    0.874598
dtype: float64

**Exercise:** Add at least three new reviews to the `new_text` list. 

1. Write a review that is predicted to be very positive (near 1), very negative (near 0), and in between (near .5).
2. Try to 'trick' the model-- include examples that a human would likely interpret as positive but the model would interpret as negative, or vice versa.  What kind of elements included in the review seem to trick the model? 

## Wrapping it up

Today we walked through one way to make a numerical representation of text, the **bag-of-words** representation, and used that representation in a logistic model for sentiment analysis. This representation has shortcomings for several reasons:

1. Phrases and negation 
2. Tone/sarcasm 
3. Frequency bias

However, this is still a useful representation because of its simplicity and effectiveness, and can be built upon for other representations. With numeric representations for text, we can now apply many useful statistical tests and models to text data and identify interesting patterns in data. This leads into the field of **computational text analysis**, where there is a lot of really interesting work on identifying patterns in text data.

**Final Discussion**: What are some cases where you run into text analysis in your daily life?