# Sentiment Analysis and Topic Models


Consider the sentiment of the following statements:

- Coronet has the best lines of all day cruisers.
- Bertram has a deep V hull and runs easily through seas.
- Pastel-colored 1980s day cruisers from Florida are ugly.
- I dislike old cabin cruisers.

Are they positive, negative, or neutral?  Why?

In [2]:
%matplotlib inline 
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from textblob import TextBlob

### Sentiment Analysis

In [3]:
text = 'Hi, I thought the speech you gave was awful, your hair looked terrible, and your mom would be ashamed.'

In [4]:
analysis = TextBlob(text)

In [5]:
pos_or_neg = analysis.sentiment.polarity

In [13]:
pos_or_neg

-0.6999999999999998

In [113]:
ny = pd.read_csv('data/ny_donors.csv')

In [114]:
sent = ny.project_essay_2[30]

In [115]:
sent

"We currently have 3 outdated desktop computers in our classroom that still run on Windows XP! We like to use reading websites, like Raz Kids, Starfall and News-O-Matic to practice our reading skills. It's difficult to make sure everyone gets a fair turn when there are 24 students sharing 3 computers. Most of them don't have access to technology at home either. \\r\\nThese new Kindle Fires will allow more students to have access to technology at the same time. Students will be able to read ebooks, as well as use other reading apps to practice reading on their own level. They can also use them for research for their writing, publishing their work, and learning how to use different types of  technology. It will encourage reluctant readers to practice reading more if they can have access to different kinds of books on technology."

In [116]:
analysis = TextBlob(sent)

In [117]:
analysis.sentiment.polarity

0.18742424242424244

In [118]:
analysis.sentiment.subjectivity

0.6067845117845116

In [119]:
import spacy

In [120]:
nlp = spacy.load('en')

In [10]:
doc = nlp(sent)

In [11]:
for sent in doc.sents:
    print(sent)

We currently have 3 outdated desktop computers in our classroom that still run on Windows XP!
We like to use reading websites, like Raz Kids, Starfall and News-O-Matic to practice our reading skills.
It's difficult to make sure everyone gets a fair turn when there are 24 students sharing 3 computers.
Most of them don't have access to technology at home either.
\r\nThese new Kindle Fires will allow more students to have access to technology at the same time.
Students will be able to read ebooks, as well as use other reading apps to practice reading on their own level.
They can also use them for research for their writing, publishing their work, and learning how to use different types of  technology.
It will encourage reluctant readers to practice reading more if they can have access to different kinds of books on technology.


In [21]:
from spacy import displacy

In [26]:
displacy.render(doc, style = 'ent', jupyter = True)

In [33]:
from textblob import TextBlob

In [40]:
text = ny.project_essay_2[10]

In [41]:
blob = TextBlob(text)

In [42]:
blob.tags[:5]

[('We', 'PRP'),
 ('are', 'VBP'),
 ('looking', 'VBG'),
 ('to', 'TO'),
 ('add', 'VB')]

In [121]:
blob.tags[0][0]

'We'

In [122]:
blob.tags[0][1]

'PRP'

In [123]:
for sent in blob.sentences:
    print(sent.sentiment.polarity, sent[:10])

0.0 We are loo
0.0 But, we ar
0.7 We believe
0.0 \r\n\r\nA 
0.0 For exampl
0.5 More stude
0.0 We could t
0.3181818181818182 Our robot 
0.0 The possib
0.0 Robotics c
0.0 It allows 
0.08333333333333333 This will 
0.0 Thank you!


In [44]:
blob.words

WordList(['We', 'are', 'looking', 'to', 'add', 'robotics', 'coding', 'and', 'programming', 'to', 'our', 'STEM', 'lab', 'in', 'a', 'rural', 'school', 'But', 'we', 'are', 'currently', 'lacking', 'the', 'technology', 'to', 'accomplish', 'this', 'with', 'our', 'students', 'We', 'believe', 'a', 'set', 'of', 'three', 'GoPiGo', 'Robot', 'Starter', 'Kits', 'and', 'Raspberry', 'Pi', 'computers', 'would', 'be', 'the', 'perfect', 'fit', 'to', 'introduce', 'hands-on', 'innovation', 'r\\n\\r\\nA', 'set', 'of', 'three', 'Raspberry', 'Pi', 'computers', 'to', 'program', 'three', 'GoPiGo', 'Robots', 'can', 'engage', 'a', 'variety', 'of', 'grade', 'levels', 'covering', 'a', 'variety', 'of', 'STEAM', 'topics', 'and', 'disciplines', 'For', 'example', 'younger', 'programmers', 'can', 'be', 'introduced', 'to', 'coding', 'by', 'moving', 'the', 'robots', 'through', 'a', 'maze', 'or', 'competing', 'in', 'a', 'robot', 'soccer', 'match', 'More', 'students', 'can', 'learn', 'about', 'planets', 'and', 'program', '

In [46]:
blob.words.count('robotics')

2

In [48]:
blob.words.count('STEM')

2

In [49]:
blob.words.count('technology')

1

In [50]:
blob.words.count('robot')

5

### Problem

Add columns to our dataframe `ny_donors` that contain scores for sentiment and polarity of the `project_essay_1` and `project_essay_2`.

### Topic Models

Below, we explore two approaches to modeling topics with `scikitlearn`: NMF and LDA.  Before we can use these models, we have to preprocess our data.  Below, we do so for both a basic `CountVectorizer` and a `TfidfVectorizer`.

In [52]:
from sklearn.feature_extraction.text import CountVectorizer

In [54]:
vect = CountVectorizer()

In [55]:
X = vect.fit_transform(ny.project_essay_2)

In [56]:
X

<12157x20418 sparse matrix of type '<class 'numpy.int64'>'
	with 1056010 stored elements in Compressed Sparse Row format>

In [57]:
X2 = X.toarray()

In [59]:
names = vect.get_feature_names()
words = pd.DataFrame(X2, columns=names)

In [60]:
words.head()

Unnamed: 0,00,000,00am,00pm,021,04,05a,06,08,10,...,zoo,zoob,zoology,zoom,zooms,zooplankton,zoos,zucchini,zuma,zumba
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [72]:
vect = CountVectorizer(min_df = 15, stop_words = 'english')
X = vect.fit_transform(ny.project_essay_2)
X2 = X.toarray()
names = vect.get_feature_names()
words = pd.DataFrame(X2, columns=names)

In [73]:
words.head()

Unnamed: 0,00,000,10,100,11,11th,12,13,14,15,...,yes,yoga,york,young,younger,youngest,youngsters,youth,yummy,zone
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [74]:
from sklearn.feature_extraction.text import TfidfTransformer

In [75]:
transformer = TfidfTransformer()
transformer

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

In [76]:
tfidf = transformer.fit_transform(words)

In [77]:
tfidf

<12157x4175 sparse matrix of type '<class 'numpy.float64'>'
	with 608648 stored elements in Compressed Sparse Row format>

In [78]:
transformer.fit_transform(words[:10]).toarray()

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.10751966, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [83]:
words.columns[:10]

Index(['00', '000', '10', '100', '11', '11th', '12', '13', '14', '15'], dtype='object')

In [79]:
from sklearn.decomposition import NMF, LatentDirichletAllocation

no_topic = 10

nmf = NMF(n_components=no_topic).fit(tfidf) 

lda = LatentDirichletAllocation(n_topics=5).fit(tfidf) 

In [81]:
def display_topics(model, feature_names, no_top_words): 
    for topic_idx, topic in enumerate(model.components_): 
        print ("Topic %d:" % (topic_idx) )
        print( " ".join([feature_names[i] 
                        for i in topic.argsort()[:-no_top_words - 1:-1]]) )

In [84]:
no_top_words = 10

display_topics(nmf, words.columns, no_top_words) 

Topic 0:
supplies need students paper year materials pencils classroom school help
Topic 1:
books reading read library book students readers level love classroom
Topic 2:
technology students access chromebooks use computer research computers classroom able
Topic 3:
seating students chairs sit classroom work sitting flexible stools comfortable
Topic 4:
math skills students learning games materials help fun centers practice
Topic 5:
school students healthy snacks day children play music equipment snack
Topic 6:
art create projects paint express creative artists students creativity arts
Topic 7:
printer print ink color work students classroom projects able pictures
Topic 8:
science stem students world hands explore learn learning kits life
Topic 9:
ipad ipads apps use technology students able classroom learning reading


In [85]:
display_topics(lda, words.columns, no_top_words) 

Topic 0:
calculators cd athletes theater puppets graphing debate dna voices calculator
Topic 1:
students help classroom learning reading books school use able work
Topic 2:
fitness physical exercise weight gym equipment health strength active activity
Topic 3:
supplies paper pencils markers art easel erase dry notebooks folders
Topic 4:
basketball soccer basketballs dot privacy dash hoops magna equip engineer


In [96]:
tf_vectorizer = CountVectorizer(strip_accents = 'unicode',
                                stop_words = 'english',
                                lowercase = True,
                                token_pattern = r'\b[a-zA-Z]{3,}\b',
                                max_df = 0.5, 
                                min_df = 10)
dtm_tf = tf_vectorizer.fit_transform(ny.project_essay_1)
print(dtm_tf.shape)

(12157, 3827)


In [100]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(**tf_vectorizer.get_params())
dtm_tfidf = tfidf_vectorizer.fit_transform(ny.project_essay_1)
print(dtm_tfidf.shape)

(12157, 3827)


In [101]:
# for TF DTM# for T 
lda_tf = LatentDirichletAllocation(n_topics=20, random_state=0)
lda_tf.fit(dtm_tf)
# for TFIDF DTM
lda_tfidf = LatentDirichletAllocation(n_topics=20, random_state=0)
lda_tfidf.fit(dtm_tfidf)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
             mean_change_tol=0.001, n_components=10, n_jobs=1, n_topics=20,
             perp_tol=0.1, random_state=0, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

### Visualizing Topic Models

In [109]:
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

In [110]:
pyLDAvis.sklearn.prepare(lda_tfidf, dtm_tfidf, tfidf_vectorizer)

In [111]:
pyLDAvis.sklearn.prepare(lda_tfidf, dtm_tfidf, tfidf_vectorizer)

### Project Essay Visualizations

What are some topics in the `project_essay_2` column?  Can you determine a way to incorporate these into a `LogisticRegression` model?