# When Rotten Tomatoes Isn't Enough: Twitter Sentiment Analysis with DSE

### Things To Setup
* Create a Twitter Account and get API access: https://developer.twitter.com/en/docs/ads/general/guides/getting-started.html
* Install DSE https://docs.datastax.com/en/install/doc/install60/installTOC.html
* Start DSE Analytics Cluster: dse cassandra -k #Must use -k option for Analytics
* Install Anaconda and Jupyter #Anaconda is not required but will make installing jupyter easier 
* Start Jupyter with DSE to get all environemnt variables: dse exec jupyter notebook
* !pip install cassandra-driver
* !pip install tweepy 
* !pip install pattern 
* Counter-intuitive don't install pyspark!!

#### Add some environment variables to find dse verision of pyspark

In [194]:
# Needed to be able to find pyspark libaries
import sys
sys.path.append("/Users/amanda.moran/cassandra/dse-6.0.1/resources/spark/python/lib/pyspark.zip")
sys.path.append("/Users/amanda.moran/cassandra/dse-6.0.1/resources/spark/python/lib/py4j-0.10.4-src.zip")

#### Import python packages -- all are required

In [195]:
import pandas
import cassandra
import pyspark
import tweepy
import re
from IPython.display import display, HTML
from pyspark.sql import SparkSession
from pyspark.ml.feature import Tokenizer, RegexTokenizer, StopWordsRemover
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType
from pattern.en import sentiment, positive

#### Helper function to have nicer formatting of Spark DataFrames

In [196]:
#Helper for pretty formatting for Spark DataFrames
def showDF(df, limitRows =  20, truncate = True):
    if(truncate):
        pandas.set_option('display.max_colwidth', 50)
    else:
        pandas.set_option('display.max_colwidth', -1)
    pandas.set_option('display.max_rows', limitRows)
    display(df.limit(limitRows).toPandas())
    pandas.reset_option('display.max_rows')

### Creating Tables, Pulling Tweets, and Loading Tables

#### Connect to DSE Analytics Cluster

In [197]:
from cassandra.cluster import Cluster

cluster = Cluster(['127.0.0.1']) #If you have a locally installed DSE cluster
session = cluster.connect()

#### Create Demo Keyspace 

In [198]:
session.execute("""
    CREATE KEYSPACE IF NOT EXISTS dseanalyticsdemo 
    WITH REPLICATION = 
    { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }"""
)

<cassandra.cluster.ResultSet at 0x11b68b8d0>

#### Set keyspace 

In [199]:
session.set_keyspace('dseanalyticsdemo')

#### Set Movie Title variable --Change this to search for different movies!

In [242]:
movieTitle = "mamamia2"

In [243]:
positiveNegative = ["pos", "sad"] 

#### Create two tables in Cassandra for the movie title. One of negative tweets and one for positive tweets. Twitter returns a lot of information with each call but for this demo we will just utilize the twitter id (as our Primary key as it is unique) and the actual tweet. 
#### Is using twitter id the right value to distriubte by? Consider your data model when choosing your primary key. 

In [244]:
for emotion in positiveNegative: 
    
    query = "CREATE TABLE IF NOT EXISTS movie_tweets_%s_%s (twitterid bigint, tweet text, PRIMARY KEY (twitterid))" % (movieTitle, emotion)
    print query
    session.execute(query)


CREATE TABLE IF NOT EXISTS movie_tweets_mamamia2_pos (twitterid bigint, tweet text, PRIMARY KEY (twitterid))
CREATE TABLE IF NOT EXISTS movie_tweets_mamamia2_sad (twitterid bigint, tweet text, PRIMARY KEY (twitterid))


#### Setting up Search Terms for gathering tweets from Twitters API. The happy :) and sad :( face are twitter operators to find positive and negative tweets

In [245]:
searchTermSad= movieTitle + " :("
searchTermPos= movieTitle + " :)"

searchTerms = [searchTermSad, searchTermPos]

#### Function to CleanUp Each Tweet before if is inserted into Cassandra.
#### Removing: 
* emojis 
* flags 
* special characters 
* URL's 
* RT (for Retweet)

In [246]:
#Code from: https://stackoverflow.com/questions/33404752/removing-emojis-from-a-string-in-

def cleanUpTweet(tweet):
    
    emoji_pattern = re.compile(
    u"(\ud83d[\ude00-\ude4f])|"
    u"(\ud83c[\udf00-\uffff])|"  
    u"(\ud83d[\u0000-\uddff])|" 
    u"(\ud83d[\ude80-\udeff])|"  
    u"(\ud83c[\udde0-\uddff])" 
    "+", flags=re.UNICODE)

    removeSpecial = re.compile ('[\n|#|@|!|.|?|,|\"]')
    removeHttp = re.compile("http\S+ | https\S+")
    removeRetweet = re.compile("RT")
    
    noemoji = emoji_pattern.sub(r'', tweet)
    nospecial = removeSpecial.sub(r'', noemoji)
    nohttp = removeHttp.sub(r'', nospecial)
    noretweet = removeRetweet.sub(r'', nohttp)
    
    cleanTweet=noretweet
    
    return cleanTweet

#### Required from Twitter: 
* consumer_key= ''
* consumer_secret= ''
* access_token=''
* access_token_secret=''

In [247]:
consumer_key= ''
consumer_secret= ''

access_token=''
access_token_secret=''

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

#### This cell will pull tweets from Twitter. The max number of tweets returned for free at one time is 100. 
#### Run this code a couple of times to get more data! 
#### Once the tweets are collected, loop over the list, clean up each tweet, and then insert it into the table. A large for loops surrounds this to make one call for postive tweets and one call for negative tweets. Happy and sad face have been URL encoded. :) = "%22%20%3A%29" and :( = "%22%20%3A%28"

In [248]:
for emotion in positiveNegative:
    print emotion
    query = "INSERT INTO movie_tweets_%s_%s (twitterid, tweet)" % (movieTitle, emotion)
    query = query + " VALUES (%s, %s)"
    
    if emotion == "pos":
        searchTermPos= movieTitle + "%22%20%3A%29"
        public_tweets = api.search(q=movieTitle, lang="en", count="100")
    if emotion == "sad":
        searchTermPos= movieTitle + "%22%20%3A%28"
        public_tweets = api.search(q=movieTitle, lang="en", count="100")

    for tweet in public_tweets:
        cleanTweet = cleanUpTweet(tweet.text)
        session.execute(query, (tweet.id, cleanTweet))
        print(cleanTweet)

pos
Right Brownie points for me Im off to see The cheese fest that will be MamaMia2
Cher singing Fernando Awful Absolutely terrible- but in any case I am not a fan by any means just a pity…
 hhgarcia41: Wow Thank you China &amp; thank you world In our 2nd week out our skyscrapermovie is not only the 1 movie on the planet be…
I fell in love with Lily James MamaMia2
“Life is short the world is wide I want to make some memories”mamamia2
Off to see MamaMia2
Wow Thank you China &amp; thank you world In our 2nd week out our skyscrapermovie is not only the 1 movie on the p…
On my way to go see MamaMia2 I'm so excited 
If Cher is craving popcorn  I’m gonna buy popcorn MamaMia2
 is_quin: The last 5 minutes of MamaMia2 deserves an Oscar
Just days before the film hits theatres on Friday James revealed that she’s “struggled” with her voice a lot sinc…
That free-spirited Donna  MamaMia2
AM I DEAD OR SOMTHINGGGG THIS ISNT REALLMamaMia2 CANT WAIT
 Krusty101: Tonight more than ever l wished real life 

Right Brownie points for me Im off to see The cheese fest that will be MamaMia2
Cher singing Fernando Awful Absolutely terrible- but in any case I am not a fan by any means just a pity…
 hhgarcia41: Wow Thank you China &amp; thank you world In our 2nd week out our skyscrapermovie is not only the 1 movie on the planet be…
I fell in love with Lily James MamaMia2
“Life is short the world is wide I want to make some memories”mamamia2
Off to see MamaMia2
Wow Thank you China &amp; thank you world In our 2nd week out our skyscrapermovie is not only the 1 movie on the p…
On my way to go see MamaMia2 I'm so excited 
If Cher is craving popcorn  I’m gonna buy popcorn MamaMia2
 is_quin: The last 5 minutes of MamaMia2 deserves an Oscar
Just days before the film hits theatres on Friday James revealed that she’s “struggled” with her voice a lot sinc…
That free-spirited Donna  MamaMia2
AM I DEAD OR SOMTHINGGGG THIS ISNT REALLMamaMia2 CANT WAIT
 Krusty101: Tonight more than ever l wished real life incl

#### Do a select * on each table and verify that the tweets have been inserted into each Cassandra table

In [249]:
for emotion in positiveNegative:
    print emotion
    query = 'SELECT * FROM movie_tweets_%s_%s' % (movieTitle, emotion)
    rows = session.execute(query)
    for user_row in rows:
        print (user_row.twitterid, user_row.tweet)

pos
(1021707917354000384, u' omid9: I LOVE BLOCK CAPITALSTHERE SHOULD BE AN ANNUAL WORLD SHOUTY DAYGO SEE MamaMia2 AND ME marlowetheatre THIS SATURDAY GOO\u2026')
(1021782859315593221, u' is_quin: The last 5 minutes of MamaMia2 deserves an Oscar')
(1021697750432407552, u'Tune in to LadyXsize Fitness Hour on ukhealthradio This week Music &amp; TV sensation Jamelia talks about Mum an\u2026')
(1021663090922938368, u'emmafreud Everyone cheered in Brighton Also clapped at the surprise bit at the very end the people who watch all\u2026')
(1021594480372273152, u'hi cher I\u2019m gay now MamaMia2')
(1021732119205298176, u' samanthabaines: Here I am re-enacting the greatest moments from MnetMAMA MamaMia2 cher (as a rabbit/bear) \U0001f923')
(1019759855576367104, u"why so many kid characters I didn't know millennials are the target audience for Abba songs :P Would've preferred\u2026")
(1020694063572574210, u'Just watched MamaMia2 earlier today - loved it- and did anyone else notice the lovely AB

### Finally time for Apache Spark! 

#### Create a spark session that is connected to Cassandra. From there load each table into a Spark Dataframe and take a count of the number of rows in each.

In [250]:
countTokens = udf(lambda words: len(words), IntegerType())

spark = SparkSession.builder.appName('demo').master("local").getOrCreate()

tableNamePos = "movie_tweets_%s_pos" % (movieTitle.lower())
tableNameSad = "movie_tweets_%s_sad" % (movieTitle.lower())
tablepos = spark.read.format("org.apache.spark.sql.cassandra").options(table=tableNamePos, keyspace="dseanalyticsdemo").load()
tablesad = spark.read.format("org.apache.spark.sql.cassandra").options(table=tableNameSad, keyspace="dseanalyticsdemo").load()

print "Postive Table Count: "
print tablepos.count()
print "Negative Table Count: "
print tablesad.count()


Postive Table Count: 
106
Negative Table Count: 
101


#### Use Tokenizer to break up the sentences into indiviudals words

In [251]:
tokenizerPos = Tokenizer(inputCol="tweet", outputCol="tweetwords")
tokenizedPos = tokenizerPos.transform(tablepos)

dfPos = tokenizedPos.select("tweet", "tweetwords").withColumn("tokens", countTokens(col("tweetwords")))

showDF(dfPos)

tokenizerSad = Tokenizer(inputCol="tweet", outputCol="tweetwords")
tokenizedSad = tokenizerSad.transform(tablesad)

dfSad = tokenizedSad.select("tweet", "tweetwords").withColumn("tokens", countTokens(col("tweetwords")))

showDF(dfSad)

Unnamed: 0,tweet,tweetwords,tokens
0,So I saw MamaMia2 today and omfg I was not pre...,"[so, i, saw, mamamia2, today, and, omfg, i, wa...",27
1,DanteHarker: Off to see MamaMia2 have you see...,"[, danteharker:, off, to, see, mamamia2, have,...",14
2,anyway it's time for a poll regarding the Mama...,"[anyway, it's, time, for, a, poll, regarding, ...",19
3,Box Office: 'Equalizer 2' And 'Mamma Mia 2' Ar...,"[box, office:, 'equalizer, 2', and, 'mamma, mi...",17
4,Andante Andante Lily James MamaMia2,"[andante, andante, , lily, james, , mamamia2]",7
5,I honestly only wish to watch MamaMia2 because...,"[i, honestly, only, wish, to, watch, mamamia2,...",21
6,I feel like my cher impression from MamaMia2 h...,"[i, feel, like, my, cher, impression, from, ma...",18
7,MamaMia2 Great movie feel good factor It is th...,"[mamamia2, great, movie, feel, good, factor, i...",10
8,Sometimes you want life to be as colorful and...,"[sometimes, you, want, life, to, be, as, color...",13
9,Such a beautiful movie that met my expectative...,"[such, a, beautiful, movie, that, met, my, exp...",22


Unnamed: 0,tweet,tweetwords,tokens
0,DanteHarker: Off to see MamaMia2 have you see...,"[, danteharker:, off, to, see, mamamia2, have,...",14
1,anyway it's time for a poll regarding the Mama...,"[anyway, it's, time, for, a, poll, regarding, ...",19
2,Box Office: 'Equalizer 2' And 'Mamma Mia 2' Ar...,"[box, office:, 'equalizer, 2', and, 'mamma, mi...",17
3,Andante Andante Lily James MamaMia2,"[andante, andante, , lily, james, , mamamia2]",7
4,I honestly only wish to watch MamaMia2 because...,"[i, honestly, only, wish, to, watch, mamamia2,...",21
5,I feel like my cher impression from MamaMia2 h...,"[i, feel, like, my, cher, impression, from, ma...",18
6,MamaMia2 Great movie feel good factor It is th...,"[mamamia2, great, movie, feel, good, factor, i...",10
7,Sometimes you want life to be as colorful and...,"[sometimes, you, want, life, to, be, as, color...",13
8,Such a beautiful movie that met my expectative...,"[such, a, beautiful, movie, that, met, my, exp...",22
9,I’m definitely watching MamaMia2 again,"[i’m, definitely, watching, mamamia2, again]",5


#### Using StopWordsRemover to remove all stop words. Interesting to see, people don't use many stop words with twitter!

In [252]:
removerPos = StopWordsRemover(inputCol="tweetwords", outputCol="tweetnostopwords")
removedPos = removerPos.transform(dfPos)

dfPosStop = removedPos.select("tweet", "tweetwords", "tweetnostopwords").withColumn("tokens", countTokens(col("tweetwords"))).withColumn("notokens", countTokens(col("tweetnostopwords")))

showDF(dfPosStop)

removerSad = StopWordsRemover(inputCol="tweetwords", outputCol="tweetnostopwords")
removedSad = removerSad.transform(dfSad)

dfSadStop = removedSad.select("tweet", "tweetwords", "tweetnostopwords").withColumn("tokens", countTokens(col("tweetwords"))).withColumn("notokens", countTokens(col("tweetnostopwords")))

showDF(dfSadStop)

Unnamed: 0,tweet,tweetwords,tweetnostopwords,tokens,notokens
0,So I saw MamaMia2 today and omfg I was not pre...,"[so, i, saw, mamamia2, today, and, omfg, i, wa...","[saw, mamamia2, today, omfg, prepared, feels, ...",27,14
1,DanteHarker: Off to see MamaMia2 have you see...,"[, danteharker:, off, to, see, mamamia2, have,...","[, danteharker:, see, mamamia2, seen, think]",14,6
2,anyway it's time for a poll regarding the Mama...,"[anyway, it's, time, for, a, poll, regarding, ...","[anyway, time, poll, regarding, mamamia2, igno...",19,11
3,Box Office: 'Equalizer 2' And 'Mamma Mia 2' Ar...,"[box, office:, 'equalizer, 2', and, 'mamma, mi...","[box, office:, 'equalizer, 2', 'mamma, mia, 2'...",17,12
4,Andante Andante Lily James MamaMia2,"[andante, andante, , lily, james, , mamamia2]","[andante, andante, , lily, james, , mamamia2]",7,7
5,I honestly only wish to watch MamaMia2 because...,"[i, honestly, only, wish, to, watch, mamamia2,...","[honestly, wish, watch, mamamia2, cher, goodne...",21,10
6,I feel like my cher impression from MamaMia2 h...,"[i, feel, like, my, cher, impression, from, ma...","[feel, like, cher, impression, mamamia2, got, ...",18,11
7,MamaMia2 Great movie feel good factor It is th...,"[mamamia2, great, movie, feel, good, factor, i...","[mamamia2, great, movie, feel, good, factor, h...",10,7
8,Sometimes you want life to be as colorful and...,"[sometimes, you, want, life, to, be, as, color...","[sometimes, want, life, colorful, , cheesy, ma...",13,7
9,Such a beautiful movie that met my expectative...,"[such, a, beautiful, movie, that, met, my, exp...","[beautiful, movie, met, expectatives, much, cr...",22,11


Unnamed: 0,tweet,tweetwords,tweetnostopwords,tokens,notokens
0,DanteHarker: Off to see MamaMia2 have you see...,"[, danteharker:, off, to, see, mamamia2, have,...","[, danteharker:, see, mamamia2, seen, think]",14,6
1,anyway it's time for a poll regarding the Mama...,"[anyway, it's, time, for, a, poll, regarding, ...","[anyway, time, poll, regarding, mamamia2, igno...",19,11
2,Box Office: 'Equalizer 2' And 'Mamma Mia 2' Ar...,"[box, office:, 'equalizer, 2', and, 'mamma, mi...","[box, office:, 'equalizer, 2', 'mamma, mia, 2'...",17,12
3,Andante Andante Lily James MamaMia2,"[andante, andante, , lily, james, , mamamia2]","[andante, andante, , lily, james, , mamamia2]",7,7
4,I honestly only wish to watch MamaMia2 because...,"[i, honestly, only, wish, to, watch, mamamia2,...","[honestly, wish, watch, mamamia2, cher, goodne...",21,10
5,I feel like my cher impression from MamaMia2 h...,"[i, feel, like, my, cher, impression, from, ma...","[feel, like, cher, impression, mamamia2, got, ...",18,11
6,MamaMia2 Great movie feel good factor It is th...,"[mamamia2, great, movie, feel, good, factor, i...","[mamamia2, great, movie, feel, good, factor, h...",10,7
7,Sometimes you want life to be as colorful and...,"[sometimes, you, want, life, to, be, as, color...","[sometimes, want, life, colorful, , cheesy, ma...",13,7
8,Such a beautiful movie that met my expectative...,"[such, a, beautiful, movie, that, met, my, exp...","[beautiful, movie, met, expectatives, much, cr...",22,11
9,I’m definitely watching MamaMia2 again,"[i’m, definitely, watching, mamamia2, again]","[i’m, definitely, watching, mamamia2]",5,4


### Sentiment Analysis using Python package Pattern

#### Convert each Spark Dataframe to a Pandas Dataframe. From there loop over each row and get the sentiment score (anything + is postive and anything - or 0 is negative). The "positive" function will return true if the tweet is postive. For more info on how the scores are calcuated: https://www.clips.uantwerpen.be/pages/pattern-en#sentiment

#### Negative Tweets

In [253]:
pandaSad = dfSadStop.toPandas()
movieScoreSad = 0
countSad = 0

for index, row in pandaSad.iterrows():
    print row['tweet']
    print sentiment(row["tweetnostopwords"])
    print positive(row["tweetnostopwords"])
    if positive(row["tweetnostopwords"]):
        print "This is a negative tweet! Analysis is wrong :("
        countSad = countSad + 1
    scoreSad = sentiment(row['tweetnostopwords'])[0]
    movieScoreSad = scoreSad + movieScoreSad

 DanteHarker: Off to see MamaMia2 have you seen it What did you think
(0.0, 0.0)
False
anyway it's time for a poll regarding the MamaMia2 ignoring who they become in adulthood entirely which young d…
(0.1, 0.4)
True
This is a negative tweet! Analysis is wrong :(
Box Office: 'Equalizer 2' And 'Mamma Mia 2' Are Both Winners This Weekend - Forbes Now via…
(0.0, 0.0)
False
Andante Andante  Lily James  MamaMia2
(0.0, 0.0)
False
I honestly only wish to watch MamaMia2 because of Cher Because my goodness- i cringed through the first one and d…
(0.425, 0.6166666666666667)
True
This is a negative tweet! Analysis is wrong :(
I feel like my cher impression from MamaMia2 hasn't got enough praise 🤣 What do you think sticktothedayjob
(0.0, 0.5)
False
MamaMia2 Great movie feel good factor It is the holidays
(0.75, 0.675)
True
This is a negative tweet! Analysis is wrong :(
Sometimes you want life to be as colorful  and cheesy as MamaMia2
(-0.1, 0.7)
False
Such a beautiful movie that met my expectatives

#### Positive Tweet
#### Also adding up all the sentiment scores of all the tweets

In [254]:
pandaPos = dfPosStop.toPandas()
movieScore = 0
countPos = 0
for index, row in pandaPos.iterrows():
    print row['tweet']
    print sentiment(row["tweetnostopwords"])
    print positive(row["tweetnostopwords"])
    if not positive(row["tweetnostopwords"]):
        print "This is a postive tweet! Analysis is wrong :("
        countPos = countPos + 1
    score = sentiment(row['tweetnostopwords'])[0]
    movieScore = score + movieScore



If you want to see a movie that just makes you smile watch MamaMia2Honestly it's so good Just an all around fee…
(0.5, 0.35000000000000003)
True
 samanthabaines: I feel like my cher impression from MamaMia2 hasn't got enough praise 🤣 What do you think sticktothedayjob
(0.0, 0.5)
False
This is a postive tweet! Analysis is wrong :(
🤗date night tonight with my bestest ever sonmamamia2  nd yes i did have to bribe him with goodies
(0.0, 0.0)
False
This is a postive tweet! Analysis is wrong :(
Saw MamaMia2 within 2 days of release while it took me 2weeks to go see Black Panther To make myself feel better…
(-0.16666666666666666, 0.43333333333333335)
False
This is a postive tweet! Analysis is wrong :(
Cher singing Fernando Awful Absolutely terrible- but in any case I am not a fan by any means just a pity…
(-0.4, 0.95)
False
This is a postive tweet! Analysis is wrong :(
Me jamming to dancing queen MamaMia2
(0.0, 0.0)
False
This is a postive tweet! Analysis is wrong :(
VodafoneUK ODEONCinemas I 

### Alright! Should I see this movie???

In [255]:
posrating = movieScore/dfPos.count()

print "Postive Rating Average Score: " 
print posrating
print "Number of Tweets Twitter Scored Wrong:"
print countPos 
if dfSad.count() != 0:
    sadrating = movieScoreSad/dfSad.count()
else: 
    sadrating = 0
print "Negative Rating Average Score:"
print sadrating
print "Number of Tweets Twitter Scored Wrong:"
print countSad


if posrating > abs(sadrating):
    print "People like this movie!"
elif posrating == abs(sadrating):
    print "People are split on this movie! Take a risk!"
elif posrating < abs(sadrating):
    print "People do not like this movie!"


Postive Rating Average Score: 
0.208630127079
Number of Tweets Twitter Scored Wrong:
52
Negative Rating Average Score:
0.193669179418
Number of Tweets Twitter Scored Wrong:
48
People like this movie!
