# Predicting Popularity: Using Text and Content Analysis to Examine Shared Characteristics of Popular Posts on Twitter

### A CS109 Final Project by Belinda Zeng, Roseanne Feng, Yuqi Hou, and Zahra Mahmood

![caption](https://studentshare.net/content/wp-content/uploads/2015/05/53a0e7d640b31_-_unknown-3-51047042.png)

## Background and Motivation

Twitter (https://twitter.com) is social network, real-time news media service, and micro-blogging service where users can use text, photos, and videos to express moments or ideas in 140-characters or less. These 140-character messages are called "tweets.” According to Twitter’s website, millions of tweets are shared in real time, every day. Registered users can read and post tweets, favorite other people’s tweets, retweet other people’s posts, favorite tweets, and follow other accounts. Unregistered users can read tweets from public accounts. 

In today's day and age of Twitter, popularity is measured in hearts, retweets, follows, and follow-backs. What posts get popular over time? What seems to resonate most with people? Do positive or negative sentiments invite more engagement? In this project, we use Twitter's publically available archive of content to  like to examine some of the shared characteristics of popular posts, including length of post, visual content, positivity, negativity.

## Related Work

Our idea came from a desire to understand how movements such as #BlackLivesMatter and #Ferguson begin on Twitter as well as a general desire to know what makes a post popular. We chose to focus on Tweets on an individual level and to use natural language processing to be able to understand and predict what makes posts popular.

One paper that is related to our work is a paper from Cornell titled, [The effect of wording on message propagation: Topic- and author-controlled natural experiments on Twitter](https://chenhaot.com/pages/wording-for-propagation.html), which compaired pairs of tweets containing the same url and written by the same user but employing different wording to see which version attracted more retweets. Twitter itself has published research on [What fuels a Tweet’s engagement?](https://blog.twitter.com/2014/what-fuels-a-tweets-engagement) Their research found that adding video, links and photos all result in an increase in the number of retweets and even breaking down those results by industry. Inspired by previous research, we sought to include sentiment analysis in our understanding of what made a Tweet popular. 

## Initial Questions

1. How does the distribution of retweets and hearts vary for a post depending on the time of day when tweet is created?
2. How does positive and negative sentiment affect popularity? 
3. What Tweets do we think will become popular?

## Data

This data is publicly available via the Twitter Static API that gets queries based on specific parameters. We limited the data set to look at tweets within a specified period of time. We are storing the data in CSV files for now. To reduce file-sizes, we will try to have multiple CSVs so that we don't load too much data into memory. If data exceeds computer memory, we will consider AWS/SQL database alternatives. 

In [18]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

import csv

import ast

from collections import Counter

from scipy.stats.stats import pearsonr

### Scraping

Set up oauth and a app on Twitter (to getthe consumer key & secret and access token and secret)

In [None]:
# great resource where I got all this 
# http://marcobonzanini.com/2015/03/02/mining-twitter-data-with-python-part-1/

import tweepy
import json
from tweepy import OAuthHandler


consumer_key = 'lun6TR6KpaISisFdGnQ5Eo8v5'
consumer_secret = 'hmwEtnfvTfI6CljEKKtIGjahG4NcFQvLBXhOnPyFHmAqNZ9fVV'
access_token = '3004335028-UKSgKFDbaBLNWTzXQFrBRDwVOKo0JR475KYY3LW'
access_secret = 'pA6MeW4NYsv3tL0MRvjI1oBqdUZc0os11gesdNVkeLpX2'
 
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
 
api = tweepy.API(auth)

Our initial approach is to create a random sample that consists of 1% of tweets. This involves using tweepy and the sample call from the Twitter API.

```python
# final, final version 

from tweepy import Stream
from tweepy.streaming import StreamListener

# get retweet status
def try_retweet(status, attribute):
    try:
        if getattr(status, attribute):
            return True
    except AttributeError:
        return None

# function that tries to get attribute from object
def try_get(status, attribute):
    try:
        return getattr(status, attribute).encode('utf-8')
    except AttributeError:
        return None

# open csv file
csvFile = open('smallsample.csv', 'a')

# create csv writer
csvWriter = csv.writer(csvFile)

class MyListener(StreamListener):
    
    def on_status(self, status):
        try:
            # save relevant components of the tweet
            
            # get and sanitize hashtags 
            hashtags = status.entities['hashtags']
            hashtag_list = []
            for el in hashtags:
                hashtag_list.append(el['text'])
            hashtag_count = len(hashtag_list)
            
            # get and sanitize urls
            urls = status.entities['urls']
            url_list = []
            for el in urls:
                url_list.append(el['url'])
            url_count = len(url_list)
            
            # get and sanitize user_mentions
            user_mentions = status.entities['user_mentions']
            mention_list = []
            for el in user_mentions:
                mention_list.append(el['screen_name'])
            mention_count = len(mention_list)
            # save it all as a tweet
            tweet = [status.created_at, status.text.encode('utf-8'), status.place, status.lang, status.coordinates, 
              hashtag_list, url_list, mention_list, 
              hashtag_count, url_count, mention_count, 
              try_get(status, 'possibly_sensitive'),
              status.favorite_count, status.favorited, status.retweet_count, status.retweeted, 
              try_retweet(status,'retweeted_status'), 
              try_get(status.user, 'statuses_count'), 
              try_get(status.user, 'favourites_count'), 
              try_get(status.user, 'followers_count'),
              try_get(status.user, 'description'),
              try_get(status.user, 'location')]
            
            # write to csv
            csvWriter.writerow(tweet)
        except BaseException as e:
            print("Error on_data: %s" % str(e))
        return True
    
    # tell us if there's an error
    def on_error(self, status):
        print(status)
        return True

twitter_stream = Stream(auth, MyListener())
twitter_stream.sample()
```

From this point on, analysis will be done previously scraped tweets, and there is no need to run the above code block.

In [2]:
tweetdf_small=pd.read_csv("tempdata/smallsample.csv", names=["created_at", "text", "place", "lang", "coordinates",
                                       "hashtags", "urls", "user_mentions", 
                                       "hashtag_count", "url_count", "mention_count",
                                       "possibly_sensitive", 
                                       "favorite_count", "favorited", "retweet_count", "retweeted",
                                       "retweeted_status", "user_statuses_count", "user_favorites_count",
                                       "user_follower_count", "user_description", "user_location"])
tweetdf_small.head(10)

Unnamed: 0,created_at,text,place,lang,coordinates,hashtags,urls,user_mentions,hashtag_count,url_count,mention_count,possibly_sensitive,favorite_count,favorited,retweet_count,retweeted,retweeted_status,user_statuses_count,user_favorites_count,user_follower_count,user_description,user_location
0,2015-11-24 21:24:41,RT @AcWgst: Reminds me of the fairytail~♡ Matt...,,en,,[],[],[u'AcWgst'],0,0,1,,0,False,0,False,True,,,,"I love music, art, history, current events, WD...",California
1,2015-11-24 21:24:41,ドラゴンスラッシュってスマホゲーらしいのだけれどスクリーンショットだけを見てるとほんとヴァニ...,,ja,,[],[],[],0,0,0,,0,False,0,False,,,,,,県北
2,2015-11-24 21:24:41,やられたでや\n石神さん状態w,,ja,,[],[],[],0,0,0,,0,False,0,False,,,,,味楽一条通り店。時々有明 故に我はエロティカセブン,土田 みさと
3,2015-11-24 21:24:41,#UKVOTY1D #MTVStars One Direction (MARO) Nicki...,,sl,,"[u'UKVOTY1D', u'MTVStars']",[],[],2,0,0,,0,False,0,False,,,,,,
4,2015-11-24 21:24:41,"RT @WhylmSingle: I WILL LOOK FOR YOU, I WILL F...",,en,,[],[],[u'WhylmSingle'],0,0,1,,0,False,0,False,True,,,,"Electronic music producer/vocalist, PR/AR mana...","Miami, FL"
5,2015-11-24 21:24:41,RT @higeorgeshelley: If u guys vote enough ton...,,en,,[],[],[u'higeorgeshelley'],0,0,1,,0,False,0,False,True,,,,I wanna runaway...,my own little world
6,2015-11-24 21:24:41,I Been Sleeping On @gilliedakid He 🔥,,en,,[],[],[u'gilliedakid'],0,0,1,,0,False,0,False,,,,,FMOI @quicktriggap Blessed Basketball Bigman S...,Philly
7,2015-11-24 21:24:41,RT @awwmyloueh: @izaynie93 I'm proud of you!,,en,,[],[],"[u'awwmyloueh', u'izaynie93']",0,0,2,,0,False,0,False,True,,,,sometimes I add too little milk to my coffee a...,zquad ◡̈
8,2015-11-24 21:24:41,RT @CaronPeirson: @CleanDropMobile Let's clink...,,en,,[u'DigiBlogChat'],[],"[u'CaronPeirson', u'CleanDropMobile']",1,0,2,,0,False,0,False,True,,,,CleanDrop a mobile app for #foodies who deman...,USA
9,2015-11-24 21:24:41,RT @JosAntonioNez: Dentro de 1 mes todo el mun...,,es,,[],[],[u'JosAntonioNez'],0,0,1,,0,False,0,False,True,,,,Del 97. Hermana de la Macarena. Auxiliar de En...,


In [3]:
tweetdf_small.shape

(52766, 22)

As we can see however, the retweet count and favorite count are always 0. This is because we're using the live streaming API and as a result, we're scraping the tweets as they are tweeted. At this point, all the tweets have retweet count 0 and favorite count 0 since they were literally just posted! That is, unless the tweet posted is actually a retweet...

In [4]:
# just found this bug with retweet_count, looking into why this might be the case
tweetdf_missing = tweetdf_small[tweetdf_small['retweet_count'] != 0]

In [5]:
tweetdf_missing.shape

(0, 22)

#### Getting original retweets

The following function updates the way we use the tweepy streaming API. We first detect if the tweet we're looking at is actually a retweet of something. If so, we then get the original tweet and save that to our csv.

```python
# only save information for retweets

from tweepy import Stream
from tweepy.streaming import StreamListener

# get retweet status
def try_retweet(status, attribute):
    try:
        if getattr(status, attribute):
            return True
    except AttributeError:
        return None

# get country status
def try_country(status, attribute):
    if getattr(status, attribute) != None:
        place = getattr(status, attribute)
        return place.country
    return None

# get city status
def try_city(status, attribute):
    if getattr(status, attribute) != None:
        place = getattr(status, attribute)
        return place.full_name
    return None

# function that tries to get attribute from object
def try_get(status, attribute):
    try:
        return getattr(status, attribute).encode('utf-8')
    except AttributeError:
        return None

# open csv file
csvFile = open('originalsample.csv', 'a')

# create csv writer
csvWriter = csv.writer(csvFile)

class MyListener(StreamListener):
    
    def on_status(self, status):
        try:
            # if this represents a retweet
            if try_retweet(status,'retweeted_status'):
                status = status.retweeted_status
                
                # get and sanitize hashtags 
                hashtags = status.entities['hashtags']
                hashtag_list = []
                for el in hashtags:
                    hashtag_list.append(el['text'])
                hashtag_count = len(hashtag_list)

                # get and sanitize urls
                urls = status.entities['urls']
                url_list = []
                for el in urls:
                    url_list.append(el['url'])
                url_count = len(url_list)

                # get and sanitize user_mentions
                user_mentions = status.entities['user_mentions']
                mention_list = []
                for el in user_mentions:
                    mention_list.append(el['screen_name'])
                mention_count = len(mention_list)
                
                # save it all as a tweet
                tweet = [status.id, status.created_at, try_country(status, 'place'), try_city(status, 'place'), status.text.encode('utf-8'), status.lang,
                  hashtag_list, url_list, mention_list, 
                  hashtag_count, url_count, mention_count, 
                  try_get(status, 'possibly_sensitive'),
                  status.favorite_count, status.favorited, status.retweet_count, status.retweeted, 
                  status.user.statuses_count, 
                  status.user.favourites_count, 
                  status.user.followers_count,
                  try_get(status.user, 'description'),
                  try_get(status.user, 'location'),
                  try_get(status.user, 'time_zone')]
            
                # write to csv
                csvWriter.writerow(tweet)
        except BaseException as e:
            print("Error on_data: %s" % str(e))
        return True
    
    # tell us if there's an error
    def on_error(self, status):
        print(status)
        return True

twitter_stream = Stream(auth, MyListener())
twitter_stream.sample()
```

Now we read into pandas.

In [6]:
tweetdf=pd.read_csv("tempdata/originalsample.csv", names=["id", "created_at", "country", "city", "text", "lang",
                                       "hashtags", "urls", "user_mentions", 
                                       "hashtag_count", "url_count", "mention_count",
                                       "possibly_sensitive", 
                                       "favorite_count", "favorited", "retweet_count", "retweeted",
                                       "user_statuses_count", "user_favorites_count",
                                       "user_follower_count", "user_description", "user_location", "user_timezone"])
tweetdf.head(15)

Unnamed: 0,id,created_at,country,city,text,lang,hashtags,urls,user_mentions,hashtag_count,url_count,mention_count,possibly_sensitive,favorite_count,favorited,retweet_count,retweeted,user_statuses_count,user_favorites_count,user_follower_count,user_description,user_location,user_timezone
0,669227044996124673,2015-11-24 18:52:15,,,Yo 💁🏼💟👌🏼 https://t.co/xLMaOl9QD4,und,[],[],[],0,0,0,,270,False,288,False,10726,18927,24429,,"Yucatán, México",Mexico City
1,669328402453626880,2015-11-25 01:35:01,,,読者が生産者に会いに行く!『北海道食べる通信』初の読者ツアー開催 – 北海道ファンマガジン ...,ja,[],[u'https://t.co/w4GkSYLhoz'],[],0,1,0,,1,False,1,False,10176,241,1783,手稲駅南口直結徒歩１分　ハートビル法認定バリアフリーホテル　札幌市福祉のまちづくり条例適合ホ...,北海道札幌市手稲区手稲本町1条4丁目1番５号,Sapporo
2,669335707505201152,2015-11-25 02:04:02,,,"Not 1 shot, not 2 but 16. 16 tax payer purchas...",en,"[u'LaquanMcDonald', u'sickofthesehashtags']",[],[],2,0,0,,25,False,23,False,15349,1590,69865,"Sideways Slipper, ALEKESAM",The Universe,Pacific Time (US & Canada)
3,668550084976578560,2015-11-22 22:02:15,,,#comeeeheree @DooleyFunnyAf !! Had to get yo a...,en,[u'comeeeheree'],[],[u'DooleyFunnyAf'],1,0,1,,24,False,28,False,11124,15210,4085,$quad Original | https://soundcloud.com/vanteb...,,Eastern Time (US & Canada)
4,669304504513380352,2015-11-25 00:00:03,,,Best #Thanksgiving memory? @KnucklePuckIL shar...,en,[u'Thanksgiving'],[u'https://t.co/IYC1jEOTeC'],[u'KnucklePuckIL'],1,1,1,,158,False,33,False,52246,4509,526915,"The nation’s leading voice on underground, alt...","Cleveland, Ohio",Eastern Time (US & Canada)
5,669178757417009152,2015-11-24 15:40:23,,,Muhammad kenapa handsome sangat 😍😂 https://t...,in,[],[],[],0,0,0,,91,False,173,False,58643,8317,733,spread positivity ✨,MY,Kuala Lumpur
6,669348536081850369,2015-11-25 02:55:01,,,rt for 5sos #MTVStars 5 Seconds of Summer,en,[u'MTVStars'],[],[],1,0,0,,3,False,21,False,9816,3,3299,appreciating Michael mostly *ଘ(੭*ˊᵕˋ)੭* ੈ✩‧₊˚ ...,(Liv • Blain• Ant),Eastern Time (US & Canada)
7,669335559425417216,2015-11-25 02:03:27,,,"""Why didn't you do your homework over the holi...",en,[],[],[],0,0,0,,283,False,218,False,53293,341,476765,just stahp. \r\n\r\nyou probably just ignore m...,,Eastern Time (US & Canada)
8,669339114974478336,2015-11-25 02:17:35,,,I think one of my favorite feelings is laughin...,en,[],[],[],0,0,0,,5,False,1,False,10140,6619,953,"⊱✿LIVE. .Like there's no midnight! ✿⊰ idfwu, t...",,
9,669348858418192384,2015-11-25 02:56:18,,,하루치하\r\n같이 팝시다\r\n#프로듀서_트친소 https://t.co/0ARAt...,ko,[u'\ud504\ub85c\ub4c0\uc11c_\ud2b8\uce5c\uc18c'],[],[],1,0,0,,0,False,7,False,22768,1295,159,"아이마스와 하루치하 파는 잉여 유하치 / 쿄애니, 중력폭포, 디즈니 픽사도 파요!!...",,Pacific Time (US & Canada)


In [7]:
tweetdf.shape

(114509, 23)

## Data wrangling

#### Filter for language

In [8]:
df_filtered = tweetdf[tweetdf['lang'] == 'en']

In [9]:
df_filtered.shape

(57079, 23)

#### Filter for unique tweet ids

In [10]:
df_filtered.drop_duplicates(subset='id', take_last=True)

Unnamed: 0,id,created_at,country,city,text,lang,hashtags,urls,user_mentions,hashtag_count,url_count,mention_count,possibly_sensitive,favorite_count,favorited,retweet_count,retweeted,user_statuses_count,user_favorites_count,user_follower_count,user_description,user_location,user_timezone
2,669335707505201152,2015-11-25 02:04:02,,,"Not 1 shot, not 2 but 16. 16 tax payer purchas...",en,"[u'LaquanMcDonald', u'sickofthesehashtags']",[],[],2,0,0,,25,False,23,False,15349,1590,69865,"Sideways Slipper, ALEKESAM",The Universe,Pacific Time (US & Canada)
3,668550084976578560,2015-11-22 22:02:15,,,#comeeeheree @DooleyFunnyAf !! Had to get yo a...,en,[u'comeeeheree'],[],[u'DooleyFunnyAf'],1,0,1,,24,False,28,False,11124,15210,4085,$quad Original | https://soundcloud.com/vanteb...,,Eastern Time (US & Canada)
4,669304504513380352,2015-11-25 00:00:03,,,Best #Thanksgiving memory? @KnucklePuckIL shar...,en,[u'Thanksgiving'],[u'https://t.co/IYC1jEOTeC'],[u'KnucklePuckIL'],1,1,1,,158,False,33,False,52246,4509,526915,"The nation’s leading voice on underground, alt...","Cleveland, Ohio",Eastern Time (US & Canada)
6,669348536081850369,2015-11-25 02:55:01,,,rt for 5sos #MTVStars 5 Seconds of Summer,en,[u'MTVStars'],[],[],1,0,0,,3,False,21,False,9816,3,3299,appreciating Michael mostly *ଘ(੭*ˊᵕˋ)੭* ੈ✩‧₊˚ ...,(Liv • Blain• Ant),Eastern Time (US & Canada)
8,669339114974478336,2015-11-25 02:17:35,,,I think one of my favorite feelings is laughin...,en,[],[],[],0,0,0,,5,False,1,False,10140,6619,953,"⊱✿LIVE. .Like there's no midnight! ✿⊰ idfwu, t...",,
11,668717375622021120,2015-11-23 09:07:01,,,Me: he told me to calm down\r\n\r\n911: ma'am ...,en,[],[],[],0,0,0,,118,False,76,False,63011,76098,10895,wife of 1 mother of 6 https://twitter.com/sear...,,Pacific Time (US & Canada)
17,669346377789341696,2015-11-25 02:46:26,Canada,"Toronto, Ontario",finding the perfect prom dress is so fucking d...,en,[],[],[],0,0,0,,0,False,1,False,5351,3782,222,a freak and a friend too // love u @arnaudm_19...,"Toronto, Ontario",
31,667168377744416768,2015-11-19 02:31:51,,,Holy shit I never noticed that https://t.co/ZP...,en,[],[],[],0,0,0,,570,False,5764,False,25913,18275,2026,//O2L forever// I'll never forget you ~ B.B //,4/5 +more,
32,669010936842362880,2015-11-24 04:33:31,,,That bomb bomb lol 😛💦💦💦,en,[],[],[],0,0,0,,2,False,1,False,23632,1677,664,,,Quito
41,668977048103223296,2015-11-24 02:18:51,,,.@OHSPatsFootball gets rematch with @Maryville...,en,[],"[u'https://t.co/3fRuw5qRYN', u'https://t.co/ak...","[u'OHSPatsFootball', u'MaryvilleHigh']",0,2,2,,6,False,8,False,105016,8179,9488,High school sports writer at The (Murfreesboro...,"Murfreesboro, Tennessee",Central Time (US & Canada)


In [11]:
df_filtered.shape

(57079, 23)

#### Popularity Score

In [12]:
popularity = [retweets + favs for retweets, favs in zip(df_filtered.retweet_count, df_filtered.favorite_count)]

#### Add popularity column

In [15]:
# df_filtered['popularity']=popularity
df_filtered.loc[:,'popularity']=popularity

In [16]:
df_filtered.shape

(57079, 24)

In [17]:
dftouse = df_filtered.reset_index()
dftouse.head()

Unnamed: 0,index,id,created_at,country,city,text,lang,hashtags,urls,user_mentions,hashtag_count,url_count,mention_count,possibly_sensitive,favorite_count,favorited,retweet_count,retweeted,user_statuses_count,user_favorites_count,user_follower_count,user_description,user_location,user_timezone,popularity
0,2,669335707505201152,2015-11-25 02:04:02,,,"Not 1 shot, not 2 but 16. 16 tax payer purchas...",en,"[u'LaquanMcDonald', u'sickofthesehashtags']",[],[],2,0,0,,25,False,23,False,15349,1590,69865,"Sideways Slipper, ALEKESAM",The Universe,Pacific Time (US & Canada),48
1,3,668550084976578560,2015-11-22 22:02:15,,,#comeeeheree @DooleyFunnyAf !! Had to get yo a...,en,[u'comeeeheree'],[],[u'DooleyFunnyAf'],1,0,1,,24,False,28,False,11124,15210,4085,$quad Original | https://soundcloud.com/vanteb...,,Eastern Time (US & Canada),52
2,4,669304504513380352,2015-11-25 00:00:03,,,Best #Thanksgiving memory? @KnucklePuckIL shar...,en,[u'Thanksgiving'],[u'https://t.co/IYC1jEOTeC'],[u'KnucklePuckIL'],1,1,1,,158,False,33,False,52246,4509,526915,"The nation’s leading voice on underground, alt...","Cleveland, Ohio",Eastern Time (US & Canada),191
3,6,669348536081850369,2015-11-25 02:55:01,,,rt for 5sos #MTVStars 5 Seconds of Summer,en,[u'MTVStars'],[],[],1,0,0,,3,False,21,False,9816,3,3299,appreciating Michael mostly *ଘ(੭*ˊᵕˋ)੭* ੈ✩‧₊˚ ...,(Liv • Blain• Ant),Eastern Time (US & Canada),24
4,7,669335559425417216,2015-11-25 02:03:27,,,"""Why didn't you do your homework over the holi...",en,[],[],[],0,0,0,,283,False,218,False,53293,341,476765,just stahp. \r\n\r\nyou probably just ignore m...,,Eastern Time (US & Canada),501


## Exploratory Analysis

After scraping the tweets from the Twitter API, we can use that data to build a feature list that we use to predict how popular an individual tweet is, measured by a composite score based on the amount of retweets and hearts. We will also use metadata to help us analyze trends in the data, for example if there is a correlation between time of day and retweets.

### Update: 11/30 - 12/1 (Yuqi)

Initial exploratory analysis regarding popularity score and hashtags done. It seems like we should rethink our current formula for popularity because the histogram gives extreme strange results and the max score is really high. Need to look into why that might be. 

All of the correlations that were done between popularity score and other factors came up significant. Could this be due to the large dataset that we are using? Should we be worried about things being labeled as significant not because it actually is significant but because there is so much data that small variations become significant?

Also, noticed that some tweets are longer than 140 characters, and I'm not sure why that is either. Further data wrangling probably needed. 

### Popularity Score Analysis

#### Rethink how popularity is scored? ##
Huge standard deviation and extreme ranges suggest that we may need to rethink how we score popularity...

In [None]:
dftouse['popularity'].describe()

In [None]:
plt.hist(dftouse['popularity'],bins=100)
plt.title("Distribution of Popularity")
plt.show()

In [None]:
plt.hist(dftouse['retweet_count'],bins=100)
plt.title("Distribution of Retweet Counts")
plt.show()

In [None]:
plt.hist(dftouse['favorite_count'],bins=100)
plt.title("Distribution of Favorite Counts")
plt.show()

In [None]:
dftouse['retweet_count'].describe()

In [None]:
retweet_stats = dftouse['retweet_count'].describe()
retweet_mean = retweet_stats[1]
retweet_std = retweet_stats[2]

In [None]:
dftouse['favorite_count'].describe()

In [None]:
favorite_stats = dftouse['favorite_count'].describe()
favorite_mean = favorite_stats[1]
favorite_std = favorite_stats[2]

Given these statistics on retweet_count and favorite_count, we realize we want to standardize these two for use later on, otherwise since there are way more retweets than favorites, retweets would get weighted more heavily.

In [None]:
dftouse = dftouse.rename(columns={'retweet_count': 'retweet_unstandardized', 'favorite_count': 'favorite_unstandardized'})

** Create standardized retweet_count and favorite_count **

We standardize retweet count and favorites by subtracting the mean and dividing by the standard deviation.

In [None]:
retweets = [(retweet_count - retweet_mean)/float(retweet_std) for retweet_count in dftouse['retweet_unstandardized']]

In [None]:
favorites = [(favorite_count - favorite_mean)/float(favorite_std) for favorite_count in dftouse['favorite_unstandardized']]

Now we add these as columns to our dftouse.

In [None]:
dftouse.loc[:,'retweet_count']=retweets

In [None]:
dftouse.loc[:,'favorite_count']=favorites

Now we recalculate popularity, but in the same way as before.

In [None]:
popularity = [retweets + favs for retweets, favs in zip(df_filtered.retweet_count, df_filtered.favorite_count)]
dftouse.loc[:,'popularity']=popularity

In [None]:
dftouse['popularity'].describe()

### Hashtag Analysis

References: 
- http://stackoverflow.com/questions/1894269/convert-string-representation-of-list-to-list-in-python
- http://stackoverflow.com/questions/10201977/how-to-reverse-tuples-in-python
- http://stackoverflow.com/questions/13925251/python-bar-plot-from-list-of-tuples/34013980#34013980

#### What fraction of tweets in the sample use hashtags?

In [None]:
num_tags_per_tweet = dftouse['hashtag_count']
tags_per_tweet = np.array(num_tags_per_tweet)
tagfrac = float(len(tags_per_tweet[tags_per_tweet>0]))/float(len(tags_per_tweet))
print str(tagfrac)+" of tweets in the sample use one or more hashtags."

In [None]:
plt.hist(tags_per_tweet)
plt.ylabel('Frequency')
plt.title('Histogram of Hashtags Used in Tweets')
plt.show()

#### Top 10 hashtags 

First get a flattened list of all the hashtags used in the sample:

In [None]:
alltags=[] 
for i in dftouse['hashtags']: # grab all the tags and put them into a list
    tag = ast.literal_eval(i) # convert string representation of list to list 
    alltags.append(tag) 
hashtags = [item for sublist in alltags for item in sublist] # flatten out the nested list

Then make a bar plot of the 10 most commonly used hashtags:

In [None]:
hashfreq = Counter(hashtags) # get the frequency of appearing hashtags
commontags = hashfreq.most_common(10) # save the top ten most common hashtags
taglabels = zip(*commontags)[0][::-1] # reverse the tuples to go from most frequent to least frequent 
hashtaglabels = ['#'+i for i in taglabels] # add a pound sign in front of each tag to make it clear that it's a hashtag
y_pos = np.arange(len(hashtaglabels)) 
usefreq = zip(*commontags)[1][::-1] # get the frequency part of the tuple
plt.barh(y_pos, usefreq, align='center') # plot horizontal barplot
plt.yticks(y_pos, hashtaglabels) 
plt.title('Top 20 Occuring Hashtags')
plt.show()

#### List of Hashtags Associated with Highest Popularity Score Tweets

In [None]:
print pearsonr(dftouse['hashtag_count'],dftouse['popularity'])
plt.scatter(dftouse['hashtag_count'],dftouse['popularity'])
plt.ylabel('Popularity Score')
plt.show()

#### Correlation between length of tweet and popularity 

##### More data wrangling possibly needed: Why are some tweets longer than 140 characters? 

In [None]:
tweet_len = [len(text) for text in dftouse['text']]
print pearsonr(tweet_len,dftouse['popularity'])
plt.scatter(tweet_len,dftouse['popularity'])
plt.ylabel('Popularity Score')
plt.show()

### Update 12/4 (Yuqi): Tweets that have emojiis are converted into characters that's throwing off tweet length

In [None]:
tweet_len_array = np.array(tweet_len)
idx = np.where(tweet_len_array > 140)[0].tolist()
df_filtered_by_length = dftouse['text'].filter(idx).copy()
df_over140 = df_filtered_by_length.reset_index()
df_over140['text'][0]

#### Correlation between presence of image and popularity

Dataframe only has information about links, so not differentiating between images and other urls for now...

#### Correlation between presence of links and popularity

In [None]:
print pearsonr(dftouse['url_count'],dftouse['popularity'])
plt.scatter(dftouse['url_count'],dftouse['popularity'])
plt.ylabel('Popularity Score')
plt.show()

#### Correlation between user mentions and popularity

In [None]:
print pearsonr(dftouse['mention_count'],dftouse['popularity'])
plt.scatter(dftouse['mention_count'],dftouse['popularity'])
plt.ylabel('Popularity Score')
plt.show()

#### Correlation for number of retweets and hearts


In [None]:
print pearsonr(dftouse['retweet_count'],dftouse['favorite_count'])
plt.scatter(dftouse['retweet_count'],dftouse['favorite_count'])
plt.show()

## Update: 12/4 (Yuqi)
Originally we had planned to do exploratory analysis on popular topics that people tweet about by city or state, but after taking a look at our data, we found that 3.2% of tweets were geo-tagged, so we ultimately chose to forego this analysis. 

#### Fraction of Tweets that are Geo-tagged

In [None]:
totaltweets = float(len(dftouse['country'])) # total number of tweets in sample
countryfrac = float(sum(map(lambda r: int(isinstance(r, str)), dftouse['country'])))/totaltweets
cityfrac = float(sum(map(lambda r: int(isinstance(r, str)), dftouse['city'])))/totaltweets
print str(cityfrac)+" of tweets in the sample are geo-tagged with a city."
print str(countryfrac)+" of tweets in the sample are geo-tagged with a country."

### Post Time

In [None]:
from datetime import datetime
date_objects = [datetime.strptime(each, '%Y-%m-%d %H:%M:%S') for each in dftouse['created_at']]
dir(date_objects[0])

#### When are tweets posted throughout the week?

In [None]:
day_objects = [each.weekday() for each in date_objects]
plt.hist(day_objects)
plt.show()

#### When are tweets posted during the day?

In [None]:
hour_objects = [each.hour for each in date_objects]
# plt.hist(hour_objects)
# plt.show()

In [None]:
print Counter(hour_objects)
sum(hour_objects)


In [None]:
N = 24 # number of bars should be 24 since there are 24 hours in a day
bottom = 4 # determines how big the circle in the middle is 
max_height = 8

theta = np.linspace(0.0, 2 * np.pi, N, endpoint=False)
radii = max_height*np.random.rand(N)
width = (2*np.pi) / N
ax = plt.subplot(111, polar=True)
ax.set_theta_direction(-1)
bars = ax.bar(theta, radii, width=width, bottom=bottom)

# Use custom colors and opacity
for r, bar in zip(radii, bars):
    bar.set_facecolor(plt.cm.jet(r / 10.))
    bar.set_alpha(0.8)

plt.show()

In [None]:
import matplotlib.ticker as tkr

def realign_polar_xticks(ax):
    for theta, label in zip(ax.get_xticks(), ax.get_xticklabels()):
        theta = theta * ax.get_theta_direction() + ax.get_theta_offset()
        theta = np.pi/2 - theta
        y, x = np.cos(theta), np.sin(theta)
        if x >= 0.1:
            label.set_horizontalalignment('left')
        if x <= -0.1:
            label.set_horizontalalignment('right')
        if y >= 0.5:
            label.set_verticalalignment('bottom')
        if y <= -0.5:
            label.set_verticalalignment('top')

def plot_clock(data):
    def hour_formatAM(x, p):
        hour = x * 6 / np.pi
        return '{:0.0f}:00'.format(hour) if x > 0 else '12:00'

    def hour_formatPM(x, p):
        hour = x * 6 / np.pi
        return '{:0.0f}:00'.format(hour + 12) if x > 0 else '24:00'

    def plot(ax, theta, counts, formatter):
        colors = plt.cm.jet(theta / 12.0)
        ax.bar(theta, counts, width=np.pi/6, color=colors, alpha=0.5)
        ax.xaxis.set_major_formatter(tkr.FuncFormatter(formatter))

    bins = np.r_[0, 0.5:12, 12, 12.5:24,  23.99999]
    data = np.array(data) / (60*60)
    counts = np.histogram(data,bins)[0]

    counts[13] += counts[0]
    counts[-1] += counts[13]

    fig, axes = plt.subplots(ncols=2, figsize=(20, 10), subplot_kw=dict(projection='polar'))
    fig.subplots_adjust(wspace=0.5)

    for ax in axes:
        ax.set(theta_offset=np.pi/2, theta_direction=-1,
               xticks=np.arange(0, np.pi*2, np.pi/6),
               yticks=np.arange(1, counts.max()))

    plot(axes[0], bins[1:13] * np.pi / 6, counts[1:13], hour_formatAM)
#     plot(axes[1], bins[14:26] * np.pi / 6, counts[14:26], hour_formatPM)
    return axes

data = [ 10.49531611,  22.49511583,  10.90891806,  18.99525417,
        21.57165972,   6.687755  ,   6.52137028,  15.86534639,
        18.53823556,   6.32563583,  12.99365833,  11.06817056,
        17.29261306,  15.31288556,  19.16236667,  10.38483333,
        14.51442222,  17.01413611,   6.96102278,  15.98508611,
        16.5287    ,  15.26533889,  20.83520278,  17.21952056,
         7.3225775 ,  16.42534361,  14.38649722,  21.63573111,  16.19249444]
data = np.array(data)*60*60
print len(data), data
axes = plot_clock(data)
for ax in axes:
    realign_polar_xticks(ax)
plt.show()

#### Correlation between time of day and tweet popularity



In [None]:
plt.plot_date(date_objects, popularity)
plt.show()

#### The distribution of retweets over time


In [None]:
plt.plot_date(date_objects, dftouse['retweet_count'])
plt.show()

#### The distribution of hearts & retweets over time

In [None]:
plt.plot_date(date_objects, dftouse['favorite_count'])
plt.show()

#### User's followers correlated with popularity

In [None]:
user_follower_count = dftouse['user_follower_count'] 
print pearsonr(user_follower_count,popularity)
plt.scatter(user_follower_count,popularity)
plt.show()

#### Trending tweets and trending lists affecting virality 

### Sentiment Analysis

#### Determining positive/negative words



Using sentiment lookup dictionaries, score tweets based on how positive/negative they are.

**11/29 - Roseanne**
Used a basic list of positive/negative words to begin with, no weights or other information beyond positive/negative. Appears to miss a bunch of tokens (1812/892606 found).

**12/1 - Roseanne**
Tried LabMT, using code provided. Rate is a lot better (7016/892606).

**12/4 - Roseanne**
Realized number of tokens (892606) was total tokens instead of unique tokens (83093). Still a lot but more tokens found than expected. LabMT is probably the better choice, though.

In [None]:
#notes: Unicode in texts (probably emoticons? should we find a way to categorize those?)
#df_filtered['text']

#load dicts into lookup, map words to pos or neg value
#current dict: not sure where it's from?
#1812 of 83093 words in lookup.
lookup = {}
with open('positive.txt', 'r') as f:
    for line in f:
        word = line[:-1]
        lookup[word] = 1
with open('negative.txt', 'r') as f:
    for line in f:
        word = line[:-1]
        lookup[word] = -1

# uses LabMT for scoring, see http://neuro.imm.dtu.dk/wiki/LabMT
# 7016 of 83093 words in LabMT.
url = 'http://www.plosone.org/article/fetchSingleRepresentation.action?uri=info:doi/10.1371/journal.pone.0026752.s001'
labmt = pd.read_csv(url, skiprows=2, sep='\t', index_col=0)

In [None]:
import nltk
# you'll need to download NLTK resource: nltk.download()
# or use terminal: sudo python -m nltk.downloader -d /usr/local/share/nltk_data all

In [None]:
#text = reduce(lambda x,y: x+y, dftouse['text'].apply(lambda x: [x])) # list of strings, functionally identical to dftouse['text']
tweetstext = reduce(lambda x,y: x + '\n' + y, dftouse['text']) # string of concatenated texts, all

In [None]:
# filter out stop words, etc
# notice: tokenizer puts punctuation as their own tokens, ex. separates hashtags, etc.
tokens = nltk.word_tokenize(tweetstext.decode('utf-8','ignore'))

In [None]:
print "Number of tokens:", len(tokens)
fdist = nltk.FreqDist(tokens)
utokens = fdist.keys()
print "Unique tokens:", len(utokens)
print "Tokens that appear only once:", len(fdist.hapaxes())
#fdist.most_common(50)
inlookup = []
notfoundlookup = []
inlabmt = []
notfoundlabmt = []
for key in utokens:
    if key in lookup.keys():
        inlookup.append(key)
    else:
        notfoundlookup.append(key)
    if key in labmt.index:
        inlabmt.append(key)
    else:
        notfoundlabmt.append(key)
print "{} of {} words in lookup.".format(len(inlookup), len(utokens))
print inlookup[:10]

print "{} of {} words in LabMT.".format(len(inlabmt), len(utokens))
print inlabmt[:10]

In [None]:
bigrams = dftouse['text'].apply(lambda x: list(nltk.bigrams(nltk.word_tokenize(x.decode('utf-8','ignore')))))
trigrams = dftouse['text'].apply(lambda x: list(nltk.trigrams(nltk.word_tokenize(x.decode('utf-8','ignore')))))
trigrams.head()

**12/4 - Roseanne**

Scoring - build columns for scoring text, one on the raw text, one on text that ignores words not in our dictionary, and one that shows us which words are not in the dictionary.

In [None]:
# average of entire tweet over unigrams
average = labmt.happiness_average.mean()
happiness = (labmt.happiness_average - average).to_dict()
 
def score(text):
    words = nltk.word_tokenize(text.decode('utf-8','ignore'))
    return sum([happiness.get(word.lower(), 0.0) for word in words]) / len(words)

def scoreNoNeutrals(text):
    words = nltk.word_tokenize(text.decode('utf-8','ignore'))
    notscored = [word for word in words if happiness.get(word.lower(), 0.0) == 0.0]
    return sum([happiness.get(word.lower(), 0.0) for word in words]) / max((len(words) - len(notscored)),1)

def notScored(text):
    words = nltk.word_tokenize(text.decode('utf-8','ignore'))
    return [word for word in words if happiness.get(word.lower(), 0.0) == 0.0]


#dftouse['text'].apply(score).mean()
dftouse['sentiment'] = dftouse['text'].apply(score)
dftouse['sentimentnoneutrals'] = dftouse['text'].apply(scoreNoNeutrals)
dftouse['notscored'] = dftouse['text'].apply(notScored)
dftouse[['text','sentiment', 'sentimentnoneutrals', 'notscored']].head()

**12/4 - Roseanne**

Checking how our lookup and scoring is working.

Sentiment score ranges from approx. -3 to 3, with a mean close to 0.1, or roughly neutral.

Hapaxes (words that appear only once in the Tweets we're analyzing) are a surprisingly large percentage of our tokens (~55000 out of 83000). A lot of them are URLs (19812), which we can probably ignore, or include a Unicode character or formatting that caused the tokenizer to behave oddly. Would it be worth it to try to filter out punctuation, or manually add them to our lookup (ex. replace .!?s with spaces, or add tokens such as '...'. If we add them, how do we generate a score for them?)

In [None]:
#dftouse.sentiment.min(), dftouse.sentiment.max(), dftouse.sentiment.mean()
fdist.hapaxes() #lots of links, Unicode included here, is it worth filtering out these/punctuation?

In [None]:
happiness

In [None]:
utokens_ = [x for x in utokens if x[:6] != '//t.co']
urltokens = [x for x in utokens if x[:6] == '//t.co']
print "Non-URL tokens:", len(utokens_)

In [None]:
print dftouse.sentiment.min(), dftouse.sentiment.max(), dftouse.sentiment.mean()
print dftouse.sentimentnoneutrals.min(), dftouse.sentimentnoneutrals.max(), dftouse.sentimentnoneutrals.mean()

In [None]:
dftouse.loc[dftouse.sentimentnoneutrals==dftouse.sentimentnoneutrals.max()]

In [None]:
dftouse.loc[dftouse.sentimentnoneutrals==dftouse.sentimentnoneutrals.min()]

In [None]:
for word, freq in fdist.most_common(50):
    print word, score(word)

#### Visual content


#### Length of post

#### Controversy

### Prediction

In [None]:
# import csv
# another example with Cursor get all tweets with a certain hashtag and a certain time frame within past week 
# csvFile = open('tweets.csv', 'a')
# Use csv Writer
# csvWriter = csv.writer(csvFile)

# for tweet in tweepy.Cursor(api.search,q="#PrayForJapan",count=1,\
#                            lang="en",\
#                            since_id=2015-11-13).items():
#     print tweet.created_at, tweet.text
#     csvWriter.writerow([tweet.created_at, tweet.text.encode('utf-8')])