# Predicting Popularity: Using Text and Content Analysis to Examine Shared Characteristics of Popular Posts on Twitter

### A CS109 Final Project by Belinda Zeng, Roseanne Feng, Yuqi Hou, and Zahra Mahmood

![caption](https://studentshare.net/content/wp-content/uploads/2015/05/53a0e7d640b31_-_unknown-3-51047042.png)

## Background and Motivation

Twitter (https://twitter.com) is a social network, real-time news media service, and micro-blogging service where users can use text, photos, and videos to express moments or ideas in 140-characters or less. These 140-character messages are called "tweets.” According to Twitter’s website, millions of tweets are shared in real time, every day. Registered users can read and post tweets, favorite other people’s tweets, retweet other people’s posts, favorite tweets, and follow other accounts. Unregistered users can read tweets from public accounts. 

In today's day and age of Twitter, popularity is measured in hearts, retweets, follows, and follow-backs. What posts get popular over time? What seems to resonate most with people? Do positive or negative sentiments invite more engagement? In this project, we use Twitter's publically available archive of content to  like to examine some of the shared characteristics of popular posts, including length of post, visual content, positivity, negativity.

## Related Work

Our idea came from a desire to understand how movements such as #BlackLivesMatter and #Ferguson begin on Twitter as well as a general desire to know what makes a post popular. We chose to focus on Tweets on an individual level and to use natural language processing to be able to understand and predict what makes posts popular.

One paper that is related to our work is a paper from Cornell titled, [The effect of wording on message propagation: Topic- and author-controlled natural experiments on Twitter](https://chenhaot.com/pages/wording-for-propagation.html), which compaired pairs of tweets containing the same url and written by the same user but employing different wording to see which version attracted more retweets. Twitter itself has published research on [What fuels a Tweet’s engagement?](https://blog.twitter.com/2014/what-fuels-a-tweets-engagement) Their research found that adding video, links and photos all result in an increase in the number of retweets and even breaking down those results by industry. Inspired by previous research, we sought to include sentiment analysis in our understanding of what made a Tweet popular. 

## Initial Questions

1. How does the distribution of retweets and hearts vary for a post depending on the time of day when the tweet is created?
2. How does positive and negative sentiment affect popularity? 
3. What Tweets do we think will become popular?

## Data

This data is publicly available via the Twitter Static API that gets queries based on specific parameters. We limited the data set to look at tweets within a specified period of time. We are storing the data in CSV files for now. To reduce file-sizes, we will try to have multiple CSVs so that we don't load too much data into memory. If data exceeds computer memory, we will consider AWS/SQL database alternatives. 

### Scraping

In [None]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")
import csv

In [None]:
from collections import Counter
import ast

Set up oauth and a app on Twitter (to getthe consumer key & secret and access token and secret)

In [None]:
# great resource where I got all this 
# http://marcobonzanini.com/2015/03/02/mining-twitter-data-with-python-part-1/

import tweepy
import json
from tweepy import OAuthHandler

%run api-keys.py # run any python script and load all of its data directly into the interactive namespace
 
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
 
api = tweepy.API(auth)

Our initial approach is to create a random sample that consists of 1% of tweets. This involves using tweepy and the sample call from the Twitter API.

```python
# final, final version 

from tweepy import Stream
from tweepy.streaming import StreamListener

# get retweet status
def try_retweet(status, attribute):
    try:
        if getattr(status, attribute):
            return True
    except AttributeError:
        return None

# function that tries to get attribute from object
def try_get(status, attribute):
    try:
        return getattr(status, attribute).encode('utf-8')
    except AttributeError:
        return None

# open csv file
csvFile = open('smallsample.csv', 'a')

# create csv writer
csvWriter = csv.writer(csvFile)

class MyListener(StreamListener):
    
    def on_status(self, status):
        try:
            # save relevant components of the tweet
            
            # get and sanitize hashtags 
            hashtags = status.entities['hashtags']
            hashtag_list = []
            for el in hashtags:
                hashtag_list.append(el['text'])
            hashtag_count = len(hashtag_list)
            
            # get and sanitize urls
            urls = status.entities['urls']
            url_list = []
            for el in urls:
                url_list.append(el['url'])
            url_count = len(url_list)
            
            # get and sanitize user_mentions
            user_mentions = status.entities['user_mentions']
            mention_list = []
            for el in user_mentions:
                mention_list.append(el['screen_name'])
            mention_count = len(mention_list)
            # save it all as a tweet
            tweet = [status.created_at, status.text.encode('utf-8'), status.place, status.lang, status.coordinates, 
              hashtag_list, url_list, mention_list, 
              hashtag_count, url_count, mention_count, 
              try_get(status, 'possibly_sensitive'),
              status.favorite_count, status.favorited, status.retweet_count, status.retweeted, 
              try_retweet(status,'retweeted_status'), 
              try_get(status.user, 'statuses_count'), 
              try_get(status.user, 'favourites_count'), 
              try_get(status.user, 'followers_count'),
              try_get(status.user, 'description'),
              try_get(status.user, 'location')]
            
            # write to csv
            csvWriter.writerow(tweet)
        except BaseException as e:
            print("Error on_data: %s" % str(e))
        return True
    
    # tell us if there's an error
    def on_error(self, status):
        print(status)
        return True

twitter_stream = Stream(auth, MyListener())
twitter_stream.sample()
```

From this point on, analysis will be done on previously scraped tweets, and there is no need to run the above scraping code.

In [None]:
tweetdf_small=pd.read_csv("tempdata/smallsample.csv", names=["created_at", "text", "place", "lang", "coordinates",
                                       "hashtags", "urls", "user_mentions", 
                                       "hashtag_count", "url_count", "mention_count",
                                       "possibly_sensitive", 
                                       "favorite_count", "favorited", "retweet_count", "retweeted",
                                       "retweeted_status", "user_statuses_count", "user_favorites_count",
                                       "user_follower_count", "user_description", "user_location"])
tweetdf_small.head(10)

In [None]:
tweetdf_small.shape

In [None]:
# make sure there are no missing retweet count
tweetdf_missing = tweetdf_small[tweetdf_small['retweet_count'] != 0] 
tweetdf_missing.shape

As we can see, the retweet count and favorite count are always 0. This is because we're using the live streaming API and, as a result, we're scraping the tweets as they are tweeted. At this point, all the tweets have retweet count 0 and favorite count 0 since they were literally just posted! That is, unless the tweet posted is actually a retweet... 

To solve the problem of brand new tweets, we used retweets to get the original tweet. This also ensures that our model isn't thrown off when someone with a huge follower count retweets something. Finally, we made sure not to consider the same tweet text twice.

### Get Original Retweets

The following function updates the way we use the tweepy streaming API. We first detect if the tweet we're looking at is actually a retweet of something. If so, we then get the original tweet and save that to our csv.

```python
# only save information for retweets

from tweepy import Stream
from tweepy.streaming import StreamListener

# get retweet status
def try_retweet(status, attribute):
    try:
        if getattr(status, attribute):
            return True
    except AttributeError:
        return None

# get country status
def try_country(status, attribute):
    if getattr(status, attribute) != None:
        place = getattr(status, attribute)
        return place.country
    return None

# get city status
def try_city(status, attribute):
    if getattr(status, attribute) != None:
        place = getattr(status, attribute)
        return place.full_name
    return None

# function that tries to get attribute from object
def try_get(status, attribute):
    try:
        return getattr(status, attribute).encode('utf-8')
    except AttributeError:
        return None

# open csv file
csvFile = open('originalsample.csv', 'a')

# create csv writer
csvWriter = csv.writer(csvFile)

class MyListener(StreamListener):
    
    def on_status(self, status):
        try:
            # if this represents a retweet
            if try_retweet(status,'retweeted_status'):
                status = status.retweeted_status
                
                # get and sanitize hashtags 
                hashtags = status.entities['hashtags']
                hashtag_list = []
                for el in hashtags:
                    hashtag_list.append(el['text'])
                hashtag_count = len(hashtag_list)

                # get and sanitize urls
                urls = status.entities['urls']
                url_list = []
                for el in urls:
                    url_list.append(el['url'])
                url_count = len(url_list)

                # get and sanitize user_mentions
                user_mentions = status.entities['user_mentions']
                mention_list = []
                for el in user_mentions:
                    mention_list.append(el['screen_name'])
                mention_count = len(mention_list)
                
                # save it all as a tweet
                tweet = [status.id, status.created_at, try_country(status, 'place'), try_city(status, 'place'), status.text.encode('utf-8'), status.lang,
                  hashtag_list, url_list, mention_list, 
                  hashtag_count, url_count, mention_count, 
                  try_get(status, 'possibly_sensitive'),
                  status.favorite_count, status.favorited, status.retweet_count, status.retweeted, 
                  status.user.statuses_count, 
                  status.user.favourites_count, 
                  status.user.followers_count,
                  try_get(status.user, 'description'),
                  try_get(status.user, 'location'),
                  try_get(status.user, 'time_zone')]
            
                # write to csv
                csvWriter.writerow(tweet)
        except BaseException as e:
            print("Error on_data: %s" % str(e))
        return True
    
    # tell us if there's an error
    def on_error(self, status):
        print(status)
        return True

twitter_stream = Stream(auth, MyListener())
twitter_stream.sample()
```

Now we read into pandas.

In [None]:
tweetdf=pd.read_csv("tempdata/originalsample.csv", names=["id", "created_at", "country", "city", "text", "lang",
                                       "hashtags", "urls", "user_mentions", 
                                       "hashtag_count", "url_count", "mention_count",
                                       "possibly_sensitive", 
                                       "favorite_count", "favorited", "retweet_count", "retweeted",
                                       "user_statuses_count", "user_favorites_count",
                                       "user_follower_count", "user_description", "user_location", "user_timezone"])
tweetdf.head(10)

In [None]:
tweetdf.shape

## Data wrangling

### Filter for English language tweets

In [None]:
df_filtered = tweetdf[tweetdf['lang'] == 'en']

In [None]:
df_filtered.shape

### Filter for unique tweet ids

In [None]:
df_filtered.drop_duplicates(subset='id', take_last=True)
df_filtered.head()

In [None]:
df_filtered.shape

# Exploratory Analysis

After scraping tweets from the Twitter Streaming API, we use that data to build a feature list that we use to predict how popular an individual tweet will be, measured by a composite score based on the amount of retweets and hearts. We will also use metadata to help us analyze trends in the data, for example if there is a correlation between time of day and retweets.

### Updates

**11/30 - 12/1 (Yuqi)**

Initial exploratory analysis regarding popularity score and hashtags done. It seems like we should rethink our current formula for popularity because the histogram gives extreme strange results and the max score is really high. Need to look into why that might be. 

All of the correlations that were done between popularity score and other factors came up significant. Could this be due to the large dataset that we are using? Should we be worried about things being labeled as significant not because it actually is significant but because there is so much data that small variations become significant?

Also, noticed that some tweets are longer than 140 characters, and I'm not sure why that is either. Further data wrangling probably needed. 

**12/4 (Yuqi)** Tweets that have emojiis are converted into characters that's throwing off tweet length

**12/7 - Yuqi** Took out analysis on how trending topics affects tweets

**12/9 - Yuqi & Roseanne** Noticed that z-scoring retweet count and favorite count actually make the standard deviation larger. We also noticed that taking the log of popularity unstandardized gave us the same distribution as taking the log of popularity standardized. It seems like taking the log of our popularity score made a larger impact than standardizing the scores did. 

# Popularity Score

This is the response variable that we are trying to predict using various features of a tweet. The score was originally calculated by adding raw retweet count and favorite counts together, but after some exploratory analysis we chose to z-score retweet and favorite counts.

## Raw Popularity Score

In [None]:
popularity = [retweets + favs for retweets, favs in zip(df_filtered.retweet_count, df_filtered.favorite_count)]

In [None]:
# add popularity column to df
df_filtered.loc[:,'popularity']=popularity 
df_filtered.shape

In [None]:
dftouse = df_filtered.reset_index()
dftouse.head()

## Popularity Score Exploratory Analysis

The distribution of popularity is extremely right-tailed. Later we find that this is explained by the distribution of retweet counts and favorite counts are also extremely right-skewed. 

In [None]:
plt.hist(dftouse['popularity'],bins=100)
plt.title("Distribution of Popularity")
plt.show()

In [None]:
dftouse['popularity'].describe()

## Rethinking how Popularity is Scored
The huge standard deviation and extreme ranges suggest that we may need to rethink how we score popularity. We looked more closely at a statistical summary of retweet count and favorite count to decide if any standardization would be necessary.

## Retweets

In [None]:
plt.hist(dftouse['retweet_count'],bins=100)
plt.title("Distribution of Favorite Counts")
plt.show()

In [None]:
dftouse['retweet_count'].describe()

In [None]:
retweet_stats = dftouse['retweet_count'].describe()
retweet_mean = retweet_stats[1]
retweet_std = retweet_stats[2]

## Favorites

In [None]:
plt.hist(dftouse['favorite_count'],bins=100)
plt.title("Distribution of Favorite Counts")
plt.show()

In [None]:
dftouse['favorite_count'].describe()

In [None]:
favorite_stats = dftouse['favorite_count'].describe()
favorite_mean = favorite_stats[1]
favorite_std = favorite_stats[2]

Given these statistics on retweet_count and favorite_count, we realize we want to standardize these two for use later on, otherwise since there are way more retweets than favorites, retweets would get weighted more heavily. 

In [None]:
dftouse = dftouse.rename(columns={'retweet_count': 'retweet_unstandardized', 'favorite_count': 'favorite_unstandardized'})

** Create standardized retweet_count and favorite_count **

We decided to use a z-score to standardize retweet counts and favorite counts before adding them together to create the composite popularity score. Using that method, we standardize retweet count and favorites by subtracting the mean and dividing by the standard deviation.

In [None]:
retweets = [(retweet_count - retweet_mean)/float(retweet_std) for retweet_count in dftouse['retweet_unstandardized']]

In [None]:
favorites = [(favorite_count - favorite_mean)/float(favorite_std) for favorite_count in dftouse['favorite_unstandardized']]

Now we add these as columns to our dftouse.

In [None]:
dftouse.loc[:,'retweet_count']=retweets

In [None]:
dftouse.loc[:,'favorite_count']=favorites

In [None]:
print dftouse.retweet_count.describe()

In [None]:
print dftouse.favorite_count.describe()

Now we recalculate popularity, but in the same way as before.

In [None]:
popularity = [retweets + favs for retweets, favs in zip(dftouse.retweet_count, dftouse.favorite_count)]
dftouse.loc[:,'popularity']=popularity

In [None]:
dftouse['popularity'].describe()

## Transforming Popularity Score

The original histogram of raw popularity scores appeared to have an exponential distribution, so we transformed the data using a log transformation in order to make the relationship between the response variable, popularity, had the explanatory variables (features) more observable. The resulting histogram had reasonable values, so there was no need to further standardize the popularity score. 

In [None]:
unstandardized_popularity = [retweets + favs for retweets, favs in zip(dftouse.retweet_unstandardized, dftouse.favorite_unstandardized)]
dftouse.loc[:,'unstandardized_popularity']=popularity

In [None]:
dftouse['logpopularity']=dftouse['popularity'].apply(np.log)
dftouse['logpopularity'].describe()

In [None]:
plt.hist(dftouse['logpopularity'])
plt.xlabel('Log Popularity Score')
plt.ylabel('Frequency')
plt.title('Distribution of Log Popularity Score')
plt.show()

## Hashtag Analysis

References: 
- http://stackoverflow.com/questions/1894269/convert-string-representation-of-list-to-list-in-python
- http://stackoverflow.com/questions/10201977/how-to-reverse-tuples-in-python
- http://stackoverflow.com/questions/13925251/python-bar-plot-from-list-of-tuples/34013980#34013980

#### What fraction of tweets in the sample use hashtags?

In [None]:
num_tags_per_tweet = dftouse['hashtag_count']
tags_per_tweet = np.array(num_tags_per_tweet)
tagfrac = float(len(tags_per_tweet[tags_per_tweet>0]))/float(len(tags_per_tweet))
print str(tagfrac)+" of tweets in the sample use one or more hashtags."

In [None]:
plt.hist(tags_per_tweet)
plt.ylabel('Frequency')
plt.title('Histogram of Hashtags Used in Tweets')
plt.show()

#### Top 10 hashtags 

First get a flattened list of all the hashtags used in the sample:

In [None]:
alltags=[] 
for i in dftouse['hashtags']: # grab all the tags and put them into a list
    tag = ast.literal_eval(i) # convert string representation of list to list 
    alltags.append(tag) 
hashtags = [item for sublist in alltags for item in sublist] # flatten out the nested list

Then make a bar plot of the 10 most commonly used hashtags:

In [None]:
hashfreq = Counter(hashtags) # get the frequency of appearing hashtags
commontags = hashfreq.most_common(10) # save the top ten most common hashtags
taglabels = zip(*commontags)[0][::-1] # reverse the tuples to go from most frequent to least frequent 
hashtaglabels = ['#'+i for i in taglabels] # add a pound sign in front of each tag to make it clear that it's a hashtag
y_pos = np.arange(len(hashtaglabels)) 
usefreq = zip(*commontags)[1][::-1] # get the frequency part of the tuple
plt.barh(y_pos, usefreq, align='center') # plot horizontal barplot
plt.yticks(y_pos, hashtaglabels) 
plt.title('Top 10 Occuring Hashtags')
plt.show()

In [None]:
top4 = hashfreq.most_common(4)
tagdf = pd.DataFrame(dict(alltags=alltags, popularity=dftouse['logpopularity']))

for hashtag, _ in top4:
    tagdf[hashtag] = [hashtag in hashtags for hashtags in alltags]

tagdf['populartags']=tagdf[['MTVStars','ThanksgivingClapBack','ALDUBApproval','ThanksgivingWithBlackFamilies']].sum(axis=1)
tagdf.head()

In [None]:
for column in ['MTVStars','ThanksgivingClapBack','ALDUBApproval','ThanksgivingWithBlackFamilies', 'populartags']:
    dftouse[column] = tagdf[column]
dftouse.head()

### Tweeting about "Popular" Topics and Popularity Score

Even if you're tweeting about a topic (defined in this case as a hashtag that occurs frequently in our sample), it doesn't affect popularity all that much. Therefore, we chose to leave hashtags out of our model because it didn't seem as though hashtags affected our model that much. 

In [None]:
plt.scatter(tagdf['populartags'], tagdf['popularity'])

#### Number of Hashtags vs. Popularity Score

In [None]:
from scipy.stats.stats import pearsonr

In [None]:
print pearsonr(dftouse['hashtag_count'],dftouse['logpopularity'])
plt.scatter(dftouse['hashtag_count'],dftouse['logpopularity'])
plt.ylabel('Log Popularity Score')
plt.show()

#### Correlation between length of tweet and popularity 

In [None]:
tweet_len = [len(text) for text in dftouse['text']]
print pearsonr(tweet_len,dftouse['logpopularity'])
plt.scatter(tweet_len,dftouse['logpopularity'])
plt.ylabel('Log Popularity Score')
plt.show()

It seems that some tweets are longer than 140 characters because tweets using emojiis are converted into unicode characters, which is being counted into tweet length. The below text shows a tweet that uses emojiis. The distribution of tweet length appears fairly uniform so we decided not to remove emojii text from the analysis. In addition, fewer than 5.6% of our tweets went over the 140 character mark, so we decided that the effect was negligible. 

#### Character Length and Emojiis

In [None]:
tweet_len_array = np.array(tweet_len)
idx = np.where(tweet_len_array > 140)[0].tolist()
df_filtered_by_length = dftouse['text'].filter(idx).copy()
df_over140 = df_filtered_by_length.reset_index()
df_over140['text'][0]

In [None]:
# number of tweets that use emojiis
float(len(idx))/float(len(dftouse))

#### Correlation between presence of links and popularity

Dataframe only has information about links and it would have been too complex to differntiate between images links and other links, so differentiating between images and other urls for this. It appears that having more than one link is correlated with decreased popularity scores. 

In [None]:
print pearsonr(dftouse['url_count'],dftouse['logpopularity'])
plt.scatter(dftouse['url_count'],dftouse['logpopularity'])
plt.ylabel('Log Popularity Score')
plt.show()

#### Correlation between user mentions and popularity

It appears as though increasing the number of user mentions is correlated with a decrease in popularity score. 

In [None]:
print pearsonr(dftouse['mention_count'],dftouse['logpopularity'])
plt.scatter(dftouse['mention_count'],dftouse['logpopularity'])
plt.ylabel('Log Popularity Score')
plt.show()

#### Correlation for number of retweets and hearts


There generally appears to be a positive correlation between retweet count and hearts. 

In [None]:
print pearsonr(dftouse['retweet_count'],dftouse['favorite_count'])
plt.scatter(dftouse['retweet_count'],dftouse['favorite_count'])
plt.show()

### Location

**Update 12/4 (Yuqi)** Originally we had planned to do exploratory analysis on popular topics that people tweet about by city or state, but after taking a look at our data, we found that 3.2% of tweets were geo-tagged, so we ultimately chose to forego this analysis. 

#### Fraction of Tweets that are Geo-tagged

In [None]:
totaltweets = float(len(dftouse['country'])) # total number of tweets in sample
countryfrac = float(sum(map(lambda r: int(isinstance(r, str)), dftouse['country'])))/totaltweets
cityfrac = float(sum(map(lambda r: int(isinstance(r, str)), dftouse['city'])))/totaltweets
print str(cityfrac)+" of tweets in the sample are geo-tagged with a city."
print str(countryfrac)+" of tweets in the sample are geo-tagged with a country."

### Post Time

In [None]:
from datetime import datetime
date_objects = [datetime.strptime(each, '%Y-%m-%d %H:%M:%S') for each in dftouse['created_at']]

#### When are tweets posted throughout the week?

Because we used the Twitter Streaming API, most of the tweets are posted on a Tuesday, since that's when we scraped the tweets. This is also the case for the spike in tweets posted between midnight and 4am (localized time).

In [None]:
day_objects = [each.weekday() for each in date_objects]
x_pos = Counter(day_objects).keys()
height = Counter(day_objects).values()
days = ('Sunday','Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday')
plt.bar(x_pos,height,align='center')
plt.xticks(x_pos, days) 
plt.ylabel("Frequency")
plt.title("Distribution of Tweets Throughout the Week")
plt.show()

#### When are tweets posted during the day?

In [None]:
hour_objects = [each.hour for each in date_objects]
plt.hist(hour_objects)
plt.show()

A histogram is helpful, but a polar histogram could possibly visualize our data in a more intuitive way. 

In [None]:
def main():
    data = hour_objects
    axes = plot_clock(data)
    for ax in axes:
        realign_polar_xticks(ax)
    plt.show()

def realign_polar_xticks(ax):
    pass
    for theta, label in zip(ax.get_xticks(), ax.get_xticklabels()):
        theta = theta * ax.get_theta_direction() + ax.get_theta_offset()
        theta = np.pi/2 - theta
        y, x = np.cos(theta), np.sin(theta)
        if x >= 0.1:
            label.set_horizontalalignment('left')
        if x <= -0.1:
            label.set_horizontalalignment('right')
        if y >= 0.5:
            label.set_verticalalignment('bottom')
        if y <= -0.5:
            label.set_verticalalignment('top')

def plot_clock(data):
    def hour_formatAM(x, p):
        hour = x * 6 / np.pi
        return '{:0.0f}:00'.format(hour) if x > 0 else '12:00'

    def hour_formatPM(x, p):
        hour = x * 6 / np.pi
        return '{:0.0f}:00'.format(hour + 12) if x > 0 else '24:00'

    def plot(ax, theta, counts, formatter):
        colors = plt.cm.jet(theta / 12.0)
        ax.bar(theta, counts, width=np.pi/6, color=colors, alpha=0.5)
        ax.xaxis.set_major_formatter(tkr.FuncFormatter(formatter))

    plt.rcParams['font.size'] = 8

    bins = np.r_[0, 0.5:12, 12, 12.5:24,  23.99999]
    counts = np.histogram(data,bins)[0]

    counts[13] += counts[0]
    counts[-1] += counts[13]

    fig, axes = plt.subplots(ncols=2, figsize=(22, 12), dpi=200,
                             subplot_kw=dict(projection='polar'))
    fig.subplots_adjust(wspace=0.5)

    for ax in axes:
        ax.set(theta_offset=np.pi/2, theta_direction=-1,
               xticks=np.arange(0, np.pi*2, np.pi/6),
               yticks=np.arange(1, counts.max()))

    plot(axes[0], bins[1:13] * np.pi / 6, counts[1:13], hour_formatAM)
    plot(axes[1], bins[14:26] * np.pi / 6, counts[14:26], hour_formatPM)
    return axes

main()

#### Correlation between time of day and tweet popularity



There does not appear to be a clear relationship between the time of day that a tweet is created and its popularity score. There does appear to be some cyclical change, but it's hard to tell based on the correlation.

In [None]:
plt.scatter(hour_objects, dftouse['logpopularity'])
plt.show()

 Since there does appear to be a slight relationship, we will include this in the prediction model. 


In [None]:
# convert hour ints to strings to make the model evaluate hours as categorical variables 
string_hours = [str(hour) for hour in hour_objects]
dftouse['hour_posted']=string_hours

#### Correlation between day of the week posted and popularity

This appears uniform, more so than hour of the day.

In [None]:
plt.scatter(day_objects, dftouse['logpopularity'])
x_pos_x=range(0,7)
plt.xticks(x_pos_x, days)
plt.show()

#### The distribution of retweets and favorites over time


We were unable to plot this graph using the standardized retweet count so we chose to display the relationship using unstandardized values. 

In [None]:
plt.figure()
plt.plot_date(date_objects, dftouse['favorite_unstandardized'], alpha=.1, color='r')
plt.plot_date(date_objects, dftouse['retweet_unstandardized'], alpha=.1, color='b')
plt.show()

#### User's followers correlated with popularity

In [None]:
user_follower_count = dftouse['user_follower_count'] 
print pearsonr(user_follower_count,dftouse['logpopularity'])
plt.scatter(user_follower_count,dftouse['logpopularity'])
plt.show()

# Sentiment Analysis

#### Determining positive/negative words



Using sentiment lookup dictionaries, score tweets based on how positive/negative they are.

**11/29 - Roseanne**
Used a basic list of positive/negative words to begin with, no weights or other information beyond positive/negative. Appears to miss a bunch of tokens (1812/892606 found).

**12/1 - Roseanne**
Tried LabMT, using code provided. Rate is a lot better (7016/892606).

**12/4 - Roseanne**
Realized number of tokens (892606) was total tokens instead of unique tokens (83093). Still a lot but more tokens found than expected. LabMT is probably the better choice, though.

### Initial Analysis

First we tested the completeness of the lookup dictionaries that we originally chose to score text sentiment. The first positive.txt and negative.txt dictionaries are from UNC. Due to the nature of Twitter, many words and phrases tweeted will not be standard English and it is unlikely that we will find ratings for all of them in our dictionaries. We chose to go with LabMT as our dictionary since it was built for Twitter and therefore contained more words that we looked up. 

In [None]:
#notes: Unicode in texts (probably emoticons? should we find a way to categorize those?)

#load dicts into lookup, map words to pos or neg value
#current dict: not sure where it's from?
#1812 of 83093 words in lookup.
lookup = {}
with open('positive.txt', 'r') as f:
    for line in f:
        word = line[:-1]
        lookup[word] = 1
with open('negative.txt', 'r') as f:
    for line in f:
        word = line[:-1]
        lookup[word] = -1

# uses LabMT for scoring, see http://neuro.imm.dtu.dk/wiki/LabMT
# 7016 of 83093 words in LabMT.
url = 'http://www.plosone.org/article/fetchSingleRepresentation.action?uri=info:doi/10.1371/journal.pone.0026752.s001'
labmt = pd.read_csv(url, skiprows=2, sep='\t', index_col=0)

We used NLTK to handle much of the text parsing. We also used the NLTK tokenizer which breaks up punctuation and contractions, so words like "can't" are broken up into "ca" "n't", as well as emoticons such as ":)" which become ":", ")". This causes some data to be lost, but ultimately these are neutral or stop words that do not have a significant impact on sentiment.

In [None]:
import nltk
# you'll need to download NLTK resource: nltk.download()
# or use terminal: sudo python -m nltk.downloader -d /usr/local/share/nltk_data all

First, we compiled all the tweets into a format that NLTK could work with...

In [None]:
#text = reduce(lambda x,y: x+y, dftouse['text'].apply(lambda x: [x])) # list of strings, functionally identical to dftouse['text']
tweetstext = reduce(lambda x,y: x + '\n' + y, dftouse['text']) # string of concatenated texts, all

In [None]:
# notice: tokenizer puts punctuation as their own tokens, ex. separates hashtags, etc.
tokens = nltk.word_tokenize(tweetstext.decode('utf-8','ignore'))

Then, we looked at how often individual tokens appear and checked to see how many appeared in each of the dictionaries (we are using two dictionaries, the UNC and LabMT one). 

In [None]:
print "Number of tokens:", len(tokens)
fdist = nltk.FreqDist(tokens)
utokens = fdist.keys()
print "Unique tokens:", len(utokens)
print "Tokens that appear only once:", len(fdist.hapaxes())
#fdist.most_common(50)
inlookup = []
notfoundlookup = []
inlabmt = []
notfoundlabmt = []
for key in utokens:
    if key in lookup.keys():
        inlookup.append(key)
    else:
        notfoundlookup.append(key)
    if key in labmt.index:
        inlabmt.append(key)
    else:
        notfoundlabmt.append(key)
print "{} of {} words in lookup.".format(len(inlookup), len(utokens))
print inlookup[:10]

print "{} of {} words in LabMT.".format(len(inlabmt), len(utokens))
print inlabmt[:10]

We can see that LabMT found more words so we chose to go with LabMT in our final analysis. Plus, LabMT computes sentiment on a scale whereas the UNC dictionaries were binary. LabMT gave out more information which we thought would be more useful for our model. We also took into consideration that some tokens would not have sentiment ratings, such as the URL tokens or hapaxes, which are tokens that appear only once in our tweets sample.

#### Expectations

A lot of words don't appear in our dictionary, but we think that's okay because many of these are the result of formatting in Tweets that causes the tokenizer difficulty in parsing, or are unique words that are unlikely to throw off our scoring, such as URLs. 

In [None]:
fdist.hapaxes()[:10] #lots of links, Unicode included here, is it worth filtering out these/punctuation?

In [None]:
utokens_ = [x for x in utokens if x[:6] != '//t.co']
urltokens = [x for x in utokens if x[:6] == '//t.co']
print "Non-URL tokens:", len(utokens_)

## Sentiment Scoring

**12/4 - Roseanne**

Scoring - build columns for scoring text, one on the raw text, one on text that ignores words not in our dictionary, and one that shows us which words are not in the dictionary.

In [None]:
# average of entire tweet over unigrams
average = labmt.happiness_average.mean()
happiness = (labmt.happiness_average - average).to_dict()

# this is amazingly inefficient sorry
def score(text):
    words = nltk.word_tokenize(text.decode('utf-8','ignore'))
    return sum([happiness.get(word.lower(), 0.0) for word in words]) / len(words)

def scoreNoNeutrals(text):
    words = nltk.word_tokenize(text.decode('utf-8','ignore'))
    notscored = [word for word in words if happiness.get(word.lower(), 0.0) == 0.0]
    return sum([happiness.get(word.lower(), 0.0) for word in words]) / max((len(words) - len(notscored)),1)

def scored(text):
    words = nltk.word_tokenize(text.decode('utf-8','ignore'))
    return [word for word in words if happiness.get(word.lower(), 0.0) != 0.0]

def notScored(text):
    words = nltk.word_tokenize(text.decode('utf-8','ignore'))
    return [word for word in words if happiness.get(word.lower(), 0.0) == 0.0]


dftouse['text'].apply(score).mean()
dftouse['sentiment'] = dftouse['text'].apply(score)
dftouse['sentimentnoneutrals'] = dftouse['text'].apply(scoreNoNeutrals)
dftouse['scored'] = dftouse['text'].apply(scored)
dftouse['notscored'] = dftouse['text'].apply(notScored)
dftouse[['text','sentiment', 'sentimentnoneutrals', 'scored', 'notscored']].head()

**12/4 - Roseanne**

Checking how our lookup and scoring is working.

Sentiment score ranges from approx. -3 to 3, with a mean close to 0.1, or roughly neutral.

Hapaxes (words that appear only once in the Tweets we're analyzing) are a surprisingly large percentage of our tokens (~55000 out of 83000). A lot of them are URLs (19812), which we can probably ignore, or include a Unicode character or formatting that caused the tokenizer to behave oddly. Would it be worth it to try to filter out punctuation, or manually add them to our lookup (ex. replace .!?s with spaces, or add tokens such as '...'. If we add them, how do we generate a score for them?)

In [None]:
print dftouse.sentiment.min(), dftouse.sentiment.max(), dftouse.sentiment.mean()
print dftouse.sentimentnoneutrals.min(), dftouse.sentimentnoneutrals.max(), dftouse.sentimentnoneutrals.mean()

In [None]:
dftouse.loc[dftouse.sentimentnoneutrals==dftouse.sentimentnoneutrals.max()]

In [None]:
dftouse.loc[dftouse.sentimentnoneutrals==dftouse.sentimentnoneutrals.min()]

## 50 Most Common Tokens

We can look at the fifty most common tokens in our tweets to get an idea of what appears most often. Many of these are punctuation, which generally do not have a consistent effect on sentiment, or are neutral words that do not factor into our sentiment.

In [None]:
for word, freq in fdist.most_common(50):
    print word, score(word)

# The Ngram Problem

Our approach to sentiment analysis is very simplistic, in large part because we look at words individually. However, in sentences words are not independent of each other and their meanings can combine in different ways to affect the sentiment. For example, the phrase "not bad" would be considered positive, but our sentiment would score it as negative because "not" and "bad" are generally negative. To examine the extent of which this would affect our sentiment scoring, we looked at bigrams and trigrams to understand how often and which bigrams/trigrams could be scored differently than the unigrams. We found that it wasn't a significant enough difference to include them in our analysis.

### Bigrams

We begin by finding the bigrams where the constituent elements are strongly associated with each other, i.e. they often occur together.

In [None]:
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokens)

In [None]:
# scoring the association
scored = finder.score_ngrams(bigram_measures.raw_freq)

In [None]:
print scored[:20]

### 20 Most Strongly Associated Bigrams

Strongly associated bigrams occur when the elements appear together consistently. For example, if we see a "https" token, the chance that the next token is ":" is high, giving the bigram ("https", ":") a high association score. 

In [None]:
# finds most associated bigrams
top_bigrams = finder.nbest(bigram_measures.raw_freq, 20)

In [None]:
top_bigrams

### Token Wrangling

At this point, we remove stopwords and punctuation from our list of tokens in order to see more meaningful tokens in our analysis. 

In [None]:
# create bigrams for each tweet
bigrams = dftouse['text'].apply(lambda x: list(nltk.bigrams(nltk.word_tokenize(x.decode('utf-8','ignore')))))

In [None]:
import string
from sklearn.feature_extraction import text 
stopwords=text.ENGLISH_STOP_WORDS
punctuation = string.punctuation[:6] + string.punctuation[7:]
filtered = list(punctuation) + ['https','http','//t.co'] + list(stopwords)
# built a list of tokens that aren't punctuation or stopwords
tokens_ = [x for x in tokens if x not in filtered]

## Distribution of bigram frequencies

In [None]:
# find 50 most common bigrams
bigramfreq = nltk.FreqDist(nltk.bigrams(tokens_))
bigramfreq.most_common(20)
frequencies = [freq for bigram, freq in bigramfreq.items()]
# plot distribution
plt.hist(frequencies, bins=100)
plt.title("Distribution of Bigram Frequencies")
plt.show()

## Bigrams vs Unigrams

** Count important bigrams **

These are considered to be bigrams that show up 50 or more times.

In [None]:
bigrams_sorted = sorted(bigramfreq.items(), key=lambda x: -x[1])
print bigrams_sorted[:20]

In [None]:
# bigram is important if it's associated more than 50 times
important_bigrams = [(bigram, val) for bigram, val in bigrams_sorted if val >= 50]
len(important_bigrams)

** What percentage of our bigrams are important? **

In [None]:
frac = len(important_bigrams) / float(len(bigrams))
print frac

This gives us hope that the presence of bigrams won't throw off our calculations too badly.

** Does manual scoring differ from our unigram scores? **

In [None]:
# assign scores to what we think is appropriate
manual_scores = bigrams_sorted[:20]
bigrams_tuple = [str(bigram) for bigram, frequency in manual_scores]

We manually score the bigrams to see which are positive and negative.

In [None]:
bigramdf = pd.DataFrame.from_items([('bigrams', bigrams_tuple)])
manual_ratings = ["Pos", "Pos", "Pos", "Neg", "Neg", "Pos", "Neg", "Neg", "Pos", "Pos", "Pos", 
                  "Neg", "Neg", "Neg", "Neg", "Pos", "Pos", "Pos", "Pos", "Neg"]
bigramdf['manual_ratings']= manual_ratings
bigramdf.head(5)

Now we do some unigram scoring

In [None]:
# scoring function for lists of tokens
def scoreTokens(words):
    return sum([happiness.get(word.lower(), 0.0) for word in words]) / len(words)

def scoreNoNeutralsTokens(words):
    notscored = [word for word in words if happiness.get(word.lower(), 0.0) == 0.0]
    return sum([happiness.get(word.lower(), 0.0) for word in words]) / max((len(words) - len(notscored)),1)


In [None]:
bigrams_text = [[word1, word2] for (word1, word2), frequency in manual_scores]

Our unigram scores including neutrals.

In [None]:
# pass bigrams to score function
unigram_scores_neutrals = [sum([happiness.get(word.lower(), 0.0) for word in bigram]) / len(bigram) for bigram in bigrams_text]
# # print whether they're positive or neutral
unigram_bool_neutrals = ["Pos" if score > 0 else "Neg" for score in unigram_scores_neutrals]
bigramdf['unigram_ratings_neutrals']= unigram_bool_neutrals

Our unigram scores without neutrals

In [None]:
# pass bigrams to no neutrals score function
unigram_scores_no_neutrals = [scoreNoNeutralsTokens(bigram) for bigram in bigrams_text]
# print whether they're positive or neutral
unigram_bool_no_neutrals = ["Pos" if score > 0 else "Neg" for score in unigram_scores_neutrals]
bigramdf['unigram_ratings_no_neutrals']= unigram_bool_no_neutrals

Our bigramdf now includes manual_ratings, unigram_ratings with neutrals and unigram_ratings without neutrals.

In [None]:
bigramdf.head(5)

Calculate the accuracy of our unigram scoring

In [None]:
# calculate percent difference neutrals
bigramdf['neutral_manual_same'] = [ 1 if manual == neutral else 0 for manual, neutral in zip(
                                    bigramdf['manual_ratings'],
                                    bigramdf['unigram_ratings_neutrals'])]
neutralcount = bigramdf['neutral_manual_same'].sum()
# calculate percent difference no neutrals
neutralcount/float(len(bigramdf))

This shows that our unigram scoring seems pretty accurate since 90% of our manual scoring of bigrams actually match our unigram scores. Now we see if this holds for our no neutrals scoring as wel.

In [None]:
# noneutral_manual_same
bigramdf['noneutral_manual_same'] = [ 1 if manual == neutral else 0 for manual, neutral in zip(
                                    bigramdf['manual_ratings'],
                                    bigramdf['unigram_ratings_no_neutrals'])]
noneutralcount = bigramdf['noneutral_manual_same'].sum()
noneutralcount/float(len(bigramdf))

In [None]:
bigramdf

### Trigrams

Now we do the same thing with trigrams.

In [None]:
trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = TrigramCollocationFinder.from_words(tokens)

In [None]:
# finds most associated trigrams
top_trigrams = finder.nbest(trigram_measures.raw_freq, 20)

In [None]:
# define trigrams
trigrams = dftouse['text'].apply(lambda x: list(nltk.trigrams(nltk.word_tokenize(x.decode('utf-8','ignore')))))

In [None]:
# finds the most common 20
tokens_ = [x for x in tokens if x not in filtered]

**Plot distribution of trigram frequencies**

In [None]:
# find 50 most common bigrams
trigramfreq = nltk.FreqDist(nltk.trigrams(tokens_))
trigramfreq.most_common(20)
frequencies = [freq for trigram, freq in trigramfreq.items()]
# plot distribution
plt.hist(frequencies, bins=100)
plt.title("Distribution of Trigram Frequencies")
plt.show()

## Trigrams vs Unigrams

** Count important trigrams **

In [None]:
trigrams_sorted = sorted(trigramfreq.items(), key=lambda x: -x[1])
print trigrams_sorted[:20]

In [None]:
# trigram is important if it's associated more than 50 times
important_trigrams = [(trigram, val) for trigram, val in trigrams_sorted if val >= 50]
len(important_trigrams)


As before, the percentage of our trigrams that are important.

In [None]:
frac = len(important_trigrams) / float(len(trigrams))
print frac

**Compare manual and unigram scores of trigrams.**

In [None]:
# assign scores to what we think is appropriate
manual_scores = trigrams_sorted[:20]
trigrams_tuple = [str(trigram) for trigram, frequency in manual_scores]

Manual scoring

In [None]:
trigramdf = pd.DataFrame.from_items([('trigrams', trigrams_tuple)])
manual_ratings = ["Pos", "Pos", "Pos", "Neg", "Neg", "Pos", "Neg", "Neg", "Pos", "Pos", "Pos", 
                  "Neg", "Neg", "Neg", "Neg", "Pos", "Pos", "Pos", "Pos", "Neg"]
trigramdf['manual_ratings']= manual_ratings
trigramdf.head(5)

Unigram scoring

In [None]:
trigrams_text = [[word1, word2, word3] for (word1, word2, word3), frequency in manual_scores]

Scoring with neutrals

In [None]:
# pass bigrams to score function
unigram_scores_neutrals = [sum([happiness.get(word.lower(), 0.0) for word in trigram]) / len(bigram) for trigram in trigrams_text]
# # print whether they're positive or neutral
unigram_bool_neutrals = ["Pos" if score > 0 else "Neg" for score in unigram_scores_neutrals]
trigramdf['unigram_ratings_neutrals']= unigram_bool_neutrals

Scoring without neutrals

In [None]:
# pass bigrams to no neutrals score function
unigram_scores_no_neutrals = [scoreNoNeutralsTokens(bigram) for bigram in bigrams_text]
# print whether they're positive or neutral
unigram_bool_no_neutrals = ["Pos" if score > 0 else "Neg" for score in unigram_scores_neutrals]
bigramdf['unigram_ratings_no_neutrals']= unigram_bool_no_neutrals

Accuracy of scoring with and without neutrals

In [None]:
# calculate percent difference neutrals
trigramdf['neutral_manual_same'] = [ 1 if manual == neutral else 0 for manual, neutral in zip(
                                    trigramdf['manual_ratings'],
                                    trigramdf['unigram_ratings_neutrals'])]
neutralcount = trigramdf['neutral_manual_same'].sum()
# noneutral_manual_same
bigramdf['noneutral_manual_same'] = [ 1 if manual == neutral else 0 for manual, neutral in zip(
                                    bigramdf['manual_ratings'],
                                    bigramdf['unigram_ratings_no_neutrals'])]
noneutralcount = bigramdf['noneutral_manual_same'].sum()
# calculate percent difference neutrals, and no neutrals
neutralcount/float(len(trigramdf)), noneutralcount/float(len(bigramdf))

### Unigram, Bigram, Trigram comparison

In [None]:
scoredunigrams = [(x, y, scoreNoNeutralsTokens([x])) for x,y in fdist.items() if y >= 20]

posunigrams = [(x,y,z) for x,y,z in scoredunigrams if z >= 1]
negunigrams = [(x,y,z) for x,y,z in scoredunigrams if z <= 1]
neuunigrams = [(x,y,z) for x,y,z in scoredunigrams if z < 1 and z > -1]

print "unigrams positive:", len(posunigrams), "negative:", len(negunigrams), "neutral:", len(neuunigrams)

In [None]:
scoredbigrams = [(x, y, scoreNoNeutralsTokens(x)) for x,y in bigramfreq.items() if y >= 20]

posbigrams = [(x,y, z) for x,y,z in scoredbigrams if z >= 1]
negbigrams = [(x,y, z) for x,y,z in scoredbigrams if z <= 1]
neubigrams = [(x,y, z) for x,y,z in scoredbigrams if z < 1 and z > -1]

print "bigrams positive:", len(posbigrams), "negative:", len(negbigrams), "neutral:", len(neubigrams)

In [None]:
scoredtrigrams = [(x, y, scoreNoNeutralsTokens(x)) for x,y in trigramfreq.items() if y >= 20]

postrigrams = [(x,y,z) for x,y,z in scoredtrigrams if z >= 1]
negtrigrams = [(x,y,z) for x,y,z in scoredtrigrams if z <= 1]
neutrigrams = [(x,y,z) for x,y,z in scoredtrigrams if z < 1 and z > -1]

print "trigrams positive:", len(postrigrams), "negative:", len(negtrigrams), "neutral:", len(neutrigrams)

In [None]:
fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(30, 30), 
                         tight_layout=True)

grams = [(1, scoredunigrams, 'unigrams'),
        (2, scoredbigrams, 'bigrams'),
        (3, scoredtrigrams, 'trigrams')]

for i, dist, name in grams:
    plt.subplot(3,1,i)
    flattened = reduce(lambda x,y: x+y, [[z]*y for x,y,z in dist])
    #plt.hist(plotpostri+plotnegtri+plotneutri, color=['b','r', 'g'], label=['positive', 'negative', 'neutral'])
    plt.hist(flattened)
    plt.title(name)

In [None]:
plt.figure()
for i, dist, name in grams:
    flattened = reduce(lambda x,y: x+y, [[z]*y for x,y,z in dist])
    plt.hist(flattened, alpha=0.2, label=name)
plt.legend(frameon=True)

In [None]:
color = ['b','r','g']
plt.figure()
for i, dist, name in grams:
    yscores = [z for x,y,z in dist]
    xvals = [i + (np.random.rand() - 0.5)/2. for y in yscores]
    plt.scatter(xvals, yscores, alpha=0.2, label=name, c=color[i-1])
plt.gca().axes.get_xaxis().set_ticks([])
plt.xticks([1.,2.,3.],['unigrams','bigrams','trigrams'])
plt.title('Distribution of sentiment scores over ngrams')
plt.ylabel('Sentiment')

# Prediction

**12/6 - Roseanne**

Working on prediction. Note: HW5 prediction is used to predict if sentiment is positive or negative, but we want to use this to predict how popular something is, not sure how to do that. Xarray below was built on text, but we should probably be doing something like HW2 and using sentiment or sentimentnoneutrals as one of the factors instead. I have been doing too much text analysis.

In [None]:
dftouse.head()

## Model 1

In [None]:
from statsmodels.formula.api import glm, ols
formula = 'logpopularity ~ hashtags + url_count + mention_count + user_follower_count + sentimentnoneutrals'
model = ols(formula, dftouse.head(2000)).fit()
print "R^2:", model.rsquared

## Model 1.5

In [None]:
from statsmodels.formula.api import glm, ols
formula = 'logpopularity ~ hashtags + url_count + mention_count + user_follower_count + sentimentnoneutrals + created_at'
model = ols(formula, dftouse.head(2000)).fit()
print "R^2:", model.rsquared

## Model 2

We model popularity as a function of features of the tweet and the user that posted it. Hashtags, links, mentioned users, sentiment, and time of posting are all features that can affect how many people see the tweet and the likelihood of a user retweeting and thus expanding the number of users who see the tweet. We also use the poster's follower count as a feature to account for users whose higher follower counts means that their initial audience is larger to begin with and can skew their popularity score compared to users with lower follower counts.

In [None]:
formula = 'logpopularity ~ MTVStars + ThanksgivingClapBack + ALDUBApproval + ThanksgivingWithBlackFamilies + url_count + mention_count + user_follower_count + sentimentnoneutrals + hour_posted'
model = ols(formula, dftouse.head(2000)).fit()
print "R^2:", model.rsquared

In [None]:
model.summary()

Preliminary test on a basic OLS regression model we have fairly low R^2. It's possible to run the following code to see if there's just something wrong in our formula (i.e. can we get a higher R^2 with a different combination of features), but it's likely that all our results are going to have a low R^2 by the nature of the problem.

```python
# final, final version 
import itertools
formulabase = 'popularity ~ '
variables = ['hashtags', 'url_count', 'mention_count', 'user_follower_count', 'sentimentnoneutrals', 'date_objects', 'created_at']
' + '.join(variables)

testresults = {}

for i in range(1,len(variables)):
    #print ' + '.join(list(itertools.combinations(variables, i)))
    for var in list(itertools.combinations(variables, i)):
        testformula = formulabase + ' + '.join(var)
        print testformula, "running......"
        testmodel = ols(testformula, dftouse).fit()
        testresults[testformula] = testmodel.rsquared
```

In [None]:
# this takes forever to run + overheated my laptop
# sns.lmplot(x="sentimentnoneutrals", y="logpopularity", hue="user_follower_count", data=dftouse.head(), size = 7, aspect=1.2)

## Prediction Results

In [None]:
plt.scatter(dftouse['logpopularity'], model.predict())
plt.plot([-2,16],[-2,16], 'k', label="slope 1", linewidth=2)
plt.xlim(-1.5,16)
plt.ylim(-1.5,16)
plt.xlabel('Actual Log Popularity Score')
plt.ylabel('Predicted Log Popularity Score')
plt.show()