### Language modeling is the task of predicting the next word, given the preceding history.

### Sentiment detection is just a special case of classification

**Data sets with fields:**

*Combined_Comments* := comment_id, author, author_flair, score, comment_name, comment_fullname, comment_is_root, comment_parent, comment_created, comment_created_utc, comment_created_utc_datetime, comment_created_utc_date, comment_created_utc_time, comment_depth, comment_body, submission_id, submission_title, submission_created_utc**

*Clean_Game_Data* := index, unnamed: 0, playnum, playid, 'Game Title Date', text, homeWinPercentage, matched_play_by_play_text, matched_play_by_play_index, matched_play_by_play_utc, matched_play_by_play_tweetid, home_team, away_team, awayWinPercentage

*Pickle files in Clean_Game_Data* := author, author_flair, score, comment_id, comment_name, comment_fullname, comment_is_root, comment_parent, comment_approved_at_utc, comment_approved_by, matched_play_by_play_utc, matched_play_by_play_tweetid, home_team, away_team, awayWinPercentage, vader_ss, vader_neg, vader_neu, vader_pos, vader_compound

*Comments_FanOfGame* := comment_body (from Reddit), fan_of_team_playing


**Ideas for data to model**

*-------------1-------------*

*Dependent var* := game state

*Independent vars* := comment_body, fan_of_team_playing

*-------------2-------------*

*Dependent var* := fan_of_team_playing

*Independent vars* := comment_body, game_state

*-------------3-------------*

*Dependent var* := author_flair

*Independent vars* := comment_body, game_fan_state (fan_team_prob_win, fan_team_prob_lose, fan_no_team)

*-------------4-------------*

*Dependent var* := author_game_state or game_fan_state (fan_team_prob_win, fan_team_prob_lose, fan_no_team)

*Independent vars* := comment_body

*-------------5-------------*

*Dependent var* := next word

*Independent vars* := previous word


*--------------------------*

*Next step* := apply language model to each game and examine by game_state, fan_of_team_playing

**Possible comment label combinations (author_game_state)**

*fan/close*

*fan/blowout*

*notfan/close*

*notfan/blowout*

*fan/lose*

*fan/win*

In [1]:
import numpy as np
import pandas as pd
import re
import pickle
import itertools
from __future__ import print_function
from __future__ import division

# SK-learn libraries for learning.
#from sklearn.pipeline import Pipeline
#from sklearn.grid_search import GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet

# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *

# NLTK libs
from nltk.tokenize import TweetTokenizer



## Data Processing

In [2]:
## Load comments by game
files = [
    'Bears_vs_Packers__2017-09-28_comment_sentiment.pickle',
    'Broncos_vs_Chiefs__2017-10-30_comment_sentiment.pickle',
    'Chargers_vs_Cowboys__2017-11-23_comment_sentiment.pickle',
    'Chiefs_vs_Patriots__2017-09-07_comment_sentiment.pickle',
    'Chiefs_vs_Raiders__2017-10-19_comment_sentiment.pickle',
    'Cowboys_vs_Cardinals__2017-09-25_comment_sentiment.pickle',
    'Cowboys_vs_Raiders__2017-12-17_comment_sentiment.pickle',
    'Eagles_vs_Panthers__2017-10-12_comment_sentiment.pickle',
    'Falcons_vs_Buccaneers__2017-12-18_comment_sentiment.pickle',
    'Falcons_vs_Patriots__2017-10-22_comment_sentiment.pickle',
    'Falcons_vs_Seahawks__2017-11-20_comment_sentiment.pickle',
    'Giants_vs_Cowboys__2017-09-10_comment_sentiment.pickle',
    'Jaguars_vs_Patriots__2018-01-21_comment_sentiment.pickle',
    'Lions_vs_Giants__2017-09-18_comment_sentiment.pickle',
    'Lions_vs_Packers__2017-11-06_comment_sentiment.pickle',
    'Packers_vs_Panthers__2017-12-17_comment_sentiment.pickle',
    'Packers_vs_Vikings__2017-10-15_comment_sentiment.pickle',
    'Patriots_vs_Dolphins__2017-12-11_comment_sentiment.pickle',
    'Raiders_vs_Eagles__2017-12-25_comment_sentiment.pickle',
    'Raiders_vs_Redskins__2017-09-24_comment_sentiment.pickle',
    'Rams_vs_49ers__2017-09-21_comment_sentiment.pickle',
    'Redskins_vs_Chiefs__2017-10-02_comment_sentiment.pickle',
    'Redskins_vs_Cowboys__2017-11-30_comment_sentiment.pickle',
    'Redskins_vs_Eagles__2017-10-23_comment_sentiment.pickle',
    'Saints_vs_Falcons__2017-12-07_comment_sentiment.pickle',
    'Saints_vs_Vikings__2017-09-11_comment_sentiment.pickle',
    'Seahawks_vs_Cardinals__2017-11-09_comment_sentiment.pickle',
    'Steelers_vs_Bengals__2017-12-04_comment_sentiment.pickle',
    'Steelers_vs_Lions__2017-10-29_comment_sentiment.pickle',
    'Texans_vs_Bengals__2017-09-14_comment_sentiment.pickle',
    'Vikings_vs_Packers__2017-12-23_comment_sentiment.pickle',
    'Vikings_vs_Panthers__2017-12-10_comment_sentiment.pickle']

In [3]:
path = "/Users/chadharness/mids/w266/w266_final_project/Clean_Game_Data/"

for index, filename in enumerate(files):
    print(path+filename)
    if index == 0:
        data = pd.read_pickle(path+filename)
        #print(data.head())
    else:
        temp_data = pd.read_pickle(path+filename)
        data = data.append(temp_data)
        #print(data.head())

/Users/chadharness/mids/w266/w266_final_project/Clean_Game_Data/Bears_vs_Packers__2017-09-28_comment_sentiment.pickle
/Users/chadharness/mids/w266/w266_final_project/Clean_Game_Data/Broncos_vs_Chiefs__2017-10-30_comment_sentiment.pickle
/Users/chadharness/mids/w266/w266_final_project/Clean_Game_Data/Chargers_vs_Cowboys__2017-11-23_comment_sentiment.pickle
/Users/chadharness/mids/w266/w266_final_project/Clean_Game_Data/Chiefs_vs_Patriots__2017-09-07_comment_sentiment.pickle
/Users/chadharness/mids/w266/w266_final_project/Clean_Game_Data/Chiefs_vs_Raiders__2017-10-19_comment_sentiment.pickle
/Users/chadharness/mids/w266/w266_final_project/Clean_Game_Data/Cowboys_vs_Cardinals__2017-09-25_comment_sentiment.pickle
/Users/chadharness/mids/w266/w266_final_project/Clean_Game_Data/Cowboys_vs_Raiders__2017-12-17_comment_sentiment.pickle
/Users/chadharness/mids/w266/w266_final_project/Clean_Game_Data/Eagles_vs_Panthers__2017-10-12_comment_sentiment.pickle
/Users/chadharness/mids/w266/w266_final_p

In [4]:
data.head()

Unnamed: 0,author,author_flair,score,comment_id,comment_name,comment_fullname,comment_is_root,comment_parent,comment_approved_at_utc,comment_approved_by,...,matched_play_by_play_utc,matched_play_by_play_tweetid,home_team,away_team,awayWinPercentage,vader_ss,vader_neg,vader_neu,vader_pos,vader_compound
0,Street_Spirit_,Raiders,2,dnnfsxx,t1_dnnfsxx,t1_dnnfsxx,True,7344it,,,...,2017-09-29 00:30:01,9.135615e+17,Packers,Bears,0.161,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1.0,0.0,0.0
1,irishkid46,Bears,1,dnnft5a,t1_dnnft5a,t1_dnnft5a,True,7344it,,,...,2017-09-29 00:30:01,9.135615e+17,Packers,Bears,0.161,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1.0,0.0,0.0
2,SportsMasterGeneral,Bears,12,dnnft8q,t1_dnnft8q,t1_dnnft8q,True,7344it,,,...,2017-09-29 00:30:01,9.135615e+17,Packers,Bears,0.161,"{'neg': 0.278, 'neu': 0.722, 'pos': 0.0, 'comp...",0.278,0.722,0.0,-0.5927
3,Street_Spirit_,Raiders,24,dnnftgx,t1_dnnftgx,t1_dnnftgx,True,7344it,,,...,2017-09-29 00:30:01,9.135615e+17,Packers,Bears,0.161,"{'neg': 0.636, 'neu': 0.364, 'pos': 0.0, 'comp...",0.636,0.364,0.0,-0.5423
4,Fight_For_Tacos,,0,dnnftkp,t1_dnnftkp,t1_dnnftkp,True,7344it,,,...,2017-09-29 00:30:01,9.135615e+17,Packers,Bears,0.161,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1.0,0.0,0.0


In [5]:
data[data.author == 'Scaryclouds'].comment_body.head()

8784                                Fisher you fat fuck! 
8814    Hmm, looks like a bad spot, I think Smith got ...
8825                           Yea, first down for sure. 
8843    God damn, our o-line is just terrible at run b...
8860      Part of it is injuries to our interior o-line. 
Name: comment_body, dtype: object

In [6]:
list(data.columns.values)

['author',
 'author_flair',
 'score',
 'comment_id',
 'comment_name',
 'comment_fullname',
 'comment_is_root',
 'comment_parent',
 'comment_approved_at_utc',
 'comment_approved_by',
 'comment_created',
 'comment_created_utc',
 'comment_created_utc_datetime',
 'comment_created_utc_date',
 'comment_created_utc_time',
 'comment_banned_at_utc',
 'comment_banned_by',
 'comment_depth',
 'comment_num_reports',
 'comment_body',
 'comment_body_parsed',
 'submission_id',
 'submission_title',
 'submission_created_utc',
 'playId',
 'index',
 'Unnamed: 0',
 'playnum',
 'Game Title Date',
 'text',
 'homeWinPercentage',
 'matched_play_by_play_text',
 'matched_play_by_play_index',
 'matched_play_by_play_utc',
 'matched_play_by_play_tweetid',
 'home_team',
 'away_team',
 'awayWinPercentage',
 'vader_ss',
 'vader_neg',
 'vader_neu',
 'vader_pos',
 'vader_compound']

### Create features for model

In [8]:
# Identify game state
data['win_differential'] = abs(data.homeWinPercentage - data.awayWinPercentage)

# Call it a win for away if away has same or higher win percentage
data['win_team'] = np.where(data.awayWinPercentage >= data.homeWinPercentage, 'away', 'home')

data['game_state'] = np.where(data.win_differential < 0.6, 'close', 'notclose')


In [9]:
# Identify author affiliation to game
data['fan_type'] = np.where(data.away_team == data.author_flair, 'away', 
                            np.where(data.home_team == data.author_flair, 'home', 'nofan'))

data['author_team_state'] = np.where(data.fan_type == 'nofan', 'nopref',
                                     np.where(data.win_team == data.fan_type, 'winning', 'losing'))

data['author_game_state'] = np.where(data.fan_type == 'nofan', 
                                     np.where(data.game_state == 'notclose', 'nofan_notclose', 'nofan_close'),
                                     np.where(data.game_state == 'notclose', 
                                             np.where(data.win_team == data.fan_type, 'fan_win_notclose','fan_lose_notclose'),
                                              np.where(data.win_team == data.fan_type, 'fan_win_close', 'fan_lose_close')))



### Define helper functions

In [10]:
# Borrowed some functions from the w266 utils.py file
# Miscellaneous helpers
def flatten(list_of_lists):
    """Flatten a list-of-lists into a single list."""
    return list(itertools.chain.from_iterable(list_of_lists))


# Word processing functions
def canonicalize_digits(word):
    if any([c.isalpha() for c in word]): return word
    word = re.sub("\d", "DG", word)
    if word.startswith("DG"):
        word = word.replace(",", "") # remove thousands separator
    return word

def canonicalize_word(word, wordset=None, digits=True):
    word = re.sub(r"(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w\.-]*)*\/?\S", \
                                     "postedhyperlinkvalue", word)
    #if not word.isupper():
    word = word.lower()
    if digits:
        if (wordset != None) and (word in wordset): return word
        word = canonicalize_digits(word) # try to canonicalize numbers
    if (wordset == None) or (word in wordset):
        return word
    else:
        return constants.UNK_TOKEN

def canonicalize_words(words, **kw):
    return [canonicalize_word(word, **kw) for word in words]

In [38]:
def make_data(data, only_fans=False, no_empty=True, tail_win_diffs=False, tokenizer=TweetTokenizer(), canonize=True):
    use_data = data
    
    if only_fans:
        # Get rid of non-fans
        use_data = use_data[use_data['fan_type']!='nofan']
    
    if no_empty:
        # Eliminate data with empty comments
        use_data = use_data[pd.notnull(use_data['comment_body'])]
        
    if tail_win_diffs:
        # Eliminate data for games in which the outcome is neither very close nor very clear
        use_data = use_data[(use_data['win_differential'] <= 0.2) | (use_data['win_differential'] >= 0.9)]

    # Separate comments
    comments = use_data.loc[:, 'comment_body']
    
    # Convert to list
    comment_list = comments.values.tolist()
    
    # Tokenize comments
    tokenizer = tokenizer
    x_tokens = [tokenizer.tokenize(sentence) for sentence in comment_list]
    
    if canonize:
        comments_canon = []
        for token in x_tokens:
            x_tokens_canon = canonicalize_words(token)
            comments_canon.append(x_tokens_canon)
        x_tokens = comments_canon
    
    return use_data, comments, comment_list, x_tokens


def most_informative_feature_for_binary_classification(vectorizer, classifier, n=10):
    class_labels = classifier.classes_
    feature_names = vectorizer.get_feature_names()
    topn_class1 = sorted(zip(classifier.coef_[0], feature_names))[:n]
    topn_class2 = sorted(zip(classifier.coef_[0], feature_names))[-n:]

    for coef, feat in topn_class1:
        print (class_labels[0], coef, feat)

    print()

    for coef, feat in reversed(topn_class2):
        print (class_labels[1], coef, feat)    

### Prepare the data for modeling

For classification, we'll remove games that are neither clear blowouts nor very close. We'll also restrict ourselves solely to tweets by users who self-identify as a fan, i.e. with flair.

In [12]:
use_data, comments, comment_list, x_tokens = make_data(data, only_fans=True, tail_win_diffs=True)

### Set dependent variable

In [13]:
# Isolate the labels
# target_var = 'author_game_state'
target_var = 'game_state'
# target_var = 'fan_type'

labels = use_data.loc[:, target_var]

counts = {}
for label in np.unique(labels):
    counts[label] = sum(labels == label)

print("Class counts:\n{}".format(counts))

Class counts:
{'close': 41287, 'notclose': 41743}


### Multinomial Naive Bayes

In [29]:
print(len(x_tokens))

83030


In [30]:
# Count or TF-IDF vectorize, removing stop words.
vectorizer = TfidfVectorizer(analyzer='word', stop_words='english', lowercase=False, 
                             tokenizer=lambda text: text)
                             #tokenizer=lambda text: text, min_df=0.00002, max_df=0.005)
spmat = vectorizer.fit_transform(x_tokens)
#vectorizer = CountVectorizer(analyzer='word', stop_words='english', lowercase=False, binary=False)
#spmat = vectorizer.fit_transform(X_data)

In [31]:
# Split into test and train
train_data, test_data, train_labels, test_labels = train_test_split(spmat, labels, test_size=0.10, random_state=42)  

In [32]:
# Train model
#a_values = [x * 0.01 for x in range(1,20)]
#gs_mnb = GridSearchCV(MultinomialNB(), {'alpha': a_values}, cv=5,
#                       scoring='f1_weighted')
clf = MultinomialNB()
clf.fit(train_data, train_labels)
#print(gs_mnb.best_estimator_)
#print(gs_mnb.best_score_)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [33]:
print(clf.class_count_)

[37124. 37603.]


In [34]:
# Get feature names and class labels
feature_names = vectorizer.get_feature_names()
class_labels = clf.classes_
print(class_labels)

['close' 'notclose']


In [35]:
# Create predictions and evaluate
pred_labels = clf.predict(test_data)
acc = metrics.accuracy_score(test_labels, pred_labels)
print("Accuracy on test set: {:.02%}".format(acc))
print('Test Data:')

print(classification_report(test_labels, pred_labels, target_names = class_labels, digits=3))

print("Confusion Matrix...")
confusionMatrix = metrics.confusion_matrix(test_labels, pred_labels)
print(confusionMatrix)

Accuracy on test set: 62.45%
Test Data:
             precision    recall  f1-score   support

      close      0.633     0.597     0.615      4163
   notclose      0.617     0.652     0.634      4140

avg / total      0.625     0.624     0.624      8303

Confusion Matrix...
[[2487 1676]
 [1442 2698]]


In [37]:
# Get most informative features
most_informative_feature_for_binary_classification(vectorizer, clf, n = 15)

close -11.631221128427331 ###bruh
close -11.631221128427331 ###mods
close -11.631221128427331 ##crazyeyes
close -11.631221128427331 ##domcapers
close -11.631221128427331 ##firegarret
close -11.631221128427331 ##mikeshula
close -11.631221128427331 ##questions
close -11.631221128427331 ##superbowl
close -11.631221128427331 ##touchdowwwnnnnn
close -11.631221128427331 #DG-DG
close -11.631221128427331 #DGDGDG
close -11.631221128427331 #DGDGDGDG
close -11.631221128427331 #allplayersmatter
close -11.631221128427331 #amanprovides
close -11.631221128427331 #analysis

notclose -3.7345662013167704 .
notclose -4.480346151869599 ,
notclose -4.653306298418179 ?
notclose -4.938916378626684 !
notclose -5.132103883964338 game
notclose -5.24887573474587 fuck
notclose -5.391043107401303 ...
notclose -5.401783677654522 just
notclose -5.464980463381548 like
notclose -5.470774039206745 DG
notclose -5.537343488204635 lol
notclose -5.589333104013729 fucking
notclose -5.639432134852957 "
notclose -5.6864792326

### Logistic Regression

### Vectorize data and split into train and test

Reuse data prep, vectorization, and train/test split from Naive Bayes 

In [39]:
# Train model
lgreg = LogisticRegression()
lgreg.fit(train_data, train_labels)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [40]:
#Create predictions and evaluate
pred_labels = lgreg.predict(test_data)
acc = metrics.accuracy_score(test_labels, pred_labels)
print("Accuracy on test set: {:.02%}".format(acc))
print('Test Data:')

print(classification_report(test_labels, pred_labels, target_names = class_labels, digits=3))
#print(classification_report(test_labels, pred_labels, target_names = ['fan_lose_close', 'fan_lose_notclose', 'fan_win_close', 'fan_win_notclose'], digits=3))

print("Confusion Matrix...")
confusionMatrix = metrics.confusion_matrix(test_labels, pred_labels)
print(confusionMatrix)

Accuracy on test set: 62.41%
Test Data:
             precision    recall  f1-score   support

      close      0.622     0.638     0.630      4163
   notclose      0.626     0.610     0.618      4140

avg / total      0.624     0.624     0.624      8303

Confusion Matrix...
[[2657 1506]
 [1615 2525]]


In [41]:
# Get top features
# Get most informative features
most_informative_feature_for_binary_classification(vectorizer, lgreg, n = 15)

close -3.083387276589751 vernon
close -2.916989858090347 foles
close -2.8505547414694385 kalil
close -2.8432073438532317 shazier
close -2.432569510986005 roberts
close -2.426465896113813 ebron
close -2.4129499304235984 lattimore
close -2.3960654682452986 ab
close -2.2481388899302264 bersin
close -2.21636704437944 quin
close -2.1637725459347292 bucs
close -2.1589079406201903 bounds
close -2.1510923780674727 hightower
close -2.143273609819304 apple
close -2.136422235009551 mills

notclose 6.252775706009352 glennon
notclose 6.03267734586043 fog
notclose 5.171237599217078 bears
notclose 4.623512584117515 trevathan
notclose 4.115676250470464 gg
notclose 3.5019565157239745 trubisky
notclose 3.4708362362954452 siemian
notclose 3.4060403898708485 jordy
notclose 3.316606566691688 onside
notclose 3.157677110168431 obj
notclose 3.1377404645144775 mcadoo
notclose 3.0058498218298153 fox
notclose 2.9109852692522082 shutout
notclose 2.7888360139763946 marshall
notclose 2.716301040827279 achilles


### Formulate the problem as a regression problem

### Prepare the data for modeling

We will begin by using the same data that we used for classification. We will later model all of the tweets in our corpus. In both cases, our target variable will be win_differential, a continuous variable.

In [93]:
use_data, comments, comment_list, x_tokens = make_data(data, only_fans=True, tail_win_diffs=True)

### Set dependent variable

In [94]:
target_var = 'win_differential'
y_data = use_data.loc[:, target_var]

### Vectorize data and split into train and test

We will vectorize a bit differently for regression. We will use binary variables to indicate the presence or absence of a word, rather than model counts or tf-idf, as we did for classification.

In [108]:
vectorizer = CountVectorizer(analyzer='word', stop_words='english', tokenizer=lambda text: text, 
                             lowercase=False, binary=True)#, min_df=10)
spmat = vectorizer.fit_transform(x_tokens)

In [96]:
# Split into test and train
train_data, test_data, train_labels, test_labels = train_test_split(spmat, y_data, test_size=0.10, random_state=42)  

In [29]:
print(len(x_tokens))

83030


### Linear Regression

In [47]:
# Train model
lr = LinearRegression()
lr.fit(train_data, train_labels)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [48]:
#Create predictions and evaluate
pred_labels = lr.predict(test_data)
print("Training set score: {:.2f}".format(lr.score(train_data, train_labels)))
print("Test set score: {:.2f}".format(lr.score(test_data, test_labels)))

Training set score: 0.39
Test set score: -0.18


In [109]:
# Look at top scoring words
n = 20
print(lr.coef_.shape)
feature_names = vectorizer.get_feature_names()
top_feats = np.argsort(lr.coef_)[-n:]
print("Top feats:")
print(" ".join(feature_names[j] for j in top_feats))

bottom_feats = np.argsort(lr.coef_)[:n]
print("Bottom feats:")
print(" ".join(feature_names[j] for j in bottom_feats))

(25816,)
Top feats:
pairing roadrunner faaaaast schiavo hinders okayish cursory physifally sleuths tri-state yrs kalifa chamionship nintendo va pleaseee baaaaaad comings sported urinatingtree
Bottom feats:
namath's richardwashington toothpaste taunts compulsively preference syrup exhange skeletons described sportsbook nord septum virgil hahahahahahahahahahahahahahahahahahahahahahahahahahahahahhahahahajahahajhahajajajajajjahahahahhahahahahahahahhahahahahhahahahahajhajajajhahahhahah economy sturgill identifiable swayed chevrolet


### Lasso Regression

In [50]:
lasso = Lasso(alpha=0.0002)
lasso.fit(train_data, train_labels)

Lasso(alpha=0.0002, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

In [51]:
#Create predictions and evaluate
pred_labels = lasso.predict(test_data)
print("Training set score: {:.2f}".format(lasso.score(train_data, train_labels)))
print("Test set score: {:.2f}".format(lasso.score(test_data, test_labels)))
print("Number of features used: {}".format(np.sum(lasso.coef_ != 0)))

Training set score: 0.06
Test set score: 0.05
Number of features used: 239


In [110]:
# Look at top scoring words
m = 20
n = np.sum(lasso.coef_ != 0)
if m < n:
    n = m
print(lasso.coef_.shape)
feature_names = vectorizer.get_feature_names()
top_feats = np.argsort(lasso.coef_)[-n:]
print("Top feats:")
print(" ".join(feature_names[j] for j in top_feats))

bottom_feats = np.argsort(lasso.coef_)[:n]
print("Bottom feats:")
print(" ".join(feature_names[j] for j in bottom_feats))

(25816,)
Top feats:
achilles end hundley angle camera jordy ap fox packers capers onside siemian obj mcadoo adams trevathan bears glennon gg fog
Bottom feats:
foles ebron shazier bounds penalty penalties tackle kamara ab gronk ben drive eagles weak gruden vernon coverage stafford todd dak


### Ridge Regression

In [63]:
rdg = Ridge(alpha=5)
rdg.fit(train_data, train_labels)

Ridge(alpha=5, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [64]:
#Create predictions and evaluate
pred_labels = rdg.predict(test_data)
print("Training set score: {:.2f}".format(rdg.score(train_data, train_labels)))
print("Test set score: {:.2f}".format(rdg.score(test_data, test_labels)))
#print("Number of features used: {}".format(np.sum(clf.coef_ != 0)))

Training set score: 0.25
Test set score: 0.09


In [111]:
# Look at top scoring words
n = 20
print(rdg.coef_.shape)
feature_names = vectorizer.get_feature_names()
top_feats = np.argsort(rdg.coef_)[-n:]
print("Top Feats")
print(" ".join(feature_names[j] for j in top_feats))

bottom_feats = np.argsort(rdg.coef_)[:n]
print("Bottom feats:")
print(" ".join(feature_names[j] for j in bottom_feats))

(25816,)
Top Feats
naz mcadoo hyde marshall touchback legion postedhyperlinkvalueei=7bqqwq2cbssq_qbb1jtobg&q=28-3+falcons&oq=28-3&gs_l=postedhyperlinkvalue achilles doink siemian shutout trevathan callahan glennon jordy onside gg colt postedhyperlinkvaluev=fr9uj_ayayq&feature=postedhyperlinkvaluet=10s fog
Bottom feats:
vernon kalil mccourty quin hightower pagano roberts bersin foles cart foreman remmers morelli dupree week's strief trufant barry tree jameis


### ElasticNet Regression

In [97]:
elnet = ElasticNet(alpha=0.0001, l1_ratio=0.25)
elnet.fit(train_data, train_labels)

ElasticNet(alpha=0.0001, copy_X=True, fit_intercept=True, l1_ratio=0.25,
      max_iter=1000, normalize=False, positive=False, precompute=False,
      random_state=None, selection='cyclic', tol=0.0001, warm_start=False)

In [98]:
#Create predictions and evaluate
pred_labels = elnet.predict(test_data)
print("Training set score: {:.2f}".format(elnet.score(train_data, train_labels)))
print("Test set score: {:.2f}".format(elnet.score(test_data, test_labels)))
print("Number of features used: {}".format(np.sum(elnet.coef_ != 0)))

Training set score: 0.13
Test set score: 0.09
Number of features used: 2534


In [112]:
# Look at top scoring words
m = 20
n = np.sum(elnet.coef_ != 0)
if m < n:
    n = m
print(elnet.coef_.shape)
feature_names = vectorizer.get_feature_names()
top_feats = np.argsort(elnet.coef_)[-n:]
print("Top Feats")
print(" ".join(feature_names[j] for j in top_feats))

bottom_feats = np.argsort(elnet.coef_)[:n]
print("Bottom feats:")
print(" ".join(feature_names[j] for j in bottom_feats))

(25816,)
Top Feats
turkey touchback obj parker marshall van ap bears shutout mcadoo colt postedhyperlinkvaluev=fr9uj_ayayq&feature=postedhyperlinkvaluet=10s siemian achilles trevathan jordy onside glennon gg fog
Bottom feats:
kalil vernon quin hightower foles roberts lattimore ebron shazier mccourty anthem pagano runoff ab bersin barry trufant carrie apple dupree


### Rerun regressions against the full corpus of tweets

In [113]:
use_data, comments, comment_list, x_tokens = make_data(data)

In [67]:
print(len(x_tokens))

588378


### Set dependent variable

In [114]:
target_var = 'win_differential'
y_data = use_data.loc[:, target_var]

### Vectorize data and split into train and test

We will require that each word appear in at least five posts, in order to make the problem a bit more tractable. The full vocabulary is ~73,000 words.

In [115]:
vectorizer = CountVectorizer(analyzer='word', stop_words='english', tokenizer=lambda text: text, 
                             lowercase=False, binary=True, min_df=5)
spmat = vectorizer.fit_transform(x_tokens)

In [79]:
print(spmat.shape)

(588378, 20293)


In [80]:
# Split into test and train
train_data, test_data, train_labels, test_labels = train_test_split(spmat, y_data, test_size=0.10, random_state=42)  

### Linear Regression

In [81]:
# Train model
lr2 = LinearRegression()
lr2.fit(train_data, train_labels)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [82]:
#Create predictions and evaluate
pred_labels = lr2.predict(test_data)
print("Training set score: {:.2f}".format(lr2.score(train_data, train_labels)))
print("Test set score: {:.2f}".format(lr2.score(test_data, test_labels)))

Training set score: 0.12
Test set score: 0.04


In [116]:
# Look at top scoring words
n = 20
print(lr2.coef_.shape)
feature_names = vectorizer.get_feature_names()
top_feats = np.argsort(lr2.coef_)[-n:]
print("Top Feats")
print(" ".join(feature_names[j] for j in top_feats))

bottom_feats = np.argsort(lr2.coef_)[:n]
print("Bottom feats:")
print(" ".join(feature_names[j] for j in bottom_feats))

(20293,)
Top Feats
ruby postedhyperlinkvaluepostedhyperlinkvalue naz unified sorrow smithfest postedhyperlinkvalueei=7bqqwq2cbssq_qbb1jtobg&q=28-3+falcons&oq=28-3&gs_l=postedhyperlinkvalue mmmphmmgofgpsngfjg panting wilfs recruits try-hard selena mcfucked plagued mcaddo blossoms cbssports suckle postedhyperlinkvaluewiki_political_.2f_religious_comments
Bottom feats:
proverbs addlepated septum ヽ pagano's ʖ kat dissonance +DGDG:DGDG fortnight demean lisa's ving truf gano's alualu wendell utc usain eifert's


### Lasso Regression

In [84]:
lasso2 = Lasso(alpha=0.0002)
lasso2.fit(train_data, train_labels)

Lasso(alpha=0.0002, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

In [85]:
#Create predictions and evaluate
pred_labels = lasso2.predict(test_data)
print("Training set score: {:.2f}".format(lasso2.score(train_data, train_labels)))
print("Test set score: {:.2f}".format(lasso2.score(test_data, test_labels)))
print("Number of features used: {}".format(np.sum(lasso2.coef_ != 0)))

Training set score: 0.02
Test set score: 0.02
Number of features used: 83


In [117]:
# Look at top scoring words
m = 20
n = np.sum(lasso2.coef_ != 0)
if m < n:
    n = m
print(lasso2.coef_.shape)
feature_names = vectorizer.get_feature_names()
top_feats = np.argsort(lasso2.coef_)[-n:]
print("Top Feats")
print(" ".join(feature_names[j] for j in top_feats))

bottom_feats = np.argsort(lasso2.coef_)[:n]
print("Bottom feats:")
print(" ".join(feature_names[j] for j in bottom_feats))

(20293,)
Top Feats
won romo season hundley siemian game angle team rodgers broncos mcadoo giants trevathan fumble adams packers gg bears fog glennon
Bottom feats:
jags watson penalties bortles penalty gronk shazier flag bengals smith refs tackle drive commercial flags anthem hate catch pi half


### Ridge Regression

In [87]:
rdg2 = Ridge(alpha=5)
rdg2.fit(train_data, train_labels)

Ridge(alpha=5, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [88]:
#Create predictions and evaluate
pred_labels = rdg2.predict(test_data)
print("Training set score: {:.2f}".format(rdg2.score(train_data, train_labels)))
print("Test set score: {:.2f}".format(rdg2.score(test_data, test_labels)))

Training set score: 0.11
Test set score: 0.06


In [118]:
# Look at top scoring words
n = 20
print(rdg2.coef_.shape)
feature_names = vectorizer.get_feature_names()
top_feats = np.argsort(rdg2.coef_)[-n:]
print("Top Feats")
print(" ".join(feature_names[j] for j in top_feats))

bottom_feats = np.argsort(rdg2.coef_)[:n]
print("Bottom feats:")
print(" ".join(feature_names[j] for j in bottom_feats))

(20293,)
Top Feats
vance siemien thumbs laterals joystick trevethan siemian bellamy glennon travathan postedhyperlinkvalueei=7bqqwq2cbssq_qbb1jtobg&q=28-3+falcons&oq=28-3&gs_l=postedhyperlinkvalue shutout gamblers semen shutouts pouncey trevathan paxton postedhyperlinkvaluev=fr9uj_ayayq&feature=postedhyperlinkvaluet=10s fog
Bottom feats:
clowney jurassic jeter mathieu devey usain edp bersin veldheer goldblum ving ldt starz eifert morrow dinosaurs fozzy headbutted sudfeld trufant


### ElasticNet Regression

In [90]:
elnet2 = ElasticNet(alpha=0.0001, l1_ratio=0.25)
elnet2.fit(train_data, train_labels)

ElasticNet(alpha=0.0001, copy_X=True, fit_intercept=True, l1_ratio=0.25,
      max_iter=1000, normalize=False, positive=False, precompute=False,
      random_state=None, selection='cyclic', tol=0.0001, warm_start=False)

In [91]:
#Create predictions and evaluate
pred_labels = elnet2.predict(test_data)
print("Training set score: {:.2f}".format(elnet2.score(train_data, train_labels)))
print("Test set score: {:.2f}".format(elnet2.score(test_data, test_labels)))
print("Number of features used: {}".format(np.sum(elnet2.coef_ != 0)))

Training set score: 0.05
Test set score: 0.05
Number of features used: 1022


In [119]:
# Look at top scoring words
m = 20
n = np.sum(elnet2.coef_ != 0)
if m < n:
    n = m
print(elnet2.coef_.shape)
feature_names = vectorizer.get_feature_names()
top_feats = np.argsort(elnet2.coef_)[-n:]
print("Top Feats")
print(" ".join(feature_names[j] for j in top_feats))

bottom_feats = np.argsort(elnet2.coef_)[:n]
print("Bottom feats:")
print(" ".join(feature_names[j] for j in bottom_feats))

(20293,)
Top Feats
turkey jordy fox xp lambeau semen marshall shutout adams thumbs trubisky lightning onside gg achilles bears siemian trevathan glennon fog
Bottom feats:
ebron vernon clowney kamara watson lattimore larry shazier foles intro carrie legs deuce computer texans anthem palmer cooper trump mixon
