#### Questions:
#### How to interpret beer style coefficients in the regression model (as one categorical model, do not remove beer styles)
#### If we use beer style, should we balance out the samples (remove top 5 and bottom 5 beers) (no)

#### Group beer styles by type of yeast used? (ale, lager, hybrid)
#### Ales are fermented quicker, are more aromatic, and fruity
#### Lagers are fermented slower and at lower temperatures to create a "hoppy" taste
#### Hybrids are a combination of ale and lager
#### https://www.beeradvocate.com/beer/style/

#### How to select columns to use in regression? Lasso technique?
#### Split sentiment for sentiment by sentence. Find sentences with synonyms for each rating dimension
#### create aroma sentiment, appearance sentiment, etc...
#### interaction between age and beer style

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import re
import webcolors

from datetime import datetime
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score, GridSearchCV, KFold

from textblob import TextBlob, Word



Baseline features
1. beer/ABV - the alcohol by volume of the beer
2. beer/style
3. user's gender
4. user's age in years

Extra features from review text
1. sentiment of the review
2. adjectives
3. adverbs
4. verbs
4. colors

#### Filling in missing values 
#### birthdays use average age
#### beer style use an empty string
#### review/text use an empty string

In [2]:
def convertUnixTimeToYears(unixTimes):
    ageInYears = []
    today = datetime.now()

    for age in unixTimes:
        birthdate = datetime.fromtimestamp(int(age))
        delta = today - birthdate
        years = delta.days / 365
        ageInYears.append(years)

    return (ageInYears)

In [3]:
def fill_missing_values(df):
    df['beer/style'] = df['beer/style'].fillna('missing')
    df['review/text'] = df['review/text'].fillna('')
    df['user/birthdayUnix'] = df['user/birthdayUnix'].fillna(np.mean(df['user/birthdayUnix']))

In [4]:
def read_type(filename):
    all_ales = []
    with open(filename, 'r') as f:
        for line in f:
            name = line.lower().replace('/','').replace('(','').replace(')','').strip()
            name = re.sub(' +',' ',name)
            all_ales.append(name)
            
    return all_ales

In [5]:
def assign_beer_category(df):
    all_ales = read_type('ales.txt')
    all_lagers = read_type('lagers.txt')
    all_hybrids = read_type('hybrids.txt')

    category = []

    for style in df['beer/style']:
        style = style.lower().replace('/','').replace('(','').replace(')','')
        style = re.sub(' +',' ',style)

        if style in all_ales:
            category.append('ale')

        elif style in all_lagers or 'oktoberfest' in style or \
        'keller bier zwickel bier' in style:
            category.append('lager')

        elif style in all_hybrids:
            category.append('hybrid')

        else:
            category.append('other')

    df['beer/category'] = category

In [178]:
# extract average sentiment by sentence
def extract_sentiment(corpus):
    sentiment = []
    
    for text in corpus:
        curr = []
        sentences = TextBlob(text).sentences
        
        for sentence in sentences:
            curr.append(sentence.sentiment.polarity)
            
        if len(curr) == 0:
            curr.append(TextBlob(text).sentiment.polarity)
            
        sentiment.append(np.mean(curr))
    
    return sentiment

In [198]:
def format_predictions(X, df, ratings, Xtest, ytest):
    for rating in ratings:
        y = df[rating]
        reg = LinearRegression()
        reg.fit(X, y)
        ytest[rating] = reg.predict(Xtest)
        
    return result

In [122]:
# removes rows containing beer styles from train that are not in test
def remove_different_styles(train, test):
    testStyles = test['beer/style'].unique()
    trainStyles = train['beer/style'].unique()
    diffStyles = np.setdiff1d(trainStyles, testStyles)
    train = train[~train['beer/style'].isin(diffStyles)]
    
    return train

In [164]:
# extracts user's age in years and review polarity from dataset
def extract_features(df, cols_keep):
    fill_missing_values(df)
    df = df[cols_keep]
    df['userAgeInYears'] = convertUnixTimeToYears(df['user/birthdayUnix'])
    df['review_polarity'] = extract_sentiment(df['review/text'])
    df['style/interaction'] = df['userAgeInYears']
    assign_beer_category(df)
    
    df = pd.get_dummies(df, columns=["beer/style", "beer/category"], prefix=["style", "category"])
    df = df.drop(['user/birthdayUnix','review/text'], axis = 1)
    
    return df

In [87]:
df = pd.DataFrame.from_csv('train.csv')
ratings = ['review/appearance','review/aroma','review/overall','review/palate','review/taste']
cols_keep = ['beer/style', 'user/birthdayUnix', 'review/text', 'beer/ABV']
len(df)

37500

In [7]:
fill_missing_values(df)
len(df)

37500

### Convert beer style to numerical features via one-hot encoding
#### 96 features used

In [15]:
X = df[["beer/style", 'beer/ABV', 'userAgeInYears']]
X = pd.get_dummies(X, columns=["beer/style"], prefix=["style"])

In [16]:
X.head()

Unnamed: 0_level_0,beer/ABV,userAgeInYears,style_Altbier,style_American Adjunct Lager,style_American Amber / Red Ale,style_American Amber / Red Lager,style_American Barleywine,style_American Black Ale,style_American Blonde Ale,style_American Brown Ale,...,style_Scotch Ale / Wee Heavy,style_Scottish Ale,style_Scottish Gruit / Ancient Herbed Ale,style_Smoked Beer,style_Tripel,style_Vienna Lager,style_Weizenbock,style_Wheatwine,style_Winter Warmer,style_Witbier
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
40163,5.0,40,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8135,11.0,40,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10529,4.7,40,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
44610,4.4,41,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
37062,4.4,40,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### simple linear regression on each rating with 5-fold cross validation
#### score results via negative mean squared error

In [17]:
results = {}
for rating in ratings:
    reg = LinearRegression()
    scores = cross_val_score(reg, X, base[rating], cv=5, scoring='neg_mean_squared_error')
    results[rating] = np.mean(scores)

print (results)

{'review/appearance': -0.26220314119076804, 'review/taste': -0.36117809090072489, 'review/palate': -0.32917508759985198, 'review/overall': -0.40413812207636352, 'review/aroma': -0.31070561244745604}


#### Add review sentiments to features
#### polarity = how positive, neutral, or negative the review is
#### subjectivity = how biased the review is 

In [19]:
X['review_polarity'] = extract_sentiment(base['review/text'])

In [20]:
X.head()

Unnamed: 0_level_0,beer/ABV,userAgeInYears,style_Altbier,style_American Adjunct Lager,style_American Amber / Red Ale,style_American Amber / Red Lager,style_American Barleywine,style_American Black Ale,style_American Blonde Ale,style_American Brown Ale,...,style_Scottish Ale,style_Scottish Gruit / Ancient Herbed Ale,style_Smoked Beer,style_Tripel,style_Vienna Lager,style_Weizenbock,style_Wheatwine,style_Winter Warmer,style_Witbier,review_polarity
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
40163,5.0,40,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.017014
8135,11.0,40,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.124208
10529,4.7,40,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.241389
44610,4.4,41,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.272917
37062,4.4,40,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.292449


#### Evaluate model with five-fold cross validation

In [21]:
results = {}
for rating in ratings:
    reg = LinearRegression()
    scores = cross_val_score(reg, X, base[rating], cv=5, scoring='neg_mean_squared_error')
    results[rating] = np.mean(scores)

print (results)

{'review/appearance': -0.24925654455553942, 'review/taste': -0.31783395679399157, 'review/palate': -0.29906281375914767, 'review/overall': -0.35867288499403427, 'review/aroma': -0.28618195900586135}


#### Add an interaction between age and beer style

In [36]:
X['style/interaction'] = X['userAgeInYears']

In [37]:
results = {}
for rating in ratings:
    reg = LinearRegression()
    scores = cross_val_score(reg, X, base[rating], cv=5, scoring='neg_mean_squared_error')
    results[rating] = np.mean(scores)

print (results)

{'review/appearance': -0.24633153583592393, 'review/taste': -0.31301450369171685, 'review/palate': -0.29438417381976045, 'review/overall': -0.35297125477615449, 'review/aroma': -0.28210070893111244}


#### Assign beer styles to categories: ale, lager, hybrid, or other and add to feature set
#### See whether or not categorizing the beers will improve rating prediction accuracy
#### Adding categories did not improve MSE

In [38]:
assign_beer_category(df)
X['beer/category'] = df['beer/category']
X = pd.get_dummies(X, columns=["beer/category"], prefix=["style"])
X.head()

Unnamed: 0_level_0,beer/ABV,userAgeInYears,style_Altbier,style_American Adjunct Lager,style_American Amber / Red Ale,style_American Amber / Red Lager,style_American Barleywine,style_American Black Ale,style_American Blonde Ale,style_American Brown Ale,...,noun_phrases,adverbs,verbs,adjs_sentiment,noun_phrase_sentiment,style/interaction,style_ale,style_hybrid,style_lager,style_other
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
40163,5.0,40,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,10,7,2,-0.075521,-0.080952,40,0.0,1.0,0.0,0.0
8135,11.0,40,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,13,6,3,0.020833,0.099242,40,1.0,0.0,0.0,0.0
10529,4.7,40,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,11,6,1,0.175,0.108333,40,1.0,0.0,0.0,0.0
44610,4.4,41,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,9,3,3,-0.0125,-0.2,41,0.0,0.0,1.0,0.0
37062,4.4,40,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,10,6,4,0.328869,0.21627,40,1.0,0.0,0.0,0.0


In [39]:
results = {}
for rating in ratings:
    reg = LinearRegression()
    scores = cross_val_score(reg, X, base[rating], cv=5, scoring='neg_mean_squared_error')
    results[rating] = np.mean(scores)

print (results)

{'review/appearance': -0.24631547765124545, 'review/taste': -0.31321905367424724, 'review/palate': -0.29454723779508746, 'review/overall': -0.35299333880445272, 'review/aroma': -0.28216172562757985}


#### output predictions on testing set

In [197]:
Xtest = pd.DataFrame.from_csv('test.csv')
Xtrain = pd.DataFrame.from_csv('train.csv')

Xtrain = remove_different_styles(Xtrain, Xtest)
ytrain = Xtrain[ratings]

assert len(ytrain) == len(Xtrain)

Xtrain = extract_features(Xtrain, cols_keep) 

ytest = Xtest[ratings]
Xtest = extract_features(Xtest, cols_keep) 

assert len(ytest) == len(Xtest)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [199]:
results = {}
for rating in ratings:
    reg = LinearRegression()
    scores = cross_val_score(reg, Xtrain, ytrain[rating], cv=5, scoring='neg_mean_squared_error')
    results[rating] = np.mean(scores)

print (results)

{'review/appearance': -0.25035521701790786, 'review/taste': -0.32005822439796822, 'review/palate': -0.30060977402601641, 'review/overall': -0.36081921878661805, 'review/aroma': -0.28671190031410027}


In [203]:
format_predictions(Xtrain, ytrain, ratings, Xtest, ytest)
ytest.to_csv('results.csv')