Random Forest Regression is used here for feature selection. By exporting the features/words and their importances, I could manually create a list of key beer attributes to use later on. The out of sample RMSE was also calculated to see if random forest was actually a good predictor, but that wasn't the main motive. The analysis here starts with a cleaned dataset.

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
import math

In [3]:
rev = pd.read_csv("beer_reviews_clean (2).csv")

In [30]:
print ('This dataset contains reviews and scores on', 1 + len(rev['name'].value_counts()), 'unique craft beers that come from', 1 + len(rev['brewery'].value_counts()), 'unique breweries.')

This dataset contains reviews and scores on 250 unique craft beers that come from 85 unique breweries.


In [24]:
rev[:5]

Unnamed: 0,score,Review,look,smell,taste,feel,name,brewery
0,4.9,Wow. More than lucky to be able to split a bot...,5.0,5.0,4.75,5.0,Kentucky Brunch Brand Stout,Toppling Goliath Brewing Company
1,4.95,L: Dinosaur blood with a chocolate milk head. ...,5.0,5.0,5.0,5.0,Kentucky Brunch Brand Stout,Toppling Goliath Brewing Company
2,5.0,"Latest batch. Most complex beer I've had, hand...",5.0,5.0,5.0,5.0,Kentucky Brunch Brand Stout,Toppling Goliath Brewing Company
3,4.67,Did someone say there was maple in the beer??4...,4.5,5.0,4.5,4.5,Kentucky Brunch Brand Stout,Toppling Goliath Brewing Company
4,4.71,This beer turned out to be everything I wanted...,4.75,5.0,4.5,4.75,Kentucky Brunch Brand Stout,Toppling Goliath Brewing Company


In [4]:
#dropping stopwords and creating a sparse matrix
cv = CountVectorizer(stop_words='english')
cv.fit(rev["Review"].values)
spmx = cv.transform(rev["Review"].values)

In [7]:
x1, x2, y1, y2 = train_test_split(spmx, rev["score"], test_size = .2, random_state = 42)

In [21]:
#tuning hyperparaters with gridsearchCV to find optimal parameters for the random forest regression
parameters = {"max_depth":range(2, 7), "max_features": ["auto", "sqrt", "log2"], \
             "n_estimators":[10, 100, 150, 200, 300, 350, 500, 750, 1000]}

rf = RandomForestRegressor(n_jobs = -1)
clf = GridSearchCV(rf, parameters)
clf.fit(x1, y1)
clf.best_params_

{'max_depth': 6, 'max_features': 'auto', 'n_estimators': 300}

In [22]:
from sklearn.metrics import mean_squared_error

rf = RandomForestRegressor(n_jobs = -1, max_depth=6, max_features="auto", n_estimators = 300)
rf.fit(x1, y1)

y2_pred = rf.predict(x2)

#finding the RMSE of model shows that it's not the most accurate at predicting overall score on the 5 point
#That's okay though because prediction isn't the goal here. I'm more concerned about the features/words that are important.
math.sqrt(mean_squared_error(y2, y2_pred))

0.33283008064106157

In [48]:
#By creating and sorting the features and their names by importance in the random forest regression, it is possible to conclude what words in the reviews are most important
impdf = pd.DataFrame(list(zip(list(rf.feature_importances_), cv.get_feature_names())))
sorted_df = impdf[impdf[0] > 0].sort_values(by=0, ascending = False) #.to_csv("View_of_Importances.csv")
sorted_df.columns = ['Importance', 'Word']
sorted_df[:15]

Unnamed: 0,Importance,Word
1974,0.129566,best
7275,0.120595,infected17
3005,0.116865,charactersdiemilio
5341,0.035925,drain
9212,0.035417,perfect
1428,0.027395,amazing
7274,0.025266,infected
9297,0.021825,pillars
12848,0.021692,zombier
4893,0.020254,decent
