# Predicting beer type

This work is to predict what type of beer each is based on the characteristics of that beer.

The data I will use here come from a publicly-available [Kaggle dataset on craft beer](https://www.kaggle.com/nickhould/craft-cans).

# Part I : Data, Wrangling, & EDA

In [2]:
import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, classification_report, precision_recall_fscore_support

In [3]:
assert pd
assert np
assert SVC
assert confusion_matrix
assert classification_report
assert precision_recall_fscore_support

In [4]:
breweries = pd.read_csv("data/breweries.csv")
beers = pd.read_csv("data/beers.csv")

In [5]:
assert breweries.shape == (558, 4)
assert beers.shape == (2410, 8)

In [6]:
breweries.head()

Unnamed: 0.1,Unnamed: 0,name,city,state
0,0,NorthGate Brewing,Minneapolis,MN
1,1,Against the Grain Brewery,Louisville,KY
2,2,Jack's Abby Craft Lagers,Framingham,MA
3,3,Mike Hess Brewing Company,San Diego,CA
4,4,Fort Point Beer Company,San Francisco,CA


In [7]:
beers.head()

Unnamed: 0.1,Unnamed: 0,abv,ibu,id,name,style,brewery_id,ounces
0,0,0.05,,1436,Pub Beer,American Pale Lager,408,12.0
1,1,0.066,,2265,Devil's Cup,American Pale Ale (APA),177,12.0
2,2,0.071,,2264,Rise of the Phoenix,American IPA,177,12.0
3,3,0.09,,2263,Sinister,American Double / Imperial IPA,177,12.0
4,4,0.075,,2262,Sex and Candy,American IPA,177,12.0


In [8]:
null_beers = beers.isnull().sum()

In [9]:
beers = beers.dropna(subset=['style', 'abv', 'ibu'])

In [11]:
beer_df = beers.merge(breweries, how = "left")
beer_df.head()

Unnamed: 0.1,Unnamed: 0,abv,ibu,id,name,style,brewery_id,ounces,city,state
0,14,0.061,60.0,1979,Bitter Bitch,American Pale Ale (APA),177,12.0,,
1,21,0.099,92.0,1036,Lower De Boom,American Barleywine,368,8.4,,
2,22,0.079,45.0,1024,Fireside Chat,Winter Warmer,368,12.0,,
3,24,0.044,42.0,876,Bitter American,American Pale Ale (APA),368,12.0,,
4,25,0.049,17.0,802,Hell or High Watermelon Wheat (2009),Fruit / Vegetable Beer,368,12.0,,


In [12]:
beer_df.describe()

Unnamed: 0.1,Unnamed: 0,abv,ibu,id,brewery_id,ounces
count,1403.0,1403.0,1403.0,1403.0,1403.0,1403.0
mean,1241.128297,0.059919,42.739843,1413.88881,223.375624,13.510264
std,691.675612,0.013585,25.962692,757.572191,150.38751,2.254112
min,14.0,0.027,4.0,1.0,0.0,8.4
25%,681.5,0.05,21.0,771.0,95.5,12.0
50%,1228.0,0.057,35.0,1435.0,198.0,12.0
75%,1864.5,0.068,64.0,2068.5,350.0,16.0
max,2408.0,0.125,138.0,2692.0,546.0,32.0


In [13]:
beer_counts = beer_df["style"].value_counts()
print(beer_counts)

American IPA                          301
American Pale Ale (APA)               153
American Amber / Red Ale               77
American Double / Imperial IPA         75
American Blonde Ale                    61
                                     ... 
Roggenbier                              1
Smoked Beer                             1
Euro Pale Lager                         1
Other                                   1
American Double / Imperial Pilsner      1
Name: style, Length: 90, dtype: int64


In [14]:
styles = list(beer_counts.index[0: 4])
beer_df = beer_df[beer_df["style"].isin(styles)]

# Part II : Prediction Model

In [15]:
import random
num_training = int(0.8 * beer_df.shape[0])
num_testing = beer_df.shape[0] - num_training

In [16]:
beer_X = beer_df[["abv", "ibu"]]
beer_Y = np.array(beer_df["style"])

In [17]:
beer_train_X = beer_X.iloc[: num_training]
beer_test_X = beer_X.iloc[num_training: ]
beer_train_Y = beer_Y[: num_training]
beer_test_Y = beer_Y[num_training: ]

In [18]:
def train_SVM(X, y, kernel='linear'):
    clf = SVC(kernel=kernel)
    clf.fit(X, y)
    
    return clf

In [19]:
beer_clf = train_SVM(beer_train_X, beer_train_Y)

In [20]:
beer_predicted_train_Y = beer_clf.predict(beer_train_X)
beer_predicted_test_Y = beer_clf.predict(beer_test_X)

# Part III : Model Assessment

In [21]:
class_report_train = classification_report(beer_train_Y, beer_predicted_train_Y)
print(class_report_train)

                                precision    recall  f1-score   support

      American Amber / Red Ale       0.82      0.45      0.58        69
American Double / Imperial IPA       0.76      0.25      0.37        53
                  American IPA       0.69      0.84      0.76       236
       American Pale Ale (APA)       0.57      0.64      0.60       126

                      accuracy                           0.67       484
                     macro avg       0.71      0.54      0.58       484
                  weighted avg       0.69      0.67      0.65       484



In [22]:
class_report_test = classification_report(beer_test_Y, beer_predicted_test_Y)
print(class_report_test)

                                precision    recall  f1-score   support

      American Amber / Red Ale       0.62      0.62      0.62         8
American Double / Imperial IPA       0.78      0.32      0.45        22
                  American IPA       0.70      0.72      0.71        65
       American Pale Ale (APA)       0.55      0.78      0.65        27

                      accuracy                           0.66       122
                     macro avg       0.66      0.61      0.61       122
                  weighted avg       0.68      0.66      0.64       122



In [23]:
conf_mat_train = confusion_matrix(beer_train_Y, beer_predicted_train_Y)
print(conf_mat_train)

[[ 31   1  10  27]
 [  0  13  40   0]
 [  0   3 198  35]
 [  7   0  38  81]]


In [24]:
conf_mat_test = confusion_matrix(beer_test_Y, beer_predicted_test_Y)
print(conf_mat_test)

[[ 5  0  2  1]
 [ 1  7 14  0]
 [ 0  2 47 16]
 [ 2  0  4 21]]
