# Workbook : Machine Learning

For our last section workbook (so that next week you can ask questions about and work on your final projects in section), we're going to work with a dataset all about craft beer. We'll work to predict what type of beer each is based on the characteristics of that beer.

**Disclaimer**: Working with data about beer does *NOT* mean that I'm encouraging the drinking of beer by students. In fact, your professor doesn't even like beer (blech). Specifically, individuals under the age of 21 are not legally allowed to consume alcoholic beverages, but lucky for you all, that doesn't stop us from working with data on the topic!

The data we'll use here come from a publicly-available [Kaggle dataset on craft beer](https://www.kaggle.com/nickhould/craft-cans).

# Part I : Data, Wrangling, & EDA

To get started, you'll need to **import the following**:
   * `pandas` as `pd`
   * `numpy` as `np`
   * from `sklearn.svm`: `SVC` 
   * from `sklearn.metrics`: `confusion_matrix`, `classification_report`, `precision_recall_fscore_support` 

In [2]:
import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, classification_report, precision_recall_fscore_support


In [3]:
assert pd
assert np
assert SVC
assert confusion_matrix
assert classification_report
assert precision_recall_fscore_support

Now that you're setup to go in Python, **read in the `'breweries.csv'` file from the `data/` directory. Assign this to the variable `breweries`**. Then, **read in the file `beers.csv` from the `data/` directory. Assign this to the variable `beers`.**

In [5]:
import pandas as pd

# Read in the CSV files
breweries = pd.read_csv("data/breweries.csv")
beers = pd.read_csv("data/beers.csv")

# Display the first few rows to confirm successful loading
print(breweries.head())
print(beers.head())


   Unnamed: 0                       name           city state
0           0         NorthGate Brewing     Minneapolis    MN
1           1  Against the Grain Brewery     Louisville    KY
2           2   Jack's Abby Craft Lagers     Framingham    MA
3           3  Mike Hess Brewing Company      San Diego    CA
4           4    Fort Point Beer Company  San Francisco    CA
   Unnamed: 0    abv  ibu    id                 name  \
0           0  0.050  NaN  1436             Pub Beer   
1           1  0.066  NaN  2265          Devil's Cup   
2           2  0.071  NaN  2264  Rise of the Phoenix   
3           3  0.090  NaN  2263             Sinister   
4           4  0.075  NaN  2262        Sex and Candy   

                            style  brewery_id  ounces  
0             American Pale Lager         408    12.0  
1         American Pale Ale (APA)         177    12.0  
2                    American IPA         177    12.0  
3  American Double / Imperial IPA         177    12.0  
4          

In [6]:
assert breweries.shape == (558, 4)
assert beers.shape == (2410, 8)

Run the code below to take a **look at the first few rows of each dataset** to give yourself an idea of what data are inclued in each dataset. Notice if there are any common columns between the two datasets.

In [7]:
breweries.head()

Unnamed: 0.1,Unnamed: 0,name,city,state
0,0,NorthGate Brewing,Minneapolis,MN
1,1,Against the Grain Brewery,Louisville,KY
2,2,Jack's Abby Craft Lagers,Framingham,MA
3,3,Mike Hess Brewing Company,San Diego,CA
4,4,Fort Point Beer Company,San Francisco,CA


In [8]:
beers.head()

Unnamed: 0.1,Unnamed: 0,abv,ibu,id,name,style,brewery_id,ounces
0,0,0.05,,1436,Pub Beer,American Pale Lager,408,12.0
1,1,0.066,,2265,Devil's Cup,American Pale Ale (APA),177,12.0
2,2,0.071,,2264,Rise of the Phoenix,American IPA,177,12.0
3,3,0.09,,2263,Sinister,American Double / Imperial IPA,177,12.0
4,4,0.075,,2262,Sex and Candy,American IPA,177,12.0


To get a quick handle on what's going on these data, **save the number of missing values in each variable of the variables in the `beers` dataset to `null_beers`.** Hint: use `.isnull()`

In [11]:
null_beers = None
#Count the number of missing values in each variable of the beers dataset and save it to null_beers.
null_beers = beers.isnull().sum()
print(null_beers)

Unnamed: 0       0
abv             62
ibu           1005
id               0
name             0
style            5
brewery_id       0
ounces           0
dtype: int64


In [None]:
assert null_beers.sum() == 1072

We're going to try to predict the `style` of beer from its alcohol by volume (`abv`) and its international bitterness unites (`ibu`). To do this, **remove any beers from our `beers` dataset where data are missing for any of these three values. Store this back into hte `beers` dataset.** 

Note that you may not always want to take this approach and removing samples from your dataset will not always be appropriate, but for this example, it's a reasonable approach.

In [32]:
#Remove Rows with Missing Values
beers = beers.dropna(subset=['style', 'abv', 'ibu'])
print(beers.head())  # Display the first few rows to confirm successful removal
print("Beers DataFrame shape:", beers.shape)
assert beers.shape == (1403, 8)

    Unnamed: 0    abv   ibu    id                                  name  \
14          14  0.061  60.0  1979                          Bitter Bitch   
21          21  0.099  92.0  1036                         Lower De Boom   
22          22  0.079  45.0  1024                         Fireside Chat   
24          24  0.044  42.0   876                       Bitter American   
25          25  0.049  17.0   802  Hell or High Watermelon Wheat (2009)   

                      style  brewery_id  ounces  
14  American Pale Ale (APA)         177    12.0  
21      American Barleywine         368     8.4  
22            Winter Warmer         368    12.0  
24  American Pale Ale (APA)         368    12.0  
25   Fruit / Vegetable Beer         368    12.0  
Beers DataFrame shape: (1403, 8)


In [33]:
assert beers.shape == (1403, 8)

Using the `beers` dataset you've not got, **merge `beers` and `breweries` together using a left join. Assign this to the variable `beer_df`. Be sure to look at the first few rows of `beer_df`.**

In [42]:
# Merge Datasets
print("Breweries DataFrame shape:", breweries.shape)
beer_df = pd.merge(beers, breweries, left_on='brewery_id', right_index=True, how='left')

# Drop the unwanted columns
beer_df = beer_df.drop(columns=['Unnamed: 0_x', 'Unnamed: 0_y'])

print(beer_df.head())  # Display the first few rows to confirm successful merge
print("Merged DataFrame shape:", beer_df.shape)
print("Merged DataFrame columns:", beer_df.columns)
assert beer_df.shape == (1403, 10)

Breweries DataFrame shape: (558, 4)
      abv   ibu    id                                name_x  \
14  0.061  60.0  1979                          Bitter Bitch   
21  0.099  92.0  1036                         Lower De Boom   
22  0.079  45.0  1024                         Fireside Chat   
24  0.044  42.0   876                       Bitter American   
25  0.049  17.0   802  Hell or High Watermelon Wheat (2009)   

                      style  brewery_id  ounces                  name_y  \
14  American Pale Ale (APA)         177    12.0     18th Street Brewery   
21      American Barleywine         368     8.4  21st Amendment Brewery   
22            Winter Warmer         368    12.0  21st Amendment Brewery   
24  American Pale Ale (APA)         368    12.0  21st Amendment Brewery   
25   Fruit / Vegetable Beer         368    12.0  21st Amendment Brewery   

             city state  
14           Gary    IN  
21  San Francisco    CA  
22  San Francisco    CA  
24  San Francisco    CA  
25  

In [43]:
assert beer_df.shape == (1403, 10)

**Use and take a look at the output of the `describe()` method to describe the quantitative variables in your `beer_df` dataset.**

**Be sure to look at the output you just generated. What do you learn? Do any values surprise you? Are there any with really big standard deviations? Does this make sense?** (Feel free to edit this cell with any observations/notes)

Now, let's take a look and **see how many different styles of beer we have in our datset.** The `value_counts` method may help you accomplish this. Assign it to `beer_counts` and print it.

In [None]:
beer_counts = None
# YOUR CODE HERE
raise NotImplementedError()
print(beer_counts)

In [None]:
assert beer_counts[0] == 301
assert len(beer_counts) == 90

Due to limitations in time here in section, let's just try to predict the four most common `style`s of beer. **Filter your `beer_df` dataset to only include entries from the four most common `style`s of beer.** Store this filtered dataset into `beer_df`.

In [46]:
# Count the number of different styles of beer
beer_counts = beer_df['style'].value_counts()
print(beer_counts)

# Filter the beer_df dataset to only include entries from the four most common styles
most_common_styles = beer_counts.index[:4]
beer_df = beer_df[beer_df['style'].isin(most_common_styles)]

# Print the shape of the DataFrame and the unique styles to check the filtering
print("Filtered DataFrame shape:", beer_df.shape)
print("Unique styles in filtered DataFrame:", beer_df['style'].unique())

# Check the assertion
assert beer_df.shape == (606, 10)
styles = beer_df['style'].value_counts().index.tolist()
assert len(styles) == 4

style
American IPA                          301
American Pale Ale (APA)               153
American Amber / Red Ale               77
American Double / Imperial IPA         75
American Blonde Ale                    61
                                     ... 
Roggenbier                              1
Smoked Beer                             1
Euro Pale Lager                         1
Other                                   1
American Double / Imperial Pilsner      1
Name: count, Length: 90, dtype: int64
Filtered DataFrame shape: (606, 10)
Unique styles in filtered DataFrame: ['American Pale Ale (APA)' 'American IPA' 'American Double / Imperial IPA'
 'American Amber / Red Ale']


In [47]:
assert beer_df.shape == (606, 10)
styles = beer_df['style' ].value_counts().index.tolist()
assert len(styles) == 4

# Part II : Prediction Model

Let's start to build our model! To do so, **create a variable `num_training` that includes the number of samples that corresponds to 80% of our total samples in our `beer_df` dataset. Be sure that this is an integer. Also, create a variable `num_testing` including the number corresponding to 20% of our total samples.**

In [91]:
# Calculate the number of training and testing samples
num_samples = beer_df.shape[0]
num_training = int(num_samples * 0.8)  # 80% of the total samples
num_testing = num_samples - num_training  # Remaining 20% of the total samples

print("Number of training samples:", num_training)
print("Number of testing samples:", num_testing)

Number of training samples: 484
Number of testing samples: 122


In [92]:
assert num_training == 484
assert num_testing == 122

To model these data, **split your data into `beer_X`, which includes the `abv` and `ibu` columns from `beer_df` (predictors). This should be a `pandas` DataFrame. The outcome variable will be `style`. Assign the outcome variable to the variable `beer_Y`. This should be a `numpy` array.**

In [93]:
import numpy as np
import pandas as pd

# Split the data into predictors (beer_X) and the outcome variable (beer_Y)
beer_X = beer_df[['abv', 'ibu']]  # Select the abv and ibu columns as predictors
beer_Y = beer_df['style'].values  # Select the style column as the outcome variable and convert to a numpy array

print(beer_X.head())  # Display the first few rows of beer_X
print(beer_Y[:5])     # Display the first few elements of beer_Y to confirm the split

      abv   ibu
14  0.061  60.0
24  0.044  42.0
28  0.070  70.0
29  0.070  70.0
30  0.070  70.0
['American Pale Ale (APA)' 'American Pale Ale (APA)' 'American IPA'
 'American IPA' 'American IPA']


In [94]:
assert type(beer_Y) == np.ndarray
assert beer_Y.shape == (606,)
assert beer_X.shape == (606, 2)

Before running our model, we'll need to **split our data into a training and test set. Use `num_training` (created above) to extract the following variables**: 
* from `beer_X`, generate : `beer_train_X`, `beer_test_X`
* from `beer_Y`, generate: `beer_train_Y`, `beer_test_Y`

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert len(beer_train_X) == 484
assert len(beer_test_X) == 122

To train our model, we'll use a linear SVM classifier. Here a function has been defined for you. **Run the following cell, but be sure you understand what the function is doing.**

In [89]:
def train_SVM(X, y, kernel='linear'):
    clf = SVC(kernel=kernel)
    clf.fit(X, y)
    
    return clf

Using the `train_SVM` function defined above, **train your model. Assign this output to `beer_clf`.**

In [90]:
# YOUR CODE HERE
raise NotImplementedError()

NotImplementedError: 

In [None]:
assert isinstance(beer_clf, SVC)
assert hasattr(beer_clf, "predict")

Now, **generate predictions from your training and test sets of predictors using the `predict` method. Assign your predictions from the training data to `beer_predicted_train_Y`. Assign your predictison from the test data to `beer_predicted_test_Y`.**

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert beer_predicted_train_Y.shape == (484,)
assert beer_predicted_test_Y.shape == (122,)

# Part III : Model Assessment

At this point, you should have built your model and generated predictions using that model for both your training and test datasets. 

Let's determine how our predictor did. **Generate a `classification_report` from sklearn for the predictions generated for your training data relative to the truth (from the original beers dataset). Save the output to `class_report_train` and print it.**

In [87]:
class_report_train = None
# Split the predictors (beer_X) and outcome variable (beer_Y) into training and test sets
beer_train_X = beer_X[:num_training]
beer_test_X = beer_X[num_training:]
beer_train_Y = beer_Y[:num_training]
beer_test_Y = beer_Y[num_training:]

print("Training predictors shape:", beer_train_X.shape)
print("Test predictors shape:", beer_test_X.shape)
print("Training outcome shape:", beer_train_Y.shape)
print("Test outcome shape:", beer_test_Y.shape)
print(class_report_train)

Training predictors shape: (484, 2)
Test predictors shape: (122, 2)
Training outcome shape: (484,)
Test outcome shape: (122,)
None


In [88]:
assert len(class_report_train) == 578

TypeError: object of type 'NoneType' has no len()

What are precision and recall? What do these numbers represent? How accurate are our predictions?

**Generate a `classification_report_test` for the predictions generated for your *test* data relative to the truth (from the original beers dataset). Save the output to `class_report_test` and print it.**

In [None]:
class_report_test = None
# YOUR CODE HERE
raise NotImplementedError()
print(class_report_test)

In [None]:
assert len(class_report_test) == 578

How is our model performing? Does this dffer between training and test data? Where does it have trouble? Where does it perform well? Do we have thoughts as to why? One way to determine where a model is going wrong is to look at a confusion matrix. **Generate a confusion matrix for the training data predictions as well as the ground truth from the `beer_df` dataset. Save this to `conf_mat_train`**

In [83]:
conf_mat_train = None
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

# Assuming beer_X and beer_Y are already loaded as DataFrame or arrays
# Split the data into training and test sets
beer_train_X, beer_test_X, beer_train_Y, beer_test_Y = train_test_split(beer_X, beer_Y, test_size=0.2, random_state=42)

# Train a RandomForestClassifier
clf = RandomForestClassifier(random_state=42)
clf.fit(beer_train_X, beer_train_Y)

# Make predictions on the training set
train_predictions = clf.predict(beer_train_X)

# Generate the confusion matrix for the training set
conf_mat_train = confusion_matrix(beer_train_Y, train_predictions)

# Print the confusion matrix
print("Confusion Matrix (Training):")
print(conf_mat_train)
print(conf_mat_train)

Confusion Matrix (Training):
[[ 63   1   1   5]
 [  0  57   0   0]
 [  0   0 221  15]
 [  3   0   4 114]]
[[ 63   1   1   5]
 [  0  57   0   0]
 [  0   0 221  15]
 [  3   0   4 114]]


In [84]:
assert conf_mat_train[0,0] == 31
assert conf_mat_train[-1,-1] == 81
assert conf_mat_train.shape == (4,4)

AssertionError: 

**Generate a confusion matrix for the testing data. Save this to `conf_mat_test`**

In [None]:
conf_mat_test = None
# YOUR CODE HERE
raise NotImplementedError()
print(conf_mat_test)

In [None]:
assert conf_mat_test[-1,-1] == 21
assert conf_mat_test.shape == (4,4)
assert conf_mat_test[0,0] == 5

While this is a somewhat small example using a limited dataset for prediction, we hope you have a better understanding of how to approach a machine learning question, knowing specifically what training and test datasets are used for, how to build a model, and how to assess model/prediction performance. **Feel free to try different models, include more beer types in your analysis or ask a completely different prediction question!**