# Beer-Recommendation System: Data Prep
Author: Ashli Dougherty 

# Overview

***

# Business Understanding 

***

# Data Understanding 

>There are two datasets being utilized for this project. 

### Tasting Profiles
The [first dataset](https://www.kaggle.com/datasets/stephenpolozoff/top-beer-information?select=beer_data_set.csv) contains data scraped from the website [BeerAdvocate.com](https://www.beeradvocate.com/). It is a CSV file that contains 5,558 beers in total across 112 styles. Other data represented in the table includes: the brewery the beer was produced, a description of the beer, and the overall rating of the beer. As there is no unique user review data, this dataset would be best for a content based recommendation system. 

### Beer Reviews 
The [second dataset](https://www.kaggle.com/datasets/rdoume/beerreviews) is also sourced from [BeerAdvocate.com](https://www.beeradvocate.com/). It is also a CSV file that contains approximately 1.6 million reviews of the reviews on the website. There are a total of 33,388 unique reviewers and 56,857 unique beers that have been reviewed. There are no descriptions of the beers, just names, brewery, style, and ratings on a scale of 1 - 5. Because this data set has both unique users and beer IDs it lends itself to be a collaborative filtering system.   

Due to the large size of these datasets I downloaded them locally and saved to an external repository outside of github.

***

# Data Preparation 

## Imports

Initial cleaning consited of using pandas methods and functions. Visualizations for EDA were made with matplotlib library. 

In [99]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import string

## Load Data

### Tasting Profiles -- content based recs

In [100]:
cd BeerData

[Errno 2] No such file or directory: 'BeerData'
/Users/ashlidougherty/Documents/Flatiron/BeerData


In [101]:
ls

Beer Descriptors Simplified.xlsx  tasting_profiles.csv
beer_reviews.csv


In [102]:
df_tasting = pd.read_csv('tasting_profiles.csv')

In [103]:
df_tasting.head()

Unnamed: 0,Name,key,Style,Style Key,Brewery,Description,ABV,Ave Rating,Min IBU,Max IBU,...,Body,Alcohol,Bitter,Sweet,Sour,Salty,Fruits,Hoppy,Spices,Malty
0,Amber,251,Altbier,8,Alaskan Brewing Co.,"Notes:Richly malty and long on the palate, wit...",5.3,3.65,25,50,...,32,9,47,74,33,0,33,57,8,111
1,Double Bag,252,Altbier,8,Long Trail Brewing Co.,"Notes:This malty, full-bodied double alt is al...",7.2,3.9,25,50,...,57,18,33,55,16,0,24,35,12,84
2,Long Trail Ale,253,Altbier,8,Long Trail Brewing Co.,Notes:Long Trail Ale is a full-bodied amber al...,5.0,3.58,25,50,...,37,6,42,43,11,0,10,54,4,62
3,Doppelsticke,254,Altbier,8,Uerige Obergärige Hausbrauerei,Notes:,8.5,4.15,25,50,...,55,31,47,101,18,1,49,40,16,119
4,Scurry,255,Altbier,8,Off Color Brewing,Notes:Just cause it's dark and German doesn't ...,5.3,3.67,25,50,...,69,10,63,120,14,0,19,36,15,218


> From a first look at the table it appears that not all the beers actually have a description. Some entries are just 'Notes:' and therefore will come up as null values even though they don't contain any useable information. Planning on stripping this and all punctuation from all descriptions.

In [104]:
# check shape
df_tasting.shape

(5558, 21)

In [105]:
# checking types
df_tasting.info()
# all types look as expected

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5558 entries, 0 to 5557
Data columns (total 21 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Name         5556 non-null   object 
 1   key          5558 non-null   int64  
 2   Style        5558 non-null   object 
 3   Style Key    5558 non-null   int64  
 4   Brewery      5558 non-null   object 
 5   Description  5558 non-null   object 
 6   ABV          5558 non-null   float64
 7   Ave Rating   5558 non-null   float64
 8   Min IBU      5558 non-null   int64  
 9   Max IBU      5558 non-null   int64  
 10  Astringency  5558 non-null   int64  
 11  Body         5558 non-null   int64  
 12  Alcohol      5558 non-null   int64  
 13  Bitter       5558 non-null   int64  
 14  Sweet        5558 non-null   int64  
 15  Sour         5558 non-null   int64  
 16  Salty        5558 non-null   int64  
 17  Fruits       5558 non-null   int64  
 18  Hoppy        5558 non-null   int64  
 19  Spices

In [106]:
df_tasting.isna().sum()

Name           2
key            0
Style          0
Style Key      0
Brewery        0
Description    0
ABV            0
Ave Rating     0
Min IBU        0
Max IBU        0
Astringency    0
Body           0
Alcohol        0
Bitter         0
Sweet          0
Sour           0
Salty          0
Fruits         0
Hoppy          0
Spices         0
Malty          0
dtype: int64

In [107]:
df_tasting.duplicated().sum()

0

In [108]:
# There are no duplicates and only 2 missing values. 
# These missing values will be dropped as I need the identity of the beer in order to recommend it. 

In [109]:
df_tasting.dropna(inplace=True)

In [110]:
df_tasting.shape

(5556, 21)

#### Cleaning text for description 

In [111]:
df_tasting.Description

0       Notes:Richly malty and long on the palate, wit...
1       Notes:This malty, full-bodied double alt is al...
2       Notes:Long Trail Ale is a full-bodied amber al...
3                                                  Notes:
4       Notes:Just cause it's dark and German doesn't ...
                              ...                        
5553                                               Notes:
5554    Notes:This is the forty-fifth annual Our Speci...
5555                                               Notes:
5556    Notes:Chanukah Beer pours a rich crystal clear...
5557    Notes:The essence of Christmas is captured in ...
Name: Description, Length: 5556, dtype: object

In [112]:
df_tasting.Description = df_tasting.Description.str.strip('Notes:')

In [119]:
import re
#stripping description of punctuation 

In [121]:
df_tasting['Description'] = df_tasting['Description'].replace(r'[^\w\s]', "", regex=True)

#### Binning Styles

In [124]:
df_tasting['Style Key'].nunique()

112

In [116]:
def unique_list(list1):
    '''
    This function takes a list and creates a new list of unique values. 
    '''
    unique_list = []
    for x in list1: 
        if x not in unique_list: 
            unique_list.append(x)
    for x in unique_list: 
        print(x)

In [117]:
style_list = df_tasting['Style'].tolist()

In [118]:
unique_list(style_list)

Altbier
Barleywine - American
Barleywine - English
Bitter - English Extra Special / Strong Bitter (ESB)
Bitter - English
Bière de Champagne / Bière Brut
Blonde Ale - American
Blonde Ale - Belgian
Bock - Doppelbock
Bock - Eisbock
Bock - Maibock
Bock - Traditional
Bock - Weizenbock
Braggot
Brett Beer
Brown Ale - American
Brown Ale - Belgian Dark
Brown Ale - English
California Common / Steam Beer
Chile Beer
Cream Ale
Dubbel
Farmhouse Ale - Bière de Garde
Farmhouse Ale - Sahti
Farmhouse Ale - Saison
Fruit and Field Beer
Gruit / Ancient Herbed Ale
Happoshu
Herb and Spice Beer
IPA - American
IPA - Belgian
IPA - Black / Cascadian Dark Ale
IPA - Brut
IPA - English
IPA - Imperial
IPA - New England
Kvass
Kölsch
Lager - Adjunct
Lager - American Amber / Red
Lager - American
Lager - European / Dortmunder Export
Lager - European Dark
Lager - European Pale
Lager - European Strong
Lager - Helles
Lager - India Pale Lager (IPL)
Lager - Japanese Rice
Lager - Kellerbier / Zwickelbier
Lager - Light
Lager -

In [147]:
style_map = {
    'Altbier': 'Brown Ale',
    'Barleywine - American': 'Strong Ale',
    'Barleywine - English': 'Strong Ale',
    'Bitter - English Extra Special / Strong Bitter (ESB)': 'Pale Ale',
    'Bitter - English': 'Pale Ale',
    'Bière de Champagne / Bière Brut': 'Hybrid Beer',
    'Blonde Ale - American': 'Pale Ale',
    'Blonde Ale - Belgian': 'Pale Ale',
    'Bock - Doppelbock': 'Bock',
    'Bock - Eisbock': 'Bock',
    'Bock - Maibock': 'Bock',
    'Bock - Traditional': 'Bock',
    'Bock - Weizenbock': 'Bock',
    'Braggot': 'Hybrid Beer',
    'Brett Beer': 'Wild/Sour Beer',
    'Brown Ale - American': 'Brown Ale',
    'Brown Ale - Belgian Dark': 'Brown Ale',
    'Brown Ale - English': 'Brown Ale',
    'California Common / Steam Beer': 'Hybrid Beer',
    'Chile Beer': 'Specialty Beer',
    'Cream Ale': 'Hybrid Beer',
    'Dubbel': 'Dark Ales',
    'Farmhouse Ale - Bière de Garde': 'Pale',
    'Farmhouse Ale - Sahti': 'Specialty Beer',
    'Farmhouse Ale - Saison': 'Pale Ale',
    'Fruit and Field Beer': 'Specialty Beer',
    'Gruit / Ancient Herbed Ale': 'Specialty Beer',
    'Happoshu': 'Specialty Beer',
    'Herb and Spice Beer': 'Specialty Beer',
    'IPA - American': 'India Pale Ale',
    'IPA - Belgian': 'India Pale Ale',
    'IPA - Black / Cascadian Dark Ale': 'India Pale Ale',
    'IPA - Brut': 'India Pale Ale',
    'IPA - English': 'India Pale Ale',
    'IPA - Imperial': 'India Pale Ale',
    'IPA - New England': 'India Pale Ale',
    'Kvass': 'Specialty Beer',
    'Kölsch': 'Pale Ale',
    'Lager - Adjunct': 'Pale Lager',
    'Lager - American Amber / Red': 'Dark Lager',
    'Lager - American': 'Pale Lager',
    'Lager - European / Dortmunder Export': 'Pale Lager',
    'Lager - European Dark': 'Dark Lager',
    'Lager - European Pale': 'Pale Lager',
    'Lager - European Strong': 'Pale Lager',
    'Lager - Helles': 'Pale Lager',
    'Lager - India Pale Lager (IPL)': 'Pale Lager',
    'Lager - Japanese Rice': 'Specialty Beer',
    'Lager - Kellerbier / Zwickelbier': 'Pale Lager',
    'Lager - Light': 'Pale Lager',
    'Lager - Malt Liquor': 'Pale Lager',
    'Lager - Munich Dunkel': 'Dark Lager',
    'Lager - Märzen / Oktoberfest': 'Dark Lager',
    'Lager - Rauchbier': 'Dark Lager',
    'Lager - Schwarzbier': 'Dark Lager',
    'Lager - Vienna': 'Dark Lager',
    'Lambic - Faro': 'Wild/Sour Beer',
    'Lambic - Fruit': 'Wild/Sour Beer',
    'Lambic - Gueuze': 'Wild/Sour Beer',
    'Lambic - Traditional': 'Wild/Sour Beer',
    'Low Alcohol Beer': 'Specialty Beer',
    'Mild Ale - English Dark': 'Brown Ale',
    'Mild Ale - English Pale': 'Pale Ale',
    'Old Ale': 'Strong Ale',
    'Pale Ale - American': 'Pale Ale',
    'Pale Ale - Belgian': 'Pale Ale',
    'Pale Ale - English': 'Pale Ale',
    'Pilsner - Bohemian / Czech': 'Pale Lager',
    'Pilsner - German': 'Pale Lager',
    'Pilsner - Imperial': 'Pale Lager',
    'Porter - American': 'Porter',
    'Porter - Baltic': 'Porter',
    'Porter - English': 'Porter',
    'Porter - Imperial': 'Porter',
    'Porter - Robust': 'Porter',
    'Porter - Smoked': 'Porter',
    'Pumpkin Beer': 'Specialty Beer',
    'Quadrupel (Quad)': 'Strong Ale',
    'Red Ale - American Amber / Red': 'Pale Ale',
    'Red Ale - Imperial': 'Strong Ale',
    'Red Ale - Irish': 'Pale Ale',
    'Rye Beer - Roggenbier': 'Dark Ale',
    'Rye Beer': 'Specialty Beer',
    'Scotch Ale / Wee Heavy': 'Strong Ale',
    'Scottish Ale': 'Dark Ale',
    'Smoked Beer': 'Specialty Beer',
    'Sour - Berliner Weisse': 'Wild/Sour Beer',
    'Sour - Flanders Oud Bruin': 'Wild/Sour Beer',
    'Sour - Flanders Red Ale': 'Wild/Sour Beer',
    'Sour - Gose': 'Wild/Sour Beer',
    'Stout - American Imperial': 'Stout',
    'Stout - American': 'Stout',
    'Stout - English': 'Stout',
    'Stout - Foreign / Export': 'Stout',
    'Stout - Irish Dry': 'Stout',
    'Stout - Oatmeal': 'Stout',
    'Stout - Russian Imperial': 'Stout',
    'Stout - Sweet / Milk': 'Stout',
    'Strong Ale - American': 'Strong Ale',
    'Strong Ale - Belgian Dark': 'Strong Ale',
    'Strong Ale - Belgian Pale': 'Strong Ale',
    'Strong Ale - English': 'Strong Ale',
    'Tripel': 'Strong Ale',
    'Wheat Beer - American Dark': 'Wheet Beer',
    'Wheat Beer - American Pale': 'Wheet Beer',
    'Wheat Beer - Dunkelweizen': 'Wheet Beer',
    'Wheat Beer - Hefeweizen': 'Wheet Beer',
    'Wheat Beer - Kristallweizen': 'Wheet Beer',
    'Wheat Beer - Wheatwine': 'Wheet Beer',
    'Wheat Beer - Witbier': 'Wheet Beer',
    'Wild Ale': 'Wild/Sour Beers',
    'Winter Warmer': 'Dark Ale'
}

In [148]:
df_tasting['Style'] = df_tasting['Style'].map(style_map)

### Beer Reviews -- collaborative recs

In [12]:
df_reviews = pd.read_csv('beer_reviews.csv')

In [13]:
df_reviews.head()

Unnamed: 0,brewery_id,brewery_name,review_time,review_overall,review_aroma,review_appearance,review_profilename,beer_style,review_palate,review_taste,beer_name,beer_abv,beer_beerid
0,10325,Vecchio Birraio,1234817823,1.5,2.0,2.5,stcules,Hefeweizen,1.5,1.5,Sausa Weizen,5.0,47986
1,10325,Vecchio Birraio,1235915097,3.0,2.5,3.0,stcules,English Strong Ale,3.0,3.0,Red Moon,6.2,48213
2,10325,Vecchio Birraio,1235916604,3.0,2.5,3.0,stcules,Foreign / Export Stout,3.0,3.0,Black Horse Black Beer,6.5,48215
3,10325,Vecchio Birraio,1234725145,3.0,3.0,3.5,stcules,German Pilsener,2.5,3.0,Sausa Pils,5.0,47969
4,1075,Caldera Brewing Company,1293735206,4.0,4.5,4.0,johnmichaelsen,American Double / Imperial IPA,4.0,4.5,Cauldron DIPA,7.7,64883
