# RateBeer Analysis

In [1]:
import numpy as np
import pandas as pd

## Data Munging

__The data comes in a file with each review in blocks of text seperated by empty lines. Each review text block has multiple fields, one on each line. It looks something like this (for example):__

beer/name: John Harvards Fancy Lawnmower Beer<br />
beer/beerId: 64125<br />
beer/brewerId: 8481<br />
beer/ABV: 5.4<br />
beer/style: Kolsch<br />
review/appearance: 2/5<br />
review/aroma: 4/10<br />
review/palate: 2/5<br />
review/taste: 4/10<br />
review/overall: 8/20<br />
review/time: 1157587200<br />
review/profileName: hopdog<br />
review/text: On tap at the Springfield, PA location. Poured a lighter golden color with a very small, if any head. Aromas and tastes of grain, very lightly fruity with a light grassy finish. Lively yet thin and watery body. Oh yeah, the person seating me told me this was a new one and was a Pale Ale even though the menu he gave me listed it as a lighter beer brewed in the Kolsh style.<br />
<br />
beer/name: John Harvards Vanilla Black Velvet Stout<br />
beer/beerId: 31544<br />
beer/brewerId: 8481<br />
beer/ABV: -<br />
beer/style: Sweet Stout<br />
review/appearance: 5/5<br />
review/aroma: 8/10<br />
review/palate: 4/5<br />
review/taste: 7/10<br />
review/overall: 16/20<br />
review/time: 1077753600<br />
review/profileName: egajdzis<br />
review/text: Springfield, PA location... Poured an opaque black color with a creamy tan head and nice lacing.  Strong vanilla and roasted malt aroma.  Creamy taste of coffee, chocolate and vanilla. The bartender told me this was an imperial stout at about 8%.  She didn't convince me, there was no alcohol to be found, and it was sweet as hell!  But still good.<br />


**To seperate out the fields, I used grep to put the data for each field in a single file:**

grep "beer/beerId:" Ratebeer.txt > beerIDCol.txt <br />
grep "beer/style:" Ratebeer.txt > beerStyleCol.txt<br />
grep "beer/brewerID:" Ratebeer.txt > beerBrewerCol.txt<br />
grep "beer/brewerId:" Ratebeer.txt > beerBrewerCol.txt<br />
grep "beer/style:" Ratebeer.txt > beerStyleCol.txt<br />
grep "beer/ABV:" Ratebeer.txt > beerStyleCol.txt<br />

grep "review/appearance:" Ratebeer.txt > reviewAppearanceCol.txt<br />
grep "review/aroma:" Ratebeer.txt > reviewAromaCol.txt<br />
grep "review/palate:" Ratebeer.txt > reviewPalateCol.txt<br />
grep "review/taste:" Ratebeer.txt > reviewTasteCol.txt<br />
grep "review/overall:" Ratebeer.txt > reviewOverallCol.txt<br />
grep "review/time:" Ratebeer.txt > reviewTimeCol.txt<br />
grep "review/profileName:" Ratebeer.txt > reviewProfileNameCol.txt<br />
grep "review/text:" Ratebeer.txt > reviewTextCol.txt<br />

Now we can read each file into a pandas series. We have to trim the field names from the begginning of each value in each series and then convert some of the series values into numerical values

In [2]:
beerID=pd.read_csv('original/beerIDCol.txt', header=None)

In [3]:
beerName = pd.read_csv('original/beerNameCol.txt', sep="|", encoding='latin1', header=None)

In [4]:
brewerID=pd.read_csv('original/beerBrewerCol.txt', header=None)

In [5]:
abv=pd.read_csv('original/beerABVCol.txt', header=None)

In [6]:
beerStyle=pd.read_csv('original/beerStyleCol.txt', encoding='latin1', header=None)

In [7]:
reviewAppearance=pd.read_csv('original/reviewAppearanceCol.txt', header=None)

In [8]:
reviewAroma=pd.read_csv('original/reviewAromaCol.txt', header=None)

In [9]:
reviewPalate=pd.read_csv('original/reviewPalateCol.txt', header=None)

In [10]:
reviewTaste=pd.read_csv('original/reviewTasteCol.txt', header=None)

In [11]:
reviewOverall=pd.read_csv('original/reviewOverallCol.txt', header=None)

In [12]:
reviewTime=pd.read_csv('original/reviewTimeCol.txt', header=None)

In [13]:
reviewProfileName=pd.read_csv('original/reviewProfileNameCol.txt', header=None, sep="|", encoding='latin1')

In [14]:
reviewText=pd.read_csv('original/reviewTextCol.txt', header=None, sep="|")

In [15]:
beerID.head()

Unnamed: 0,0
0,beer/beerId: 63836
1,beer/beerId: 63836
2,beer/beerId: 71716
3,beer/beerId: 64125
4,beer/beerId: 64125


In [16]:
beerID['intermediate']=beerID[0].apply(lambda x: x[12:])

In [18]:
# some of the id values have the name of the beer prepended. we should remove that text
# so that the id can be used reliably for identification/categorization
import re
beerID['intermediate2']=beerID['intermediate'].apply(lambda x: re.sub("[^0-9]", "", x) )


In [19]:
beerID['munged']=beerID['intermediate2'].apply(lambda x: int(x))

In [20]:
beerID.head()

Unnamed: 0,0,intermediate,intermediate2,munged
0,beer/beerId: 63836,63836,63836,63836
1,beer/beerId: 63836,63836,63836,63836
2,beer/beerId: 71716,71716,71716,71716
3,beer/beerId: 64125,64125,64125,64125
4,beer/beerId: 64125,64125,64125,64125


In [21]:
beerName.head()

Unnamed: 0,0
0,beer/name: John Harvards Simcoe IPA
1,beer/name: John Harvards Simcoe IPA
2,beer/name: John Harvards Cristal Pilsner
3,beer/name: John Harvards Fancy Lawnmower Beer
4,beer/name: John Harvards Fancy Lawnmower Beer


In [28]:
beerName['munged']=beerName[0].apply(lambda x: x[10:])

In [29]:
beerName.head()

Unnamed: 0,0,munged
0,beer/name: John Harvards Simcoe IPA,John Harvards Simcoe IPA
1,beer/name: John Harvards Simcoe IPA,John Harvards Simcoe IPA
2,beer/name: John Harvards Cristal Pilsner,John Harvards Cristal Pilsner
3,beer/name: John Harvards Fancy Lawnmower Beer,John Harvards Fancy Lawnmower Beer
4,beer/name: John Harvards Fancy Lawnmower Beer,John Harvards Fancy Lawnmower Beer


In [30]:
brewerID.head()

Unnamed: 0,0
0,beer/brewerId: 8481
1,beer/brewerId: 8481
2,beer/brewerId: 8481
3,beer/brewerId: 8481
4,beer/brewerId: 8481


In [35]:
brewerID['intermediate']=brewerID[0].apply(lambda x: x[14:])

In [36]:
brewerID['munged']=brewerID['intermediate'].apply(lambda x: int(x))

In [37]:
brewerID.head()

Unnamed: 0,0,intermediate,munged
0,beer/brewerId: 8481,8481,8481
1,beer/brewerId: 8481,8481,8481
2,beer/brewerId: 8481,8481,8481
3,beer/brewerId: 8481,8481,8481
4,beer/brewerId: 8481,8481,8481


In [38]:
abv.head()

Unnamed: 0,0
0,beer/ABV: 5.4
1,beer/ABV: 5.4
2,beer/ABV: 5
3,beer/ABV: 5.4
4,beer/ABV: 5.4


In [55]:
# for the moment set all unknown abv values to 200
abv['num']=abv[0].apply(lambda x: x[10:]).apply(lambda x: '200' if x=='-' else x).apply(lambda x: float(x))

In [56]:
abv.head()

Unnamed: 0,0,munged,num
0,beer/ABV: 5.4,5.4,5.4
1,beer/ABV: 5.4,5.4,5.4
2,beer/ABV: 5,5.0,5.0
3,beer/ABV: 5.4,5.4,5.4
4,beer/ABV: 5.4,5.4,5.4


In [57]:
abv[abv['num']<101.0]['num'].mean()

6.6407943347145215

In [58]:
abv[abv['num']<101.0]['num'].median()

6.0

The median and mean abv seem fairly high. Probably because to craft/artisinal beers reviewed most often tend to have higher abv. The mean here of 6.64% seems intuitively very high. So we will use the median value to populate rows with no listed abv value

In [60]:
abv['munged']=abv['num'].apply(lambda x: 6.0 if x==200.0 else x)

In [64]:
abv.head(10)

Unnamed: 0,0,munged,num
0,beer/ABV: 5.4,5.4,5.4
1,beer/ABV: 5.4,5.4,5.4
2,beer/ABV: 5,5.0,5.0
3,beer/ABV: 5.4,5.4,5.4
4,beer/ABV: 5.4,5.4,5.4
5,beer/ABV: -,6.0,200.0
6,beer/ABV: -,6.0,200.0
7,beer/ABV: 7,7.0,7.0
8,beer/ABV: 7,7.0,7.0
9,beer/ABV: 7,7.0,7.0


In [68]:
# lets create another boolean feature specifying whether the abv was listed in case we need it 
# in the future
abv['abv_listed']=abv['num'].apply(lambda x: False if x==200.0 else True)

In [69]:
abv.head(10)

Unnamed: 0,0,munged,num,abv_listed
0,beer/ABV: 5.4,5.4,5.4,True
1,beer/ABV: 5.4,5.4,5.4,True
2,beer/ABV: 5,5.0,5.0,True
3,beer/ABV: 5.4,5.4,5.4,True
4,beer/ABV: 5.4,5.4,5.4,True
5,beer/ABV: -,6.0,200.0,False
6,beer/ABV: -,6.0,200.0,False
7,beer/ABV: 7,7.0,7.0,True
8,beer/ABV: 7,7.0,7.0,True
9,beer/ABV: 7,7.0,7.0,True


In [70]:
beerStyle.head()

Unnamed: 0,0
0,beer/style: India Pale Ale &#40;IPA&#41;
1,beer/style: India Pale Ale &#40;IPA&#41;
2,beer/style: Bohemian Pilsener
3,beer/style: Kölsch
4,beer/style: Kölsch


In [74]:
beerStyle[0].value_counts().count()

89

In [77]:
beerStyle['munged']=beerStyle[0].apply(lambda x: x[11:])

In [78]:
beerStyle.head()

Unnamed: 0,0,munged
0,beer/style: India Pale Ale &#40;IPA&#41;,India Pale Ale &#40;IPA&#41;
1,beer/style: India Pale Ale &#40;IPA&#41;,India Pale Ale &#40;IPA&#41;
2,beer/style: Bohemian Pilsener,Bohemian Pilsener
3,beer/style: Kölsch,Kölsch
4,beer/style: Kölsch,Kölsch


In [81]:
reviewAppearance.head()

Unnamed: 0,0
0,review/appearance: 4/5
1,review/appearance: 4/5
2,review/appearance: 4/5
3,review/appearance: 2/5
4,review/appearance: 2/5


In [88]:
reviewAppearance['munged']=reviewAppearance[0].apply(lambda x: x[18:20]).apply(lambda x: int(x))

In [89]:
reviewAppearance.head()

Unnamed: 0,0,munged
0,review/appearance: 4/5,4
1,review/appearance: 4/5,4
2,review/appearance: 4/5,4
3,review/appearance: 2/5,2
4,review/appearance: 2/5,2


In [90]:
reviewAroma.head()

Unnamed: 0,0
0,review/aroma: 6/10
1,review/aroma: 6/10
2,review/aroma: 5/10
3,review/aroma: 4/10
4,review/aroma: 4/10


In [130]:
reviewAroma['munged']=reviewAroma[0].apply(lambda x: x[13:-3]).apply(lambda x: int(x))

In [131]:
reviewAroma.head()

Unnamed: 0,0,munged
0,review/aroma: 6/10,6
1,review/aroma: 6/10,6
2,review/aroma: 5/10,5
3,review/aroma: 4/10,4
4,review/aroma: 4/10,4


In [103]:
reviewOverall['munged']=reviewOverall[0].apply(lambda x: x[15:-3]).apply(lambda x: int(x))

In [104]:
reviewPalate.head()

Unnamed: 0,0
0,review/palate: 3/5
1,review/palate: 4/5
2,review/palate: 3/5
3,review/palate: 2/5
4,review/palate: 2/5


In [110]:
reviewPalate['munged']=reviewPalate[0].apply(lambda x: x[14:16]).apply(lambda x: int(x))

In [111]:
reviewPalate.head()

Unnamed: 0,0,munged
0,review/palate: 3/5,3
1,review/palate: 4/5,4
2,review/palate: 3/5,3
3,review/palate: 2/5,2
4,review/palate: 2/5,2


In [112]:
reviewProfileName.head()

Unnamed: 0,0
0,review/profileName: hopdog
1,review/profileName: TomDecapolis
2,review/profileName: PhillyBeer2112
3,review/profileName: TomDecapolis
4,review/profileName: hopdog


In [116]:
reviewProfileName['munged']=reviewProfileName[0].apply(lambda x: x[19:])

In [117]:
reviewProfileName.head()

Unnamed: 0,0,munged
0,review/profileName: hopdog,hopdog
1,review/profileName: TomDecapolis,TomDecapolis
2,review/profileName: PhillyBeer2112,PhillyBeer2112
3,review/profileName: TomDecapolis,TomDecapolis
4,review/profileName: hopdog,hopdog


In [118]:
reviewTaste.head()

Unnamed: 0,0
0,review/taste: 6/10
1,review/taste: 7/10
2,review/taste: 6/10
3,review/taste: 4/10
4,review/taste: 4/10


In [132]:
reviewTaste['munged']=reviewTaste[0].apply(lambda x: x[13:-3]).apply(lambda x: int(x))

In [133]:
reviewTaste.head()

Unnamed: 0,0,munged
0,review/taste: 6/10,6
1,review/taste: 7/10,7
2,review/taste: 6/10,6
3,review/taste: 4/10,4
4,review/taste: 4/10,4


In [137]:
reviewText['munged']=reviewText[0].apply(lambda x: x[12:])

In [138]:
reviewText.head()

Unnamed: 0,0,munged
0,"review/text: On tap at the Springfield, PA loc...","On tap at the Springfield, PA location. Poure..."
1,review/text: On tap at the John Harvards in Sp...,On tap at the John Harvards in Springfield PA...
2,"review/text: UPDATED: FEB 19, 2003 Springfield...","UPDATED: FEB 19, 2003 Springfield, PA. I've n..."
3,review/text: On tap the Springfield PA locatio...,On tap the Springfield PA location billed as ...
4,"review/text: On tap at the Springfield, PA loc...","On tap at the Springfield, PA location. Poure..."


In [139]:
reviewTime.head()

Unnamed: 0,0
0,review/time: 1157587200
1,review/time: 1157241600
2,review/time: 958694400
3,review/time: 1157587200
4,review/time: 1157587200


In [145]:
reviewTime['munged']=reviewTime[0].apply(lambda x: x[12:]).apply(lambda x: int(x))

In [146]:
reviewTime.head()

Unnamed: 0,0,munged
0,review/time: 1157587200,1157587200
1,review/time: 1157241600,1157241600
2,review/time: 958694400,958694400
3,review/time: 1157587200,1157587200
4,review/time: 1157587200,1157587200


In [159]:
ratebeer_data_clean=pd.concat([beerID['munged'].rename('beerID'),
           beerName['munged'].rename('beer_name'),
          abv['munged'].rename('abv'),
          abv['abv_listed'].rename('abv_listed'),
          brewerID['munged'].rename('brewerID'),
          beerStyle['munged'].rename('beer_style'),
           reviewProfileName['munged'].rename('reviewer_username'),
          reviewAppearance['munged'].rename('review_appearance'),
          reviewAroma['munged'].rename('review_aroma'),
           reviewPalate['munged'].rename('review_palate'),
           reviewTaste['munged'].rename('review_taste'),
          reviewOverall['munged'].rename('review_overall'),
          reviewText['munged'].rename('review_text'),
          reviewTime['munged'].rename('review_unix_time')], axis=1)#.reset_index()

In [162]:
ratebeer_data_clean.to_pickle(path='./ratebeer_clean.pkl')