# Using Pandas and PixieDust to explore canned craft beer in California

We'll use Pandas to clean and work with the initial datasets, and we'll use a couple of other tools ([geopy](https://github.com/geopy/geopy), [Yelp API](https://www.yelp.com/developers)) to add some more useful information to the data.  We'll then convert the Pandas dataframe to a Spark dataframe so we can visualize it with [PixieDust](https://ibm-cds-labs.github.io/pixiedust/).  PixieDust makes it easy for you to gain insight from your data without needing to know Matplotlib or any other plotting tools.  Here are a couple of questions I could answer by using PixieDust visualizations:

- Does higher IBU correlate with higher ABV?
- Which city in California brews the hoppiest/most alcoholic beer?
- Does geographical region (Northern, Southern California) affect the type of beer a brewery chooses to brew?
- Which city in California has the highest rated breweries?

In [1]:
import pandas as pd
import numpy as np

# data from https://www.kaggle.com/nickhould/craft-cans (scraped from CraftCans.com in January 2017)
beers = pd.read_csv("beers.csv")
breweries = pd.read_csv("breweries.csv")

In [2]:
beers.head()

Unnamed: 0.1,Unnamed: 0,abv,ibu,id,name,style,brewery_id,ounces
0,0,0.05,,1436,Pub Beer,American Pale Lager,408,12.0
1,1,0.066,,2265,Devil's Cup,American Pale Ale (APA),177,12.0
2,2,0.071,,2264,Rise of the Phoenix,American IPA,177,12.0
3,3,0.09,,2263,Sinister,American Double / Imperial IPA,177,12.0
4,4,0.075,,2262,Sex and Candy,American IPA,177,12.0


In [3]:
breweries.head()

Unnamed: 0.1,Unnamed: 0,name,city,state
0,0,NorthGate Brewing,Minneapolis,MN
1,1,Against the Grain Brewery,Louisville,KY
2,2,Jack's Abby Craft Lagers,Framingham,MA
3,3,Mike Hess Brewing Company,San Diego,CA
4,4,Fort Point Beer Company,San Francisco,CA


In [4]:
# remove redundant column, rename columns for clarity
beers = beers.drop('Unnamed: 0',axis=1)
breweries = breweries.rename(columns = {'Unnamed: 0': 'brewery_id', 'name': 'brewery_name'})

# merge dataframes, remove NaN values, make ABV more readable
data = pd.merge(beers,breweries,on='brewery_id',how='inner')
data = data[np.isfinite(data['ibu'])]
data['abv'] = data['abv']*100

In [5]:
data.head()

Unnamed: 0,abv,ibu,id,name,style,brewery_id,ounces,brewery_name,city,state
14,6.1,60.0,1979,Bitter Bitch,American Pale Ale (APA),177,12.0,18th Street Brewery,Gary,IN
21,9.9,92.0,1036,Lower De Boom,American Barleywine,368,8.4,21st Amendment Brewery,San Francisco,CA
22,7.9,45.0,1024,Fireside Chat,Winter Warmer,368,12.0,21st Amendment Brewery,San Francisco,CA
24,4.4,42.0,876,Bitter American,American Pale Ale (APA),368,12.0,21st Amendment Brewery,San Francisco,CA
25,4.9,17.0,802,Hell or High Watermelon Wheat (2009),Fruit / Vegetable Beer,368,12.0,21st Amendment Brewery,San Francisco,CA


In [6]:
data['state'].unique()

array([' IN', ' CA', ' FL', ' MO', ' WA', ' CO', ' LA', ' KY', ' OR',
       ' AK', ' NC', ' MI', ' TX', ' AL', ' MA', ' AZ', ' MN', ' ME',
       ' VA', ' IL', ' TN', ' MT', ' WY', ' NE', ' NY', ' NJ', ' NV',
       ' OK', ' WI', ' OH', ' GA', ' RI', ' IA', ' ID', ' DC', ' KS',
       ' ND', ' VT', ' MD', ' WV', ' CT', ' PA', ' HI', ' NM', ' MS',
       ' AR', ' SC', ' DE', ' UT', ' NH'], dtype=object)

In [7]:
# remove space in front of state names
data['state'] = data['state'].str.slice(1,3)

In [8]:
ca = data[data['state'] == 'CA']

In [9]:
ca['brewery_name'].value_counts()

21st Amendment Brewery               17
Golden Road Brewing                  14
Anderson Valley Brewing Company      14
Modern Times Beer                     8
TailGate Beer                         7
Mike Hess Brewing Company             6
Sierra Nevada Brewing Company         6
Manzanita Brewing Company             5
Ruhstaller Beer Company               5
Black Market Brewing Company          4
Ballast Point Brewing Company         4
Fort Point Beer Company               4
Central Coast Brewing Company         4
Saint Archer Brewery                  4
Hess Brewing Company                  3
Mission Brewery                       3
Devil's Canyon Brewery                3
The Dudes' Brewing Company            3
Firestone Walker Brewing Company      3
Headlands Brewing Company             3
Mavericks Beer Company                3
Hangar 24 Craft Brewery               2
Figueroa Mountain Brewing Company     2
Butcher's Brewing                     1
Mother Earth Brew Company             1


In [10]:
ca['city'].value_counts()

San Diego              35
San Francisco          22
Boonville              14
Los Angeles            14
Chico                   6
Temecula                5
Santee                  5
Sacramento              5
San Luis Obispo         4
Torrance                3
Paso Robles             3
Half Moon Bay           3
Belmont                 3
Mill Valley             3
Redlands                2
Santa Cruz              2
Buellton                2
Claremont               1
Vista                   1
South San Francisco     1
Carlsbad                1
Name: city, dtype: int64

In [11]:
# use geopy to get coordinate data so we can use PixieDust's mapping capabilities
from geopy.geocoders import Nominatim
geolocator = Nominatim()

# make it easier for geopy to find coordinates
ca['city-state'] = ca['city']+", "+ca['state']
cities = ca['city-state'].unique()

# create dictionaries of latitudes and longitudes
lats = dict(zip(cities, pd.Series(cities).apply(geolocator.geocode).apply(lambda x: x.latitude)))
longs = dict(zip(cities, pd.Series(cities).apply(geolocator.geocode).apply(lambda x: x.longitude)))

ca['latitude'] = ca['city-state'].map(lats)
ca['longitude'] = ca['city-state'].map(longs)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [12]:
import pixiedust

# convert Pandas dataframe to Spark dataframe for use with PixieDust
sqlContext = SQLContext(sc)
ca2 = sqlContext.createDataFrame(ca)

display(ca2)

abv,ibu,id,name,style,brewery_id,ounces,brewery_name,city,state,city-state,latitude,longitude
9.9,92.0,1036,Lower De Boom,American Barleywine,368,8.4,21st Amendment Brewery,San Francisco,CA,"San Francisco, CA",37.7792808,-122.4192362
7.9,45.0,1024,Fireside Chat,Winter Warmer,368,12.0,21st Amendment Brewery,San Francisco,CA,"San Francisco, CA",37.7792808,-122.4192362
4.4,42.0,876,Bitter American,American Pale Ale (APA),368,12.0,21st Amendment Brewery,San Francisco,CA,"San Francisco, CA",37.7792808,-122.4192362
4.9,17.0,802,Hell or High Watermelon Wheat (2009),Fruit / Vegetable Beer,368,12.0,21st Amendment Brewery,San Francisco,CA,"San Francisco, CA",37.7792808,-122.4192362
4.9,17.0,801,Hell or High Watermelon Wheat (2009),Fruit / Vegetable Beer,368,12.0,21st Amendment Brewery,San Francisco,CA,"San Francisco, CA",37.7792808,-122.4192362
4.9,17.0,800,21st Amendment Watermelon Wheat Beer (2006),Fruit / Vegetable Beer,368,12.0,21st Amendment Brewery,San Francisco,CA,"San Francisco, CA",37.7792808,-122.4192362
7.0,70.0,799,21st Amendment IPA (2006),American IPA,368,12.0,21st Amendment Brewery,San Francisco,CA,"San Francisco, CA",37.7792808,-122.4192362
7.0,70.0,797,Brew Free! or Die IPA (2008),American IPA,368,12.0,21st Amendment Brewery,San Francisco,CA,"San Francisco, CA",37.7792808,-122.4192362
7.0,70.0,796,Brew Free! or Die IPA (2009),American IPA,368,12.0,21st Amendment Brewery,San Francisco,CA,"San Francisco, CA",37.7792808,-122.4192362
8.5,52.0,531,Special Edition: Allies Win The War!,English Strong Ale,368,12.0,21st Amendment Brewery,San Francisco,CA,"San Francisco, CA",37.7792808,-122.4192362


In [21]:
# use the Yelp API to get reviews for each brewery in the list
# API tokens have been redacted so this step won't work unless you put your own in

import reviews

def rev(r):
    return reviews.query_api(r['brewery_name'],r['city-state'])

bs = ca[['brewery_name','city-state','latitude','longitude']]
bs = bs.drop_duplicates()
bs['review'] = bs.apply(rev,axis=1)

bs2 = sqlContext.createDataFrame(bs)
display(bs2)

HTTPError: HTTP Error 403: Forbidden