# Craft Breweries per Million by State - With Answers

Now that you're a Pandas expert, we'll use two online data sources to determine which state has the most craft breweries per million people!

If you want to learn about how the data is accessed/pulled into Python, feel free to read along to see just how the data is retrieved. If you want to go ahead and just start working with the data, feel free to jump to the [Calculating Breweries per Million](#Calculating-Breweries-per-Million) section below.

------------------------
## Reading in the Data

First, we'll import the packages we need.

In [None]:
import openbrewerydb #For getting brewery data
import requests #For interacting with Census Bureau API
import pandas as pd 

Awesome! `openbrewerydb` is a very simple Python package with one function: [`.load()`](https://jrbourbeau.github.io/openbrewerydb-python/api.html). Load takes a couple of keyword arguments which work as query parameters, allowing you to specify which data you want to pull from the brewery database. In this instance, however, we'll ignore all of those options (except `verbose`, which prints updates as the data is pulled) in order to pull all data from the database.

In [None]:
#Use OpenBreweryDB to get Brewery Data
brewery_data = openbrewerydb.load(verbose=True)

Let's take a look at our brewery data!

In [None]:
brewery_data.head()

Awesome! It looks like we have data by brewery, with info on each brewery such as brewery type, address, and coordinates (latitude/longitude).

Next, we'll use the Census Bureau Web API to programatically pull state population data from the web. 

The code below uses the `requests` package, which is used for interacting with websites. In this instance, we use the `requests` packge to make a GET request, which gets the data from the specified url (`pop_url`). The data is returned as JSON, or JavaScript Object Notation. You can think about JSON as a collection of Python lists and dictionaries. We can access the JSON using the `json()` method of the `Response` object (in this case, the variable `r`) returned after the `get()` function call. 

In [None]:
#Use Census API to get State-Level Population Data
pop_url = r"https://api.census.gov/data/2018/acs/acs1?get=NAME,group(B01003)&for=state"
r = requests.get(pop_url)
pop_json = r.json()

Now that we've retrieved the, we can see just the first 5 values of `pop_json` below

In [None]:
print(type(pop_json))
print()
pop_json[0:5]

The JSON data is a 2-dimensional list (a list of lists) where the first list contains the column names and the following lists contain the rows of data, one list per rowe. Luckily, Pandas is used to converting this type of data into DataFrames! To create a DataFrame from this data we simply split the column names from the data using positional indexing. We then use the `DataFrame` constructor function to create the DataFrame, specifying the columns separately from the data.

*Note that the column name "NAME" shows up twice in the column names. So in the code below, I remove the first "NAME" (which is in the first column position) and replace it with "STATE_NAME" instead.*

In [None]:
column_names = pop_json[0]
column_names = ["STATE_NAME"] + column_names[1:]
pop_data = pop_json[1:]
pop_df = pd.DataFrame(pop_data, columns=column_names)

And voila! We have population data!

In [None]:
pop_df.head()

-------------------------------------

## Calculating Breweries per Million

Now that we have population data and brewery data, we can count the number of breweries in each state! 

Note that, along the way, we'll introduce a few new functions/methods that you haven't seen before. But with your existing Pandas expertise, learning these new functions should be a cinch!

### <ins>Cleaning Up the Data</ins>

First, we'll clean up the population data a bit. The state name is stored in `"STATE_NAME"` and the 2018 population is stored in `"B01003_001E"`. Limit the dataset to just these two variables. Then, rename `"B01003_001E"` to `"State Population"`.

In [None]:
pop_df = pop_df[["STATE_NAME", "B01003_001E"]]
pop_df = pop_df.rename(columns={"B01003_001E" : "State Population"})

Now that the data is read-in and limited to just the variables we care about, let's look at the data types of the two variables using the `.dtypes` attribute.  

In [None]:
pop_df.dtypes

It looks like the `"State Population"` variable isn't numeric. We'll need to work with this as numeric data later on so change the data type of `"State Population"` to a `float` using the `astype()` method.

In [None]:
pop_df["State Population"] = pop_df["State Population"].astype(float)

Great! Next, we'll turn to the brewery data and aggregate the data, calculating the number of each type of brewery within each state.

To do this, first create a variable `"brewery count"` that takes the value `1`. Then sum this variable within `"state"` and `"brewery_type"`.

In [None]:
brewery_data["brewery count"] = 1

In [None]:
brewery_counts = brewery_data.groupby(["state", "brewery_type"])[["brewery count"]].sum()

You'll notice that after aggregating, `"state"` and `"brewery_type"` became the row indices, creating a multi-indexed DataFrame. We'll need to convert these back to columns before proceeding. You can easily convert `"state"` and `"brewery_type"` back into columns and reset the index to the row number by using the [`reset_index()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html) function. The `reset_index()` function syntax looks like the following

```python
df = df.reset_index()
```

So now, use the `reset_index()` function now to convert `"state"` and `"brewery_type"` back to columns and to reset the index.


In [None]:
brewery_counts = brewery_counts.reset_index()

Great! Now we have brewery counts by type, however wewant to flip this data to become *wide* rather than *long*. In other words, we want to have columns where the column names are the type of brewery and the column values are the count associated with that brewery type. We can do that in Pandas using the `pivot()` command. The syntax for the `.pivot()` command is the following 

```python
df = df.pivot(index='row_variable', columns='column_variable', values='value_variable')
```

Use the `pivot` command, specifying `"state"` as the `index`, `"brewery_type"` as the `columns`, and `"brewery count"` as the `values`.

In [None]:
brewery_counts = brewery_counts.pivot(index='state', columns='brewery_type', values='brewery count')

Again, the `pivot()` command resulted in a DataFrame where the values of `"state"` are the row indices and the columns are multi-indexed. To fix this, run the `reset_index()` function again.

In [None]:
brewery_counts = brewery_counts.reset_index()

This will have reset the index back to normal, except with the default index now being named `"brewery_type"`. You can choose to rename this back to the default value of `"index"`, but that's not necessary. 

### <ins>Merging the Data</ins>

Great! Now, merge on the population data by `"state"`. Use the `validate` argument to make sure the merge is many-to-one and the `indicator` argument to make sure everything merges correctly.

In [None]:
merged_df = brewery_counts.merge(pop_df, left_on="state", right_on="STATE_NAME", how="outer", indicator=True, validate="m:1")

Unlike with the Girl Scout Cookie data, it's hard to tell if everything merged by just looking at the data. There are a couple ways to do this programatically is by doing the following, where `merged_df` if my merged DataFrame

```python
print(merged_df["_merge"].unique())
```

This will print all of the unique values of `"_merge"`. If the only value is `"both"`, everything merged!

In [None]:
merged_df._merge.unique()

Did everything merge? If not, what do the observations that didn't merge look like? Are these observations that should have merged or are they observations that should be dropped? If they are observations that should have merged, modify the data appropriately to make sure the data merges correctly. If they are extra observations, drop them using boolean indexing.

In [None]:
merged_df = merged_df[merged_df._merge == "both"]

Then, drop the `"STATE_NAME"` and `"_merge"` columns.

In [None]:
merged_df = merged_df.drop(columns=["STATE_NAME", "_merge"])

In [None]:
merged_df.head()

### <ins>Analyzing the Data</ins>

Now we have our analysis dataset! Our ultimate goal is to calculate breweries per million population for each brewery type as well as across breweries.

First, you'll notice that the `.pivot()` function produced a number of `NaN`s in our data. These `NaN`s results from the fact that no data was present for that brewery type-state combination, i.e. each `NaN` represents the fact that there are 0 breweries in the data for the given brewery type and state.

In order to clean up the data, let's convert all of the `NaN`s to `0`s. This can be done very easily using the `fillna()` function. The syntax of the `fillna()` function looks like the following

```python
df = df.fillna(value)
```

where `value` is the value you wish to replace the `NaN`s with.

Below, use `fillna()` to replace all `NaN`s with `0`s.

In [None]:
analysis_df = merged_df.fillna(0)

Great! Now, let's change the values of the brewery count variables (`"bar"`, `"brewpub"`, etc.) to be the number of breweries per million population.

To do that, first use the `.columns` attribute of your DataFrame and list comprehension to create a variable called `brewery_vars` which is a list of just the brewery count columns. 

*Hint: `df.columns` is an iterable and the two columns that do **not** denote brewery type counts both share the word `"state"`...*

Also note that you could also just list out the columns by hand and store them in the `brewery_vars` list.

In [None]:
brewery_vars = [c for c in analysis_df.columns if "state" not in c.lower()]

Now that you have a list of the brewery columns, loop through the list and use the `"State Population"` data to calculate the counts per million, replacing the column values as you go along.

In [None]:
for c in brewery_vars:
    analysis_df[c] = analysis_df[c] * 1000000 / analysis_df["State Population"]

Now we're almost done! Lastly, add a `"Total"` column that is the sum of all of the individual brewery counts per 1,000,000 columns. You can do that by either looping over the columns, as we did above, or by using the `.sum()` method. The syntax of the `.sum()` method is as follows

```python
df["sum"] = df[list_of_columns].sum(axis=1)
```

In short, the `sum()` method sums over all of the columns or rows of a DataFrame. If `axis=0`, it sums over all of the observations of the DataFrame, while if `axis=1`, it sums across columns. At the end, `.sum()` returns a Series with the summations. In order to limit the number of columns that are included in the summation, the DataFrame `df` is limited to just the columns included in `list_of_columns` before the call to the `sum()` method. The result is a Series that is stored in a new column called `"sum"`.

Below, use the `.sum()` method and the `brewery_vars` list to create a new column `"Total"` that is the total number of breweries per 1,000,000.

In [None]:
analysis_df["Total"] = analysis_df[brewery_vars].sum(axis=1)

Finally, we'll clean up the DataFrame. Now that we've used the `"State Population"` variable, drop that column from the DataFrame. After that, loop over the `brewery_vars` variable and the `"state"` column, using the string method `.title()` to proper case each column. 

In [None]:
analysis_df = analysis_df.drop(columns="State Population")

In [115]:
for c in ["state"] + brewery_vars:
    analysis_df = analysis_df.rename({c : c.title()})

Then, use the `.sort_values()` method to sort that DataFrame in descending order based on the `"Total"` column. Which state has the most breweries per 1,000,000 people??

In [116]:
analysis_df.sort_values("Total", ascending=False)

Unnamed: 0,State,Bar,Brewpub,Closed,Contract,Large,Micro,Planning,Proprietor,Regional,Total
46,Vermont,0.0,25.546903,0.0,0.0,0.0,54.28717,6.386726,0.0,11.17677,97.397569
27,Montana,0.0,16.002937,0.0,1.882698,0.0,57.422303,10.354842,0.0,1.882698,87.545479
20,Maine,0.0,23.161915,0.0,1.494317,0.0,52.301099,5.977268,0.747159,2.241476,85.923234
5,Colorado,0.0,24.229383,0.0,1.580177,1.229027,38.626552,8.778762,0.702301,1.931328,77.077529
38,Oregon,0.0,25.055402,0.0,0.715869,1.431737,35.077563,4.533835,0.477246,3.34072,70.632372
1,Alaska,0.0,16.272554,0.0,1.356046,0.0,33.901155,8.136277,0.0,1.356046,61.022079
51,Wyoming,0.0,20.770697,0.0,1.730891,0.0,24.232479,10.385348,0.0,1.730891,58.850307
48,Washington,0.0,14.597395,0.0,0.796222,0.530814,33.574009,3.981108,0.530814,0.928925,54.939287
30,New Hampshire,0.0,16.218711,0.0,0.737214,0.737214,27.276923,7.372141,0.0,1.474428,53.816631
32,New Mexico,0.0,20.998097,0.0,0.477229,0.0,14.316884,2.863377,0.477229,0.954459,40.087276
