<a href="https://colab.research.google.com/github/brendanpshea/data-science/blob/main/Data_Science_06_AggregateFunctions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Aggregate Functions

## Introduction to the Minnesota Beer Rating Data Set

The data set we'll be working with contains information about various beers brewed in Minnesota. It's a fascinating collection of data that helps us understand the preferences and characteristics of different beers. Let's break down the columns in this data set:

1.  BeerName: The name of the beer.
2.  Brewery: The name of the brewery that produces the beer.
3.  Style: The style of the beer, such as "Imperial IPA" or "Russian Imperial Stout." This categorizes the beer based on its taste, color, aroma, and other characteristics.
4.  ABV (Alcohol By Volume): This represents the alcohol content in the beer, measured as a percentage. For example, a beer with an ABV of 9.2% contains 9.2% alcohol by volume.
5.  NumRating: The number of ratings the beer has received on Beer Advocate, a popular beer rating website.
6.  AvgRating: The average rating of the beer on a scale from 1 to 5, where a higher rating indicates a more favorable review.



In [1]:
# Importing the Pandas library
import pandas as pd

# Loading the Minnesota beer rating data from the provided CSV file
!wget 'https://github.com/brendanpshea/data-science/raw/main/data/minnesota_beers.csv' -q
minnesota_beers_df = pd.read_csv("minnesota_beers.csv")

# Displaying the first few rows of the data to understand its structure
minnesota_beers_df.head()


Unnamed: 0,BeerName,Brewery,Style,ABV,NumRating,AvgRating
0,Nillerzzzzz,Forager Brewing Company,American Imperial Stout,14.0,143,4.58
1,Abrasive Ale,Surly Brewing Company,Imperial IPA,9.2,4828,4.5
2,Barrel-Aged Silhouette,Lift Bridge Brewery,Russian Imperial Stout,11.0,551,4.5
3,Darkness,Surly Brewing Company,Russian Imperial Stout,12.0,4252,4.48
4,Darkness - Bourbon Barrel-Aged,Surly Brewing Company,Russian Imperial Stout,12.0,356,4.46


In [2]:
## Display basic information of table
minnesota_beers_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   BeerName   100 non-null    object 
 1   Brewery    100 non-null    object 
 2   Style      100 non-null    object 
 3   ABV        100 non-null    float64
 4   NumRating  100 non-null    int64  
 5   AvgRating  100 non-null    float64
dtypes: float64(2), int64(1), object(3)
memory usage: 4.8+ KB


#### Examples:

-   Nillerzzzzz is an "American Imperial Stout" with an ABV of 14.0%, and it has received 143 ratings with an average rating of 4.58.
-   Abrasive Ale is an "Imperial IPA" with an ABV of 9.2%, and it has received 4828 ratings with an average rating of 4.50.

This data set provides a rich source of information for exploring various aspects of the beer industry in Minnesota, including consumer preferences, beer characteristics, and brewery performance. By using aggregate functions in Pandas, we can analyze this data in many interesting ways, such as finding the highest-rated beers, understanding the popularity of different beer styles, or identifying trends in alcohol content.

In the following sections, we'll learn how to use Pandas to perform these analyses and more, providing you with valuable insights and techniques to work with structured data. So grab your favorite beer (if you're of legal drinking age, of course!) and let's dive into the world of data analysis with Pandas!

## Grouping
In this section, we will explore how to group data by a specific category and then calculate the mean, or average, for each group. The emphasis is on the powerful concept of "grouping," which allows us to segment data into categories and perform calculations on each category separately.
### What is the Average Alcohol Content (ABV) for Each Beer Style?

Our first example will be calculating the average Alcohol By Volume (ABV) for each beer style.

Imagine you are at a beer festival in Minnesota, and you are curious to know how different styles of beer compare in terms of alcohol content. Some styles might be stronger, while others might be lighter. How can we figure this out using data analysis?

The answer lies in grouping the beers by their style and then calculating the average ABV for each group. By doing so, we can gain insights into how different styles compare in terms of alcohol content. Here's how we can achieve this:

1.  Group the Beers by Style: First, we'll use the `groupby` method in Pandas to group the beers by their style. This will create a collection of groups where each group contains all the beers of a particular style.

2.  Calculate the Average ABV: Next, we'll use the `mean` function to calculate the average ABV for each group. The mean function adds up all the ABV values in a group and then divides by the number of values, giving us the average.

3.  Analyze the Results: Finally, we'll examine the results to understand how different beer styles compare in terms of alcohol content. This analysis can lead to interesting insights and discussions about beer culture, preferences, and more. We'll use the Pandas method `sort_values()` to give our results meaning.

We'll now use the Minnesota beer rating data set to calculate the average ABV for each beer style. Here's the code to do it, along with the results:

In [3]:
# Grouping the beers by their style and calculating the average ABV for each group
average_abv_by_style = minnesota_beers_df.groupby('Style')['ABV'].mean()

# Sorting the result for better visibility
average_abv_by_style_sorted = average_abv_by_style.sort_values(ascending=False)

# Displaying the result
average_abv_by_style_sorted.head(10)


Style
English Barleywine         14.400000
American Imperial Stout    12.018182
Wheatwine                  11.500000
Russian Imperial Stout     11.070000
American Barleywine         9.900000
Imperial Porter             9.800000
Smoked Porter               9.500000
Belgian Dark Strong Ale     9.500000
Imperial IPA                8.904762
Fruited Kettle Sour         8.000000
Name: ABV, dtype: float64

By grouping the beers by style and calculating the average ABV, we can clearly see how different styles compare. For example, English Barleywine and American Imperial Stout are among the strongest styles in terms of alcohol content, while others, like Imperial IPA and Fruited Kettle Sour, are slightly lighter.  This analysis provides a valuable perspective for beer enthusiasts, brewers, and even regulators who might be interested in understanding the alcohol content of various beer styles.


### How to "Count" Beers by Brewery

If you want to know which breweries produce the most beers, you can group the data by brewery and then count the number of beers for each group. The steps are:

1.  Group Beers by Brewery: Use the `groupby` method to group the beers by the brewery.
2.  Count the Beers: Use the `count` function to count the number of beers for each brewery group.
3.  Analyze the Results: Identify the breweries that produce the most beers. Again, we will use `df.sort_values()`.

We'll now use the Minnesota beer rating data set to count the number of beers produced by each brewery. Here's the code to do it:


In [4]:
# Grouping the beers by brewery and counting the number of beers for each brewery
beers_by_brewery = minnesota_beers_df.groupby('Brewery')['BeerName'].count()

# Sorting the result to find the breweries that produce the most beers
top_breweries_by_beer_count = beers_by_brewery.sort_values(ascending=False)

# Displaying the top 5 breweries
top_breweries_by_beer_count.head(5)

Brewery
Surly Brewing Company         14
Modist Brewing Co.            14
Lupulin Brewing                9
Barrel Theory Beer Company     9
BlackStack Brewing             9
Name: BeerName, dtype: int64

### Understanding Sorting: The `sort_values()` Function

Sorting is the process of arranging data in a specific order, either ascending (from smallest to largest) or descending (from largest to smallest). In Pandas, the `sort_values()` function is used to perform sorting on a Series or DataFrame. In our analysis of the Minnesota Beer data, we used the `sort_values()` function to organize data in a meaningful way. Here's a closer look at how we used this function:

1.  Sorting by a Specific Column: We can sort the DataFrame by a specific column, such as the average rating. This helps us identify the breweries with the highest or lowest ratings.

2.  Ascending and Descending Order: By setting the `ascending` parameter, we can control the order of sorting. For example, `ascending=False` will sort the values in descending order, showing us the breweries with the highest ratings first.

3.  Creating a New Copy: The `sort_values()` function returns a new sorted Series or DataFrame, leaving the original data unchanged. This allows us to maintain the original dataset while working with a sorted version.

**Example:** We sorted the average rating for each brewery in descending order to find the breweries with the highest average ratings:

```python
top_breweries_by_avg_rating = average_rating_by_brewery.sort_values(ascending=False)
```

This code created a new Series with the breweries sorted by their average rating, allowing us to easily identify the top-rated breweries.

Sorting is a fundamental operation in data analysis that helps us organize and interpret data. By understanding how to use the `sort_values()` function in Pandas, we can arrange data in meaningful ways, making it easier to analyze and draw insights. Whether we want to find the highest-rated breweries or understand the distribution of beer styles, sorting provides a powerful tool to enhance our data exploration.

### Exercise

Your task is to find out how many beers are produced for each beer style in the Minnesota beer rating data set. Here's what to do:

1.  Group Beers by Style: Use the `groupby` method to group the beers by their style.
2.  Count the Beers: Use the `count` function to count the number of beers for each style group.
3.  Analyze the Results: Identify the top 3 beer styles based on the number of beers. Write a brief analysis of what you observe.

You can should be able to use the code blocks provided above, with only very small changes (to group by style instead of brewery).

In [7]:
# Grouping the beers by beer style and count the number of beers
# beers_by_style = ?

# Sorting the result to find the breweries that produce the most beers
# top_breweries_by_beer_style = ?

# Displaying the top 3 style. You'll need to use df,head()


### Finding the `max()` Rated Brewery

Imagine you are a beer connoisseur, and you want to visit the brewery with the highest average rating in Minnesota. You have a dataset of the most highly rated beers, and you want to analyze this data to make your decision. Here's how you can approach this task:

1.  Calculate the Average Rating for Each Brewery: First, you need to group the beers by brewery and then calculate the average rating for each group. This will give you a clear picture of how each brewery is rated by beer enthusiasts.

2.  Identify the Brewery with the Highest Average Rating: Next, you need to find the maximum average rating among all the breweries. This will help you identify the brewery that is most favored among the reviewers.

3.  Consider the Context: Since the dataset includes only the most highly rated beers, the analysis will reflect the preferences of those who rate beers highly. It's essential to keep this context in mind when interpreting the results.


We'll now use the Minnesota beer rating data set to find the brewery with the highest average rating. Here's the code to do it:

In [6]:
# Finding the average rating for each brewery
average_rating_by_brewery = minnesota_beers_df.groupby('Brewery')['AvgRating'].mean()

# Identifying the brewery with the highest average rating
# idmax() gets us the actual name of the brewery
highest_avg_rating_brewery = average_rating_by_brewery.idxmax()
# max() gets use the number
highest_avg_rating_value = average_rating_by_brewery.max()

highest_avg_rating_brewery, highest_avg_rating_value


('Lift Bridge Brewery', 4.5)

The brewery with the highest average rating among the most highly rated beers in Minnesota is Lift Bridge Brewery, with an average rating of 4.5. This result tells us that Lift Bridge Brewery is highly favored among reviewers, especially considering that the dataset includes only the most highly rated beers. It's a significant achievement for a brewery to stand out in such a competitive field.

In this example, you'll notice we use two related functions `df.max()` and `df.idmax()`

-   `max`: Returns the numerical value of the maximum element. In our example, it gives us the highest average rating as a number.
-   `idxmax`: Returns the index (or label) of the maximum element. In our example, it gives us the name of the brewery that has the highest average rating.

The combination of the `mean`, `idmax` and `max` functions has allowed us to perform a nuanced analysis of the brewery ratings. By understanding how to calculate the average rating for each brewery and then identify the one with the highest average, we have unlocked insights that could be valuable for consumers, business analysts, and beer enthusiasts.

Remember, the concepts of mean and max are not limited to beer ratings. They are fundamental statistical tools that can be applied across various fields and contexts. Whether you're analyzing customer satisfaction, academic performance, or market trends, understanding how to use these functions can help you make informed decisions and uncover meaningful insights.

Now that you've learned how to use the mean and max functions to analyze brewery ratings, think about how you can apply these concepts to other data sets and questions. The world of data analysis is vast and exciting, and these tools are just the beginning of what you can explore and discover!

### Exercise: Finding the `min()`

Take the code block above (which we used to find the highest rated brewery) and use it to find the **lowest** rated brewery. It's basically the same idea, but you'll be using `min()` and `idmin()` instead of `max()` and `idmax()`.

In [8]:
# Change the code below to use min() and idmin() instead of max and idmax
# Finding the average rating for each brewery
average_rating_by_brewery = minnesota_beers_df.groupby('Brewery')['AvgRating'].mean()
highest_avg_rating_brewery = average_rating_by_brewery.idxmax()
highest_avg_rating_value = average_rating_by_brewery.max()
highest_avg_rating_brewery, highest_avg_rating_value

('Lift Bridge Brewery', 4.5)

## Putting it Altogether: A Comprehensive Report

Now, let's put together what we've learned and create a comprhensive report on each beer style and its associated alchohol content. We'll also meet the `df.agg()` function, which allows us to apply multiple aggregate functions to our data.

In [16]:
# Grouping by beer style and calculating the required statistics for the 'ABV' column
report_abv_by_style = minnesota_beers_df.groupby('Style')['ABV'].agg(
    min_abv='min',
    max_abv='max',
    mean_abv='mean',
    median_abv='median',
    sd_abv='std',
    count_style = 'count'
)

# Displaying the report (showing top 5 rows for brevity)
report_abv_by_style


Unnamed: 0_level_0,min_abv,max_abv,mean_abv,median_abv,sd_abv,count_style
Style,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
American Barleywine,9.9,9.9,9.9,9.9,,1
American Brown Ale,5.1,5.5,5.3,5.3,0.282843,2
American IPA,6.0,7.9,6.9,7.0,0.562139,11
American Imperial Stout,10.0,14.0,12.018182,12.5,1.483791,11
American Lager,5.2,6.5,5.85,5.85,0.919239,2
American Pale Ale,7.5,7.5,7.5,7.5,,1
American Porter,5.3,5.3,5.3,5.3,,1
American Stout,6.0,6.0,6.0,6.0,,1
Belgian Dark Strong Ale,9.5,9.5,9.5,9.5,,1
Berliner Weisse,7.2,7.2,7.2,7.2,,1


Here, we use the `df.agg()` function in Pandasto aggregate a DataFrame using one or more aggregate functions (of the type we've been learning about). In the code above, the `df.agg()` function is used to group the `minnesota_beers_df` DataFrame by the `Style` column and calculate the minimum, maximum, mean, median, and standard deviation of the ABV values for each beer style. The results of the aggregation are stored in the report_abv_by_style DataFrame.

Here is a more concise breakdown of the code:

- `minnesota_beers_df.groupby('Style')['ABV']:` This line groups the minnesota_beers_df DataFrame by the Style column and selects the ABV column.

- `.agg(min_abv='min', max_abv='max', mean_abv='mean', median_abv='median', sd_abv='std')`: This line applies the aggregation functions to the ABV column. The names of the functions are passed as strings, but you can also pass the functions themselves.

- `report_abv_by_style:` This line stores the results of the aggregation in the report_abv_by_style DataFrame.

- `report_abv_by_style.head()`: This line displays the top 5 rows of the report_abv_by_style DataFrame.

In the output, you'll notice that standard deviation is sometimes "not a number" (`NaN`). This is because standard deviation isn't defined when there is only 1 data item (e.g., only one beer of a given style).

## What is an API?

Write something here.

In [None]:
import requests
import json

# The API endpoint URL
url = "https://api.openbrewerydb.org/v1/breweries?by_state=minnesota&by_city=rochester"

# Make a GET request to the API
response = requests.get(url)

# Check the response status code
if response.status_code == 200:
  # The request was successful, get the data
  data = response.json()

  # Print the first row of data
  print(json.dumps(data[0], indent=2))

else:
  # The request failed, print the error message
  print(response.status_code)
  print(response.text)

In [None]:
# Convert to pandas
brewery_df = pd.DataFrame(data)
brewery_df.head()
