<a href="https://colab.research.google.com/github/brendanpshea/data-science/blob/main/Data_Science_06_AggregateFunctions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Aggregate Functions and Data Collection Methods

## Introduction to the Minnesota Beer Rating Data Set

The data set we'll be working with contains information about various beers brewed in Minnesota. It's a fascinating collection of data that helps us understand the preferences and characteristics of different beers. Let's break down the columns in this data set:

1.  BeerName: The name of the beer.
2.  Brewery: The name of the brewery that produces the beer.
3.  Style: The style of the beer, such as "Imperial IPA" or "Russian Imperial Stout." This categorizes the beer based on its taste, color, aroma, and other characteristics.
4.  ABV (Alcohol By Volume): This represents the alcohol content in the beer, measured as a percentage. For example, a beer with an ABV of 9.2% contains 9.2% alcohol by volume.
5.  NumRating: The number of ratings the beer has received on Beer Advocate, a popular beer rating website.
6.  AvgRating: The average rating of the beer on a scale from 1 to 5, where a higher rating indicates a more favorable review.



In [None]:
# Importing the Pandas library
import pandas as pd

# Loading the Minnesota beer rating data from the provided CSV file
!wget 'https://github.com/brendanpshea/data-science/raw/main/data/minnesota_beers.csv' -q
minnesota_beers_df = pd.read_csv("minnesota_beers.csv")

# Displaying the first few rows of the data to understand its structure
minnesota_beers_df.head()


Unnamed: 0,BeerName,Brewery,Style,ABV,NumRating,AvgRating
0,Nillerzzzzz,Forager Brewing Company,American Imperial Stout,14.0,143,4.58
1,Abrasive Ale,Surly Brewing Company,Imperial IPA,9.2,4828,4.5
2,Barrel-Aged Silhouette,Lift Bridge Brewery,Russian Imperial Stout,11.0,551,4.5
3,Darkness,Surly Brewing Company,Russian Imperial Stout,12.0,4252,4.48
4,Darkness - Bourbon Barrel-Aged,Surly Brewing Company,Russian Imperial Stout,12.0,356,4.46


In [None]:
## Display basic information of table
minnesota_beers_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   BeerName   100 non-null    object 
 1   Brewery    100 non-null    object 
 2   Style      100 non-null    object 
 3   ABV        100 non-null    float64
 4   NumRating  100 non-null    int64  
 5   AvgRating  100 non-null    float64
dtypes: float64(2), int64(1), object(3)
memory usage: 4.8+ KB


#### Examples:

-   Nillerzzzzz is an "American Imperial Stout" with an ABV of 14.0%, and it has received 143 ratings with an average rating of 4.58.
-   Abrasive Ale is an "Imperial IPA" with an ABV of 9.2%, and it has received 4828 ratings with an average rating of 4.50.

This data set provides a rich source of information for exploring various aspects of the beer industry in Minnesota, including consumer preferences, beer characteristics, and brewery performance. By using aggregate functions in Pandas, we can analyze this data in many interesting ways, such as finding the highest-rated beers, understanding the popularity of different beer styles, or identifying trends in alcohol content.

In the following sections, we'll learn how to use Pandas to perform these analyses and more, providing you with valuable insights and techniques to work with structured data. So grab your favorite beer (if you're of legal drinking age, of course!) and let's dive into the world of data analysis with Pandas!

## Grouping
In this section, we will explore how to group data by a specific category and then calculate the mean, or average, for each group. The emphasis is on the powerful concept of "grouping," which allows us to segment data into categories and perform calculations on each category separately.
### What is the Average Alcohol Content (ABV) for Each Beer Style?

Our first example will be calculating the average Alcohol By Volume (ABV) for each beer style.

Imagine you are at a beer festival in Minnesota, and you are curious to know how different styles of beer compare in terms of alcohol content. Some styles might be stronger, while others might be lighter. How can we figure this out using data analysis?

The answer lies in grouping the beers by their style and then calculating the average ABV for each group. By doing so, we can gain insights into how different styles compare in terms of alcohol content. Here's how we can achieve this:

1.  Group the Beers by Style: First, we'll use the `groupby` method in Pandas to group the beers by their style. This will create a collection of groups where each group contains all the beers of a particular style.

2.  Calculate the Average ABV: Next, we'll use the `mean` function to calculate the average ABV for each group. The mean function adds up all the ABV values in a group and then divides by the number of values, giving us the average.

3.  Analyze the Results: Finally, we'll examine the results to understand how different beer styles compare in terms of alcohol content. This analysis can lead to interesting insights and discussions about beer culture, preferences, and more. We'll use the Pandas method `sort_values()` to give our results meaning.

We'll now use the Minnesota beer rating data set to calculate the average ABV for each beer style. Here's the code to do it, along with the results:

In [None]:
# Grouping the beers by their style and calculating the average ABV for each group
average_abv_by_style = minnesota_beers_df.groupby('Style')['ABV'].mean()

# Sorting the result for better visibility
average_abv_by_style_sorted = average_abv_by_style.sort_values(ascending=False)

# Displaying the result
average_abv_by_style_sorted.head(10)


Style
English Barleywine         14.400000
American Imperial Stout    12.018182
Wheatwine                  11.500000
Russian Imperial Stout     11.070000
American Barleywine         9.900000
Imperial Porter             9.800000
Smoked Porter               9.500000
Belgian Dark Strong Ale     9.500000
Imperial IPA                8.904762
Fruited Kettle Sour         8.000000
Name: ABV, dtype: float64

By grouping the beers by style and calculating the average ABV, we can clearly see how different styles compare. For example, English Barleywine and American Imperial Stout are among the strongest styles in terms of alcohol content, while others, like Imperial IPA and Fruited Kettle Sour, are slightly lighter.  This analysis provides a valuable perspective for beer enthusiasts, brewers, and even regulators who might be interested in understanding the alcohol content of various beer styles.


### How to "Count" Beers by Brewery

If you want to know which breweries produce the most beers, you can group the data by brewery and then count the number of beers for each group. The steps are:

1.  Group Beers by Brewery: Use the `groupby` method to group the beers by the brewery.
2.  Count the Beers: Use the `count` function to count the number of beers for each brewery group.
3.  Analyze the Results: Identify the breweries that produce the most beers. Again, we will use `df.sort_values()`.

We'll now use the Minnesota beer rating data set to count the number of beers produced by each brewery. Here's the code to do it:


In [None]:
# Grouping the beers by brewery and counting the number of beers for each brewery
beers_by_brewery = minnesota_beers_df.groupby('Brewery')['BeerName'].count()

# Sorting the result to find the breweries that produce the most beers
top_breweries_by_beer_count = beers_by_brewery.sort_values(ascending=False)

# Displaying the top 5 breweries
top_breweries_by_beer_count.head(5)

Brewery
Surly Brewing Company         14
Modist Brewing Co.            14
Lupulin Brewing                9
Barrel Theory Beer Company     9
BlackStack Brewing             9
Name: BeerName, dtype: int64

### Understanding Sorting: The `sort_values()` Function

Sorting is the process of arranging data in a specific order, either ascending (from smallest to largest) or descending (from largest to smallest). In Pandas, the `sort_values()` function is used to perform sorting on a Series or DataFrame. In our analysis of the Minnesota Beer data, we used the `sort_values()` function to organize data in a meaningful way. Here's a closer look at how we used this function:

1.  Sorting by a Specific Column: We can sort the DataFrame by a specific column, such as the average rating. This helps us identify the breweries with the highest or lowest ratings.

2.  Ascending and Descending Order: By setting the `ascending` parameter, we can control the order of sorting. For example, `ascending=False` will sort the values in descending order, showing us the breweries with the highest ratings first.

3.  Creating a New Copy: The `sort_values()` function returns a new sorted Series or DataFrame, leaving the original data unchanged. This allows us to maintain the original dataset while working with a sorted version.

**Example:** We sorted the average rating for each brewery in descending order to find the breweries with the highest average ratings:

```python
top_breweries_by_avg_rating = average_rating_by_brewery.sort_values(ascending=False)
```

This code created a new Series with the breweries sorted by their average rating, allowing us to easily identify the top-rated breweries.

Sorting is a fundamental operation in data analysis that helps us organize and interpret data. By understanding how to use the `sort_values()` function in Pandas, we can arrange data in meaningful ways, making it easier to analyze and draw insights. Whether we want to find the highest-rated breweries or understand the distribution of beer styles, sorting provides a powerful tool to enhance our data exploration.

### Exercise

Your task is to find out how many beers are produced for each beer style in the Minnesota beer rating data set. Here's what to do:

1.  Group Beers by Style: Use the `groupby` method to group the beers by their style.
2.  Count the Beers: Use the `count` function to count the number of beers for each style group.
3.  Analyze the Results: Identify the top 3 beer styles based on the number of beers. Write a brief analysis of what you observe.

You can should be able to use the code blocks provided above, with only very small changes (to group by style instead of brewery).

In [None]:
# Grouping the beers by beer style and count the number of beers
# beers_by_style = ?

# Sorting the result to find the breweries that produce the most beers
# top_breweries_by_beer_style = ?

# Displaying the top 3 style. You'll need to use df,head()


### Finding the `max()` Rated Brewery

Imagine you are a beer connoisseur, and you want to visit the brewery with the highest average rating in Minnesota. You have a dataset of the most highly rated beers, and you want to analyze this data to make your decision. Here's how you can approach this task:

1.  Calculate the Average Rating for Each Brewery: First, you need to group the beers by brewery and then calculate the average rating for each group. This will give you a clear picture of how each brewery is rated by beer enthusiasts.

2.  Identify the Brewery with the Highest Average Rating: Next, you need to find the maximum average rating among all the breweries. This will help you identify the brewery that is most favored among the reviewers.

3.  Consider the Context: Since the dataset includes only the most highly rated beers, the analysis will reflect the preferences of those who rate beers highly. It's essential to keep this context in mind when interpreting the results.


We'll now use the Minnesota beer rating data set to find the brewery with the highest average rating. Here's the code to do it:

In [None]:
# Finding the average rating for each brewery
average_rating_by_brewery = minnesota_beers_df.groupby('Brewery')['AvgRating'].mean()

# Identifying the brewery with the highest average rating
# idmax() gets us the actual name of the brewery
highest_avg_rating_brewery = average_rating_by_brewery.idxmax()
# max() gets use the number
highest_avg_rating_value = average_rating_by_brewery.max()

highest_avg_rating_brewery, highest_avg_rating_value


('Lift Bridge Brewery', 4.5)

The brewery with the highest average rating among the most highly rated beers in Minnesota is Lift Bridge Brewery, with an average rating of 4.5. This result tells us that Lift Bridge Brewery is highly favored among reviewers, especially considering that the dataset includes only the most highly rated beers. It's a significant achievement for a brewery to stand out in such a competitive field.

In this example, you'll notice we use two related functions `df.max()` and `df.idmax()`

-   `max`: Returns the numerical value of the maximum element. In our example, it gives us the highest average rating as a number.
-   `idxmax`: Returns the index (or label) of the maximum element. In our example, it gives us the name of the brewery that has the highest average rating.

The combination of the `mean`, `idmax` and `max` functions has allowed us to perform a nuanced analysis of the brewery ratings. By understanding how to calculate the average rating for each brewery and then identify the one with the highest average, we have unlocked insights that could be valuable for consumers, business analysts, and beer enthusiasts.

Remember, the concepts of mean and max are not limited to beer ratings. They are fundamental statistical tools that can be applied across various fields and contexts. Whether you're analyzing customer satisfaction, academic performance, or market trends, understanding how to use these functions can help you make informed decisions and uncover meaningful insights.

Now that you've learned how to use the mean and max functions to analyze brewery ratings, think about how you can apply these concepts to other data sets and questions. The world of data analysis is vast and exciting, and these tools are just the beginning of what you can explore and discover!

### Exercise: Finding the `min()`

Take the code block above (which we used to find the highest rated brewery) and use it to find the **lowest** rated brewery. It's basically the same idea, but you'll be using `min()` and `idmin()` instead of `max()` and `idmax()`.

In [None]:
# Change the code below to use min() and idmin() instead of max and idmax
# Finding the average rating for each brewery
average_rating_by_brewery = minnesota_beers_df.groupby('Brewery')['AvgRating'].mean()
highest_avg_rating_brewery = average_rating_by_brewery.idxmax()
highest_avg_rating_value = average_rating_by_brewery.max()
highest_avg_rating_brewery, highest_avg_rating_value

('Lift Bridge Brewery', 4.5)

## Putting it Altogether: A Comprehensive Report

Now, let's put together what we've learned and create a comprhensive report on each beer style and its associated alchohol content. We'll also meet the `df.agg()` function, which allows us to apply multiple aggregate functions to our data.

In [None]:
# Grouping by beer style and calculating the required statistics for the 'ABV' column
report_abv_by_style = minnesota_beers_df.groupby('Style')['ABV'].agg(
    min_abv='min',
    max_abv='max',
    mean_abv='mean',
    median_abv='median',
    sd_abv='std',
    count_style = 'count'
)

# Displaying the report (showing top 5 rows for brevity)
report_abv_by_style


Unnamed: 0_level_0,min_abv,max_abv,mean_abv,median_abv,sd_abv,count_style
Style,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
American Barleywine,9.9,9.9,9.9,9.9,,1
American Brown Ale,5.1,5.5,5.3,5.3,0.282843,2
American IPA,6.0,7.9,6.9,7.0,0.562139,11
American Imperial Stout,10.0,14.0,12.018182,12.5,1.483791,11
American Lager,5.2,6.5,5.85,5.85,0.919239,2
American Pale Ale,7.5,7.5,7.5,7.5,,1
American Porter,5.3,5.3,5.3,5.3,,1
American Stout,6.0,6.0,6.0,6.0,,1
Belgian Dark Strong Ale,9.5,9.5,9.5,9.5,,1
Berliner Weisse,7.2,7.2,7.2,7.2,,1


Here, we use the `df.agg()` function in Pandasto aggregate a DataFrame using one or more aggregate functions (of the type we've been learning about). In the code above, the `df.agg()` function is used to group the `minnesota_beers_df` DataFrame by the `Style` column and calculate the minimum, maximum, mean, median, and standard deviation of the ABV values for each beer style. The results of the aggregation are stored in the report_abv_by_style DataFrame.

Here is a more concise breakdown of the code:

- `minnesota_beers_df.groupby('Style')['ABV']:` This line groups the minnesota_beers_df DataFrame by the Style column and selects the ABV column.

- `.agg(min_abv='min', max_abv='max', mean_abv='mean', median_abv='median', sd_abv='std')`: This line applies the aggregation functions to the ABV column. The names of the functions are passed as strings, but you can also pass the functions themselves.

- `report_abv_by_style:` This line stores the results of the aggregation in the report_abv_by_style DataFrame.

- `report_abv_by_style.head()`: This line displays the top 5 rows of the report_abv_by_style DataFrame.

In the output, you'll notice that standard deviation is sometimes "not a number" (`NaN`). This is because standard deviation isn't defined when there is only 1 data item (e.g., only one beer of a given style).

## Introduction to Data Collection Methods

In the dynamic world of data analysis, the process of gathering information is as vital as the insights derived from it. Data serves as the foundation upon which we build our understanding, make decisions, and uncover trends. Whether we are exploring the preferences of beer enthusiasts, analyzing the market trends of breweries, or studying the impact of different brewing techniques, the way we collect data shapes the quality and depth of our analysis.

Different research questions and objectives require diverse data collection methods. From understanding the global reach of different beer brands to identifying local craft beer trends, various techniques can be employed to gather the information needed. In this section, we will delve into some of the most commonly used data collection methods, each offering unique advantages and applications:

1.  Web Scraping: Extracting data from websites to analyze beer ratings, customer reviews, or price trends.
2.  Public Databases: Accessing comprehensive datasets related to beer production, sales, or regulation from governmental or organizational repositories.
3.  Application Programming Interface (API)/Web Services: Connecting to specialized platforms or services that provide data on beer varieties, brewery locations, or industry statistics.


By understanding and mastering these data collection methods, we can paint a vivid picture of the multifaceted world of beer. Whether a novice homebrewer seeking to understand preferences or a multinational corporation analyzing global trends, these techniques empower us to gather the data necessary to make informed decisions and create impactful strategies.

In the following sections, we will explore each of these data collection methods in detail, understanding their applications, advantages, challenges, and how they can be effectively utilized in the context of the beer industry.

## Webscraping: Extracting Beer Data from HTML Pages
**Web scraping** is a vital data collection method that involves extracting information from websites. In the context of our beer analysis, the data frame we used earlier was obtained through web scraping from Beer Advocate's Top Rated Beers in Minnesota (https://www.beeradvocate.com/beer/top-rated/us/mn/). This webpage provides a treasure trove of details on highly rated beers, including the beer's name, style, ABV (Alcohol By Volume), and average rating.

The process of web scraping begins with sending an HTTP request to the target URL, in this case, the Beer Advocate page. Upon receiving the HTML content of the webpage from the server, data scientists use libraries like **BeautifulSoup** or **Scrapy** to parse the HTML content and navigate through the elements. By identifying specific HTML tags and attributes, they can extract the desired data, such as beer names, styles, and ratings. Once extracted, this data must be cleaned and structured into a usable format, such as a Pandas DataFrame, by removing unnecessary characters and whitespace. (Note: For the Minnesota Beer dataframe, this is actually pretty involved!).

#### Structure of HTML

Webscraping gets its data from **HTML**, or **HyperText Markup Language**, which is the standard language used to create and design webpages. It defines the structure of a webpage, organizing content into elements such as headings, paragraphs, images, links, and tables. These elements are represented using tags, which encapsulate the content and provide instructions on how it should be displayed in a web browser.


An HTML document is structured as a hierarchy of elements, often visualized as a tree. Here's a basic breakdown of the structure:

1.  Document Type Declaration: `<!DOCTYPE html>` specifies the version of HTML being used.
2.  HTML Element: The root element, `<html>`, contains all other elements on the page.
3.  Head Element: The `<head>` section includes meta information, such as the title and links to stylesheets.
4.  Body Element: The `<body>` section contains the main content of the webpage, organized into various elements.

#### Key Concepts in HTML

-   Tags: HTML uses tags to define elements. Tags usually come in pairs, with an opening tag (`<tag>`) and a closing tag (`</tag>`).
-   Attributes: Elements may have attributes that provide additional information, such as class, id, or style.
-   Hierarchy: Elements are nested within one another, creating a parent-child relationship.

And example of HTML woudl be as follows:

In [None]:
%%html
<!DOCTYPE html>
<html>
<head>
    <title>Top Rated Beers</title>
</head>
<body>
    <h1>Minnesota's Best Beers</h1>
    <p>Explore the <a href="link">top-rated beers</a> in Minnesota.</p>
    <table>
        <!-- Table rows and columns containing beer data -->
        <!--This table is where we would *scrape* data from-->
         <tr> <th>Name</th> <th>Style</th> <th>ABV</th> <th>Rating</th> </tr>
         <tr> <td>North Star Lager</td> <td>Lager</td> <td>4.8%</td> <td>4.2</td> </tr>
         <tr> <td>Lakeview IPA</td> <td>India Pale Ale</td> <td>6.5%</td> <td>4.5</td> </tr>
        <tr> <td>Timberwolf Stout</td> <td>Stout</td> <td>5.2%</td> <td>4.1</td> </tr>
    </table>
</body>
</html>


Name,Style,ABV,Rating
North Star Lager,Lager,4.8%,4.2
Lakeview IPA,India Pale Ale,6.5%,4.5
Timberwolf Stout,Stout,5.2%,4.1


When web scraping, HTML serves as the primary source of data. By navigating the HTML structure and extracting content from specific tags and attributes, data scientists can retrieve valuable information from webpages. The data extracted can range from text and numbers to links and images, depending on the content of the webpage.

### Webscraping: Strengths and Weaknesses
Web scraping offers immense value to data scientists, particularly when data is not readily available through public databases or APIs. It allows for the customization of data collection, enabling data scientists to extract only the relevant information. This approach is widely used for purposes ranging from competitive analysis, where businesses scrape data from competitors' websites, to trend analysis, where data scientists identify changes in popularity or pricing over time.

However, web scraping is not without challenges and potential sources of error. The HTML structure of a website may change, causing the scraping code to break and require updates. Legal and ethical considerations must always be taken into account, as scraping might be restricted or prohibited by a website's terms of service. Additionally, the data on webpages may contain human errors or inconsistencies, leading to inaccuracies in the scraped information. Websites might also implement rate limiting, which restricts repeated requests from the same IP address, necessitating responsible scraping practices.

In conclusion, web scraping played an indispensable role in our exploration of Minnesota's top-rated beers. It's a technique that, while requiring careful handling and consideration of potential errors, offers a versatile and robust tool for data collection. The insights derived from the data scraped from Beer Advocate's website illustrate how web scraping can fuel research, innovation, and informed decision-making in the world of beer and beyond.

### Questions: Web Scraping
1. Consider the ethical aspects of web scraping. How can you ensure that you're scraping data responsibly and within legal boundaries? What might be the consequences of not adhering to ethical guidelines? For example, our scraping above--a small chunk of data for educational purposes, is allowed. However, how would it change if we (repeatedly) scrapped large amounts of data for commerical purposes?

2. Web scraping can present various challenges, especially with websites that change their structure or block scraping. What common obstacles might you encounter when scraping data, and how would you overcome them to ensure accurate data collection?"

3. Web scraping is used across different industries for various purposes. How might you apply web scraping in your own projects or interests, such as analyzing beer ratings? Can you think of innovative ways to utilize web scraping in other fields or topics that interest you?

### My Answers: Web Scraping
1.

2.

3.

## Public Databases
Public databases or public data sets are collections of information that are made freely available to anyone. Unlike private or restricted databases, where access might be limited to certain researchers or organizations, public databases are open for all to explore and utilize. These databases can be particularly useful in fields like data science, where the analysis of large volumes of data can lead to valuable insights.

Let's take the example of someone interested in researching alcohol abuse. Public databases would be an invaluable resource for this researcher. Through sites like Data.gov, which is a repository for U.S. government's open data, they can access statistical information, medical records, survey responses, and more related to alcohol consumption and its effects. These data sets may include demographics, geographical locations, health outcomes, and patterns of behavior. Sample, relevant data sets include:

1.	**U.S. Chronic Disease Indicators (CDI):** This dataset by the U.S. Department of Health & Human Services provides a cross-cutting set of 124 indicators that allow states and territories to understand and compare chronic diseases, including those related to alcohol abuse.
2.	**National Survey on Drug Use and Health (NSDUH-2015):** This federal dataset primarily measures the prevalence and correlates of drug and alcohol use in the United States, offering insights into patterns and tendencies.
3.	**National Epidemiologic Survey on Alcohol and Related Conditions (NESARC)** - III: A nationally representative survey of adult Americans, this dataset specifically focuses on alcohol and related conditions, providing extensive insights into the subject.
4.	**Gender, Mental Illness, and Crime in the United States, 2004:** Offered by the Department of Justice, this dataset examines the gendered effects of depression, drug use, and treatment on crime, which could include the influence of alcohol.


The importance of public databases in data science cannot be overstated. Firstly, they allow researchers from various backgrounds and institutions to access a treasure trove of information without financial barriers. This democratizes the field, allowing for a wider array of perspectives and more robust studies. Secondly, the availability of these data sets encourages transparency and reproducibility in scientific research. Other researchers can verify the results or build upon them, fostering a collaborative and progressive scientific community. Finally, public data sets enable interdisciplinary studies. For example, a sociologist and a medical scientist could jointly analyze the data on alcohol abuse to gain comprehensive insights into both the social and physiological aspects of the problem.

In essence, public databases are like vast, accessible libraries filled with raw facts and figures. They enable researchers to ask new questions, challenge existing assumptions, and contribute to our understanding of complex issues like alcohol abuse. By offering a starting point that is both rich in content and free of charge, sites like Data.gov play a pivotal role in nurturing a culture of open inquiry and innovation.

### Common Data Formats
Public databases come in a variety of formats. Some of the most important are:

1.  **CSV (Comma-Separated Values):** This is a widely used text-based format where values are separated by commas and each row represents a record. CSV files are easy to read and write, and they can be effortlessly imported into Pandas using the `read_csv` function (as we've done many times!)

2.  **JSON (JavaScript Object Notation):** JSON is a lightweight data-interchange format that is easy for humans to read and write. It's often used in web applications to transmit data between a server and a client. Pandas can read a JSON file using the `read_json` function. We'll talk about this in more detail when we come to APIs.

3.  **HTML or XML** Some datasets are provided in HTML format, which can represent data in tables within web pages. Pandas has a `read_html` function that can scrape tabular data from HTML content and convert it into a DataFrame (see section on webscraping). Some datasets are provided in a more general markup language called **XML** (Extensible Markup Language), which can also be parsed.

4.  ArcGIS GeoServices REST API, GeoJSON, KML: These formats are often related to geographical data. Handling them might require specialized libraries like `geopandas`, which extends Pandas to enable spatial operations.

In general, **most** public data will be provided in one of these formts. However, in order to "parse" it correctly (and load it into Pandas) for analysis, you'll often have to spend some time looking at the data and its structure.

### Public Data: Strengths and Weaknesses

Public data sets have become a cornerstone in the ever-expanding field of data science, embodying both a promise of potential and a set of challenges. Their most significant strength lies in their accessibility, as they are freely available to anyone, leveling the playing field and allowing a wide range of researchers to tap into valuable insights. This democratization of data not only fosters inclusivity but also leads to a rich diversity of information covering various domains and geographies. This wide spectrum of information has made interdisciplinary studies more feasible and encouraged a more transparent and collaborative scientific community, where findings can be verified, reproduced, and built upon.

However, the same openness also brings about certain weaknesses. The varying quality and accuracy of public data sets can be a significant concern, where errors or inconsistencies might lead to incorrect conclusions. Finding relevant data sets that match a specific research question's granularity or specificity can be a daunting task. Privacy concerns, though often addressed through anonymization, still linger, and the responsibility to ensure that sensitive information is handled appropriately cannot be overlooked. The lack of standardization across different data sets adds another layer of complexity, requiring additional effort in preprocessing and cleaning. Moreover, the popularity of certain public data sets might lead to their overuse, resulting in redundant studies and a saturation of perspectives on the same subject. Lastly, without adequate context or metadata, there's a risk of misinterpreting the information contained within these data sets.


### Questions: Public Data
1. How can data scientists ensure responsible use of public data sets while balancing the need for detailed information and individual privacy? Where should the line be drawn between openness and confidentiality?

2. How should researchers evaluate the suitability of a public data set for a specific study, such as researching alcohol abuse? What methods can be used to assess accuracy, consistency, and granularity?

3. How can public data sets be leveraged to study complex issues like alcohol abuse across various disciplines? What challenges might arise in integrating data from different fields, and how can they be overcome?

### Answers: Public Data

1.

2.

3.

## Application Program Interfaces (APIs)
Application Programming Interfaces (APIs) are a pivotal connection point in the digital world, serving as a bridge between different software applications. They enable developers to access specific functionalities or data from a service without having to understand the underlying code. In the context of data science, APIs provide a dynamic and often real-time gateway to valuable data that can be used for analysis, visualization, and insight generation.

Unlike static data sets that may be downloaded and used as-is, APIs offer a more interactive approach. They allow researchers to query specific information, filter results, and even manipulate data on the fly. This real-time interaction can be incredibly powerful, particularly when working with constantly changing or updating information.

Let's consider the example of the Open Brewery DB API (https://api.openbrewerydb.org/v1/breweries). This API provides access to a database of breweries, allowing users to explore various details such as brewery types, locations, and even specific beers. Imagine you are researching the craft beer industry or studying cultural trends related to alcohol consumption. With this API, you could dynamically query information about breweries in a specific state, filter by brewery type, or even track new breweries as they open.

Here's how you access this API. Below, we will load the data on breweries in Rochester, MN. Here's what we are going to do:

1. The code begins by importing two essential libraries: `requests`, which is used to make HTTP requests, and `json`, which helps in handling JSON data.

2. The variable `url` contains the **API endpoint**, which is the specific URL where the request is sent. This URL includes query parameters `by_state`, `by_city`, and `per_page` that filter the data for breweries in Rochester, Minnesota, and limit the results to 3 entries.

3.  The line `response = requests.get(url)` sends a **GET request** to the specified endpoint. GET is an HTTP method used to retrieve data from the server.

4.  The response from the server includes a **status code**, which indicates whether the request was successful. A status code of 200 means "OK," signaling a successful request.

5. If the request is successful, the code extracts the data using `response.json()`. This method parses the **JSON data** returned by the API, converting it into a Python object. We'll say more about JSON data below.

6. The data is then **pretty-printed** using `json.dumps(data, indent=2)`, which formats the JSON data with an indentation of 2 spaces for better readability.

7.  If the status code is anything other than 200, the code prints the status code and error message. This helps in debugging and understanding what went wrong with the request.

8.   In the URL, you see specific filters like `by_state`, `by_city`, and `per_page`. These are known as **query parameters** and are used to customize the data retrieved from the API.

Here's what this looks like. To keep the output manageable, we're just going to display the "top" result.

In [2]:
import requests
import json

# The API endpoint URL
url = "https://api.openbrewerydb.org/v1/breweries?by_state=minnesota&by_city=rochester"

# Make a GET request to the API
response = requests.get(url)

# Check the response status code
if response.status_code == 200:
  # The request was successful, get the data
  data = response.json()

  # Print the data
  print(json.dumps(data[0], indent=2))

else:
  # The request failed, print the error message
  print(response.status_code)
  print(response.text)

{
  "id": "190e7e9c-9e20-4b26-b820-f951c75c27b3",
  "name": "Forager Brewing Company",
  "brewery_type": "brewpub",
  "address_1": "1005 6th St NW",
  "address_2": null,
  "address_3": null,
  "city": "Rochester",
  "state_province": "Minnesota",
  "postal_code": "55901-2741",
  "country": "United States",
  "longitude": "-92.47868101",
  "latitude": "44.02937696",
  "phone": "5072587490",
  "website_url": "http://www.foragerbrewery.com",
  "state": "Minnesota",
  "street": "1005 6th St NW"
}


The output you see is in **JSON**, or **JavaScript Object Notation**, a common format used by APIs. It is often employed to transmit information between a web server and a browser, making it a critical tool in the development of modern web applications. Imagine a digital address book for breweries, where each entry is neatly organized into categories like name, type, address, and contact information. JSON is the ledger that keeps this information in an easily readable and processable form.

Let's take a closer look at the example example. It appears to be an **JSON array**, indicated by the square brackets `[ ]`, containing a single **JSON object** that describes a brewery. (In fact, we could ask for more breweries, and an each object would describe a different array). Within this object, encased by curly braces `{ }`, there are **key-value pairs** that represent various attributes of the brewery. The **key** is a string that defines the attribute's name, and the **value** can be a string, number, `null`, or another object or array. For instance, the key `"name"` has the value `"Forager Brewing Company"`, and the key `"address_2"` has the value `null`, indicating that there's no secondary address information. This structure allows for a hierarchical, flexible, and human-readable representation of data, making JSON an elegant and widely-used format in web development and beyond.

Pandas provides us an easy way of loading this sort of JSON data into a dataframe.

In [4]:
import pandas as pd
brewery_df = pd.DataFrame(data)
brewery_df.head()


Unnamed: 0,id,name,brewery_type,address_1,address_2,address_3,city,state_province,postal_code,country,longitude,latitude,phone,website_url,state,street
0,190e7e9c-9e20-4b26-b820-f951c75c27b3,Forager Brewing Company,brewpub,1005 6th St NW,,,Rochester,Minnesota,55901-2741,United States,-92.47868101,44.02937696,5072587490.0,http://www.foragerbrewery.com,Minnesota,1005 6th St NW
1,0a5882a8-ba47-43a1-b5f5-804fa1ea8f2a,Grand Rounds Brewing Company,brewpub,4 3rd St SW,,,Rochester,Minnesota,55902-3019,United States,-92.46339522,44.0202771,5072921628.0,http://www.grandroundsbrewing.com,Minnesota,4 3rd St SW
2,90d71e53-16cf-4e32-ab36-fd068ab717aa,Kinney Creek Brewery,micro,1016 7th St NW,,,Rochester,Minnesota,55901-2668,United States,-92.47808893,44.0312604,,,Minnesota,1016 7th St NW
3,2688cb2b-a937-42d7-a49f-3e1894a66021,Little Thistle Brewing,planning,,,,Rochester,Minnesota,55901,United States,,,5073584647.0,http://www.littlethistlebeer.com,Minnesota,
4,4d1d38b6-5514-45bc-a020-74efd0c1bf8f,LTS Brewing Company,micro,2001 32nd Ave NW,,,Rochester,Minnesota,55901-8321,United States,-92.5104808,44.0447412,5072268280.0,http://www.ltsbrewing.com,Minnesota,2001 32nd Ave NW


## Exercise: Call the OpenBreweryDB API
For this exercise, I'd like you to practice callling the OpenBreweryDB API to retrieve data that interests you (besides "Which breweries are in Roheshter, MN?" Here's what you should do:

1. Read the API doumentation here: https://www.openbrewerydb.org/documentation. This will tell you about the "query" format, as well as what sorts of questions can be answered.
2. Once you've figure out the format of the query, enter it below, and run it to see the results.

In [None]:
import requests
import json

# The API endpoint URL. This is what you needs to enter
# url = "https://api.openbrewerydb.org/v1/breweries?by_state=minnesota&by_city=rochester" # Example


### DO NOT CHANGE ANYTHING BELOW THIS LINE

# Make a GET request to the API
response = requests.get(url)

# Check the response status code
if response.status_code == 200:
  # The request was successful, get the data
  data = response.json()

  # Print the data
  print(json.dumps(data, indent=2))

else:
  # The request failed, print the error message
  print(response.status_code)
  print(response.text)