<a href="https://colab.research.google.com/github/coding-integration/Math-Coding-Integration/blob/main/3.%20Introduction%20to%20Data%20Analysis%20in%20Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcoding-integration%2FMath-Coding-Integration&branch=main&subPath=3.%20Introduction%20to%20Data%20Analysis%20in%20Python.ipynb&depth=1"  target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"></a>

# Introduction to Data Analysis in Python <img src="https://i.imgur.com/qUSxW7s.png" height=45 width=45 align=right>

## Learning Goals
- Use Python to investigate a dataset and conduct statistical analysis on one and two-variable data
- Understand how Python may be used to perform data analysis

## Success Criteria
- I can use Python code to generate boxplots and scatter plots
- I can use Python code to calculate the minimum, maximum, mean, median, and quartiles of a dataset 
- I can analyze the output of code to assess its statistical and mathematical meaning
- I can modify the code to perform statistical analysis



---

# Spreadsheets <img src="https://1000logos.net/wp-content/uploads/2017/03/McDonalds-logo.png" height=115 width=200 align=right>

For this activity, we will be working with some nutritional facts data for McDonald's breakfast foods. The data that we need is stored in a **spreadsheet**. A spreadsheet is a table with rows and columns that our data is organized into. 

<img src="https://i.imgur.com/NVwvbxS.png" width=800>

Our dataset consists of the breakfast food items served at McDonald's. The first column contains the names of the items. The other columns consist of some nutritional facts for each item - Serving Size, Calories, Total Fat, and Protein.

To see the dataset in its entirety, <a href="https://docs.google.com/spreadsheets/d/1Zj8Nl-zw4bKIA2Sd7hzEbCAH5NlAp0am4dbmsAFefSY/edit?usp=sharing">view the spreadsheet</a>
Using Python, we will be able to process and analyze this data through graphs and calculating some basic statistics.



---



## Mean and Median

The mean of a set of numbers is the "average" - **what is the process for calcuating the mean?**

The median of a set of numbers is the "middle" value - **what is the process for calculating the median?**

Double click **here** to enter your responses.


Run the code block below. ```import pandas as pd ``` and ``` import numpy as np``` are Python commands that will import some code to help us perform some **data analysis** on our dataset. Data analysis can include finding basic statistics of data such as the mean and median, as well as creating graphs and charts to represent our data.

In [2]:
import pandas as pd
import numpy as np

We will find the mean and median number of calories among McDonald's breakfast foods. Run the code below to find these values:

In [None]:
data = pd.read_csv("https://raw.githubusercontent.com/coding-integration/Math-Coding-Integration/main/mcdonalds_breakfast_items.csv").dropna() # make Python open our dataset
mean = data["Calories"].mean() #calculate the mean of the "Calories" column
median = data["Calories"].median() #calculate the median of the "Calories" column

print("The mean number of calories is", mean)
print("The median number of calories is", median)

---
<img src="https://www.eatthis.com/wp-content/uploads/sites/4/2019/08/best-worst-mcdonalds.jpg?quality=82&strip=1" height=200 width=250 align=right>

# Boxplots

One of the ways that we can visualize the distribution of data is with a boxplot (also known as a box-and-whisker plot). The box plot will show the minimum and maximum values as well as the quartiles of the data. But first, we will calculate these values using Python.

Run the code below to find the minimum and maximum (lowest and highest) amount of calories among all McDonald's breakfast food items.



In [None]:
minimum = data["Calories"].min()

# use the minimum value to find out what item has the fewest calories
minItem = data.loc[data["Calories"] == minimum]["Item"].values[0]

maximum = data["Calories"].max()

# use the maximum value to find out what item has the most calories
maxItem = data.loc[data["Calories"] == maximum]["Item"].values[0]

print("The item with the fewest calories is",minItem,"at",minimum,"calories")
print("The item with the highest calories is",maxItem,"at",maximum,"calories")

The item with the fewest calories is Hash Brown at 150.0 calories
The item with the highest calories is Big Breakfast with Hotcakes (Large Biscuit) a 1150.0 calories


## Quartiles

Recall: **Quartiles** divide the data into 4 parts, each containing 25% of the data. To divide the data, we must arrange the data in order and calculate 3 values (Q1, Q2, and Q3). Calculating these values will tell us more about the data.

Run the code below to calculate the quartiles Q1, Q2, and Q3.

In [None]:
Q1 = data["Calories"].quantile(0.25)
Q2 = data["Calories"].quantile(0.5)
Q3 = data["Calories"].quantile(0.75)

print("Q1:",Q1)
print("Q2:",Q2)
print("Q3:",Q3)

### Respond to these questions:
1. What do the Q1 and Q3 values tell us? 
2. What is another name for Q2?



Double click **here** to enter your responses.


## Generating a Boxplot

Now that we have calculated the values we need to create a boxplot, using Python we will generate a boxplot graph. Run the code block below to generate the boxplot.

In [None]:
boxplot = data.boxplot(column=["Calories"], whis=[0,100])

#### What can this boxplot tell us about the distribution of calories?


Double click **here** to enter your responses.



---


# Analyzing Correlation Between Two Variables

We will create a scatter plot using two variables in our dataset: **calories** and **total fat**.

## Hypothesis

1. What type of correlation will exist between the amount of calories and total fat? Justify your reasoning.
2. Will the amount of calories and total fat be represented by a linear or non-linear trendline? Justify your reasoning.


Double click **here** to enter your responses.

## Generating a Scatter Plot

Run the code below to generate the scatter plot.

In [None]:
scatterplot = data.plot.scatter(x='Total Fat', y='Calories')
#can change "Total Fat" or "Calories" to be other columns in the spreadsheet
x = np.array(data['Total Fat'].tolist())
y = np.array(data['Calories'].tolist())

#calculate the trendline
z = np.polyfit(x, y, 1) 
p = np.poly1d(z)

scatterplot.plot(x,p(x),"r--") #plot the trendline

trendline = f"y={z[0]:0.3f}x{z[1]:+0.3f}" #prints the trendline as y=ax+b
print("The trendline for this scatter plot is",trendline)

## Consolidation Questions

1. Are there outliers in the data? How could they be explained?
2. Write a statement that describes the trend. Be sure to use proper terminology.
3. Use the scatter plot to interpolate the total fat if there are 200 calories in a food.
4. Use the scatter plot to extrapolate the number of calories if the total fat in the item is 70.
5. Use the trendline equation to verify your interpolation and extrapolation calculations. 
6. Are these interpolations and extrapolations accurate? Why or why not? What factors could influence the accuracy of these predictions?
7. Were your original hypotheses about the trendline and type of correlation correct? Explain how you know.

Double click **here** to enter your responses.



---


# Final Food for Thought....

Respond to the following questions below:

1. Try to find foods that have similar serving sizes, but different calories. What factors could influence the amount of calories an item has? 
2. Do you think there would be a difference in the data analysis of other fast food restaurants? Why or why not?
3. Do you think other correlations could exist between other variables? Try changing the code in "Generating a Scatter Plot" to other columns of data.

Double click **here** to enter your responses.