# Class 10: Intro to Data Visualization

Plan for today:
- Review pandas DataFrames
- Discuss joining DataFrames
- Discuss data visualization


## Notes on the class Jupyter setup

If you have the *ydata123_2023e* environment set up correctly, you can get the class code using the code below (which presumably you've already done given that you are seeing this notebook).  

In [None]:
import YData

# YData.download.download_class_code(10)   # get class code    
# YData.download.download_class_code(10, TRUE) # get the code with the answers 

YData.download.download_data("dow.csv")
YData.download.download_data("monthly_egg_prices.csv")
YData.download_data("US_Gasoline_Prices_Weekly.csv")
YData.download_data("The_Big_Game_Stats_2023.csv")
YData.download.download_data("nba_salaries_2015_16.csv")
YData.download.download_data("nba_position_names.csv")


There are also similar functions to download the homework:

In [None]:
YData.download.download_homework(4)  # downloads the homework 

If you are using colabs, you should install polars and the YData packages by uncommenting and running the code below.

In [None]:
# !pip install https://github.com/emeyers/YData_package/tarball/master

If you are using google colabs, you should also uncomment and run the code below to mount the your google drive

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
import pandas as pd
import statistics
import numpy as np
from datetime import datetime

import matplotlib.pyplot as plt
%matplotlib inline

## Review of processing DataFrames - warm-up problems

To review manipulating pandas DataFrames, let's do a few warm-up exercises. 

In particular, let's look at information about the Super Bowl! 


In [None]:
super_bowl = pd.read_csv("The_Big_Game_Stats_2023.csv")

super_bowl.head(3)

**Problem 1:** To start, create a DataFrame called `super_simpler` that just has the following columns: 
- `Winner`: The name of the team that won the Super Bowl
- `Winner_Pts`: The number of points the winning team scored
- `Loser_Pts`: The number of points the losing team scored


In [None]:
# Get just a subset of columns
super_simpler = super_bowl[['Winner', 'Winner_Pts', 'Loser_Pts']].copy()

super_simpler.head(3)

**Problem 2:** Calcuate the mean score of teams that won and the mean score of team's that lost the Super Bowl. 

See if you can do this in a single line of code. 

In [None]:
# what is the mean number of points that super bowl winners and losers have scored? 

super_simpler[["Winner_Pts", "Loser_Pts"]].mean()

**Problem 3:** Now let's look at which teams have won the most super bowls. 

To do this create a DataFrame called `winner_counts` that has the number of times each team has won the Super Bowl, and sort this DataFrame in order so that teams that have won the most Super Bowls are on the top. 


Reminder: There are several ways to get multiple statistics by group. Perhaps the most useful way is to use the syntax:

<pre>
my_df.groupby("group_col_name").agg(
   new_col1 = ('col_name', 'statistic_name1'),
   new_col2 = ('col_name', 'statistic_name2'),
   new_col3 = ('col_name', 'statistic_name3')
)
</pre>



In [None]:
winner_counts_alt = (super_simpler
 .groupby("Winner")
 .agg(Num_Wins = ("Winner_Pts", "count"))
 .sort_values("Num_Wins", ascending = False)
)

winner_counts_alt.head(3)

In [None]:
# An alternative way to solve the problem

winner_counts = (super_simpler
 .groupby("Winner")
 .count()
 .sort_values("Winner_Pts", ascending = False)
)


winner_counts = (winner_counts
 .rename(columns = {"Winner_Pts": "Number of Wins"})["Number of Wins"]
 .reset_index()
)

winner_counts.head(3)

## "Joining" DataFrames by Index

To explore joining DataFrames, let's load the egg and wheat prices as DataFrames. 

We will also:
- Rename the Price colomns to Egg Price and Wheat Price
- Set the Index to be the date


When two DataFrames have the same Index values, we can use the `.join()` method to join them.

In [None]:
# load the egg and wheat prices as DataFrames
egg_price_df = pd.read_csv("monthly_egg_prices.csv", parse_dates=True, index_col= "DATE")
egg_price_df = egg_price_df.rename(columns = {"Price":"Egg Price"})
egg_price_df.head(3)

In [None]:
wheat_price_df = pd.read_csv("monthly_wheat_prices.csv", parse_dates=True, index_col= "DATE")
wheat_price_df = wheat_price_df.rename(columns = {"Price":"Wheat Price"})
wheat_price_df.head(3)

In [None]:
# Let's do a left join by setting how = "left"
# This will give same results as an outer join b/c the egg_price_df has all (and more) index values as the wheat_prices_df
left_joined = egg_price_df.join(wheat_price_df, how = "left") 
left_joined

In [None]:
# Let's do a right join by setting how = "right"  
# This will give same results as an inner join b/c the egg_price_df has all (and more) index values as the wheat_prices_df
right_joined = egg_price_df.join(wheat_price_df, how = "right") 
right_joined

### "Merging" DataFrames by column values

If we want to join by value in a column rather than by Index value we can use the `.merge()` method (which is very similar to the `.join()` method). 


In [None]:
egg_price_df2 = egg_price_df.reset_index()
egg_price_df2.head(3)

In [None]:
wheat_price_df2 = wheat_price_df.reset_index()

wheat_price_df2.head(3)

In [None]:
left_joined2 = egg_price_df2.merge(wheat_price_df2, how = "left") 
left_joined2

#### Merging with different column names

What if the columns we want to join on have different names, we can use the `left_on` and `right_on` arguments to specify which columns (i.e., keys) should be used to align the two DataFrames

In [None]:
egg_price_df3 = egg_price_df2.rename(columns = {"DATE":"Egg DATE"})
wheat_price_df3 = wheat_price_df2.rename(columns = {"DATE": "Wheat DATE"})

wheat_price_df3.head(3)


In [None]:
egg_price_df3.head(3)

In [None]:
left_joined3 = egg_price_df3.merge(wheat_price_df3, how = "left", left_on = "Egg DATE", right_on = "Wheat DATE") 
left_joined3

#### Example: Spelling out NBA position names

As you will recall, our NBA salaries DataFrame had the different positions listed as abbreviations such as "C" and "PG". 

Often it is hard to tell what these abbreviations (or codes) mean, so a common use of joining is to join on to a table a list of longer names that give more meaning to abbreviations. 

Below we load our `nba_salaries` DataFrame along with a `nba_positions` DataFrame which has information about how each position abbreviation maps on to the position's full name.

Let's merge these DataFrames together so that our `nba_salaries` DataFrame has the full position names!



In [None]:
nba_salaries = pd.read_csv("nba_salaries_2015_16.csv")

nba_salaries.head(3)


In [None]:
nba_positions = pd.read_csv("nba_position_names.csv")
nba_positions

In [None]:
# merge the DataFrames together so each player's position is the full position name

nba_improved = nba_salaries.merge(nba_positions, left_on = "POSITION", right_on = "Position Abbreviation")

nba_improved.head(5)

In [None]:
# remove unnecessary columns using the .drop(colums = )  method
nba_improved.drop(columns = ["POSITION", "Position Abbreviation"])

![pandas](https://image.goat.com/transform/v1/attachments/product_template_additional_pictures/images/071/445/310/original/719082_01.jpg.jpeg)

## Data visualization!

Let's go through different ways to visualize data. To do this let's look again and Egg and Gas prices.


In [None]:
egg_prices = pd.read_csv("monthly_egg_prices.csv", parse_dates = [0])
gas_prices = pd.read_csv("US_Gasoline_Prices_Weekly.csv", parse_dates = [0])

print(egg_prices.head(3))

gas_prices.head(3)

To start with, let's get a little more practice joining DataFrames by joining the egg and gas prices together into a single DataFrame. 

Let's do an inner join to only keep the dates where we have prices for both eggs and gas. 

In [None]:
# merge the egg and gas prices
prices = egg_prices.merge(gas_prices, how = "inner", left_on = "DATE",
                          right_on = "Week")

prices.head(3)


Let's also clean up our prices data by only keeping the columns we need, and renaming them to more meaningful names.

In [None]:
# only keep the columns we need
prices = prices[["Week", "Price", "DollarsPerGallon"]]

# rename the columns to have more meaningful names
prices = prices.rename(columns = {"Price": "Eggs", "DollarsPerGallon":"Gas"})

prices.head(3)


Now we are ready to start visualizing the data!

Let's start by creating line plots!

In [None]:
# create a line plot of egg prices, and also include a circle marker at each point 
plt.plot(prices["Eggs"], "-o");

In [None]:
# Let's have the x-axis be the actual dates
plt.plot(prices["Week"], prices["Eggs"], "-o");

What is [wrong](https://xkcd.com/833/) with these plots???


In [None]:
# Let's make this better!
plt.plot(prices["Week"], prices["Eggs"], "-o");
plt.ylabel("Price ($)");
plt.xlabel("Date");
plt.title("Egg prices over time");

In [None]:
# Let's compare egg and gas prices on the same plot

plt.plot(prices["Week"], prices["Eggs"], "-o", label = "Eggs");
plt.plot(prices["Week"], prices["Gas"], "-o", label = "Gas");

plt.ylabel("Price ($)");
plt.xlabel("Date");
plt.title("Egg prices over time");
plt.legend();


In [None]:
# Side note: Are there any weeks where Eggs cost more than Gas? 

# Can you use pandas to show this?  

prices2 = prices.copy()

prices2["Price Diff"] = prices2["Gas"] - prices2["Eggs"]

prices2.sort_values("Price Diff").head(7)

### Histograms 

We can create histograms using the `plt.hist()` function. 


In [None]:
plt.hist(prices["Gas"], edgecolor = "black", bins = 20, alpha = .5);
plt.xlabel("Price ($)")
plt.ylabel("Count")
plt.title("Weekly US average gas prices")

In [None]:
plt.hist(prices["Gas"], edgecolor = "black", bins = 20, alpha = .5, label = "Gas");
plt.hist(prices["Eggs"], edgecolor = "black", bins = 20, alpha = .5, label = "Egg");
plt.xlabel("Price ($)")
plt.ylabel("Count")
plt.legend()

### We will continue more with visualizing data next week...

<br>
<br>

<img src="https://imgs.xkcd.com/comics/science_valentine.png">
