# Class 11: Intro to Data Visualization

Plan for today:
- Quick review of joining DataFrames
- Data visualization using matplotlib


## Notes on the class Jupyter setup

If you have the *ydata123_2023e* environment set up correctly, you can get the class code using the code below (which presumably you've already done given that you are seeing this notebook).  

In [None]:
import YData

# YData.download.download_class_code(11)   # get class code    
# YData.download.download_class_code(11, TRUE) # get the code with the answers 

YData.download.download_data("dow.csv")
YData.download.download_data("monthly_egg_prices.csv")
YData.download.download_data("monthly_wheat_prices.csv")
YData.download_data("US_Gasoline_Prices_Weekly.csv")
YData.download.download_data("nba_salaries_2015_16.csv")
YData.download.download_data("nba_position_names.csv")


There are also similar functions to download the homework:

In [None]:
# YData.download.download_homework(5)  # downloads the homework 

If you are using colabs, you should install polars and the YData packages by uncommenting and running the code below.

In [None]:
# !pip install https://github.com/emeyers/YData_package/tarball/master

If you are using google colabs, you should also uncomment and run the code below to mount the your google drive

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
import pandas as pd
import statistics
import numpy as np
from datetime import datetime

import matplotlib.pyplot as plt
%matplotlib inline

## "Joining" DataFrames example

Let's go through the exact example I had in my slides so we can understanding the difference between the differnt types of joins. 

Below I create the DataFrames "by hand". 

In [None]:
x_df = pd.DataFrame({"key_x": [1, 2, 3], "val_x": ["x1", "x2", "x3"]})

x_df


In [None]:
y_df = pd.DataFrame({"key_y": [1, 2, 4], "val_y": ["y1", "y2", "y3"]})

y_df

In [None]:
# left join keeps all the rows in the left DataFrame and joins on maching rows in the right DataFrame





In [None]:
# right join keeps all the rows in the right DataFrame and joins on maching rows in the left DataFrame




In [None]:
# inner join keeps only the rows where the keys match in both DataFrames




In [None]:
# outer (full) join keeps all rows in both DataFrames




#### The .join() method

In [None]:
# if we set the indexes of the data frames, we can use the .join() methods instead of the .merge() method
x_df2 = x_df.set_index("key_x")

x_df2 

In [None]:
y_df2 = y_df.set_index("key_y")

y_df2

In [None]:
# Using the join method we do not need to specify on_left and on_right arguments because the key is the index



#### Example: Spelling out NBA position names

As you will recall, our NBA salaries DataFrame had the different positions listed as abbreviations such as "C" and "PG". 

Often it is hard to tell what these abbreviations (or codes) mean, so a common use of joining is to join on to a table a list of longer names that give more meaning to abbreviations. 

Below we load our `nba_salaries` DataFrame along with a `nba_positions` DataFrame which has information about how each position abbreviation maps on to the position's full name.

Let's merge these DataFrames together so that our `nba_salaries` DataFrame has the full position names!



In [None]:
nba_salaries = pd.read_csv("nba_salaries_2015_16.csv")

nba_salaries.head(3)


In [None]:
nba_positions = pd.read_csv("nba_position_names.csv")
nba_positions

In [None]:
# merge the DataFrames together so each player's position is the full position name

nba_improved = nba_salaries.merge(nba_positions, 
                                  how = "left", 
                                  left_on = "POSITION", 
                                  right_on = "Position Abbreviation")

nba_improved.head(5)

In [None]:
# remove unnecessary columns using the .drop(colums = )  method
nba_improved.drop(columns = ["POSITION", "Position Abbreviation"])

![pandas](https://image.goat.com/transform/v1/attachments/product_template_additional_pictures/images/071/445/310/original/719082_01.jpg.jpeg)

## Data visualization!

Let's go through different ways to visualize data. To do this let's look again and Egg and Gas prices.


In [None]:
egg_prices = pd.read_csv("monthly_egg_prices.csv", parse_dates = [0])
gas_prices = pd.read_csv("US_Gasoline_Prices_Weekly.csv", parse_dates = [0])

display(egg_prices.head(3))

gas_prices.head(3)

To start with, let's get a little more practice joining DataFrames by joining the egg and gas prices together into a single DataFrame. 

Let's do an inner join to only keep the dates where we have prices for both eggs and gas. 

In [None]:
# merge the egg and gas prices
prices = egg_prices.merge(gas_prices, 
                          how = "inner", 
                          left_on = "DATE",
                          right_on = "Week")

prices.head(3)


Let's also clean up our prices data by only keeping the columns we need, and renaming them to more meaningful names.

In [None]:
# only keep the columns we need
prices = prices[["Week", "Price", "DollarsPerGallon"]]

# rename the columns to have more meaningful names
prices = prices.rename(columns = {"Price": "Eggs", "DollarsPerGallon":"Gas"})

prices.head(3)


Now we are ready to start visualizing the data!

Let's start by creating line plots usin gthe `plot.plot()` function!

In [None]:
# create a line plot of egg prices, and also include a circle marker at each point 



In [None]:
# Let's have the x-axis be the actual dates



What is [wrong](https://xkcd.com/833/) with these plots???


In [None]:
# Let's make this better!







In [None]:
# Let's compare egg and gas prices on the same plot









In [None]:
# Side note: Can you use pandas methods to see if there are there any weeks where Eggs cost more than Gas? 








### Histograms 

We can create histograms using the `plt.hist()` function. 


### Boxplots

We can create boxplots using the `plt.boxplot()` function. 


In [None]:
# boxplot of gas prices





In [None]:
# creating side-by-side boxplots by passing a list of the different data sets to compare






### Scatter plots

We can create simple scatter plots using: `plt.plot()`

For more complex scatter plots we can use: `plot.scatter()`

Let's start by looking at the simple `plt.plot()`

In [None]:
# Create a basic scatter plot of Egg prices vs. Gas prices using plt.plot()






Let's now join wheat prices on to our data so we can experiment with plotting additional visual features.


In [None]:
# load the wheat prices
wheat_prices = pd.read_csv("monthly_wheat_prices.csv", parse_dates = [0])

# merge them on to the prices DataFrame
prices2 = (prices
           .merge(wheat_prices, how = "left", left_on = "Week", right_on = "DATE")
           .drop("DATE", axis = 1)
           .rename(columns = {"Price": "Wheat"})
          )

# Add a column called "after2000" which has values that are 
# "red" is years after 2000,  "green" is years before 2000
prices2["after2000"] = "red"
prices2.loc[12:, "after2000"] = "green"

prices2.head()

In [None]:
# Create a fancier scatter plot of Egg prices vs. Gas prices using plt.scatter()









### Bar plots and pie charts

We can plot *categorical data* using bar plots and pie charts. 

To create bar plots we can use: `plt.bar()`

To create pie charts we can use: `plt.pie()`


In [None]:
position_counts = (nba_salaries
                   .groupby("POSITION")
                   .agg(num_players = ("PLAYER", "count"))
                   .reset_index())
position_counts

In [None]:
# Create a bar plot of the number of basketball players at each position




In [None]:
# Create a pie char of the number of basketball players at each position




### Subplots 

We can create subplots using: `plt.subplot(num_rows, num_cols, curr_plot_num);`



In [None]:
# subplots








<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>

![piechart](http://i.imgur.com/wsVTukr.jpg)
