# Class 8: Series and DataFrames

Plan for today:
- Review/continuations of functions
- pandas Series
- pandas DataFrames



## Notes on the class Jupyter setup

If you have the *ydata123_2023e* environment set up correctly, you can get the class code using the code below (which presumably you've already done given that you are seeing this notebook).  

In [None]:
import YData

# YData.download.download_class_code(8)   # get class code    

# YData.download.download_class_code(8, TRUE) # get the code with the answers 

YData.download.download_data("dow.csv")
YData.download.download_data("monthly_egg_prices.csv")
YData.download.download_data("monthly_wheat_prices.csv")


There are also similar functions to download the homework:

In [None]:
YData.download.download_homework(3)  # downloads the second homework 

If you are using colabs, you should install polars and the YData packages by uncommenting and running the code below.

In [None]:
# !pip install polars
# !pip install https://github.com/emeyers/YData_package/tarball/master

If you are using google colabs, you should also uncomment and run the code below to mount the your google drive

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
import pandas as pd
import statistics
import numpy as np
from datetime import datetime

import matplotlib.pyplot as plt
%matplotlib inline

## Review of functions

Let's review writing our own functions!


Let's start by writing a function `power3(x)` that returns a number taken to the 3rd power.

In [None]:
def power3(x):
    return x**3

In [None]:
power3(2)

Now let's generalize the function to a function called `power_it(x, k)` that takes x to the kth power;
i.e., it computes $x^k$

In [None]:
def power_it(x, k):
    return x**k

In [None]:
power_it(2, 8)

Let's modify the `power_it()` function so that the default power is k = 2, but so that the user can input any power if a second argument is given. 

In [None]:
def power_it(x, k = 2):
    return x**k


print(power_it(2))
print(power_it(2, 8))

### Multiple return values and tuples

Sometimes we want write a funciton that can return multiple values. We can do this by returning a "tuple". 

Tuples are a basic data structure in Python that is like a list. However, unlike lists, elements in tuples are "immutable" meaning that once we create a tuple, we can not modify the values in the tuple.

We create tuples by using values in parentheses separated by commas:

`my_tuple = (10, 20, 30)`

Let's explore tuples now... 



In [None]:
my_tuple = (10, 20, 30)

my_tuple


In [None]:
# we can access elements of the tuple using square brackets (the same as lists)
my_tuple[1]

In [None]:
# unlike a list, we can't reassign values in a tuple 
my_tuple[1] = 50

In [None]:
# We extract values from tuples into regular names using "tuple unpacking"

val1, val2, val3 = my_tuple


val3

Let's create a function `power23(x)` that returns a number squared and a number cubed. 

In [None]:
# create a function that returns a value squared and cubed

def power23(x):
    
    return (x**2, x**3)


In [None]:
power23(2)

In [None]:
# we can use "tuple unpacking" to assign both outputs to different names
squared, cubed = power23(2)  

print(squared)
print(cubed)

### Passing functions as input arguments

We can also pass functions as input arguments to other functions. Let's explore this...


In [None]:
def compute_on_my_array(stat_function):
    
    my_array = np.array([21, 44, 54, 23, 25, 32])
    
    calculated_val = stat_function(my_array)
    
    return calculated_val
    


In [None]:
# apply the np.mean function to my_array
compute_on_my_array(np.mean)

In [None]:
# apply the np.sum function to my_array
compute_on_my_array(np.sum)

In [None]:
# apply power23 to my_array
compute_on_my_array(power23)

## Pandas 

pandas Series are: One-dimensional ndarray with axis labels

pands DataFrame are: Table data

Let's look at the egg and wheat price data...


In [None]:
egg_price_series = pd.read_csv("monthly_egg_prices.csv", parse_dates=True, index_col= "DATE").squeeze() 

# print the type
print(type(egg_price_series))

# print the shape
print(egg_price_series.shape)

# print the series
egg_price_series


In [None]:
# get a value from the Series by an Index name using .loc
egg_price_series.loc["1980-01-01"]

In [None]:
# get a value from the Series by index number using .iloc
egg_price_series.iloc[0]

In [None]:
# sort the egg prices using the .sort_values() method
egg_price_series.sort_values()

In [None]:
# Let's look at the wheat prices as a series
wheat_price_series = pd.read_csv("monthly_wheat_prices.csv", parse_dates=True, index_col= "DATE").squeeze() 

wheat_price_series


In [None]:
# note the wheat and egg price series starts at a different date 
print(wheat_price_series.index[0])
print(egg_price_series.index[0])

# and they are different lengths 
print(wheat_price_series.shape)
print(egg_price_series.shape)


In [None]:
# when we add to series together, the addition aligns the Indexes!
# NaN are added when there are no matches between Indexes 

monthly_spent = egg_price_series + wheat_price_series/2000

monthly_spent      # can remove NaN with .dropna() method

In [None]:
# we can turn the index back into a column using .reset_index()
# this returns a DataFrame! 

monthly_spend_df = monthly_spent.reset_index()

monthly_spend_df


# get the total spent...
# np.sum(monthly_spend_df["Price"])

## DataFrames!

The ability to manipulate data in tables (DataFrames) is one of the most useful skills in Data Science. 

Pandas is the most popular package in Python for manipulating data tables so we will use this package for manipulating tables in this class. The syntax for Pandas can be a little tricky, so try to be patient if you run into errors, and as always, there should be plenty of help available at office hours and on Ed. 

As an example, let's look at data on the closing price of the [Dow Jones Industrial Average](https://www.marketwatch.com/investing/index/djia) which is an index of the prices of the 30 largest corporations in the US.

The code below loads the DOW data into a Pandas DataFrame and displays the first 5 rows using the `head()` method. 


In [None]:
dow = pd.read_csv("dow.csv", parse_dates=True)  # parsing the dates didn't work

dow = dow.set_index("Date")

dow.head()

In [None]:
# The head() method returns the first 5 rows. 
# Let's use the tail() method to get the last 5 rows.
# From looking at the output, can you tell what year the data goes back until? 

dow.tail()

In [None]:
# get the number of rows and columns in a DataFrame using the shape property
dow.shape

In [None]:
# get the types of all the columns using .dtypes
dow.dtypes

In [None]:
# get the names of all the columns using .columns
print(dow.columns)

# we can convert these names to an numpy array using the .to_numpy() method
dow.columns.to_numpy()

In [None]:
# get descriptive statistics on DataFrame using the .describe() method

dow.describe().round()   # round() the values, or can convert them to ints using astype("int")

### Selecting columns from a DataFrame

We can select columns from a DataFrame using the square brackets; e.g., `my_df["my_col"]`

If we'd like to select multiple columns we can pass a list; e.g., `my_df[["col1", "col2"]]`


In [None]:
# Get just the DOW close price

close_price = dow["Close"]

close_price.head()  # what is the type of close_price? (use type() and .dtype)


In [None]:
# we can also get a single column using the .col_name 

close_price2 = dow.Close

close_price2.head()

In [None]:
# Get both the open and close price
open_close_price = dow[["Open", "Close"]]

open_close_price # what is the type of close_price? (use type() and .dtypes)

### Getting a subset of rows from a DataFrame

Similar to pandas Series, we can get particular rows from a DataFrame using:

- `.loc`:  Get rows by Index values - and by Boolean masks
- `.iloc`.:  Get rows by their index number



In [None]:
# Extract a row based on the Index name "1/25/23"
dow.loc["1/25/23"]

In [None]:
# Extract a row based on the row number (get row 0)
dow.iloc[0]

In [None]:
# We can get multiple rows that meet particular conditions using Boolean masking

booleans_in_2022 = dow["Year"] == 2022

booleans_in_2022

In [None]:
# extract the 2022 values using our Boolean mask
dow.loc[booleans_in_2022]   # actually works even without the .loc

In [None]:
# Can you get the mean DOW close value in 2022? 
data_2022 = dow[dow.Year == 2022]

print(data_2022["Close"].mean())   # using the Series mean() function

np.mean(data_2022["Close"])  # can also use np.mean()



### Sorting values in a DataFrame

We can sort values in a DataFrame using `.sort_values("col_name")`

We can sort from highest to lowest by setting the argument `ascending = False`


In [None]:
# Sort the data by the Close value
dow.sort_values("Close").head()

In [None]:
# What is the highest the DOW has been? 
dow.sort_values("Close", ascending = False).head()

### Adding new columns to a Data Frame

We can add a column to a data frame using square backets. For example: 

- `my_df["new col"] = my_df["col1"] + my_df["col2"]`.




Percent change is defined as: $100 * \frac{final - initial}{initial}$

Can you add a "Percent change" column to the dow2 data (which is a copy of the dow data comparing closing and opening prices?  What is the biggest percent change in the dow? 

In [None]:
# copy the data to dow2
dow2 = dow.copy()

# add percent change column
dow2["Percent change"] = 100 * (dow2["Close"] - dow2["Open"])/dow2["Open"]

# sort the data
dow2.sort_values("Percent change").head()

In [None]:
# sort the data from largest to smallest
dow2.sort_values("Percent change", ascending = False).head() 

# This is actually not historically correct for older dates. 
# See if you can figure out how to calculate the actual largest percent changes. 

### Getting aggregate statistics by group

We can get aggregate statistics by group using `groupby()` and `agg` methods using the following syntax:

`my_df.groupby("col_name").agg("agg_function_name")`

Can you get the max values of the DOW each year? 


In [None]:
# What was the max values of the DOW each year? 

dow[["Year", "Close"]].groupby("Year").agg("max")


There are several ways to get multiple statistics by group. Perhaps the most useful way is to use the syntax:

<pre>
my_df.groupby("group_col_name").agg(
   new_col1 = ('col_name', 'statistic_name1'),
   new_col2 = ('col_name', 'statistic_name2'),
   new_col3 = ('col_name', 'statistic_name3')
)
</pre>


Let's create a DataFrame that has the number of trading days, the minimum and the maximum DOW value for each year. 


In [None]:
dow.groupby('Year').agg(
    countClose = ('Close', 'count'),
    minClose = ('Close', 'min'),
    maxClose=('Close', 'max')
)

We will continue with pandas next class...

![pandas](https://image.goat.com/transform/v1/attachments/product_template_additional_pictures/images/071/445/310/original/719082_01.jpg.jpeg)