# Class 8: Series and DataFrames

Plan for today:
- Review/continuations of functions
- pandas Series
- pandas DataFrames



## Notes on the class Jupyter setup

If you have the *ydata123_2023e* environment set up correctly, you can get the class code using the code below (which presumably you've already done given that you are seeing this notebook).  

In [4]:
import YData

# YData.download.download_class_code(8)   # get class code    

# YData.download.download_class_code(8, TRUE) # get the code with the answers 

YData.download.download_data("dow.csv")
YData.download.download_data("monthly_egg_prices.csv")
YData.download.download_data("monthly_wheat_prices.csv")


The file `dow.csv` already exists.
If you would like to download a new copy of the file, please rename the existing copy of the file.
The file `monthly_egg_prices.csv` already exists.
If you would like to download a new copy of the file, please rename the existing copy of the file.
The file `monthly_wheat_prices.csv` already exists.
If you would like to download a new copy of the file, please rename the existing copy of the file.


There are also similar functions to download the homework:

In [5]:
YData.download.download_homework(3)  # downloads the second homework 

If you are using colabs, you should install polars and the YData packages by uncommenting and running the code below.

In [6]:
# !pip install polars
# !pip install https://github.com/emeyers/YData_package/tarball/master

If you are using google colabs, you should also uncomment and run the code below to mount the your google drive

In [7]:
# from google.colab import drive
# drive.mount('/content/drive')

In [8]:
import pandas as pd
import statistics
import numpy as np
from datetime import datetime

import matplotlib.pyplot as plt
%matplotlib inline

## Review of functions

Let's review writing our own functions!


Let's start by writing a function `power3(x)` that returns a number taken to the 3rd power.

In [9]:
def power3(x):
    ...

In [10]:
# try the function out


Now let's generalize the function to a function called `power_it(x, k)` that takes x to the kth power;
i.e., it computes $x^k$

In [11]:
def power_it(x, k):
    ...

In [12]:
# try the function out


Let's modify the `power_it()` function so that the default power is k = 2, but so that the user can input any power if a second argument is given. 

In [13]:
def power_it(x, k = 2):
    ...


# try the functin out




### Multiple return values and tuples

Sometimes we want write a funciton that can return multiple values. We can do this by returning a "tuple". 

Tuples are a basic data structure in Python that is like a list. However, unlike lists, elements in tuples are "immutable" meaning that once we create a tuple, we can not modify the values in the tuple.

We create tuples by using values in parentheses separated by commas:

`my_tuple = (10, 20, 30)`

Let's explore tuples now... 



In [14]:
# create a tuple




In [15]:
# we can access elements of the tuple using square brackets (the same as lists)


In [16]:
# unlike a list, we can't reassign values in a tuple 


In [17]:
# We extract values from tuples into regular names using "tuple unpacking"




Let's create a function `power23(x)` that returns a number squared and a number cubed. 

In [18]:
# create a function that returns a value squared and cubed

def power23(x):
    
    ...


In [19]:
# try the function out

In [20]:
# we can use "tuple unpacking" to assign both outputs to different names




### Passing functions as input arguments

We can also pass functions as input arguments to other functions. Let's explore this...


In [21]:
def compute_on_my_array(stat_function):
    
    my_array = np.array([21, 44, 54, 23, 25, 32])
    
    ...
    


In [22]:
# apply the np.mean function to my_array


In [23]:
# apply the np.sum function to my_array


In [24]:
# apply power23 to my_array


## Pandas 

pandas Series are: One-dimensional ndarray with axis labels

pands DataFrame are: Table data

Let's look at the egg and wheat price data...


In [25]:
egg_price_series = pd.read_csv("monthly_egg_prices.csv", parse_dates=True, index_col= "DATE").squeeze() 



In [26]:
# print the type


# print the shape


# print the series



In [27]:
# get a value from the Series by an Index name using .loc



In [28]:
# get a value from the Series by index number using .iloc



In [29]:
# sort the egg prices using the .sort_values() method



In [30]:
# Let's look at the wheat prices as a series
wheat_price_series = pd.read_csv("monthly_wheat_prices.csv", parse_dates=True, index_col= "DATE").squeeze() 

wheat_price_series


DATE
1990-01-01    167.918579
1990-02-01    160.937271
1990-03-01    156.528030
1990-04-01    159.467529
1990-05-01    149.179291
                 ...    
2022-07-01    321.975128
2022-08-01    323.016769
2022-09-01    346.322181
2022-10-01    353.712907
2022-11-01    344.329861
Name: Price, Length: 395, dtype: float64

In [31]:
# note the wheat and egg price series starts at a different date 




# and they are different lengths 




In [32]:
# when we add to series together, the addition aligns the Indexes!
# NaN are added when there are no matches between Indexes 






# can remove NaN with .dropna() method

In [33]:
# we can turn the index back into a column using .reset_index()
# this returns a DataFrame! 






# get the total spent...
# np.sum(monthly_spend_df["Price"])

## DataFrames!

The ability to manipulate data in tables (DataFrames) is one of the most useful skills in Data Science. 

Pandas is the most popular package in Python for manipulating data tables so we will use this package for manipulating tables in this class. The syntax for Pandas can be a little tricky, so try to be patient if you run into errors, and as always, there should be plenty of help available at office hours and on Ed. 

As an example, let's look at data on the closing price of the [Dow Jones Industrial Average](https://www.marketwatch.com/investing/index/djia) which is an index of the prices of the 30 largest corporations in the US.

The code below loads the DOW data into a Pandas DataFrame and displays the first 5 rows using the `head()` method. 


In [34]:
dow = pd.read_csv("dow.csv", parse_dates=True)  # parsing the dates didn't work

dow = dow.set_index("Date")

dow.head()

Unnamed: 0_level_0,Year,Month,Day,Open,High,Low,Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1/25/23,2023,1,Wednesday,33538.36,33773.09,33273.21,33743.84
1/24/23,2023,1,Tuesday,33444.72,33782.92,33310.56,33733.96
1/23/23,2023,1,Monday,33439.56,33782.88,33316.25,33629.56
1/20/23,2023,1,Friday,33073.46,33381.95,32948.93,33375.49
1/19/23,2023,1,Thursday,33171.35,33227.49,32982.05,33044.56


In [35]:
# The head() method returns the first 5 rows. 
# Let's use the tail() method to get the last 5 rows.
# From looking at the output, can you tell what year the data goes back until? 



In [36]:
# get the number of rows and columns in a DataFrame using the shape property



In [37]:
# get the types of all the columns using .dtypes



In [38]:
# get the names of all the columns using .columns



# we can convert these names to an numpy array using the .to_numpy() method


In [39]:
# get descriptive statistics on DataFrame using the .describe() method



# round() the values, or can convert them to ints using astype("int")




### Selecting columns from a DataFrame

We can select columns from a DataFrame using the square brackets; e.g., `my_df["my_col"]`

If we'd like to select multiple columns we can pass a list; e.g., `my_df[["col1", "col2"]]`


In [40]:
# Get just the DOW close price




In [41]:
# we can also get a single column using the .col_name 




In [42]:
# Get both the open and close price



# what is the type of close_price? (use type() and .dtypes)



### Getting a subset of rows from a DataFrame

Similar to pandas Series, we can get particular rows from a DataFrame using:

- `.loc`:  Get rows by Index values - and by Boolean masks
- `.iloc`.:  Get rows by their index number



In [43]:
# Extract a row based on the Index name "1/25/23"


In [44]:
# Extract a row based on the row number (get row 0)


In [45]:
# We can get multiple rows that meet particular conditions using Boolean masking




In [46]:
# extract the 2022 values using our Boolean mask



In [47]:
# Can you get the mean DOW close value in 2022? 






### Sorting values in a DataFrame

We can sort values in a DataFrame using `.sort_values("col_name")`

We can sort from highest to lowest by setting the argument `ascending = False`


In [48]:
# Sort the data by the Close value



In [49]:
# What is the highest the DOW has been? 



### Adding new columns to a Data Frame

We can add a column to a data frame using square backets. For example: 

- `my_df["new col"] = my_df["col1"] + my_df["col2"]`.




Percent change is defined as: $100 * \frac{final - initial}{initial}$

Can you add a "Percent change" column to the dow2 data (which is a copy of the dow data comparing closing and opening prices?  What is the biggest percent change in the dow? 

In [50]:
# copy the data to dow2



# add percent change column



# sort the data



In [51]:
# sort the data from largest to smallest


# This is actually not historically correct for older dates. 
# See if you can figure out how to calculate the actual largest percent changes. 

### Getting aggregate statistics by group

We can get aggregate statistics by group using `groupby()` and `agg` methods using the following syntax:

`my_df.groupby("col_name").agg("agg_function_name")`

Can you get the max values of the DOW each year? 


In [52]:
# What was the max values of the DOW each year? 



There are several ways to get multiple statistics by group. Perhaps the most useful way is to use the syntax:

<pre>
my_df.groupby("group_col_name").agg(
   new_col1 = ('col_name', 'statistic_name1'),
   new_col2 = ('col_name', 'statistic_name2'),
   new_col3 = ('col_name', 'statistic_name3')
)
</pre>


Let's create a DataFrame that has the number of trading days, the minimum and the maximum DOW value for each year. 


We will continue with pandas next class...

![pandas](https://image.goat.com/transform/v1/attachments/product_template_additional_pictures/images/071/445/310/original/719082_01.jpg.jpeg)