# Class 7: Functions

Plan for today:
- Review important topics: for loops and Boolean indexing
- Writing functions
- If there is time: pandas Series and DataFrames



## Notes on the class Jupyter setup

If you have the *ydata123_2023e* environment set up correctly, you can get the class code using the code below (which presumably you've already done given that you are seeing this notebook).  

In [9]:
import YData

# YData.download.download_class_code(7)   # get class 4 code    

# YData.download.download_class_code(7, TRUE) # get the code with the answers 

YData.download.download_data("dow.csv")
YData.download.download_data("monthly_egg_prices.csv")
YData.download.download_data("monthly_wheat_prices.csv")


The file `dow.csv` already exists.
If you would like to download a new copy of the file, please rename the existing copy of the file.
The file `monthly_egg_prices.csv` already exists.
If you would like to download a new copy of the file, please rename the existing copy of the file.
The file `monthly_wheat_prices.csv` already exists.
If you would like to download a new copy of the file, please rename the existing copy of the file.


There are also similar functions to download the homework:

In [10]:
YData.download.download_homework(3)  # downloads the second homework 

The file `homework_03.ipynb` already exists.
If you would like to download a new copy of the file, please rename the existing copy of the file.


If you are using colabs, you should install polars and the YData packages by uncommenting and running the code below.

In [11]:
# !pip install polars
# !pip install https://github.com/emeyers/YData_package/tarball/master

If you are using google colabs, you should also uncomment and run the code below to mount the your google drive

In [12]:
# from google.colab import drive
# drive.mount('/content/drive')

In [13]:
import pandas as pd
import statistics
import numpy as np
from datetime import datetime

import matplotlib.pyplot as plt
%matplotlib inline

## Warm up review: lists and for loops

The code below loads the monthly price of a ton of wheat and a dozen eggs. 

Suppose someone bought a dozen eggs and a pound of wheat each month since Jan 1st, 1990 (which is when the data starts). Please answer the following questions:

1. Use a for loop to create a list which has how much they spent each month (i.e., has one value for each month since 1980). 

2. Calculate the total amount of money that was spent. 

Hint: A ton is 2,000 pounds. 


In [14]:
# load the data
import datetime

egg_prices_df = pd.read_csv("monthly_egg_prices.csv", parse_dates=True) 
egg_prices_df["DATE"] = pd.to_datetime(egg_prices_df['DATE']) 
egg_prices_df = egg_prices_df.set_index("DATE")

wheat_prices_df = pd.read_csv("monthly_wheat_prices.csv", parse_dates=True) 
wheat_prices_df["DATE"] = pd.to_datetime(wheat_prices_df['DATE']) 
wheat_prices_df = wheat_prices_df.set_index("DATE")

prices_df = egg_prices_df.join(wheat_prices_df, lsuffix = "_egg", rsuffix = "_wheat").dropna()


# create lists with the egg and wheat prices
egg_prices = prices_df["Price_egg"].to_list()
wheat_prices = prices_df["Price_wheat"].to_list()

print(egg_prices[0:5])
print(wheat_prices[0:5])

prices_df.head()

[1.223, 1.041, 1.111, 1.092, 0.94]
[167.9185791, 160.9372711, 156.5280304, 159.4675293, 149.1792908]


Unnamed: 0_level_0,Price_egg,Price_wheat
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1
1990-01-01,1.223,167.918579
1990-02-01,1.041,160.937271
1990-03-01,1.111,156.52803
1990-04-01,1.092,159.467529
1990-05-01,0.94,149.179291


In [15]:
# Question 1:
# Use a for loop to calculate total spent on a dozen eggs and a pound of flour per month
# (i.e., for 12 eggs and a pound of wheat)








In [16]:
# Question 2: 
# Use a for loop to create a *list* that has how much money would be spent each month 











In [17]:
# Question 3: 
# Suppose someone only would buy a pound of wheat in months where eggs were less than $2.
# What is the total they would have spent on wheat? 
# Again use a for loop to solve this...










## Very quick review of array computations

Often we want to process data that is all of the same type. For example, we might want to do processing on a data set of numbers (e.g., if we were just analyzing salary data). 

When we have data that is all of the same type, there are faster ways to process data than using a list. In Python, the `numpy` package offers ways to store and process data that is all of the same type using a data structure called a `ndarray`. There are also functions that operate on `ndarrays` that can do computations very efficiently. 

Let's explore this now!

In [18]:
# import the numpy package
import numpy as np


In [19]:
# create an ndarry of numbers
a_list = [2, 3, 4, 5]
an_array = np.array(a_list)  # create an ndarray from a list

print(an_array)           # print out the array
print(an_array.dtype)     # print out the array's dtype
print(an_array.shape)     # print out the array's size
print(an_array[0])        # get the first item of the array

[2 3 4 5]
int64
(4,)
2


In [20]:
# create a boolean array
boolean_array = np.array([True, True, False])

print(boolean_array) # print out the array
print(boolean_array.dtype) # print out the array's dtype
print(boolean_array.astype("int")) # convert the array from Booleans to integers

[ True  True False]
bool
[1 1 0]


In [21]:
# use a Boolean array to get elements from another array
an_array = np.array([1, 2, 3, 4, 5])
boolean_array = np.array([True, True, False, True, True])

# get only the elements that are True


In [22]:
# create a boolean array for all values greater than 3


In [23]:
# use boolean array to return the actual values greater than 3


In [24]:
# return the actual values greater than 3 in one step


In [25]:
# Let's again get the total price of wheat in months where eggs were less than $2

egg_price_array = np.array(egg_prices)
wheat_price_array = np.array(wheat_prices)/2000







## Functions!

We have already used many functions in this class that are built into Python or are imported from different modules/packages. 

Let's now write some new functions outselves! 


In [26]:
def double(x):
    ...

In [27]:
# apply it to 7


In [28]:
# apply it to 15/3


In [29]:
# create a number


In [30]:
# apply it to a number


In [31]:
# apply it to a number divided by 8


In [32]:
# will this work applied to an ndarray?


In [33]:
# will this work applied to a string? 


In [34]:
# will this work applied to a Boolean


In [35]:
#"local scope"


In [36]:
# set x to 17


In [37]:
# use the double() function


In [38]:
# what is the value of x


In [39]:
# what if we double x? 


In [40]:
# now what is the value of x? 


### Discussion Question

What does the following function do? 

In [41]:
#What does this function do?
def percents(values):
    return np.round(100 * values / sum(values), 2)

In [42]:
# apply the function to an array


In [43]:
# apply the function to another array


In [44]:
# Can have multiple inputs
def percents(values, places):
    return np.round(values / sum(values) * 100, places)

In [45]:
# try it out setting the second argument


## Function extras: docstrings

When writing functions that will be used by other people (or your future self) it is important to write some documentation describing how your function works. In Python, this type of documentation is called a "docstring". The text in a docstring is in triple quotes which allows for multi-line comments.

There are a number of [convensions](https://peps.python.org/pep-0257/) surrounding on how to write a docstring, including: 

- The doc string line should begin with a capital letter and end with a period.
- The first line should be a short description.
- If there are more lines in the documentation string, the second line should be blank, visually separating the summary from the rest of the description.
- The following lines should be one or more paragraphs describing the object’s calling conventions, its side effects, etc.


In [46]:
def double(x):
    """ Set the docstring 

    """
    # define the function here
    

In [47]:
# get help on the function 


## Function extras: creating your own modules

If you save function in a file a Python file that ends with .py (e.g., in a file called `my_module.py`), you can import your functions as a module.


In [48]:
# save the function to a .py file 

# we can then import it as a module

# import the function as a module



## Pandas 

pandas Series are: 0ne-dimensional ndarray with axis labels

pands DataFrame are: Table data

Let's look at the egg and wheat price data...


In [49]:
egg_prices_series = pd.read_csv("monthly_egg_prices.csv", parse_dates=True, index_col= "DATE").squeeze() 

# print the type


# print the shape


# print the series


In [50]:
# get a value from the Series by an Index name using .loc


In [51]:
# get a value from the Series by index number using .iloc


In [52]:
# get egg prices for only 2022 using the .filter function 



# print the length 




In [53]:
# turn the index back into a column using .reset_index()


# get the type


# print the values



## Tables!

The ability to manipulate data in tables is one of the most useful skills in Data Science. 

Pandas is the most popular package in Python for manipulating data tables so we will use this package for manipulating tables in this class. The syntax for Pandas can be a little tricky, so try to be patient if you run into errors, and as always, there should be plenty of help available at office hours and on Ed. 

As an example, let's look at data on the closing price of the [Dow Jones Industrial Average](https://www.marketwatch.com/investing/index/djia) which is an index of the prices of the 30 largest corporations in the US.

The code below loads the DOW data into a Pandas DataFrame and displays the first 5 rows using the `head()` method. 


In [54]:
dow = pd.read_csv("dow.csv", parse_dates=True)  # parsing the dates didn't work

dow.head()

Unnamed: 0,Date,Year,Month,Day,Open,High,Low,Close
0,1/25/23,2023,1,Wednesday,33538.36,33773.09,33273.21,33743.84
1,1/24/23,2023,1,Tuesday,33444.72,33782.92,33310.56,33733.96
2,1/23/23,2023,1,Monday,33439.56,33782.88,33316.25,33629.56
3,1/20/23,2023,1,Friday,33073.46,33381.95,32948.93,33375.49
4,1/19/23,2023,1,Thursday,33171.35,33227.49,32982.05,33044.56


In [55]:
# The head() method returns the first 5 rows. 
# Let's use the tail() method to get the last 5 rows.
# From looking at the output, can you tell what year the data goes back until? 



In [56]:
# get the number of rows and columns in a DataFrame using the shape property


In [57]:
# get the types of all the columns using .dtypes


In [58]:
# get the names of all the columns using .columns



# we can convert these names to an numpy array using the .to_numpy() method



In [59]:
# get more info on the data frame using the .info() method


In [60]:
# get descriptive statistics on DataFrame using the .describe() method



# round() the values, or can convert them to ints using astype("int")



More on pandas DataFrames next class!