# Class 7: Functions

Plan for today:
- Review important topics: for loops and Boolean indexing
- Writing functions
- If there is time: pandas Series and DataFrames



## Notes on the class Jupyter setup

If you have the *ydata123_2023e* environment set up correctly, you can get the class code using the code below (which presumably you've already done given that you are seeing this notebook).  

In [None]:
import YData

# YData.download.download_class_code(7)   # get class 4 code    

# YData.download.download_class_code(7, TRUE) # get the code with the answers 

YData.download.download_data("dow.csv")
YData.download.download_data("monthly_egg_prices.csv")
YData.download.download_data("monthly_wheat_prices.csv")


There are also similar functions to download the homework:

In [None]:
YData.download.download_homework(3)  # downloads the second homework 

If you are using colabs, you should install polars and the YData packages by uncommenting and running the code below.

In [None]:
# !pip install polars
# !pip install https://github.com/emeyers/YData_package/tarball/master

If you are using google colabs, you should also uncomment and run the code below to mount the your google drive

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
import pandas as pd
import statistics
import numpy as np
from datetime import datetime

import matplotlib.pyplot as plt
%matplotlib inline

## Warm up review: lists and for loops

The code below loads the monthly price of a ton of wheat and a dozen eggs. 

Suppose someone bought a dozen eggs and a pound of wheat each month since Jan 1st, 1990 (which is when the data starts). Please answer the following questions:

1. Use a for loop to create a list which has how much they spent each month (i.e., has one value for each month since 1980). 

2. Calculate the total amount of money that was spent. 

Hint: A ton is 2,000 pounds. 


In [None]:
# load the data
import datetime

egg_prices_df = pd.read_csv("monthly_egg_prices.csv", parse_dates=True) 
egg_prices_df["DATE"] = pd.to_datetime(egg_prices_df['DATE']) 
egg_prices_df = egg_prices_df.set_index("DATE")

wheat_prices_df = pd.read_csv("monthly_wheat_prices.csv", parse_dates=True) 
wheat_prices_df["DATE"] = pd.to_datetime(wheat_prices_df['DATE']) 
wheat_prices_df = wheat_prices_df.set_index("DATE")

prices_df = egg_prices_df.join(wheat_prices_df, lsuffix = "_egg", rsuffix = "_wheat").dropna()


# create lists with the egg and wheat prices
egg_prices = prices_df["Price_egg"].to_list()
wheat_prices = prices_df["Price_wheat"].to_list()

print(egg_prices[0:5])
print(wheat_prices[0:5])

prices_df.head()

In [None]:
# Question 1: 
# Calculate total spent on a dozen eggs and a pound of flour per month

total_spent = 0
for i in range(len(egg_prices)):
    
    total_spent = total_spent + egg_prices[i] + wheat_prices[i]/2000

total_spent


In [None]:
# Question 2: 
# Create a list that has how much money would be spent each month (for 12 eggs and a pound of wheat)

monthly_spent = []

for i in range(len(egg_prices)):
    
    monthly_spent.append(egg_prices[i] + wheat_prices[i]/2000)


print(monthly_spent[0:5])
print(monthly_spent[-4:-1])



In [None]:
# Question 3: 
# Suppose someone only would buy a pound of wheat in months where eggs were less than $2.
# What is the total they would have spent on wheat? 

total_spent = 0
for i in range(len(egg_prices)):
    
    if egg_prices[i] < 2:
        total_spent = total_spent +  wheat_prices[i]/2000

total_spent



## Very quick review of array computations

Often we want to process data that is all of the same type. For example, we might want to do processing on a data set of numbers (e.g., if we were just analyzing salary data). 

When we have data that is all of the same type, there are faster ways to process data than using a list. In Python, the `numpy` package offers ways to store and process data that is all of the same type using a data structure called a `ndarray`. There are also functions that operate on `ndarrays` that can do computations very efficiently. 

Let's explore this now!

In [None]:
# import the numpy package
import numpy as np


In [None]:
# create an ndarry of numbers
a_list = [2, 3, 4, 5]
an_array = np.array(a_list)  # create an ndarray from a list

print(an_array)           # print out the array
print(an_array.dtype)     # print out the array's dtype
print(an_array.shape)     # print out the array's size
print(an_array[0])        # get the first item of the array

In [None]:
# create a boolean array
boolean_array = np.array([True, True, False])

print(boolean_array) # print out the array
print(boolean_array.dtype) # print out the array's dtype
print(boolean_array.astype("int")) # convert the array from Booleans to integers

In [None]:
# use a Boolean array to get elements from another array
an_array = np.array([1, 2, 3, 4, 5])
boolean_array = np.array([True, True, False, True, True])

# get only the elements that are True
an_array[boolean_array]

In [None]:
# create a boolean array for all values greater than 3
boolean_array = an_array > 3

In [None]:
# use boolean array to return the actual values greater than 3
an_array[boolean_array]


In [None]:
# return the actual values greater than 3 in one step
an_array[an_array > 3]


In [None]:
# Let's again get the total price of wheat in months where eggs were less than $2

egg_price_array = np.array(egg_prices)
wheat_price_array = np.array(wheat_prices)/2000

wheat_price_array[egg_price_array < 2]

np.sum(wheat_price_array[egg_price_array < 2])



## Functions!

We have already used many functions in this class that are built into Python or are imported from different modules/packages. 

Let's now write some new functions outselves! 


In [None]:
def double(x):
    return x * 2

In [None]:
double(7)

In [None]:
double(15/3)

In [None]:
my_number = 12

In [None]:
double(my_number)

In [None]:
double(my_number / 8)

In [None]:
# will this work?
double(np.array([3, 4, 5]))

In [None]:
# will this work? 
double('data')

In [None]:
# what about this? 
double(True)

In [None]:
#"local scope"
x

In [None]:
x = 17

In [None]:
double(2)

In [None]:
x

In [None]:
double(x)

In [None]:
x

### Discussion Question

What does the following function do? 

In [None]:
#What does this function do?
def percents(values):
    return np.round(100 * values / sum(values), 2)

In [None]:
percents(np.array([1, 2, 3, 4]))

In [None]:
percents(np.array([1, 4, 30]))

In [None]:
# Can have multiple inputs
def percents(values, places):
    return np.round(values / sum(values) * 100, places)

In [None]:
percents(np.array([1, 4, 30]), 1)

## Function extras: docstrings

When writing functions that will be used by other people (or your future self) it is important to write some documentation describing how your function works. In Python, this type of documentation is called a "docstring". The text in a docstring is in triple quotes which allows for multi-line comments.

There are a number of [convensions](https://peps.python.org/pep-0257/) surrounding on how to write a docstring, including: 

- The doc string line should begin with a capital letter and end with a period.
- The first line should be a short description.
- If there are more lines in the documentation string, the second line should be blank, visually separating the summary from the rest of the description.
- The following lines should be one or more paragraphs describing the object’s calling conventions, its side effects, etc.


In [None]:
def double(x):
    """Take a number and doubles it.
    
    Parameters:
    x (int): A number that should be doubled
    
    Returns:
    int: The numbers that is doubled
    
    """
    return x * 2

In [None]:
? double

## Function extras: creating your own modules

If you save function in a file a Python file that ends with .py (e.g., in a file called `my_module.py`), you can import your functions as a module.


In [None]:

import my_module as mm

mm.my_double(123)

## Pandas 

pandas Series are: 0ne-dimensional ndarray with axis labels

pands DataFrame are: Table data

Let's look at the egg and wheat price data...


In [None]:
egg_prices_series = pd.read_csv("monthly_egg_prices.csv", parse_dates=True, index_col= "DATE").squeeze() 

# print the type
print(type(egg_prices_series))

# print the shape
print(egg_prices_series.shape)

# print the series
egg_prices_series


In [None]:
# get a value from the Series by an Index name using .loc
egg_prices_series.loc["1980-01-01"]

In [None]:
# get a value from the Series by index number using .iloc
egg_prices_series.iloc[0]

In [None]:
# get egg prices for only 2022 using the .filter function 
egg_prices_2022 = egg_prices_series.filter(like='2022')

# print the length 
print(len(egg_prices_2022))

egg_prices_2022

In [None]:
# turn the index back into a column using .reset_index()
egg_prices_df = egg_prices_series.reset_index()

# get the type
print(type(egg_prices_df))

# print the values
egg_prices_df


## Tables!

The ability to manipulate data in tables is one of the most useful skills in Data Science. 

Pandas is the most popular package in Python for manipulating data tables so we will use this package for manipulating tables in this class. The syntax for Pandas can be a little tricky, so try to be patient if you run into errors, and as always, there should be plenty of help available at office hours and on Ed. 

As an example, let's look at data on the closing price of the [Dow Jones Industrial Average](https://www.marketwatch.com/investing/index/djia) which is an index of the prices of the 30 largest corporations in the US.

The code below loads the DOW data into a Pandas DataFrame and displays the first 5 rows using the `head()` method. 


In [None]:
dow = pd.read_csv("dow.csv", parse_dates=True)  # parsing the dates didn't work

dow.head()

In [None]:
# The head() method returns the first 5 rows. 
# Let's use the tail() method to get the last 5 rows.
# From looking at the output, can you tell what year the data goes back until? 

dow.tail()

In [None]:
# get the number of rows and columns in a DataFrame using the shape property
dow.shape

In [None]:
# get the types of all the columns using .dtypes
dow.dtypes

In [None]:
# get the names of all the columns using .columns
print(dow.columns)

# we can convert these names to an numpy array using the .to_numpy() method
dow.columns.to_numpy()

In [None]:
# get more info on the data frame using the .info() method
dow.info()

In [None]:
# get descriptive statistics on DataFrame using the .describe() method

dow.describe().round()   # round() the values, or can convert them to ints using astype("int")

More on pandas DataFrames next class!