# Class 5: Array computations

Today we will focus on array computations which are particularly useful for processing images. 

## Notes on the class Jupyter setup

If you have the *ydata123_2024a* environment set up correctly, you can get the class code using the code below (which presumably you've already done given that you are seeing this notebook).  

In [None]:
import YData

# YData.download.download_class_code(5)   # get class 5 code    

# YData.download.download_class_code(5, True)  # get the code with the answers 

There are also similar functions to download the homework:

In [None]:
YData.download.download_homework(2)  # downloads the second homework 

If you are using colabs, you should install polars and the YData packages by uncommenting and running the code below.

In [None]:
# !pip install https://github.com/emeyers/YData_package/tarball/master

If you are using google colabs, you should also uncomment and run the code below to mount the your google drive

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

## Review: Creating Arrays

Often we want to process data that is all of the same type. For example, we might want to do processing on a data set of numbers (e.g., if we were just analyzing salary data). 

When we have data that is all of the same type, there are faster ways to process data than using a list. In Python, the `numpy` package offers ways to store and process data that is all of the same type using a data structure called a `ndarray`. There are also functions that operate on `ndarrays` that can do computations very efficiently. 

Let's explore this now!

In [None]:
# import the numpy package
import numpy as np

import matplotlib.pyplot as plt


In [None]:
# create an ndarry of numbers




In [None]:
# we can get the type of elements in an array by accessing the dtype property


In [None]:
# get the size of the array


In [None]:
# create a boolean array



In [None]:
# get the type in the boolean array


In [None]:
# reverse True's and Falses


In [None]:
# what happens if we make an array from a list of mixed values



In [None]:
# get the dtype 


In [None]:
# create sequential numbers 1 to 9



In [None]:
# convert types





## NumPy functions on numerical arrays

The NumPy package has a number of functions that operate very efficiently on numerical ndarrays.

Let's explore these functions by looking at the price of gas!

The data comes from: https://www.eia.gov/opendata/v1/qb.php?category=240692&sdid=PET.EMM_EPM0_PTE_NUS_DPG.W

In [None]:
# download the data
YData.download.download_data('US_Gasoline_Prices_Weekly.csv')

In [None]:
# read in and view the data
import pandas as pd
gas_data = pd.read_csv("US_Gasoline_Prices_Weekly.csv", parse_dates=[0])  # load in the data
gas_data.head()

In [None]:
# Get an ndarray of the gas prices from each week of 2023
# You can ignore this code for now...

gas_data_2023 = gas_data[(gas_data['Week'] > '2023-01-01') & (gas_data['Week'] < '2024-01-01')] 

gas_prices_2023 = gas_data_2023["DollarsPerGallon"].values
gas_dates_2023 = gas_data_2023["Week"].values


In [None]:
# prices for all 52 weeks in 2023



In [None]:
# One dollar is currently 147 Yen. What has been the price of a gallon of gas cost in Yen? 
# What have gas prices been in Euros? 



In [None]:
# what if there was a constant tax of $2 on each gallon purchased? 



In [None]:
# basic functions of: min, max, etc.



In [None]:
# if you bought one gallon each week, what would you pay over the whole year? 



In [None]:
# what do you pay on average? 



In [None]:
# If you bought one gallon each week, how much would you pay at the end of each of the weeks of the year? 



In [None]:
# How much does the gas price go up and down each week? 



In [None]:
# plot the gas prices



In [None]:
# plot the gas prices better!






<br>
<br>
<br>
<p>
<center><img src=https://cdn.quotesgram.com/img/69/59/1803591020-high-gas-prices.jpg></center>

## Boolean arrays

We can easily compare all values in an ndarray to a particular value. The result will return an ndarray of Booleans. 

Since Boolean `True` values are treated as 1's, and Boolean `False` values are treated as 0's, this makes it easy to see how many values in an array meet particular conditions. 

In [None]:
# Test all values in an array that are less than 5




In [None]:
# How many values are less than 5.



In [None]:
# How many (and what proportion) of weeks in 2023 were gas prices were below $3.50?



### What proportion of NBA players are centers? 

The data from the 2022-2023 season is loaded below and ndarrays for players positions and salaries are created. 

See if you can use this data to calculate the proportion of NBA players that are centers using numpy!

In [None]:
# download the data
import YData
YData.download.download_data("nba_salaries_2022_23.csv")

In [None]:
# Load the NBA data as a pandas data frame
import pandas as pd
nba = pd.read_csv("nba_salaries_2022_23.csv")  # load in the data
nba.head()

# Extract ndarrays for salary and position 
salary_array = nba["SALARY"].values
position_array = nba["POSITION"].values
team_array = nba["TEAM"].values
player_array = nba["PLAYER"].values

print(salary_array[0:5])
print(position_array[0:5])


In [None]:
# get the proportion of players that are centers





# equivalently we can use the np.mean() funciton 



## Boolean indexing/masking

We can also use Boolean arrays to return values in another array. This is called "Boolean masking" or "Boolean indexing".


In [None]:
# Calculate the average salary of NBA centers





In [None]:
# Do the other positions have higher average salaries compared to centers? 
# Calculate the average salary of all players who are not Centers





In [None]:
# What are the salaries for centers on the Celtics? 




# print the number of players that are centers on the celtics



# check who they are



# get their salaries



## Higher dimensional arrays

In [None]:
# slicing to get a submatrix 
# like array slicing, it does return a value at the end index



In [None]:
# copy the matrix


# set particular index values to 100




In [None]:
# sum all the values



In [None]:
# sum down the rows (axis 0)



In [None]:
# sum across the columns (axis 1)



In [None]:
# what does the following do? 

face_array = np.zeros([100, 100])  # create a matrix of all 0's 

face_array[21:30, 21:30] = 1  # assign particular regions the value of 1
face_array[21:30, 71:80] = 1
face_array[71:80, 21:80] = 1



In [None]:
# convert face_array to a boolean matrix




## Image processing

We can use numerical arrays (and NumPy) to do image processing. Let's explre this now.

In [None]:
# download an image of a famous Yale alumni
YData.download.download_image("burns.jpeg")

In [None]:
# load in an image 

from imageio.v3 import imread

I = imread("burns.jpeg")

plt.imshow(I);

In [None]:
# get the type and shape of the image



In [None]:
# Let's reverse the red and blue channels

r_channel = I[:, :, 0]
g_channel = ...
b_channel = ...

# create new image where color channels will be swapped
rev_rb = np.zeros(I.shape)
print(rev_rb.shape)

# swap channels
rev_rb[:, :, 0] = b_channel
rev_rb[:, :, 1] = g_channel
rev_rb[:, :, 2] = r_channel

# convert to ints
print(rev_rb.dtype)
rev_rb = rev_rb.astype("int")

# show the image
plt.imshow(rev_rb);

In [None]:
# To create a grayscale image - use the average value in all three r, g, b channels





In [None]:
# Image masking - make all drak pixels even darker (set to a value of 0)

darken = I.copy()
darken_mask = darken < 128
print(darken_mask.shape)

darken[darken_mask] = 0
plt.imshow(darken);

## Pandas 

pandas Series are: 0ne-dimensional ndarray with axis labels

pands DataFrame are: Table data

Let's look at the egg and wheat price data...


In [None]:
YData.download.download_data("monthly_egg_prices.csv");
YData.download.download_data("dow.csv");

In [None]:
import pandas as pd

# reading in a series by parsing the dates, and using .squeeze() to conver to a Series
egg_prices_series = pd.read_csv("monthly_egg_prices.csv", parse_dates=True, date_format="%m/%d/%y", index_col="DATE").squeeze()


# print the type


# print the shape


# print the series


In [None]:
# get a value from the Series by an Index name using .loc



In [None]:
# get a value from the Series by index number using .iloc



In [None]:
# use the .filter() method to get data from dates that contain "2023"


# print the length 




In [None]:
# turn the index back into a column using .reset_index()


# get the type


# print the values



## DataFrames!

The ability to manipulate data in tables is one of the most useful skills in Data Science. 

Pandas is the most popular package in Python for manipulating data tables so we will use this package for manipulating tables in this class. The syntax for Pandas can be a little tricky, so try to be patient if you run into errors, and as always, there should be plenty of help available at office hours and on Ed. 

As an example, let's look at data on the closing price of the [Dow Jones Industrial Average](https://www.marketwatch.com/investing/index/djia) which is an index of the prices of the 30 largest corporations in the US.

The code below loads the DOW data into a Pandas DataFrame and displays the first 5 rows using the `head()` method. 


In [None]:
dow = pd.read_csv("dow.csv", parse_dates=True)  # parsing the dates didn't work

dow.head()

In [None]:
# The head() method returns the first 5 rows. 
# Let's use the tail() method to get the last 5 rows.
# From looking at the output, can you tell what year the data goes back until? 



In [None]:
# get the number of rows and columns in a DataFrame using the shape property



In [None]:
# get the types of all the columns using .dtypes



In [None]:
# get the names of all the columns using .columns


# we can convert these names to an numpy array using the .to_numpy() method



In [None]:
# get more info on the data frame using the .info() method



In [None]:
# get descriptive statistics on DataFrame using the .describe() method



More on pandas DataFrames next class!