# Class 5: Array computations

Today we will focus on array computations which are particularly useful for processing images. 

## Notes on the class Jupyter setup

If you have the *ydata123_2023e* environment set up correctly, you can get the class code using the code below (which presumably you've already done given that you are seeing this notebook).  

In [None]:
import YData

# YData.download.download_class_code(5)   # get class 4 code    

# YData.download.download_class_code(5, TRUE) # get the code with the answers 

YData.download.download_data("nba_salaries_2015_16.csv")
YData.download.download_data("US_Gasoline_Prices_Weekly.csv")
YData.download.download_image("burns.jpeg")

There are also similar functions to download the homework:

In [None]:
YData.download.download_homework(2)  # downloads the second homework 

If you are using colabs, you should install polars and the YData packages by uncommenting and running the code below.

In [None]:
# !pip install polars
# !pip install https://github.com/emeyers/YData_package/tarball/master

If you are using google colabs, you should also uncomment and run the code below to mount the your google drive

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
import polars as pl
import statistics
from datetime import datetime

import matplotlib.pyplot as plt
%matplotlib inline

## Review of array computations

Often we want to process data that is all of the same type. For example, we might want to do processing on a data set of numbers (e.g., if we were just analyzing salary data). 

When we have data that is all of the same type, there are faster ways to process data than using a list. In Python, the `numpy` package offers ways to store and process data that is all of the same type using a data structure called a `ndarray`. There are also functions that operate on `ndarrays` that can do computations very efficiently. 

Let's explore this now!

In [None]:
# import the numpy package
import numpy as np


In [None]:
# create an ndarry of numbers




In [None]:
# we can get the type of elements in an array by accessing the dtype property


In [None]:
# get the size of the array


In [None]:
# create a boolean array



In [None]:
# get the type in the boolean array


In [None]:
# what happens if we make an array from a list of mixed values



In [None]:
# get the dtype 


In [None]:
# convert types







## NumPy functions on numerical arrays

The NumPy package has a number of functions that operate very efficiently on numerical ndarrays.

Let's explore these functions by looking at the price of gas!

The data comes from: https://www.eia.gov/opendata/v1/qb.php?category=240692&sdid=PET.EMM_EPM0_PTE_NUS_DPG.W


In [None]:
all_gas_prices = pl.read_csv("US_Gasoline_Prices_Weekly.csv", parse_dates=True)  # load in the data
all_gas_prices.head()

In [None]:
# Get an ndarray of the gas prices from each week of 2022
# You can ignore this code for now...

gas_data = pl.read_csv("US_Gasoline_Prices_Weekly.csv", parse_dates=True)  # load in the data
gas_data = gas_data.with_column(pl.col('Week').str.strptime(pl.Date, fmt='%m/%d/%Y').cast(pl.Datetime))
gas_data_2022 = gas_data.filter(
    pl.col("Week").is_between(datetime(2022, 1, 1), datetime(2022, 12, 31), closed = "both"),
)

gas_prices_2022 = np.array(gas_data_2022["DollarsPerGallon"])
gas_dates_2022 = np.array(gas_data_2022["Week"])

gas_prices_2022

In [None]:
# prices for all 52 weeks in 2022


In [None]:
# One dollar is currently .92 Euros. What has been the price of a gallon of gas cost in Euros? 
# What have gas prices been in Euros? 


In [None]:
# what if there was a constant tax of $2 on each gallon purchased? 


In [None]:
# basic functions of: min, max, sum, mean and median


In [None]:
# if you bought one gallon each week, what would you pay over the whole year? 


In [None]:
# what do you pay on average? 



In [None]:
# If you bought one gallon each week, how much would you pay at the end of each of the weeks of the year? 


In [None]:
# How much does the gas price go up and down each week? 


In [None]:
# plot the gas prices


## Measuring how long a function takes to run

Jupter notebooks have a special set of ["magic commands"](https://ipython.readthedocs.io/en/stable/interactive/magics.html) that can be used to add additional functionality to a notebook. 

We use `%%time` magic command to evaluate how long a piece of code takes to run. In particular, let's compare summing our gas prices using:

1. A for loop
2. Python's standard library sum() function
3. NumPy's np.sum() function 

In [None]:
%%timeit

# use a for loop to sum all the values






In [None]:
%%timeit

# use Python's standard library sum() function


In [None]:
%%timeit

# NumPy's np.sum() function 



There is not a huge difference here because this is such a small data set, but using efficient code can make a huge difference on large data sets!

![gas_prices](https://cdn.quotesgram.com/img/69/59/1803591020-high-gas-prices.jpg)

## Boolean arrays

We can easily compare all values in an ndarray to a particular value. The result will return an ndarray of Booleans. 

Since Boolean `True` values are treated as 1's, and Boolean `False` values are treated as 0's, this makes it easy to see how many values in an array meet particular conditions. 

In [None]:
# Test all values in an array that are less than 5
my_array = np.array([12, 4, 6, 3, 4, 3, 7, 4])


In [None]:
# How many values are less than 5.


In [None]:
# How many (and what proportion) of weeks in 2022 were gas prices were below $4?


### What proportion of NBA players are centers? 

The data from the 2014-2015 season is loaded below and ndarrays for players positions and salaries are created. 

See if you can use this data to calculate the proportion of NBA players that are centers...


In [None]:
# Load the NBA data as a polars data frame
nba = pl.read_csv("nba_salaries_2015_16.csv")  # load in the data
nba.head()

In [None]:
# Extract ndarrays for salary and position 
salary_array = np.array(nba["SALARY"].to_list())
position_array = np.array(nba["POSITION"].to_list())

print(salary_array[0:5])
print(position_array[0:5])

In [None]:
# get the proportion of players that are centers




# equivalently we can use the np.mean() funciton 



## Boolean indexing/masking

We can also use Boolean arrays to return values in another array. This is called "Boolean masking" or "Boolean indexing".


In [None]:
my_array = np.array([12, 4, 6, 3, 4, 3, 7, 4])



In [None]:
# Calculate the average salary of NBA centers






## Higher dimensional arrays

In [None]:
# slicing to get a submatrix 


In [None]:
# copy the matrix


# set particular index values to 100



In [None]:
# sum all the values


In [None]:
# sum down the rows 


In [None]:
# sum across the columns


In [None]:
# what does the following do? 

face_array = np.zeros([100, 100])  # create a matrix of all 0's 

face_array[21:30, 21:30] = 1  # assign particular regions the value of 1
face_array[21:30, 71:80] = 1
face_array[71:80, 21:80] = 1



## Image processing

We can use numerical arrays (and NumPy) to do image processing. Let's explre this now.


In [None]:
# load in an image 

from imageio.v3 import imread

I = imread("burns.jpeg")

plt.imshow(I);

In [None]:
# get the type and shape of the image



In [None]:
# get the min an max values



In [None]:
# Let's reverse the red and blue channels

# put the R, G, and B channels in separate names


# create a tensor of zeros to store the results


# reverse the R and B channels



# print the dtype


# convert the dtype to int


# show the image



In [None]:
# To create a grayscale image - use the average value in all three r, g, b channels

# can the average image


# create a tensor of all zeros and add the mean image to the R, G and B channels





# convert the image to ints

# show the image


In [None]:
# Image masking - make all drak pixels even darker (set to a value of 0)

# copy the image



# create a mask for all pixels less than a value of 128


# set all values in the mask to 0 and show the image

