# Week 02

## Citing open-source / found code

Sometimes the citation will be part of the code. Whenever you use the `import` command, I'll know the code is coming form somewhere else and it's easy to figure out where.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

plt.plot(np.sin(np.arange(0, 4 * np.pi, .1)))
plt.plot(np.cos(np.arange(0, 4 * np.pi, .1)), c='r')
plt.show()

Other times the citation will have to be a little more explicit.

A link to the original code, repo, or stackoverflow answer is enough.

In [None]:
import cv2
from scipy import fftpack
from imagehash import ImageHash

# Function for computing the perceptual hash of an image
# Based on code from the vframe project:
#   https://github.com/vframeio/vframe/blob/master/src/vframe/utils/im_utils.py#L37-L48
# which is based on code from the imagehash library:
#   https://github.com/JohannesBuchner/imagehash/blob/master/imagehash.py#L197

def phash(im, hash_size=8, highfreq_factor=4):
  wh = hash_size * highfreq_factor
  im = cv2.resize(im, (wh, wh), interpolation=cv2.INTER_NEAREST)
  if len(im.shape) > 2 and im.shape[2] > 1:
    im = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)
  mdct = fftpack.dct(fftpack.dct(im, axis=0), axis=1)
  dctlowfreq = mdct[:hash_size, :hash_size]
  med = np.median(dctlowfreq)
  diff = dctlowfreq > med
  return ImageHash(diff)

Ok, back to Week 02

## Setup

Let's import some helper functions and libraries

In [None]:
import random

## Ranges

<img src="./imgs/range.jpg" width="500px" />

Range of integers between 0 and 10:

In [None]:
range(0, 10)

# TODO: take a look at the range values

Range of integers between 0 and 100 skipping by 10s:

In [None]:
range(0, 100, 10)

# TODO: take a look at the range values

## Lists
### Creating lists from sequences of numbers
#### Create a list with all the numbers between 0 and 1000 that end in 91

In [None]:
list_x91 = []

# TODO: for loop
# TODO: comprehension
# TODO: casting

### List indexing

Indexing from the front is normal:

In [None]:
list_x91, list_x91[0], list_x91[2], list_x91[8]

But, Python also lets us index from the back with negative numbers:

In [None]:
list_x91[-1], list_x91[-2], list_x91[-8]

### Create a list with 100 number 2's

In [None]:
list_100_2 = []
# TODO

### Create list of numbers between 0 and 100 that are divisible by 7:

In [None]:
# TODO: probably easier using comprehension
list_100_7 = []

### List functions

Members of each `list` object.

<img src="./imgs/lists00.jpg" width="500px" />

### Create a list of 10 random numbers between 0 and 100

In [None]:
list_of_randoms = []

# TODO: with for loop and append

### Print the numbers and their index

In [None]:
# TODO: with len
# TODO: with enumerate

### Create a list of 100 random numbers between 0 and 1000

In [None]:
# TODO: with for loop
list_of_randoms = []

# TODO: with comprehension
list_of_randoms_c = []

### Create a list with random length

A list of random length, with random numbers between 0 and 1000.


In [None]:
num_elements = 0 # TODO: random len
list_of_randoms = []

# TODO: list

### Addition

Besides the `append()` function we can also add elements to a list by using the `+` operator to *concatenate* two lists.

And just like addition on numbers we have to assign the result to a variable in order to use it later:

In [None]:
big_list_of_randoms = list_of_randoms + list_of_randoms_c

# Or, with the +=
list_of_randoms += list_of_randoms_c

### Find the largest element on a list

Go through all of the elements and compare each element to the largest number seen so far.

Update the `largest` variable if we encounter a larger number.

In [None]:
largest = list_of_randoms[0]

# TODO: find max

### Find the smallest element on a list

Go through all of the elements and compare each element to the smallest number seen so far.

Update the `smallest` variable if we encounter a smaller number.

In [None]:
smallest = list_of_randoms[0]

# TODO: find min

### Find the sum of all elements on a list

Go through all of the elements and add their values to an accumulator variable.

In [None]:
my_sum = 0

# TODO: find sum

### Python has built in functions for doing these things

In [None]:
print(min(list_of_randoms), max(list_of_randoms), sum(list_of_randoms))

### Find the 5 largest and 5 smallest numbers on a list

# 🤔

### Python has a function for sorting a list that could help

In [None]:
my_sorted_list = sorted(list_of_randoms)

print(list_of_randoms)
print(my_sorted_list)

### Functions on lists

These are functions that Python gives us to work on lists.

There are functions for sorting, reversing and getting the length of a `list`:

<img src="./imgs/lists01.jpg" width="600px" />

#### Reverse sorting:

In [None]:
my_reversed_sorted_list = [] # TODO

print(list_of_randoms)
print(my_sorted_list)
print(my_reversed_sorted_list)

### With a sorted list we can more easily print the 5 smallest and 5 largest elements


In [None]:
print(my_sorted_list[ :5], my_sorted_list[-5: ])

### :W:T:F:?:

### Slicing

Python has a built-in mechanism for getting sub-sections of a list called *slicing*.

Instead of a single index, we specify two values in the square bracket, separated by a `:`, to specify where our slice starts and ends:

<img src="./imgs/slicing.jpg" width="700px" />

One **VERY** important thing to remember is that the second index in the bracket is **NOT** included in the slice.

In [7]:
my_list = [random.randint(0, 12) for i in range(0, 20)]
my_list, my_list[0 : 5]

NameError: name 'random' is not defined

As another example:  
`my_list[4 : 10]` would be used to access $6$ elements starting at position $4$, so ...
<br>elements $4$ - $9$ on the list. The second index in the slice, $10$, is not included.

In [None]:
my_list[4 : 10]

And, Python being Python, it tries to be smart and keep us from unnecessary typing:
- if the first index is blank, the slice will start at the first element 
- if the second index is blank, the slice will go until the end of the list

In [None]:
my_list, my_list[0 : 5], my_list[ :5]

In [None]:
my_list[15 : 20], my_list[15: ]

We can use negative indexes to slice from the back:

`a_list[-5 : len(a_list)]` would grab the last 5 elements from the list `my_list`,
<br>but this can be simplified with `a_list[-5: ]`.

In [None]:
my_list[-5 : len(my_list)], my_list[-5: ]

### How would we get the 5 items in the center?

In [None]:
center_index = 0 # TODO
center_5 = [] # TODO: center items

### This should make more sense now:

In [None]:
my_sorted_list[ :5], my_sorted_list[-5: ]

## Objects

### Creating objects

In [None]:
my_info = {
  "name": "thiago",
  "id": 8114,
  "zip": 11001,
  "grades": [90, 80, 60],
  "attendance": [True, True, False, True, True],
  "final grade": "A"
}
my_info

### Accessing values at specific keys

In [None]:
my_info["name"], my_info["grades"]

### Modifying and Adding values

In [None]:
my_info["zip"] = 11202
my_info["course"] = 9103
my_info["section"] = "H"
my_info

### Iterating over keys, values and items

<img src="./imgs/objects.jpg" width="500px" />

In [None]:
# TODO use my_info.keys(), .values() and .items() to print object

## List of objects

### Create a list of 10 objects with random heights and brooklyn zip codes.

```python
my_data = [
  {"height": [60, 70], "zip": [11200, 11250]},
  {"height": [60, 70], "zip": [11200, 11250]},
  {"height": [60, 70], "zip": [11200, 11250]},
  ...
]
```

In [18]:
import random
my_data = []
# TODO: create list of random objects
for cnt in range(10):
    obj = {
        "height": random.randint(60,70),
        "zip": random.randint(11220, 11250)
    }
    my_data.append(obj)
my_data

[{'height': 61, 'zip': 11240},
 {'height': 66, 'zip': 11240},
 {'height': 70, 'zip': 11236},
 {'height': 70, 'zip': 11233},
 {'height': 60, 'zip': 11249},
 {'height': 66, 'zip': 11224},
 {'height': 60, 'zip': 11248},
 {'height': 64, 'zip': 11222},
 {'height': 64, 'zip': 11229},
 {'height': 65, 'zip': 11244}]

### Let's add a list of 3 grades for each member of the list and another item with their computed average

In [19]:
# TODO: first, add grade list to objects

for obj in my_data: 
    obj["grades"] = [random.randint(70,100) for cnt in range(3)]

my_data

[{'height': 61, 'zip': 11240, 'grades': [77, 89, 99]},
 {'height': 66, 'zip': 11240, 'grades': [81, 97, 86]},
 {'height': 70, 'zip': 11236, 'grades': [93, 86, 92]},
 {'height': 70, 'zip': 11233, 'grades': [80, 85, 85]},
 {'height': 60, 'zip': 11249, 'grades': [79, 72, 74]},
 {'height': 66, 'zip': 11224, 'grades': [94, 100, 72]},
 {'height': 60, 'zip': 11248, 'grades': [77, 96, 80]},
 {'height': 64, 'zip': 11222, 'grades': [91, 79, 81]},
 {'height': 64, 'zip': 11229, 'grades': [100, 79, 84]},
 {'height': 65, 'zip': 11244, 'grades': [90, 97, 74]}]

### Average

<img src="./imgs/average00.jpg" width="500px" />

<img src="./imgs/average01.jpg" width="500px" />

In [23]:
# TODO: compute and store averages
for obj in my_data: 
    obj["avg"] = round(sum(obj["grades"]) / len(obj["grades"]))

my_data

[{'height': 61, 'zip': 11240, 'grades': [77, 89, 99], 'avg': 88},
 {'height': 66, 'zip': 11240, 'grades': [81, 97, 86], 'avg': 88},
 {'height': 70, 'zip': 11236, 'grades': [93, 86, 92], 'avg': 90},
 {'height': 70, 'zip': 11233, 'grades': [80, 85, 85], 'avg': 83},
 {'height': 60, 'zip': 11249, 'grades': [79, 72, 74], 'avg': 75},
 {'height': 66, 'zip': 11224, 'grades': [94, 100, 72], 'avg': 89},
 {'height': 60, 'zip': 11248, 'grades': [77, 96, 80], 'avg': 84},
 {'height': 64, 'zip': 11222, 'grades': [91, 79, 81], 'avg': 84},
 {'height': 64, 'zip': 11229, 'grades': [100, 79, 84], 'avg': 88},
 {'height': 65, 'zip': 11244, 'grades': [90, 97, 74], 'avg': 87}]

### Get lowest and hightest average grades


First, get all grades, then sort and get the first and last item on the list, or use `min()`/`max()`

In [25]:
averages = []
for obj in my_data:
    averages.append(obj["avg"])

averages
min(averages), max(averages)
# TODO: for loop or comprehension to get heights
# TODO: sort and get first/last items, or use min/max


(75, 90)

### Sort by key values

For example, sort objects by average grade.

We could first get all the average grades and then sort the new list:

In [27]:
# TODO: get list of avg grades
grades = []

by_grade = sorted(grades)

print("original:\n", grades)
print("sorted:\n", by_grade)

original:
 []
sorted:
 []


In [26]:
by_averages = sorted(averages)
print(averages)
print(by_averages)

[88, 88, 90, 83, 75, 89, 84, 84, 88, 87]
[75, 83, 84, 84, 87, 88, 88, 88, 89, 90]


### But now we don't have the other associated information with each grade.

We want to sort the list while keeping the objects together.

Would be nice to be able to do something like this, just like with a `list`:

In [28]:
by_grade = sorted(my_data)
print(by_grade)

TypeError: '<' not supported between instances of 'dict' and 'dict'

### Sorting Objects

For lists of objects we have to tell python which values to compare to determine their order.

We do this by defining a key function.

Key functions receive one argument, that can be an object, a list, a class member, anything... and they return one numerical value.

<img src="./imgs/list-of-objects.jpg" width="620px" />

In [37]:
# this key function receives a student-info object with {height, grade, zip, etc}
# and should return just the average grade value
def gradeKey(A):
  return A["avg"]

# # then we can just use it when we call sorted()
by_grade = sorted(my_data, key=gradeKey)
by_grade

def height(A):
  return A["height"]

by_height = sorted(my_data, key=height)
by_height

[{'height': 60, 'zip': 11249, 'grades': [79, 72, 74], 'avg': 75},
 {'height': 60, 'zip': 11248, 'grades': [77, 96, 80], 'avg': 84},
 {'height': 61, 'zip': 11240, 'grades': [77, 89, 99], 'avg': 88},
 {'height': 64, 'zip': 11222, 'grades': [91, 79, 81], 'avg': 84},
 {'height': 64, 'zip': 11229, 'grades': [100, 79, 84], 'avg': 88},
 {'height': 65, 'zip': 11244, 'grades': [90, 97, 74], 'avg': 87},
 {'height': 66, 'zip': 11240, 'grades': [81, 97, 86], 'avg': 88},
 {'height': 66, 'zip': 11224, 'grades': [94, 100, 72], 'avg': 89},
 {'height': 70, 'zip': 11236, 'grades': [93, 86, 92], 'avg': 90},
 {'height': 70, 'zip': 11233, 'grades': [80, 85, 85], 'avg': 83}]

In [41]:
# TODO: sort by first assignment grade
def hw01Key(A):
  # TODO
  return A["grades"][0]

by_hw01 = sorted(my_data, key=hw01Key)

by_hw01

[{'height': 61, 'zip': 11240, 'grades': [77, 89, 99], 'avg': 88},
 {'height': 60, 'zip': 11248, 'grades': [77, 96, 80], 'avg': 84},
 {'height': 60, 'zip': 11249, 'grades': [79, 72, 74], 'avg': 75},
 {'height': 70, 'zip': 11233, 'grades': [80, 85, 85], 'avg': 83},
 {'height': 66, 'zip': 11240, 'grades': [81, 97, 86], 'avg': 88},
 {'height': 65, 'zip': 11244, 'grades': [90, 97, 74], 'avg': 87},
 {'height': 64, 'zip': 11222, 'grades': [91, 79, 81], 'avg': 84},
 {'height': 70, 'zip': 11236, 'grades': [93, 86, 92], 'avg': 90},
 {'height': 66, 'zip': 11224, 'grades': [94, 100, 72], 'avg': 89},
 {'height': 64, 'zip': 11229, 'grades': [100, 79, 84], 'avg': 88}]

### `min()`/`max()` functions also work with a `key` argument:

In [42]:
# student with highest average grade
max_by_grade = max(my_data, key=gradeKey)

# student with lowest score on first assignment
min_by_hw01 = min(my_data, key=hw01Key)

print(max_by_grade)
print(min_by_hw01)

{'height': 70, 'zip': 11236, 'grades': [93, 86, 92], 'avg': 90}
{'height': 61, 'zip': 11240, 'grades': [77, 89, 99], 'avg': 88}


## Bigger Lists

## Setup

Include some helper functions and libraries

In [45]:
!wget -q https://github.com/DM-GY-9103-2024F-H/9103-utils/raw/main/src/data_utils.py

In [46]:
import matplotlib.pyplot as plt

from data_utils import object_from_json_url

ModuleNotFoundError: No module named 'pandas'

### Load ANSUR 2 Databse

The `JSON` file has a subset of the measurements found [here](https://www.openlab.psu.edu/ansur2/).

In [47]:
ANSUR_JSON_URL = "https://raw.githubusercontent.com/DM-GY-9103-2024F-H/9103-utils/main/datasets/json/ansur.json"
ansur = object_from_json_url(ANSUR_JSON_URL)

# TODO: look at the data

# Answer:
#   - how many rows/records/items ?
#   - tallest height ?
#   - longest ear ?
#   - average ear length ?


NameError: name 'object_from_json_url' is not defined

### Let's look at a simpler versions:

In [None]:
AHW_JSON_URL = "https://raw.githubusercontent.com/DM-GY-9103-2024F-H/9103-utils/main/datasets/json/ansur_age_height_weight_object.json"
ahw_objs = object_from_json_url(AHW_JSON_URL)

# TODO: look at data
# How is it organized ?

In [None]:
AHW_LIST_URL = "https://raw.githubusercontent.com/DM-GY-9103-2024F-H/9103-utils/main/datasets/json/ansur_age_height_weight.json"
ahws = object_from_json_url(AHW_LIST_URL)

# TODO: look at data
# How is it organized ?

# Answer the following:
#   - how many items ?
#   - how do we access the height of a person ?

## List of Lists

Just like we can put lists inside objects, and objects inside lists, we can also put lists inside lists.

If we want to get to a particular value we have to use $2$ indices instead of using just one:
`list[i][j]`

The first index tells Python which of the sub-lists we want, and the second specifies the item on that list.

<img src="./imgs/list-of-lists00.jpg" width="700px" />

<img src="./imgs/list-of-lists01.jpg" width="700px" />

Sometimes we'll refer to the first index as the row index and the second index as the column index.

That's because if we imagine our list of lists as a 2-dimensional matrix of numbers, the first index tells Python which row we want to access and the second tells which column:

<img src="./imgs/list-of-lists02.jpg" width="700px" />

<img src="./imgs/list-of-lists03.jpg" width="700px" />

### Datasets

We'll see this kind of structure a lot.

It's very common for datasets to be organized by rows/columns, where each column specifies a different *property* (or *feature*) and each row is a different *measurement* (or *record*) of those features.

In our example above, our dataset had $3$ *features* (age, height, weight), and one *record* per person.

<img src="./imgs/datasets00.jpg" width="700px" />

### JSON

It's also common to find datasets specified in the JSON format.

Instead of just being a list of lists with values, each *record* is an object that specifies the names and values of its *features*:

<img src="./imgs/datasets01.jpg" width="700px" />

There are advantages and disadvantages to each. We'll soon look at another way to organize datasets that will make it easier to go from one type to the other if we have to.

## Plots

We can use the [matplot](https://matplotlib.org/stable/api/pyplot_summary.html) library to visualize our data.

In [None]:
# TODO: get heights
heights = []

plt.plot(heights, 'bo', markersize=2)
plt.show()

In [None]:
# TODO: get weights
weights = []

plt.plot(weights, 'ro', markersize=2)
plt.show()

In [None]:
# TODO: plot ages in green
ages = []

### Sorting data can give a different perspective

In [None]:
sorted_heights = sorted(heights)
plt.plot(sorted_heights, 'bo', markersize=2)
plt.show()

### Histograms

In [None]:
min_height = min(heights)
max_height = max(heights)
plt.hist(heights, bins=range(min_height, max_height + 1))
plt.grid()
plt.show()

## Correlation

Measurement of how $2$ independent variables (features) are related to each other.

<img src="./imgs/correlation.jpg" width="800px" />

They can have *positive* or *direct* correlation, if an increase in one of the variables comes with an increase in the other.

They can have *negative* or *inverse* correlation if an increase in one of the variables is accompanied by a decrease in the other.

Or, there can be *weak* or *NO* correlation, if a change in one variable doesn't seem to be accompanied by a change in the other.

In [None]:
# use "column" lists from above to plot scatter plot
plt.scatter(ages, heights, marker='o', alpha=0.2)
plt.xlabel("age")
plt.ylabel("height")
plt.show()

In [None]:
# TODO plot other combinations of variables
# TODO: any correlation ?