# Module \#2 - Intermediate Python

## Dictionaries

Dictionaries are the core data type of the python language. Unlike lists, dictionaries are **unordered** and accessed using keys rather than using numerical indices. This workshop will not describe dictionaries in detail, but you can refer to the W3 schools guide [here](https://www.w3schools.com/python/python_dictionaries.asp) for more info. 

In [112]:
# Create a dict using key-value pairs between {}
my_dict = {
    'hello': 1
}
my_dict

{'hello': 1}

In [113]:
# Access the value in a dict using keys
my_dict['hello']

1

In [114]:
# Dictionaries can have numerical, string, and boolean keys. They can hold any number of arbitrary object types.
my_dict = {
    'hello': 1,
    'world': True,
    123: [1, 2, ["Hello world"]],
    True: {
        "New": True
    }
}
my_dict

{'hello': 1, 'world': True, 123: [1, 2, ['Hello world']], True: {'New': True}}

In [115]:
my_dict[True]

{'New': True}

## Sets and Tuples

Sets and tuples are also used in python. Discussing them is outside the scope of this workshop. For more information please refer to the following:

1. Sets - [here](https://www.w3schools.com/python/python_sets.asp)
2. Tuples - [here]()

## Numpy arrays

`numpy` is a python package that provides complex data types for performing mathematical operations. In particular, numpy provides the `array` data type which is similar to the `matrix` in R. 

Before arrays can be constructed it is necessary to load the numpy library in Python. We can accomplish this using an import statement (below). This loads the `numpy` library as an object called `np` into python. 

In [80]:
!pip install numpy



In [118]:
import numpy

Libraries (aka "packages") in python are actually a type of object called a `module`.

In [119]:
type(numpy)

module

Just like other objects, they have properties and methods. We typically load modules into python because we want to use the methods they contain. As a reminder, you can access an object's methods using the `<object>.<method>()` notation. For the `numpy` module, the method we are most interested is the `array()` method -- this is what we can use to construct an `array` object.

In [120]:
# Create a 1-dimensional (1d) array holding the values 1, 2, and 3
numpy.array([1, 2, 3])

array([1, 2, 3])

We can create 2-dimensional arrays by adding lists of lists:

In [122]:
# Construct a 2d array
numpy.array([
    [1, 2, 3],
    [4, 5, 6]
])

array([[1, 2, 3],
       [4, 5, 6]])

We can even create a 3-dimension array (and beyond) using lists of lists of lists (etc). 

In [123]:
# Construct a 3d array
numpy.array([
    [
        [1, 2, 3],
        [4, 5, 6]
    ],
    [
        [7, 8, 9],
        [10, 11, 12]
    ]
])

array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

In [125]:
import numpy as np
np.array([1, 2, 3])

array([1, 2, 3])

In [126]:
from numpy import array
array([1, 2, 3])

array([1, 2, 3])

### Numpy array methods

Many methods are available for `array` objects. An exhaustive reference is available [here](https://numpy.org/doc/stable/reference/index.html). For now, we will discuss a few key methods:

1. Creation
2. Shape and dimensions
3. Accessing elements
4. Setting elements
5. any / all 
6. Mathematical operations

TODO: Include short section on using magic blocks

#### Creation

Numpy arrays are created in multiple ways. The simplest invovles the use of lists (shown above):

In [127]:
my_arr = np.array([
    [True, False, False],
    [False, True, False]
])
my_arr

array([[ True, False, False],
       [False,  True, False]])

In [128]:
type(my_arr)

numpy.ndarray

Arrays can also be created using the `arange()` method. This method creates a sequential `array` given the max element specified:

In [132]:
# Create a 1d integer array from 0-9
my_arr = np.arange(10)
my_arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [131]:
# Create a 1d integer array from 0-9
my_arr = np.arange(10, 20)
my_arr

array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

In [134]:
# Create a 1d integer array from 0-9
my_arr = np.arange(10, 100, 5)
my_arr

array([10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90,
       95])

In [133]:
# Create a 1d float array from 0.0-9.0
my_arr = np.arange(10.0)
my_arr

array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

#### Shape and dimensions

`numpy` arrays have a number of dimensions and a shape. Note that these are properties, not methods. They are accessed using this pattern: `<object>.<property_name>` as follows:

In [135]:
# Construct 2d array
my_arr = np.array([
    [True, False, False],
    [False, True, False]
])

In [136]:
# Get number of dimensions property
my_arr.ndim

2

In [137]:
# Get the shape property (number of rows, number of columns)
my_arr.shape

(2, 3)

In [139]:
# Construct a 3d array
arr_3d = np.array([
    [
        [1, 2, 3],
        [4, 5, 6]
    ],
    [
        [7, 8, 9],
        [10, 11, 12]
    ]
])

# Get the shape (number of 2d arrays, number of rows, number of columns)
arr_3d.shape

(2, 2, 3)

Finally, the shape of an array can be altered using the `reshape()` method. This is particularly useful for quickly constructing arrays of a desired shape:

In [140]:
my_arr = np.arange(15)
my_arr = my_arr.reshape((5, 3))  # Note that this does NOT overwrite the my_arr object until you re-assign using '='

In [141]:
my_arr

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

The above can be simplified in 1 line of code:

In [142]:
my_arr = np.arange(15).reshape((5, 3))
my_arr

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

#### Accessing elements

Elements can be accessed using several approaches. 

1. Numerical
2. Logical

For the **Numerical** approach, numerical indices are utilized using the pattern suited to their shape:

In [143]:
# For a 1D array, similar to list
my_arr = np.array([3, 8, 1, 5])
print(my_arr)
my_arr[1]  # Get the 2nd element of the first (and only) dimension

[3 8 1 5]


8

In [144]:
# For a 2D array, the pattern is array[dim2_index, dim1_index]
my_arr = np.array([
    [5, 7, 4, 6],
    [2, 1, 9, 8]
])
print(my_arr)
my_arr[0, 1]  # First element in dim 2 (row 0) and second element in dim 1 (column 2)

[[5 7 4 6]
 [2 1 9 8]]


7

In [176]:
# For an n-dimensional array, the pattern is the same: array[dimN_index, dimN-1_index, dimN-2_index..., dim1_index]
my_arr = np.arange(125).reshape((5, 5, 5))
print(my_arr)
my_arr[2, 3, 1]  # 3rd element in dim 1 (matrix 3), 4th element in dim 2 (row 4), 2nd element in dim 3 (column 2)

[[[  0   1   2   3   4]
  [  5   6   7   8   9]
  [ 10  11  12  13  14]
  [ 15  16  17  18  19]
  [ 20  21  22  23  24]]

 [[ 25  26  27  28  29]
  [ 30  31  32  33  34]
  [ 35  36  37  38  39]
  [ 40  41  42  43  44]
  [ 45  46  47  48  49]]

 [[ 50  51  52  53  54]
  [ 55  56  57  58  59]
  [ 60  61  62  63  64]
  [ 65  66  67  68  69]
  [ 70  71  72  73  74]]

 [[ 75  76  77  78  79]
  [ 80  81  82  83  84]
  [ 85  86  87  88  89]
  [ 90  91  92  93  94]
  [ 95  96  97  98  99]]

 [[100 101 102 103 104]
  [105 106 107 108 109]
  [110 111 112 113 114]
  [115 116 117 118 119]
  [120 121 122 123 124]]]


66

For the **Logical** approach, we can use a boolean array to extract the element(s) of interest:

In [147]:
num_arr = np.array([1, 2, 3])
bool_arr = np.array([False, False, True])
num_arr[bool_arr]  #  We access the element of num_arr for which bool_arr is True

array([3])

This approach is extremely powerful when you can use logical operations to create a boolean array:

In [148]:
# Create a 2D matrix
dataset = np.array([
    [1, 2, 3, 4, 5],
    [6, 7, 8, 9, 10]
])
print(dataset)

[[ 1  2  3  4  5]
 [ 6  7  8  9 10]]


In [149]:
# Create a boolean matrix for this dataset to test where values are greater than 3
bools = dataset > 3
print(bools)

[[False False False  True  True]
 [ True  True  True  True  True]]


In [150]:
# Extract the value(s) which satisfy this logical operation
dataset[bools]

array([ 4,  5,  6,  7,  8,  9, 10])

In [151]:
# Create a boolean matrix for this dataset to test where values are equal to 5 using np.equals()
bools = np.equal(dataset, 5)
print(bools)

[[False False False False  True]
 [False False False False False]]


In [152]:
# Extract the value(s) which satisfy this logical operation
dataset[bools]

array([5])

In [153]:
# Create a boolean matrix for this dataset to test where values are > 8 or < 3
bools = np.logical_or(dataset > 8, dataset < 3)
print(bools)
# Subset the data using these booleans
dataset[bools]

[[ True  True False False False]
 [False False False  True  True]]


array([ 1,  2,  9, 10])

Finally, we can use the **where** approach that is a hybrid of these two methods. 

In [154]:
# Find the numerical indices for values in the dataset > 6
indices = np.where(dataset > 6)
print(indices)
# Subset the data using these indices
dataset[indices]

(array([1, 1, 1, 1], dtype=int64), array([1, 2, 3, 4], dtype=int64))


array([ 7,  8,  9, 10])

#### Setting elements

Just as you can access elements of an array, you can also set them. This can be done with integer and logical indexing. 

Here is an example with simple integer indexing:

In [155]:
# Create a 2D matrix
dataset = np.array([
    [1, 2, 3, 4, 5],
    [6, 7, 8, 9, 10]
])
print(dataset)

[[ 1  2  3  4  5]
 [ 6  7  8  9 10]]


In [156]:
# Change row 2, column 5 to the value 100
dataset[1, 4] = 100
dataset

array([[  1,   2,   3,   4,   5],
       [  6,   7,   8,   9, 100]])

You can also use logical indexing to set array values:

In [157]:
# Set every value > 3 to 0
dataset = np.array([
    [1, 2, 3, 4, 5],
    [6, 7, 8, 9, 10]
])
dataset[dataset > 3] = 0 
dataset

array([[1, 2, 3, 0, 0],
       [0, 0, 0, 0, 0]])

And, finally, you can use the `where()` method:

In [158]:
# Set all value < 7 to -1
dataset = np.array([
    [1, 2, 3, 4, 5],
    [6, 7, 8, 9, 10]
])
dataset[np.where(dataset < 7)] = -1
dataset

array([[-1, -1, -1, -1, -1],
       [-1,  7,  8,  9, 10]])

#### Any / All

`any()` and `all()` are two methods which determine whether an array satisfies a logical condition. `any()` is `True` if any element in the array satisfies the condition. `all()` is `True` if all elements of the array satisfy the condition. Examples:

In [159]:
dataset = np.array([
    [1, 2, 3, 4, 5],
    [6, 7, 8, 9, 10]
])

In [160]:
# Any values equal to 0?
np.any(dataset == 0)

False

In [161]:
# All values NOT equal to 0?
np.all(dataset != 0)

True

#### Mathematical methods

Arrays have a large number of built-in mathematic methods. Examples include `sum()` and `mean()`. They can also be used for multi-dimensional algebraic operations, such as matrix multiplication and dot products. Here are a small number of examples:

In [162]:
my_data = np.arange(9).reshape((3,3))
my_data

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [None]:
# Multiplication by scalar
my_data * 3

In [None]:
# Addition by vector
my_vector = np.array([5, 10, 20])
my_data + my_vector

In [None]:
# Sum of values
my_data.sum()

In [163]:
# Mean of values within dimension 2 (rows) -- "axis" specificies the dimension index
my_data.mean(axis=1)

array([1., 4., 7.])

In [166]:
# Max values within dimension 1 (columns)
my_data.max(axis=0)

array([6, 7, 8])

In [167]:
# Transposition
my_data.transpose()

array([[0, 3, 6],
       [1, 4, 7],
       [2, 5, 8]])

In [168]:
# Make new dataset
my_data2 = np.arange(100, 109).reshape((3,3))  # 3x3 matrix of 100:109

# Compute dot product
dot_prod = np.dot(my_data, my_data2)
dot_prod

array([[ 315,  318,  321],
       [1242, 1254, 1266],
       [2169, 2190, 2211]])

In [177]:
# Compute the pearson correlation of two 1d arrays
arr1 = np.array([1, 5, 6, 6, 7, 10])
arr2 = np.array([3, 3, 4, 3, 6, 9])
np.corrcoef(arr1, arr2)  # Correlation is ~.809

array([[1.        , 0.80873372],
       [0.80873372, 1.        ]])

## Pandas Series and DataFrame

`pandas` is arguably the most important library for data science in python. It provides both the `Series` and `DataFrame` objects, along with a large number of methods for working with them. Under the hood, it uses `numpy` so many `array` methods work with `pandas` objects. In this section, we will discuss the `Series` object and the `DataFrame` object, then introduce some core methods for working with them.

### Pandas `Series`

Similar to the 1D array, a pandas `Series` is an array where every element can have a name. See this example:

In [183]:
!pip install pandas



In [184]:
import pandas as pd

In [185]:
my_data = pd.Series(data={
    'one': 1,
    'two': 2,
    'three': 3
})
my_data

one      1
two      2
three    3
dtype: int64

Similar to a dictionary, the values can be accessed using the names:

In [181]:
my_data['two']

2

And similar to an `array`, the values can also be accessed using numbers and booleans:

In [186]:
# Access element 1
my_data[0]

1

In [187]:
# Access the element(s) which equals 3
my_data[my_data == 3]

three    3
dtype: int64

### Pandas `DataFrame`

The `DataFrame` is an extremely powerful datatype in python, and it is used ubiquitously throughout pythonic data science. A `DataFrame` is always a 2-dimensional `array` which contains named columns and rows. 

In [188]:
my_df = pd.DataFrame(data={
    'col_one': range(1, 5),
    'col_two': range(11, 15),
    'col_three': range(21, 25)
}, index = [
    'row_one', 'row_two', 'row_three', 'row_four'
])
my_df

Unnamed: 0,col_one,col_two,col_three
row_one,1,11,21
row_two,2,12,22
row_three,3,13,23
row_four,4,14,24


Methods for `DataFrame` objects are numerous and can be found [here](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html). In this module, we will only discuss the following:

1. Difference between this and numpy array
2. Indexing / naming
3. Accessing data (iloc vs loc vs []) / setting data
4. Basic plotting
5. Reading / Writing to file

<hr>

# Other programming concepts in Python

We will also briefly discuss control flow and functions in python. While these are useful techniques for python programming, they are not necessary for most typical data science activities in python. These are the topics which we will now summarize:

1. If...elif...else
2. Loops
3. Function definitions

## If...elif...else

These statements indicate code blocks that will only be executed given that a logical condition is met.

### If statements

`if` statements in python create a logic gate, such that some code will only execute if a logical condition is met. See an example here:

In [191]:
a = 1
b = 1

if a == b:
    # Execute this code only if a == b is True
    print("a is equal to b!")

a is equal to b!


The above example shows an `if` statement. The code in this statement only executes which the condition (`a == b`) is `True`. **Challenge:** Can you modify the above block so that the code will not execute?

### If...else statements

`else` statements are executed if no previous conditions are satisfied. In other words, if not of the `if` statements execute, only then will the `else` statement execute.

In [192]:
a = 1
b = 2

if a == b:
    print("a is equal to b!")
else:
    print("a is NOT equal to b!")

a is NOT equal to b!


### If...elif...else statements

`elif` is a phrase that means "else if". This means that if the preceeding logical conditions are not satisfied, only then is this statement tested. 

In [193]:
grade = 68

if grade > 90:
    # Only executes if grade > 90
    letter_grade = "A"
elif grade > 80:
    # Only executes if grade > 80 and grade <= 90
    letter_grade = "B"
elif grade > 70:
    # Only executes if grade > 70 and grade <= 80
    letter_grade = "C"
elif grade >= 60:
    # Only executes if grade > 60 and grade <= 70
    letter_grade = "D"
else:
    # Only executes if grade < 60
    letter_grade = "F"
    
print("Student earned a grade of " + letter_grade)

Student earned a grade of D


In the above example, each logical condition is tested in sequence. Only when a condition is not met is the next one tested. If a student has a grade of `68`, then every `elif` statement will be tested. If the student had a `96`, then no `elif` statements would have been tested.

## Loops

Loops allow for some code to be applied to every element of an iterable object, such as a list. 

### For loops

For loops are a type of finite loop in python (as opposed to `while` loops which we will not discuss here). A for loop iterates over an iterable object, such as a `list` or `tuple`. For every element of the object, code will be executed in succession. Here is an example:

In [194]:
grades = [85, 98, 45, 73]

# Loop over list of grades and print letter grade
for grade in grades:
    if grade > 90:
        # Only executes if grade > 90
        letter_grade = "A"
    elif grade > 80:
        # Only executes if grade > 80 and grade <= 90
        letter_grade = "B"
    elif grade > 70:
        # Only executes if grade > 70 and grade <= 80
        letter_grade = "C"
    elif grade >= 60:
        # Only executes if grade > 60 and grade <= 70
        letter_grade = "D"
    else:
        # Only executes if grade < 60
        letter_grade = "F"

    print("Student earned a grade of " + letter_grade)


Student earned a grade of B
Student earned a grade of A
Student earned a grade of F
Student earned a grade of C


Rather than using the list of grades directly, it may be useful to use the numerical indices of list elements. For example:

In [198]:
students = ['alice', 'kevin', 'sara', 'tim']
grades = [85, 98, 45, 73]

for i in range(len(grades)):
    
    grade = grades[i]
    student = students[i]
    
    if grade > 90:
        # Only executes if grade > 90
        letter_grade = "A"
    elif grade > 80:
        # Only executes if grade > 80 and grade <= 90
        letter_grade = "B"
    elif grade > 70:
        # Only executes if grade > 70 and grade <= 80
        letter_grade = "C"
    elif grade >= 60:
        # Only executes if grade > 60 and grade <= 70
        letter_grade = "D"
    else:
        # Only executes if grade < 60
        letter_grade = "F"

    print(student + " earned a grade of " + letter_grade)


alice earned a grade of B
kevin earned a grade of A
sara earned a grade of F
tim earned a grade of C


## Functions

Functions (aka "methods") are objects in python which take an input, perform computations, and return an output. Functions have arguments that help the function operate correctly

In [199]:
def square_it(x):
    print(x ** 2)
    
square_it(5)

25


Functions can also return a value. This is more common in python programming than simply printing the value:

In [200]:
def square_it(x):
    return x ** 2
    
result = square_it(5)
print(result)

25


**Challenge problem:** create a function with one argument, `grade`. The argument should convert `grade` to a letter grade and return this to the user. Then, use this function to simplify the for loop from earlier. 

**Challenge problem 2** create a function with one argument, `grades`, that is a list of numerical grades. The argument should convert all elements of `grades` to letter grades and return this to the user.  

TODO: Maybe list comprehension goes here