# Module \#2 - Intermediate Python

Topics to discuss:
1. Recap of lists
2. Control flow (If else, loops)
3. Functions
4. Lambda function
5. List comprehension
6. Dictionaries 
7. Modules
8. Numpy arrays
    - Random
9. Pandas dataframes

Practice problems ideas:
- Let's do some basic fill-in-the blank problems based on previous excercises
- Need to write a function -- given a radius, give the circumference and area

Challenge problem:
- Generate two random numbers between -1 and 1; pick the points whose distance from the center is less than 1

Challenge problem:
- Given a text file, analyze its contents and create a frequency table for all the words used. 
- Given a text file, create a pair frequency table

Super challenge problem:
- Given a word, use the pair frequency table to generate a stream of 100 words. 

# Lists, continued

Recall that lists are simple to construct using `[]` and can be accessed numerically!

In [3]:
# Construct a list using []
my_lst = ["a", "b", "c"]
my_lst

['a', 'b', 'c']

In [4]:
# Access the first element of a list using the index 0
my_lst[0]

'a'

Recall also that you can use slice notation (`start:stop:step`) which allows you to access a range of elements by an interval. 

In [7]:
# Construct a mixed-type list
my_lst = ["a", "b", "c", 1, 2, 3, True, False]
my_lst

['a', 'b', 'c', 1, 2, 3, True, False]

In [10]:
# Access the 2nd - 5th elements
my_lst[1:4]  # Remember that the slice is NOT inclusive

['b', 'c', 1]

In [11]:
# Access the 3rd element through the end in intervals of 2
my_lst[2::2]

['c', 2, True]

In [12]:
# Reverse the order of the list
my_lst[::-1]

[False, True, 3, 2, 1, 'c', 'b', 'a']

In [46]:
# Get the length of a list with len()
len(my_lst)

7

## in
Sometimes, you want to test whether a value exists within a list. To do this, we use the `in` operator. 

In [30]:
my_lst = ["Hello world!", 200, int("1"), [True, False, [1, 0, bool(0)]], 2**4, "This is a list!", 2+2j]
my_lst

['Hello world!',
 200,
 1,
 [True, False, [1, 0, False]],
 16,
 'This is a list!',
 (2+2j)]

In [31]:
# 200 is inside of my_lst?
200 in my_lst

True

In [32]:
# "Hello list!" is inside of my_lst
"Hello list!" in my_lst

False

**Challenge problem:** What would the following produce? (Without running it)

```python
22 % 7 in my_lst
```

### Range

Range allows you to easily construct a list-like object containing all the values between two numbers. 

They follow the pattern `range( start, stop, step )`.

In [34]:
range(0, 100)

range(0, 100)

In [39]:
range(10, 20, 3)

range(10, 20, 3)

In [35]:
20 in range(0, 100)

True

Ranges can be easily converted to a `list`. **Note** they are not inclusive of the 'end' value in the range. 

In [42]:
list(range(0, 10, 1))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [43]:
list(range(20, 30, 2))

[20, 22, 24, 26, 28]

In [47]:
list(range(10))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

### String - List Conversion

In python, strings are actually similar to a list of letters. You can access individual values using list access notation:

In [55]:
my_str = "Hello world!"
my_str[4]

'o'

In [72]:
my_str[3::2]

'l ol!'

Additionally, strings containing a *separator* can be broken up into a list of strings using `.split()`. This is useful, for example, when trying to parse the text of a document.

In [56]:
sentence = "It was a dark and stormy night."
sentence.split(sep=" ")

['It', 'was', 'a', 'dark', 'and', 'stormy', 'night.']

You can also rejoin a list into a string using `.join()`. 

Notice that `.join()` is a method belonging to objects of the `str` class. 

In [65]:
words = ['It', 'was', 'a', 'dark', 'and', 'stormy', 'night.']
" ".join(words)

'It was a dark and stormy night.'

Any arbitrary `str` can use the `.join()` method!

**Challenge problem:** What would the following produce? (Without running it)

```python
22 % 7 in my_lst
```

In [66]:
"(-_-)".join(words)

'It(-_-)was(-_-)a(-_-)dark(-_-)and(-_-)stormy(-_-)night.'

<hr>

# Other programming concepts in Python

We will also briefly discuss control flow and functions in python. While these are useful techniques for python programming, they are not necessary for most typical data science activities in python. These are the topics which we will now summarize:

1. If...elif...else
2. Loops
3. Function definitions

## If...elif...else

These statements indicate code blocks that will only be executed given that a logical condition is met.

### If statements

`if` statements in python create a logic gate, such that some code will only execute if a logical condition is met. See an example here:

In [73]:
a = 1
b = 1

if a == b:
    # Execute this code only if a == b is True
    print("a is equal to b!")

a is equal to b!


The above example shows an `if` statement. The code in this statement only executes which the condition (`a == b`) is `True`. **Challenge:** Can you modify the above block so that the code will not execute?

### If...else statements

`else` statements are executed if no previous conditions are satisfied. In other words, if not of the `if` statements execute, only then will the `else` statement execute.

In [75]:
a = 1
b = 2

if a == b:
    print("a is equal to b!")
else:
    print("a is NOT equal to b!")

a is NOT equal to b!


**Challenge question:** What will the following print? (Without running it yourself)

```python
a = [1, 2, 3]
cond = a[1] == 2
if cond:
    print("Yes")
else:
    print("No")
```

In [74]:
a = [1, 2, 3]
cond = a[1] == 2
if cond:
    print("Yes")
else:
    print("No")

Yes


### If...elif...else statements

`elif` is a phrase that means "else if". This means that if the preceeding logical conditions are not satisfied, only then is this statement tested. 

In [None]:
grade = 78

if grade > 90:
    # Only executes if grade > 90
    letter_grade = "A"
elif grade > 80:
    # Only executes if grade > 80 and grade <= 90
    letter_grade = "B"
elif grade > 70:
    # Only executes if grade > 70 and grade <= 80
    letter_grade = "C"
elif grade >= 60:
    # Only executes if grade > 60 and grade <= 70
    letter_grade = "D"
else:
    # Only executes if grade < 60
    letter_grade = "F"
    
print("Student earned a grade of " + letter_grade)

In the above example, each logical condition is tested in sequence. Only when a condition is not met is the next one tested. If a student has a grade of `68`, then every `elif` statement will be tested. If the student had a `96`, then no `elif` statements would have been tested.

**Challenge question:** What will the following print? (Without running it yourself)

```python

trees = ['pine', 'spruce', 'fir', 'oak', 'cherry']

if 'pi' + ' ne' in trees:
    print("Pine Trees!")
elif ",".join(['fir', 'oak']) == "fir, oak":
    print("Fir Trees!")
elif "".join(['ry', 'cher'][::-1]) in trees:
    print("Cherry Trees!")
else:
    print("No trees!")
             
```

In [76]:
trees = ['pine', 'spruce', 'fir', 'oak', 'cherry']

if 'pi' + ' ne' in trees:
    print("Pine Trees!")
elif ",".join(['fir', 'oak']) == "fir, oak":
    print("Fir Trees!")
elif "".join(['ry', 'cher'][::-1]) in trees:
    print("Cherry Trees!")
else:
    print("No trees!")

Cherry Trees!


In [77]:
",".join(['fir', 'oak'])

'fir,oak'

## Loops

Loops allow for python code to be applied to every element of an iterable object, such as a list.

### For loops

For loops are a type of finite loop in python (as opposed to `while` loops which we will not discuss here). A for loop iterates over an iterable object, such as a `list` or `tuple`. For every element of the object, code will be executed in succession. Here is an example:

In [78]:
# Loop through a list of 1 through 10
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

for number in numbers:
    print(number)

1
2
3
4
5
6
7
8
9
10


As the loop iterates, it assigns each element of `numbers` to the variable `number` and then runs the code within the loop. For example, we can add 10 to `number`:

In [80]:
for number in numbers:
    print(number + 10)

11
12
13
14
15
16
17
18
19
20


Loops can be a relatively convenient way to add numbers to a list using the `.append()` method. For example:

In [50]:
new_numbers = list()
for number in numbers:
    new_numbers.append(number + 10)
    
new_numbers

[11, 12, 13, 14, 15, 16, 17, 18, 19, 20]

### Combining loops and if / else

In [105]:
# Combining loops and if...else
grades = [85, 98, 45, 73]

# Loop over list of grades and print letter grade
for grade in grades:
    if grade > 90:
        # Only executes if grade > 90
        letter_grade = "A"
    elif grade > 80:
        # Only executes if grade > 80 and grade <= 90
        letter_grade = "B"
    elif grade > 70:
        # Only executes if grade > 70 and grade <= 80
        letter_grade = "C"
    elif grade >= 60:
        # Only executes if grade > 60 and grade <= 70
        letter_grade = "D"
    else:
        # Only executes if grade < 60
        letter_grade = "F"

    print("Student earned a grade of " + letter_grade)


Student earned a grade of B
Student earned a grade of A
Student earned a grade of F
Student earned a grade of C


**Challenge problem:** What does the following produce? (Without running it)

```python

fruits = ['apple', 'banana', 'pear', 'kiwi', 'orange', 'tomato', 'starfuit', 'grape']
new_fruits = []
for fruit in fruits:
    if "i" in fruit:
        new_fruits.append(fruit)
    elif "g" in fruit:
        if "p" in fruit:
            new_fruits.append(fruit)
            
print(new_fruits)

```

### Integer indices instead of direct for loops

Rather than using the list of grades directly, it may be useful to use the numerical indices of list elements. For example:

In [44]:
# Loop through a list of letters
letters = ["a", "b", "c", "d"]

for letter in letters:
    print(letter)

a
b
c
d


In [45]:
range(len(letters))

range(0, 4)

In [None]:
for i in range(len(letters)):
    letter = letters[i]
    print(letter)

While this may seem more complicated, there are many situations in which this is necessary! For example:

In [None]:
students = ['alice', 'kevin', 'sara', 'tim']
grades = [85, 98, 45, 73]

for i in range(len(grades)):
    
    grade = grades[i]
    student = students[i]
    
    if grade > 90:
        # Only executes if grade > 90
        letter_grade = "A"
    elif grade > 80:
        # Only executes if grade > 80 and grade <= 90
        letter_grade = "B"
    elif grade > 70:
        # Only executes if grade > 70 and grade <= 80
        letter_grade = "C"
    elif grade >= 60:
        # Only executes if grade > 60 and grade <= 70
        letter_grade = "D"
    else:
        # Only executes if grade < 60
        letter_grade = "F"

    print(student + " earned a grade of " + letter_grade)


### List comprehension

Typically, for loops are a terrible coding pattern in `Python`. There's almost always a much better/faster alternative to using one. However, they do have one area of utility in data science: **list comprehensions**. 

**List comprehension** is a *pythonic* coding pattern used for performing an action on a list. It is faster than a typical for-loop and reduces the number of lines needed to use one. 

It usually takes the following form:

```python
[ modify_value(value) for value in values ]
```
This will take every element of `values` and modify it using `modify_value()` function, returning a list of modified values of the same order and length as `values`.

For example we can simply return the `number` for every `number` in `range(1, 11)` like so:

In [107]:
# NOTE: You can also use list comprehension to achieve this
[number for number in range(1, 11)]  # Print doesn't actually return a value

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

We could also modify these numbers:

In [108]:
[number + 10 for number in numbers]  # Returns a value

[11, 12, 13, 14, 15, 16, 17, 18, 19, 20]

**Challenge question:** Write a list comprehension which converts each element of `fruits` uppercase.

```python
fruits = ['apple', 'banana', 'pear', 'kiwi', 'orange', 'tomato', 'starfuit', 'grape']

# List comprehension goes here...

```


We can even add conditionals! Such as if...else statements:

In [109]:
[-number if number % 2 == 0 else number for number in numbers]  # Returns a value

[1, -2, 3, -4, 5, -6, 7, -8, 9, -10]

**Challenge question:** Write a list comprehension which converts each element of `fruits` uppercase only if it contains the letter "e".

```python
fruits = ['apple', 'banana', 'pear', 'kiwi', 'orange', 'tomato', 'starfuit', 'grape']

# List comprehension goes here...

```


## Functions

Functions are objects in python which take an input, perform computations, and return an output. Functions have arguments that help the function operate correctly. For example, we can define a function, `square_it()` which finds the square of any number:

In [112]:
def square_it(x):
    print(x ** 2)
    
type(square_it)

function

In [113]:
square_it(5)  # Gets 5 ** 2

25


Functions do not have to return a value. Because `square_it()` only prints an object, it doesn't return anything. Any variable that references the output of `square_it()` will be a `None`, which means "doesn't exist". 

In [114]:
result = square_it(5)

25


In [115]:
print(result)

None


Interestingly, `None` is actually a type of object in Python. This allows you to easily reference them, which can make solving certain coding problems easier. 

In [117]:
type(result)  # NoneType objects

NoneType

Functions can also return a value with the `return` statement. This is more common in python programming than simply printing the value:

In [119]:
def square_it(x):
    return x ** 2  # Return a value
    
result = square_it(5)

In [120]:
print(result)

25


**Challenge problem:** create a function with one argument, `grade` (is an `int`). The argument should convert `grade` to a letter grade and return this to the user.

**Challenge problem 2**: Using list comprehension, convert every element of `grades` with the function from the previous problem. 

Here is `grades`:

```python
grades = [85, 98, 45, 73, 35, 62, 67, 72, 92, 38, 88]
```

<hr>

# Complex objects in python, Continued

Let's continue on with our discussion of complex objects in Python!

Object types:
1. ~~Lists~~
2. Dictionaries
3. Tuples
4. Sets
5. *Modules/Packages*
6. Numpy arrays
7. Pandas DataFrames

## Dictionaries

Dictionaries are the core data type of the python language. Unlike lists, dictionaries are **unordered** and accessed using keys rather than using numerical indices. This workshop will not describe dictionaries in detail, but you can refer to the W3 schools guide [here](https://www.w3schools.com/python/python_dictionaries.asp) for more info. 

In [121]:
# Create a dict using key-value pairs between {}
my_dict = {
    'hello': 1
}
my_dict

{'hello': 1}

In [122]:
# Access the value in a dict using keys
my_dict['hello']

1

In [123]:
# Dictionaries can have numerical, string, and boolean keys. They can hold any number of arbitrary object types.
my_dict = {
    'hello': 1,
    'world': True,
    123: [1, 2, ["Hello world"]],
    True: {
        "New": True
    }
}
my_dict

{'hello': 1, 'world': True, 123: [1, 2, ['Hello world']], True: {'New': True}}

In [124]:
my_dict[True]

{'New': True}

## Sets and Tuples

Sets and tuples are also used in python. Discussing them is outside the scope of this workshop. For more information please refer to the following:

1. Sets - [here](https://www.w3schools.com/python/python_sets.asp)
2. Tuples - [here]()

## Modules and Packages

### Modules

Modules are python script files (`*.py`) which are imported into python. Typically, they contain functions and/or classes which are useful for your code. We can import modules like so:

In [132]:
import builtins

`builtins` is a generic python module which contains many core functions such as `print()`. If we check the `type()` of `builtins`, we see that it is a `module` object:

In [133]:
type(builtins)

module

If we import `builtins` as a module, we can then use the functions within that module, like so:

In [134]:
builtins.print("Hello world!")

Hello world!


It might get inconvenient to keep typing `builtins` every time we want to use the `builtins.print()` function. Instead, we can import `builtins` using a variable which is easier to type:

In [137]:
import builtins as btns

btns.print("Hello world!")

Hello world!


Sometimes we only want to use a small number of functions from a module. So, instead of importing the whole module, we might instead just import those functions directly using `from ... import ...`:

In [138]:
from builtins import print

print("Hello world!")

Finally, we can even make a variable for a function:

In [147]:
from builtins import print as pnt

pnt("Hello world!")

Hello world!


### Packages and CLI usage in Jupyter

Packages are collections of modules typically based on a shared purpose. They are typically installed using a package manager such as `pip` from the command line, like so:

```shell
pip install <name_of_package>
```

However, we are in a notebook and not on the command line! How do we install packages? Fortunately, Jupyter Notebook allows us to write any arbitrary command-line commands using the `!` symbol at the beginning of a block. For example, on the CLI, you can write "Hello world!":

In [144]:
!echo "Hello world!"

"Hello world!"


This is equivalent to opening command prompt (windows) or terminal (macOS) and typing:

```shell
echo "Hello world!"
```

This capability is also very useful when you need to install python packages using `pip`, which is typically done from the command line. Instead, we can install packages from within Jupyter like so:

In [145]:
!pip install numpy



As you can see, I already have `numpy` installed. Otherwise, it would have installed it for me. 

## Numpy arrays

`numpy` is a python package that provides complex data types for performing mathematical operations. In particular, numpy provides the `array` data type which is similar to the `matrix` in R. 

Before arrays can be constructed it is necessary to install the numpy library in Python (if you don't already have it) and load it into Python:

In [149]:
import numpy as np

In [150]:
type(np)

module

Just like other objects, they have properties and methods. We typically load modules into python because we want to use the methods they contain. As a reminder, you can access an object's methods using the `<object>.<method>()` notation. For the `numpy` module, the method we are most interested is the `array()` method -- this is what we can use to construct an `array` object.

Arrays are similar to lists, except that they are specifically design for holding only one type of data, typically numerical data. 

In [154]:
# Create a 1-dimensional (1d) array holding the values 1, 2, and 3
np.array([1, 2, 3])

array([1, 2, 3])

We can create 2-dimensional arrays by adding lists of lists:

In [155]:
# Construct a 2d array
numpy.array([
    [1, 2, 3],
    [4, 5, 6]
])

array([[1, 2, 3],
       [4, 5, 6]])

We can even create a 3-dimension array (and beyond) using lists of lists of lists (etc). 

In [156]:
# Construct a 3d array
numpy.array([
    [
        [1, 2, 3],
        [4, 5, 6]
    ],
    [
        [7, 8, 9],
        [10, 11, 12]
    ]
])

array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

### Numpy array methods

Many methods are available for `array` objects. An exhaustive reference is available [here](https://numpy.org/doc/stable/reference/index.html). For now, we will discuss a few key methods:

1. Creation
2. Shape and dimensions
3. Accessing elements
4. Setting elements
5. any / all 
6. Mathematical operations

TODO: Include short section on using magic blocks

#### Creation

Numpy arrays are created in multiple ways. The simplest invovles the use of lists (shown above):

In [127]:
my_arr = np.array([
    [True, False, False],
    [False, True, False]
])
my_arr

array([[ True, False, False],
       [False,  True, False]])

In [128]:
type(my_arr)

numpy.ndarray

Arrays can also be created using the `arange()` method. This method creates a sequential `array` given the max element specified:

In [132]:
# Create a 1d integer array from 0-9
my_arr = np.arange(10)
my_arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [131]:
# Create a 1d integer array from 0-9
my_arr = np.arange(10, 20)
my_arr

array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

In [134]:
# Create a 1d integer array from 0-9
my_arr = np.arange(10, 100, 5)
my_arr

array([10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90,
       95])

In [133]:
# Create a 1d float array from 0.0-9.0
my_arr = np.arange(10.0)
my_arr

array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

#### Shape and dimensions

`numpy` arrays have a number of dimensions and a shape. Note that these are properties, not methods. They are accessed using this pattern: `<object>.<property_name>` as follows:

In [135]:
# Construct 2d array
my_arr = np.array([
    [True, False, False],
    [False, True, False]
])

In [136]:
# Get number of dimensions property
my_arr.ndim

2

In [137]:
# Get the shape property (number of rows, number of columns)
my_arr.shape

(2, 3)

In [139]:
# Construct a 3d array
arr_3d = np.array([
    [
        [1, 2, 3],
        [4, 5, 6]
    ],
    [
        [7, 8, 9],
        [10, 11, 12]
    ]
])

# Get the shape (number of 2d arrays, number of rows, number of columns)
arr_3d.shape

(2, 2, 3)

Finally, the shape of an array can be altered using the `reshape()` method. This is particularly useful for quickly constructing arrays of a desired shape:

In [140]:
my_arr = np.arange(15)
my_arr = my_arr.reshape((5, 3))  # Note that this does NOT overwrite the my_arr object until you re-assign using '='

In [141]:
my_arr

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

The above can be simplified in 1 line of code:

In [142]:
my_arr = np.arange(15).reshape((5, 3))
my_arr

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

#### Accessing elements

Elements can be accessed using several approaches. 

1. Numerical
2. Logical

For the **Numerical** approach, numerical indices are utilized using the pattern suited to their shape:

In [143]:
# For a 1D array, similar to list
my_arr = np.array([3, 8, 1, 5])
print(my_arr)
my_arr[1]  # Get the 2nd element of the first (and only) dimension

[3 8 1 5]


8

In [144]:
# For a 2D array, the pattern is array[dim2_index, dim1_index]
my_arr = np.array([
    [5, 7, 4, 6],
    [2, 1, 9, 8]
])
print(my_arr)
my_arr[0, 1]  # First element in dim 2 (row 0) and second element in dim 1 (column 2)

[[5 7 4 6]
 [2 1 9 8]]


7

In [176]:
# For an n-dimensional array, the pattern is the same: array[dimN_index, dimN-1_index, dimN-2_index..., dim1_index]
my_arr = np.arange(125).reshape((5, 5, 5))
print(my_arr)
my_arr[2, 3, 1]  # 3rd element in dim 1 (matrix 3), 4th element in dim 2 (row 4), 2nd element in dim 3 (column 2)

[[[  0   1   2   3   4]
  [  5   6   7   8   9]
  [ 10  11  12  13  14]
  [ 15  16  17  18  19]
  [ 20  21  22  23  24]]

 [[ 25  26  27  28  29]
  [ 30  31  32  33  34]
  [ 35  36  37  38  39]
  [ 40  41  42  43  44]
  [ 45  46  47  48  49]]

 [[ 50  51  52  53  54]
  [ 55  56  57  58  59]
  [ 60  61  62  63  64]
  [ 65  66  67  68  69]
  [ 70  71  72  73  74]]

 [[ 75  76  77  78  79]
  [ 80  81  82  83  84]
  [ 85  86  87  88  89]
  [ 90  91  92  93  94]
  [ 95  96  97  98  99]]

 [[100 101 102 103 104]
  [105 106 107 108 109]
  [110 111 112 113 114]
  [115 116 117 118 119]
  [120 121 122 123 124]]]


66

For the **Logical** approach, we can use a boolean array to extract the element(s) of interest:

In [147]:
num_arr = np.array([1, 2, 3])
bool_arr = np.array([False, False, True])
num_arr[bool_arr]  #  We access the element of num_arr for which bool_arr is True

array([3])

This approach is extremely powerful when you can use logical operations to create a boolean array:

In [148]:
# Create a 2D matrix
dataset = np.array([
    [1, 2, 3, 4, 5],
    [6, 7, 8, 9, 10]
])
print(dataset)

[[ 1  2  3  4  5]
 [ 6  7  8  9 10]]


In [149]:
# Create a boolean matrix for this dataset to test where values are greater than 3
bools = dataset > 3
print(bools)

[[False False False  True  True]
 [ True  True  True  True  True]]


In [150]:
# Extract the value(s) which satisfy this logical operation
dataset[bools]

array([ 4,  5,  6,  7,  8,  9, 10])

In [151]:
# Create a boolean matrix for this dataset to test where values are equal to 5 using np.equals()
bools = np.equal(dataset, 5)
print(bools)

[[False False False False  True]
 [False False False False False]]


In [152]:
# Extract the value(s) which satisfy this logical operation
dataset[bools]

array([5])

In [153]:
# Create a boolean matrix for this dataset to test where values are > 8 or < 3
bools = np.logical_or(dataset > 8, dataset < 3)
print(bools)
# Subset the data using these booleans
dataset[bools]

[[ True  True False False False]
 [False False False  True  True]]


array([ 1,  2,  9, 10])

Finally, we can use the **where** approach that is a hybrid of these two methods. 

In [154]:
# Find the numerical indices for values in the dataset > 6
indices = np.where(dataset > 6)
print(indices)
# Subset the data using these indices
dataset[indices]

(array([1, 1, 1, 1], dtype=int64), array([1, 2, 3, 4], dtype=int64))


array([ 7,  8,  9, 10])

#### Setting elements

Just as you can access elements of an array, you can also set them. This can be done with integer and logical indexing. 

Here is an example with simple integer indexing:

In [155]:
# Create a 2D matrix
dataset = np.array([
    [1, 2, 3, 4, 5],
    [6, 7, 8, 9, 10]
])
print(dataset)

[[ 1  2  3  4  5]
 [ 6  7  8  9 10]]


In [156]:
# Change row 2, column 5 to the value 100
dataset[1, 4] = 100
dataset

array([[  1,   2,   3,   4,   5],
       [  6,   7,   8,   9, 100]])

You can also use logical indexing to set array values:

In [157]:
# Set every value > 3 to 0
dataset = np.array([
    [1, 2, 3, 4, 5],
    [6, 7, 8, 9, 10]
])
dataset[dataset > 3] = 0 
dataset

array([[1, 2, 3, 0, 0],
       [0, 0, 0, 0, 0]])

And, finally, you can use the `where()` method:

In [158]:
# Set all value < 7 to -1
dataset = np.array([
    [1, 2, 3, 4, 5],
    [6, 7, 8, 9, 10]
])
dataset[np.where(dataset < 7)] = -1
dataset

array([[-1, -1, -1, -1, -1],
       [-1,  7,  8,  9, 10]])

#### Any / All

`any()` and `all()` are two methods which determine whether an array satisfies a logical condition. `any()` is `True` if any element in the array satisfies the condition. `all()` is `True` if all elements of the array satisfy the condition. Examples:

In [159]:
dataset = np.array([
    [1, 2, 3, 4, 5],
    [6, 7, 8, 9, 10]
])

In [160]:
# Any values equal to 0?
np.any(dataset == 0)

False

In [161]:
# All values NOT equal to 0?
np.all(dataset != 0)

True

#### Mathematical methods

Arrays have a large number of built-in mathematic methods. Examples include `sum()` and `mean()`. They can also be used for multi-dimensional algebraic operations, such as matrix multiplication and dot products. Here are a small number of examples:

In [162]:
my_data = np.arange(9).reshape((3,3))
my_data

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [None]:
# Multiplication by scalar
my_data * 3

In [None]:
# Addition by vector
my_vector = np.array([5, 10, 20])
my_data + my_vector

In [None]:
# Sum of values
my_data.sum()

In [163]:
# Mean of values within dimension 2 (rows) -- "axis" specificies the dimension index
my_data.mean(axis=1)

array([1., 4., 7.])

In [166]:
# Max values within dimension 1 (columns)
my_data.max(axis=0)

array([6, 7, 8])

In [167]:
# Transposition
my_data.transpose()

array([[0, 3, 6],
       [1, 4, 7],
       [2, 5, 8]])

In [168]:
# Make new dataset
my_data2 = np.arange(100, 109).reshape((3,3))  # 3x3 matrix of 100:109

# Compute dot product
dot_prod = np.dot(my_data, my_data2)
dot_prod

array([[ 315,  318,  321],
       [1242, 1254, 1266],
       [2169, 2190, 2211]])

In [177]:
# Compute the pearson correlation of two 1d arrays
arr1 = np.array([1, 5, 6, 6, 7, 10])
arr2 = np.array([3, 3, 4, 3, 6, 9])
np.corrcoef(arr1, arr2)  # Correlation is ~.809

array([[1.        , 0.80873372],
       [0.80873372, 1.        ]])

## Pandas Series and DataFrame

`pandas` is arguably the most important library for data science in python. It provides both the `Series` and `DataFrame` objects, along with a large number of methods for working with them. Under the hood, it uses `numpy` so many `array` methods work with `pandas` objects. In this section, we will discuss the `Series` object and the `DataFrame` object, then introduce some core methods for working with them.

### Pandas `Series`

Similar to the 1D array, a pandas `Series` is an array where every element can have a name. See this example:

In [183]:
!pip install pandas



In [184]:
import pandas as pd

In [185]:
my_data = pd.Series(data={
    'one': 1,
    'two': 2,
    'three': 3
})
my_data

one      1
two      2
three    3
dtype: int64

Similar to a dictionary, the values can be accessed using the names:

In [181]:
my_data['two']

2

And similar to an `array`, the values can also be accessed using numbers and booleans:

In [186]:
# Access element 1
my_data[0]

1

In [187]:
# Access the element(s) which equals 3
my_data[my_data == 3]

three    3
dtype: int64

### Pandas `DataFrame`

The `DataFrame` is an extremely powerful datatype in python, and it is used ubiquitously throughout pythonic data science. A `DataFrame` is always a 2-dimensional `array` which contains named columns and rows. 

In [188]:
my_df = pd.DataFrame(data={
    'col_one': range(1, 5),
    'col_two': range(11, 15),
    'col_three': range(21, 25)
}, index = [
    'row_one', 'row_two', 'row_three', 'row_four'
])
my_df

Unnamed: 0,col_one,col_two,col_three
row_one,1,11,21
row_two,2,12,22
row_three,3,13,23
row_four,4,14,24


Methods for `DataFrame` objects are numerous and can be found [here](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html). In this module, we will only discuss the following:

1. Difference between this and numpy array
2. Indexing / naming
3. Accessing data (iloc vs loc vs []) / setting data
4. Basic plotting
5. Reading / Writing to file

More topics to cover in this lecture:
1. How to find items within a pandas dataframe -- loc iloc at