# Good morning!

Welcome back to Day 2 of PyCamp!

Yesterday, we learned a lot about the fundamentals of Python: types, data structures, and the general syntax of Python code.

Today, we'll delve into more practical uses of Python. We'll start by reviewing the most important aspects of yesterday's content, then we'll move on a very important package called `numpy`, which is a core Python data science package that provides tons of new, useful functions.

# Review

## Lists and loops

Yesterday, we learned about our basic data types, including a very important data structure called a **list**.

Recall that lists are defined with square brackets, and that they contain multiple elements.

In [1]:
weekdays = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']
print(weekdays)

# try it out: sort weekdays alphabetically, then print the sorted list
# Remember that the sort method works "in place"

weekdays.sort()
print(weekdays)

['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']
['Friday', 'Monday', 'Thursday', 'Tuesday', 'Wednesday']


Most of our work yesterday afternoon concerned `for` loops and lists. We can use `for` loops to perform some code on each item in a list.

Loops are constructed in the following manner:

```
for <ITEM> in <ITERABLE>:
	<EXECUTE CODE HERE>
```

For example, the below loop will print each element of weekdays.

In [2]:
for day in weekdays:
  print(day)

Friday
Monday
Thursday
Tuesday
Wednesday


Notice that `day` is a *placeholder* that we can use in the code to signify each element of `weekdays` as the code within the loop is run. You should always name your placeholder something simple that makes sense to you.

In [3]:
num_list = [5, 9, 2.3, 14, 3, 2, 10]

# try it out: use a for loop to print the squared value of each number in num_list
for number in num_list:
  print(number ** 2)

25
81
5.289999999999999
196
9
4
100


## Defining functions

**Custom functions** are defined using a `def` statement. Just like with `for` loops, functions use *placeholder variables* to indicate inputs that we want to use inside the function's code. The example below shows two inputs, but you can specify as many as you want!

```
def function_name(input1, input2):
  # function code goes here #
```

The code that goes inside the function must be written in an indented block. Generally speaking, you should make sure that you **return** values that you compute or generate within a function. This way, you can feed them into code that you use later on.

In [4]:
# Run this cell: notice that there's no output because we're just defining the function

def list_mean(input_list):
  # the code below will run each time we use this function
  return sum(input_list)/len(input_list)

In [5]:
# try it out: use list_mean on a list of numbers of your own choosing
list_mean([1, 2, 3, 4, 5])

3.0

We've taught you a lot about `print()` functions and how they're used to explicitly display the outputs of functions.

`print()` functions are sometimes useful to put inside functions, as they can print out informative human-readable text. However, the text that `print()` generates **cannot** be used for anything else: it's just for display.

In [6]:
# We're re-defining list_mean now

def list_mean(input_list):
  # the code below will run each time we use this function
  print("The input list was:", input_list)
  return sum(input_list)/len(input_list)

In [7]:
list_one = [1, 4, 10, 2]
list_two = [4, 5, 2, 1]

# try it out: add the averages of list_one and list_two
list_mean(list_one) + list_mean(list_two)

The input list was: [1, 4, 10, 2]
The input list was: [4, 5, 2, 1]


7.25

## More data structures

At the end of yesterday's session, we directed you to some extra reading and exercises about the remaining three key data structures in base Python: **tuples**, **sets**, and **dictionaries**. If you completed this reading, great! If not, this will be a very simple overview of what you need to know to progress with today's content.

### Tuple
In simplest terms, a tuple is a list with very limited functionality. Tuples sacrifice much of the flexibility of lists in order to gain *efficiency* in data storage.

We define tuples using parentheses instead of square brackets. Like lists, tuples store information in an ordered sequence and can be indexed using square brackets.

In [8]:
string_tuple = ('a', 'b', 'c') # a tuple of strings
num_tuple = (5, 9, 2.3, 14, 3, 2, 10) # a tuple of numerics

print(string_tuple)
print(num_tuple)

#get the third item in the string_tuple
print(string_tuple[2])

#or the second item in the num_tuple (remember, python indexing begins with 0!)
print(num_tuple[1])

('a', 'b', 'c')
(5, 9, 2.3, 14, 3, 2, 10)
c
9


Tuples are interchangeable with lists in almost every aspect, with the exception that they **cannot be altered** once they've been created. This means that all of the methods we used to sort or append values to lists are *not applicable* to tuples.

In [9]:
# try it out: what happens if you try using .sort() on num_tuple?

num_tuple.sort()

AttributeError: ignored

Why use tuples then? Well, tuples are the default output for functions that return multiple values. If you use a `return` statement with comma-separated variables, they will automatically be packaged up and returned in a tuple.



In [10]:
def len_and_mean(input_list):
  list_length = len(input_list)
  list_average = sum(input_list)/len(input_list)

  # You can return multiple values by separating them with a comma
  return list_length, list_average

In [11]:
# try it out: use len_and_mean on a list of numerics of your own choosing
len_and_mean([1, 2, 3, 4, 5])

(5, 3.0)

### Sets

A **set** is a structure that only contains unique values. Although you can create sets manually using curly brackets (`{ }`), the most common way to create a set is through the built-in function `set()`.

The most common use of sets is identifying unique values in very long lists.

In [12]:
repeat_list = [1, 5, 2, 11, 2, 6, 6, 9, 1, 2, 0, 6, 11, 2, -3, 1]

# What if we just want unique values?
set(repeat_list)

{-3, 0, 1, 2, 5, 6, 9, 11}

You can use sets to perform useful set operations like intersections, unions, and differences. If these methods are relevant to your work, we recommend that you review this list of set methods [here](https://www.programiz.com/python-programming/methods/set).

### Dictionaries
The most important of these structures is the **dictionary**. Dictionaries are powerful because they allow us to associate values with each other in **key-value pairs**. Each **key** must be a unique value, and we can associate each unique key with a defined **value**.

In [13]:
# Dictionaries are defined in key-value pairs
# Below, our dates are keys and the weekday is the value

days_and_dates = {'06/12': 'Monday',
                  '06/13': 'Tuesday',
                  '06/14': 'Wednesday',
                  '06/15': 'Thursday',
                  '06/16': 'Friday'}

days_and_dates

{'06/12': 'Monday',
 '06/13': 'Tuesday',
 '06/14': 'Wednesday',
 '06/15': 'Thursday',
 '06/16': 'Friday'}

This is useful because we now have a *direct association* between one value and another: if we have a key, we can retrieve its associated value.

In [14]:
# This will return the value associated with the key '06/16'.
days_and_dates['06/16']

'Friday'

It can be a bit cumbersome to type out a dictionary: thankfully, we can easily convert lists of equal size to dictionaries using the `zip()` and `dict()` functions. We generally always pair these functions together, like so:

```
dict(zip(keys_list, values_list))
```

In [15]:
bootcamp_dates = ['06/12', '06/13', '06/14', '06/15', '06/16']
weekdays = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']

# Use bootcamp_dates as the list of keys, and weekdays as the list of values
bootcamp_dict = dict(zip(bootcamp_dates, weekdays))
print(bootcamp_dict)

{'06/12': 'Monday', '06/13': 'Tuesday', '06/14': 'Wednesday', '06/15': 'Thursday', '06/16': 'Friday'}


Unlike lists and tuples, dictionaries are not ordered. If we try to index by integer, there will return an error.

In [16]:
bootcamp_dict[1]

KeyError: ignored

# More about functions

Yesterday, we showed you a few different flavors of functions:
1. **Built-in functions**: These are short, simple functions that are default provided by Python for performing essential operations.
2. **Methods**: These are "shortcut functions" that are used *in reference* to objects. Each type of object has different methods available to it: for example, a list has different methods than a set or a string.
3. **Custom functions**: These are functions that you define on your own using a `def` statement.

Today, we'll introduce you to another source of useful functions: external packages.

## Importing from packages

Python is a very popular programming language for data scientists, and it's become increasingly popular for computational biologists over the last several years. As a result, there are many **[open-source packages](https://en.wikipedia.org/wiki/Open-source_software)** written by researchers and made available to the broader scientific community.

These packages typically need to be installed manually, either using a downloadable installer or using the command line. Fortunately for us, Colab has many of these packages pre-installed, so all we have to do is **import** them into our runtime so we can begin to use them.

We'll start off by importing a package called `numpy`, which is short for Numerical Python. 🔢

In [17]:
# Make sure to run this cell!
import numpy as np

# check that the package was installed by printing the version
np.__version__

'1.23.5'

The `import` statement tells Python that we want to load functions from a certain package: in this case, it's `numpy`. For longer package names, you can use the `as` operator to provide an **alias** for the package. `numpy` is usually shortened to `np` by convention.

Once you've imported the package, you have access to the functions contained within the package. In this manner, packages are like expansion packs for Python, adding new functions and possibilities for your analyses.

You can use any of the functions contained within the package by prefixing them with the package name or alias. For example, `np.mean()` and `np.median()` are functions that belong to the `numpy` package.

In [18]:
num_list = [5, 9, 2.3, 14, 3, 2, 10]

print('Mean:', np.mean(num_list))
print('Median:', np.median(num_list))

Mean: 6.471428571428571
Median: 5.0


By importing `numpy`, we can use functions like `np.median()` to accomplish operations without having to program them ourselves every time.
> We *did* make you do this with `list_mean()` and `list_median()` yesterday, but from now on, we'll just import functions as needed. 😊

More importantly, we can use these imported functions to write even more powerful and specific custom functions. Once the package is imported, we can use the imported functions freely inside our own custom functions.

In [None]:
# try it out:

# Write a function called summarize_list that will take a list of numerics
# and return a tuple with:
# 1) the length of the list
# 2) the median of the list, using np.median
# 3) the mean of the list, using np.mean

def summarize_list(input_list):
  return len(input_list), np.median(input_list), np.mean(input_list)

In [19]:
# test the function here
summarize_list(num_list)

(7, 5.0, 6.471428571428571)

## Function inputs

Now that we're moving into the world of imported functions, it's important to distinguish the different kinds of inputs that functions can take. This is important because many of the functions we'll use in the coming days have long lists of arguments and parameters that can be adjusted.

* **Required inputs**: These are always required in order for the function to work. For example, `summarize_list()` requires `input_list` to work.
* **Optional inputs**: These are inputs that adjust some aspect of how the function works. Unless otherwise specified using keyword arguments, the function proceeds with a default value/setting.

By convention, required inputs go *first* in the list of your inputs. Any additional optional inputs are given afterwards using **keywords** that specify the optional input.

Below is an example of `np.unique()`, a function that sorts an ordered data structure (like a list) and returns sorted unique values.

In [20]:
# For example
mixed_ints = [1, 5, 2, 3, 3, 5, 8, 10, 7, 8]
print(mixed_ints)
print(np.unique(mixed_ints))

[1, 5, 2, 3, 3, 5, 8, 10, 7, 8]
[ 1  2  3  5  7  8 10]


Let's quickly look at the documentation for `np.unique()`. To learn more about a function and its inputs and outputs, you can simply type `?` followed by the name of the function.



In [21]:
?np.unique

`np.unique()` has an optional input called `return_counts`, which specifies whether or not the counts of each value in the input is returned alongside the ordered unique values.

`return_counts` is set to `False` by default: thus, the function works even if we don't provide an argument for `return_counts`. However, we can adjust `return_counts` if we wish to do so:

In [22]:
# Use np.unique on mixed_ints, setting return_counts = True
np.unique(mixed_ints, return_counts = True)

(array([ 1,  2,  3,  5,  7,  8, 10]), array([1, 1, 2, 2, 1, 2, 1]))

# The `numpy` package

Although we introduced `numpy` in the context of providing convenient functions, `numpy` does much more than that.

`numpy` is a package that provides additional data types and functions, notably a *multi-dimensional array* data type. Simply put, `numpy` allows us to use Python to perform operations in an Excel-like manner, spanning multi-dimensional arrays of data. This immense utility makes `numpy` one of the most commonly used Python packages for scientific work and beyond.

# Intro to arrays

In a moment, we're going to introduce a very powerful new data structure called an **array.** An array is a data structure that incorporates elements of many data structures that we learned about yesterday.

We can create **arrays** by using the function `np.array()` on an existing data structure.

In [23]:
# creating an array from a list
weekdays = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']
weekday_array = np.array(weekdays)

weekday_array

array(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'],
      dtype='<U9')

At the surface level, arrays aren't so different from lists. We can get their length, index their elements, iterate over them, and perform operations on their contents.

In [24]:
# Get the length of the array
print('Length:', len(weekday_array))
# Retrieve Tuesday by index
print(weekday_array[1])
# Iterate over the array
for day in weekday_array:
  print(day, 'is', len(day), 'characters long.')

Length: 5
Tuesday
Monday is 6 characters long.
Tuesday is 7 characters long.
Wednesday is 9 characters long.
Thursday is 8 characters long.
Friday is 6 characters long.


However, the main features of arrays are for **numerical arrays** that contain only numeric types.

In [25]:
one_by_five = np.array([1, 1, 1, 1, 1])
print('Starting array:\n', one_by_five)

Starting array:
 [1 1 1 1 1]


We can create multi-dimensional arrays by using `np.array()` with nested lists of equal length. `np.array()` will automatically format our nested list into a multi-dimensional array.

In [26]:
# As a nested list: two sub-lists of length 5
print('Nested list:', [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1]], '\n')

two_by_five = np.array([[1, 1, 1, 1, 1],
                        [1, 1, 1, 1, 1]])
print('2 x 5 array:\n', two_by_five, '\n')

# As a nested list: five sub-lists of length 5
five_by_five = np.array([[1, 1, 1, 1, 1],
                         [1, 1, 1, 1, 1],
                         [1, 1, 1, 1, 1],
                         [1, 1, 1, 1, 1],
                         [1, 1, 1, 1, 1]])
print('5 x 5 array:\n', five_by_five)

Nested list: [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1]] 

2 x 5 array:
 [[1 1 1 1 1]
 [1 1 1 1 1]] 

5 x 5 array:
 [[1 1 1 1 1]
 [1 1 1 1 1]
 [1 1 1 1 1]
 [1 1 1 1 1]
 [1 1 1 1 1]]


Arrays are optimized for numerical operations. One of the magical aspects of arrays is that they allow us to **broadcast** operations over each element in the array as if we were working on a single element. This property of arrays can make our code much more concise. This is especially important for **multi-dimensional arrays** (also known as *matrices*).

As an example, say we want to divide each element in an array by 2, and then square each element. Using Numpy broadcasting, we can perform this on our multi-dimensional array in a single line:

In [27]:
five_by_five_array = np.array([[1, 1, 1, 1, 1],
                     [1, 1, 1, 1, 1],
                     [1, 1, 1, 1, 1],
                     [1, 1, 1, 1, 1],
                     [1, 1, 1, 1, 1]])

print((five_by_five_array/2)**2)

[[0.25 0.25 0.25 0.25 0.25]
 [0.25 0.25 0.25 0.25 0.25]
 [0.25 0.25 0.25 0.25 0.25]
 [0.25 0.25 0.25 0.25 0.25]
 [0.25 0.25 0.25 0.25 0.25]]


However, if we were adamant about sticking to a nested list, the same operation would require us to use several techniques we learned yesterday:
1. A nested `for` loop
2. Indexing each element of the nested list
3. Updating each element of the nested list
4. The `range()`function.

In [28]:
five_by_five_list = [[1, 1, 1, 1, 1],
                     [1, 1, 1, 1, 1],
                     [1, 1, 1, 1, 1],
                     [1, 1, 1, 1, 1],
                     [1, 1, 1, 1, 1]]

#(five_by_five_list/2)**2
def nested_half_square(input_nested_list):
  #a function that does the same thing as above for a nested list
  #we index each sublist, then each item in the list, and then perform our operation

  for sublist in input_nested_list:
    for index in range(len(sublist)):
      #here, the range object is similar to a list or tuple of integers ranging from 0 to 4
      sublist[index] = (sublist[index]/2)**2
  return input_nested_list

nested_half_square(five_by_five_list)

[[0.25, 0.25, 0.25, 0.25, 0.25],
 [0.25, 0.25, 0.25, 0.25, 0.25],
 [0.25, 0.25, 0.25, 0.25, 0.25],
 [0.25, 0.25, 0.25, 0.25, 0.25],
 [0.25, 0.25, 0.25, 0.25, 0.25]]

## Why *not* use arrays?

Unsurprisingly, arrays are much more efficient than lists in both memory (storage) and operation speed, and unlike tuples, items in the array can be changed or replaced. This now begs the question of why you would ever use a list instead of an array. Here are two major limitations of arrays:

1. **Arrays cannot be of mixed types.** <br>
  Unlike lists, arrays can only contain values of a singular data type.<br>
  (Given a data structure of mixed types, `np.array()` will do its best to pick the "most harmonious" type. This is referred to as "type coercion".)

2. **Arrays must be rectangular.**<br>
  In a nested list, you can have sub-lists of different lengths. However, multi-dimensional `numpy` arrays require that every row must have the same number of columns.

In short, arrays offer a great boost in efficiency and conciseness of code at the cost of some flexibility. However, **for most numerical and/or tabular data, arrays are more efficient than lists.**

We'll spend the next day or so using arrays as our data structure of choice. This will prepare us for Thursday's content, which will focus heavily on a workhorse of a data structure called a DataFrame, which is heavily based on arrays.

In [29]:
# try it out:
# use np.array on a mixed list of:
# 1) integers and strings
# 2) integers and floats
# 3) floats and strings
# 4) integers, floats, and strings

print(np.array([1, 2, 3, 'four']))

print(np.array([1, 2, 3.6, 4.1]))

print(np.array([1.0, 2.0, 3.0, 'four']))

print(np.array([1, 2.2, 3.7, 'four']))

['1' '2' '3' 'four']
[1.  2.  3.6 4.1]
['1.0' '2.0' '3.0' 'four']
['1' '2.2' '3.7' 'four']


# Interlude: Using autocomplete

As we progress along today and the remainder of the bootcamp, you'll want to start taking advantage of Colab's autocomplete feature.

Once you've defined a variable, function, or imported a package, Colab will understand that these objects are now available for us in the runtime. Colab will try to be helpful by providing "text prediction" for the variables and functions you reference.

For example, earlier we defined an array called `five_by_five`. Although it doesn't have the longest name, it's a little annoying to type it over and over again.

In [30]:
# Below, just write five_by and then pause. You can also press control+space to trigger auto-complete
five_by_five

array([[1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1]])

You should see a small options menu pop up. If the top option is what you want, simply press `Tab` to autocomplete.

In [31]:
# try it out again!
five_by_five

array([[1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1]])

# Working with arrays

This section marks the beginning of the more rigorous Python content we'll cover in the bootcamp.

We strongly encourage you to **refer to the cheat sheets as often as possible**, rather than trying to memorize the syntax of all of the operations. Memorization will come with practice, and if you try to force yourself to memorize the syntax of every command, you'll burn out before the end of the bootcamp!

## Attributes

In Python, objects have **attributes** that describe certain properties: for example, the length of a list or array is an attribute.

In the past, we've used built-in functions like `len()` and `type()` to access information about object attributes. However, arrays have special attributes that can't always be accessed with built-in functions. For example, how can we check the number of elements in an array?

In [32]:
print('Array:\n', five_by_five)

print(len(five_by_five)) # this only gives us the number of rows

Array:
 [[1 1 1 1 1]
 [1 1 1 1 1]
 [1 1 1 1 1]
 [1 1 1 1 1]
 [1 1 1 1 1]]
5


Not quite what we wanted, right? Thankfully, we can get information about array attributes by *referencing* our array of interest.

When we use methods, we *reference* the object that the method is associated with by using the `.` as a link between the object and the method. We can use a similar strategy to access attributes of arrays: we'll *reference* the array we want using `.`, then provide the name of the attribute we want to access. For example, we can use `.size` to retrieve the number of elements in the array. You may think of `.` as the possessive 's in the English language (e.g. "Josh's notebook" or `josh.notebook`).

In [33]:
# The size attribute describes the number of elements in the array
# We access it with .size

print('Array:\n', five_by_five)

print('Number of elements:', five_by_five.size)

Array:
 [[1 1 1 1 1]
 [1 1 1 1 1]
 [1 1 1 1 1]
 [1 1 1 1 1]
 [1 1 1 1 1]]
Number of elements: 25


The only difference between using an array method and accessing an array's attributes is the absence of a `()` when using `.size`: adding an extra `()` when trying to access attributes is a common mistake.

In [34]:
five_by_five.size()

TypeError: ignored

Beyond `.size`, two other useful attributes to examine are:
* `.shape`: Returns a tuple with the number of columns and number of rows.
* `.ndim`: Returns the number of dimensions that the array contains.

> Click [here](https://numpy.org/doc/stable/reference/arrays.ndarray.html#array-attributes) to view the full Numpy documentation for array attributes.

In [35]:
# try it out: how many rows/columns are in five_by_five?
print(five_by_five.ndim)
print(five_by_five.shape)

2
(5, 5)


## Indexing and slicing

Indexing and slicing in 1D arrays works exactly the same as it does with lists.

In [36]:
one_to_five = np.array([1, 2, 3, 4, 5])

# try it out:
# print the last value of the array
print(one_to_five[4])

# print the last three values of the array
print(one_to_five[2:])

5
[3 4 5]


Things are a little different for multi-dimensional arrays.

In nested lists, we use **hierarchical indexing** to access elements, meaning that you have to first index the sub-list before you can index its elements.

In [37]:
# a nested list: each sub-list has 5 elements
one_to_fifteen_list = [[1, 2, 3, 4, 5],
                      [6, 7, 8, 9, 10],
                      [11, 12, 13, 14, 15]]
print('Nested list:', one_to_fifteen_list)

# the second sub-list is index [1]
print('\nSecond sub-list:')
print(one_to_fifteen_list[1])

# the last element of the second sublist is [1][4]
print('\nSecond sub-list, last element:')
print(one_to_fifteen_list[1][4])

# the last three elements of the second sublist are [1][2:]
print('\nSecond sub-list, last three elements:')
print(one_to_fifteen_list[1][2:])

Nested list: [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10], [11, 12, 13, 14, 15]]

Second sub-list:
[6, 7, 8, 9, 10]

Second sub-list, last element:
10

Second sub-list, last three elements:
[8, 9, 10]


In arrays, we use a single index in the format of `[row, column]`.
* *Rows* are equivalent to sub-lists.
* *Columns* are equivalent to elements in a sub-list.

<img src = 'https://github.com/ccbskillssem/pythonbootcamp/raw/main/day_2/indexing.png'>

In [38]:
# array version of above nested list
one_to_fifteen_array = np.array(one_to_fifteen_list)
print('Array:\n', one_to_fifteen_array)

# the second row is index [1]
print('\nSecond row:')
print(one_to_fifteen_array[1])

# the last element of the second row is [1, 4]
print('\nSecond row, last element:')
print(one_to_fifteen_array[1, 4])

# the last three elements of the second row are [1][2:]
print('\nSecond row, last three elements:')
print(one_to_fifteen_array[1, 2:])

Array:
 [[ 1  2  3  4  5]
 [ 6  7  8  9 10]
 [11 12 13 14 15]]

Second row:
[ 6  7  8  9 10]

Second row, last element:
10

Second row, last three elements:
[ 8  9 10]


This method of indexing is superior for accessing elements *column-wise*, which was previously not possible with nested lists.

In [39]:
# print array again for convenient viewing
print('Array:\n', one_to_fifteen_array)

# print the fourth column of the array
print('\nFourth column:')
one_to_fifteen_array[:, 3] # ':' is a placeholder for "all rows"

Array:
 [[ 1  2  3  4  5]
 [ 6  7  8  9 10]
 [11 12 13 14 15]]

Fourth column:


array([ 4,  9, 14])

Above, we simply used the slice (`:`) operator in place of the row index to indicate that we would like to access values from *all rows*. Providing `3` in place of the column index then retrieves the index 3 position from each row. This type of indexing is referred to as indexing across an **axis**.

> In 2D arrays, we have two axes: the "x axis" (rows) and the "y axis" (columns).

We can put these row and column indexing strategies together to be able to slice out portions of arrays that would have been inaccessible to us in nested lists.

In [40]:
# print array again for convenient viewing
print('Array:\n', one_to_fifteen_array)

print('\nMiddle three elements of the first two rows:')
# rows: 0, 1
# columns: 1, 2, 3
one_to_fifteen_array[:2, 1:4]

Array:
 [[ 1  2  3  4  5]
 [ 6  7  8  9 10]
 [11 12 13 14 15]]

Middle three elements of the first two rows:


array([[2, 3, 4],
       [7, 8, 9]])

Lastly, we can use tuples to specify exact columns that we want to slice.

In [41]:
# print array again for convenient viewing
print('Array:\n', one_to_fifteen_array)

print('\nOnly the first, third, and fifth columns')
one_to_fifteen_array[:, (0,2,4)]

Array:
 [[ 1  2  3  4  5]
 [ 6  7  8  9 10]
 [11 12 13 14 15]]

Only the first, third, and fifth columns


array([[ 1,  3,  5],
       [ 6,  8, 10],
       [11, 13, 15]])

# Exercises

**1A**: Create a 3 x 4 array called `even_3x4` that contains sequential even numbers from 2 to 24. You may simply write out array elements literally (2, 4, etc.). It should look like this when printed:

```
[[ 2  4  6  8]
 [10 12 14 16]
 [18 20 22 24]]
```

In [42]:
### write your code below ###
even_3x4 = np.array([[2, 4, 6, 8],
                     [10, 12, 14, 16],
                     [18, 20, 22, 24]])

print(even_3x4)

[[ 2  4  6  8]
 [10 12 14 16]
 [18 20 22 24]]


**1B**: Next, look at the [documentation](https://numpy.org/doc/stable/reference/arrays.ndarray.html#array-attributes) for array attributes.

Which attribute will tell us about the number of bytes consumed by the elements in the array?

In [43]:
# access the relevant attribute for even_3x4

### write your code below ###
even_3x4.nbytes

96

**1C**: Find the column-wise mean of each column in `even_3x4`. The [documentation](https://numpy.org/doc/stable/reference/arrays.ndarray.html#array-methods) for array methods might be helpful.

In [44]:
### write your code below ###
even_3x4.mean(axis=0)

array([10., 12., 14., 16.])

**2A-F**: A tremendouly important skill for a programmer (Python or other language) is to be comfortable with finding answers to common questions using documentations and other online resources. In this exercise, you will learn to use the Numpy documentation and your search engine to find a few utilities that Numpy provides. We're purposefully not telling you where in the documentation to look so that you figure it out yourself.

In [45]:
# a. create a 2x4 array filled with random numbers between 0 and 4 (included)

### write your code below ###
a = np.random.randint(5, size=(2,4))
print(a)

# b. create an array like the one above (i.e. same shape) that is filled with zeros

### write your code below ###
b = np.zeros_like(a)
print(b)

# c. use numpy to answer the following question: are a and b element-wise equal?
# Note: "element-wise equal" means all their elements are equal.

### write your code below ###
c = np.array_equal(a, b)
print(c)

# d. print a 1D array that contains the values in the 2D array below but with no duplicates
d_dups = np.array([[2, 4, 6, 8],
                   [10, 2, 2, 16],
                   [18, 20, 6, 4]])

### write your code below ###
d = np.unique(d_dups)
print(d)

# e. limit the values in the array from part a by constraining them to be in the interval [1,3]
# (i.e. any value less than 1 is replaced with a 1 and any value greater than 3 is replaced with a 3)

### write your code below ###
e = np.clip(a, 1, 3)
print(e)

# f. use numpy to revert the casing in the string array below (uppercase become lowercase and vice versa)
f_case = np.array([["ABC", "AbC", "abC", "abc"],
                   ["aBc", "aBC", "Abc", "abC"]])

### write your code below ###
f = np.char.swapcase(f_case)
print(f)

[[0 3 3 3]
 [4 1 2 0]]
[[0 0 0 0]
 [0 0 0 0]]
False
[ 2  4  6  8 10 16 18 20]
[[1 3 3 3]
 [3 1 2 1]]
[['abc' 'aBc' 'ABc' 'ABC']
 ['AbC' 'Abc' 'aBC' 'ABc']]


**Challenge**: Write a function called `off_diagonals()` that takes a square matrix and returns a list containing the off-diagonal elements of the matrix, **using the `np.eye()` function**. Run the test below with your function and make sure it passes.

Terminology:
- A square matrix is a matrix that has the same number of rows and columns.
- The diagonal elements of a square matrix are the numbers that are on the "diagonal" of the square, e.g. `a[0,0]`, `a[1,1]`, and in general `a[i,i]` for i=0...number of columns.
- The off-diagonal elements of a square matrix are all elements in the matrix that are *not* on the diagonal.

💡 **Hints**:
- What happens if you multiply the output of `np.eye()` with the square matrix?
- You're actually going to want to use the "opposite" of `np.eye()`, i.e. a matrix that has 1's on the off-diagnoals. How can you get that? Notice that `1 - 0 = 1`, and `1 - 1 = 0` and remember array broadcasting rules.
- You might want to use `np.nonzero()`.

In [46]:
### write your code below ###
def off_diagonals(mat):
    eye_mat = np.eye(mat.shape[0])
    opposite_eye = 1-eye_mat
    mat_no_diag = opposite_eye * mat
    off_diag_elems = mat_no_diag[np.nonzero(mat_no_diag)]
    return list(off_diag_elems)

In [47]:
# Execute this cell after having defined off_diagonals above and make sure it passes
test_mat = np.array([[1, 2, 5, 12],
                     [5, 8, 10, 3],
                     [15, 2, 21, 4],
                     [8, 9, 11, 3]])

print('Test mat: \n', test_mat, '\n')

print('Off diagonal entries: \n', off_diagonals(test_mat))

assert(off_diagonals(test_mat) == [2.0, 5.0, 12.0, 5.0, 10.0, 3.0, 15.0, 2.0, 4.0, 8.0, 9.0, 11.0])

Test mat: 
 [[ 1  2  5 12]
 [ 5  8 10  3]
 [15  2 21  4]
 [ 8  9 11  3]] 

Off diagonal entries: 
 [2.0, 5.0, 12.0, 5.0, 10.0, 3.0, 15.0, 2.0, 4.0, 8.0, 9.0, 11.0]


# [Optional] Challenge question (loops, dictionaries)

This challenge question is representative of code that you would be able to write and deploy without using external packages like `numpy`. Give it a go and see if you can figure it out!

*Challenge*: Write a function called `translate()` that takes in a RNA sequence as input and does the following:
1. Given the input sequence, replaces all instances of `'T'` with '`U'`.
2. Finds the index of the first instance of the `'AUG'` substring.
3. Slices the sequence, starting from the `'AUG'` and continuing until the end of the sequence.
4. Divides the sequence into triplets (three-letter substrings).
5. Iterates over substrings to build a string of single-letter amino acid codes.
  * If the substring is a triplet, identifies the amino acid for the corresponding triplet.
6. Returns the amino acid string.

For example: the sequence `'AAGACAUGGCACUGGAGCGCGGGGUCAGCAGCUACGCUUAA'` would return `'MALERGVSSYA_'`.

> *Note*: The sample string that we provide will have only one start codon, and the ORF will end with a stop codon. We'll leave ORF discovery as an exercise for you to try on your own :)

In [48]:
### edit the skeleton code below ###

def translate(seq):
  # we've created an amino acid dictionary for you already
  amino_dict = {'AAA': 'K', 'AAC': 'N', 'AAG': 'K', 'AAU': 'N',
           'ACA': 'T', 'ACC': 'T', 'ACG': 'T', 'ACU': 'T',
           'AGA': 'R', 'AGC': 'S', 'AGG': 'R', 'AGU': 'S',
           'AUA': 'I', 'AUC': 'I', 'AUG': 'M', 'AUU': 'I',
           'CAA': 'Q', 'CAC': 'H', 'CAG': 'Q', 'CAU': 'H',
           'CCA': 'P', 'CCC': 'P', 'CCG': 'P', 'CCU': 'P',
           'CGA': 'R', 'CGC': 'R', 'CGG': 'R', 'CGU': 'R',
           'CUA': 'L', 'CUC': 'L', 'CUG': 'L', 'CUU': 'L',
           'GAA': 'E', 'GAC': 'D', 'GAG': 'E', 'GAU': 'D',
           'GCA': 'A', 'GCC': 'A', 'GCG': 'A', 'GCU': 'A',
           'GGA': 'G', 'GGC': 'G', 'GGG': 'G', 'GGU': 'G',
           'GUA': 'V', 'GUC': 'V', 'GUG': 'V', 'GUU': 'V',
           'UAA': '_', 'UAC': 'Y', 'UAG': '_', 'UAU': 'Y',
           'UCA': 'S', 'UCC': 'S', 'UCG': 'S', 'UCU': 'S',
           'UGA': '_', 'UGC': 'C', 'UGG': 'W', 'UGU': 'C',
           'UUA': 'L', 'UUC': 'F', 'UUG': 'L', 'UUU': 'F'}

  ### write your code below ###
  seq = seq.replace('T', 'U')
  orf = seq[seq.find('AUG'):]
  translated = ''

  for index in range(0, len(orf), 3):
    codon = orf[index:index + 3]
    translated += amino_dict[codon]

  return translated

In [49]:
# try it out with translation_seq when you're done
translation_seq = 'AACGTCAATGTATCACGCGACTAGCCTCTGCTTAATTTTTGTGCTCAAGGGTTTTGGTCCGCCCGAGCGGTGCAGCCGATTAGGACCATGTAATACATTTGTTACAAGACTTCTTTTAAACACTTAG'

print(translate(translation_seq))
print(translate("AAGACAUGGCACUGGAGCGCGGGGUCAGCAGCUACGCUUAA"))

MYHATSLCLIFVLKGFGPPERCSRLGPCNTFVTRLLLNT_
MALERGVSSYA_


# [Optional] Function scope

Scope is an important programming principle, but we've found that discussing it can be more of a hindrance than a help to people who are new to Python. Covering this section is up to the lecturer's discretion, and we may refer you to this section if you have trouble with the following problems:

1. Trying to call variables that only exist within a function.
2. Using redundant variable names within functions and outside of functions (and then doing #1).
___

So far, we've been using fairly descriptive variable names, and we've tried to avoid reusing them in functions. However, there will likely come a time when you tire of imagining new variable names and you begin to recycle them. In these cases, you'll want to be mindful of what your functions can access and/or modify. This is referred to as the **scope** of a function.

When Python is running our code, it moves in and out of different **environments**. By default, Python starts in what we call the *global* environment, and when executing functions, moves into a *local* environment (local to the function, that is.)

In [50]:
# everything below is instantiated in the global environment

dog_breeds = ['shiba inu', 'corgi', 'pug']
dog_counts = [6, 2, 3]
dog_record = dict(zip(dog_breeds, dog_counts))

We've instantiated (programming for created) three variables here: `dog_breeds`, `dog_counts`, and `dog_record`. These variables exist in the global environment.

Now, let's create a function that reuses variable names that already exist in the global environment.

In [51]:
def doggy_daycare(breeds, counts):
  dog_breeds = breeds
  dog_counts = counts
  dog_record = dict(zip(breeds, counts))

  return dog_record

When run, `doggy_daycare()` creates variables named `dog_breeds`, `dog_counts`, and `dog_record`. How will these local variables interact with their identically named global counterparts?

In [52]:
print('Global dog_record:', dog_record)
print('dog_record, local to doggy_daycare:', doggy_daycare(['german shepherd', 'husky'], [3, 2]))
print('Global dog_record is unchanged:', dog_record)

Global dog_record: {'shiba inu': 6, 'corgi': 2, 'pug': 3}
dog_record, local to doggy_daycare: {'german shepherd': 3, 'husky': 2}
Global dog_record is unchanged: {'shiba inu': 6, 'corgi': 2, 'pug': 3}


As you can see, it turns out that they don't interact much at all. Let's walk through why:

To start, we have to understand that Python references environments in a hierarchical fashion.

<img src='https://github.com/ccbskillssem/pythonbootcamp/raw/main/day_2/global_enclosed_local.png'></img>

When trying to execute functions, Python will start with the most specific environment first, then go up to more general environments to find what it needs.

Here's how it works, step by step:
1. Python starts in the global environment by default.
2. Python needs to run a function: in this case, `doggy_daycare()`. This function references several different named variables.
3. Python will start by searching the local environment to see if it can find the variable that it needs to execute the function.
<br>**If the variable is found, Python goes on and executes the function and we skip Steps 4-6.**

4. If this variable is not found, then Python moves up to the next environment.
  - If the function happens to be nested inside another function, Python checks the local environment of the enclosing function, called the *enclosed* environment.
  - If the function is not enclosed or Python doesn't find what it needs, Python then checks the global environment.
5. If Python doesn't find what it needs in the global environment, it does one final check of the *built-in* environment, where all of Python's built in variables and functions exist.
6. If Python really doesn't find what you're asking for in any environment, it throws an error.

For `doggy_daycare()`, we specifically created variables named `dog_breeds`, `dog_counts`, and `dog_record` inside the function's local environment. Thus, when Python searched for those variables while running `doggy_daycare()`, it found the local variables first, in accordance with the hierarchical search. Since Python found what it needed, it continued on with executing the function.

Now, that covers how Python starts from the local environment and progressively searches the outer environments: what about starting from the outer global environment and trying to look into the local? It turns out that the arrows in the figure are one-way for a reason.

In [53]:
def daycare_report():
  daycare_greeting = "Have a paw-some day!"
  print("We have", sum(dog_counts), "dogs today.")
  print(daycare_greeting)

daycare_report()
print(daycare_greeting)

We have 11 dogs today.
Have a paw-some day!


NameError: ignored

If Python is sitting in the global environment, it *cannot access* anything inside a function's local environment. It's a bit like a one-way mirror: inside `daycare_report()`, Python can "see" `dog_counts` and use it to execute `daycare_report()`. However, once Python returns to the global environment after executing the function, it can no longer "see" `daycare_greeting`, because it only exists inside the local environment.

This is the very same reason why the `dog_breeds`, `dog_counts`, and `dog_record` inside `doggy_daycare()` don't change `dog_breeds`, `dog_counts`, `dog_record` in the global environment: What happens inside `doggy_daycare()` stays inside `doggy_daycare()`: if we want to bring anything "outside" of the function, we need to return and save it!

This is a tough topic, so we encourage you to use the [Python Code Visualizer](http://www.pythontutor.com/visualize.html) to run the following code block and visualize how Python searches different environments.

```
dog_breeds = ['shiba inu', 'corgi', 'pug']
dog_counts = [6, 2, 3]
dog_record = dict(zip(dog_breeds, dog_counts))

def doggy_daycare(breeds, counts):
  dog_breeds = breeds
  dog_counts = counts
  dog_record = dict(zip(breeds, counts))

  return dog_record

print(doggy_daycare(['german shepherd', 'husky'], [3, 2]))
print(dog_record)
```