# Lecture 4 – Arrays and DataFrames
## DSC 10, Winter 2022

### Announcements

- Lab 1 is due **tomorrow (1/11) at 11:59pm**.
    - Don't wait until the last minute to submit!
- Homework 1 is due **Saturday 1/15 at 11:59pm**.
- Discussion sections start today.
    - Monday 6PM and 7PM – attend whichever one you want.
    - Both will be recorded.

### Agenda

- Review: strings and text.
- Lists.
- Arrays.
- Ranges.
- DataFrames.

### Resources

- We're covering **a lot** of content very quickly. If you're overwhelmed, just know that we're here to support you! 
    - Office Hours and Campuswire are your friends 🤝.
- Remember to check the [Resources tab of the course website](https://dsc10.com/resources/) for programming resources.
- Some key links moving forward:
    - [DSC 10 Reference Sheet](https://drive.google.com/file/d/1mQApk9Ovdi-QVqMgnNcq5dZcWucUKoG-/view).
    - [BabyPandas Documentation](https://babypandas.readthedocs.io/en/latest/index.html).

## Review: strings and text

### Strings

- A string is a snippet of text of any length.
- Enclose a string in either single or double quotes.

In [None]:
'woof'

In [None]:
"woof"

In [None]:
# A string, not an int!
"1998"

### String arithmetic

- When using the `+` symbol between two strings, the operation is called "concatenation".

In [None]:
s1 = 'tiny'
s2 = 'panda'

In [None]:
s1 + s2

In [None]:
s1 + ' ' + s2

In [None]:
s1 * 3

### String methods
* Strings are associated with certain functions called **string methods**.
* Access string methods with a `.` after the string (dot notation).
* e.g. `.upper()`, `.replace()`,...

In [None]:
my_cool_string = 'data science is super cool!'

In [None]:
my_cool_string.upper()

In [None]:
my_cool_string.replace('super cool', '💯' * 3)

In [None]:
# len is not a method, since it doesn't use dot notation
len(my_cool_string)

### Special characters in strings
* apostrophes, quotes, new-lines, etc...

In [None]:
'my string's full of apostrophes!'

In [None]:
"my string's full of apostrophes!"

In [None]:
# escape the apostrophe with a backslash!
'my string\'s "full" of apostrophes!'

In [None]:
print('my string\'s "full" of apostrophes!')

### Digression: ```print()```
- By default Jupyter notebooks display the "raw" value of the expression of the last line in a cell.
- The `print` function displays the value in human readable text when it's evaluated.

In [None]:
12 # 12 won't be displayed, since Python only shows the value of the last expression
23

In [None]:
# Note, there is no Out[number] to the left! That only appears when displaying a non-printed value.
# But both 12 and 23 are displayed.
print(12)
print(23)

In [None]:
# '\n' inserts a new line
my_newline_str = 'here is a string with two lines.\nhere is the second line'  
my_newline_str

In [None]:
# Notice the quotes disappear!
print(my_newline_str)  

### Type conversion to and from strings
* Any value can be converted to a string using ```str```.
* Some strings can be converted to ```int``` and ```float```.

In [None]:
str(3)

In [None]:
float('3')

In [None]:
int('4')

In [None]:
int('bunnies')

### Discussion Question

Assume you have run the following statements:

```py
x = 3
y = '4'
z = '5.6'
```

Choose the expression that will be evaluated **without** an error.

A. `x + y`

B. `x + int(y + z)`

C. `str(x) + int(y)`

D. `str(x) + z`

E. All of them have errors

### To answer, go to **[menti.com](https://menti.com)** and enter the code **3962 2509**.

## Lists

### How do we store *sequences*?

For instance:
- All temperatures in the month of January.
- The age of every user on TikTok.
- The salary of every NBA player.

### Each as its own variable?

In [None]:
temperature_on_jan_01 = 68
temperature_on_jan_02 = 72
temperature_on_jan_03 = 65
temperature_on_jan_04 = 64
temperature_on_jan_05 = 62
temperature_on_jan_06 = 61
temperature_on_jan_07 = 59
temperature_on_jan_08 = 64
temperature_on_jan_09 = 64
temperature_on_jan_10 = 63
temperature_on_jan_11 = 65
temperature_on_jan_12 = 62

```
avg_temperature = 1/12 * (
    temperature_on_jan_01
    + temperature_on_jan_02
    + temperature_on_jan_03
    + ...)
```

It seems like we need a better solution.

## Python's `list`s

- To create a `list`, place commas between things and surround with square brackets:

In [None]:
temperature_list = [68, 72, 65, 64, 62, 61, 59, 64, 64, 63, 65, 62]
temperature_list

In [None]:
type(temperature_list)

## `list`s make working with sequences easy

In [None]:
# What's the average temperature?
sum(temperature_list) / len(temperature_list)

### There's a problem...

- Lists are **very slow**.
- This is not a big deal when there aren't many entries, but it's a big problem when there are millions/billions of entries.

## Arrays

### Arrays

- Arrays are like lists, but faster.
- Provided by a package called `numpy` (pronounced "num-pie").
    - Core package for data science and scientific computing.

<center>
<img src='images/numpy.png' width=400>
</center>

- To use `numpy`, we need to import it. It's usually imported as `np` (but doesn't have to be!)

In [None]:
import numpy as np

### Creating arrays

- To create an array, pass a list as input to the `np.array` function.
- Remember the square brackets!

<center>
<img src='images/brackets.png' width=500>
</center>

In [None]:
temperature_array = np.array([68, 72, 65, 64, 62, 61, 59, 64, 64, 63, 65, 62])
temperature_array

In [None]:
temperature_list

In [None]:
# No square brackets, because temperature_list is already a list!
np.array(temperature_list)

### Accessing elements in an array

- The things inside of an array are called its *elements*.
- Every element in an array has a position.
- Python, like most programming languages, is 0-indexed. **This means that the position of the first element in an array is 0, not 1.**
- To access the element at position `i` in an array, add `[i]` after the name of the array.
- Everything above applies to lists, too.

In [None]:
temperature_array

In [None]:
temperature_array[0]

In [None]:
temperature_array[1]

In [None]:
temperature_array[3]

In [None]:
# Trying to access the last element in the array
temperature_array[12]

In [None]:
temperature_array[42]

### Array-number arithmetic

- Arrays make it easy to perform the same operation to every element.

In [None]:
temperature_array

In [None]:
# Increase all temperatures by 3 degrees
temperature_array = temperature_array + 3
temperature_array

In [None]:
# Halve all temperatures
temperature_array / 2

In [None]:
# Convert all temperatures to Celsius
(5/9) * (temperature_array - 32)

In [None]:
# In the previous two cells, we didn't re-assign temperature_array
temperature_array

### Array-array arithmetic

- Two arrays of the **same size** can be added, subtracted, multiplied, etc.
- The arithmetic happens *elementwise*.

In [None]:
a1 = np.array([1, 2, 3])
a2 = np.array([4, 5, -6])

In [None]:
a1 + a2

In [None]:
a1 - a2

In [None]:
a1 * a2

### Example: newborn birth weights 👶

Suppose there are four babies with weights 3.405 kg, 3.207 kg, 2.420 kg, and 3.984 kg. The average weight of a newborn baby is 3.300 kg.

**Question:** How far are these four babies' weights from the weight of an average newborn?

In [None]:
g1 = 3.405 
g2 = 3.207
g3 = 2.42
g4 = 3.984

# Array of four weights
four_weights = np.array([g1, g2, g3, g4])
four_weights

In [None]:
average_weight = 3.3

#### Calculate the deviation of weights from the average weight
* Subtracting a number from an array subtracts the number from each element.

In [None]:
abs(four_weights - average_weight)

#### Convert the weights to pounds (2.2 lb/kg)

In [None]:
four_weights_lbs = four_weights * 2.2
four_weights_lbs

#### How many babies are recorded in the array?

- The function `len` returns the length of an array (or list).
- In this case it's obvious that there are 4 elements in the array, but it won't always be so obvious.

In [None]:
len(four_weights_lbs)

### Example: daily temperatures 🌡

Below is an array of daily high temperatures in San Diego from August 2018.

In [None]:
temps = np.array([86, 85, 85, 84, 85, 86, 91, 89, 90, 88, 88, 85, 83, 82, 79, 81, 82,
                   83, 82, 79, 81, 83, 83, 79, 80, 80, 79, 80, 82, 82, 80])

Numbers of days temperatures are collected in August:

In [None]:
len(temps)

#### Temperature statistics (mean, min, max)

- Arrays have handy methods for common tasks.
- Methods are like functions but they use dot notation (e.g. `temps.max()`).

In [None]:
sum(temps) / len(temps) # Built-in functions work on both arrays and lists

In [None]:
temps.sum() / len(temps)  # THe sum method for arrays

In [None]:
temps.mean() # The mean method for arrays

In [None]:
max(temps) 

In [None]:
temps.max() # Arrays have their own min/max method (faster)

## Ranges

### Motivation

- We often find ourselves needing to make arrays like this:

In [None]:
days_in_january = np.array([
    1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 
    13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 
    23, 24, 25, 26, 27, 28, 29, 30, 31
])

### Ranges
* A **range** is an array of consecutive numbers (or evenly spaced numbers).
* ```np.arange(end)```: An array of increasing integers from 0 up to (and excluding!) end.
* ```np.arange(start, end)```: An array of increasing integers from start up to (excluding!) end.
* ```np.arange(start, end, step)```: A range with step between consecutive values.
* The range always includes the start but excludes the end (i.e. a half-open interval): $[~,~)$.

In [None]:
np.arange(8)

In [None]:
np.arange(3, 9, 1)

In [None]:
np.arange(3, 32, 5)

In [None]:
np.arange(-3, 2, 0.5)

In [None]:
np.arange(8, 5, -1)

In [None]:
np.arange(1, -10, -3)

### Discussion Question

On the first day of January, you are paid 1 cent. Every day thereafter, your pay doubles: on the 2nd day it is 2 cents, on the 3rd it is 4 cents, on the 4th it is 8 cents, and so on.

January has 31 days.

Which of these expressions calculates the total amount of money you'll make in January, in dollars?

A. `(2**(np.arange(31) * 0.01)).sum()`

B. `(2**(np.arange(32) * 0.01)).sum()`

C. `((2**np.arange(31)) * 0.01).sum()`

D. `((2**np.arange(32)) * 0.01).sum()`

### To answer, go to **[menti.com](https://menti.com)** and enter the code **3962 2509**.

## DataFrames (i.e. Tables)

<center>
<img width=50% src="images/imdb.png"/>
</center>

### How do we store *tabular data*?

- We could have an array for title, another for rating, another for year, etc.
- But this is not convenient.
- Instead, we use something called a *DataFrame*.

### `pandas`

- DataFrames are provided by a package called `pandas`.
- `pandas` is **the** tool for doing data science in Python.

<center>
<img src='images/pandas.png' width=500>
</center>

### But `pandas` is not so cute...

<center>
<img height=100% src="images/angrypanda.jpg"/>
</center>

### Instead!

- We at UCSD have created a smaller, nicer version of `pandas`.
- It keeps the important stuff and throws out the rest.
- It's easier to learn, but is still valid `pandas` code.

### We call it `babypandas` 🐼

<center>
<img height=75% src="images/babypanda.jpg"/ width=500>
</center>

### Importing `babypandas`

In [None]:
import babypandas as bpd

### Structure of a DataFrame

- DataFrames (the name for tables in `pandas` and `babypandas`) have *columns* and *rows*.
    - Can think of each column as an array.
- Each column has a label: `"Votes"`, `"Rating"`, etc.
    - This is its name.
    - Column labels are stored as strings.
- Every row has a label too: in this case, 0, 1, 2, 3, 4.

In [None]:
# Reading a DataFrame from a file and keeping just the first 5 rows
movies = bpd.read_csv('data/imdb.csv').take(np.arange(5))
movies

### The index

- Together, the row labels are called the *index*.
    - Think of the index as an array of names, one for each row.
- **The index is not a separate column**!

In [None]:
movies

### Setting a new index

- We can set a better index using `.set_index(column_name)`.
- Row labels should (ideally) be unique identifiers.
    - Remember, the index contains the "name" of each row. Ideally, each row has a different, descriptive name.
- Like most **DataFrame methods**, `.set_index` returns a new DataFrame; it does not modify the original DataFrame.
- The result not only looks nicer, but will be easier to manipulate, as we'll see soon.

In [None]:
movies

In [None]:
movies.set_index('Title')

In [None]:
movies

In [None]:
movies_by_name = movies.set_index('Title')
movies_by_name

### The index is like an array

In [None]:
movies_by_name

In [None]:
# Technically not an array, but works just like one
movies_by_name.index

### Discussion Question

Which of these will return `"Léon"`?

A. `movies_by_name['Title'][3]`

B. `movies_by_name['Title'][4]`

C. `movies_by_name.index[3]`

D. `movies_by_name.index[4]`

### To answer, go to **[menti.com](https://menti.com)** and enter the code **3962 2509**.

## Summary

### Summary

- Strings are used to store text.
- Lists and arrays are used to store sequences.
    - Arrays are faster and more convenient for numerical operations
- We will be using the `babypandas` module for working with data.
- Tables in `babypandas` are called DataFrames.
- **Next time:** We will do a deep dive on a single dataset and introduce DataFrame manipulation techniques as necessary.
    - Remember to refer to the resources from the start of lecture!