# Lecture 4 – Arrays and DataFrames
## DSC 40A, Fall 2021

### Announcements

- Starting Monday, Janine's lectures (A00 and B00) will be held in RWAC 0121 (near Blue Bowl).
    - Same room as in-person discussion.
- Lab 1 and Homework 1 are due **tomorrow at 11:59pm**.
- Lab 2 is due on **Tuesday at 11:59pm**.
- Click the "[Zoom Links and Office Hours Schedule](https://canvas.ucsd.edu/calendar?include_contexts=course_29590#view_name=month)" to see the OH schedule.
    - Instructions on how to access in-person OH are embedded within each calendar event.
    - Suraj's OH schedule is now Monday 2-3pm and Wednesday 10:30-11:30am, both in person.

### Agenda

- Review: strings and text.
- Lists.
- Arrays.
- Ranges.
- DataFrames.

### Resources

- We're covering **a lot** of content very quickly. If you're overwhelmed, just know that we're here to support you! 
    - Office Hours and Campuswire are your friends.
- Remember to check the [Resources tab of the course website](https://dsc10.com/resources/) for programming resources.
- Some key links moving forward:
    - [DSC 10 Reference Sheet](https://drive.google.com/file/d/1mQApk9Ovdi-QVqMgnNcq5dZcWucUKoG-/view).
    - [BabyPandas Documentation](https://babypandas.readthedocs.io/en/latest/index.html).

## Review: strings and text

### Strings

- A string is a snippet of text of any length.
- Enclose a string in either single or double quotes.

In [None]:
'oink'

In [None]:
"oink"

In [None]:
# A string,  not a float!
"12.0"

### String arithmetic

- When using the `+` symbol between two strings, the operation is called "concatenation".

In [None]:
s1 = 'baby'
s2 = 'porcupine'

In [None]:
s1 + s2

In [None]:
s1 + ' ' + s2

In [None]:
s1 * 3

### String methods
* Strings are associated with certain functions called **string methods**.
* Access string methods with a `.` after the string (dot notation).
* e.g. `.upper()`, `.replace()`,...

In [None]:
my_cool_string = 'data science is super cool!'

In [None]:
my_cool_string.upper()

In [None]:
my_cool_string.replace('super', 'super-duper')

In [None]:
# len is not a method, since it doesn't use dot notation
len(my_cool_string)

### Special characters in strings
* apostrophes, quotes, new-lines, etc...

In [None]:
'my string's full of apostrophes!'

In [None]:
"my string's full of apostrophes!"

In [None]:
# escape the apostrophe with a backslash!
'my string\'s "full" of apostrophes!'

In [None]:
print('my string\'s "full" of apostrophes!')

### Digression: ```print()```
* By default Jupyter notebooks displays the "raw" value of the expression of the last line in a cell.
* The function ```print```, displays the value in human readable text when it's evaluated.

In [None]:
12 # 12 won't be displayed, since Python only shows the value of the last expression
23

In [None]:
# Note, there is no Out[number] to the left! That only appears when displaying a non-printed value.
# But both 12 and 23 are displayed.
print(12)
print(23)

In [None]:
# '\n' inserts a new line
my_newline_str = 'here is a string with two lines.\nhere is the second line'  
my_newline_str

In [None]:
# Notice the quotes disappear!
print(my_newline_str)  

### Type conversion to and from strings
* Any value can be converted to a string using ```str```.
* Some strings can be converted to ```int``` and ```float```.

In [None]:
str(3)

In [None]:
float('3')

In [None]:
int('4')

In [None]:
int('bunnies')

### Discussion Question

Assume you have run the following statements:

```py
x = 3
y = '4'
z = '5.6'
```

Choose the expression that will be evaluated **without** an error.

A. `x + y`

B. `x + int(y + z)`

C. `str(x) + int(y)`

D. `str(x) + z`

E. All of them have errors

### To answer, go to **[menti.com](https://menti.com)** and enter the code **2309 7224**.

In [None]:
x = 3
y = '4'
z = '5.6'

In [None]:
x + y # Like 3 + "bunnies", this doesn't make sense

In [None]:
x + int(y + z)

In [None]:
str(x) + int(y)

In [None]:
str(x) + z

## Lists

### How do we store *sequences*?

For instance:
- All temperatures in the month of October.
- The age of every user on Facebook.
- The salary of every NBA player.

### Each as its own variable?

In [None]:
temperature_on_oct_01 = 68
temperature_on_oct_02 = 72
temperature_on_oct_03 = 65
temperature_on_oct_04 = 64
temperature_on_oct_05 = 62
temperature_on_oct_06 = 61
temperature_on_oct_07 = 59
temperature_on_oct_08 = 64
temperature_on_oct_09 = 64
temperature_on_oct_10 = 63
temperature_on_oct_11 = 65
temperature_on_oct_12 = 62

```
avg_temperature = 1/12 * (
    temperature_on_oct_01
    + temperature_on_oct_02
    + temperature_on_oct_03
    + ...)
```

- It seems like we need a better solution.

## Python's `list`s

- To create a `list`, place commas between things and surround with square brackets:

In [None]:
temperature_list = [68, 72, 65, 64, 62, 61, 59, 64, 64, 63, 65, 62]
temperature_list

In [None]:
type(temperature_list)

## `list`s make working with sequences easy

In [None]:
# compute the average temperature using `sum`
sum(temperature_list) / len(temperature_list)

### There's a problem...

- Lists are **very slow**.
- This is not a big deal when there aren't many entries, but it's a big problem when there are millions/billions of entries.

## Arrays

### Arrays

* Arrays are like lists, but faster.
* Provided by a package called `numpy` (pronounced "num-pie").
    - Core package for data science and scientific computing.

<center>
<img src='images/numpy.png' width=300>
</center>

In [None]:
import numpy as np

### Creating arrays

- To create an array, pass a list as input to the `np.array` function.
- Remember the square brackets!

<center>
<img src='images/brackets.png' width=500>
</center>

In [None]:
temperature_array = np.array([68, 72, 65, 64, 62, 61, 59, 64, 64, 63, 65, 62])
temperature_array

In [None]:
temperature_list

In [None]:
np.array(temperature_list)

### Accessing elements in an array

- The things inside of an array are called its *elements*.
- Every element in an array has a position.
- Python, like most programming languages, is 0-indexed. **This means that the position of the first element in an array is 0, not 1.**
- To access the element at position `i` in an array, add `[i]` after the name of the array.
- Everything above applies to lists, too.

In [None]:
temperature_array

In [None]:
temperature_array[0]

In [None]:
temperature_array[1]

In [None]:
temperature_array[3]

In [None]:
# get the last element of the array?
temperature_array[12]

In [None]:
temperature_array[42]

### Array-Number arithmetic

- `numpy` arrays make it easy to perform the same operation to every element.

In [None]:
temperature_array

In [None]:
# increase all temperatures by 3 degrees
temperature_array = temperature_array + 3

In [None]:
temperature_array

In [None]:
# halve all temperatures
temperature_array / 2

In [None]:
# convert all temperatures to Celsius
(5/9) * (temperature_array - 32)

### Array-Array arithmetic

- Two arrays of the same size can be added, subtracted, multiplied, etc.
- The arithmetic happens *elementwise*.

In [None]:
a1 = np.array([1, 2, 3])
a2 = np.array([4, 5, 6])

In [None]:
a1 + a2

In [None]:
a1 - a2

In [None]:
a1 * a2

### Example: newborn birth weights

In [None]:
#: four baby girls with weight in kg: g1 = 3.405, g2 = 3.207, g3 = 2.42, g4 = 3.984

g1 = 3.405 
g2 = 3.207
g3 = 2.42
g4 = 3.984

# average weight of a newborn girl (in kg): 3.3
girl_av_weight = 3.3

#### Load the weights into an array of floats

In [None]:
weights_kg_g = np.array([g1, g2, g3, g4]) 

weights_kg_g

#### Calculate the deviation of weights from the average weight
* Subtracting a number from an array subtracts the number from each element.

In [None]:
weights_kg_g - girl_av_weight

#### Convert the weights to pounds (2.2 lb/kg)

In [None]:
weights_lbs_g = weights_kg_g * 2.2
weights_lbs_g

#### How many babies are recorded in the array?

- The function `len()` returns the length of an array (or list).

In [None]:
len(weights_lbs_g)

### Example: daily temperatures

Below is an array of daily high temperatures in San Diego from August 2018.

In [None]:
temps = np.array([86, 85, 85, 84, 85, 86, 91, 89, 90, 88, 88, 85, 83, 82, 79, 81, 82,
                   83, 82, 79, 81, 83, 83, 79, 80, 80, 79, 80, 82, 82, 80])

Numbers of days temperatures are collected in August:

In [None]:
len(temps)

#### Temperature statistics (mean, min, max)

- Arrays have handy methods for common tasks.
- Methods are like functions but they use dot notation (e.g. `temps.max()`).

In [None]:
sum(temps) / len(temps)

In [None]:
temps.sum() / len(temps)  

In [None]:
temps.mean() # the mean method for arrays

In [None]:
max(temps) # built-in functions work on arrays

In [None]:
temps.max() # arrays have their own min/max method (faster)

## Ranges

### Motivation

- We often find ourselves needing to make arrays like this:

In [None]:
days_in_october = np.array([
    1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 
    13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 
    23, 24, 25, 26, 27, 28, 29, 30, 31
])

### Ranges
* A range is an array of consecutive numbers (or evenly spaced numbers).
* ```np.arange(end)```: An array of increasing integers from 0 up to (and excluding!) end.
* ```np.arange(start, end)```: An array of increasing integers from start up to (excluding!) end.
* ```np.arange(start, end, step)```: A range with step between consecutive values.
* The range always includes the start but excludes the end (i.e. a half-open interval): $[~,~)$.

In [None]:
np.arange(8)

In [None]:
np.arange(3, 9, 1)

In [None]:
np.arange(3, 31, 5)

In [None]:
np.arange(-3, 2, 0.5)

In [None]:
np.arange(8, 5, -1)

In [None]:
np.arange(1, -10, -3)

### Discussion Question

On the first day of October, you are paid 1 cent. Every day thereafter, your pay doubles: on the 2nd day it is 2 cents, on the 3rd it is 4 cents, on the 4th it is 8 cents, and so on.

October has 31 days.

Which of these expressions calculates the total amount of money you'll make in October (in dollars)?

A. `(2**(np.arange(31) * .01)).sum()`

B. `(2**(np.arange(32) * .01)).sum()`

C. `((2**np.arange(31)) * .01).sum()`

D. `((2**np.arange(32)) * .01).sum()`

### To answer, go to **[menti.com](https://menti.com)** and enter the code **2309 7224**.

In [None]:
np.arange(31)

In [None]:
2**np.arange(31) #number-array arithmetic

In [None]:
2**np.arange(31) * .01 

In [None]:
(2**np.arange(31) * .01).sum()

## DataFrames (i.e. Tables)

<center>
<img width=50% src="images/imdb.png"/>
</center>

### How do we store *tabular data*?

- Could have an array for title, another for rating, another for year, etc.
- But this is not convenient.
- Instead, we use something called a *DataFrame*.

### `pandas`

- DataFrames are provided by a package called `pandas`.
- `pandas` is **the** tool for doing data science in Python.

<center>
<img src='images/pandas.png' width=500>
</center>

### But `pandas` is not so cute...

<center>
<img height=100% src="images/angrypanda.jpg"/>
</center>

### Instead!

- We at UCSD have created a smaller, nicer version of `pandas`.
- Keeps important stuff, throws out the rest.
- Easier to learn, but is still valid `pandas` code.

### We call it `babypandas` 🐼

<center>
<img height=75% src="images/babypanda.jpg"/ width=500>
</center>

### Importing `babypandas`

In [None]:
import babypandas as bpd

### Table structure

- Tables (called DataFrames in `pandas` and `babypandas`) have *columns* and *rows*.
    - Can think of each column as an array.
- Every column has a label: `"Votes"`, `"Rating"`, etc.
    - This is its name.
    - Labels are stored as strings.
- Every row does too: 0, 1, 2, 3.

In [None]:
movies = bpd.read_csv('data/imdb.csv').take(np.arange(4))
movies

### The index

- Together, the row labels are called the *index*.
- **The index is not a separate column**!

In [None]:
movies

### Setting a new index

- We can set a better index using `.set_index(column_name)`.
- Row labels should (ideally) be unique identifiers.
- Returns a copy!
- Looks nicer, but also really useful.

In [None]:
movies

In [None]:
movies.set_index('Title')

In [None]:
movies

In [None]:
movies_by_name = movies.set_index('Title')
movies_by_name

### The index is like an array

In [None]:
movies_by_name.index

### Discussion Question

Which of these will return `Léon`?

A. `movies_by_name['Title'][3]`

B. `movies_by_name['Title'][4]`

C. `movies_by_name.index[3]`

D. `movies_by_name.index[4]`

### To answer, go to **[menti.com](https://menti.com)** and enter the code **2309 7224**.

## Summary

### Summary

- Strings are used to store text.
- Lists and arrays are used to store sequences.
    - Arrays are faster and more convenient for numerical operations
- We will be using the `babypandas` module for working with data.
- Tables in `babypandas` are called DataFrames.
- **Next time:** We will do a deep dive on a single dataset and introduce DataFrame manipulation techniques as necessary.
    - Remember to refer to the resources from the start of lecture!