# Lecture 4 – Arrays and DataFrames
## DSC 10, Summer 2022

### Announcements

- Good job on Lab 1 and HW 1!
  - Aiming to have HW 1 graded this week.
- Assignment deadlines are pushed back because of July 4.
  - Labs now due Tues, and HW due Sat.
  - Nothing due this week!
- **Lab 2 released, due Tues July 12 at 11:59pm.**
- HW 2 will be released this week, due Sat July 16 at 11:59pm.

### Agenda

- Strings and text.
- Lists.
- Arrays.
- Ranges.
- DataFrames.

### Resources

- We're covering **a lot** of content very quickly. If you're overwhelmed, just know that we're here to support you! 
    - Office Hours and Campuswire are your friends 🤝.
- Remember to check the [Resources tab of the course website](https://dsc10.com/resources/) for programming resources.

## Strings and text

### Strings

- A string is a snippet of text of any length.
- Enclose a string in either single or double quotes.

In [1]:
'woof'

'woof'

In [2]:
"woof"

'woof'

In [3]:
# A string, not an int!
"1998"

'1998'

### String arithmetic

- When using the `+` symbol between two strings, the operation is called "concatenation".

In [4]:
s1 = 'tiny'
s2 = 'panda'

In [5]:
s1 + s2

'tinypanda'

In [6]:
s1 + ' ' + s2

'tiny panda'

In [7]:
s1 * 3

'tinytinytiny'

### String methods
* Strings are associated with certain functions called **string methods**.
* Access string methods with a `.` after the string (dot notation).
* e.g. `.upper()`, `.replace()`,...

In [8]:
my_cool_string = 'data science is super cool!'

In [9]:
my_cool_string.upper()

'DATA SCIENCE IS SUPER COOL!'

In [10]:
my_cool_string.replace('super cool', '💯' * 3)

'data science is 💯💯💯!'

In [11]:
# len is not a method, since it doesn't use dot notation
len(my_cool_string)

27

### Special characters in strings
* apostrophes, quotes, new-lines, etc...

In [12]:
'my string's full of apostrophes!'

SyntaxError: invalid syntax (3472332101.py, line 1)

In [13]:
"my string's full of apostrophes!"

"my string's full of apostrophes!"

In [14]:
# escape the apostrophe with a backslash!
'my string\'s "full" of apostrophes!'

'my string\'s "full" of apostrophes!'

In [15]:
print('my string\'s "full" of apostrophes!')

my string's "full" of apostrophes!


### Digression: ```print()```
- By default Jupyter notebooks display the "raw" value of the expression of the last line in a cell.
- The `print` function displays the value in human readable text when it's evaluated.

In [16]:
12 # 12 won't be displayed, since Python only shows the value of the last expression
23

23

In [17]:
# Note, there is no Out[number] to the left! That only appears when displaying a non-printed value.
# But both 12 and 23 are displayed.
print(12)
print(23)

12
23


In [18]:
# '\n' inserts a new line
my_newline_str = 'here is a string with two lines.\nhere is the second line'  
my_newline_str

'here is a string with two lines.\nhere is the second line'

In [19]:
# Notice the quotes disappear!
print(my_newline_str)  

here is a string with two lines.
here is the second line


### Type conversion to and from strings
* Any value can be converted to a string using ```str```.
* Some strings can be converted to ```int``` and ```float```.

In [20]:
str(3)

'3'

In [21]:
float('3')

3.0

In [22]:
int('4')

4

In [23]:
int('bunnies')

ValueError: invalid literal for int() with base 10: 'bunnies'

<div class="menti">
<div>

### Discussion Question

Assume you have run the following statements:

```py
x = 3
y = '4'
z = '5.6'
```

Choose the expression that will be evaluated **without** an error.

A. `x + y`

B. `x + int(y + z)`

C. `str(x) + int(y)`

D. `str(x) + z`

E. All of them have errors
    
</div>
<div>

### To answer, go to **[menti.com](https://www.menti.com/v42ge81t5d)** and enter the code 2863 3386 or use this QR code:

![](images/menti-qr.png)
    
</div>
</div>

## Lists

### How do we store *sequences*?

For instance:
- All temperatures in the month of January. 🌡️
- The age of every user on TikTok. 🤳
- The salary of every NBA player. 🏀

### Each as its own variable?

In [24]:
temperature_on_jan_01 = 68
temperature_on_jan_02 = 72
temperature_on_jan_03 = 65
temperature_on_jan_04 = 64
temperature_on_jan_05 = 62
temperature_on_jan_06 = 61
# and so on

```
avg_temperature = 1/31 * (
    temperature_on_jan_01
    + temperature_on_jan_02
    + temperature_on_jan_03
    + ...)
```

It seems like we need a better solution.

## Python's `list`s

- To create a `list`, place commas between things and surround with square brackets:

In [69]:
temperature_list = [68, 72, 65, 64, 62, 61]

In [70]:
len(temperature_list)

6

## `list`s make working with sequences easy

In [72]:
# What's the average temperature?
sum(temperature_list) / len(temperature_list)

65.33333333333333

### There's a problem...

- Lists are **very slow**.
- This is not a big deal when there aren't many entries, but it's a big problem when there are millions/billions of entries.

## Arrays

### Arrays

- Arrays are like lists, but faster.
- Provided by a package called `numpy` (pronounced "num-pie").
    - Core package for data science and scientific computing.

<center>
<img src='images/numpy.png' width=400>
</center>

- To use `numpy`, we need to import it. It's usually imported as `np` (but doesn't have to be!)

In [73]:
import numpy as np

### Creating arrays

- To create an array, pass a list as input to the `np.array` function.
- Remember the square brackets!

<center>
<img src='images/brackets.png' width=500>
</center>

In [74]:
temperature_array = np.array([68, 72, 65, 64, 62, 61])
temperature_array

array([68, 72, 65, 64, 62, 61])

In [75]:
temperature_list

[68, 72, 65, 64, 62, 61]

In [76]:
# No square brackets, because temperature_list is already a list!
np.array(temperature_list)

array([68, 72, 65, 64, 62, 61])

### Accessing elements in an array

- The things inside of an array are called its *elements*.
- Every element in an array has a position.
- Python, like most programming languages, is 0-indexed. **This means that the position of the first element in an array is 0, not 1.**
- To access the element at position `i` in an array, add `[i]` after the name of the array.
- Everything above applies to lists, too.

In [77]:
temperature_array

array([68, 72, 65, 64, 62, 61])

In [78]:
temperature_array[0]

68

In [79]:
temperature_array[1]

72

In [80]:
temperature_array[3]

64

In [81]:
# Access last element
temperature_array[5]

61

In [82]:
temperature_array[6]

IndexError: index 6 is out of bounds for axis 0 with size 6

### Array-number arithmetic

- Arrays make it easy to perform the same operation to every element.

In [83]:
temperature_array

array([68, 72, 65, 64, 62, 61])

In [84]:
# Increase all temperatures by 3 degrees
temperature_array = temperature_array + 3
temperature_array

array([71, 75, 68, 67, 65, 64])

In [85]:
# Halve all temperatures
temperature_array / 2

array([35.5, 37.5, 34. , 33.5, 32.5, 32. ])

In [86]:
# Convert all temperatures to Celsius
(5/9) * (temperature_array - 32)

array([21.66666667, 23.88888889, 20.        , 19.44444444, 18.33333333,
       17.77777778])

In [87]:
# In the previous two cells, we didn't re-assign temperature_array
temperature_array

array([71, 75, 68, 67, 65, 64])

### Array-array arithmetic

- Two arrays of the **same size** can be added, subtracted, multiplied, etc.
- The arithmetic happens *elementwise*.

In [88]:
a1 = np.array([1, 2, 3])
a2 = np.array([4, 5, -6])

In [89]:
a1 + a2

array([ 5,  7, -3])

In [90]:
a1 - a2

array([-3, -3,  9])

In [91]:
a1 * a2

array([  4,  10, -18])

### Example: newborn birth weights 👶

Suppose there are four babies with weights 3.405 kg, 3.207 kg, 2.420 kg, and 3.984 kg. 

The average weight of a newborn baby is 3.300 kg.

**Question:** How far are these four babies' weights from the weight of an average newborn?

In [92]:
g1 = 3.405 
g2 = 3.207
g3 = 2.42
g4 = 3.984

# Array of four weights
four_weights = np.array([g1, g2, g3, g4])
four_weights

array([3.405, 3.207, 2.42 , 3.984])

In [93]:
average_weight = 3.3

#### Calculate the deviation of weights from the average weight
* Subtracting a number from an array subtracts the number from each element.

In [94]:
four_weights - average_weight

array([ 0.105, -0.093, -0.88 ,  0.684])

#### Convert the weights to pounds (2.2 lb/kg)

In [95]:
four_weights_lbs = four_weights * 2.2
four_weights_lbs

array([7.491 , 7.0554, 5.324 , 8.7648])

#### How many babies are recorded in the array?

- The function `len` returns the length of an array (or list).
- In this case it's obvious that there are 4 elements in the array, but it won't always be so obvious.

In [96]:
len(four_weights_lbs)

4

### Example: daily temperatures 🌡️

Below is an array of daily high temperatures in San Diego from August 2018.

In [98]:
temps = np.array([86, 85, 85, 84, 85, 86, 91, 89, 90, 88, 88, 85, 83, 82, 79, 81, 82,
                   83, 82, 79, 81, 83, 83, 79, 80, 80, 79, 80, 82, 82, 80])

Numbers of days temperatures are collected in August:

In [99]:
len(temps)

31

#### Temperature statistics (mean, min, max)

- Arrays have handy methods for common tasks.
- Methods are like functions but they use dot notation (e.g. `temps.max()`).

In [100]:
sum(temps) / len(temps) # Built-in functions work on both arrays and lists

83.29032258064517

In [101]:
temps.sum() / len(temps)  # THe sum method for arrays

83.29032258064517

In [102]:
temps.mean() # The mean method for arrays

83.29032258064517

In [103]:
max(temps) 

91

In [104]:
temps.max() # Arrays have their own min/max method (faster)

91

## Ranges

### Motivation

- We often find ourselves needing to make arrays like this:

In [105]:
days_in_january = np.array([
    1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 
    13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 
    23, 24, 25, 26, 27, 28, 29, 30, 31
])

### Ranges
* A **range** is an array of consecutive numbers (or evenly spaced numbers).
* ```np.arange(end)```: An array of increasing integers from 0 up to (and excluding!) end.
* ```np.arange(start, end)```: An array of increasing integers from start up to (excluding!) end.
* ```np.arange(start, end, step)```: A range with step between consecutive values.
* The range always includes the start but excludes the end (i.e. a half-open interval): $[~,~)$.

In [106]:
np.arange(8)

array([0, 1, 2, 3, 4, 5, 6, 7])

In [107]:
np.arange(3, 9, 1)

array([3, 4, 5, 6, 7, 8])

In [108]:
np.arange(3, 32, 5)

array([ 3,  8, 13, 18, 23, 28])

In [109]:
np.arange(-3, 2, 0.5)

array([-3. , -2.5, -2. , -1.5, -1. , -0.5,  0. ,  0.5,  1. ,  1.5])

In [110]:
np.arange(8, 5, -1)

array([8, 7, 6])

In [111]:
np.arange(1, -10, -3)

array([ 1, -2, -5, -8])

<div class="menti">
<div>

### Discussion Question

On the first day of January, you are paid 1 cent. Every day thereafter, your pay doubles: on the 2nd day it is 2 cents, on the 3rd it is 4 cents, on the 4th it is 8 cents, and so on.

January has 31 days.

Which of these expressions calculates the total amount of money you'll make in January, in dollars?

A. `(2**(np.arange(31) * 0.01)).sum()`

B. `(2**(np.arange(32) * 0.01)).sum()`

C. `((2**np.arange(31)) * 0.01).sum()`

D. `((2**np.arange(32)) * 0.01).sum()`
    
</div>
<div>

### To answer, go to **[menti.com](https://www.menti.com/v42ge81t5d)** and enter the code 2863 3386 or use this QR code:

![](images/menti-qr.png)
    
</div>
</div>

## DataFrames (i.e. Tables)

<center>
<img width=50% src="images/imdb.png"/>
</center>

### How do we store *tabular data*?

- We could have an array for title, another for rating, another for year, etc.
- But this is not convenient.
- Instead, we use something called a *DataFrame*.

### `pandas`

- DataFrames are provided by a package called `pandas`.
- `pandas` is **the** tool for doing data science in Python.

<center>
<img src='images/pandas.png' width=500>
</center>

### But `pandas` is not so cute...

<center>
<img height=100% src="images/angrypanda.jpg"/>
</center>

### Instead!

- We at UCSD have created a smaller, nicer version of `pandas`.
- It keeps the important stuff and throws out the rest.
- It's easier to learn, but is still valid `pandas` code.

### We call it `babypandas` 🐼

<center>
<img height=75% src="images/babypanda.jpg"/ width=500>
</center>

### Importing `babypandas`

In [112]:
import babypandas as bpd

## Summary

### Summary

- Strings are used to store text.
- Lists and arrays are used to store sequences.
    - Arrays are faster and more convenient for numerical operations
- We will be using the `babypandas` module for working with data.
- Tables in `babypandas` are called DataFrames.
- **Next time:** We will do a deep dive on a single dataset and introduce DataFrame manipulation techniques as necessary.
    - Remember to refer to the resources from the start of lecture!