In [None]:
# You'll start seeing this cell in most lectures.
# It exists to hide all of the import statements and other setup
# code we need in lecture notebooks.
from dsc80_utils import *

# <span style="color:#7b40c7">Pre-Lecture Reading</span> for Lecture 2 – DataFrame Fundamentals

## DSC 80, Winter 2024

<div class="alert alert-success">
<b>Make sure to read this before attending lecture (which, remember, is on <a href="https://ucsd.zoom.us/my/rampure">Zoom</a> today.)</b>

You can also access this notebook by pulling the course GitHub repository and opening <code>lectures/lec02/pre-lec02.ipynb</code>.
</div>

In this reading, we'll review some of the basics of `numpy` and `babypandas` that you're familiar with from [DSC 10](https://dsc-courses.github.io/dsc10-2023-fa). In lecture, we'll build off of this foundation.

## `numpy` arrays

### `numpy` overview

- `numpy` stands for "numerical Python". It is a commonly-used Python module that enables **fast** computation involving arrays and matrices.
- `numpy`'s main object is the **array**. In `numpy`, arrays are:
    - Homogenous – all values are of the same type.
    - (Potentially) multi-dimensional.
- Computation in `numpy` is fast because:
    - Much of it is implemented in C.
    - `numpy` arrays are stored more efficiently in memory than, say, Python lists. 
- [This site](https://cloudxlab.com/blog/numpy-pandas-introduction/) provides a good overview of `numpy` arrays.

We used `numpy` in DSC 10 to work with sequences of data:

In [None]:
arr = np.arange(10)
arr

In [None]:
# The shape (10,) means that the array only has a single dimension,
# of size 10.
arr.shape

In [None]:
2 ** arr

Arrays come equipped with several handy methods; some examples are below, but you can read about them all [here](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html).

In [None]:
(2 ** arr).sum()

In [None]:
(2 ** arr).mean()

In [None]:
(2 ** arr).max()

In [None]:
(2 ** arr).argmax()

### ⚠️ The dangers of `for`-loops

- `for`-loops are slow when processing large datasets. **You will rarely write `for`-loops in DSC 80 (except for Lab 1 and Project 1), and may be penalized on assignments for using them when unnecessary!**
- One of the biggest benefits of `numpy` is that it supports **vectorized** operations. 
    - If `a` and `b` are two arrays of the same length, then `a + b` is a new array of the same length containing the element-wise sum of `a` and `b`.
- To illustrate how much faster `numpy` arithmetic is than using a `for`-loop, let's compute the squares of the numbers between 0 and 1,000,000:
    - Using a `for`-loop.
    - Using vectorized arithmetic, through `numpy`.

In [None]:
%%timeit
squares = []
for i in range(1_000_000):
    squares.append(i * i)

In vanilla Python, this takes about 0.04 seconds per loop.

In [None]:
%%timeit
squares = np.arange(1_000_000) ** 2

In `numpy`, this only takes about 0.001 seconds per loop, more than 40x faster! Note that under the hood, `numpy` is also using a `for`-loop, but it's a `for`-loop implemented in C, which is much faster than Python.

### Multi-dimensional arrays

While we didn't see these very often in DSC 10, multi-dimensional lists/arrays may have since come up in DSC 20, 30, or 40A (especially in the context of linear algebra).

We'll spend a bit of time talking about 2D (and 3D) arrays here, since in some ways, they behave similarly to DataFrames. 

Below, we create a 2D array from scratch.

In [None]:
nums = np.array([
    [5, 1, 9, 7],
    [9, 8, 2, 3],
    [2, 5, 0, 4]
])

nums

In [None]:
# nums has 3 rows and 4 columns.
nums.shape

We can also create 2D arrays by _reshaping_ other arrays.

In [None]:
# Here, we're asking to reshape np.arange(1, 7)
# so that it has 2 rows and 3 columns.
a = np.arange(1, 7).reshape((2, 3))
a

### Operations along axes

In 2D arrays (and DataFrames), axis 0 refers to the rows (up and down) and axis 1 refers to the columns (left and right).

<center><img src='imgs/axis-sum.png' width=600></center>

In [None]:
a

If we specify `axis=0`, `a.sum` will "compress" along axis 0.

In [None]:
a.sum(axis=0)

If we specify `axis=1`, `a.sum` will "compress" along axis 1.

In [None]:
a.sum(axis=1)

### Selecting rows and columns from 2D arrays

You can use `[`square brackets`]` to **slice** rows and columns out of an array, using the same slicing conventions you saw in DSC 20.

In [None]:
a

In [None]:
# Accesses row 0 and all columns.
a[0, :]

In [None]:
# Same as the above.
a[0]

In [None]:
# Accesses all rows and column 1.
a[:, 1]

In [None]:
# Accesses row 0 and columns 1 and onwards.
a[0, 1:]

<div class="alert alert-success">
    <h3>Exercise</h3>
    Try and predict the value of <code>grid[-1, 1:].sum()</code> without running the code below.
</div>

In [None]:
s = (5, 3)
grid = np.ones(s) * 2 * np.arange(1, 16).reshape(s)
# grid[-1, 1:].sum()

## From `babypandas` to `pandas` 🐼

### `babypandas`

In DSC 10, you used `babypandas`, which was a subset of `pandas` designed to be friendly for beginners.
<center><img src='imgs/babypanda.jpg' width=45%></center>

### `pandas`

You're not a beginner anymore – you've taken DSC 20, 30, and 40A. You're ready for the real deal.

<center><img src='imgs/angrypanda.jpg' width=60%></center>

Fortunately, **everything you learned in `babypandas` will carry over!**

### `pandas`

<center><img src='imgs/pandas.png' width=200></center>

- `pandas` is **the** Python library for tabular data manipulation.
- Before `pandas` was developed, the standard data science workflow involved using multiple languages (Python, R, Java) in a single project.
- Wes McKinney, the original developer of `pandas`, wanted a library which would allow everything to be done in Python.
    - Python is faster to develop in than Java, and is more general-purpose than R.

### `pandas` data structures

There are three key data structures at the core of `pandas`:
- DataFrame: 2 dimensional tables.
- Series: 1 dimensional array-like object, typically representing a column or row.
- Index: sequence of column or row labels.

<center>
    <img src='imgs/example-df.png' width=400>
</center>

### Importing `pandas` and related libraries

`pandas` is almost always imported in conjunction with `numpy`.

In [None]:
import pandas as pd
import numpy as np

### Example: Dog Breeds (woof!) 🐶

We'll provide more context for the dataset we're working with in lecture. For now, all you need to know is that each row corresponds to a different dog breed.

In [None]:
# You'll see the Path(...) / syntax a lot.
# It creates the correct path to your file, 
# whether you're using Windows, macOS, or Linux.
# (Note that macOS and Linux use / to denote separate folders in paths,
# while Windows uses \.)
dog_path = Path('data') / 'dogs43.csv'
dogs = pd.read_csv(dog_path)
dogs

### Review: `head`, `tail`, `shape`, `index`, `get`, and `sort_values`

To extract the first or last few rows of a DataFrame, use the `head` or `tail` methods.

In [None]:
dogs.head(3)

In [None]:
dogs.tail(2)

The `shape` attribute returns the DataFrame's number of rows and columns.

In [None]:
dogs.shape

In [None]:
# The default index of a DataFrame is 0, 1, 2, 3, ...
dogs.index

We know that we can use `.get()` to select out a column or multiple columns...

In [None]:
dogs.get('breed')

In [None]:
dogs.get(['breed', 'kind', 'longevity'])

Most people don't use `.get` in practice; we'll see the more common technique in lecture.

And lastly, remember that to sort by a column, use the `sort_values` method. Like most DataFrame and Series methods, `sort_values` returns a new DataFrame, and doesn't modify the original.

In [None]:
# Note that the index is no longer 0, 1, 2, ...!
dogs.sort_values('height', ascending=False)

In [None]:
# This sorts by 'height', 
# then breaks ties by 'longevity'.
# Note the difference in the last three rows between
# this DataFrame and the one above.
dogs.sort_values(['height', 'longevity'],
                 ascending=False)

Note that `dogs` is not the DataFrame above. To save our changes, we'd need to say something like `dogs = dogs.sort_values...`.

In [None]:
dogs

That's all we need to review... we'll pick back up in lecture!