In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

pd.set_option('display.max_rows', 9)
plt.style.use('ggplot')

# Lecture 3 – More DataFrame Fundamentals

## DSC 80, Spring 2023

### Agenda

- Recap: `loc` and `iloc`.
- Adding and modifying columns.
- Axes.
- `pandas` and `numpy`.
- Extra: Data cleaning and `plotly`.

## Recap: `loc` and `iloc`

### Example: Universities in California 📚

Recall, last lecture we started working with a dataset that contains the name, location, enrollment, and founding date of most UCs and CSUs.

In [None]:
schools_path = os.path.join('data', 'california_universities.csv')
schools = pd.read_csv(schools_path)
schools.head()

### `loc` and `iloc` with the default index

- We use `loc` to access rows by their indexes (labels).
- We use `iloc` to access rows by their integer positions.
- When we load a DataFrame from file, the default index is 0, 1, 2, 3, ...
- In some cases, `loc` and `iloc` behave similarly – but they are **not the same**!

In [None]:
schools.head()

What's the difference between the two DataFrames below?

In [None]:
schools.loc[1:5]

In [None]:
schools.iloc[1:5]

Which of the following two expressions evaluate to the name of the youngest school in `schools`?

In [None]:
schools.sort_values('Founded', ascending=False).iloc[0]['Name']

In [None]:
schools.sort_values('Founded', ascending=False).loc[0]['Name']

## Adding and modifying columns

### Adding and modifying columns, using a copy

- To add a new column to a DataFrame, use the `assign` method.
    - To change the values in a column, add a new column with the same name as the existing column.
- Like most `pandas` methods, `assign` returns a new DataFrame.
    - **Pro** ✅: This doesn't inadvertently change any existing variables.
    - **Con** ❌: It is not very space efficient, as it creates a new copy each time it is called.

In [None]:
schools.head()

In [None]:
schools.assign(Age=2023 - schools['Founded'])

In [None]:
schools.head()

As an aside, you should try your best to write **chained** `pandas` code, as follows:

In [None]:
(
    schools
    .assign(Age=(2023 - schools['Founded']))
    .assign(is_UC=schools['Name'].str.contains('University of California'))
)

In [None]:
schools

You can also use `assign` when the desired column name has spaces, by using keyword arguments.

In [None]:
(
    schools
    .assign(**{'Years since Founding': 2023 - schools['Founded']})
)

### Adding and modifying columns, in-place

* You can assign a new column to a DataFrame **in-place** using `[]`.
    - This works like dictionary assignment.
    - This **modifies** the underlying DataFrame, unlike `assign`, which returns a new DataFrame.
* This is the more "common" way of adding/modifying columns. 
    - ⚠️ Warning: Exercise caution when using this approach, since this approach changes the values of existing variables.

In [None]:
# By default, .copy() returns a deep copy of the object it is called on,
# meaning that if you change the copy the original remains unmodified.
schools_copy = schools.copy()
schools_copy.head()

In [None]:
schools_copy['Age'] = 2023 - schools_copy['Founded']

In [None]:
schools_copy['Name'] = schools_copy['Name'].str.replace('University of California,', 'UC')

Note that we never reassigned `schools_copy` in the two cells above – that is, we never wrote `schools_copy = ...` – though it was still modified.

In [None]:
schools_copy.head()

### Mutability

DataFrames, like lists, arrays, and dictionaries, are **mutable**. As you learned in DSC 20, this means that they can be modified after being created. 

Not only does this explain the behavior on the previous slide, but it also explains the following:

In [None]:
schools.head()

In [None]:
def calculate_age(df):
    df['Age'] = 2023 - df['Founded']
    return df

In [None]:
calculate_age(schools)

In [None]:
schools.head()

Note that `schools` was modified, even though we didn't reassign it! These unintended consequences can **influence the behavior of test cases on labs and projects**, among other things! 

To avoid this, it's a good idea to include `df = df.copy()` as the first line in functions that take DataFrames as input.

In [None]:
#How to delete a column?
#schools.drop('Age', axis=1,inplace=True)

In [None]:
def calculate_age(df):
    df = df.copy()
    # Now, the df referenced below is a fresh copy that is unrelated to the df passed in.
    df['Age'] = 2023 - df['Founded']
    return df

In [None]:
calculate_age(schools)

### What about rows?

You can add and modify rows using `loc` and `iloc`. There's a function that can be to add rows, called `pd.concat`; we'll see it in a few lectures.

In [None]:
schools_copy.iloc[-1, :] = ['University of California, La Jolla', 
                            'La Jolla', 
                            'San Diego', 
                            '80', 
                            2023, 
                            0]
schools_copy.tail()

In [None]:
schools_copy.loc[-1, :] = ['La Jolla State University', 
                           'La Jolla', 
                           'San Diego', 
                           '10', 
                           2023, 
                           0]
schools_copy.tail()

## Axes

### Axes

- The rows and columns of a DataFrame are both stored as Series.
- The **axis** specifies the direction of a "slice" of a DataFrame.

<center><img src='imgs/axis.png' width=30%></center>

- Axis 0 refers to the index (rows).
- Axis 1 refers to the columns.

### DataFrame methods with `axis`

Consider the DataFrame `A` defined below using a dictionary.

In [None]:
A = pd.DataFrame({
    'A': [1, 4],
    'B': [2, 5],
    'C': [3, 6]
})
A

If we specify `axis=0`, `A.sum` will "compress" along axis 0, and keep the column labels intact.

In [None]:
A.sum(axis=0)

In [None]:
A.sum(0)

If we specify `axis=1`, `A.sum` will "compress" along axis 1, and keep the row labels (index) intact.

In [None]:
A.sum(axis=1)

<center><img src='imgs/axis-sum.png' width=600></center>

What's the default axis?

In [None]:
A

In [None]:
A.sum()

### DataFrame methods with `axis`

- In addition to `sum`, many other Series methods work on DataFrames.
- In such cases, the DataFrame method usually applies the Series method to every row or column.
- Many of these methods accept an `axis` argument; the default is usually `axis=0`.

In [None]:
schools.head()

In [None]:
# The maximum element in each column.
schools.max()

In [None]:
# The number of unique values in each column.
schools.nunique()

In [None]:
# Why is this meaningless?
schools[['Founded', 'Age']].mean(axis=1)

In [None]:
# describe doesn't accept an axis argument; it works on every numeric column in the DataFrame it is called on.
schools.describe()

### Discussion Question

In **words**, what characteristic do all schools in the following DataFrame share?

```py
schools[schools.nunique(axis=1) != schools.nunique(axis=1).max()]
```

_Hint: What city is SDSU in? What county is it in?_

## `pandas` and `numpy`

<center><img src='imgs/python-stack.png' width=60%></center>

### `numpy`

- NumPy stands for "numerical Python". It is a commonly-used Python module that enables **fast** computation involving arrays and matrices.
- `numpy`'s main object is the **array**. In `numpy`, arrays are:
    - Homogenous – all values are of the same type.
    - (Potentially) multi-dimensional.
- Computation in `numpy` is fast because:
    - Much of it is implemented in C.
    - `numpy` arrays are stored more efficiently in memory than, say, Python lists. 
- [This site](https://cloudxlab.com/blog/numpy-pandas-introduction/) provides a good overview of `numpy` arrays.

### `pandas` is built upon `numpy`

- A Series in `pandas` is a `numpy` array with an index.
- A DataFrame is like a dictionary of columns, each of which is a `numpy` array.
- Many operations in `pandas` are fast because they use `numpy`'s implementations.
- To access the array underlying a DataFrame or Series, use the `to_numpy` method.
    - ⚠️ Warning: `to_numpy` returns a view of the original object, not a copy! Read more in the [course notes](https://notes.dsc80.com/content/02/data-types.html#copies-and-views-in-pandas).
    - `.values` is a soon-to-be-deprecated version of `.to_numpy()`.

In [None]:
arr = np.array([4, 2, 9, 15, -1])
arr

In [None]:
ser = pd.Series(arr, index=['a', 'b', 'c', 'd', 'e'])
ser

In [None]:
conv = ser.to_numpy()
conv

In [None]:
conv[2] = 100
conv

Even though `conv` appears to be "detached" from `ser`, it is not:

In [None]:
ser

### The dangers of `for`-loops

- `for`-loops are slow when processing large datasets. **You will rarely write `for`-loops in DSC 80, and may be penalized on assignments for using them when unnecessary!**
- One of the biggest benefits of `numpy` is that it supports **vectorized** operations. 
    - If `a` and `b` are two arrays of the same length, then `a + b` is a new array of the same length containing the element-wise sum of `a` and `b`.
- To illustrate how much faster `numpy` arithmetic is than using a `for`-loop, let's compute the distances between the origin $(0, 0)$ and 1000 random points $(x, y)$ in $\mathbb{R}^2$:
    - Using a `for`-loop.
    - Using vectorized arithmetic, through `numpy`.

### Aside: Generating data

- First, we need to create a DataFrame containing 1000 random points in 2D. 
- `np.random.random(N)` returns an array containing `N` numbers selected uniformly at random from the interval $[0, 1)$.

In [None]:
N = 1000
x_arr = np.random.random(N)
y_arr = np.random.random(N)

coordinates = pd.DataFrame({'x': x_arr, 'y': y_arr})
coordinates.head()

In [None]:
coordinates.plot(kind='scatter', x='x', y='y');

Next, let's define a function that takes in a DataFrame like `coordinates` and returns the distances between each point and the origin, using a `for`-loop.

In [None]:
def distances(df):
    hyp_list = []
    for i in df.index:
        dist = (df.loc[i, 'x'] ** 2 + df.loc[i, 'y'] ** 2) ** 0.5
        hyp_list.append(dist)
    return hyp_list

distances(coordinates)[:5]

The `%timeit` magic command can repeatedly run any snippet of code and give us its average runtime.

In [None]:
%timeit distances(coordinates)

Now, using a vectorized approach:

In [None]:
%timeit (coordinates['x'] ** 2 + coordinates['y'] ** 2) ** 0.5

Note that "µs" refers to microseconds, which are one-millionth of a second, whereas "ms" refers to milliseconds, which are one-thousandth of a second.

**Takeaway**: Avoid `for`-loops whenever possible!

### `pandas` data types

- Each Series (column) has a data type, which refers to the type of the values stored within. Access it using the `dtypes` attribute.
- A column's data type determines which operations can be applied to it.
- `pandas` tries to guess the correct data types for a given DataFrame, and is often wrong.
    - This can lead to incorrect calculations and poor memory/time performance.
- As a result, you will often need to explicitly convert between data types.

In [None]:
schools.head()

In [None]:
schools.dtypes

In [None]:
schools['Founded'].dtypes

### `pandas` data types

|Pandas dtype|Python type|NumPy type|SQL type|Usage|
|---|---|---|---|---|
|int64|int|int_, int8,...,int64, uint8,...,uint64|INT, BIGINT| Integer numbers|
|float64|float|float_, float16, float32, float64|FLOAT| Floating point numbers|
|bool|bool|bool_|BOOL|True/False values|
|datetime64|NA|datetime64[ns]|DATETIME|Date and time values|
|timedelta[ns]|NA|NA|NA|Differences between two datetimes|
|category|NA|NA|ENUM|Finite list of text values|
|object|str|string, unicode|NA|Text|
|object|NA|object|NA|Mixed types|

[This article](https://www.dataquest.io/blog/pandas-big-data/) details how `pandas` stores different data types under the hood.

What do you think is happening here? 🚰

In [None]:
schools['Founded'] ** 7

Read [this article](https://mortada.net/can-integer-operations-overflow-in-python.html#Can-integers-overflow-in-python?) for a discussion of how `numpy`/`pandas` `int64` operations differ from vanilla `int` operations.

### ⚠️ Warning: `numpy` and `pandas` don't always make the same decisions! 

`numpy` prefers homogenous data types to optimize memory and read/write speed. This leads to **type coercion**. 

Notice that the array created below contains only strings, even though there was an `int` in the argument list.

In [None]:
np.array(['a', 1])

On the other hand, `pandas` likes correctness and ease-of-use. The Series created below is of type `object`, which preserves the original data types in the argument list.

In [None]:
pd.Series(['a', 1])

In [None]:
pd.Series(['a', 1]).values

You can specify the data type of an array when initializing it by using the `dtype` argument.

In [None]:
np.array(['a', 1], dtype=object)

`pandas` does make some trade-offs for efficiency, however. For instance, a Series consisting of both `int`s and `float`s is coerced to the `float64` data type.

In [None]:
pd.Series([1, 1.0])

### Type conversion

You can change the data type of a Series using the `.astype` Series method.

For instance, we can change the data type of the `'Enrollment'` column in `schools` to be `int64`, once we remove the commas.

In [None]:
schools.head()

In [None]:
schools.dtypes

In [None]:
schools['Enrollment'] = schools['Enrollment'].str.replace(',', '').astype(int)
schools.head()

In [None]:
schools.dtypes

### Performance and memory management

As we just discovered,
* `numpy` is optimized for speed and memory consumption.
* `pandas` makes implementation choices that: 
    - are slow and use a lot of memory, but
    - optimize for fast code development.

To demonstrate, let's create a large array in which all of the entries are non-negative numbers less than 255, meaning that they can be represented with 8 bits (i.e. as `np.uint8`s, where the "u" stands for "unsigned").

In [None]:
data = np.random.choice(np.arange(8), 10 ** 6)

When we tell `pandas` to use a `dtype` of `uint8`, the size of the resulting DataFrame is under a megabyte.

In [None]:
ser1 = pd.Series(data, dtype=np.uint8).to_frame()
ser1.info()

But by default, even though the numbers are only 8-bit, `pandas` uses the `int64` dtype, and the resulting DataFrame is over 7 megabytes large.

In [None]:
ser2 = pd.Series(data).to_frame()
ser2.info()

### Aside: `std`

To compute the standard deviation of a Series, we can use:
- The `std` method.
- The `np.std` function.

Let's try both. What do you notice?

In [None]:
schools['Founded'].std()

In [None]:
np.std(schools['Founded'])

### Aside: `std`

The two methods/functions use different _degrees of freedom_ (`ddof`) by default.

- The `std` method in `pandas` uses `ddof=1` by default (sometimes called the "sample" standard deviation):

$$\text{SD} = \sqrt{\frac{\sum_{i = 1}^n (x_i - \bar{x})^2}{n - 1}}$$

- The `np.std` method in `numpy` uses `ddof=0` by default (sometimes called the "population" standard deviation):

$$\text{SD} = \sqrt{\frac{\sum_{i = 1}^n (x_i - \bar{x})^2}{n}}$$

Be careful!

In [None]:
schools['Founded'].std()

In [None]:
schools['Founded'].std(ddof=1)

In [None]:
schools['Founded'].std(ddof=0)

## Extra: Data cleaning and `plotly`

_Note: We may not get to these slides in lecture; refer to them for extra examples._

### Example: Universities in California 📚

Let's return to `schools`. Towards the end of the last section, we fixed the data type of the `'Enrollment'` column to be `int64`, which means we can now perform calculations with it.

In [None]:
schools.head()

In [None]:
schools.dtypes

In [None]:
schools['Enrollment'].describe()

### Enrollment vs. year founded

In [None]:
schools.plot(kind='scatter', x='Founded', y='Enrollment', figsize=(10, 5));

### `plotly`

`plotly` is a plotting library that creates interactive graphs. It's not included in your `dsc80` conda environment, so you'll need to `pip install` it.

In [None]:
!pip install plotly

In [None]:
import plotly.express as px

### Enrollment vs. year founded, but interactive

In [None]:
px.scatter(schools, 
           x='Founded', 
           y='Enrollment', 
           hover_name='Name', 
           color=schools['Name'].str.contains('University of California')
           )

You can even create `plotly` plots by default by setting `pandas`' plotting backend to `plotly`:

In [None]:
pd.options.plotting.backend = 'plotly'

In [None]:
schools.plot(kind='scatter', 
             x='Founded', 
             y='Enrollment', 
             hover_name='Name')

## Summary, next time

### Summary, next time

- `pandas` relies heavily on `numpy`. An understanding of how data types work in both will allow you to write more efficient and bug-free code.
- Series and DataFrames share many methods (refer to the [`pandas` documentation](https://pandas.pydata.org/docs/user_guide/index.html#user-guide) for more details).
- Most `pandas` methods return copies of Series/DataFrames. Be careful when using techniques that modify values in-place.
- Next time: `groupby` and data granularity.