In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (10, 5)


# Lecture 02 - Pandas 🐼

## DSC 80, Fall 2022

## Today, in DSC 80...

- Remembering basic 🐼 usage
- How 🐼 works
- Writing efficient 🐼 code
- Some useful 🐼 methods to know

## Announcements 📣

- 

In [41]:
# load some data

import util
names_path = util.safe_download('https://www.ssa.gov/oact/babynames/names.zip')

import pathlib

dfs = []
for path in pathlib.Path('data/names/').glob('*.txt'):
    year = int(str(path)[14:18])
    if year >= 1964:
        df = pd.read_csv(path, names=['firstname', 'gender', 'count']).assign(year=year)
        dfs.append(df)
        
names = pd.concat(dfs)

## Introduction to `pandas` 🐼

<center><img src='imgs/babypanda.jpg' width=400></center>

<center><img src='imgs/angrypanda.jpg' width=600></center>

### `pandas`

<center><img src='imgs/pandas.png' width=200></center>

- `pandas` is **the** Python library for tabular data manipulation.
- Before `pandas` was developed, the standard data science workflow involved using multiple languages (Python, R, Java) in a single project.
- Wes McKinney, the original developer of `pandas`, wanted a library which would allow everything to be done in Python.
    - Python is faster to develop in than Java, and is more general-purpose than R.

### `pandas` data structures

There are three key data structures at the core of `pandas`:
- DataFrame: 2 dimensional tables.
- Series: 1 dimensional (columnar) array.
- Index: immutable sequence of column/row labels.

<center><img src='imgs/example-df.png' width=600></center>

### Importing `pandas` and related libraries

We've already run this at the top of the notebook, so we won't repeat it here. But `pandas` is almost always imported in conjunction with `numpy`:

```py
import pandas as pd
import numpy as np
```

### Series represent columns / rows
* Rows and columns of DataFrame are stored as `pd.Series`.
* A `pd.Series` object is a one-dimensional sequence with labels (index).

In [None]:
names

In [None]:
names['firstname']

In [None]:
names.iloc[3]

### Initializing a Series

- The function `pd.Series` can create a new Series, given either an existing sequence or dictionary.
- By default, the index will be set to 0, 1, 2, 3,... and the Series will have no "name".
    - You can use optional `index` and `name` arguments to change this behavior.

In [None]:
pd.Series([10, 23, 45, 53, 87])

In [None]:
pd.Series({'a': 10, 'b': 23, 'c': 45, 'd': 53, 'e': 87}, name='people')

### Initializing a DataFrame

* `pd.DataFrame` initializes a DataFrame using either: 
    - a list of rows, or
    - a dictionary of columns.
* There are various optional arguments: `index`, `columns`, `dtype`, etc.
    - To see the signature of a function `f`, run `f?` in a cell (e.g. `pd.DataFrame?`).

In [None]:
pd.DataFrame?

### Method 1: Using a list of rows

In [None]:
row_data = [
    ['Granger, Hermione', 'A13245986', 1],
    ['Potter, Harry', 'A17645384', 1],
    ['Weasley, Ron', 'A32438694', 1],
    ['Longbottom, Neville', 'A52342436', 1]
]

row_data

By default, the column names are set to 0, 1, 2, ...

In [None]:
pd.DataFrame(row_data)

You can change that using the `columns` argument.

In [None]:
pd.DataFrame(row_data, columns=['Name', 'PID', 'LVL'])

### Method 2: Using a dictionary of columns

In [None]:
column_dict = {
    'Name': ['Granger, Hermione', 'Potter, Harry', 'Weasley, Ron', 'Longbottom, Neville'],
    'PID': ['A13245986', 'A17645384', 'A32438694', 'A52342436'],
    'LVL': [1, 1, 1, 1]
}
column_dict

In [None]:
enrollments = pd.DataFrame(column_dict)
enrollments

### DataFrame index and column labels

- Access column labels with the `columns` attribute.
- Access index labels with the `index` attribute.
- The default for both is 0-indexed position (0, 1, 2, ...).

In [None]:
enrollments.columns

In [None]:
enrollments.index

### Axis

- The rows and columns of a DataFrame are both stored as Series.
- The **axis** specifies the direction of a slice of a DataFrame.

<center><img src='imgs/axis.png' width=300></center>

- Axis 0 refers to the index.
- Axis 1 refers to the columns.

### DataFrame methods with `axis`

In [None]:
A = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['A', 'B', 'C'])
A

If we specify `axis=0`, `A.sum` will "compress" along axis 0, and keep the column labels intact.

In [None]:
A.sum(axis=0)

If we specify `axis=1`, `A.sum` will "compress" along axis 1, and keep the row labels (index) intact.

In [None]:
A.sum(axis=1)

<center><img src='imgs/axis-sum.png' width=600></center>

## Selecting columns using `[]`

### Throwback to `babypandas` 👶🐼

- In `babypandas`, you accessed columns using the `.get` method.
- `.get` also works in `pandas`, but it is not **idiomatic** – people don't usually use it.

In [None]:
enrollments

In [None]:
enrollments.get('Name')

In [None]:
# Doesn't error
enrollments.get('billy')

### Selecting columns with `[]`

- The standard way to access a column in `pandas` is by using the `[]` operator.
    - Think of a DataFrame as a dictionary of arrays!
* Specifying a column name returns the column as a Series.
* Specifying a list of column names returns a DataFrame.

In [None]:
enrollments

In [None]:
# Returns a Series
enrollments['Name']

In [None]:
# Returns a DataFrame
enrollments[['Name', 'PID']]

In [None]:
# 🤔
enrollments[['Name']]

In [None]:
# KeyError
enrollments['billy']

### Selecting columns with attribute notation

- It is also possible to access columns using attribute notation, i.e. `.<column name>`.
- **Don't do this.**
    - What if the column name clashes with a DataFrame method, like `.mean`?
    - What if the column name contains spaces or special characters?

In [None]:
enrollments.LVL

In [None]:
enrollments.mean

## Selecting rows with `.loc` and `.iloc`

### Selecting rows with `loc`

If `df` is a DataFrame, then:
* `df.loc[idx]` returns the Series whose index is `idx`.
* `df.loc[idx_list]` returns a DataFrame containing the rows whose indexes are in `idx_list`.

In [None]:
enrollments

In [None]:
enrollments.loc[3]

In [None]:
enrollments.loc[[1, 3]]

In [None]:
enrollments.loc[[3]]

### Boolean sequence selection

* The `loc` operator also supports Boolean sequences (lists, arrays, Series) as input. 
* The length of the sequence must be the same as the number of rows in the DataFrame. 
* The result is a filtered DataFrame, containing only the rows in which the sequence contained `True`.

In [None]:
enrollments

In [None]:
bool_arr = [
    False,  # Hermione
    True,   # Harry
    False,  # Ron
    True    # Neville
]

enrollments.loc[bool_arr]

### Querying

- Comparisons with arrays (Series) result in Boolean arrays (Series).
- We can use comparisons along with the `loc` operator to **query** a DataFrame.
- Querying is the act of selecting rows in a DataFrame that satisfy certain condition(s).

In [None]:
enrollments

In [None]:
enrollments['Name'].str.contains('on')

In [None]:
# Rows where Name includes 'on'
enrollments.loc[enrollments['Name'].str.contains('on')]

In [None]:
# Rows where the first letter of Name is between A and L
enrollments.loc[enrollments['Name'] < 'M']

When using a Boolean sequence, e.g. `enrollments['Name'] < 'M'`, `loc` is not strictly necessary:

In [None]:
enrollments[enrollments['Name'] < 'M']

### Selecting columns and rows simultaneously

So far, we used `[]` to select columns and `loc` to select rows.

In [None]:
enrollments.loc[enrollments['Name'] < 'M']['PID']

### Selecting columns and rows simultaneously

`loc` can also be used to select both rows and columns. The general pattern is:

```
df.loc[<row selector>, <column selector>]
```

Examples:
- `df.loc[idx_list, col_list]` returns a DataFrame containing the rows in `idx_list` and columns in `col_list`.
- `df.loc[bool_arr, col_list]` returns a DataFrame contaning the rows for which `bool_arr` is `True` and columns in `col_list`.
- If `:` is used as the first input, all rows are kept. If `:` is used as the second input, all columns are kept.

In [None]:
enrollments

In [None]:
enrollments.loc[enrollments['Name'] < 'M', 'PID']

In [None]:
enrollments.loc[enrollments['Name'] < 'M', ['PID']]

### Even more ways of selecting rows and columns

In `df.loc[<row selection>, <column selection>]`:

- Both the first and second inputs can be Boolean sequences.
- Both the first and second inputs can be **slices**, which use `:` syntax (e.g. `0:2`, `'Name': 'PID'`).
- If both the first and second inputs are primitives (strings or numbers), the result is a single value, not a DataFrame or Series.
- The first input can be a **function** that takes a row as input and returns a Boolean.

There are many, many more – see the [`pandas` documentation](https://pandas.pydata.org/docs/user_guide/index.html#user-guide) for more.

In [None]:
enrollments

In [None]:
enrollments.loc[2, 'LVL']

In [None]:
enrollments.loc[0:2, 'Name': 'PID']

### Don't forget `iloc`!

- `iloc` stands for "integer location".
- `iloc` is like `loc`, but it selects rows and columns based off of integer positions only.

In [None]:
enrollments

In [None]:
enrollments.iloc[2:4, 0:2]

In [None]:
other = enrollments.set_index('Name')
other

In [None]:
other.iloc[2]

In [None]:
other.loc[2]

### Discussion Question

Let's return to the `names` DataFrame.

In [None]:
names

**Question:** How many babies were born with the name `'Billy'` and gender `'M'`?

In [None]:
...

### More Practice

Consider the DataFrame below.

In [None]:
jack = pd.DataFrame({1: ['fee', 'fi'], '1': ['fo', 'fum']})
jack

For each of the following pieces of code, predict what the output will be. Then, uncomment the line of code and see for yourself.

In [None]:
# jack[1]

In [None]:
# jack[[1]]

In [None]:
# jack['1']

In [None]:
# jack[[1,1]]

In [None]:
# jack.loc[1]

In [None]:
# jack.loc[jack[1] == 'fo']

In [None]:
# jack[1, ['1', 1]]

In [None]:
# jack.loc[1,1]

## How Pandas Works

### Pandas and NumPy

<center><img src='imgs/python-stack.png' width=800></center>

### NumPy

- NumPy stands for "numerical Python". It is a commonly-used Python module that enables **fast** computation involving arrays and matrices.
- `numpy`'s main object is the **array**. In `numpy`, arrays are:
    - homogenous (all values are of the same type), and
    - (potentially) multi-dimensional.
- Computation in `numpy` is fast because
    - Much of it is implemented in C.
    - `numpy` arrays are stored more efficiently in memory than, say, Python lists. 
- [This site](https://cloudxlab.com/blog/numpy-pandas-introduction/) provides a good overview of `numpy` arrays.

### `pandas` is built upon `numpy`

- A Series in `pandas` is a `numpy` array with an index.
- A DataFrame is like a dictionary of columns, each of which is a `numpy` array.
- Many operations in `pandas` are fast because they use `numpy`'s implementations.
- To access the array underlying a DataFrame or Series, use the `to_numpy` method.
    - ⚠️ Warning: `to_numpy` returns a view of the original object, not a copy! Read more in the [course notes](https://notes.dsc80.com/content/02/data-types.html#copies-and-views-in-pandas).
    - `.values` is a soon-to-be-deprecated version of `.to_numpy()`.

In [4]:
arr = np.array([4, 2, 9, 15, -1])
arr

array([ 4,  2,  9, 15, -1])

In [5]:
ser = pd.Series(arr, index='a b c d e'.split(' '))
ser

a     4
b     2
c     9
d    15
e    -1
dtype: int64

In [6]:
conv = ser.to_numpy()
conv

array([ 4,  2,  9, 15, -1])

In [7]:
conv[2] = 100
conv

array([  4,   2, 100,  15,  -1])

In [8]:
ser

a      4
b      2
c    100
d     15
e     -1
dtype: int64

### The dangers of `for`-loops

- `for`-loops are slow when processing large datasets.
- To illustrate how much faster `numpy` arithmetic is than using a `for`-loop, let's compute the distances between the origin $(0, 0)$ and 2000 random points $(x, y)$ in $\mathbb{R}^2$:
    - Using a `for`-loop.
    - Using vectorized arithmetic (through `numpy`).

### Aside: generating data

- First, we need to create a DataFrame containing 2000 random points in 2D. 
- `np.random.random(N)` returns an array containing `N` numbers selected uniformly at random from the interval $[0, 1)$.

In [9]:
N = 2000
x_arr = np.random.random(N)
y_arr = np.random.random(N)

coordinates = pd.DataFrame({"x": x_arr, "y": y_arr})
coordinates.head()

Unnamed: 0,x,y
0,0.902074,0.160808
1,0.499344,0.894245
2,0.765865,0.182101
3,0.510969,0.847546
4,0.901006,0.175181


Next, let's define a function that takes in a DataFrame like the one above and returns the distances between each point and the origin, using a `for`-loop.

In [10]:
def distances(df):
    hyp_list = []
    for i in df.index:
        dist = (df.loc[i, 'x'] ** 2 + df.loc[i, 'y'] ** 2) ** 0.5
        hyp_list.append(dist)
    return hyp_list

The `%timeit` magic command can repeatedly run any snippet of code and give us its average runtime.

In [11]:
%timeit distances(coordinates)

22.1 ms ± 186 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


Now, using a vectorized approach:

In [12]:
%timeit (coordinates['x'] ** 2 + coordinates['y'] ** 2) ** 0.5

155 µs ± 1.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


Note that "µs" refers to microseconds, which are one-millionth of a second, whereas "ms" refers to milliseconds, which are one-thousandth of a second.

**Takeaway:** avoid `for`-loops whenever possible!

### `pandas` data types

- A **data type** in `pandas` refers to the type of values in a column.
- A column's data type determines which operations can be applied to it.
- `pandas` tries to guess the correct data type for a given DataFrame, and is often wrong.
    - This can lead to incorrect calculations and poor memory/time performance.
- As a result, you will often need to explicitly convert between data types.

### `pandas` data types

|Pandas dtype|Python type|NumPy type|SQL type|Usage|
|---|---|---|---|---|
|int64|int|int_, int8,...,int64, uint8,...,uint64|INT, BIGINT| Integer numbers|
|float64|float|float_, float16, float32, float64|FLOAT| Floating point numbers|
|bool|bool|bool_|BOOL|True/False values|
|datetime64|NA|datetime64[ns]|DATETIME|Date and time values|
|timedelta[ns]|NA|NA|NA|Differences between two datetimes|
|category|NA|NA|ENUM|Finite list of text values|
|object|str|string, unicode|NA|Text|
|object|NA|object|NA|Mixed types|

[This article](https://www.dataquest.io/blog/pandas-big-data/) details how `pandas` stores different data types under the hood.

### Type conversion and the underlying `numpy` array(s)
* The `dtypes` attribute (of both Series and DataFrames) describes the data type of each column.
* The `to_numpy` method, when used on a Series, returns an array in which all values are of the data type specified by `dtypes`.
* The `to_numpy` method, when used on a DataFrame, returns a multi-dimensional array of type `object`, unless all columns in the DataFrame are homogenous.

In [13]:
# Read in file
elections_fp = os.path.join('data', 'elections.csv')
elections = pd.read_csv(elections_fp)
elections.head()

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss


In [14]:
elections.dtypes

Candidate     object
Party         object
%            float64
Year           int64
Result        object
dtype: object

In [15]:
elections['Year'].dtypes

dtype('int64')

In [16]:
elections['Year'].to_numpy().dtype

dtype('int64')

In [17]:
elections.to_numpy()

array([['Reagan', 'Republican', 50.7, 1980, 'win'],
       ['Carter', 'Democratic', 41.0, 1980, 'loss'],
       ['Anderson', 'Independent', 6.6, 1980, 'loss'],
       ['Reagan', 'Republican', 58.8, 1984, 'win'],
       ['Mondale', 'Democratic', 37.6, 1984, 'loss'],
       ['Bush', 'Republican', 53.4, 1988, 'win'],
       ['Dukakis', 'Democratic', 45.6, 1988, 'loss'],
       ['Clinton', 'Democratic', 43.0, 1992, 'win'],
       ['Bush', 'Republican', 37.4, 1992, 'loss'],
       ['Perot', 'Independent', 18.9, 1992, 'loss'],
       ['Clinton', 'Democratic', 49.2, 1996, 'win'],
       ['Dole', 'Republican', 40.7, 1996, 'loss'],
       ['Perot', 'Independent', 8.4, 1996, 'loss'],
       ['Gore', 'Democratic', 48.4, 2000, 'loss'],
       ['Bush', 'Republican', 47.9, 2000, 'win'],
       ['Kerry', 'Democratic', 48.3, 2004, 'loss'],
       ['Bush', 'Republican', 50.7, 2004, 'win'],
       ['Obama', 'Democratic', 52.9, 2008, 'win'],
       ['McCain', 'Republican', 45.7, 2008, 'loss'],
       ['O

What do you think is happening here?

In [18]:
elections['Year'] ** 7

0    -9176136658659852288
1    -9176136658659852288
2    -9176136658659852288
3    -8884621300430012416
4    -8884621300430012416
5    -6382217169766957056
6    -6382217169766957056
7    -1459841187188834304
8    -1459841187188834304
9    -1459841187188834304
10    6093279273289170944
11    6093279273289170944
12    6093279273289170944
13   -1957127470578663424
14   -1957127470578663424
15   -6950134928041361408
16   -6950134928041361408
17   -8669840355176218624
18   -8669840355176218624
19   -6898610338587885568
20   -6898610338587885568
21   -1417070411446747136
22   -1417070411446747136
23    7995905371925528576
24    7995905371925528576
Name: Year, dtype: int64

### ⚠️ Warning: `numpy` and `pandas` don't always make the same decisions! 

`numpy` prefers homogenous data types to optimize memory and read/write speed. This leads to **type coercion**. Notice that the array created below contains only strings, even though there was an `int` in the argument list.

In [24]:
np.array(['a', 1])

array(['a', '1'], dtype='<U21')

On the other hand, `pandas` likes correctness and ease-of-use. The Series created below is of type `object`, which preserves the original data types in the argument list.

In [25]:
pd.Series(['a', 1])

0    a
1    1
dtype: object

In [26]:
pd.Series(['a', 1]).values

array(['a', 1], dtype=object)

You can specify the data type of an array when initializing it by using the `dtype` argument.

In [27]:
np.array(['a', 1], dtype=object)

array(['a', 1], dtype=object)

`pandas` does make some trade-offs for efficiency, however. For instance, a Series consisting of both `int`s and `float`s is coerced to the `float64` data type.

In [28]:
pd.Series([1, 1.0])

0    1.0
1    1.0
dtype: float64

### Type conversion

You can change the data type of a Series using the `.astype` Series method.

In [29]:
ser = pd.Series(['1', '2', '3', '4'])
ser

0    1
1    2
2    3
3    4
dtype: object

In [30]:
ser.astype(int)

0    1
1    2
2    3
3    4
dtype: int64

In [31]:
ser.astype(float)

0    1.0
1    2.0
2    3.0
3    4.0
dtype: float64

### Performance and memory management

As we just discovered,
* `numpy` is optimized for speed and memory consumption.
* `pandas` makes implementation choices that: 
    - are slow and use a lot of memory, but
    - optimize for fast code development.

To demonstrate, let's create a large array in which all of the entries are non-negative numbers less than 255, meaning that they can be represented with 8 bits (i.e. as `np.uint8`s, where the "u" stands for "unsigned").

In [32]:
import random
data = np.random.choice(np.arange(8), 10 ** 6)

When we tell `pandas` to use a `dtype` of `uint8`, the size of the resulting DataFrame is under a megabyte.

In [33]:
ser1 = pd.Series(data, dtype=np.uint8).to_frame()
ser1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 1 columns):
 #   Column  Non-Null Count    Dtype
---  ------  --------------    -----
 0   0       1000000 non-null  uint8
dtypes: uint8(1)
memory usage: 976.7 KB


But by default, even though the numbers are only 8-bit, `pandas` uses the `int64` dtype, and the resulting DataFrame is over 7 megabytes large.

In [34]:
ser2 = pd.Series(data).to_frame()
ser2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 1 columns):
 #   Column  Non-Null Count    Dtype
---  ------  --------------    -----
 0   0       1000000 non-null  int64
dtypes: int64(1)
memory usage: 7.6 MB


## Useful Series and DataFrame methods

### Shared methods and attributes
* The `head`/`tail` methods return the first/last few rows (the default is 5).
* The `shape` attribute returns the number of rows (and columns).
* The `size` attribute returns the number of entries.

In [None]:
elections.head()

In [None]:
elections.shape

In [None]:
elections.size

### Series methods

|Method Name|Description|
|---|---|
|`count`|Returns the number of non-null entries in the Series|
|`unique`|Returns the unique values in the Series|
|`nunique`|Returns the number of unique values in the Series|
|`value_counts`|Returns a Series of counts of unique values|
|`describe`|Returns a Series of descriptive stats of values|

In [None]:
elections.head()

In [None]:
# Distinct candidates
elections['Candidate'].unique()

In [None]:
# Number of distinct candidates
elections['Candidate'].nunique()

In [None]:
# Total number of candidates
elections['Candidate'].count()

In [None]:
# 🤔
republicans = elections.loc[elections['Party'] == 'Republican']
republicans['Result'].value_counts()

In [None]:
republicans['%'].describe()

### DataFrame methods

* DataFrames share *many* of the same methods with Series.
    - In such cases, the DataFrame method applies the Series method to every row or column.
* Some DataFrame methods accept the `axis` keyword argument:
    - `axis=0`: the method is applied across the rows (i.e. to each column).
    - `axis=1`: the method is applied across the columns (i.e. to each row).
* Default value: `axis=0`.

In [None]:
elections.head()

In [None]:
elections[['%', 'Year']].mean()

The following piece of code works, but is meaningless. Why?

In [None]:
elections[['%', 'Year']].mean(axis=1)

### Even more DataFrame methods

|Method Name|Description|
|---|---|
|`sort_values`|Returns a DataFrame sorted by the specified column|
|`drop_duplicates`|Returns a DataFrame with duplicate values dropped|
|`describe`|Returns descriptive stats of the DataFrame|

In [None]:
elections.sort_values('%', ascending=False).head(4)

In [None]:
# By default, drop_duplicates looks for duplicate entire rows, which elections does not have
elections.drop_duplicates(subset=['Candidate'])

In [None]:
elections.describe()

### Adding and modifying columns, using a copy

* To add a new column to a DataFrame, use the `assign` method.
* To add a new row to a DataFrame, use the `append` method.
* Both `assign` and `append` return a copy of the DataFrame, **which is (often) a good idea!**
* To change the values in a column, re-assign its name to a sequence of the desired values.

As an aside, you should try your best to write **chained** `pandas` code, as follows:

In [None]:
(
    elections
    .assign(proportion_of_vote=(elections['%'] / 100))
    .head()
)

You can chain together several steps at a time:

In [None]:
(
    elections
    .assign(proportion_of_vote=(elections['%'] / 100))
    .assign(Result=elections['Result'].str.upper())
    .head()
)

You can also use `assign` when the desired column name has spaces, by using keyword arguments.

In [None]:
(
    elections
    .assign(**{'Proportion of Vote': elections['%'] / 100})
    .head()
)

### ⚠️ Warning!

- Adding a single row with `append` has poor time complexity (forces a copy).
- Use it sparingly.
- Especially, don't build a DataFrame using `append` in a loop.
    - Instead, gather all of your data first and use `pd.DataFrame` constructor

### Adding and modifying columns, in-place

* You can assign a new row or column to a DataFrame **in-place** using `loc` or `[]`.
    - Works like dictionary assignment.
    - Unlike `assign`, this **modifies** the underlying DataFrame rather than a copy of it.
* This is the more "common" way of adding/modifying columns. 
    - ⚠️ Warning: Exercise caution when using this approach, since this approach changes the values of existing variables.

In [None]:
# By default, .copy() returns a deep copy of the object it is called on,
# meaning that if you change the copy the original remains unmodified.
mod_elec = elections.copy()
mod_elec.head()

In [None]:
mod_elec['Proportion of Vote'] = mod_elec['%'] / 100
mod_elec.head()

In [None]:
mod_elec['Result'] = mod_elec['Result'].str.upper()
mod_elec.head()

In [None]:
# 🤔
mod_elec.loc[-1, :] = ['Carter', 'Democratic', 50.1, 1976, 'WIN', 0.501]
mod_elec.loc[-2, :] = ['Ford', 'Republican', 48.0, 1976, 'LOSS', 0.48]
mod_elec

In [None]:
mod_elec = mod_elec.sort_index()
mod_elec.head()

In [None]:
# df.reset_index(drop=True) drops the current index 
# of the DataFrame and replaces it with an index of increasing integers
mod_elec.reset_index(drop=True)

## Example: San Diego employee salaries (again)

Note: We probably won't finish looking at all of this code in lecture, but we will leave it here for you as a reference.

### Reading the data

Let's work with the same dataset that we did in Lecture 1, using our new knowledge of `pandas`.

In [None]:
salaries = pd.read_csv('https://transcal.s3.amazonaws.com/public/export/san-diego-2020.csv')
salaries['Employee Name'] = salaries['Employee Name'].str.split().str[0] + ' xxxxx'

In [None]:
salaries.head()

In [None]:
salaries.info()

### Data cleaning

Current issues with the dataset:

- Some columns have no information (`'Notes'`) or the same value in all rows (`'Agency'`) – let's drop them.
- `'Other Pay'` should be numeric, but it's not currently.



In [None]:
# Dropping useless columns
salaries = salaries.drop(['Year', 'Notes', 'Agency'], axis=1)
salaries.head()

### Fixing the `'Other Pay'` column

In [None]:
salaries['Other Pay'].dtype

In [None]:
salaries['Other Pay'].unique()

It appears that most of the values in the `'Other Pay'` column are strings containing numbers. Which values are not numbers?

In [None]:
salaries.loc[-salaries['Other Pay'].str.contains('.00')]

We can keep just the rows where the `'Other Pay'` is numeric, and then convert the `'Other Pay'` column to `float`.

In [None]:
salaries = salaries.loc[salaries['Other Pay'].str.contains('.00') == True]
salaries['Other Pay'] = salaries['Other Pay'].astype(float)
salaries.head()

The line of code above is correct, but it errors if you run it more than once. Why? 🤔

### Full-time vs. part-time

What happens when we use `normalize=True` with `value_counts`?

In [None]:
salaries['Status'].value_counts()

In [None]:
salaries['Status'].value_counts(normalize=True)

### Salary analysis

In [None]:
# Salary statistics
salaries.describe()

**Question:** Is `'Total Pay'` equal to the sum of `'Base Pay'`, `'Overtime Pay'`, and `'Other Pay'`?

We can answer this by summing the latter three columns and seeing if the resulting Series equals the former column.

In [None]:
salaries.loc[:, ['Base Pay', 'Overtime Pay', 'Other Pay']].sum(axis=1)

In [None]:
salaries['Total Pay']

In [None]:
(salaries.loc[:, ['Base Pay', 'Overtime Pay', 'Other Pay']].sum(axis=1) == salaries['Total Pay']).all()

Similarly, we might ask whether `'Total Pay & Benefits'` is truly the sum of `'Total Pay'` and `'Benefits'`.

In [None]:
(salaries.loc[:, ['Total Pay', 'Benefits']].sum(axis=1) == salaries.loc[:, 'Total Pay & Benefits']).all()

### Visualization

In [None]:
salaries['Total Pay & Benefits'].plot(kind='hist', density=False, bins=20, ec='w');

In [None]:
salaries.plot(kind='scatter', x='Base Pay', y='Overtime Pay');

In [None]:
pd.plotting.scatter_matrix(salaries[['Base Pay', 'Overtime Pay', 'Total Pay']], figsize=(8, 8));

Think of your own questions about the dataset, and try and answer them!

## Next time, in DSC 80...

...a deeper dive into *pandas*.