## Announcements

* Lab 01 Due Monday @ Midnight
* "No loops" on the Labs/Projects
* `pip install otter-grader`

### Part 1

# Tabular Data

## Outline

* Intro to Pandas
    - Series / Dataframe / Indices
    - Basic Operations
* For loops and Numpy 
* Pandas and Numpy: performance and memory management
* Useful Pandas Methods

# 2. Intro to Pandas


![pandas-1.jpg](attachment:pandas-1.jpg)

# Pandas
* Python library for reading "Panel Data"
* Similar to `data.table` in `R`; `dataframe` in `spark`.

![panda.png](attachment:panda.png)

### Pandas History

* Old workflow: Use multiple languages (python, R, java) in single project.
* New workflow: Do everything in python! (Wes McKinney)
    - Faster to develop than java;
    - More production capable than R


### Pandas History

* Pandas was created to fill a gap in data manipulation capabilities.
    - Hodgepodge of features of other succesful libraries (e.g. from `R`)
    - Ad-hoc, still evolving design.
    - Great for prototyping.

## Pandas Data Structures:
1. Data Frame: 2 dimensional tables
2. Series: 1 dimensional (columnar) array
3. Index: immutable sequence of column/row labels

![download.png](attachment:download.png)

### Importing Pandas (and related libraries)

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import os

## 1-dimensional slices of tables
* Rows and columns of data frames are represented by `pd.Series`.
* A `pd.Series` object is a one-dimensional with labels (index).
* Optional (default) arguments:
    - `index` (does not have to be numeric),  `name`

In [None]:
row_data = pd.Series([10, 23, 45, 53, 87])
row_data

In [None]:
row_data = pd.Series({'a': 10, 'b': 23, 'c': 45, 'd': 53, 'e': 87})
row_data

# `DataFrame` Constructor

* `pd.DataFrame` creates a data frame from: 
    - a list of rows
    - a dictionary of columns
* Optional (default) arguments:
    - `index`, `columns`, `dtype`

In [None]:
row_data = [
    ['Granger, Hermione', 'A13245986', 1],
    ['Potter, Harry', 'A17645384', 1],
    ['Weasley, Ron', 'A32438694', 1],
    ['Longbottom, Neville', 'A52342436', 1]
]

row_data

In [None]:
enrollments = pd.DataFrame(row_data, columns = ['Name', 'PID','LVL'])
enrollments

In [None]:
column_dict = {
    'Name': ['Granger, Hermione', 'Potter, Harry', 'Weasley, Ron', 'Longbottom, Neville'],
    'PID': ['A13245986', 'A17645384', 'A32438694', 'A52342436'],
    'LVL': [1, 1, 1, 1]
}
column_dict

In [None]:
enrollments = pd.DataFrame(column_dict)
enrollments

### `DataFrame` index and column labels
* column labels accessed using the `columns` attribute
* index labels accessed using the `index` attribute.
* index/columns default to column number (0-indexed)

In [None]:
enrollments.columns

In [None]:
enrollments.index

## Axis 

The rows and columns of a `pd.DataFrame` are both `pd.Series`.

The *axis* specifies the direction of a slice of a table.

* A slice along the:
    - 0-axis is a Series labeled by the *row* index of the dataframe. (column-wise)
    - 1-axis is a Series labeled by the *columns* (index) of the dataframe. (row-wise)


In [None]:
A = pd.DataFrame([[2, 4], [1, 3]],columns=list('AB'))
A

In [None]:
# what will you get?

A.mean(axis=1)


In [None]:
A.mean(axis=0)


In [None]:
A.mean()

* Axis 0 will act on all the ROWS in each COLUMN
* Axis 1 will act on all the COLUMNS in each ROW

### Part 2

# Selecting Rows and Columns with `[]` and `loc`

### Selecting columns of a `DataFrame` using `[]`
* Access a column using the `[]` operator
    - A `DataFrame` is roughly a `dict` of arrays!
* Specifying a column name returns the column as a series (an `axis=0` slice).
* Specifying a list of column names returns a data frame.

In [None]:
enrollments

In [None]:
# returns series

enrollments['Name']

In [None]:
# returns a data frame

enrollments[['Name', 'PID']]

In [None]:
# What is the output? Table or array?

enrollments[['Name']]

### Accessing columns with attribute notation

- Can use `.<column_name>`
- I don't like this : (
    - What if column name clashes? e.g., `.mean`
    - What if it contains spaces, special characters?

In [None]:
enrollments.Name

## Selecting rows with `loc`
* `DataFrame.loc[idx]` returns a Series describing of a single row.
* `DataFrame.loc[idx_list]` returns a DataFrame with rows given by `idx_list`

In [None]:
enrollments

In [None]:
enrollments.loc[3]

In [None]:
enrollments.loc[[1,3]]

In [None]:
enrollments.loc[[3]]

## Boolean Array Selection

* The `loc` operator also supports boolean arrays as input. 
* The array must be exactly as long as the number of rows. 
* The result is a filtered data frame, where only rows corresponding to `True` appear.

In [None]:
enrollments

In [None]:
bool_arr = [
    False,  # Hermione
    True,   # Harry
    False,  # Ron
    True    # Neville
]

enrollments.loc[bool_arr]

### Boolean arrays via conditions
* Select all Hogwarts whose last names begin with A-L

In [None]:
enrollments

In [None]:
# numpy arrays:

n = np.array([1, 2, 3, 4, 5, 6])
n
n < 4

In [None]:
bool_arr = enrollments['Name'] < 'M'
bool_arr

In [None]:
enrollments.loc[bool_arr]

### Selecting groups of rows and columns of a `DataFrame`.
* `DataFrame.loc[idx_list, col_list]` selects rows in `idx_list` and columns in `col_list`.
* `DataFrame.loc[bool_arr, col_list]` selects rows using `bool_arr` and columns in `col_list`.
* Use `DataFrame.loc[:,col_list]` to select all rows, only those columns in `col_list`.

### Other ways of selecting rows and columns
* `loc[bool_arr, bool_arr]` selects both rows and columns using boolean arrays.
* `loc[predicate]` where `predicate` is a function with boolean output.
* `loc[:,colname_i:colname_j]` column name slicing
* `loc[idx, col]` selects the entry in row `idx` and column `col`.
* `iloc[,]` selects rows/columns by the 'positional index'.


See the documentation for more!

In [None]:
enrollments

In [None]:
enrollments.loc[:, 'Name':'PID']

### Discussion Questions

For the data frame given below, what do each of the following return?

|1|2|
|---|---|
|`jack[1]`|`jack.loc[1]`|
|`jack[[1]]`|`jack.loc[jack[1] == 'fo']`|
|`jack['1']`|`jack.loc[1,['1', 1]]`|
|`jack[[1,1]]`|`jack.loc[1,1]`|


In [None]:
jack = pd.DataFrame({1: ['fee', 'fi'], '1': ['fo', 'fum']})
jack

In [None]:
#jack[1]
#jack[[1]]
#jack['1']
#jack[[1,1]]
#jack.loc[1]
#jack.loc[jack[1] == 'fo']
#jack[1, ['1', 1]]
#jack.loc[1,1]

### Part 3

# Pandas and NumPy

![image.png](attachment:image.png)

## NumPy

- It is an open source module of Python which provides **fast** mathematical computation on arrays and matrices.
- `NumPy`’s main object is the homogeneous multidimensional array.
    - A single column (typically),
    - A table with a single type (ints, floats, or string).
- `NumPy` provides convenient and optimized C-implementations of essential mathematical operations on vectors.
- [Good overview](https://cloudxlab.com/blog/numpy-pandas-introduction/)

## `Pandas` is built upon `NumPy`

* Pandas Series are Numpy arrays with user-defined indices.
* DataFrames are roughly dictionaries of columns (each a homogeneous Numpy array).
* `Pandas` uses fast `NumPy` operations whenever possible (e.g. column operations).
* To access the underlying numpy array of a DataFrame/Series use `.values` attribute.
    - Hard to predict what is returned: copy or direct access

In [None]:
arr = np.array([0,1,2,3])
ser = pd.Series([0,1,2,3], index='a b c d'.split())

In [None]:
arr

In [None]:
ser

In [None]:
ser.values

## Danger of the for loops

* 'for loops' have slow execution when processing large datasets.
* Let's compute the distance between the origin (0, 0) and 2000 random points in $\mathbb{R}^2$:
    - By looping through the rows of the table,
    - By using vectorized arithmetic.

In [None]:
N = 10_000 # Number of records to process
x_list = list(100*(np.random.random(N))+1)
y_list = list(100*(np.random.random(N))+1)

coordinates = pd.DataFrame({"x": x_list, "y": y_list})
coordinates.head()

In [None]:
# Define a function to manually loop over all rows and return a series of distances

def distances(df):
    hyp_list = []
    for i in df.index:
        dist = (df.loc[i, 'x']**2 + df.loc[i, 'y']**2)**0.5
        hyp_list.append(dist)
    return hyp_list

In [None]:
%timeit distances(coordinates)

In [None]:
%timeit (coordinates['x']**2 + coordinates['y']**2)**0.5

### Pandas data types

* Understanding data types in Pandas:
    - Leads to better memory and time performance!
    - Avoids hard-to-spot computational errors!
* Pandas tries to guess the correct data-type (and is often wrong!)
* You will often need to explicitly convert between data types.

### Data types

* A Pandas **Data Type** is a classification that specifies the type of values in a column.
* A column's data type determines which operations can be applied to it.
* What data types do you know?

### Pandas data types

|Pandas dtype|Python type|NumPy type|SQL type|Usage|
|---|---|---|---|---|
|int64|int|int_, int8,...,int64, uint8,...,uint64|INT, BIGINT| Integer numbers|
|float64|float|float_, float16, float32, float64|FLOAT| Floating point numbers|
|bool|bool|bool_|BOOL|True/False values|
|datetime64|NA|datetime64[ns]|DATETIME|Date and time values|
|timedelta[ns]|NA|NA|NA|Differences between two datetimes|
|category|NA|NA|ENUM|Finite list of text values|
|object|str|string, unicode|NA|Text|
|object|NA|object|NA|Mixed types|

### Very useful article

https://www.dataquest.io/blog/pandas-big-data/

### Type conversion and the underlying `NumPy` array(s)
* `.dtypes` method give the data type of each column.
* `.values` on a column return an array of data type in `.dtypes`.
* `.values` on the dataframe gives an array of mixed type (`object`) -- unless homogeneous

In [None]:
# Read in file
elections_fp = os.path.join('data', 'elections.csv')
elections = pd.read_csv(elections_fp)
elections.head()

In [None]:
elections.dtypes

In [None]:
elections['Year'] * 999_999_999**2

In [None]:
elections['%'].values

In [None]:
elections['%'].values.dtype

In [None]:
elections.values

### Caution on conversion of `dtypes`: 

* `NumPy` and `Pandas` don't always guess dtype the same way!
* `Numpy` coerces dtype to optimize memory and read/write speed
* `Pandas` optimizes for "ease of development"

In [None]:
# Numpy likes homogeneous data types for space/speed

np.array(['a', 1])

In [None]:
# Pandas likes correctness and ease of use

pd.Series(['a', 1])

In [None]:
pd.Series(['a', 1]).values

In [None]:
np.array(['a', 1], dtype=object)

In [None]:
# Pandas makes a few trade-offs for efficiency
pd.Series([1, 1.0])

### Converting Types

- Use `.astype` series method:

In [None]:
ser = pd.Series(['1', '2', '3', '4'])
ser

In [None]:
ser.astype(int)

In [None]:
elections['Year'].astype(float)

## `NumPy` vs `Pandas` performance, memory management

* NumPy is optimized for speed and memory consumption
* Pandas makes implementation choices that: 
    - are slow and use a lot of memory,
    - optimize for fast code development.

In [None]:
import random
data = random.choices(range(8), k=1_000_000)

In [None]:
# we know the data will fit in uint8
ser1 = pd.Series(data, dtype=np.uint8).to_frame()

# by default, pandas liberally stores data at int64
ser2 = pd.Series(data).to_frame()

In [None]:
ser1

In [None]:
ser1.info()

In [None]:
ser2.info()

### Pandas Performance and Memory Management

* Pandas encourages writing 'data processing pipelines':
    - fit together the output of one method to the input of the next method.
    - requires that methods return a *copy*.
* NumPy often works in-place; Pandas **usually** copies the underlying array before transforming it!
    - See [Copies vs Views](https://afraenkel.github.io/practical-data-science/02/data-types.html#copies-and-views-in-pandas).

### Part 4

# Useful `pd.Series` and `pd.DataFrame` methods

### Shared methods and attributes
* `head`/`tail` methods displays the first/last few rows.
* `shape` attribute returns number of rows/columns.
* `size` attribute returns the number of entries.

In [None]:
elections.head(7)

In [None]:
elections.shape

In [None]:
elections.size

### `pd.Series` methods

|Method Name|Description|
|---|---|
|`count`|Count the number of non-null entries of a Series|
|`unique`|Returns unique values of Series object|
|`nunique`|Returns number of unique values of Series object|
|`value_counts`|Returns Series of counts of unique values|
|`describe`|Returns Series of descriptive stats of values|

In [None]:
# distinct election years
elections['Candidate'].unique()

In [None]:
# number of distinct Candidates
elections['Candidate'].nunique()

In [None]:
# number of candidates in elections
elections['Candidate'].count()

In [None]:
# explain the output

republicans = elections[elections['Party'] == 'Republican']
republicans['Result'].value_counts()

In [None]:
republicans['%'].describe()

### `pd.DataFrame` methods

* DataFrames share *many* of the same methods with Series.
    - The dataFrame method applies the Series method to every row/column.
* Some of these methods take the `axis` keyword argument:
    - `axis=0`: the method is applied to series with index given by rows.
    - `axis=1`: the method is applied to series with index given by columns.
* Default value: `axis=0` (apply method to each column).

In [None]:
elections.head()

In [None]:
elections[['%', 'Year']].mean()

In [None]:
# doesn't make sense! why?
elections[['%', 'Year']].mean(axis=1)

### `pd.DataFrame` methods

|Method Name|Description|
|---|---|
|`sort_values`|Returns a DataFrame sorted by specified column|
|`drop_duplicates`|Returns a DataFrame with duplicate values dropped|
|`describe`|Returns descriptive stats of the data|

In [None]:
elections.sort_values('%', ascending=False).head(10)

In [None]:
# by default use all of the columns

elections.drop_duplicates(subset=['Candidate'])

## Adding and Modifying Columns (copy)

* Assign a new column of a DataFrame using `assign` method.
* Assign a new row with `append`.
* Both return a copy of the DataFrame (great feature!)
* Re-assign an existing column to change the value.

In [None]:
(
    elections
    .assign(proportion_of_vote=(elections['%'] / 100))
    .head()
)

In [None]:
# chain together multiple steps
(
    elections
    .assign(proportion_of_vote=(elections['%'] / 100))
    .assign(Result=elections['Result'].str.upper())
    .head()
)

In [None]:
# If a column name has spaces -- use keyword arguments
(
    elections
    .assign(**{'Proportion of Vote':(elections['%'] / 100)})
    .head()
)

## Warning!

- Adding a row with `.append` has terrible time complexity!
- Use it sparingly.
- Namely, don't build a dataframe using `.append` in a loop.

## Adding and Modifying Columns (in-place modify)

* Assign a new row/column of a dataframe *in-place* using `loc` or `[]`
    - Works like dictionary assignment.
    - Unlike `.assign`, this *modifies* dataframe
* Re-assign an existing row/column to change the value.
* Changes the value of existing variables (careful!)

In [None]:
# deep or shallow?
# what is the difference?

mod_elec = elections.copy()
mod_elec.head()

In [None]:
mod_elec['Proportion of Vote'] = mod_elec['%'] / 100
mod_elec.head()

In [None]:
mod_elec['Result'] = mod_elec['Result'].str.upper()
mod_elec.head()

In [None]:
# predict the output

mod_elec.loc[-1, :] = ['Carter', 'Democratic', 50.1, 1976, 'WIN', 0.501]
mod_elec.loc[-2, :] = ['Ford', 'Republican', 48.0, 1976, 'LOSS', 0.48]
mod_elec

In [None]:
mod_elec = mod_elec.sort_index()
mod_elec.head()

In [None]:
# df.reset_index(drop=True) drops the current index 
# of the DataFrame and replaces it with an index of increasing integers
mod_elec.reset_index(drop=True)


### Example: city of SD employee salaries
* Read data from web
* Look at dataframe basics

In [None]:
salaries = pd.read_csv('https://transcal.s3.amazonaws.com/public/export/san-diego-2017.csv')
salaries['Employee Name'] = salaries['Employee Name'].str.split().str[0] + ' xxxxx'

In [None]:
salaries.head()

In [None]:
salaries.info()

## Light cleaning
* Fix: `Other Pay` column non-numeric
* Drop useless columns (all one value)

In [None]:
salaries['Other Pay'].dtype

In [None]:
salaries['Other Pay'].unique()

In [None]:
# which rows don't contain a decimal?
salaries.loc[salaries['Other Pay'].str.contains('.00') == False]

In [None]:
# filter out non-numeric entries
salaries = salaries.loc[salaries['Other Pay'].str.contains('.00') == True]
salaries

In [None]:
# convert to float
salaries['Other Pay'] = salaries['Other Pay'].astype(float)

In [None]:
# drop useless columns
salaries = salaries.drop(['Year', 'Notes', 'Agency'], axis=1)
salaries.head()

In [None]:
# proportion of jobs that are FT/PT
# If True then the object returned will contain the relative frequencies of the unique values.


salaries['Status'].value_counts(normalize=True)

In [None]:
# Salary Statistics
salaries.describe()

In [None]:
# are component pays equal to total pay?
# all: Return whether all elements are True,


(salaries.loc[:, ['Base Pay', 'Overtime Pay', 'Other Pay']].sum(axis=1) == salaries.loc[:,'Total Pay']).all()

In [None]:
# is total pay plus benefits equal to total pay & benefits column?
(salaries.loc[:, ['Total Pay', 'Benefits']].sum(axis=1) == salaries.loc[:, 'Total Pay & Benefits']).all()

In [None]:
salaries['Total Pay & Benefits'].plot(kind='hist', bins=10);

In [None]:
salaries.plot(kind='scatter', x='Base Pay', y='Overtime Pay')

In [None]:
pd.plotting.scatter_matrix(salaries[['Base Pay', 'Overtime Pay', 'Total Pay']], figsize=(8,8));

In [None]:
# who makes a lot of overtime?
salaries.loc[salaries['Overtime Pay'] > 100000]