# Lesson 2: Pandas Bootcamp Part 1

[Acknowledgments Page](https://ds100.org/fa23/acks/)

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px

### Loading Elections Data Into a DataFrame:

Panda's [read_csv function](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) is one of the most versatile and useful functions for managing data.  

**Practice:  Load the elections data**

In [None]:
elections = ...


### `DataFrame` attributes: `index`, `columns`

In [None]:
elections.index

In [None]:
elections.columns

The `Index` column can be set to the default list of integers by calling `reset_index()` on a `DataFrame`.

# Extraction:

One of the most basic tasks for manipulating a DataFrame is to extract rows and columns of interest.   


### Label-Based Extraction Using`loc`

`loc` selects items by row and column *label*.  

`df.loc[row_labels, column_labels]`

We describe "labels" as the bolded text at the top and left of a DataFrame.




Arguments to `.loc` can be:
1. A row label and column label
2. A list.
3. A slice (syntax is inclusive of the right-hand side of the slice).

In [6]:
# Here's how we can select all rows and just the Year and Party columns from the elections dataframe.
# Note we use the ellipsis (:) in the first entry because we want to select all rows

elections.loc[:,["Year","Party"]]

Unnamed: 0,Year,Party
0,2020,Democratic
1,2020,Republican
2,2020,Libertarian
3,2020,Green
4,2016,Constitution
...,...,...
177,1832,Anti-Masonic
178,1828,Democratic
179,1828,National Republican
180,1824,Democratic-Republican


In [None]:
# Selection by a list

elections.loc[[87, 25, 179], ["Year", "Candidate", "Result"]]

In [None]:
# Selection by a list and a slice of columns
elections.loc[[87, 25, 179], "Popular vote":"%"]

In [None]:
# Extracting all rows using a colon
elections.loc[:, ["Year", "Candidate", "Result"]]

In [None]:
# Extracting all columns using a colon
elections.loc[[87, 25, 179], :]

In [None]:
# Selection by a list and a single-column label
elections.loc[[87, 25, 179], "Popular vote"]

In [None]:
# Note that if we pass "Popular vote" in a list, the output will be a DataFrame
elections.loc[[87, 25, 179], ["Popular vote"]]

In [None]:
# Selection by a row label and a column label
elections.loc[0, "Candidate"]

#### Integer-Based Extraction Using `iloc`

`iloc` selects items by row and column *integer* position.

Arguments to `.iloc` can be:
1. A list.
2. A slice (syntax is exclusive of the right hand side of the slice).
3. A single value.


In [None]:
# Select the rows at positions 1, 2, and 3.
# Select the columns at positions 0, 1, and 2.
# Remember that Python indexing begins at position 0!
elections.iloc[[1, 2, 3], [0, 1, 2]]

In [None]:
# Index-based extraction using a list of rows and a slice of column indices
elections.iloc[[1, 2, 3], 0:3]

In [None]:
# Selecting all rows using a colon
elections.iloc[:, 0:3]

In [None]:
elections.iloc[[1, 2, 3], 1]

In [None]:
# Extracting the value at row 0 and the second column
elections.iloc[0,1]

#### Context-dependent Extraction using `[]`

We could technically do anything we want using `loc` or `iloc`. However, in practice, the `[]` operator is often used instead to yield more concise code.

`[]` is a bit trickier to understand than `loc` or `iloc`, but it achieves essentially the same functionality. The difference is that `[]` is *context-dependent*.

`[]` only takes one argument, which may be:
1. A slice of row integers.
2. A list of column labels.
3. A single column label.


If we provide a slice of row numbers, we get the numbered rows.

In [None]:
elections[3:7]

If we provide a list of column names, we get the listed columns.

In [None]:
elections[["Year", "Candidate", "Result"]]

And if we provide a single column name we get back just that column, stored as a `Series`.

In [None]:
elections["Candidate"]

### Multi-indexed DataFrames

You can also define multiple indexes for the same DataFrame.  This is useful when you need more than one column to specify the granularity of the data.  
For example, if we wanted to use both `Year` and `Party` as our indices we would do this as follows:

In [3]:
elections_multindex = elections.set_index(["Year","Party"])

In [4]:
elections_multindex.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Candidate,Popular vote,Result,%
Year,Party,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1824,Democratic-Republican,Andrew Jackson,151271,loss,57.210122
1824,Democratic-Republican,John Quincy Adams,113142,win,42.789878
1828,Democratic,Andrew Jackson,642806,win,56.203927
1828,National Republican,John Quincy Adams,500897,loss,43.796073
1832,Democratic,Andrew Jackson,702735,win,54.574789


### Accessing Data in Multi-indexed DataFrames:

Now, to access data we can use `.loc` where the first entry is a tuple: (year, party):


In [6]:
elections_multindex.loc[(1828,"Democratic"),:]

  elections_multindex.loc[(1828,"Democratic"),:]


Unnamed: 0_level_0,Unnamed: 1_level_0,Candidate,Popular vote,Result,%
Year,Party,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1828,Democratic,Andrew Jackson,642806,win,56.203927


Notice, we got a warning above.  This just means that your index is not sorted. pandas depends on the index being sorted (in this case, lexicographically, since we are dealing with string values) for optimal search and retrieval. A quick fix would be to sort your DataFrame in advance using DataFrame.sort_index. This is especially desirable from a performance standpoint if you plan on doing multiple such queries in tandem:

In [7]:
elections_multindex = elections_multindex.sort_index()
elections_multindex.loc[(1828,"Democratic"),:]

Unnamed: 0_level_0,Unnamed: 1_level_0,Candidate,Popular vote,Result,%
Year,Party,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1828,Democratic,Andrew Jackson,642806,win,56.203927


## Setting a New Index:

Suppose we want to know how many elections Andrew Jackson ran in.

**Practice:** Set the elections index to be Candidate.

In [None]:
elections = ...
elections

**Practice**:Select only the rows when Andrew Jackson was president

In [None]:
...

**Practice:  Reset the index (to the default integer indices)**

In [None]:
elections = ...

**Practice:  Create a new dataframe that is just the first 10 rows of the elections dataframe**

In [None]:
elections_first_10 = ...
elections_first_10

## Boolean Arrays

In [None]:
a = np.array([True, False, True, False, True, False, False, False, False, False])

In [None]:
# What happens when you sum a boolean array?
a.sum()

In [None]:
# What happens if you put a boolean array as an input to the .loc or [] operator?

In [None]:
elections_first_10[a]

## Conditional Selection

By passing in a sequence (list, array, or `Series`) of boolean values, we can extract a subset of the rows in a `DataFrame`. We will keep *only* the rows that correspond to a boolean value of `True`.


**Practice:  Use Conditional Selection to Extract all rows from the elections DataFrame where the percentage of popular votes was greater than 50%**

In [None]:
# First, use a logical condition to generate a boolean array
logical_operator = ...

logical_operator

In [None]:
# Then, use this boolean array to filter the DataFrame
elections[logical_operator]


### Bitwise Operators

To filter on multiple conditions, we combine boolean operators using **bitwise comparisons**.

Symbol | Usage      | Meaning 
------ | ---------- | -------------------------------------
~    | ~p       | Returns negation of p
&#124; | p &#124; q | p OR q
&    | p & q    | p AND q
^  | p ^ q | p XOR q (exclusive or)

### Another Selection Option:  Query

**Practice: Read the documentation for query and it to Extract all rows from the elections DataFrame where the percentage of popular votes was greater than 50%**
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html

In [None]:
...

## Adding, Removing, and Modifying Columns

### Adding a Column
To add a column, use `.assign()`

Syntax:

`df = df.assign(new_col_name = new_col_values)`


**Practice:  Add a new column to elections called "Total_Voters" that gives the total number of people who voted in that particular election**

In [None]:
...

### Modify a Column
To modify a column, use `.assign()` 

### Rename a Column Name
Rename a column using the `.rename()` method.

### Delete a Column
Remove a column using `.drop()`.

## Useful Utility Functions

### `NumPy`

`NumPy` functions are compatible with objects in `pandas`. 

In [None]:


np.mean(elections["Popular Vote"])

In [None]:
# Max number of babies named Devon born in any single year

max(elections["Popular Vote"])

### Built-In `pandas` Methods

There are many, *many* utility functions built into `pandas`, far more than we can possibly cover in lecture. You are encouraged to explore all the functionality outlined in the `pandas` [documentation](https://pandas.pydata.org/docs/reference/index.html).

#### Useful DataFrame Utility Functions

`df.info()`

`df.shape`

`df.describe()`

`df.isna()`

`df.value_counts()`

`df.sort_values()`

#### Useful Series Utility Functions

`series.unique()`
`series.sort_values()

**Practice:  Practice with each of the utility functions above and/or read their pandas documentation.  Then explain what each one does.**

In [None]:
# Practice with the utility functions above.   Then explain what each method does.

...

...

...



**Practice: Using the applicable utility function(s), determine how many unique years of election data we have in this dataset, and when it begins and ends.**

In [19]:
# Hint: We can select ONE column from a DataFrame as a Series using the following:
elections["Party"]

0                 Democratic
1                 Republican
2                Libertarian
3                      Green
4               Constitution
               ...          
177             Anti-Masonic
178               Democratic
179      National Republican
180    Democratic-Republican
181    Democratic-Republican
Name: Party, Length: 182, dtype: object

**Practice:  Is there any  missing or unexpected data values?  Explain**

Unnamed: 0,Year,Popular vote,%
count,182.0,182.0,182.0
mean,1934.087912,12353640.0,27.47035
std,57.048908,19077150.0,22.968034
min,1824.0,100715.0,0.098088
25%,1889.0,387639.5,1.219996
50%,1936.0,1709375.0,37.677893
75%,1988.0,18977750.0,48.354977
max,2020.0,81268920.0,61.344703


### Practice Exercise:
**In how many different elections was the percentage of popular votes greater than 50%?**