# Lessons 1-2: Pandas Bootcamp Part 1
[Acknowledgments Page](https://ds100.org/fa23/acks/)

In [9]:
# import numpy
import numpy as np


<br/><br/>
<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />




## Dataset: United States Presidential Election Data
For our first analysis we will analyze some data from US Presidential elections.  This data is already stored in `data/elections.csv`

Ideally columns in the dataset are  named in a way that clearly explains that they represent.  If not, you will want to refer to the data's codebook (if one exists).

For this particular dataset the columns represent the following:



|Column|Description|
| --- | --- |
|Year| Year of the election | 
|Candidate| Candidate who ran| 
|Party | Party of candidate|
|Popular vote | Number of popular votes candidate received |
|Result | Whether the candidate won or lost the election |
|% | The percentage of popular votes the candidate received|



<br>

---

## Exploring CSV Files


We can explore the CSV (which is a text file, and does not contain binary-encoded data) in many ways:
1. Opening the CSV directly in Excel, Google Sheets, etc.
2. Using a the jupyter lab explorer tool look at the data
3. pandas, using `pd.read_csv()`

## Pandas module:
***
**Pandas** is an open source $\color{red}{\text{data analysis module}}$ in Python used for storing, cleaning, wrangling, and analyzing data.   (Fun fact: It was named as a shortcut for the term "$\textbf{pan}$el  $\textbf{da}$ta", a common term for multidimensional data sets encountered in statistics and econometrics.)





First, let's import the Pandas module.  It's custom in data science to import Pandas with the alias $\texttt{pd}$.  We can then access any function in the Pandas libraries by prepending function names by $\texttt{pd.}$  

In [13]:
import pandas as pd

### $\color{red}{\textbf{Pandas}}$ Data Structures




Pandas has three types of data structures: 
- **Series**: A one dimensional array with labeled indices (can be mixed data types). 
-  **DataFrame**: 2D tabular data structure with both row and column labels.  $\color{red}{\text{Rows}}$ have a specific index to access them, which can be $\color{red}{\text{any name or value}}$. The $\color{blue}{\text{columns}}$ are just $\color{blue}{\text{Pandas Series}}$. The Pandas DataFrame data structure can be seen as a spreadsheet, but it is much more flexible. 
-  **Index**:  A sequence of row/column labels


![pandas-DataStructure.jpg](attachment:pandas-DataStructure.jpg)

# Data Wrangling With Pandas CheatSheet

https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

# Functions vs. Methods vs. Attributes

An important characteristic of the Python language is the consistency of its object model. Every number, string, data structure, function, class, module, and so on exists in the Python interpreter in its own “box,” which is referred to as a Python object. Each object has an associated type (e.g., integer, string, or function) and internal data. In practice this makes the language very flexible, as even functions can be treated like any other object.

There are some distinctions between Functions, Methods and Attributes:

## Functions

Functions are either built-in, or user-defined.  

Here is a summary of top-level functions you call directly from pandas:  https://pandas.pydata.org/docs/reference/general_functions.html


#### Pandas Function for Loading Data Into a DataFrame:

Panda's [read_csv function](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) is one of the most versatile and useful functions for managing data.  

Since we're loading a csv file,  the data is already in tabular format, and each row represents a record of election data for a specific party for a given year, we don't have to add any additional inputs to the function for this file:

In [15]:
elections = pd.read_csv("data/elections.csv")

elections

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073
4,1832,Andrew Jackson,Democratic,702735,win,54.574789
...,...,...,...,...,...,...
182,2024,Donald Trump,Republican,77303568,win,49.808629
183,2024,Kamala Harris,Democratic,75019230,loss,48.336772
184,2024,Jill Stein,Green,861155,loss,0.554864
185,2024,Robert Kennedy,Independent,756383,loss,0.487357


**Poll:  What are 2 questions you could answer with this dataset?**

## Attributes

Attributes are variables that belong to an object. Attributes are defined within the class from which an object is instantiated. Attributes hold information about the object, and can be accessed by referencing the object and then calling the attribute by name. Note that many attributes for built-in Python classes are "dundered", meaning that the name is prefixed and suffixed with a double underscore (e.g. `plus_some.__name__`

### Useful DataFrame attributes:


1. **`df.shape`** – the number of rows and columns in the DataFrame
2. **`df.columns`** – the names of all columns
3. **`df.dtypes`** – the data type of each column
4. **`df.index`** – the labels used to identify rows
5. **`df.empty`** – whether the DataFrame has zero rows (`True` or `False`)


In [23]:
# Practice calling the attributes on the elections dataset
...


False

## Methods


A method is a function associated with a specified object (and, by extension, with the type of that object). That is, a method corresponds to a data-type operation.  Methods are defined within the class from which an object is instantiated. 


We can call (or invoke) a method as follows:  `obj.method_name()`



**DataFrame methods:**
https://pandas.pydata.org/docs/reference/frame.html

**Series methods:**
https://pandas.pydata.org/docs/reference/series.html

### Useful DataFrame Utility Methods:


1. **`df.head()`** – shows the first few rows to quickly inspect the data
   [https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html)

2. **`df.info()`** – displays column names, data types, and missing-value counts
   [https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html)

3. **`df.sort_values(by = 'col_name', ascending = True)`** – reorders rows based on one or more columns
   [https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html)

4. **`df.describe()`** – generates summary statistics for numeric columns
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html

   
5. **`df.drop_duplicates(subset = "col", keep = "first")`** – removes repeated rows, optionally based on specific columns
   [https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html)


6.  **`df.value_counts()`** – counts how often each unique value appears
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.value_counts.html

---



`



**Practice:  Select the first 8 rows of the election DataFrame:**

In [28]:
# Select the first 8 rows of the DataFrame:
...

**Practice:  Are there any columns with missing data in this DataFrame?**

In [139]:

...

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188 entries, 0 to 187
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Year          188 non-null    int64  
 1   Candidate     188 non-null    object 
 2   Party         188 non-null    object 
 3   Popular vote  188 non-null    int64  
 4   Result        188 non-null    object 
 5   %             188 non-null    float64
dtypes: float64(1), int64(2), object(3)
memory usage: 8.9+ KB


**Practice: Call the .describe() method on the elections DataFrame.  Reading the output from this method, what is the first year of election data contained in this dataset?  What is the last year?**

In [None]:
...

first_year = ...

last_year = ...

**Practice: Use `.value_counts()`** to find the candidate that has run in the most elections`

In [54]:
...

Candidate
Norman Thomas         5
Franklin Roosevelt    4
Eugene V. Debs        4
Ralph Nader           4
Andrew Jackson        3
                     ..
Silas C. Swallow      1
Alton B. Parker       1
John G. Woolley       1
Joshua Levering       1
Chase Oliver          1
Name: count, Length: 135, dtype: int64

**Practice: Sort values in the election dataframe by Candidate(ascending)**

In [None]:
...

**Practice: Sort values in the election dataframe by Year (ascending) and  by Popular vote (descending)**

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
186,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
154,1876,Samuel J. Tilden,Democratic,4288546,loss,51.528376
145,1888,Grover Cleveland,Democratic,5534488,loss,48.656799
32,2000,Al Gore,Democratic,50999897,loss,48.491813
14,2016,Hillary Clinton,Democratic,65788564,loss,48.238558


**Practice: Run the cell below.  What has happened to your sorted values?  Why do you think this is the case?**

In [150]:
elections

Year  Candidate               Party                  Popular vote  Result  %        
1824  Andrew Jackson          Democratic-Republican  151271        loss    57.210122    1
1980  Jimmy Carter            Democratic             35480115      loss    41.132848    1
1972  Richard Nixon           Republican             47168710      win     60.907806    1
1976  Eugene McCarthy         Independent            740460        loss    0.911649     1
      Gerald Ford             Republican             39148634      loss    48.199499    1
                                                                                       ..
1908  William Jennings Bryan  Democratic             6408979       loss    43.414640    1
      William Taft            Republican             7678335       win     52.013300    1
1912  Eugene V. Debs          Socialist              901551        loss    6.004354     1
      Eugene W. Chafin        Prohibition            208156        loss    1.386325     1
2024  Robert F.

**Practice: Try using `.drop_duplicates(subset = "col_name", keep = "first")` to only keep the rows with candidates who had the largest popular vote in a given year.  (Hint: You will need to sort your dataframe first)**

In [None]:
...

# Common Task:  Extraction:

One of the most basic tasks for manipulating a DataFrame is to extract rows and columns of interest.   


### Label-Based Extraction Using`loc`

`loc` selects items by row and column *label*.  

`df.loc[row_labels, column_labels]`

We describe "labels" as the bolded text at the top and left of a DataFrame.




Arguments to `.loc` can be:
1. A row label and column label
2. A list.
3. A slice (syntax is inclusive of the right-hand side of the slice).

**Practice: Select the column "Popular Vote" and the rows with labels 1, 3, and 10**

In [None]:
...

#### Integer-Based Extraction Using `iloc`

`iloc` selects items by row and column *integer* position.

Arguments to `.iloc` can be:
1. A list.
2. A slice (syntax is exclusive of the right hand side of the slice).
3. A single value.


In [211]:
# Select the rows at integer positions 1, 2, and 3.
# Select the columns at integer positions 0, 1, and 2.
# Remember that Python indexing begins at position 0!
...

Ellipsis

In [209]:
# Extract the value at in the first row and the second column
...

#### Context-dependent Extraction using `[]`

We could technically do anything we want using `loc` or `iloc`. However, in practice, the `[]` operator is often used instead to yield more concise code.

`[]` is a bit trickier to understand than `loc` or `iloc`, but it achieves essentially the same functionality. The difference is that `[]` is *context-dependent*.

`[]` only takes one argument, which may be:
1. A slice of row integers.
2. A list of column labels.
3. A single column label.


If we provide a slice of row numbers, [start:stop], we get all rows with those integer positions.  While the element at the start index is included, the stop index is not included, so that the number of elements in the result is stop - start. 

In [None]:
# Select rows at integer positions 3 through 6
...

If we provide a list of column names, we get the listed columns:


In [None]:
# Select the columns Party, Candidate and Result in that order
...

And if we provide a single column name we get back just that column, stored as a `Series`.

In [238]:
# Select just the column "Candidate"
...

### Multi-indexed DataFrames

You can also define multiple indexes for the same DataFrame.  This is useful when you need more than one column to specify the granularity of the data.  
For example, if we wanted to use both `Year` and `Party` as our indices we would do this as follows:

In [None]:
elections_multindex = elections.set_index(["Year","Party"])

In [None]:
elections_multindex.head()

### Accessing Data in Multi-indexed DataFrames:

Now, to access data we can use `.loc` where the first entry is a tuple: (year, party):


In [None]:
elections_multindex.loc[(1828,"Democratic"),:]

Notice, we got a warning above.  This just means that your index is not sorted. pandas depends on the index being sorted (in this case, lexicographically, since we are dealing with string values) for optimal search and retrieval. A quick fix would be to sort your DataFrame in advance using DataFrame.sort_index. This is especially desirable from a performance standpoint if you plan on doing multiple such queries in tandem:

In [None]:
elections_multindex = elections_multindex.sort_index()
elections_multindex.loc[(1828,"Democratic"),:]

## Setting a New Index:

Suppose we want to know how many elections Andrew Jackson ran in.

**Practice:** Set the elections index to be Candidate.

In [None]:
...
elections

**Practice:Select only the rows when Andrew Jackson ran in an election**

In [None]:
...

**Practice:  Reset the index (to the default integer indices)**

In [None]:
...

**Practice:  Create a new dataframe that is just the first 10 rows of the elections dataframe**

In [None]:
elections_first_10 = ...


## Boolean Arrays

In [None]:
a = np.array([True, False, True, False, True, False, False, False, False, False])

In [None]:
# What happens when you sum a boolean array?
...

In [None]:
# What happens if you put a boolean array as an input to the .loc or [] operator?
...

## Common Task:  Conditional Selection


### **Option 1: Boolean Indexing**

By passing in a sequence (list, array, or `Series`) of boolean values, we can extract a subset of the rows in a `DataFrame`. We will keep *only* the rows that correspond to a boolean value of `True`.



```python
elections[
    (elections["Result"] == "win") &
    (elections["Popular vote"] > 70000000)
]
```

**How it works**

* Uses **explicit column access** (`df["col"]`)
* Conditions are combined with `&`, `|`, `~`
* Very explicit and Pythonic

**Pros**

* Clear what’s happening under the hood
* Works in all situations
* Easy to debug step by step

**Cons**

* Verbose
* Lots of brackets and parentheses

---


**Practice:  Use Boolean Indexing to Extract all rows from the elections DataFrame in which Norman Thomas was a candidate**

In [None]:
...


### Bitwise Operators

To filter on multiple conditions, we combine boolean operators using **bitwise comparisons**.

Symbol | Usage      | Meaning 
------ | ---------- | -------------------------------------
~    | ~p       | Returns negation of p
&#124; | p &#124; q | p OR q
&    | p & q    | p AND q
^  | p ^ q | p XOR q (exclusive or)

**Practice: Extract all rows from the elections DataFrame when Andrew Jackson was elected president**

In [None]:

...

**Practice:  Extract all rows from the elections DataFrame where the percentage of popular votes was greater than 50 AND the candidate lost**

In [None]:
...

### Another Selection Option:  Query



### **`query()`**

```python
elections.query("Result == 'win' and `Popular vote` > 70000000")
```

**How it works**

* Conditions written as a **string**
* Column names behave like **variables**
* Uses `and`, `or`, `not`

**Pros**

* More readable
* Looks like English / SQL
* Fewer brackets

**Cons**

* Column names with spaces need backticks
* Harder to debug
* Less flexible in complex logic

---

Documentation for query:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html



**Practice: Select all rows in which a candidate won the popular vote but lost the election.  How/why can this occur?**

## Common Task: Adding, Removing, and Modifying Columns

### Adding or Modifying a Column
To add (or modify an existing) column, use `.assign()`

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.assign.html 


Syntax:

`df = df.assign(new_col_name = new_col_values)`


In [None]:
# Add a column called frac_voters with the fraction of voters who voted in each election
...

**Practice:  Add a new column to elections called "TotVoters" that gives the total number of people who voted in that particular election**

In [None]:
...

elections

### Rename a Column Name
Rename a column using the `.rename()` method.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html


Rename "TotVoters to "Total_Voters":

In [None]:
...

### Delete a Column
Remove a column using `.drop()`
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html 


Drop the columns "frac_voters" and "Total_Voters":

In [None]:
...

# Other Useful Functions



## NumPy Functions Commonly Used with DataFrames and Series

1. **`np.mean(obj)`** – computes the average of values (column-wise for DataFrames)
   [https://numpy.org/doc/stable/reference/generated/numpy.mean.html](https://numpy.org/doc/stable/reference/generated/numpy.mean.html)

2. **`np.median(obj)`** – computes the median of values
   [https://numpy.org/doc/stable/reference/generated/numpy.median.html](https://numpy.org/doc/stable/reference/generated/numpy.median.html)

3. **`np.std(obj)`** – computes the standard deviation of values
   [https://numpy.org/doc/stable/reference/generated/numpy.std.html](https://numpy.org/doc/stable/reference/generated/numpy.std.html)

4. **`np.sum(obj)`** – sums values (similar to `sum(obj)` but with NumPy behavior)
   [https://numpy.org/doc/stable/reference/generated/numpy.sum.html](https://numpy.org/doc/stable/reference/generated/numpy.sum.html)


In [None]:
# Examples:

np.mean(elections["Popular vote"])

In [None]:
# Max 

np.max(elections["Popular vote"])



## Python Functions Commonly Used with DataFrames and Series

1. **`len(obj)`** – returns the number of rows in a DataFrame or elements in a Series
   [https://docs.python.org/3/library/functions.html#len](https://docs.python.org/3/library/functions.html#len)

2. **`type(obj)`** – shows the object’s type (e.g., `DataFrame`, `Series`)
   [https://docs.python.org/3/library/functions.html#type](https://docs.python.org/3/library/functions.html#type)

3. **`sum(obj)`** – adds values in a Series or across DataFrame columns (by default)
   [https://docs.python.org/3/library/functions.html#sum](https://docs.python.org/3/library/functions.html#sum)

4. **`min(obj)`** – returns the smallest value in a Series or column-wise minimum in a DataFrame
   [https://docs.python.org/3/library/functions.html#min](https://docs.python.org/3/library/functions.html#min)

5. **`max(obj)`** – returns the largest value in a Series or column-wise maximum in a DataFrame
   [https://docs.python.org/3/library/functions.html#max](https://docs.python.org/3/library/functions.html#max)





#### Useful Python Function:  len()

In [None]:
len(elections["Party"])

In [None]:
len(elections)

##  Useful Series Methods 

1. **`s.value_counts()`** – counts how often each unique value appears
   [https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html)

2. **`s.unique()`** – returns the unique values in the Series
   [https://pandas.pydata.org/docs/reference/api/pandas.Series.unique.html](https://pandas.pydata.org/docs/reference/api/pandas.Series.unique.html)

3. **`s.sort_values()`** – sorts the Series by its values
   [https://pandas.pydata.org/docs/reference/api/pandas.Series.sort_values.html](https://pandas.pydata.org/docs/reference/api/pandas.Series.sort_values.html)

4. **`s.isna()`** – returns a boolean Series indicating missing values
   [https://pandas.pydata.org/docs/reference/api/pandas.Series.isna.html](https://pandas.pydata.org/docs/reference/api/pandas.Series.isna.html)



In [None]:
elections["Party"].unique()

In [None]:
len(elections["Party"].unique())

In [None]:
elections["Candidate"].sort_values()

In [None]:
elections["Candidate"].sort_values(ascending=False)

In [None]:
elections["Candidate"].value_counts()

In [69]:
elections[["Candidate", "Party"]].isna()

Unnamed: 0,Candidate,Party
0,False,False
1,False,False
2,False,False
3,False,False
4,False,False
...,...,...
182,False,False
183,False,False
184,False,False
185,False,False


In [None]:
elections["Candidate"].isna()

In [None]:
elections[elections["Candidate"].isna()]