## Intro to Pandas Data Structures

In [1]:
# `pd` is the conventional alias for Pandas, as `np` is for NumPy
import pandas as pd

### $\color{red}{\textbf{Pandas}}$ Data Structures




Pandas has three types of data structures: 
- **Series**: A one dimensional array with labeled indices (can be mixed data types). 
-  **DataFrame**: 2D tabular data structure with both row and column labels.  $\color{red}{\text{Rows}}$ have a specific index to access them, which can be $\color{red}{\text{any name or value}}$. The $\color{blue}{\text{columns}}$ are just $\color{blue}{\text{Pandas Series}}$. The Pandas DataFrame data structure can be seen as a spreadsheet, but it is much more flexible. 
-  **Index**:  A sequence of row/column labels
 

Series, DataFrames, and Indices are fundamental `pandas` data structures for storing tabular data and processing the data using vectorized operations.

This is text cell.

### Series

A `Series` is a 1-D labeled array of data. We can think of it as columnar data. 

#### Creating a new `Series` object
Below, we create a `Series` object and will look into its two components: 1) values and 2) index.

In [None]:
s = pd.Series(["welcome", "to", "CSPB 3022"])

s

In [None]:
s.values

In [None]:
s.index

In the example above, `pandas` automatically generated an `Index` of integer labels. We can also create a `Series` object by providing a custom `Index`.

In [None]:
s = pd.Series([-1, 10, 2], index = ["a", "b", "c"])
s

In [None]:
s.values

In [None]:
s.index

After it has been created, we can reassign the Index of a `Series` to a new Index.

In [None]:
s.index = ["first", "second", "third"]
s

#### Selection in Series
We can select a single value or a set of values in a `Series` using:
- A single label
- A list of labels
- A filtering condition

In [None]:
s = pd.Series([4, -2, 0, 6], index = ["a", "b", "c", "d"])
s

**Selection using one or more label(s)**

In [None]:
# Selection using a single label
# Notice how the return value is a single array element
s["a"]

In [None]:
# Selection using a list of labels
# Notice how the return value is another Series
s[["a", "c"]]

**Selection using a filter condition**

In [None]:
# Filter condition: select all elements greater than 0
s>0

In [None]:
# Use the Boolean filter to select data from the original Series
s[s>0]

**Vectorized Operations Using Series**

In [None]:
s

In [None]:
#Raising to a power
s**2

In [None]:
#Multiplying by a constant
a=s/4

In [None]:
s

In [None]:
s2=pd.Series([-2, 10, 100, 2], index=["a", "b", "c", "d"])
s2

In [None]:
s+s2

In [None]:
s/s2

In [None]:
s2/s

In [None]:
s2

In [None]:
s3=pd.Series([20, 5, -10, 200], index=["b", "c", "d", "e"])
s3

In [None]:
s2+s3





### DataFrame

A `DataFrame` is a 2-D tabular data structure with both row and column labels. In this lecture, we will see how a `DataFrame` can be created from scratch or loaded from a file. 

### Creating a new `DataFrame` object
We can also create a `DataFrame` in a variety of ways. Here, we cover the following:
1. From a CSV file
2. Using a list and column names
3. From a dictionary
4. From a `Series`


#### Creating a `DataFrame` from a CSV file
For loading data into a `DataFrame`, `pandas` has a number of very useful file reading tools. We'll be using `read_csv` today to load data from a CSV file into a `DataFrame` object. 

In [2]:
elections = pd.read_csv("data/elections.csv")
elections

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073
4,1832,Andrew Jackson,Democratic,702735,win,54.574789
...,...,...,...,...,...,...
177,2016,Jill Stein,Green,1457226,loss,1.073699
178,2020,Joseph Biden,Democratic,81268924,win,51.311515
179,2020,Donald Trump,Republican,74216154,loss,46.858542
180,2020,Jo Jorgensen,Libertarian,1865724,loss,1.177979


By passing a column to the `index_col` attribute, the `Index` can be defined at the initialization.

In [3]:
elections = pd.read_csv("data/elections.csv",index_col ="Party" )
elections

Unnamed: 0_level_0,Year,Candidate,Popular vote,Result,%
Party,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Democratic-Republican,1824,Andrew Jackson,151271,loss,57.210122
Democratic-Republican,1824,John Quincy Adams,113142,win,42.789878
Democratic,1828,Andrew Jackson,642806,win,56.203927
National Republican,1828,John Quincy Adams,500897,loss,43.796073
Democratic,1832,Andrew Jackson,702735,win,54.574789
...,...,...,...,...,...
Green,2016,Jill Stein,1457226,loss,1.073699
Democratic,2020,Joseph Biden,81268924,win,51.311515
Republican,2020,Donald Trump,74216154,loss,46.858542
Libertarian,2020,Jo Jorgensen,1865724,loss,1.177979


#### Creating a `DataFrame` using a list and column names

In [4]:
# Creating a single-column DataFrame using a list
df_list_1 = pd.DataFrame([1, 2, 3], 
                         columns = ["Number"])
display(df_list_1)

Unnamed: 0,Number
0,1
1,2
2,3


In [5]:
# Creating a multi-column DataFrame using a list of lists
df_list_2 = pd.DataFrame([[1, "one"], [2, "two"]], 
                         columns = ["Number", "Description"])
df_list_2

Unnamed: 0,Number,Description
0,1,one
1,2,two


#### Creating a `DataFrame` from a dictionary

In [6]:
# Creating a DataFrame from a dictionary of columns
df_dict_1 = pd.DataFrame({"Fruit":["Strawberry", "Orange"], 
                          "Price":[5.49, 3.99]})
df_dict_1

Unnamed: 0,Fruit,Price
0,Strawberry,5.49
1,Orange,3.99


In [7]:
# Creating a DataFrame from a list of row dictionaries
df_dict_2 = pd.DataFrame([{"Fruit":"Strawberry", "Price":5.49}, 
                          {"Fruit":"Orange", "Price":3.99}])
df_dict_2

Unnamed: 0,Fruit,Price
0,Strawberry,5.49
1,Orange,3.99


#### Creating a `DataFrame` from a `Series`

In [8]:
# In the examples below, we create a DataFrame from a Series

s_a = pd.Series(["a1", "a2", "a3"], index = ["r1", "r2", "r3"])
s_b = pd.Series(["b1", "b2", "b3"], index = ["r1", "r2", "r3"])

In [9]:
# Passing Series objects for columns
df_ser = pd.DataFrame({"A-column":s_a, "B-column":s_b})
df_ser

Unnamed: 0,A-column,B-column
r1,a1,b1
r2,a2,b2
r3,a3,b3


In [10]:
# Passing a Series to the DataFrame constructor to make a one-column dataframe
df_ser = pd.DataFrame(s_a)
df_ser

Unnamed: 0,0
r1,a1
r2,a2
r3,a3


In [11]:
# Using to_frame() to convert a Series to DataFrame
ser_to_df = s_a.to_frame()
ser_to_df

Unnamed: 0,0
r1,a1
r2,a2
r3,a3


## Setting/Resetting the index:

In [None]:
elections.reset_index(inplace = True) # Need to reset the index to keep 'Candidate' as one of the DataFrane Columns
elections.set_index("Party", inplace=True) # This sets the index to the "Candidate" column
elections

### `DataFrame` attributes: `index`, `columns`

In [None]:
elections.index

In [None]:
elections.columns

The `Index` column can be set to the default list of integers by calling `reset_index()` on a `DataFrame`.

In [None]:
elections.reset_index(inplace=True) # Revert the index back to its default numeric labeling
elections

### Slicing in `DataFrame`s

We can use `.head` to return only a few rows of a dataframe.

In [None]:
# Loading DataFrame again to keep the original ordering of columns
elections = pd.read_csv("data/elections.csv")

elections.head(15) # By default, calling .head with no argument will show the first 5 rows

In [None]:
elections.head(13)

We can also use `.tail` to get the last so many rows.

In [None]:
elections.tail(9)

#### Label-Based Extraction Using`loc`

Arguments to `.loc` can be:
1. A list.
2. A slice (syntax is inclusive of the right-hand side of the slice).
3. A single value.


`loc` selects items by row and column *label*.

In [None]:
# Selection by a list
elections.loc[[87, 25, 179], ["Year", "Candidate", "Result"]]

In [None]:
# Selection by a list and a slice of columns
elections.loc[[87, 25, 179], "Popular vote":"%"]

In [None]:
# Extracting all rows using a colon
elections.loc[:, ["Year", "Candidate", "Result"]]

In [None]:
# Extracting all columns using a colon
elections.loc[[87, 25, 179], :]

In [None]:
# Selection by a list and a single-column label
elections.loc[[87, 25, 179], "Popular vote"]

In [None]:
# Note that if we pass "Popular vote" in a list, the output will be a DataFrame
elections.loc[[87, 25, 179], ["Popular vote"]]

In [None]:
# Selection by a row label and a column label
elections.loc[0, "Candidate"]

#### Integer-Based Extraction Using `iloc`

`iloc` selects items by row and column *integer* position.

Arguments to `.iloc` can be:
1. A list.
2. A slice (syntax is exclusive of the right hand side of the slice).
3. A single value.


In [None]:
# Select the rows at positions 1, 2, and 3.
# Select the columns at positions 0, 1, and 2.
# Remember that Python indexing begins at position 0!
elections.iloc[[1, 2, 3], [0, 1, 2]]

In [None]:
# Index-based extraction using a list of rows and a slice of column indices
elections.iloc[[1, 2, 3], 0:3]

In [None]:
# Selecting all rows using a colon
elections.iloc[:, 0:3]

In [None]:
elections.iloc[[1, 2, 3], 1]

In [None]:
# Extracting the value at row 0 and the second column
elections.iloc[0,1]

#### Context-dependent Extraction using `[]`

We could technically do anything we want using `loc` or `iloc`. However, in practice, the `[]` operator is often used instead to yield more concise code.

`[]` is a bit trickier to understand than `loc` or `iloc`, but it achieves essentially the same functionality. The difference is that `[]` is *context-dependent*.

`[]` only takes one argument, which may be:
1. A slice of row integers.
2. A list of column labels.
3. A single column label.


If we provide a slice of row numbers, we get the numbered rows.

In [None]:
elections[3:7]

If we provide a list of column names, we get the listed columns.

In [None]:
elections[["Year", "Candidate", "Result"]]

And if we provide a single column name we get back just that column, stored as a `Series`.

In [None]:
elections["Candidate"]

### Multi-indexed DataFrames

You can also define multiple indexes for the same DataFrame.  This is useful when you need more than one column to specify the granularity of the data.  
For example, if we wanted to use both `Year` and `Party` as our indices we would do this as follows:

In [3]:
elections_multindex = elections.set_index(["Year","Party"])

In [4]:
elections_multindex.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Candidate,Popular vote,Result,%
Year,Party,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1824,Democratic-Republican,Andrew Jackson,151271,loss,57.210122
1824,Democratic-Republican,John Quincy Adams,113142,win,42.789878
1828,Democratic,Andrew Jackson,642806,win,56.203927
1828,National Republican,John Quincy Adams,500897,loss,43.796073
1832,Democratic,Andrew Jackson,702735,win,54.574789


### Accessing Data in Multi-indexed DataFrames:

Now, to access data we can use `.loc` where the first entry is a tuple: (year, party):


In [6]:
elections_multindex.loc[(1828,"Democratic"),:]

  elections_multindex.loc[(1828,"Democratic"),:]


Unnamed: 0_level_0,Unnamed: 1_level_0,Candidate,Popular vote,Result,%
Year,Party,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1828,Democratic,Andrew Jackson,642806,win,56.203927


Notice, we got a warning above.  This just means that your index is not sorted. pandas depends on the index being sorted (in this case, lexicographically, since we are dealing with string values) for optimal search and retrieval. A quick fix would be to sort your DataFrame in advance using DataFrame.sort_index. This is especially desirable from a performance standpoint if you plan on doing multiple such queries in tandem:

In [7]:
elections_multindex = elections_multindex.sort_index()
elections_multindex.loc[(1828,"Democratic"),:]

Unnamed: 0_level_0,Unnamed: 1_level_0,Candidate,Popular vote,Result,%
Year,Party,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1828,Democratic,Andrew Jackson,642806,win,56.203927


### Practice Exercises

In [None]:
example = pd.Series([4, 5, 6], index=["one", "two", "three"])


In [None]:
df = pd.DataFrame({"c1":[1, 2, 3, 4], "c2":[2, 4, 6, 8]})


In [None]:
weird = pd.DataFrame({
    1:["topdog","botdog"], 
    "1":["topcat","botcat"]
})
