# Pandas I: Data Structures

Lets dive right into things by having you explore and manipulate real-world data. To do so, we’ll introduce `pandas`, a popular Python library for interacting with tabular data.

`pandas` is generally accepted in the data science community as the industry- and academia-standard tool for manipulating tabular data.

## Tabular Data

Data scientists work with data stored in a variety of formats. The primary focus of this class is in understanding _tabular data_ –– data that is stored in a table.

Tabular data is one of the most common systems that data scientists use to organize data. This is in large part due to the simplicity and flexibility of tables. Tables allow us to represent each **observation**, or instance of collecting data from an individual, as its own row. We can record distinct characteristics, or **features**, of each observation in separate columns.

To see this in action, we’ll explore the `elections` dataset, which stores information about political candidates who ran for president of the United States in various years.

This dataset is stored in **Comma Separated Values** (CSV) format. CSV files due to their simplicity and readability are one of the most common ways to store tabular data. Each line in a CSV file (file extension: `.csv`) represents a row in the table. In other words, each row is separated by a newline character `\n`. Within each row, each column is separated by a comma `,`, hence the name Comma Separated Values.

The first few rows of `elections` dataset in CSV format are as follows: 

```{code} csv
Year,Candidate,Party,Popular vote,Result,%
1824,Andrew Jackson,Democratic-Republican,151271,loss,57.21012204
1824,John Quincy Adams,Democratic-Republican,113142,win,42.78987796
1828,Andrew Jackson,Democratic,642806,win,56.20392707
1828,John Quincy Adams,National Republican,500897,loss,43.79607293
1832,Andrew Jackson,Democratic,702735,win,54.57478905
```

To begin our studies in `pandas`, we must first import the library into our Python environment. This will allow us to use `pandas` data structures and methods in our code.

CSV files can be in `pandas` using `read_csv`. The following code cell imports `pandas` as `pd`, the conventional alias for Pandas and then reads the `elections.csv` file. 

In [1]:
# `pd` is the conventional alias for Pandas
import pandas as pd

url = "https://raw.githubusercontent.com/fahadsultan/datascience_ml/main/data/elections.csv"
pd.read_csv(url)

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073
4,1832,Andrew Jackson,Democratic,702735,win,54.574789
...,...,...,...,...,...,...
177,2016,Jill Stein,Green,1457226,loss,1.073699
178,2020,Joseph Biden,Democratic,81268924,win,51.311515
179,2020,Donald Trump,Republican,74216154,loss,46.858542
180,2020,Jo Jorgensen,Libertarian,1865724,loss,1.177979


In the `elections` dataset, each row represents one instance of a candidate running for president in a particular year. For example, the first row represents Andrew Jackson running for president in the year 1824. Each column represents one characteristic piece of information about each presidential candidate. For example, the column named `Result` stores whether or not the candidate won the election.


## DataFrames, Series and Indices

There are three fundamental data structures in `pandas`:

1. **Series**: 1D labeled array data; best thought of as columnar data
2. **DataFrame**: 2D tabular data with rows and columns
3. **Index**: A sequence of row/column labels

DataFrames, Series, and Indices can be represented visually in the following diagram, which considers the first few rows of the `elections` dataset.

```{figure} https://fahadsultan.com/datascience_ml/_images/data_structure.png
---
width: 100%
align: center
---
Three fundamental `pandas` data structures: **Series**, **DataFrame**, **Index**
``` 

Notice how the **DataFrame** is a two-dimensional object – it contains both rows and columns. The **Series** above is a singular column of this **DataFrame**, namely, the `Result` column. Both contain an **Index**, or a shared list of row labels (here, the integers from 0 to 4, inclusive).

```{figure} https://raw.githubusercontent.com/fahadsultan/datascience_ml/main/assets/DataFrameSeries.png
---
width: 100%
align: center
---
Schematic of a `pandas` **DataFrame** and  **Series**
``` 


## Series

A Series represents a column of a DataFrame; more generally, it can be any 1-dimensional array-like object containing values of the same type with associated data labels, called its index. In the cell below, we create a Series named `s`.

In [2]:
s = pd.Series([-1, 10, 2])
s

0    -1
1    10
2     2
dtype: int64

In [3]:
s.values # Data contained within the Series

array([-1, 10,  2])

In [4]:
s.index # The Index of the Series

RangeIndex(start=0, stop=3, step=1)

By default, the Index of a Series is a sequential list of integers beginning from 0. 

Optionally, a manually-specified list of desired indices can be passed to the `index` argument.



In [5]:
s = pd.Series([-1, 10, 2], index = ["a", "b", "c"])
s

a    -1
b    10
c     2
dtype: int64

Indices can also be changed after initialization.

In [6]:
s.index = ["first", "second", "third"]
s

first     -1
second    10
third      2
dtype: int64

## DataFrame

With our new understanding of `pandas` in hand, let’s return to the `elections` dataset from before. Now, we recognize that it is represented as a `pandas` DataFrame.

In [7]:
# `pd` is the conventional alias for Pandas, as `np` is for NumPy
import pandas as pd

elections = pd.read_csv("../data/elections.csv")

elections

FileNotFoundError: [Errno 2] No such file or directory: '../data/elections.csv'


Let’s dissect the code above.

1. We first import the `pandas` library into our Python environment, using the alias `pd`.
 `import pandas as pd`

2. There are a number of ways to read data into a DataFrame. In this course, our datasets are typically stored in a CSV (comma-seperated values) file format. We can import a CSV file into a DataFrame by passing the data path as an argument to the following `pandas` function.
 `pd.read_csv("data/elections.csv")`

This code stores our DataFrame object in the `elections` variable. We see that our `elections` DataFrame has 182 rows and 6 columns (`Year`, `Candidate`, `Party`, `Popular Vote`, `Result`, `%`). Each row represents a single record – in our example, a presedential candidate from some particular year. Each column represents a single attribute, or feature of the record.

In the example above, we constructed a DataFrame object using data from a CSV file. As we’ll explore in the next section, we can also create a DataFrame with data of our own.


```{figure} https://fahadsultan.com/datascience_ml/_images/df_cols.png
---
width: 100%
align: center
---
Each column of a `pandas` **DataFrame** `df` is a **Series** `s` where `s.index == df.index`
``` 


```{figure} https://fahadsultan.com/datascience_ml/_images/df_rows.png
---
width: 100%
align: center
---
Each row of a `pandas` **DataFrame** `df` is a **Series** `s` where `s.index == df.columns`
``` 


### Creating a DataFrame

There are many ways to create a DataFrame. Here, we will cover the most popular approaches.

1. Using a list and column names
2. From a dictionary
3. From a Series

#### Using a List and Column Names

Consider the following examples. The first code cell creates a DataFrame with a single column `Numbers`. The second creates a DataFrame with the columns `Numbers` and `Description`. Notice how a 2D list of values is required to initialize the second DataFrame – each nested list represents a single row of data.

In [13]:
df_list_1 = pd.DataFrame([1, 2, 3], columns=["Numbers"])
df_list_1

Unnamed: 0,Numbers
0,1
1,2
2,3


In [14]:
df_list_2 = pd.DataFrame([[1, "one"], [2, "two"]], columns = ["Number", "Description"])
df_list_2

Unnamed: 0,Number,Description
0,1,one
1,2,two


#### From a Dictionary

A second (and more common) way to create a DataFrame is with a dictionary. The dictionary keys represent the column names, and the dictionary values represent the column values.

In [15]:
df_dict = pd.DataFrame({"Fruit": ["Strawberry", "Orange"], "Price": [5.49, 3.99]})
df_dict

Unnamed: 0,Fruit,Price
0,Strawberry,5.49
1,Orange,3.99


#### From a Series

Earlier, we noted that a Series is usually thought of as a column in a DataFrame. It follows then, that a DataFrame is equivalent to a collection of Series, which all share the same index.

In fact, we can initialize a DataFrame by merging two or more Series.

In [16]:
# Notice how our indices, or row labels, are the same

s_a = pd.Series(["a1", "a2", "a3"], index = ["r1", "r2", "r3"])
s_b = pd.Series(["b1", "b2", "b3"], index = ["r1", "r2", "r3"])

pd.DataFrame({"A-column": s_a, "B-column": s_b})

Unnamed: 0,A-column,B-column
r1,a1,b1
r2,a2,b2
r3,a3,b3


## Indices

The major takeaway: we can think of a **DataFrame** as a collection of **Series** that all share the same **Index**.

On a more technical note, an Index doesn’t have to be an integer, nor does it have to be unique. For example, we can set the index of the `elections` Dataframe to be the name of presidential candidates.

In [19]:
# This sets the index to be the "Candidate" column
elections.set_index("Candidate", inplace=True)
elections.index

Index(['Andrew Jackson', 'John Quincy Adams', 'Andrew Jackson',
       'John Quincy Adams', 'Andrew Jackson', 'Henry Clay', 'William Wirt',
       'Hugh Lawson White', 'Martin Van Buren', 'William Henry Harrison',
       ...
       'Darrell Castle', 'Donald Trump', 'Evan McMullin', 'Gary Johnson',
       'Hillary Clinton', 'Jill Stein', 'Joseph Biden', 'Donald Trump',
       'Jo Jorgensen', 'Howard Hawkins'],
      dtype='object', name='Candidate', length=182)

And, if we’d like, we can revert the index back to the default list of integers.

In [20]:
# This resets the index to be the default list of integers
elections.reset_index(inplace=True) 
elections.index

RangeIndex(start=0, stop=182, step=1)

## Arithmetic and Data Alignment

`pandas` can make it much simpler to work with objects that have different indexes. For example, when you add objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs. Let’s look at an example:



In [5]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=["a", "c", "d", "e"])
s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [4]:
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=["a", "c", "e", "f", "g"])
s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [6]:
s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

The internal data alignment introduces missing values in the label locations that don’t overlap. Missing values will then propagate in further arithmetic computations.

In the case of DataFrame, alignment is performed on both rows and columns:

In [11]:
df1 = pd.DataFrame({"A": [1, 2], "B":[3, 4]})
df1

Unnamed: 0,A,B
0,1,3
1,2,4


In [12]:
df2 = pd.DataFrame({"B": [5, 6], "D":[7, 8]})
df2

Unnamed: 0,B,D
0,5,7
1,6,8


In [13]:
df1 + df2

Unnamed: 0,A,B,D
0,,8,
1,,10,
