<a href="https://colab.research.google.com/github/csaatechnicalarts/ML_Bootcamp/blob/main/Pandas_03_DataFrameIntro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#03 Pandas Data Structure: Data Frame

##Content Outline

* Introducton
* Relationship between data frames and series
* Taking slices of rows and columns
* Filtering rows using boolean logic
* Vector operations on data frames
* Handling date time types
* Quick note about Numpy



## Introduction
In Pandas, a *data frame* is a two-dimensional structure analogous to tabulated data enclosing rows and columns. The header of a data frame series is known as the *column* while the header of a data frame row is known as an *index*.

In [171]:
import pandas as pd

In [172]:
df = pd.read_csv("sample_data/ww2_leaders.csv")
print(f"Type of df: {type(df)}\n")
df

Type of df: <class 'pandas.core.frame.DataFrame'>



Unnamed: 0,Name,Born,Died,Age,Title,Country
0,Franklin Roosevelt,1882-01-30,1945-04-12,63,President,United States
1,Joseph Stalin,1878-12-06,1953-03-05,74,Great Leader,Soviet Union
2,Adolph Hitler,1889-04-20,1945-04-30,56,Fuhrer,Germany
3,Michinomiya Hirohito,1901-04-29,1989-01-07,87,Emperor,Japan
4,Charles de Gaulle,1890-11-22,1970-11-09,79,President,France
5,Winston Churchill,1874-11-30,1965-01-24,90,Prime Minister,United Kingdom
6,Manuel Camacho,1897-04-24,1955-10-13,58,President,Mexico
7,Jan Smuts,1870-05-24,1950-09-11,80,Prime Minister,South Africa
8,Ibn Saud,1875-01-15,1953-11-09,78,King,Saudi Arabia
9,Plaek Phibunsongkhram,1897-07-14,1965-06-11,66,Prime Minister,Thailand


In our input file for heads of state during World War 2, **["Name", "Born", "Died", "Age", "Title", "Country"]** make up the list of data frame columns. By default Pandas supplies the indices, labeling the row headers from zero onwards.

Pandas allows us to isolate specific rows and columns.

In [173]:
df['Name']

Unnamed: 0,Name
0,Franklin Roosevelt
1,Joseph Stalin
2,Adolph Hitler
3,Michinomiya Hirohito
4,Charles de Gaulle
5,Winston Churchill
6,Manuel Camacho
7,Jan Smuts
8,Ibn Saud
9,Plaek Phibunsongkhram


In [174]:
df.loc[10]

Unnamed: 0,10
Name,John Curtin
Born,1885-01-08
Died,1945-07-05
Age,60
Title,Prime Minister
Country,Australia


## Relationship Between Data Frames and Series
In most Pandas tutorials, an individual column in a data frame is a Pandas *series*. But we have to more flexible with that definition, as Pandas reports both the type of a row and the type of a column as *series*. It's probably best to think of a series as a vector of data -- heterogenous data in a row, and homogeneous data in a column -- with an attached header, called "index" and "column".

In [175]:
print(type(df['Name']))

<class 'pandas.core.series.Series'>


In [176]:
print(type(df.loc[0]))

<class 'pandas.core.series.Series'>


We can create a Pandas data frame from using Python dictionaries. Conceptually, the dictionary key becomes the data frame column label and the dictionary value becomes the collection of column values. In practice, we use Pandas series to provison a column, with data and (optionally) the index label that comes with it.

In [177]:
leader_name_series = pd.Series(["Franklin Roosevelt", "Joseph Stalin", "Adolph Hitler"])
leader_age_series = pd.Series([63, 74, 56])
leader_country_series = pd.Series(["United States", "Soviet Union", "Germany"])
leader_df = pd.DataFrame({"Name": leader_name_series, "Age": leader_age_series, "Country": leader_country_series})
leader_df

Unnamed: 0,Name,Age,Country
0,Franklin Roosevelt,63,United States
1,Joseph Stalin,74,Soviet Union
2,Adolph Hitler,56,Germany


In [178]:
print(f"Type of leader_df: {type(leader_df)}")

Type of leader_df: <class 'pandas.core.frame.DataFrame'>


Pandas handles the cases where the shape of the columnar series do not line up. In our example below, the two series have indices that line up, but their data don't. When creating the data frame, Pandas makes up for the incongruity by slotting **NaN** in the missing data for **col02**.

In [179]:
col01_series = pd.Series(["aaa", "bbb", "ccc", "ddd", "eee"], index = ["row01", "row02", "row03", "row04", "row05"])
col02_series = pd.Series([100, 200, 300], index = ["row02", "row03", "row05"])
string_int_df = pd.DataFrame({"col01": col01_series, "col02": col02_series})
string_int_df

Unnamed: 0,col01,col02
row01,aaa,
row02,bbb,100.0
row03,ccc,200.0
row04,ddd,
row05,eee,300.0


## Taking Slices of Rows and Columns
Let's head back to our World War 2 leaders data frame.

---



In [180]:
df = pd.read_csv("sample_data/ww2_leaders.csv")
print(f"Type of df: {type(df)}\n")
df

Type of df: <class 'pandas.core.frame.DataFrame'>



Unnamed: 0,Name,Born,Died,Age,Title,Country
0,Franklin Roosevelt,1882-01-30,1945-04-12,63,President,United States
1,Joseph Stalin,1878-12-06,1953-03-05,74,Great Leader,Soviet Union
2,Adolph Hitler,1889-04-20,1945-04-30,56,Fuhrer,Germany
3,Michinomiya Hirohito,1901-04-29,1989-01-07,87,Emperor,Japan
4,Charles de Gaulle,1890-11-22,1970-11-09,79,President,France
5,Winston Churchill,1874-11-30,1965-01-24,90,Prime Minister,United Kingdom
6,Manuel Camacho,1897-04-24,1955-10-13,58,President,Mexico
7,Jan Smuts,1870-05-24,1950-09-11,80,Prime Minister,South Africa
8,Ibn Saud,1875-01-15,1953-11-09,78,King,Saudi Arabia
9,Plaek Phibunsongkhram,1897-07-14,1965-06-11,66,Prime Minister,Thailand


Using the column labels, we can take a series, a single column of the data frame.

In [181]:
df['Title']

Unnamed: 0,Title
0,President
1,Great Leader
2,Fuhrer
3,Emperor
4,President
5,Prime Minister
6,President
7,Prime Minister
8,King
9,Prime Minister


We can also take slices (ranges) of data frame rows using the *loc[ ]* or *iloc[ ]* operators. What is the difference between the two?

By default, all rows in a data frame have zero-based indices associated with them. They also have label-based indices. If we supply these labels when the data frame is created, then the series will have the zero-based and the label-based indices. But if we skip the label indices, Pandas will duplicate the default zero-based indices into label indices.

The *loc[ ]* is easier to grasp but no less idiosyncratic. This operator isolates rows based on labels and **includes the last element of the range passed in it** -- very un-Python like semantics here. Meanwhile the *iloc[ ]* operator relies on the zero-based indices and does not include the last element of the range passed in it.

In [182]:
df.loc[1]

Unnamed: 0,1
Name,Joseph Stalin
Born,1878-12-06
Died,1953-03-05
Age,74
Title,Great Leader
Country,Soviet Union


In [184]:
df.loc['1':'1']

Unnamed: 0,Name,Born,Died,Age,Title,Country
1,Joseph Stalin,1878-12-06,1953-03-05,74,Great Leader,Soviet Union


Because our data frame of World War 2 leaders did not come with row labels, Pandas supplied default labels based on the zero-based indices. Note that we had to use **loc['1':'1']** and not **loc['1']**. The later will result in *KeyError* exception.

As the following examples show, with the *iloc[ ]* operator we're closer to familiar ground. It's syntax for isolating a data frame row and for taking ranges of rows are closer to canonical Python.

In [185]:
df.iloc[2]

Unnamed: 0,2
Name,Adolph Hitler
Born,1889-04-20
Died,1945-04-30
Age,56
Title,Fuhrer
Country,Germany


In [186]:
df.iloc[2:5]

Unnamed: 0,Name,Born,Died,Age,Title,Country
2,Adolph Hitler,1889-04-20,1945-04-30,56,Fuhrer,Germany
3,Michinomiya Hirohito,1901-04-29,1989-01-07,87,Emperor,Japan
4,Charles de Gaulle,1890-11-22,1970-11-09,79,President,France


In [188]:
df.iloc[4:1:-1]

Unnamed: 0,Name,Born,Died,Age,Title,Country
4,Charles de Gaulle,1890-11-22,1970-11-09,79,President,France
3,Michinomiya Hirohito,1901-04-29,1989-01-07,87,Emperor,Japan
2,Adolph Hitler,1889-04-20,1945-04-30,56,Fuhrer,Germany


## Filtering Rows Using Boolean Logic
We can pick a column to filter the data frame rows. For example, say we want to isolate the rows of leaders whose age is less than or equal to the average age for the whole data set. We do so also by means of the **[ ]** operator, but this time passing a filtering statement within.

In [None]:
df.describe()

In [None]:
df[df['Age'] <= df['Age'].mean()]

Behind the scenes the statement **df['Age'] <= df['Age'].mean()** evaluates to a list of boolean values which Pandas uses to filter the data frame rows.

We can manually create the vector of booleans ourselves to filter the data frame without referencing any column.

In [None]:
potentates = [False, True, True, True, False, False, False, False, True, False, False, True]
df[potentates]

In [None]:
democratic_leaders = list(map(lambda x: not x, potentates))
df[democratic_leaders]

## Vector Operations on Data Frames
We can also perform vectorized operations on data frames. In the following example, multiplying by two doubles the values in every numeric cell. Similarly, strings are concatenated with each other, as per the rules in Python.

In [None]:
df * 2

## Handling Date and Time Types
Our information about World War 2 leaders presents a good opportunity to discuss date and time types in Pandas. Right now the **Born** and **Died** columns are of type *object*, which is essentially of type *string*. To carry out operations on these, we first need to convert them to *Timestamp* types. We accomplish this using the Pandas *to_datetime()* method.

In [None]:
born_dt = pd.to_datetime(df.Born, format = '%Y-%m-%d')
print(f"Type of born_dt: {type(born_dt)}\nType of each born_dt element: {type(born_dt[0])}")

The *to_datetime()* method returns a series of time stamps. The *dtype* is *datetime64* which allows us to apply arithmetic operations on these time stamps.

In [None]:
born_dt

Up till now we haven't showed how to update data frames. Let's now create a new recalculate the age of a leader and append these as a new column to our data frame. To preserve our original data frame, lets make a distinct copy before we proceed.

In [None]:
df_copy = df.copy()
df_copy['Born_DT'] = born_dt
died_dt = pd.to_datetime(df.Died, format = '%Y-%m-%d')
df_copy['Died_DT'] = died_dt
df_copy

Just as with adding new entries in Python dictionaries, we've now extended our data frame with the two new columns with *Timestamp* types. From here we can calculate the leader's age and apply it to a third new column, first as days then finally recalculated as years.

In [None]:
age_dt = died_dt - born_dt
print(age_dt)

In [None]:
df_copy["Age_DT"] = age_dt // pd.Timedelta('365 days')
df_copy

## Quick Note About Numpy
Finally we can create Pandas data frames from numPy array.

In [None]:
import numpy as np
np_data = np.random.random((4,5))
df = pd.DataFrame(np_data, index = ['Sample_01', 'Sample_02', 'Sample_03', 'Sample_04'], columns = np.arange(1,6))
df