# Data Manipulation with Pandas
This week, we will cover the basic data manipulation using Pandas.
Pandas is an open source data analysis and manipulation tool, built on top of the Python programming.

Since we've covered the fundamentals of Python, it will be fairly easy to pick up Pandas.



## Table of Contents
- Connecting to Google Drive
- Importing data
- Getting info about the dataset
- Removing NaN (None) values
- Selecting columns to work with
- Filtering dataset based on criteria
- Aggregation functions

## Connecting to Your Google Drive


In [None]:
# Start by connecting google drive into google colab

from google.colab import drive

drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


In [None]:
!ls "/content/gdrive/My Drive/DigitalHistory"

'List Operations.ipynb'
'Meeting Minutes'
'String Operations.ipynb'
 tmp
'Topics For Later Sections.gdoc'
 Week_1
 Week_2
 Week_3
 Week_4
 Week8-PROJECT-Analyze-Trans-Atlantic-Slave-Trade.ipynb


In [None]:
cd "/content/gdrive/My Drive/DigitalHistory/Week_2"


/content/gdrive/.shortcut-targets-by-id/1m-IVNIRZmHM3YwHOGFHd_tUQuI298irS/DigitalHistory/Week_2


In [None]:
ls

[0m[01;34mdata[0m/  Week2-Introduction-to-Data.ipynb


## Import Libraries and unpack file



In [None]:
import pandas as pd
import zipfile


In [None]:
file_location = 'data/titanic-dataset.zip'
# file_location = 'tmp/trans-atlantic-slave-trade'

zip_ref = zipfile.ZipFile(file_location,'r')
zip_ref.extractall('data')
zip_ref.close()

## Load file

In [None]:
df = pd.read_csv('data/titanic-dataset.csv')

print(df)

     PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0              1         0       3  ...   7.2500   NaN         S
1              2         1       1  ...  71.2833   C85         C
2              3         1       3  ...   7.9250   NaN         S
3              4         1       1  ...  53.1000  C123         S
4              5         0       3  ...   8.0500   NaN         S
..           ...       ...     ...  ...      ...   ...       ...
886          887         0       2  ...  13.0000   NaN         S
887          888         1       1  ...  30.0000   B42         S
888          889         0       3  ...  23.4500   NaN         S
889          890         1       1  ...  30.0000  C148         C
890          891         0       3  ...   7.7500   NaN         Q

[891 rows x 12 columns]


Now, the dataset is loaded as a dataframe 'df'

## Basic info about the dataset


### head()
Let's check what columns this file has by calling 'head()' function.
It returns first n rows, and it's useful to see the dataset at a quick glance.

By default, the head() function returns the first 5 rows.

You can specify the number of rows to display by calling df.head(number)

In [None]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### info()
This will return all of the column names and its types. This function is useful to understand what the dataframe is like.


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


### describe()
describe() is used to view summary statistics of numeric columns. This will help you to have a general idea of the dataset.

In [None]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


### shape
To see the size of the dataset, we can use shape function, which returns the number of rows and columns in a format of (#rows, #columns)

In [None]:
df.shape

(891, 12)

### Remove NaN values

Before we dive into the dataset, let's learn how to remove NaN (Null) values.
* df.dropna(): drop the rows where at least one of the elements is missing.
* df.dropna(how='all'): drop the rows where all of the elements are missing.
* df.dropna(subset=['Voyage ID', 'Vessel name']): define in which columns to look for missing values.

In [None]:
# if we drop the rows with at least one missing element.
df.dropna().shape

(183, 12)

In [None]:
# if we drop the rows with all elements missing.
df.dropna(how='all').shape

(891, 12)

In [None]:
# define in which columns to look for missing values.
df.dropna(subset=['PassengerId', 'Survived', 'Name']).shape

(891, 12)

In the titanic dataset, all of the rows have 'PassengerId', 'Survived', and 'Name' as the number of rows is the same as the code above.

## Select columns to work with

If you are interested in a few columns to do the data analysis, you can select a specific subset of columns using two methods:

1. by index location
2. by column names

In [None]:
# by index location (iloc)
df_index = df.iloc[: , [0,1,2,3]].copy()
df_index

Unnamed: 0,PassengerId,Survived,Pclass,Name
0,1,0,3,"Braund, Mr. Owen Harris"
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,3,1,3,"Heikkinen, Miss. Laina"
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
4,5,0,3,"Allen, Mr. William Henry"
...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas"
887,888,1,1,"Graham, Miss. Margaret Edith"
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie"""
889,890,1,1,"Behr, Mr. Karl Howell"


In [None]:
# by column names
df_col_names = df[['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age']]
df_col_names

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0
...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0


## Filter Dataset based on criteria

Often times, we are interested in working with specific rows that meet the certain criteria. 

If we only want to look at the data with Age > 30, we can specify the criteria within the 'loc' function.

In [None]:
df_over_30yrs=df.loc[df['Age'] > 30]
df_over_30yrs

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
...,...,...,...,...,...,...,...,...,...,...,...,...
873,874,0,3,"Vander Cruyssen, Mr. Victor",male,47.0,0,0,345765,9.0000,,S
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
881,882,0,3,"Markun, Mr. Johann",male,33.0,0,0,349257,7.8958,,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q


Now, let's select the dataset using two criteria -- where "Age" is greater than 30 AND "Survived."

'&' is equivalent to '*AND*' and '|' is equivalent to '*OR*' in dataframe.

When filtering with multiple conditions, make sure to use **()** on each condition.

In [None]:
df_over_30yrs_survived = df.loc[(df['Age'] > 30) & (df['Survived'] == 1)]
df_over_30yrs_survived

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
15,16,1,2,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0,0,0,248706,16.0000,,S
21,22,1,2,"Beesley, Mr. Lawrence",male,34.0,0,0,248698,13.0000,D56,S
...,...,...,...,...,...,...,...,...,...,...,...,...
857,858,1,1,"Daly, Mr. Peter Denis",male,51.0,0,0,113055,26.5500,E17,S
862,863,1,1,"Swift, Mrs. Frederick Joel (Margaret Welles Ba...",female,48.0,0,0,17466,25.9292,D17,S
865,866,1,2,"Bystrom, Mrs. (Karolina)",female,42.0,0,0,236852,13.0000,,S
871,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S


Let's check how many passengers survived among the ones whose age was over 30.

In [None]:
print("# of passengers whose age was over 30: ", df_over_30yrs.shape[0])
print("# of survived passengers whose age was over 30: ", df_over_30yrs_survived.shape[0])

# of passengers whose age was over 30:  305
# of survived passengers whose age was over 30:  124


## Aggregation

Aggregation is the process of combining things. It's useful to understand overall properties of the dataset and analyze it.

Some examples of aggregation are sum(), min(), max(), count(), mean(), std(), etc.

### Sum
Let's calculate the total fares in the dataset using sum() function.

In [None]:
df['Fare'].sum()

28693.9493

If we want to round total fares and save it as a variable, then we can try:

In [None]:
total_fares = df['Fare'].sum()
print("Total fares: ", round(total_fares))

Total fares:  28694.0


We can also count the number of passengers that survived by summing up the 'Survived' column.

In [None]:
survived_passengers = df['Survived'].sum()
survived_passengers

342

### Mean

Now let's move onto mean.

We will calculate the survival rate by the *age group*.
We can apply filtering that we just learned to select the group whose age was over 30 and whose age was under 30.

In each group, we will calculate the mean of 'Survived' column.


In [None]:
# filtering
df_over_30yrs = df.loc[df['Age'] > 30]
df_under_30yrs = df.loc[df['Age'] <= 30]

# calculating mean of 'Survived' for each group
mean_over_30 = df_over_30yrs['Survived'].mean()
mean_under_30 = df_under_30yrs['Survived'].mean()

# printing the mean survival rates for each group
print("The survival rate of passengers whose age was over 30: ", mean_over_30)
print("The survival rate of passengers whose age was underer 30: ", mean_under_30)

The survival rate of passengers whose age was over 30:  0.4065573770491803
The survival rate of passengers whose age was underer 30:  0.4058679706601467


There's not much difference between the two groups.


Let's calculate the survival rate by *sex* to see if there's a significant difference.

Here, we use 'groupby' aggregate function and it will let us group the dataset by that column ('Sex')

We'd like to get the mean of 'Survived' column and we specify it by using [ ] after the groupby call.

In [None]:
# group by 'Sex' and calculate mean of 'Survived'
df.groupby(['Sex'])['Survived'].mean()

Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64

There was a significant difference in the survival rate by *sex*!

Some of us might be curious:

Would lower Pclass be more expensive or higher Pclass be more expensive?

We can answer the question by calculating the mean fares for each class.

In [None]:
df.groupby(['Pclass'])['Fare'].mean()

Pclass
1    84.154687
2    20.662183
3    13.675550
Name: Fare, dtype: float64

Lower Pclass was a lot more expensive!


## Takeaways from this tutorial
Using Dataframe and aggregate functions in Pandas, we can answer any questions that might come up!

In the later weeks, we will deep dive into the uses of Pandas to further analyze data.