# Data Manipulation with Pandas
This week, we will cover the basic data manipulation using Pandas.
Pandas is an open source data analysis and manipulation tool, built on top of the Python programming.

Since we've covered the fundamentals of Python, it will be fairly easy to pick up Pandas.

## Connecting to Your Google Drive


In [None]:
# Start by connecting google drive into google colab

from google.colab import drive

drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


In [None]:
!ls "/content/gdrive/My Drive/DigitalHistory"

'Meeting Minutes'   Week_3
 tmp		    Week_4
 Week_1		    Week8-PROJECT-Analyze-Trans-Atlantic-Slave-Trade.ipynb
 Week_2


In [None]:
cd "/content/gdrive/My Drive/DigitalHistory/tmp/trans-atlantic-slave-trade"


/content/gdrive/.shortcut-targets-by-id/1m-IVNIRZmHM3YwHOGFHd_tUQuI298irS/DigitalHistory/tmp/trans-atlantic-slave-trade


In [None]:
ls

[0m[01;34m__MACOSX[0m/  trans-atlantic-slave-trade.csv  trans-atlantic-slave-trade.csv.zip


## Import Libraries and unpack file



In [None]:
import pandas as pd
import zipfile


In [None]:
file_location = 'trans-atlantic-slave-trade.csv.zip'
# file_location = 'tmp/trans-atlantic-slave-trade'

zip_ref = zipfile.ZipFile(file_location,'r')
zip_ref.extractall('tmp/trans-atlantic-slave-trade')
zip_ref.close()

## Load file

In [None]:
df = pd.read_csv('tmp/trans-atlantic-slave-trade/trans-atlantic-slave-trade.csv')

print(df)

       Voyage ID   Vessel name  ... Slaves arrived at 1st port   Captain's name
0          81711        Hannah  ...                      390.0     Smith, Bryan
1          81712        Hannah  ...                      351.0  Wilson, Charles
2          81713        Hannah  ...                      303.0  Wilson, Charles
3          81714        Hannah  ...                      316.0   Young, William
4          81715        Hannah  ...                      331.0   Young, William
...          ...           ...  ...                        ...              ...
36105      80358         Ariel  ...                        NaN            Young
36106      81265         Ellis  ...                      263.0    Soutar, James
36107      81266         Ellis  ...                      303.0      Roach, John
36108      83426  Royal Edward  ...                      396.0  Bushell, Thomas
36109      83427  Royal Edward  ...                      280.0  Griffiths, John

[36110 rows x 8 columns]


## Basic info about the dataset
Now, the dataset is loaded as a dataframe 'df'

### head()
Let's check what columns this file has by calling 'head()' function.
It returns first n rows, and it's useful to see the dataset at a quick glance.

By default, the head() function returns the first 5 rows.

You can specify the number of rows to display by calling df.head(number)


In [None]:
df.head()

Unnamed: 0,Voyage ID,Vessel name,Voyage itinerary imputed port where began (ptdepimp) place,Voyage itinerary imputed principal place of slave purchase (mjbyptimp),Voyage itinerary imputed principal port of slave disembarkation (mjslptimp) place,Year of arrival at port of disembarkation,Slaves arrived at 1st port,Captain's name
0,81711,Hannah,Liverpool,Calabar,"St. Vincent, port unspecified",1787,390.0,"Smith, Bryan"
1,81712,Hannah,Liverpool,New Calabar,"Grenada, port unspecified",1789,351.0,"Wilson, Charles"
2,81713,Hannah,Liverpool,"Bight of Biafra and Gulf of Guinea Islands, po...",Kingston,1789,303.0,"Wilson, Charles"
3,81714,Hannah,Liverpool,Bonny,"St. Vincent, port unspecified",1791,316.0,"Young, William"
4,81715,Hannah,Liverpool,Congo River,"Grenada, port unspecified",1792,331.0,"Young, William"


### info()
This will return all of the column names and its types. This function is useful to get the idea of what the dataframe is like.


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36110 entries, 0 to 36109
Data columns (total 8 columns):
 #   Column                                                                             Non-Null Count  Dtype  
---  ------                                                                             --------------  -----  
 0   Voyage ID                                                                          36110 non-null  int64  
 1   Vessel name                                                                        36108 non-null  object 
 2   Voyage itinerary imputed port where began (ptdepimp) place                         31646 non-null  object 
 3   Voyage itinerary imputed principal place of slave purchase (mjbyptimp)             34420 non-null  object 
 4   Voyage itinerary imputed principal port of slave disembarkation (mjslptimp) place  31846 non-null  object 
 5   Year of arrival at port of disembarkation                                          36110 non-null  int6

### describe()
describe() is used to view summary statistics of numeric columns. This will help you to have general idea of the dataset.

In [None]:
df.describe()

Unnamed: 0,Voyage ID,Year of arrival at port of disembarkation,Slaves arrived at 1st port
count,36110.0,36110.0,18364.0
mean,42861.336278,1764.321878,275.718743
std,72705.501393,59.470974,159.10232
min,1.0,1514.0,0.0
25%,16135.25,1732.0,157.0
50%,32540.5,1773.0,253.0
75%,50322.75,1806.0,370.25
max,900237.0,1866.0,1700.0


### shape
To see the size of the dataset, we can use shape function, which returns the number of rows and columns in a format of (#rows, #columns)

In [None]:
df.shape

(36110, 8)

### Remove NaN values

Before we dive into the dataset, let's learn how to remove NaN (Null) values.
* df.dropna(): drop the rows where at least one of the elements is missing.
* df.dropna(how='all'): drop the rows where all of the elements are missing.
* df.dropna(subset=['Voyage ID', 'Vessel name']): define in which columns to look for missing values.

In [None]:
# if we drop the rows with at least one missing element.
df.dropna().shape

(15299, 8)

In [None]:
# if we drop the rows with all elements missing.
df.dropna(how='all').shape

(36110, 8)

In [None]:
# define in which columns to look for missing values.
df.dropna(subset=['Voyage ID', 'Vessel name']).shape

(36108, 8)

In the trans_atlantic_slave dataset, there are only two rows that are missing 'Voyage ID' and 'Vessel name.' (36110-36108 = 2)

If you want to update the current dataframe with the valid entries only, then you can call (inplace = True)

The code below will keep the updated dataset in the same variable.

In [None]:
df.dropna(subset=['Voyage ID', 'Vessel name'], inplace=True)

## Select columns to work with

If you are interested in a few columns to do the data analysis, you can select a specific subset of columns using two methods:

1. by index location
2. by column names

In [None]:
# by index location
df_index = df.iloc[: , [0,1,2,3]].copy()
df_index

Unnamed: 0,Voyage ID,Vessel name,Voyage itinerary imputed port where began (ptdepimp) place,Voyage itinerary imputed principal place of slave purchase (mjbyptimp)
0,81711,Hannah,Liverpool,Calabar
1,81712,Hannah,Liverpool,New Calabar
2,81713,Hannah,Liverpool,"Bight of Biafra and Gulf of Guinea Islands, po..."
3,81714,Hannah,Liverpool,Bonny
4,81715,Hannah,Liverpool,Congo River
...,...,...,...,...
36105,80358,Ariel,Liverpool,"Africa., port unspecified"
36106,81265,Ellis,Liverpool,Bance Island (Ben's Island)
36107,81266,Ellis,Liverpool,"West Central Africa and St. Helena, port unspe..."
36108,83426,Royal Edward,Liverpool,Bonny


In [None]:
# by column names
df_col_names = df[['Voyage ID', 'Vessel name', 'Slaves arrived at 1st port']]
df_col_names

Unnamed: 0,Voyage ID,Vessel name,Slaves arrived at 1st port
0,81711,Hannah,390.0
1,81712,Hannah,351.0
2,81713,Hannah,303.0
3,81714,Hannah,316.0
4,81715,Hannah,331.0
...,...,...,...
36105,80358,Ariel,
36106,81265,Ellis,263.0
36107,81266,Ellis,303.0
36108,83426,Royal Edward,396.0


## Filter Dataset using criteria

Often times, you are interested in working with specific rows that meet the certain criteria. 

Let's say that we only want to work with the data since 1800.

Within the 'loc' function, you need to specify the criteria with a column name.

In [None]:
df_since_1800=df.loc[df['Year of arrival at port of disembarkation'] > 1800]
df_since_1800

Unnamed: 0,Voyage ID,Vessel name,Voyage itinerary imputed port where began (ptdepimp) place,Voyage itinerary imputed principal place of slave purchase (mjbyptimp),Voyage itinerary imputed principal port of slave disembarkation (mjslptimp) place,Year of arrival at port of disembarkation,Slaves arrived at 1st port,Captain's name
12,81727,Harlequin,Liverpool,"Africa., port unspecified",Havana,1802,143.0,"Brade, Thomas"
17,81732,Harmonie,Liverpool,"West Central Africa and St. Helena, port unspe...",Cumingsberg,1806,225.0,"Ainsworth, John"
20,81735,Harriott,London,"Gold Coast, port unspecified",Suriname,1804,260.0,"Clarke, James<br/> Sutherland"
21,81736,Harriott,London,Accra,Demerara,1805,273.0,"Sutherland, Stewart"
22,81737,Harriott,London,"Gold Coast, port unspecified",Demerara,1806,275.0,"Clark, James<br/> Duncan"
...,...,...,...,...,...,...,...,...
36101,47527,S Antônio Rei,"Bahia, port unspecified","West Central Africa and St. Helena, port unspe...","Bahia, port unspecified",1807,70.0,"Maria, José<br/> Gonçalves, João<br/> Guimarãe..."
36102,49000,NS da Conceição e Sr dos Passos,"Bahia, port unspecified","West Central Africa and St. Helena, port unspe...","Bahia, port unspecified",1817,410.0,"Medões, Bernardo da Silva<br/> Queiróz, Manoel..."
36106,81265,Ellis,Liverpool,Bance Island (Ben's Island),Suriname,1802,263.0,"Soutar, James"
36107,81266,Ellis,Liverpool,"West Central Africa and St. Helena, port unspe...",Suriname,1805,303.0,"Roach, John"


Let's select the dataset where the 'Year of arrival at port of disembarkation' is 1800.

In [None]:
df_in_1800=df.loc[df['Year of arrival at port of disembarkation'] == 1800]
df_in_1800

Unnamed: 0,Voyage ID,Vessel name,Voyage itinerary imputed port where began (ptdepimp) place,Voyage itinerary imputed principal place of slave purchase (mjbyptimp),Voyage itinerary imputed principal port of slave disembarkation (mjslptimp) place,Year of arrival at port of disembarkation,Slaves arrived at 1st port,Captain's name
11,81726,Harlequin,Liverpool,"West Central Africa and St. Helena, port unspe...",Demerara,1800,268.0,"Maginnis, John"
2286,19101,S João Nepomuceno,Rio de Janeiro,Luanda,Rio de Janeiro,1800,489.0,"Garcia, Francisco Correa<br/> Firme, Manoel Ca..."
2287,19102,NS de Guadalupe e Sr Bom Jesus dos Navegantes ...,"Southeast Brazil, port unspecified",Benguela,Rio de Janeiro,1800,452.0,"Couto, Antônio Xavier do"
2288,19103,Boa Sociedade Ativo,"Southeast Brazil, port unspecified",Luanda,Rio de Janeiro,1800,367.0,"Viana, Paulo José"
2289,19104,Flora,Lisbon,Benguela,Rio de Janeiro,1800,478.0,"Firme, Antônio Caetano"
...,...,...,...,...,...,...,...,...
35735,80745,Castle Douglas,London,"Gold Coast, port unspecified",Demerara,1800,304.0,"Clark, James"
35745,80757,Catherine,Liverpool,New Calabar,,1800,,"Morrison, John"
35784,80802,Chance,Liverpool,"Africa., port unspecified",West Indies (colony unspecified),1800,,"Crooker, Thomas"
35825,900040,Balsemão,Lisbon,Mozambique,,1800,,


## Aggregation

Aggregation is the process of combining things.

Some examples of aggregation are sum, minimum, maximum, count, average, standard deviation, etc.

### Sum
Let's calculate the total number of slaves in the dataset using sum() function.

In [None]:
df['Slaves arrived at 1st port'].sum()

5063299.0

If you want to save the total number of slaves in a variable, then you can try:

In [None]:
total_slaves = df['Slaves arrived at 1st port'].sum()
print("total slaves: ", total_slaves)

total slaves:  5063299.0


### Mean

Now let's move onto mean!
We will calculate the average number of the slaves by year.

Here, we use 'groupby' aggregate function and it will let us group the dataset by that column (['Year of arrival at port of disembarkation'])

We'd like to see the 'Slaves arrived at 1st port' column and we specify it by using [ ].

In [76]:
df.groupby(['Year of arrival at port of disembarkation'])['Slaves arrived at 1st port'].mean()

Year of arrival at port of disembarkation
1514           NaN
1516           NaN
1520     44.000000
1525           NaN
1526     57.500000
           ...    
1862    671.000000
1863    612.636364
1864    471.142857
1865    397.500000
1866    700.000000
Name: Slaves arrived at 1st port, Length: 335, dtype: float64

### Count
How do we count the number of unique rows for a column?

Let's try to count the number of unique values of the column 'Principal_place_of_slave_landing'

There are two approaches to count the unique rows:

1. nunique() function
2. set() and len() function

The first approach is using nunique() function that returns the number of unique columns.

We can use 'groupby' aggregate method to indicate what column we want to group the dataset by.

As we learned earlier, shape function returns (#rows, #columns). By indexing the 0th element, shape[0] will return the number of rows.

In [73]:
vessel_name_cnt = df.groupby(['Vessel name']).nunique()
vessel_name_cnt.shape[0]

9448

The second approach is using set() and len() function. 

A set is created by placing all the items inside curly braces {} , separated by comma. It is similar to a list in Python.

But unlike list, set contains no duplicates. len() will return the length of the set in the code.

In [74]:
vessel_name_cnt_2=(len(set(df['Vessel name'].tolist())))
vessel_name_cnt_2

9448