<a href="https://colab.research.google.com/github/bitprj/BitUniversity/blob/master/Digital_History/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/Week2_Working_with_data_using_Pandas_Titanic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://github.com/ShayanRiyaz/DigitalHistory/blob/Shayan/Week5-Lab-Visualizing-the-Translatlantic-Slave-Trade/assets/icons/bitproject.png?raw=1" width="200" align="left"> 
<img src="https://github.com/ShayanRiyaz/DigitalHistory/blob/Shayan/Week5-Lab-Visualizing-the-Translatlantic-Slave-Trade/assets/icons/data-science.jpg?raw=1" width="300" align="right">

# <div align="center">Working with Data Using the Pandas Library</div>

## What is Pandas?
This week, we will cover topics of basic data manipulation using the Pandas library.
Some key points:
1. Pandas is an open source data analysis and manipulation tool and is widely used both in academia and in the industry.
2. It is built on top of the Python programming language. 
3. It offers data structures and operations for manipulating numerical tables and time series.

### Data structures in Pandas
Data structures are ways to organize and store data. Pandas provides three types of data structures: Series, DataFrame, and Panel. 

1. A **Series** is 1-dimensional labelled array and 1-dimensional array represents a single column in excel. It can hold data of **any type** (integer, string, python objects, etc.) and its labels are called indices.
2. A **DataFrame** is 2-dimensional labelled data structure with both rows and columns and 2-dimensional array represents a tabluar data. 
3. A **Panel** is 3-dimensional data structure. 

For this module, we will be covering the **DataFrame** data structure.

## About the dataset:

## Titanic
<img src="https://github.com/bitprj/DigitalHistory/blob/Narae/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/assets/icons/titanic.png?raw=1" width="300" align="right"> 
To demonstrate how to use Pandas and dataframes, we will use a dataset about the Titanic. The Titanic was a British passenger liner that sank in the North Atlantic Ocean in 1912 after striking an iceberg during her maiden voyage. The ship was traveling  from Southampton to New York City and carried an estimated 2,224 passengers and crew aboard, of which more than 1,500 died.


This dataset does not include data for all of the passengers, but has the following information available for approximately a third of all passengers aboard: name, age, gender, ticket price, and most importantly, whether or not they survived. 
For this exercise, this portion of data will be sufficient for our analysis.

In the previous module, we learned about the different **data types** in Python, including *strings*, *integers*, *floats*, and *booleans*. Dataframes can also store a variety of **data types** for each column. In our data, we can recognize the different **data types** as so:
- Columns decribing texts use String. 
- Columns describing numbers use Integers and Floats.
- Columns with `True` or `False` values use Booleans.


## Grading

In order to work on the NTT sections and submit them for grading, you'll need to run the code block below. It will ask for your student ID number and then create a folder that will house your answers for each question. At the very end of the notebook, there is a code section that will download this folder as a zip file to your computer. This zip file will be your final submission.

In [None]:
import os
import shutil

!rm -rf sample_data

student_id = input('Please Enter your Student ID: ') # Enter Student ID.

while len(student_id) != 9:
 student_id = int('Please Enter your Student ID: ')  
  
folder_location = f'{student_id}/Week_Two/Now_Try_This'
if not os.path.exists(folder_location):
  os.makedirs(folder_location)
  print('Successfully Created Directory, Lets get started')
else:
  print('Directory Already Exists')

## Import Pandas library

In order to load and read data from a file, we will need to import the **Pandas** library.

When we use the command ```import pandas as pd```, we import the pandas library and specify that we will use ```pd``` to reference it for future use. 
***NOTE:*** While technically we could import it as pandas or even a random word of your choice, it's a convention to use ``` import pandas as pd``` when importing Pandas library.
With the **Pandas** library imported, we can use all of the functions and properties available
in Pandas.

In [1]:
import pandas as pd

To read the csv file containing our data, we will use the ```read_csv()``` function and provide a name to store the data into a  Dataframe. 

In this exercise, we are storing the data into a Dataframe named `df`. For future cases, it is preferred to give a more descriptive name for your dataframe.

In [3]:
url = 'https://raw.githubusercontent.com/bitprj/BitUniversity/master/Digital_History/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/data/titanic-dataset.csv'
df = pd.read_csv(url)

In [None]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


Perfect! Now when we call `df`, it displays the dataframe above.

## Basic Information About the Dataset

Now, we can use built-in functions to get basic information about the dataframe. We will explore the following functions: 
- head()
- tail()
- describe()
- info()

### head() / tail()
The `head()` and `tail()` functions are useful to see the contents of the dataset at a quick glance. These functions return the first and last n rows of the dataset, respectively.

By default, `head()` returns the first 5 rows and `tail()` returns the last 5 rows.


In [None]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
df.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


Most of the columns are straightforward, but further explanation on more ambiguous columns is available below: 

- `Pclass`: Ticket class (1 being first, 2 being second, and 3 being third)
- `Sibsp`: Number of Siblings/Spouses Aboard the Titanic
- `Parch`: Number of Parents/Children Aboard the Titanic
- `Embarked`: Port of Embarkation
  - C = Cherbourg
  - Q = Queenstown
  - S = Southampton

### info()

In order to summarize the information available in the dataset, we can use the `info()` function.

This function is useful as it returns all of the **column names** and its **data types** of the dataframe. 

This function also includes the **Non-Null** counts. 
**Non-Null** means that there is a value available. On the other hand, if there is no value available for a column, this is considered a **Null** value. Thus, the **Non-Null** value counts show how many valid data values there are for each column.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


### describe()

The ```describe()``` function is used to view **summary statistics** of numeric columns. This helps us to have a general idea of the dataset. It gives us *count*, *mean*, *standard deviation*, *minimum*, *maximum*, and the values for *25th*, *50th*, and *75th* percentile.


In [None]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


### shape
We can use ```shape``` function to see the size and dimensions of the dataset. This attribute returns the number of rows and columns in a format of `(# rows, # columns)`.

For this example, we can run `df.shape` to see that this dataset has 891 rows (passengers) and 12 columns.

In [None]:
df.shape

(891, 12)

## Working with Indexes
### Setting an Index

For this exercise, `PassengerId` is a unique key for each passenger in the dataframe, so we want use this value as the index instead of the default index.

We can use ```set_index``` function and specify the index using ```keys=```.

In [None]:
df.set_index(keys='PassengerId')

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


Now, we can see that we have set `PassengerId` as the index.

In [None]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


However, notice that PassengerId is not an index when we call `df` again. This is because we didn't make the change in place. If we want to make the change permanently, we need to include `inplace = True`.

In [None]:
df.set_index(keys='PassengerId', inplace=True)

In [None]:
df

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


Now, when we check again, we see that `PassengerId` has permanently changed to become the index!

## Indexing

Now that we have set an index, `indexing` refers to the selection of specific data.

We can do this using two methods:

1. Selection by index numbers
2. Selection by column names

### 1. Selection by index numbers
We can select specific subsets of data using ```iloc[rows_index, columns_index]```.

By default, the index numbers are assigned from left to right, beginning with the index value `0`. That is, the first column on the left is given the number 0, the second column is given the number 1, and so on.

As we learned last week, we can use slicing (```[:]```) to select specifics in a list or string in Python. 

Similarly, we can also use ```[:]``` to select rows or columns in a dataframe. 

Let's try an example: we want to select the data for all passengers, but only for the columns of `Survived`, `Pclass`, and `Name`. The index numbers for these columns are 0, 1, and 2, respectively. (Note that the index `PassengerId` is not included as part of the index numbers.)

In [None]:
df.iloc[: , [0,1,2]]

Unnamed: 0_level_0,Survived,Pclass,Name
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0,3,"Braund, Mr. Owen Harris"
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
3,1,3,"Heikkinen, Miss. Laina"
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
5,0,3,"Allen, Mr. William Henry"
...,...,...,...
887,0,2,"Montvila, Rev. Juozas"
888,1,1,"Graham, Miss. Margaret Edith"
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie"""
890,1,1,"Behr, Mr. Karl Howell"


### 2. Selecting by column names

We can also select columns by specifying the column names. We can do this by using double brackets ```[[column_names]]``` instead of referencing index locations.

Let's say we are interested in the columns for `Survived`, `Sex` and `Age`.
We can do the following line of code to select these specific columns:

In [None]:
df[['Survived', 'Sex', 'Age']]

Unnamed: 0_level_0,Survived,Sex,Age
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0,male,22.0
2,1,female,38.0
3,1,female,26.0
4,1,female,35.0
5,0,male,35.0
...,...,...,...
887,0,male,27.0
888,1,female,19.0
889,0,female,
890,1,male,26.0


## Dropping rows / columns

### Dropping rows

There are two ways of dropping rows.
The first way is used when we know exactly which row indexes we want to drop from the dataframe.
The second way to drop rows is by specifying a condition or criteria that certain rows satisfy.

1. ```df.drop(index #)```
2. ```df.drop(df[<some boolean condition>].index)```


For our exercise, the first method is not the most appropriate for this dataframe, since there are lots of rows (entities) to consider.

The first method can be used in an example of a class survey data. For instance, if a certain student dropped out, we can find his / her index number and can drop that specific index number.

To illustrate, let's look at an example of what would happen if we drop index `3`.

In [None]:
df.drop(3)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


Let's check if `df` got updated!

In [None]:
df

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


Even though we dropped the index #3, `df` remains unchanged. 

We can update the dataframe permanently once we drop the rows by setting ```inplace=True```.

In [None]:
df.drop(3, inplace=True)

In [None]:
df

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


Let's try to work with the second method of dropping rows. In this case, instead of naming the specific index number to drop, we can specify the condition and remove any rows that satisfy that condition.

For example, let's say we wanted to remove the passengers whose `Fare` was 0.

In [None]:
df.drop(df[df['Fare'] == 0].index)

Notice that the number of rows in the dataframe has changed from `890` to `875`, which we tells us that 15 rows were removed. This means that there were 15 passengers whose `Fare` was equal to `0`. 

(Similar to the previous method, if we want to update the dataframe permanently, we can add `inplace = True` in the parenthesis.)

### Dropping Columns

There are two ways of dropping columns.

The first way is similar to the way to drop the rows using the `.drop()` function. This change is temporary unless we specify `inplace=True`.
The second way deletes the column from the dataframe immediately using the `del` command.

1. ```df.drop(columns=[column_name])```
2. ```del df[column_name]```




Before we dive into the data analysis, let's see if there are any more columns we want to remove. 
We will check by calling the `.info()` function.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 890 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  890 non-null    int64  
 1   Pclass    890 non-null    int64  
 2   Name      890 non-null    object 
 3   Sex       890 non-null    object 
 4   Age       713 non-null    float64
 5   SibSp     890 non-null    int64  
 6   Parch     890 non-null    int64  
 7   Ticket    890 non-null    object 
 8   Fare      890 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  888 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 123.4+ KB


Notice that the `Cabin` column has only 204 rows that are non-null. That means 890 - 204 = 686 rows are missing in the column.
This likely won't give us as meaningful insight as other columns, so we will remove the `Cabin` column.
We can do this by using the `.drop()` function from the first method.

In [None]:
df.drop(columns=['Cabin'])

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,Q
...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,S
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,S
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,S
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C


Remember that the first method doesn't drop the column permanently unless we specify `inplace=True`.

In [None]:
df

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


We can note that `df` still has `Cabin` column even though we dropped it.

Now, if we use the second method, it will drop the column immediately.
So, be careful when you use `del` function as it deletes the column without needing to put `inplace = True`.

In [None]:
del df['Cabin']

In [None]:
df

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,Q
...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,S
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,S
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,S
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C


## Dropping NaN values

Oftentimes, when we work with large datasets, we will encounter cases where there are lots of missing elements (NaN / null) in a dataset.

Removing NaN values will allow us to work with clean datasets.

**Pandas** allows us to use the `dropna()` function in the following ways:

* ```df.dropna()```: drop the rows where at least one of the elements is missing.
* ```df.dropna(how='all')```: drop the rows where all of the elements are missing.
* ```df.dropna(subset=[columns])```: define in which columns to look for missing values.


We will practice running each of these options:

In [None]:
df.dropna()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,S
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,S
...,...,...,...,...,...,...,...,...,...,...
886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,Q
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,S
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,S
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C


In [None]:
df.dropna(how='all')

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,Q
...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,S
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,S
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,S
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C


For this example, suppose we want to drop the rows where `Age` is NaN. To do this, we can specify the column name in the `subset` parameter.

In [None]:
df.dropna(subset=['Age'])

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,S
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,S
...,...,...,...,...,...,...,...,...,...,...
886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,Q
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,S
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,S
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C


## Analysis of unique values

### Getting unique values
We can use the `.drop_duplicates()` function to find the unique values in a column. The function is called as so:

```df[[column_name]].drop_duplicates()```

For example, let's find all the unique values from the `Parch` column, and disregard all duplicate values.

In [None]:
df[['Parch']].drop_duplicates()

Unnamed: 0_level_0,Parch
PassengerId,Unnamed: 1_level_1
1,0
8,1
9,2
14,5
87,3
168,4
679,6


This shows that there were passengers with 0 - 6 parents and children aboard.

### Counting unique values

The **Pandas** library also includes the ```groupby()``` and ```size()``` functions, which can be used to count the number of occurrences of a unique value in a specific column.

For example, let's group the dataframe by the value of the column `Embarked` and get the size of each group.

In [None]:
df.groupby(['Embarked']).size()

Embarked
C    168
Q     77
S    643
dtype: int64

If you want to sort by the count, you can use `sort_values()`. This makes it easier to identify which values occur most often in the dataset.

In [None]:
df.groupby(['Embarked']).size().sort_values()

Embarked
Q     77
C    168
S    643
dtype: int64

We can note that most of the passengers embarked from the port of Southampton.

### Number of unique values

We can get the number of unique values in a column using `nunique()` function.

Let's find the number of unique ages of passengers. We can look at the values in the `Age` column.

In [None]:
df["Age"].nunique()

88

## Selecting rows based on criteria

Oftentimes, we are interested in working with specific rows that meet the certain criteria. There are two ways to select such rows.

1. `df.where(condition)`
2. `df.loc[condition]`

If you use `where` clause, it will check a dataframe based on the condtion and return the result. The rows that do not satisfy the condition will be filled with NaN value.

If you want to only have the rows that meet the condition, you may use `loc` function, as this will drop the rows that do not meet the condition.



### Using `where` clause

Let's use a filter to select the rows that have `Age >= 30` and apply the filter to our dataframe.

We name the dataframe that the filter has been applied for `older_than_30`. Now, when we call `older_than_30`, it will display our filtered data. 

In [None]:
filter = df['Age'] >= 30
older_than_30 = df.where(filter)

In [None]:
older_than_30

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,,,,,,,,,,
2,1.0,1.0,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1.0,0.0,PC 17599,71.2833,C
4,1.0,1.0,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1.0,0.0,113803,53.1000,S
5,0.0,3.0,"Allen, Mr. William Henry",male,35.0,0.0,0.0,373450,8.0500,S
6,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...
887,,,,,,,,,,
888,,,,,,,,,,
889,,,,,,,,,,
890,,,,,,,,,,


Note that the number of rows are the same but the rows that were filtered out have NaN values throughout all columns.

### Using `loc` clause

If we only want to have the data points with `Age >= 30`, we can specify the criteria within ```loc``` function.

In [None]:
filter = df['Age'] >= 30
df.loc[filter]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,S
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,S
12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,S
...,...,...,...,...,...,...,...,...,...,...
874,0,3,"Vander Cruyssen, Mr. Victor",male,47.0,0,0,345765,9.0000,S
880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C
882,0,3,"Markun, Mr. Johann",male,33.0,0,0,349257,7.8958,S
886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,Q


## Basic Analysis using Aggregate functions

Let's learn how to perform basic data analysis. Before we dive in, what exactly is an aggregate function?

Aggregation is the process of combining things. It's useful to understand overall properties of the dataset and analyze it.

Some examples of aggregation are ```sum()```, ```count()```, ```min()```, ```max()```,  ```mean()```, ```std()```, etc.

### Aggregate functions
#### Sum
##### 1. Total fares

In [None]:
df['Fare'].sum()

28686.024299999997

We can save this value to a variable using variable assignment.

In [None]:
total_fares = df['Fare'].sum()
print(total_fares)

28686.024299999997


##### 2. Total number of survived passengers
We can also count the number of passengers that survived by summing up the `Survived` column.

In [None]:
survived_passengers = df['Survived'].sum()
survived_passengers

341

#### Max / Min

Let's calculate max and min of Fare.

In [None]:
df['Fare'].max()

512.3292

In [None]:
df['Fare'].min()

0.0

#### Mean
##### 1. Average survival rate

In [None]:
df['Survived'].mean()

0.3831460674157303

##### 2. Average age

In [None]:
df['Age'].mean()

29.704305750350628

##### 3. Average survival rate by age group

Now, we will tackle a more complex problem.

Let's calculate the survival rate by the *age* group.
We can apply filtering that we just learned to split the group based on the age.

In each group, we will calculate the survival rate.

In [None]:
filter_over_30 = df['Age'] >= 30
filter_under_30 = df['Age'] < 30

df_over_30yrs = df.loc[filter_over_30]
df_under_30yrs = df.loc[filter_under_30]

Now, we have split the data points by age and we will apply `mean()` to get the mean of the `survived` column. 

In [None]:
df_over_30yrs['Survived'].mean()

0.40606060606060607

In [None]:
df_under_30yrs['Survived'].mean()

0.4046997389033943

There's not much difference between the two groups.

### Groupby

`df.groupby(['column_name']).aggregate_function`

Now, we will group by `sex` to see if there's any difference between female and male.

Here, we use `groupby` aggregate function and it will let us group the dataset by the specified column `sex`.

In [None]:
df.groupby(['Sex']).mean()

Unnamed: 0_level_0,Survived,Pclass,Age,SibSp,Parch,Fare
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
female,0.741214,2.15655,27.923077,0.696486,0.651757,44.596606
male,0.188908,2.389948,30.726645,0.429809,0.235702,25.523893


We can note that the average survival rate of female group is much higher than that of male group.

If we are interested in the survival rate only, we can use `[ ]` after the groupby call to specify which column to display.

In [None]:
# group by 'Sex' and calculate mean of 'Survived'
df.groupby(['Sex'])['Survived'].mean()

Sex
female    0.741214
male      0.188908
Name: Survived, dtype: float64

We can also apply `groupby` on multiple columns.

In [None]:
df.groupby(['Sex', 'Pclass']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Survived,Age,SibSp,Parch,Fare
Sex,Pclass,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
female,1,0.968085,34.611765,0.553191,0.457447,106.125798
female,2,0.921053,28.722973,0.486842,0.605263,21.970121
female,3,0.496503,21.707921,0.902098,0.804196,16.176109
male,1,0.368852,41.281386,0.311475,0.278689,67.226127
male,2,0.157407,30.740707,0.342593,0.222222,19.741782
male,3,0.135447,26.507589,0.498559,0.224784,12.661633


In both groups (female and male), the survival rate was a lot higher in Pclass 1 than other Pclass!

### Now Try This

Then, some of us might be curious:

Would lower Pclass be more expensive or higher Pclass be more expensive?
We can answer this question by calculating the mean fares for each class.

**Calculate the mean fares for each Pclass!**

### Let's pause and think!
Did you see any correlation between Pclass, Fare, and Survival rate? Briefly describe what you think about this correlation.

## Takeaways

Using Dataframe and aggregate functions in Pandas, we can filter our data in many different ways and draw conclusions from it. 

In the later weeks, we will deep dive into the further uses of Pandas to analyze a more complicated dataset.

## Resources
- [About Titanic](https://en.wikipedia.org/wiki/Titanic)
- [10 minutes to Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html)
- [Pandas API Reference](https://pandas.pydata.org/pandas-docs/stable/reference/index.html)
- [A Gentle Introduction to Pandas](https://medium.com/@wbusaka/a-gentle-introduction-to-pandas-5ed17421a59d)

## Appendix

### How to download a dataset from Kaggle

If you want to work with different datasets, check out Kaggle! Kaggle is the world's largest data science community and features many different datasets for public use. 

#### Step 1: Go to [Kaggle.com](https://Kaggle.com).

#### Step 2: Click Data tab.

![Kaggle Data](https://github.com/bitprj/DigitalHistory/blob/Narae/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/assets/Kaggle_data.png?raw=1)

#### Step 3: Search for a dataset of your interest or explore the most popular datasets on the main page.

![Search page](https://github.com/bitprj/DigitalHistory/blob/Narae/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/assets/search_page.png?raw=1)

#### Step 4: Once you select a dataset, you can read context and download it. If you click Download button at the top, all of the datasets will be downloaded.

![US Census main](https://github.com/bitprj/DigitalHistory/blob/Narae/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/assets/US_Census_main.png?raw=1)

#### Step 5: If you want to download a specific dataset, hit the download icon in the selected dataset.

![Download specific](https://github.com/bitprj/DigitalHistory/blob/Narae/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/assets/download_specific.png?raw=1)

#### Step 6: If you'd like to see all columns, check ```select all``` and it will show all of the columns in the dataset.

![View columns](https://github.com/bitprj/DigitalHistory/blob/Narae/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/assets/view_columns.png?raw=1)

#### Step 7: If your dataset is a csv file, then use ```read_csv()```. If your dataset is an excel file, use ```read_excel()``` to load your dataset.