<a href="https://colab.research.google.com/github/bitprj/DigitalHistory/blob/master/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/Week3-Introduction-to-Data-Manipulation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://github.com/bitprj/DigitalHistory/blob/Narae/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/assets/icons/bitproject.png?raw=1" width="200" align="left"> 
<img src="https://github.com/bitprj/DigitalHistory/blob/Narae/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/assets/icons/data-science.jpg?raw=1" width="300" align="right">

# <div align="center">Data Manipulation with Pandas</div>




## Table of Contents
- Why, Where and How we use Pandas
  - What is Pandas?
  - Data structures in Pandas
- What we will be learning today
  - About the dataset
  - Goals
- Importing Pandas library
- Loading a file
- Setting an index
- Getting info about the dataset
- Removing NaN (None) values
  - **1.0 - Now Try This**
- Removing a column
- Selecting subsets of data
  - **2.0 - Now Try This**
  - **3.0 - Now Try This**
  - **4.0 - Now Try This**
- Filtering dataset based on criteria
  - **5.0 - Now Try This**
  - **6.0 - Now Try This**
- Aggregation functions
  - Sum
  - Min / Max
  - Mean
  - **7.0 - Now Try This**
- Practical Exercise
  - About the dataset
  - Setting an index
  - **8.0 - Now Try This**
  - Aggregate functions
  - **9.0 - Now Try This**
  - **10.0 - Now Try This**
  - **11.0 - Now Try This**

## Why, Where and How we use Pandas
<img src="https://github.com/bitprj/DigitalHistory/blob/Narae/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/assets/icons/pandas.png?raw=1" width="200" align="center"> 

### What is Pandas?
This week, we will cover the basic data manipulation using Pandas.
1. Pandas is an open source data analysis and manipulation tool and it is widely used both in academia and industry.
2. It is built on top of the Python programming language. 
3. It offers data structures and operations for manipulating numerical tables and time series.

### Data structures in Pandas
Pandas provides three data structures: Series, DataFrame, and Panel. 

1. A Series is 1-dimensional labelled array and 1-dimensional array represents a single column in excel. It can hold data of **any type** (integer, string, python objects, etc.) and its labels are called indices.
2. A DataFrame is 2-dimensional labelled data structure with both rows and columns and 2-dimensional array represents a tabluar data.
3. A panel is 3-dimensional. 

This week, we will focus on DataFrame and we will learn Series in later weeks. We will not cover Panel in this semester, as it's not used as often as two other data structures.


Since we've covered the fundamentals of Python, it will be fairly easy to pick up Pandas.

## What we will be learning today
### Goals:
- Getting a quick overview of the dataset 
- Removing column / rows with NaN values
- Selecting and filtering based on criteria
- Analyze the survival rates in the Titanic dataset

## About the dataset:

## Titanic
<img src="https://github.com/bitprj/DigitalHistory/blob/Narae/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/assets/icons/titanic.png?raw=1" width="300" align="right"> 
To begin with Pandas and dataframes, we will use a dataset about the Titanic. Titanic was a British passenger liner operated by the White Star Line that sank in the North Atlantic Ocean in 1912, after striking an iceberg during her maiden voyage from Southampton to New York City. Of the estimated 2,224 passengers and crew aboard, more than 1,500 died.


This dataset does not have all of the passengers, but has the following info for a third of all passengers aboard: name, age, gender, ticket price, and most importantly whether or not they survived.

As each person has its unique PassengerID, each row is a unique entity / passenger.





## Import Pandas library

In order to read / load a file, we will need to import Pandas.

It's a convention to use ``` import pandas as pd``` when importing Pandas library.

In [1]:
import pandas as pd

Once we've imported Pandas, we can use ``` pd``` to call any functions in Pandas.



## Load file

To read the csv file with our data, we will use the ```read_csv``` function.

Since we are working with only one dataset, we will just call dataframe as df. 

But, if we are working with lots of dataframes, it's better to give a meaningful name (ex: titanic_data, passenger_info, etc.)

In [4]:
url = 'https://raw.githubusercontent.com/bitprj/DigitalHistory/master/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/data/titanic-dataset.csv'
df = pd.read_csv(url)
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


#### Set index

Now, the dataset is loaded as a dataframe 'df'

The first column is an index column and it starts from 0 by default.

But, as you can tell, PassengerId itself is a unique index. So, let's set PassengerId as an index.

We can call ```set_index``` function and specify the index using ```keys=```

In [None]:
df.set_index(keys='PassengerId')

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


The code above worked! Now PassengerId is a new index for df.

Let's call df one more time to make sure that df has been updated to reflect the change.

In [None]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


IMPORTANT: df has NOT been updated. Do you know why?

```df.set_index(keys='PassengerId')```: this function sets PassengerId as an index when we CALL the function. 

Since we didn't save the function call, df has NOT been updated.

There are two ways to save the change.
1. ```df = df.set_index(keys='PassengerId')```
2. ```df.set_index(keys='PassengerId', inplace = True)```

First function call reassigns a variable ```df``` to the updated ```df``` and second function call makes changes in-place.

In [None]:
df.set_index(keys='PassengerId', inplace=True)
df

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


Okay, now PassengerId is set to index!

## Basic info about the dataset

Now, let's get basic information about the dataframe.
- head()
- describe()
- info()

### head()
```head()``` function is useful to see the dataset at a quick glance as it returns first n rows.

Let's check what columns this file has by calling ```head()``` function.

By default, ```head()``` returns the first 5 rows.

In [None]:
df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


You can specify the number of rows to display by calling ```df.head(number)```

In [None]:
df.head(10)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


### info()

Now, we know what's in the dataset and what it looks like.

To summarize what information is available in the dataset, we can use the info() function.

This function is useful as this returns all of the **column names** and **its types** as well as **Non-Null** counts. 

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Name        891 non-null object
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Ticket      891 non-null object
Fare        891 non-null float64
Cabin       204 non-null object
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


We can tell that "Age" and "Cabin" have lots of missing values; the dataset only has data for 714 ages and 204 cabins for the 891 passengers.

If we take a closer look at dtypes in the second to last row, there are three dtypes: ```int64```, ```float64```, and ```object```.

We have covered ```int``` and ```float``` last week in Python, but what is an object?

- ```int64```: integer numbers
- ```float64```: floating point numbers
- ```object```: string or mixed numeric and non-numeric values.

That's why the dtype of "Name", "Sex", "Embarked" is ```object```, as it is a string.

"Ticket" and "Cabin" are ```objects``` as they are in a format of numbers or string + numbers (Ex: A/5 21171, C85)



### describe()

```describe()``` is used to view summary statistics of numeric columns. This helps us to have a general idea of the dataset.

```Count```, ```mean```, ```min```, and ```max``` are straightforward.

Let's refresh our memory with statistical concepts.

- ```std```: standard deviation - measures the dispersion of a dataset relative to its mean. If the data points are further from the mean, there is a higher deviation within the dataset. The more spread out the data, the higher the standard deviation.
- ```25%```: the value below which 25% of the observations may be found. 
- ```50%```: the value below which 50% of the observations may be found. 
- ```75%```: the value below which 75% of the observations may be found. 

For example, 25th percentile of age is 20.125 and 75th percentile of age is 38. This means that 25% of the passengers' age is less than 20.125 and 75% of the passengers' age is less than 38.

In [None]:
df.describe()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


### shape
To see the size of the dataset, we can use ```shape``` function, which returns the number of rows and columns in a format of (# rows, # columns)

This dataset has 891 rows (entities) and 12 columns.

In [None]:
df.shape

(891, 11)

## Remove NaN values

Often times, when we work with large datasets, we will encounter cases where there are lots of missing elements (NaN / null) in the dataset.

Removing NaN values will allow us to drop the rows and to work with clean datasets.


---


Let's remove the rows that do not provide a meaningful information.

When we know a "unique key" of the dataset (PassengerID in this dataset), we can check whether all elements have PassengerID. If any of the rows are missing PassengerID, then we can drop that entity.

* ```df.dropna()```: drop the rows where at least one of the elements is missing.
* ```df.dropna(how='all')```: drop the rows where all of the elements are missing.
* ```df.dropna(subset=[columns])```: define in which columns to look for missing values.

If we want to drop the rows with **at least** one missing element:

In [None]:
df.dropna()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
...,...,...,...,...,...,...,...,...,...,...,...
872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S
873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0000,B51 B53 B55,S
880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S


If we want to drop the rows with **all** elements missing:

In [None]:
df.dropna(how='all')

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


If we want to drop the rows that are missing Survived value.

In [None]:
df.dropna(subset=['Survived'])

(891, 11)

Yay, we've confirmed that all of the passengers have 'Survived' value since the number of rows remains the same.

If we want to update the dataset after dropping rows, we can use ```inplace = True```

In [None]:
df.dropna(subset=['Survived'], inplace=True)

### 1.0 - Now Try This

- Drop the rows that are missing any of the following columns: 'Pclass'
- Update ```df```

## Removing a column

Before we dive into the data analysis, let's see if there are any columns we want to remove. 

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Name        891 non-null object
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Ticket      891 non-null object
Fare        891 non-null float64
Cabin       204 non-null object
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


In "Cabin" column, there are only 204 rows that are non-null. That means 891 - 204 = 687 rows are missing in the column.

It wouldn't give us as meaningful insight as other columns, so let's remove "Cabin" column by using ```del``` function.

In [None]:
del df['Cabin']
df

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,S
...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,S
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,S
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,S
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C


## Select subsets of data

When we are interested in a few columns to do the data analysis, we can select a specific subset of columns using two methods:

1. by index location
2. by column names

#### 1. by index location
We can select specific subsets of data using ```iloc[rows_index, columns_index]```.

As we learned last week, ```[:]``` selects everything in a list or string in Python. 

Similarly, ```[:]``` will select every row or column depending on where we put it.

Let's select PassengerId, Survived, and Pclass.

In [None]:
# by index location (iloc)
df.iloc[: , [0,1,2]]

Unnamed: 0_level_0,Survived,Pclass,Name
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0,3,"Braund, Mr. Owen Harris"
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
3,1,3,"Heikkinen, Miss. Laina"
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
5,0,3,"Allen, Mr. William Henry"
...,...,...,...
887,0,2,"Montvila, Rev. Juozas"
888,1,1,"Graham, Miss. Margaret Edith"
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie"""
890,1,1,"Behr, Mr. Karl Howell"


We selected 0, 1, and 2 because PassengerId, Survived, and Pclass are the first 3 columns.

Hmm, but 4 columns showed up. Let's look into why!

Because PassengerId is the default index, it shows up automatically.

So, the index location 0 will be the first column right after the index column.

### 2.0 - Now Try This

Select PassengerId, Survived, and Pclass with all rows.

### 3.0 - Now Try This

Select PassengerId, Survived, and Pclass with all rows.
Please use semi-colon ```:``` this time.

#### 2. by column names

Let's select subsets of data by column names.

We are interested in PassengerId, Survived, Sex, and Age.

In [None]:
# by column names
df[['PassengerId', 'Survived', 'Sex', 'Age']]

The code above doesn't work. What does the KeyError say?

```['PassengerId'] not in index.```

Remember? PassengerId is no longer a column, so we can't select it by a column name.

### 4.0 - Now Try This

Select PassengerId, Survived, Sex, and Age.



## Filter Dataset based on criteria

Often times, we are interested in working with specific rows that meet the certain criteria. 

If we only want to look at the data with Age > 30, we can specify the criteria within ```loc``` function.

In [None]:
df_over_30yrs=df.loc[df['Age'] > 30]
df_over_30yrs

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,S
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,S
12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,S
...,...,...,...,...,...,...,...,...,...,...
874,0,3,"Vander Cruyssen, Mr. Victor",male,47.0,0,0,345765,9.0000,S
880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C
882,0,3,"Markun, Mr. Johann",male,33.0,0,0,349257,7.8958,S
886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,Q


Now, let's select the dataset using two criteria -- where "Age" is greater than 30 AND "Survived."

```&``` is equivalent to ```AND``` and ```|``` is equivalent to ```OR``` in dataframe.


In [None]:
df_over_30yrs_survived = df.loc[(df['Age'] > 30) & (df['Survived'] == 1)]
df_over_30yrs_survived

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,S
12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,S
16,1,2,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0,0,0,248706,16.0000,S
22,1,2,"Beesley, Mr. Lawrence",male,34.0,0,0,248698,13.0000,S
...,...,...,...,...,...,...,...,...,...,...
858,1,1,"Daly, Mr. Peter Denis",male,51.0,0,0,113055,26.5500,S
863,1,1,"Swift, Mrs. Frederick Joel (Margaret Welles Ba...",female,48.0,0,0,17466,25.9292,S
866,1,2,"Bystrom, Mrs. (Karolina)",female,42.0,0,0,236852,13.0000,S
872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,S



IMPORTANT: When filtering with multiple conditions, make sure to use ```()``` on each condition. 

Otherwise, you will get an error message that ```The truth value of a Series is ambiguous.```

Let's check how many passengers survived among the ones whose age was over 30.

In [None]:
print("# of passengers whose age was over 30: ", df_over_30yrs.shape[0])
print("# of survived passengers whose age was over 30: ", df_over_30yrs_survived.shape[0])

# of passengers whose age was over 30:  305
# of survived passengers whose age was over 30:  124


### 5.0 - Now Try This

Select the dataset that meet the following condition:
- Pclass is not 1

### 6.0 - Now Try This

Select the dataset that meet the following conditions:
- Age is less than 10 
- OR
- Age is greater than 50

Hint: Don't forget parenthesis!

## Aggregation functions

Aggregation is the process of combining things. It's useful to understand overall properties of the dataset and analyze it.

Some examples of aggregation are ```sum()```, ```count()```, ```min()```, ```max()```,  ```mean()```, ```std()```, etc.

### Sum
#### 1. Total fares

In [None]:
df['Fare'].sum()

28693.9493

If we want to round total fares and save it as a variable, then we can try:

In [None]:
total_fares = df['Fare'].sum()
print("Total fares: ", round(total_fares))

Total fares:  28694.0


#### 2. Survived passengers
We can also count the number of passengers that survived by summing up the 'Survived' column.

In [None]:
survived_passengers = df['Survived'].sum()
survived_passengers

342

### Max / Min

Let's calculate max and min of Fare.

In [None]:
df['Fare'].max()

512.3292

In [None]:
df['Fare'].min()

0.0

### Mean
Let's calculate the survival rate of all passengers.

In [None]:
df['Survived'].mean()

0.3838383838383838

Let's calculate the average age of all passengers.

In [None]:
df['Age'].mean()

29.69911764705882

Now, we will tackle a more complex problem.

Let's calculate the survival rate by the *age* group.
We can apply filtering that we just learned to select the group whose age was over 30 and whose age was under 30.

In each group, we will calculate the mean of 'Survived' column.

In [None]:
# filtering
df_over_30yrs = df.loc[df['Age'] > 30]
df_under_30yrs = df.loc[df['Age'] <= 30]

# calculating mean of 'Survived' for each group
mean_over_30 = df_over_30yrs['Survived'].mean()
mean_under_30 = df_under_30yrs['Survived'].mean()

# printing the mean survival rates for each group
# round(number, decimal_points): round a number to a given precision in decimal_points
print("Survival rate - age over 30: ", round(mean_over_30*100, 3), "%")
print("Survival rate - age under 30: ", round(mean_under_30*100, 3), "%")

Survival rate - age over 30:  40.656 %
Survival rate - age under 30:  40.587 %


There's not much difference between the two groups.

### Groupby

Now, we will group by *sex* to see if there's any difference between female and male.

Here, we use ```groupby``` aggregate function and it will let us group the dataset by that column ('Sex')

In [None]:
df.groupby(['Sex']).mean()

Unnamed: 0_level_0,Survived,Pclass,Age,SibSp,Parch,Fare
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
female,0.742038,2.159236,27.915709,0.694268,0.649682,44.479818
male,0.188908,2.389948,30.726645,0.429809,0.235702,25.523893


If we want to sort the aggregate funtion by a column, we can use ```sort_values(by=column_name)```.

Let's sort the above aggregate function by "Survived"

In [None]:
df.groupby(['Sex']).mean().sort_values(by="Survived")

Unnamed: 0_level_0,Survived,Pclass,Age,SibSp,Parch,Fare
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
male,0.188908,2.389948,30.726645,0.429809,0.235702,25.523893
female,0.742038,2.159236,27.915709,0.694268,0.649682,44.479818


If we are interested in the survival rate of each group, we can use ```[ ]``` after the groupby call to specify which column to display.

In [None]:
# group by 'Sex' and calculate mean of 'Survived'
df.groupby(['Sex'])['Survived'].mean()

Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64

There was a significant difference in the survival rate by *sex*!

We can also apply groupby on multiple columns.

In [None]:
 df.groupby(['Sex', 'Pclass']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Survived,Age,SibSp,Parch,Fare
Sex,Pclass,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
female,1,0.968085,34.611765,0.553191,0.457447,106.125798
female,2,0.921053,28.722973,0.486842,0.605263,21.970121
female,3,0.5,21.75,0.895833,0.798611,16.11881
male,1,0.368852,41.281386,0.311475,0.278689,67.226127
male,2,0.157407,30.740707,0.342593,0.222222,19.741782
male,3,0.135447,26.507589,0.498559,0.224784,12.661633


In both groups (female and male), the survival rate was a lot higher in Pclass 1 than other Pclass!

### 7.0 - Now Try This

Then, some of us might be curious:

Would lower Pclass be more expensive or higher Pclass be more expensive?

We can answer the question by calculating the mean fares for each class.

**Calculate the mean fares for each Pclass!**

### Let's pause and think!
Did you see any correlation between Pclass, Fare, and Survival rate? Briefly describe what you have found here.

## Takeaways

Using Dataframe and aggregate functions in Pandas, we can answer any questions that might come up!

In the tutorial section, we will apply what we have learned in Pandas and further analyze a new dataset.

# Practical Exercise

### About the dataset

What we will be using in the tutorial is the US Census Demographic Data.

The data here were collected by the US Census Burea and it includes data from the entire country.

This dataset covers lots of areas: state, county, gender, ethnicity, professional working fields, means of transportation to work, and employment.

All of this information is available at a State and County level. There are many questions that we could try to answer with the dataset:
- Unemployment by state
- Professional fields by state and county
- Means of transportation to work by county in CA
- ...

### Objective
Since the dataset covers all of the states in the US, we are going to select top 5 largest states by population. 
Once we've selected top five states, we will examine the residents' means of transportation to work at a state and county level.

That's our focus in the tutorial, but feel free to play around with it as you'd like.

Let's load our data first!

### Load file

In [None]:
url = 'https://raw.githubusercontent.com/bitprj/DigitalHistory/master/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/data/acs2017_county_data.csv'
df = pd.read_csv(url)
df

Unnamed: 0,CountyId,State,County,TotalPop,Men,Women,Hispanic,White,Black,Native,Asian,Pacific,VotingAgeCitizen,Income,IncomeErr,IncomePerCap,IncomePerCapErr,Poverty,ChildPoverty,Professional,Service,Office,Construction,Production,Drive,Carpool,Transit,Walk,OtherTransp,WorkAtHome,MeanCommute,Employed,PrivateWork,PublicWork,SelfEmployed,FamilyWork,Unemployment
0,1001,Alabama,Autauga County,55036,26899,28137,2.7,75.4,18.9,0.3,0.9,0.0,41016,55317,2838,27824,2024,13.7,20.1,35.3,18.0,23.2,8.1,15.4,86.0,9.6,0.1,0.6,1.3,2.5,25.8,24112,74.1,20.2,5.6,0.1,5.2
1,1003,Alabama,Baldwin County,203360,99527,103833,4.4,83.1,9.5,0.8,0.7,0.0,155376,52562,1348,29364,735,11.8,16.1,35.7,18.2,25.6,9.7,10.8,84.7,7.6,0.1,0.8,1.1,5.6,27.0,89527,80.7,12.9,6.3,0.1,5.5
2,1005,Alabama,Barbour County,26201,13976,12225,4.2,45.7,47.8,0.2,0.6,0.0,20269,33368,2551,17561,798,27.2,44.9,25.0,16.8,22.6,11.5,24.1,83.4,11.1,0.3,2.2,1.7,1.3,23.4,8878,74.1,19.1,6.5,0.3,12.4
3,1007,Alabama,Bibb County,22580,12251,10329,2.4,74.6,22.0,0.4,0.0,0.0,17662,43404,3431,20911,1889,15.2,26.6,24.4,17.6,19.7,15.9,22.4,86.4,9.5,0.7,0.3,1.7,1.5,30.0,8171,76.0,17.4,6.3,0.3,8.2
4,1009,Alabama,Blount County,57667,28490,29177,9.0,87.4,1.5,0.3,0.1,0.0,42513,47412,2630,22021,850,15.6,25.4,28.5,12.9,23.3,15.8,19.5,86.8,10.2,0.1,0.4,0.4,2.1,35.0,21380,83.9,11.9,4.0,0.1,4.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3215,72145,Puerto Rico,Vega Baja Municipio,54754,26269,28485,96.7,3.1,0.1,0.0,0.0,0.0,42838,18900,1219,10197,576,43.8,49.4,28.6,20.2,25.9,11.1,14.2,92.0,4.2,0.9,1.4,0.6,0.9,31.6,14234,76.2,19.3,4.3,0.2,16.8
3216,72147,Puerto Rico,Vieques Municipio,8931,4351,4580,95.7,4.0,0.0,0.0,0.0,0.0,7045,16261,2414,11136,1459,36.8,68.2,20.9,38.4,16.4,16.9,7.3,76.3,16.9,0.0,5.0,0.0,1.7,14.9,2927,40.7,40.9,18.4,0.0,12.8
3217,72149,Puerto Rico,Villalba Municipio,23659,11510,12149,99.7,0.2,0.1,0.0,0.0,0.0,18053,19893,1935,10449,1619,50.0,67.9,22.5,21.2,22.7,14.1,19.5,83.1,11.8,0.1,2.1,0.0,2.8,28.4,6873,59.2,30.2,10.4,0.2,24.8
3218,72151,Puerto Rico,Yabucoa Municipio,35025,16984,18041,99.9,0.1,0.0,0.0,0.0,0.0,27523,15586,1467,8672,702,52.4,62.1,27.7,26.0,20.7,9.5,16.0,87.6,9.2,0.0,1.4,1.8,0.1,30.5,7878,62.7,30.9,6.3,0.0,25.4


### 8.0 - Now Try This

CountyId is a unique identifier for each county and state.
- Set CountyId as an index.
- Update df

Hint: Use ```set_index``` and don't forget to update df

### Basic info about the dataset


In [None]:
df.head()

Unnamed: 0_level_0,State,County,TotalPop,Men,Women,Hispanic,White,Black,Native,Asian,Pacific,VotingAgeCitizen,Income,IncomeErr,IncomePerCap,IncomePerCapErr,Poverty,ChildPoverty,Professional,Service,Office,Construction,Production,Drive,Carpool,Transit,Walk,OtherTransp,WorkAtHome,MeanCommute,Employed,PrivateWork,PublicWork,SelfEmployed,FamilyWork,Unemployment
CountyId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1
1001,Alabama,Autauga County,55036,26899,28137,2.7,75.4,18.9,0.3,0.9,0.0,41016,55317,2838,27824,2024,13.7,20.1,35.3,18.0,23.2,8.1,15.4,86.0,9.6,0.1,0.6,1.3,2.5,25.8,24112,74.1,20.2,5.6,0.1,5.2
1003,Alabama,Baldwin County,203360,99527,103833,4.4,83.1,9.5,0.8,0.7,0.0,155376,52562,1348,29364,735,11.8,16.1,35.7,18.2,25.6,9.7,10.8,84.7,7.6,0.1,0.8,1.1,5.6,27.0,89527,80.7,12.9,6.3,0.1,5.5
1005,Alabama,Barbour County,26201,13976,12225,4.2,45.7,47.8,0.2,0.6,0.0,20269,33368,2551,17561,798,27.2,44.9,25.0,16.8,22.6,11.5,24.1,83.4,11.1,0.3,2.2,1.7,1.3,23.4,8878,74.1,19.1,6.5,0.3,12.4
1007,Alabama,Bibb County,22580,12251,10329,2.4,74.6,22.0,0.4,0.0,0.0,17662,43404,3431,20911,1889,15.2,26.6,24.4,17.6,19.7,15.9,22.4,86.4,9.5,0.7,0.3,1.7,1.5,30.0,8171,76.0,17.4,6.3,0.3,8.2
1009,Alabama,Blount County,57667,28490,29177,9.0,87.4,1.5,0.3,0.1,0.0,42513,47412,2630,22021,850,15.6,25.4,28.5,12.9,23.3,15.8,19.5,86.8,10.2,0.1,0.4,0.4,2.1,35.0,21380,83.9,11.9,4.0,0.1,4.9


Since there are so many columns, head function doesn't display all columns.

Let's use info() as it returns **ALL** of the **column names**, its types, and Non-Null counts. 



In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3220 entries, 1001 to 72153
Data columns (total 36 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   State             3220 non-null   object 
 1   County            3220 non-null   object 
 2   TotalPop          3220 non-null   int64  
 3   Men               3220 non-null   int64  
 4   Women             3220 non-null   int64  
 5   Hispanic          3220 non-null   float64
 6   White             3220 non-null   float64
 7   Black             3220 non-null   float64
 8   Native            3220 non-null   float64
 9   Asian             3220 non-null   float64
 10  Pacific           3220 non-null   float64
 11  VotingAgeCitizen  3220 non-null   int64  
 12  Income            3220 non-null   int64  
 13  IncomeErr         3220 non-null   int64  
 14  IncomePerCap      3220 non-null   int64  
 15  IncomePerCapErr   3220 non-null   int64  
 16  Poverty           3220 non-null   floa

There are 36 columns with 3220 counties.

Also, none of the columns have missing rows as every column has 3220 non-null values! That's great!

Let's view summary statistics of numeric columns and figure out what we want to get out from this dataset.

In [None]:
df.describe()

Unnamed: 0,TotalPop,Men,Women,Hispanic,White,Black,Native,Asian,Pacific,VotingAgeCitizen,Income,IncomeErr,IncomePerCap,IncomePerCapErr,Poverty,ChildPoverty,Professional,Service,Office,Construction,Production,Drive,Carpool,Transit,Walk,OtherTransp,WorkAtHome,MeanCommute,Employed,PrivateWork,PublicWork,SelfEmployed,FamilyWork,Unemployment
count,3220.0,3220.0,3220.0,3220.0,3220.0,3220.0,3220.0,3220.0,3220.0,3220.0,3220.0,3220.0,3220.0,3220.0,3220.0,3219.0,3220.0,3220.0,3220.0,3220.0,3220.0,3220.0,3220.0,3220.0,3220.0,3220.0,3220.0,3220.0,3220.0,3220.0,3220.0,3220.0,3220.0,3220.0
mean,100768.1,49587.81,51180.32,11.296584,74.920186,8.681957,1.768416,1.289379,0.083416,71309.52,48994.96677,3138.61677,25657.03323,1514.442547,16.780776,23.040634,31.479814,18.214286,21.878944,12.59236,15.835745,79.630963,9.851646,0.938975,3.244472,1.598696,4.736894,23.474534,47092.95,74.863323,17.086118,7.772733,0.27882,6.66559
std,324499.6,159321.2,165216.4,19.342522,23.0567,14.333571,7.422946,2.716191,0.709277,210869.1,13877.178398,2405.78695,6667.520452,1156.708587,8.30936,11.891934,6.523912,3.742308,3.167228,4.143504,5.808383,7.6639,2.963054,3.072571,3.89151,1.678232,3.073484,5.687241,155815.9,7.647916,6.390868,3.855454,0.448073,3.772612
min,74.0,39.0,35.0,0.0,0.0,0.0,0.0,0.0,0.0,59.0,11680.0,262.0,5943.0,129.0,2.4,0.0,11.4,0.0,4.8,0.0,0.0,4.6,0.0,0.0,0.0,0.0,0.0,5.1,39.0,31.1,4.4,0.0,0.0,0.0
25%,11213.5,5645.5,5553.5,2.1,63.5,0.6,0.1,0.2,0.0,8442.25,40622.0,1729.75,21568.0,832.0,11.475,14.9,27.2,15.8,19.9,9.8,11.5,77.3,8.0,0.1,1.4,0.8,2.9,19.6,4573.0,71.2,12.7,5.2,0.1,4.475
50%,25847.5,12879.0,12993.5,4.1,83.6,2.0,0.3,0.6,0.0,19699.0,47636.5,2587.0,25139.0,1225.0,15.4,21.5,30.5,17.8,22.1,12.1,15.4,81.0,9.5,0.3,2.3,1.3,4.1,23.2,10611.5,76.1,15.9,6.8,0.2,6.1
75%,66608.25,33017.25,33593.75,10.0,92.8,9.5,0.6,1.2,0.1,50365.75,55476.0,3802.0,28997.0,1802.5,19.8,28.6,34.9,20.2,23.9,14.8,19.5,84.1,11.3,0.8,3.825,1.9,5.8,27.0,28747.25,80.2,19.9,9.2,0.3,8.0
max,10105720.0,4979641.0,5126081.0,100.0,100.0,86.9,90.3,41.8,33.7,6218279.0,129588.0,41001.0,69529.0,16145.0,65.2,83.6,69.0,46.4,37.2,36.4,48.7,97.2,29.3,61.8,59.2,43.2,33.0,45.1,4805817.0,88.8,64.8,38.0,8.0,40.9


There are 34 numerical columns out of 36 in this dataset.

As we saw in the objective, there are lots of things we can do with this dataset.

But, let's pick top 5 states with most population and most employees and work from there!


## Select subsets of data

Before we dive into the data analysis, let's select the columns we want to work with. 
As we learned earlier, we can select a specific subset of columns using two methods:

1. by index location
2. by column names

As this dataset has so many columns, let's take a look at all of the columns first.

We can call ```columns``` and it will return all of the column names - we can use ```df.info()``` as well.

In [None]:
df.columns

Index(['State', 'County', 'TotalPop', 'Men', 'Women', 'Hispanic', 'White',
       'Black', 'Native', 'Asian', 'Pacific', 'VotingAgeCitizen', 'Income',
       'IncomeErr', 'IncomePerCap', 'IncomePerCapErr', 'Poverty',
       'ChildPoverty', 'Professional', 'Service', 'Office', 'Construction',
       'Production', 'Drive', 'Carpool', 'Transit', 'Walk', 'OtherTransp',
       'WorkAtHome', 'MeanCommute', 'Employed', 'PrivateWork', 'PublicWork',
       'SelfEmployed', 'FamilyWork', 'Unemployment'],
      dtype='object')

Since there are so many columns, it's hard to count the index location. So we will use the column names to select subsets of data!

As discussed in the objective, our main focus is transportation methods for workers and population. So, we will select the following columns!

In [None]:
# by column names
df_emp = df[['State', 'County', 'TotalPop', 'Income', 'Employed', 'Drive', 'Carpool', 'Transit', 'Walk', 'OtherTransp', 'WorkAtHome']]
df_emp

Unnamed: 0_level_0,State,County,TotalPop,Income,Employed,Drive,Carpool,Transit,Walk,OtherTransp,WorkAtHome
CountyId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1001,Alabama,Autauga County,55036,55317,24112,86.0,9.6,0.1,0.6,1.3,2.5
1003,Alabama,Baldwin County,203360,52562,89527,84.7,7.6,0.1,0.8,1.1,5.6
1005,Alabama,Barbour County,26201,33368,8878,83.4,11.1,0.3,2.2,1.7,1.3
1007,Alabama,Bibb County,22580,43404,8171,86.4,9.5,0.7,0.3,1.7,1.5
1009,Alabama,Blount County,57667,47412,21380,86.8,10.2,0.1,0.4,0.4,2.1
...,...,...,...,...,...,...,...,...,...,...,...
72145,Puerto Rico,Vega Baja Municipio,54754,18900,14234,92.0,4.2,0.9,1.4,0.6,0.9
72147,Puerto Rico,Vieques Municipio,8931,16261,2927,76.3,16.9,0.0,5.0,0.0,1.7
72149,Puerto Rico,Villalba Municipio,23659,19893,6873,83.1,11.8,0.1,2.1,0.0,2.8
72151,Puerto Rico,Yabucoa Municipio,35025,15586,7878,87.6,9.2,0.0,1.4,1.8,0.1


## Aggregation

### Total population by state

Let's get total polution by state to select top 5 largest states by population.

In order to calculate this, we need to group by state, and we will need to get ```sum``` of ```TotalPop```.

### 9.0 - Now Try This
- Calculate total population by state from ```df_emp```
- Name the dataframe as ```state_pop```

## Total population by state -- sorted

Since there are so many states, it will be easier to see which states have most population if we can sort the dataset.

If we use ```sort_values()```  and it's going to sort the data by the aggregation function value.

In [None]:
state_pop = df_emp.groupby(['State'])['TotalPop'].sum().sort_values(ascending=False)
state_pop

State
California              38982847
Texas                   27419612
Florida                 20278447
New York                19798228
Illinois                12854526
Pennsylvania            12790505
Ohio                    11609756
Georgia                 10201635
North Carolina          10052564
Michigan                 9925568
New Jersey               8960161
Virginia                 8365952
Washington               7169967
Arizona                  6809946
Massachusetts            6789319
Indiana                  6614418
Tennessee                6597381
Missouri                 6075300
Maryland                 5996079
Wisconsin                5763217
Minnesota                5490726
Colorado                 5436519
South Carolina           4893444
Alabama                  4850771
Louisiana                4663461
Kentucky                 4424376
Oregon                   4025127
Oklahoma                 3896251
Connecticut              3594478
Puerto Rico              3468963
Iowa

#### Top 5 states

If we want to see the top 5 results from any dataframes, we can use ```df.head(n)``` function to display the first n rows.

In [None]:
state_pop.head(5)

#### These are the top 5 largest states by population:
- California             
- Texas                   
- Florida                 
- New York                
- Illinois  

We are going to use ```loc``` as we want to filter dataset based on criteria.

We could use the line below: selecting rows if 'State' is California, Texas, Florida, New York, or Illinois.

In [None]:
df_emp.loc[(df_emp['State']=='California') | (df_emp['State']=='Texas') | (df_emp['State']=='Florida') | (df_emp['State']== "New York") | (df_emp['State']=='Illinois')]

Unnamed: 0_level_0,State,County,TotalPop,Income,Employed,Drive,Carpool,Transit,Walk,OtherTransp,WorkAtHome
CountyId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
6001,California,Alameda County,1629615,85743,826310,62.0,10.1,14.6,3.7,3.6,6.0
6003,California,Alpine County,1203,63438,374,62.1,13.2,0.0,9.1,1.6,14.0
6005,California,Amador County,37306,60636,13444,80.7,9.2,0.1,2.5,1.1,6.4
6007,California,Butte County,225207,46516,93439,75.0,11.1,1.2,3.2,3.9,5.7
6009,California,Calaveras County,45057,54800,16721,79.1,9.9,1.0,1.2,0.6,8.3
...,...,...,...,...,...,...,...,...,...,...,...
48499,Texas,Wood County,43315,48038,15960,80.6,10.0,0.3,2.5,1.0,5.6
48501,Texas,Yoakum County,8481,62500,3755,78.0,16.8,0.0,3.7,0.2,1.3
48503,Texas,Young County,18166,46351,8248,83.0,9.0,0.0,0.6,1.4,6.1
48505,Texas,Zapata County,14415,34550,5146,69.2,23.8,0.0,4.4,0.6,2.0


But, the code above is very lenthy so we will learn a shortcut!

We can use ```isin(list_of_values)``` function to see if 'State' is in state_list.

The syntax above is very similar to ```'a' in ['a','b','c']``` in Python.

In [None]:
five_states = df_emp.loc[df_emp['State'].isin(['California','Texas','Florida','New York','Illinois'])]
five_states

Unnamed: 0_level_0,State,County,TotalPop,Income,Employed,Drive,Carpool,Transit,Walk,OtherTransp,WorkAtHome
CountyId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
6001,California,Alameda County,1629615,85743,826310,62.0,10.1,14.6,3.7,3.6,6.0
6003,California,Alpine County,1203,63438,374,62.1,13.2,0.0,9.1,1.6,14.0
6005,California,Amador County,37306,60636,13444,80.7,9.2,0.1,2.5,1.1,6.4
6007,California,Butte County,225207,46516,93439,75.0,11.1,1.2,3.2,3.9,5.7
6009,California,Calaveras County,45057,54800,16721,79.1,9.9,1.0,1.2,0.6,8.3
...,...,...,...,...,...,...,...,...,...,...,...
48499,Texas,Wood County,43315,48038,15960,80.6,10.0,0.3,2.5,1.0,5.6
48501,Texas,Yoakum County,8481,62500,3755,78.0,16.8,0.0,3.7,0.2,1.3
48503,Texas,Young County,18166,46351,8248,83.0,9.0,0.0,0.6,1.4,6.1
48505,Texas,Zapata County,14415,34550,5146,69.2,23.8,0.0,4.4,0.6,2.0


### Average income by state

Now, we have selected five states to work with and let's get the average income in each state.

In [None]:
five_states.groupby(['State'])['Income'].mean().sort_values(ascending=False)

State
California    61046.758621
New York      58309.258065
Illinois      52930.382353
Texas         49894.338583
Florida       47144.328358
Name: Income, dtype: float64

California has the highest average income and Florida has the lowest average income amongst these five states.

### Total number of employees by state
### 10.0 - Now Try This
- Calculate the total number of employees by state
- Sort by value
- Write down the state with the highest number of employees and the state with the lowest number of employees.

Hint: use ```groupby```

### Means of transportation to work by State

Let's look at each state's transit mode and see which transit mode is most popular in each state.

We will groupby 'State' and we will get the sum of all of the transit modes.

In [None]:
five_states.groupby(['State'])['Drive','Carpool','Transit','Walk','OtherTransp','WorkAtHome'].sum()

  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,Drive,Carpool,Transit,Walk,OtherTransp,WorkAtHome
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
California,4265.3,633.3,170.1,193.5,151.0,386.8
Florida,5349.1,687.6,72.2,98.2,157.0,335.6
Illinois,8420.2,872.8,108.7,248.1,126.7,422.4
New York,4619.3,523.4,406.5,290.6,100.0,260.2
Texas,20455.0,2888.5,79.4,600.6,363.2,1013.9


As California is the largest state by both population and employment, we will work with California dataset only.


### 11.0 - Now Try This

#### Step 1: Select california state only and save it as ```ca_transit```

In [None]:
# step 1: Select california state only


#### Step 2: Calculate ```sum``` of all transit modes by ```county``` in California.

In [None]:
# Step 2: Calculate the sum of all transit modes by county in California.


## Takeaways from this tutorial
Anytime you see data, you can use Pandas and you will be able to answer any questions!

In the later weeks, we will deep dive into the further uses of Pandas to analyze a more complicated data.

## Resources
- [About Titanic](https://en.wikipedia.org/wiki/Titanic)
- 
- [US Census Demographic Data](https://www.kaggle.com/muonneutrino/us-census-demographic-data)
- [10 minutes to Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html)
- [Pandas API Reference](https://pandas.pydata.org/pandas-docs/stable/reference/index.html)
- [A Gentle Introduction to Pandas](https://medium.com/@wbusaka/a-gentle-introduction-to-pandas-5ed17421a59d)

## Appendix

### How to download a dataset from Kaggle

Kaggle is the world's largest data science community with powerful tools and resources. There are lots of datasets you can download from this website.

#### Step 1: go to [Kaggle.com](https://Kaggle.com)

#### Step 2: click Data tab

![Kaggle Data](https://github.com/bitprj/DigitalHistory/blob/Narae/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/assets/Kaggle_data.png?raw=1)

#### Step 3: search for a dataset of your interest or explore the most popular datasets on the main page

![Search page](https://github.com/bitprj/DigitalHistory/blob/Narae/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/assets/search_page.png?raw=1)

#### Step 4: once you select a dataset, you can read context and download it. If you click Download button at the top, all of the datasets will be downloaded.

![US Census main](https://github.com/bitprj/DigitalHistory/blob/Narae/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/assets/US_Census_main.png?raw=1)

#### Step 5: if you want to download a specific dataset, hit the download icon in the selected dataset 

![Download specific](https://github.com/bitprj/DigitalHistory/blob/Narae/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/assets/download_specific.png?raw=1)

#### Step 6: if you'd like to see all columns, check ```select all``` and it will show all of the columns in the dataset.

![View columns](https://github.com/bitprj/DigitalHistory/blob/Narae/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/assets/view_columns.png?raw=1)

#### Step 7: if your dataset is a csv file, then use ```read_csv()```. If your dataset is an excel file, use ```read_excel()``` to load your dataset.