<a href="https://colab.research.google.com/github/bitprj/DigitalHistory/blob/Narae/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/Week3_Introduction_to_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Manipulation with Pandas




## Table of Contents
- Why, Where and How we use Pandas
- What we will be learning today
  - About the dataset
  - Goals
- Connecting to Google Drive
- Importing libraries and unpacking a file
- Loading data
- Getting info about the dataset
  - Now Try This
- Removing NaN (None) values
  - Now Try This
- Selecting subsets of data
  - Now Try This
- Filtering dataset based on criteria
  - Now Try This
- Aggregation functions
  - Now Try This

## Why, Where and How we use Pandas

This week, we will cover the basic data manipulation using Pandas.
1. Pandas is an open source data analysis and manipulation tool and it is widely used both in academia and industry.
2. It is built on top of the Python programming. 
3. It offers data structures and operations for manipulating numerical tables and time series.

Pandas provides three data structures: Series, DataFrame, and Panel. 
1. A Series is 1-dimensional labelled array and it can hold data of **any type** (integer, string, float, python objects, etc.). Its labels are called an index.
2. A DataFrame is 2-dimensional labelled data structure with both rows and columns.
3. A panel is 3-dimensional. 

This week, we will focus on DataFrame and we will learn Series in later weeks. We will not cover Panel in this semester, as it's not used as often as two other data structures.


Since we've covered the fundamentals of Python, it will be fairly easy to pick up Pandas.

## What we will be learning today

### About the dataset:
To begin with Pandas and dataframe, we will use titanic dataset. It has all of the passengers' information, such as name, age, gender, ticket price, and most importantly whether they survived or not.

As each person has its unique PassengerID, each row is a unique entity / passenger.



### Goals:
Using data manipulation and analysis, we will learn how to get a quick overview of the dataset, how to select and filter specific rows / columns. We At the end of the tutorial, we should be able to analyze the survival rates.

## Connecting to Your Google Drive

We will start by connecting google drive into google colab

In [1]:
from google.colab import drive

drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


### Change Directory

Now, the google drive is connected to our colab notebook.
Let's access the dataset located in Digital History/Week3.

```cd``` is a command used to change current working directory. Simply, ```cd``` means 'Change Directory.'

By running the code below, we change our current directory.


In [4]:
cd "/content/gdrive/My Drive/DigitalHistory/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/data"

/content/gdrive/.shortcut-targets-by-id/1m-IVNIRZmHM3YwHOGFHd_tUQuI298irS/DigitalHistory/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/data


### List files
```ls``` is a command used to list all files in the current directory.

After we run it, we will see a 'titanic-dataset.csv' file.

In [5]:
ls

acs2017_county_data.csv  [0m[01;34m__MACOSX[0m/  titanic-dataset.csv  titanic-dataset.zip


'titanic-dataset.csv' is the file we are going to use for step-by-step learning session.

### Import Library

In order to read / load a file, we will need to import Pandas.

It's a convention to use ``` import pandas as pd``` when importing Pandas library.

In [6]:
import pandas as pd

Once we've imported Pandas, we can use ``` pd``` to call any functions in Pandas.



### Load file

We will use ```read_csv``` function as we will read a csv file.

Since we are working with only one dataset, we will just call dataframe as df. 

But, if we are working with lots of dataframes, it's better to give a meaningful name (ex: titanic_data, passenger_info, etc.)

In [8]:
df = pd.read_csv('titanic-dataset.csv')
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


Now, the dataset is loaded as a dataframe 'df'

## Basic info about the dataset

Now, let's get the basic information about the dataframe.
- head()
- describe()
- info()

### head()
head() function is useful to see the dataset at a quick glance as it returns first n rows.

Let's check what columns this file has by calling 'head()' function.

By default, the head() returns the first 5 rows.

In [9]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


You can specify the number of rows to display by calling df.head(number)

In [None]:
df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


### info()

Now, we know what's in the dataset and what it looks like.

This function is useful as this returns all of the **column names** and **its types** as well as **Non-Null** counts. 

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


We can tell that "Age" and "Cabin" have lots of missing values. 

If we take a closer look at dtypes in the second to last row, there are three dtypes: float64, int64, and object.

We have covered float64 and int64 last week in Python, but what is an object?

- int64: integer numbers
- float64: floating point numbers
- object: string or mixed numeric and non-numeric values.

That's why the dtype of "Name", "Sex", "Embarked" is object, as it is a string.

"Ticket" and "Cabin" are object as they are in a format of numbers or string+numbers (Ex: A/5 21171, C85)



### describe()

describe() is used to view summary statistics of numeric columns. This helps us to have a general idea of the dataset.

In [None]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


### shape
To see the size of the dataset, we can use shape function, which returns the number of rows and columns in a format of (# rows, # columns)

This dataset has 891 rows (entities) and 12 columns.

In [None]:
df.shape

(891, 12)

## Remove NaN values

Often times, when we work with large datasets, we will encounter cases where there are lots of missing elements (NaN / null) in the dataset.

Removing NaN values will allow us to drop the rows and to work with clean datasets.


---


Let's remove the rows that do not provide a meaningful information.

When we know a "unique key" of the dataset (PassengerID in this dataset), we can check whether all elements have PassengerID. If any of the rows are missing PassengerID, then we can drop that entity.

* df.dropna(): drop the rows where at least one of the elements is missing.
* df.dropna(how='all'): drop the rows where all of the elements are missing.
* df.dropna(subset=['PassengerId']): define in which columns to look for missing values.

If we want to drop the rows with at least one missing element:

In [None]:
df.dropna().shape

(183, 12)

If we want to drop the rows with all elements missing:

In [None]:
df.dropna(how='all')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


If we want to drop the rows that are missing PassengerID.

In [None]:
df.dropna(subset=['PassengerId']).shape

(891, 12)

If we want to update the dataset after dropping rows, we can use inplace = True

In [None]:
df.dropna(subset=['PassengerId'], inplace=True)

In the titanic dataset, we've confirmed that all of the rows have 'PassengerId' because the number of rows is the same as the code above.

### Now Try This

Drop the rows that are missing any of the following columns: 'PassengerId', 'Survived', 'Age'

In [None]:
df.dropna(subset=['PassengerId', 'Survived', 'Age'])

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


## Removing a column

Before we dive into the data analysis, let's see if there are any columns we want to remove. 

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 90.5+ KB


In "Cabin" column, there are only 204 rows that are non-null. That means 891 - 204 = 787 rows are missing in the column.

So, it wouldn't give us a meaningful insight and we can remove "Cabin" column by using ```del``` function.

In [None]:
del df['Cabin']
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,S
...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C


## Select subsets of data

When we are interested in a few columns to do the data analysis, we can select a specific subset of columns using two methods:

1. by index location
2. by column names

We can select specific subsets of data using *iloc[rows_index, columns_index]*.

As we learned last week, [:] selects everything in a list or string in Python. 

Similarly, [:] will select every row or column depending on where we put it.

In [None]:
# by index location (iloc)
df_index = df.iloc[: , [0,1,2,3]]
df_index

Unnamed: 0,PassengerId,Survived,Pclass,Name
0,1,0,3,"Braund, Mr. Owen Harris"
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,3,1,3,"Heikkinen, Miss. Laina"
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
4,5,0,3,"Allen, Mr. William Henry"
...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas"
887,888,1,1,"Graham, Miss. Margaret Edith"
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie"""
889,890,1,1,"Behr, Mr. Karl Howell"


In [None]:
# by column names
df_col_names = df[['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age']]
df_col_names

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0
...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0


### Now Try This

Select column index 3 to column index 6 (inclusive) with all rows.

In [None]:
df_index = df.iloc[: , 3:7]
df_index

Unnamed: 0,Name,Sex,Age,SibSp
0,"Braund, Mr. Owen Harris",male,22.0,1
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1
2,"Heikkinen, Miss. Laina",female,26.0,0
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1
4,"Allen, Mr. William Henry",male,35.0,0
...,...,...,...,...
886,"Montvila, Rev. Juozas",male,27.0,0
887,"Graham, Miss. Margaret Edith",female,19.0,0
888,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1
889,"Behr, Mr. Karl Howell",male,26.0,0


## Filter Dataset based on criteria

Often times, we are interested in working with specific rows that meet the certain criteria. 

If we only want to look at the data with Age > 30, we can specify the criteria within the 'loc' function.

In [None]:
df_over_30yrs=df.loc[df['Age'] > 30]
df_over_30yrs

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,S
...,...,...,...,...,...,...,...,...,...,...,...
873,874,0,3,"Vander Cruyssen, Mr. Victor",male,47.0,0,0,345765,9.0000,S
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C
881,882,0,3,"Markun, Mr. Johann",male,33.0,0,0,349257,7.8958,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,Q


Now, let's select the dataset using two criteria -- where "Age" is greater than 30 AND "Survived."

'*&*' is equivalent to '*AND*' and '*|*' is equivalent to '*OR*' in dataframe.


In [None]:
df_over_30yrs_survived = df.loc[(df['Age'] > 30) & (df['Survived'] == 1)]
df_over_30yrs_survived

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,S
15,16,1,2,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0,0,0,248706,16.0000,S
21,22,1,2,"Beesley, Mr. Lawrence",male,34.0,0,0,248698,13.0000,S
...,...,...,...,...,...,...,...,...,...,...,...
857,858,1,1,"Daly, Mr. Peter Denis",male,51.0,0,0,113055,26.5500,S
862,863,1,1,"Swift, Mrs. Frederick Joel (Margaret Welles Ba...",female,48.0,0,0,17466,25.9292,S
865,866,1,2,"Bystrom, Mrs. (Karolina)",female,42.0,0,0,236852,13.0000,S
871,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,S



IMPORTANT: When filtering with multiple conditions, make sure to use **()** on each condition.

Let's check how many passengers survived among the ones whose age was over 30.

In [None]:
print("# of passengers whose age was over 30: ", df_over_30yrs.shape[0])
print("# of survived passengers whose age was over 30: ", df_over_30yrs_survived.shape[0])

### Now Try This

Select the dataset that meet the following condition:
- Pclass is not 1

In [None]:
df.loc[(df['Pclass'] != 1)]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,S
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,Q
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,S
...,...,...,...,...,...,...,...,...,...,...,...
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.0500,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,S


### Now Try This

Select the dataset that meet the following conditions:
- Age is less than 10 
- OR
- Age is greater than 50

Hint: Don't forget parenthesis!

In [None]:
df.loc[(df['Age'] < 10) | (df['Age'] > 50)]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,S
15,16,1,2,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0,0,0,248706,16.0000,S
...,...,...,...,...,...,...,...,...,...,...,...
851,852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.7750,S
852,853,0,3,"Boulos, Miss. Nourelain",female,9.0,1,1,2678,15.2458,C
857,858,1,1,"Daly, Mr. Peter Denis",male,51.0,0,0,113055,26.5500,S
869,870,1,3,"Johnson, Master. Harold Theodor",male,4.0,1,1,347742,11.1333,S


## Aggregation

Aggregation is the process of combining things. It's useful to understand overall properties of the dataset and analyze it.

Some examples of aggregation are sum(), min(), max(), count(), mean(), std(), etc.

### Sum
#### 1. Total fares

In [None]:
df['Fare'].sum()

28693.9493

If we want to round total fares and save it as a variable, then we can try:

In [None]:
total_fares = df['Fare'].sum()
print("Total fares: ", round(total_fares))

Total fares:  28694.0


#### 2. Survived passengers
We can also count the number of passengers that survived by summing up the 'Survived' column.

In [None]:
survived_passengers = df['Survived'].sum()
survived_passengers

342

### Max / Min

Let's calculate max and min of Fare.

In [None]:
df['Fare'].max()

512.3292

In [None]:
df['Fare'].min()

0.0

### Mean
Let's calculate the survival rate of all passengers.

In [None]:
df['Survived'].mean()

0.3838383838383838

Let's calculate the average age of all passengers.

In [None]:
df['Age'].mean()

29.69911764705882

Now, we will tackle a more complex problem.

Let's calculate the survival rate by the *age* group.
We can apply filtering that we just learned to select the group whose age was over 30 and whose age was under 30.

In each group, we will calculate the mean of 'Survived' column.

In [None]:
# filtering
df_over_30yrs = df.loc[df['Age'] > 30]
df_under_30yrs = df.loc[df['Age'] <= 30]

# calculating mean of 'Survived' for each group
mean_over_30 = df_over_30yrs['Survived'].mean()
mean_under_30 = df_under_30yrs['Survived'].mean()

# printing the mean survival rates for each group
# round(number, decimal_points): round a number to a given precision in decimal_points
print("Survival rate - age over 30: ", round(mean_over_30*100, 3), "%")
print("Survival rate - age under 30: ", round(mean_under_30*100, 3), "%")

Survival rate - age over 30:  40.656 %
Survival rate - age under 30:  40.587 %


There's not much difference between the two groups.

### Groupby

Now, we will group by *sex* to see if there's any difference between female and male.

Here, we use 'groupby' aggregate function and it will let us group the dataset by that column ('Sex')

In [None]:
df.groupby(['Sex']).mean()

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
female,431.028662,0.742038,2.159236,27.915709,0.694268,0.649682,44.479818
male,454.147314,0.188908,2.389948,30.726645,0.429809,0.235702,25.523893


If we want to sort the aggregate funtion by a column, we can use sort_values(by="column_name").

Let's sort the above aggregate function by "Survived"

In [None]:
df.groupby(['Sex']).mean().sort_values(by="Survived")

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
male,454.147314,0.188908,2.389948,30.726645,0.429809,0.235702,25.523893
female,431.028662,0.742038,2.159236,27.915709,0.694268,0.649682,44.479818


If we are interested in the survival rate of each group and we specify it by using [ ] after the groupby call.

In [None]:
# group by 'Sex' and calculate mean of 'Survived'
df.groupby(['Sex'])['Survived'].mean()

Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64

There was a significant difference in the survival rate by *sex*!

We can also apply groupby on multiple columns.

In [None]:
 df.groupby(['Sex', 'Pclass']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,PassengerId,Survived,Age,SibSp,Parch,Fare
Sex,Pclass,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
female,1,469.212766,0.968085,34.611765,0.553191,0.457447,106.125798
female,2,443.105263,0.921053,28.722973,0.486842,0.605263,21.970121
female,3,399.729167,0.5,21.75,0.895833,0.798611,16.11881
male,1,455.729508,0.368852,41.281386,0.311475,0.278689,67.226127
male,2,447.962963,0.157407,30.740707,0.342593,0.222222,19.741782
male,3,455.51585,0.135447,26.507589,0.498559,0.224784,12.661633


In both groups (female and male), the survival rate was a lot higher in Pclass 1 than other Pclass!

### Now Try This

Then, some of us might be curious:

Would lower Pclass be more expensive or higher Pclass be more expensive?

We can answer the question by calculating the mean fares for each class.

**Calculate the mean fares for each Pclass!**

In [None]:
df.groupby(['Pclass'])['Fare'].mean()

Pclass
1    84.154687
2    20.662183
3    13.675550
Name: Fare, dtype: float64

### Let's pause and think!
Did you see any correlation between Pclass, Fare, and Survival rate? Briefly describe what you have found here.

## Takeaways

Using Dataframe and aggregate functions in Pandas, we can answer any questions that might come up!

In the tutorial section, we will apply what we have learned in Pandas and further analyze a new dataset.

# Tutorial

### About the dataset

What we will be using in the tutorial is US Census Demographic Data.

The data here were collected by the US Census Burea and it includes data from the entire country.

This dataset covers lots of areas: state, county, gender, ethnicity, working fields, commuting modes, and employment.

All of this information is available at a State and County level. There are many questions that we could try to answer with the dataset:
- Unemployment by state
- Professional fields by state and county
- Commute transit mode by county in CA
- ...

### Objective
Since the dataset covers all of the states in the US, we are going to select top 5 states by population. 
Once we've selected top five states, we will calculate the average income and the total employed population.

That's our focus in the tutorial, but feel free to play around with it as you'd like.

Let's load our data first!

### Load file

In [10]:
df = pd.read_csv('acs2017_county_data.csv')
df

Unnamed: 0,CountyId,State,County,TotalPop,Men,Women,Hispanic,White,Black,Native,Asian,Pacific,VotingAgeCitizen,Income,IncomeErr,IncomePerCap,IncomePerCapErr,Poverty,ChildPoverty,Professional,Service,Office,Construction,Production,Drive,Carpool,Transit,Walk,OtherTransp,WorkAtHome,MeanCommute,Employed,PrivateWork,PublicWork,SelfEmployed,FamilyWork,Unemployment
0,1001,Alabama,Autauga County,55036,26899,28137,2.7,75.4,18.9,0.3,0.9,0.0,41016,55317,2838,27824,2024,13.7,20.1,35.3,18.0,23.2,8.1,15.4,86.0,9.6,0.1,0.6,1.3,2.5,25.8,24112,74.1,20.2,5.6,0.1,5.2
1,1003,Alabama,Baldwin County,203360,99527,103833,4.4,83.1,9.5,0.8,0.7,0.0,155376,52562,1348,29364,735,11.8,16.1,35.7,18.2,25.6,9.7,10.8,84.7,7.6,0.1,0.8,1.1,5.6,27.0,89527,80.7,12.9,6.3,0.1,5.5
2,1005,Alabama,Barbour County,26201,13976,12225,4.2,45.7,47.8,0.2,0.6,0.0,20269,33368,2551,17561,798,27.2,44.9,25.0,16.8,22.6,11.5,24.1,83.4,11.1,0.3,2.2,1.7,1.3,23.4,8878,74.1,19.1,6.5,0.3,12.4
3,1007,Alabama,Bibb County,22580,12251,10329,2.4,74.6,22.0,0.4,0.0,0.0,17662,43404,3431,20911,1889,15.2,26.6,24.4,17.6,19.7,15.9,22.4,86.4,9.5,0.7,0.3,1.7,1.5,30.0,8171,76.0,17.4,6.3,0.3,8.2
4,1009,Alabama,Blount County,57667,28490,29177,9.0,87.4,1.5,0.3,0.1,0.0,42513,47412,2630,22021,850,15.6,25.4,28.5,12.9,23.3,15.8,19.5,86.8,10.2,0.1,0.4,0.4,2.1,35.0,21380,83.9,11.9,4.0,0.1,4.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3215,72145,Puerto Rico,Vega Baja Municipio,54754,26269,28485,96.7,3.1,0.1,0.0,0.0,0.0,42838,18900,1219,10197,576,43.8,49.4,28.6,20.2,25.9,11.1,14.2,92.0,4.2,0.9,1.4,0.6,0.9,31.6,14234,76.2,19.3,4.3,0.2,16.8
3216,72147,Puerto Rico,Vieques Municipio,8931,4351,4580,95.7,4.0,0.0,0.0,0.0,0.0,7045,16261,2414,11136,1459,36.8,68.2,20.9,38.4,16.4,16.9,7.3,76.3,16.9,0.0,5.0,0.0,1.7,14.9,2927,40.7,40.9,18.4,0.0,12.8
3217,72149,Puerto Rico,Villalba Municipio,23659,11510,12149,99.7,0.2,0.1,0.0,0.0,0.0,18053,19893,1935,10449,1619,50.0,67.9,22.5,21.2,22.7,14.1,19.5,83.1,11.8,0.1,2.1,0.0,2.8,28.4,6873,59.2,30.2,10.4,0.2,24.8
3218,72151,Puerto Rico,Yabucoa Municipio,35025,16984,18041,99.9,0.1,0.0,0.0,0.0,0.0,27523,15586,1467,8672,702,52.4,62.1,27.7,26.0,20.7,9.5,16.0,87.6,9.2,0.0,1.4,1.8,0.1,30.5,7878,62.7,30.9,6.3,0.0,25.4


### Basic info about the dataset


In [None]:
df.head()

Unnamed: 0,CountyId,State,County,TotalPop,Men,Women,Hispanic,White,Black,Native,...,Walk,OtherTransp,WorkAtHome,MeanCommute,Employed,PrivateWork,PublicWork,SelfEmployed,FamilyWork,Unemployment
0,1001,Alabama,Autauga County,55036,26899,28137,2.7,75.4,18.9,0.3,...,0.6,1.3,2.5,25.8,24112,74.1,20.2,5.6,0.1,5.2
1,1003,Alabama,Baldwin County,203360,99527,103833,4.4,83.1,9.5,0.8,...,0.8,1.1,5.6,27.0,89527,80.7,12.9,6.3,0.1,5.5
2,1005,Alabama,Barbour County,26201,13976,12225,4.2,45.7,47.8,0.2,...,2.2,1.7,1.3,23.4,8878,74.1,19.1,6.5,0.3,12.4
3,1007,Alabama,Bibb County,22580,12251,10329,2.4,74.6,22.0,0.4,...,0.3,1.7,1.5,30.0,8171,76.0,17.4,6.3,0.3,8.2
4,1009,Alabama,Blount County,57667,28490,29177,9.0,87.4,1.5,0.3,...,0.4,0.4,2.1,35.0,21380,83.9,11.9,4.0,0.1,4.9


Since there are so many columns, head function doesn't display all columns.

Let's use info() as it returns **ALL** of the **column names**, its types, and Non-Null counts. 



In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3220 entries, 0 to 3219
Data columns (total 37 columns):
CountyId            3220 non-null int64
State               3220 non-null object
County              3220 non-null object
TotalPop            3220 non-null int64
Men                 3220 non-null int64
Women               3220 non-null int64
Hispanic            3220 non-null float64
White               3220 non-null float64
Black               3220 non-null float64
Native              3220 non-null float64
Asian               3220 non-null float64
Pacific             3220 non-null float64
VotingAgeCitizen    3220 non-null int64
Income              3220 non-null int64
IncomeErr           3220 non-null int64
IncomePerCap        3220 non-null int64
IncomePerCapErr     3220 non-null int64
Poverty             3220 non-null float64
ChildPoverty        3219 non-null float64
Professional        3220 non-null float64
Service             3220 non-null float64
Office              3220 non-nu

There are 37 columns with 3220 counties.

Also, none of the columns have missing rows as every column has 3220 non-null values!

Let's view summary statistics of numeric columns and figure out what we want to get out from this dataset.

In [None]:
df.describe()

Unnamed: 0,CountyId,TotalPop,Men,Women,Hispanic,White,Black,Native,Asian,Pacific,...,Walk,OtherTransp,WorkAtHome,MeanCommute,Employed,PrivateWork,PublicWork,SelfEmployed,FamilyWork,Unemployment
count,3220.0,3220.0,3220.0,3220.0,3220.0,3220.0,3220.0,3220.0,3220.0,3220.0,...,3220.0,3220.0,3220.0,3220.0,3220.0,3220.0,3220.0,3220.0,3220.0,3220.0
mean,31393.60528,100768.1,49587.81,51180.32,11.296584,74.920186,8.681957,1.768416,1.289379,0.083416,...,3.244472,1.598696,4.736894,23.474534,47092.95,74.863323,17.086118,7.772733,0.27882,6.66559
std,16292.078954,324499.6,159321.2,165216.4,19.342522,23.0567,14.333571,7.422946,2.716191,0.709277,...,3.89151,1.678232,3.073484,5.687241,155815.9,7.647916,6.390868,3.855454,0.448073,3.772612
min,1001.0,74.0,39.0,35.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,5.1,39.0,31.1,4.4,0.0,0.0,0.0
25%,19032.5,11213.5,5645.5,5553.5,2.1,63.5,0.6,0.1,0.2,0.0,...,1.4,0.8,2.9,19.6,4573.0,71.2,12.7,5.2,0.1,4.475
50%,30024.0,25847.5,12879.0,12993.5,4.1,83.6,2.0,0.3,0.6,0.0,...,2.3,1.3,4.1,23.2,10611.5,76.1,15.9,6.8,0.2,6.1
75%,46105.5,66608.25,33017.25,33593.75,10.0,92.8,9.5,0.6,1.2,0.1,...,3.825,1.9,5.8,27.0,28747.25,80.2,19.9,9.2,0.3,8.0
max,72153.0,10105720.0,4979641.0,5126081.0,100.0,100.0,86.9,90.3,41.8,33.7,...,59.2,43.2,33.0,45.1,4805817.0,88.8,64.8,38.0,8.0,40.9


It says that this dataset 35 numerical columns, but actually CountyId is not a numerical value. It's a unique identifier for each county.

As we saw in the objective, there are lots of things we can do with this dataset.

But, let's pick top 5 states with most population and most employees and work from there!


## Select subsets of data

Before we dive into the data analysis, let's select the columns we want to work with. 
As we learned earlier, we can select a specific subset of columns using two methods:

1. by index location
2. by column names

As this dataset has so many columns, let's take a look at all of the columns first.

We can call columns and it will return all of the column names. (or we can use df.info() as well.)

In [None]:
df.columns

Index(['CountyId', 'State', 'County', 'TotalPop', 'Men', 'Women', 'Hispanic',
       'White', 'Black', 'Native', 'Asian', 'Pacific', 'VotingAgeCitizen',
       'Income', 'IncomeErr', 'IncomePerCap', 'IncomePerCapErr', 'Poverty',
       'ChildPoverty', 'Professional', 'Service', 'Office', 'Construction',
       'Production', 'Drive', 'Carpool', 'Transit', 'Walk', 'OtherTransp',
       'WorkAtHome', 'MeanCommute', 'Employed', 'PrivateWork', 'PublicWork',
       'SelfEmployed', 'FamilyWork', 'Unemployment'],
      dtype='object')

Since there are so many columns, it's hard to count the indices. So we will use the column names to select subsets of data!

In [None]:
# by column names
df_emp = df[['CountyId', 'State', 'County', 'TotalPop', 'Income', 'Employed', 'Drive', 'Carpool', 'Transit', 'Walk', 'OtherTransp', 'WorkAtHome']]
df_emp

Unnamed: 0,CountyId,State,County,TotalPop,Income,Employed,Drive,Carpool,Transit,Walk,OtherTransp,WorkAtHome
0,1001,Alabama,Autauga County,55036,55317,24112,86.0,9.6,0.1,0.6,1.3,2.5
1,1003,Alabama,Baldwin County,203360,52562,89527,84.7,7.6,0.1,0.8,1.1,5.6
2,1005,Alabama,Barbour County,26201,33368,8878,83.4,11.1,0.3,2.2,1.7,1.3
3,1007,Alabama,Bibb County,22580,43404,8171,86.4,9.5,0.7,0.3,1.7,1.5
4,1009,Alabama,Blount County,57667,47412,21380,86.8,10.2,0.1,0.4,0.4,2.1
...,...,...,...,...,...,...,...,...,...,...,...,...
3215,72145,Puerto Rico,Vega Baja Municipio,54754,18900,14234,92.0,4.2,0.9,1.4,0.6,0.9
3216,72147,Puerto Rico,Vieques Municipio,8931,16261,2927,76.3,16.9,0.0,5.0,0.0,1.7
3217,72149,Puerto Rico,Villalba Municipio,23659,19893,6873,83.1,11.8,0.1,2.1,0.0,2.8
3218,72151,Puerto Rico,Yabucoa Municipio,35025,15586,7878,87.6,9.2,0.0,1.4,1.8,0.1


## Aggregation

### Total population by state

In [None]:
state_pop = df_emp.groupby(['State'])['TotalPop'].sum()
state_pop

State
Alabama                  4850771
Alaska                    738565
Arizona                  6809946
Arkansas                 2977944
California              38982847
Colorado                 5436519
Connecticut              3594478
Delaware                  943732
District of Columbia      672391
Florida                 20278447
Georgia                 10201635
Hawaii                   1421658
Idaho                    1657375
Illinois                12854526
Indiana                  6614418
Iowa                     3118102
Kansas                   2903820
Kentucky                 4424376
Louisiana                4663461
Maine                    1330158
Maryland                 5996079
Massachusetts            6789319
Michigan                 9925568
Minnesota                5490726
Mississippi              2986220
Missouri                 6075300
Montana                  1029862
Nebraska                 1893921
Nevada                   2887725
New Hampshire            1331848
New 

## Total population by state -- sorted

Since there are so many states, it will be easier to see which states have most population if we can sort the dataset.

If we use sort_values() it's going to sort the data by the aggregation function (sum)

In [None]:
state_pop = df_emp.groupby(['State'])['TotalPop'].sum().sort_values(ascending=False)
state_pop

State
California              38982847
Texas                   27419612
Florida                 20278447
New York                19798228
Illinois                12854526
Pennsylvania            12790505
Ohio                    11609756
Georgia                 10201635
North Carolina          10052564
Michigan                 9925568
New Jersey               8960161
Virginia                 8365952
Washington               7169967
Arizona                  6809946
Massachusetts            6789319
Indiana                  6614418
Tennessee                6597381
Missouri                 6075300
Maryland                 5996079
Wisconsin                5763217
Minnesota                5490726
Colorado                 5436519
South Carolina           4893444
Alabama                  4850771
Louisiana                4663461
Kentucky                 4424376
Oregon                   4025127
Oklahoma                 3896251
Connecticut              3594478
Puerto Rico              3468963
Iowa

Let's select top FIVE states with most population!

#### These are the five states we will work with:
- California             
- Texas                   
- Florida                 
- New York                
- Illinois  

We are going to use "loc" as we want to filter dataset based on criteria.

We could use the line below: selecting rows if 'State' is California, Texas, Florida, New York, or Illinois.

In [None]:
df_emp.loc[(df_emp['State']=='California') | (df_emp['State']=='Texas') | (df_emp['State']=='Florida') | (df_emp['State']== "New York") | (df_emp['State']=='Illinois')]

Unnamed: 0,CountyId,State,County,TotalPop,Income,Employed,Drive,Carpool,Transit,Walk,OtherTransp,WorkAtHome
186,6001,California,Alameda County,1629615,85743,826310,62.0,10.1,14.6,3.7,3.6,6.0
187,6003,California,Alpine County,1203,63438,374,62.1,13.2,0.0,9.1,1.6,14.0
188,6005,California,Amador County,37306,60636,13444,80.7,9.2,0.1,2.5,1.1,6.4
189,6007,California,Butte County,225207,46516,93439,75.0,11.1,1.2,3.2,3.9,5.7
190,6009,California,Calaveras County,45057,54800,16721,79.1,9.9,1.0,1.2,0.6,8.3
...,...,...,...,...,...,...,...,...,...,...,...,...
2772,48499,Texas,Wood County,43315,48038,15960,80.6,10.0,0.3,2.5,1.0,5.6
2773,48501,Texas,Yoakum County,8481,62500,3755,78.0,16.8,0.0,3.7,0.2,1.3
2774,48503,Texas,Young County,18166,46351,8248,83.0,9.0,0.0,0.6,1.4,6.1
2775,48505,Texas,Zapata County,14415,34550,5146,69.2,23.8,0.0,4.4,0.6,2.0


But, the code above is very lenthy so we will learn a shortcut!

We can use isin(state_list) function to see if 'State' is in state_list.

The syntax above is very similar to 'a' in ['a','b','c'] in Python.

In [None]:
five_states = df_emp.loc[df_emp['State'].isin(['California','Texas','Florida','New York','Illinois'])]
five_states

Unnamed: 0,CountyId,State,County,TotalPop,Income,Employed,Drive,Carpool,Transit,Walk,OtherTransp,WorkAtHome
186,6001,California,Alameda County,1629615,85743,826310,62.0,10.1,14.6,3.7,3.6,6.0
187,6003,California,Alpine County,1203,63438,374,62.1,13.2,0.0,9.1,1.6,14.0
188,6005,California,Amador County,37306,60636,13444,80.7,9.2,0.1,2.5,1.1,6.4
189,6007,California,Butte County,225207,46516,93439,75.0,11.1,1.2,3.2,3.9,5.7
190,6009,California,Calaveras County,45057,54800,16721,79.1,9.9,1.0,1.2,0.6,8.3
...,...,...,...,...,...,...,...,...,...,...,...,...
2772,48499,Texas,Wood County,43315,48038,15960,80.6,10.0,0.3,2.5,1.0,5.6
2773,48501,Texas,Yoakum County,8481,62500,3755,78.0,16.8,0.0,3.7,0.2,1.3
2774,48503,Texas,Young County,18166,46351,8248,83.0,9.0,0.0,0.6,1.4,6.1
2775,48505,Texas,Zapata County,14415,34550,5146,69.2,23.8,0.0,4.4,0.6,2.0


### Average income by state

Now, we have selected five states to work with, and let's get the average income in each state.

In [None]:
five_states.groupby(['State'])['Income'].mean().sort_values(ascending=False)

State
California    61046.758621
New York      58309.258065
Illinois      52930.382353
Texas         49894.338583
Florida       47144.328358
Name: Income, dtype: float64

California has the highest average income and Florida has the lowest average income amongst five states.

### Total number of employees by state
### Now Try This
Now, let's look at the employed population by each state.

In [None]:
five_states.groupby(['State'])['Employed'].sum().sort_values(ascending=False)

State
California    17993915
Texas         12689069
New York       9467631
Florida        9018570
Illinois       6181653
Name: Employed, dtype: int64

Write down the two states with the highest employed population and the lowest employed population.



### Transit modes by state

Let's look at each state's transit mode and see which transit mode is most popular in each state.

We will groupby 'State' and we will get the sum of all of the transit modes.

In [None]:
five_states.groupby(['State'])['Drive','Carpool','Transit','Walk','OtherTransp','WorkAtHome'].sum()

Unnamed: 0_level_0,Drive,Carpool,Transit,Walk,OtherTransp,WorkAtHome
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
California,4265.3,633.3,170.1,193.5,151.0,386.8
Florida,5349.1,687.6,72.2,98.2,157.0,335.6
Illinois,8420.2,872.8,108.7,248.1,126.7,422.4
New York,4619.3,523.4,406.5,290.6,100.0,260.2
Texas,20455.0,2888.5,79.4,600.6,363.2,1013.9


### Now Try This

Step 1: Select california state only

Step 2: Calculate the sum of all transit modes by county in California.

In [None]:
# step 1: Select california state only
ca_transit = five_states.loc[five_states['State']=='California']
ca_transit

Unnamed: 0,CountyId,State,County,TotalPop,Income,Employed,Drive,Carpool,Transit,Walk,OtherTransp,WorkAtHome
186,6001,California,Alameda County,1629615,85743,826310,62.0,10.1,14.6,3.7,3.6,6.0
187,6003,California,Alpine County,1203,63438,374,62.1,13.2,0.0,9.1,1.6,14.0
188,6005,California,Amador County,37306,60636,13444,80.7,9.2,0.1,2.5,1.1,6.4
189,6007,California,Butte County,225207,46516,93439,75.0,11.1,1.2,3.2,3.9,5.7
190,6009,California,Calaveras County,45057,54800,16721,79.1,9.9,1.0,1.2,0.6,8.3
191,6011,California,Colusa County,21479,56481,9470,76.9,14.2,0.3,2.5,1.6,4.5
192,6013,California,Contra Costa County,1123678,88456,535401,68.1,11.7,10.3,1.7,1.9,6.3
193,6015,California,Del Norte County,27442,41287,8742,73.7,14.5,0.6,3.6,1.9,5.6
194,6017,California,El Dorado County,185015,74885,80373,77.8,8.5,1.8,2.0,1.6,8.3
195,6019,California,Fresno County,971616,48730,392003,78.1,12.2,1.2,1.6,2.5,4.3


In [None]:
# Step 2: Calculate the sum of all transit modes by county in California.

ca_transit.groupby(['County'])['Drive','Carpool','Transit','Walk','OtherTransp','WorkAtHome'].sum()

Unnamed: 0_level_0,Drive,Carpool,Transit,Walk,OtherTransp,WorkAtHome
County,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Alameda County,62.0,10.1,14.6,3.7,3.6,6.0
Alpine County,62.1,13.2,0.0,9.1,1.6,14.0
Amador County,80.7,9.2,0.1,2.5,1.1,6.4
Butte County,75.0,11.1,1.2,3.2,3.9,5.7
Calaveras County,79.1,9.9,1.0,1.2,0.6,8.3
Colusa County,76.9,14.2,0.3,2.5,1.6,4.5
Contra Costa County,68.1,11.7,10.3,1.7,1.9,6.3
Del Norte County,73.7,14.5,0.6,3.6,1.9,5.6
El Dorado County,77.8,8.5,1.8,2.0,1.6,8.3
Fresno County,78.1,12.2,1.2,1.6,2.5,4.3


## Takeaways from this tutorial
Anytime we want to work with data, we can use Pandas, and we will be able to answer any questions!

In the later weeks, we will deep dive into the further uses of Pandas to analyze a more complicated data.

## Resources
- [10 minutes to Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html)
- [Pandas API Reference](https://pandas.pydata.org/pandas-docs/stable/reference/index.html)
- [A Gentle Introduction to Pandas](https://medium.com/@wbusaka/a-gentle-introduction-to-pandas-5ed17421a59d)

## Homework

## Appendix