<a href="https://colab.research.google.com/github/bitprj/DigitalHistory/blob/master/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/Week3-Introduction-to-Data-Manipulation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://github.com/bitprj/DigitalHistory/blob/Narae/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/assets/icons/bitproject.png?raw=1" width="200" align="left"> 
<img src="https://github.com/bitprj/DigitalHistory/blob/Narae/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/assets/icons/data-science.jpg?raw=1" width="300" align="right">

# <div align="center">Data Manipulation with Pandas</div>




## Table of Contents
- Why, Where and How we use Pandas
  - What is Pandas?
  - Data structures in Pandas
- What we will be learning today
  - About the dataset
  - Goals
- Importing Pandas library
- Loading a file
- Setting an index
- Getting info about the dataset
- Removing NaN (None) values
  - **1.0 - Now Try This**
- Removing a column
- Selecting subsets of data
  - **2.0 - Now Try This**
  - **3.0 - Now Try This**
  - **4.0 - Now Try This**
- Filtering dataset based on criteria
  - **5.0 - Now Try This**
  - **6.0 - Now Try This**
- Aggregation functions
  - Sum
  - Min / Max
  - Mean
  - **7.0 - Now Try This**
- Practical Exercise
  - About the dataset
  - Setting an index
  - **8.0 - Now Try This**
  - Aggregate functions
  - **9.0 - Now Try This**
  - **10.0 - Now Try This**
  - **11.0 - Now Try This**

## Why, Where and How we use Pandas
<img src="https://github.com/bitprj/DigitalHistory/blob/Narae/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/assets/icons/pandas.png?raw=1" width="200" align="center"> 

### What is Pandas?
This week, we will cover the basic data manipulation using Pandas.
1. Pandas is an open source data analysis and manipulation tool and it is widely used both in academia and industry.
2. It is built on top of the Python programming language. 
3. It offers data structures and operations for manipulating numerical tables and time series.

### Data structures in Pandas
Pandas provides three data structures: Series, DataFrame, and Panel. 

1. A Series is 1-dimensional labelled array and 1-dimensional array represents a single column in excel. It can hold data of **any type** (integer, string, python objects, etc.) and its labels are called indices.
2. A DataFrame is 2-dimensional labelled data structure with both rows and columns and 2-dimensional array represents a tabluar data.
3. A panel is 3-dimensional. 

This week, we will focus on DataFrame and we will learn Series in later weeks. We will not cover Panel in this semester, as it's not used as often as two other data structures.


Since we've covered the fundamentals of Python, it will be fairly easy to pick up Pandas.

## What we will be learning today
### Goals:
- Getting a quick overview of the dataset 
- Removing column / rows with NaN values
- Selecting and filtering based on criteria
- Analyze the survival rates in the Titanic dataset


## Grading

In order to work on the NTT sections and submit them for grading, you'll need to run the code block beloW. It will ask for your student ID number and then create a folder that will house your answers for each question. At the very end of the notebook, there is a code section that will download this folder as a zip file to your computer. This zip file will be your final submission.

In [None]:
import os
import shutil

!rm -rf sample_data

student_id = input('Please Enter your Student ID: ') # Enter Student ID.

while len(student_id) != 9:
 student_id = int('Please Enter your Student ID: ')  
  
folder_location = f'{student_id}/Week_Three/Now_Try_This'
if not os.path.exists(folder_location):
  os.makedirs(folder_location)
  print('Successfully Created Directory, Lets get started')
else:
  print('Directory Already Exists')

## About the dataset:

## Titanic
<img src="https://github.com/bitprj/DigitalHistory/blob/Narae/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/assets/icons/titanic.png?raw=1" width="300" align="right"> 
To begin with Pandas and dataframes, we will use a dataset about the Titanic. Titanic was a British passenger liner operated by the White Star Line that sank in the North Atlantic Ocean in 1912, after striking an iceberg during her maiden voyage from Southampton to New York City. Of the estimated 2,224 passengers and crew aboard, more than 1,500 died.


This dataset does not have all of the passengers, but has the following info for a third of all passengers aboard: name, age, gender, ticket price, and most importantly whether or not they survived.

As each person has its unique PassengerID, each row is a unique entity / passenger.



## Import Pandas library

In order to read / load a file, we will need to import Pandas.

It's a convention to use ``` import pandas as pd``` when importing Pandas library.

In [None]:
import pandas as pd

Once we've imported Pandas, we can use ``` pd``` to call any functions in Pandas.



## Load file

To read the csv file with our data, we will use the ```read_csv``` function.

Since we are working with only one dataset, we will just call dataframe as df. 

But, if we are working with lots of dataframes, it's better to give a meaningful name (ex: titanic_data, passenger_info, etc.)

In [None]:
url = 'https://raw.githubusercontent.com/bitprj/DigitalHistory/master/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/data/titanic-dataset.csv'
df = pd.read_csv(url)
df

#### Set index

Now, the dataset is loaded as a dataframe 'df'

The first column is an index column and it starts from 0 by default.

But, as you can tell, PassengerId itself is a unique index. So, let's set PassengerId as an index.

We can call ```set_index``` function and specify the index using ```keys=```

In [None]:
df.set_index(keys='PassengerId')

The code above worked! Now PassengerId is a new index for df.

Let's call df one more time to make sure that df has been updated to reflect the change.

In [None]:
df

IMPORTANT: df has NOT been updated. Do you know why?

```df.set_index(keys='PassengerId')```: this function sets PassengerId as an index when we CALL the function. 

Since we didn't save the function call, df has NOT been updated.

There are two ways to save the change.
1. ```df = df.set_index(keys='PassengerId')```
2. ```df.set_index(keys='PassengerId', inplace = True)```

First function call reassigns a variable ```df``` to the updated ```df``` and second function call makes changes in-place.

In [None]:
df.set_index(keys='PassengerId', inplace=True)
df

Okay, now PassengerId is set to index!

## Basic info about the dataset

Now, let's get basic information about the dataframe.
- head()
- describe()
- info()

### head()
```head()``` function is useful to see the dataset at a quick glance as it returns first n rows.

Let's check what columns this file has by calling ```head()``` function.

By default, ```head()``` returns the first 5 rows.

In [None]:
df.head()

You can specify the number of rows to display by calling ```df.head(number)```

In [None]:
df.head(10)

### info()

Now, we know what's in the dataset and what it looks like.

To summarize what information is available in the dataset, we can use the info() function.

This function is useful as this returns all of the **column names** and **its types** as well as **Non-Null** counts. 

In [None]:
df.info()

We can tell that "Age" and "Cabin" have lots of missing values; the dataset only has data for 714 ages and 204 cabins for the 891 passengers.

If we take a closer look at dtypes in the second to last row, there are three dtypes: ```int64```, ```float64```, and ```object```.

We have covered ```int``` and ```float``` last week in Python, but what is an object?

- ```int64```: integer numbers
- ```float64```: floating point numbers
- ```object```: string or mixed numeric and non-numeric values.

That's why the dtype of "Name", "Sex", "Embarked" is ```object```, as it is a string.

"Ticket" and "Cabin" are ```objects``` as they are in a format of numbers or string + numbers (Ex: A/5 21171, C85)



### describe()

```describe()``` is used to view summary statistics of numeric columns. This helps us to have a general idea of the dataset.

```Count```, ```mean```, ```min```, and ```max``` are straightforward.

Let's refresh our memory with statistical concepts.

- ```std```: standard deviation - measures the dispersion of a dataset relative to its mean. If the data points are further from the mean, there is a higher deviation within the dataset. The more spread out the data, the higher the standard deviation.
- ```25%```: the value below which 25% of the observations may be found. 
- ```50%```: the value below which 50% of the observations may be found. 
- ```75%```: the value below which 75% of the observations may be found. 

For example, 25th percentile of age is 20.125 and 75th percentile of age is 38. This means that 25% of the passengers' age is less than 20.125 and 75% of the passengers' age is less than 38.

In [None]:
df.describe()

### shape
To see the size of the dataset, we can use ```shape``` function, which returns the number of rows and columns in a format of (# rows, # columns)

This dataset has 891 rows (entities) and 11 columns.

In [None]:
df.shape

## Remove NaN values

Often times, when we work with large datasets, we will encounter cases where there are lots of missing elements (NaN / null) in the dataset.

Removing NaN values will allow us to drop the rows and to work with clean datasets.


---


Let's remove the rows that do not provide a meaningful information.

When we know a "unique key" of the dataset (PassengerID in this dataset), we can check whether all elements have PassengerID. If any of the rows are missing PassengerID, then we can drop that entity.

* ```df.dropna()```: drop the rows where at least one of the elements is missing.
* ```df.dropna(how='all')```: drop the rows where all of the elements are missing.
* ```df.dropna(subset=[columns])```: define in which columns to look for missing values.

If we want to drop the rows with **at least** one missing element:

In [None]:
df.dropna()

If we want to drop the rows with **all** elements missing:

In [None]:
df.dropna(how='all')

If we want to drop the rows that are missing Survived value.

In [None]:
df.dropna(subset=['Survived'])

Yay, we've confirmed that all of the passengers have 'Survived' value since the number of rows remains the same.

If we want to update the dataset after dropping rows, we can use ```inplace = True```

In [None]:
df.dropna(subset=['Survived'], inplace=True)

### 1.0 - Now Try This

- Drop the rows that are missing any of the following columns: 'Pclass'
- Update ```df```

In [None]:
# Once your have verified your answer please uncomment the line below and run it, this will save your code 
#%%writefile -a {folder_location}/1.py
# Please note that if you uncomment and rub multiple times, the program will keep appending to the file.

# INSERT CODE BELOW


## Removing a column

Before we dive into the data analysis, let's see if there are any columns we want to remove. 

In [None]:
df.info()

In "Cabin" column, there are only 204 rows that are non-null. That means 891 - 204 = 687 rows are missing in the column.

It wouldn't give us as meaningful insight as other columns, so let's remove "Cabin" column by using ```del``` function.

In [None]:
del df['Cabin']
df

## Select subsets of data

When we are interested in a few columns to do the data analysis, we can select a specific subset of columns using two methods:

1. by index location
2. by column names

#### 1. by index location
We can select specific subsets of data using ```iloc[rows_index, columns_index]```.

As we learned last week, ```[:]``` selects everything in a list or string in Python. 

Similarly, ```[:]``` will select every row or column depending on where we put it.

Let's select PassengerId, Survived, and Pclass.

In [None]:
# by index location (iloc)
df.iloc[: , [0,1,2]]

We selected 0, 1, and 2 because PassengerId, Survived, and Pclass are the first 3 columns.

Hmm, but 4 columns showed up. Let's look into why!

Because PassengerId is the default index, it shows up automatically.

So, the index location 0 will be the first column right after the index column.

### 2.0 - Now Try This

Select PassengerId, Survived, and Pclass with all rows.

In [None]:
# Once your have verified your answer please uncomment the line below and run it, this will save your code 
#%%writefile -a {folder_location}/2.py
# Please note that if you uncomment and rub multiple times, the program will keep appending to the file.

# INSERT CODE BELOW


### 3.0 - Now Try This

Select PassengerId, Survived, and Pclass with all rows.
Please use semi-colon ```:``` this time.

In [None]:
# Once your have verified your answer please uncomment the line below and run it, this will save your code 
#%%writefile -a {folder_location}/3.py
# Please note that if you uncomment and rub multiple times, the program will keep appending to the file.

# INSERT CODE BELOW


#### 2. by column names

Let's select subsets of data by column names.

We are interested in PassengerId, Survived, Sex, and Age.

In [None]:
# by column names
df[['PassengerId', 'Survived', 'Sex', 'Age']]

The code above doesn't work. What does the KeyError say?

```['PassengerId'] not in index.```

Remember? PassengerId is no longer a column, so we can't select it by a column name.

### 4.0 - Now Try This

Select PassengerId, Survived, Sex, and Age.



In [None]:
# Once your have verified your answer please uncomment the line below and run it, this will save your code 
#%%writefile -a {folder_location}/4.py
# Please note that if you uncomment and rub multiple times, the program will keep appending to the file.

# INSERT CODE BELOW


## Filter Dataset based on criteria

Often times, we are interested in working with specific rows that meet the certain criteria. 

If we only want to look at the data with Age > 30, we can specify the criteria within ```loc``` function.

In [None]:
df_over_30yrs=df.loc[df['Age'] > 30]
df_over_30yrs

Now, let's select the dataset using two criteria -- where "Age" is greater than 30 AND "Survived."

```&``` is equivalent to ```AND``` and ```|``` is equivalent to ```OR``` in dataframe.


In [None]:
df_over_30yrs_survived = df.loc[(df['Age'] > 30) & (df['Survived'] == 1)]
df_over_30yrs_survived


IMPORTANT: When filtering with multiple conditions, make sure to use ```()``` on each condition. 

Otherwise, you will get an error message that ```The truth value of a Series is ambiguous.```

Let's check how many passengers survived among the ones whose age was over 30.

In [None]:
print("# of passengers whose age was over 30: ", df_over_30yrs.shape[0])
print("# of survived passengers whose age was over 30: ", df_over_30yrs_survived.shape[0])

### 5.0 - Now Try This

Select the dataset that meet the following condition:
- Pclass is not 1

In [None]:
# Once your have verified your answer please uncomment the line below and run it, this will save your code 
#%%writefile -a {folder_location}/5.py
# Please note that if you uncomment and rub multiple times, the program will keep appending to the file.

# INSERT CODE BELOW


### 6.0 - Now Try This

Select the dataset that meet the following conditions:
- Age is less than 10 
- OR
- Age is greater than 50

Hint: Don't forget parenthesis!

In [None]:
# Once your have verified your answer please uncomment the line below and run it, this will save your code 
#%%writefile -a {folder_location}/6.py
# Please note that if you uncomment and rub multiple times, the program will keep appending to the file.

# INSERT CODE BELOW


## Aggregation functions

Aggregation is the process of combining things. It's useful to understand overall properties of the dataset and analyze it.

Some examples of aggregation are ```sum()```, ```count()```, ```min()```, ```max()```,  ```mean()```, ```std()```, etc.

### Sum
#### 1. Total fares

In [None]:
df['Fare'].sum()

If we want to round total fares and save it as a variable, then we can try:

In [None]:
total_fares = df['Fare'].sum()
print("Total fares: ", round(total_fares))

#### 2. Survived passengers
We can also count the number of passengers that survived by summing up the 'Survived' column.

In [None]:
survived_passengers = df['Survived'].sum()
survived_passengers

### Max / Min

Let's calculate max and min of Fare.

In [None]:
df['Fare'].max()

In [None]:
df['Fare'].min()

### Mean
Let's calculate the survival rate of all passengers.

In [None]:
df['Survived'].mean()

Let's calculate the average age of all passengers.

In [None]:
df['Age'].mean()

Now, we will tackle a more complex problem.

Let's calculate the survival rate by the *age* group.
We can apply filtering that we just learned to select the group whose age was over 30 and whose age was under 30.

In each group, we will calculate the mean of 'Survived' column.

In [None]:
# filtering
df_over_30yrs = df.loc[df['Age'] > 30]
df_under_30yrs = df.loc[df['Age'] <= 30]

# calculating mean of 'Survived' for each group
mean_over_30 = df_over_30yrs['Survived'].mean()
mean_under_30 = df_under_30yrs['Survived'].mean()

# printing the mean survival rates for each group
# round(number, decimal_points): round a number to a given precision in decimal_points
print("Survival rate - age over 30: ", round(mean_over_30*100, 3), "%")
print("Survival rate - age under 30: ", round(mean_under_30*100, 3), "%")

There's not much difference between the two groups.

### Groupby

Now, we will group by *sex* to see if there's any difference between female and male.

Here, we use ```groupby``` aggregate function and it will let us group the dataset by that column ('Sex')

In [None]:
df.groupby(['Sex']).mean()

If we want to sort the aggregate funtion by a column, we can use ```sort_values(by=column_name)```.

Let's sort the above aggregate function by "Survived"

In [None]:
df.groupby(['Sex']).mean().sort_values(by="Survived")

If we are interested in the survival rate of each group, we can use ```[ ]``` after the groupby call to specify which column to display.

In [None]:
# group by 'Sex' and calculate mean of 'Survived'
df.groupby(['Sex'])['Survived'].mean()

There was a significant difference in the survival rate by *sex*!

We can also apply groupby on multiple columns.

In [None]:
 df.groupby(['Sex', 'Pclass']).mean()

In both groups (female and male), the survival rate was a lot higher in Pclass 1 than other Pclass!

### 7.0 - Now Try This

Then, some of us might be curious:

Would lower Pclass be more expensive or higher Pclass be more expensive?

We can answer the question by calculating the mean fares for each class.

**Calculate the mean fares for each Pclass!**

In [None]:
# Once your have verified your answer please uncomment the line below and run it, this will save your code 
#%%writefile -a {folder_location}/7.py
# Please note that if you uncomment and rub multiple times, the program will keep appending to the file.

# INSERT CODE BELOW


### Let's pause and think!
Did you see any correlation between Pclass, Fare, and Survival rate? Briefly describe what you have found here.

## Takeaways

Using Dataframe and aggregate functions in Pandas, we can answer any questions that might come up!

In the tutorial section, we will apply what we have learned in Pandas and further analyze a new dataset.

# Practical Exercise

### About the dataset

What we will be using in the tutorial is the US Census Demographic Data.

The data here were collected by the US Census Burea and it includes data from the entire country.

This dataset covers lots of areas: state, county, gender, ethnicity, professional working fields, means of transportation to work, and employment.

All of this information is available at a State and County level. There are many questions that we could try to answer with the dataset:
- Unemployment by state
- Professional fields by state and county
- Means of transportation to work by county in CA
- ...

### Objective
Since the dataset covers all of the states in the US, we are going to select top 5 largest states by population. 
Once we've selected top five states, we will examine the residents' means of transportation to work at a state and county level.

That's our focus in the tutorial, but feel free to play around with it as you'd like.

Let's load our data first!

### Load file

In [None]:
url = 'https://raw.githubusercontent.com/bitprj/DigitalHistory/master/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/data/acs2017_county_data.csv'
df = pd.read_csv(url)
df

### 8.0 - Now Try This

CountyId is a unique identifier for each county and state.
- Set CountyId as an index.
- Update df

Hint: Use ```set_index``` and don't forget to update df

In [None]:
# Once your have verified your answer please uncomment the line below and run it, this will save your code 
#%%writefile -a {folder_location}/8.py
# Please note that if you uncomment and rub multiple times, the program will keep appending to the file.

# INSERT CODE BELOW


### Basic info about the dataset


In [None]:
df.head()

Since there are so many columns, head function doesn't display all columns.

Let's use info() as it returns **ALL** of the **column names**, its types, and Non-Null counts. 



In [None]:
df.info()

There are 36 columns with 3220 counties.

Also, none of the columns have missing rows as every column has 3220 non-null values! That's great!

Let's view summary statistics of numeric columns and figure out what we want to get out from this dataset.

In [None]:
df.describe()

There are 34 numerical columns out of 36 in this dataset.

As we saw in the objective, there are lots of things we can do with this dataset.

But, let's pick top 5 states with most population and most employees and work from there!


## Select subsets of data

Before we dive into the data analysis, let's select the columns we want to work with. 
As we learned earlier, we can select a specific subset of columns using two methods:

1. by index location
2. by column names

As this dataset has so many columns, let's take a look at all of the columns first.

We can call ```columns``` and it will return all of the column names - we can use ```df.info()``` as well.

In [None]:
df.columns

Since there are so many columns, it's hard to count the index location. So we will use the column names to select subsets of data!

As discussed in the objective, our main focus is transportation methods for workers and population. So, we will select the following columns!

In [None]:
# by column names
df_emp = df[['State', 'County', 'TotalPop', 'Income', 'Employed', 'Drive', 'Carpool', 'Transit', 'Walk', 'OtherTransp', 'WorkAtHome']]
df_emp

## Aggregation

### Total population by state

Let's get total polution by state to select top 5 largest states by population.

In order to calculate this, we need to group by state, and we will need to get ```sum``` of ```TotalPop```.

### 9.0 - Now Try This
- Calculate total population by state from ```df_emp```
- Name the dataframe as ```state_pop```

In [None]:
# Once your have verified your answer please uncomment the line below and run it, this will save your code 
#%%writefile -a {folder_location}/9.py
# Please note that if you uncomment and rub multiple times, the program will keep appending to the file.

# INSERT CODE BELOW


## Total population by state -- sorted

Since there are so many states, it will be easier to see which states have most population if we can sort the dataset.

If we use ```sort_values()```  and it's going to sort the data by the aggregation function value.

In [None]:
state_pop = df_emp.groupby(['State'])['TotalPop'].sum().sort_values(ascending=False)
state_pop

#### Top 5 states

If we want to see the top 5 results from any dataframes, we can use ```df.head(n)``` function to display the first n rows.

In [None]:
state_pop.head(5)

#### These are the top 5 largest states by population:
- California             
- Texas                   
- Florida                 
- New York                
- Illinois  

We are going to use ```loc``` as we want to filter dataset based on criteria.

We could use the line below: selecting rows if 'State' is California, Texas, Florida, New York, or Illinois.

In [None]:
df_emp.loc[(df_emp['State']=='California') | (df_emp['State']=='Texas') | (df_emp['State']=='Florida') | (df_emp['State']== "New York") | (df_emp['State']=='Illinois')]

But, the code above is very lenthy so we will learn a shortcut!

We can use ```isin(list_of_values)``` function to see if 'State' is in state_list.

The syntax above is very similar to ```'a' in ['a','b','c']``` in Python.

In [None]:
five_states = df_emp.loc[df_emp['State'].isin(['California','Texas','Florida','New York','Illinois'])]
five_states

### Average income by state

Now, we have selected five states to work with and let's get the average income in each state.

In [None]:
five_states.groupby(['State'])['Income'].mean().sort_values(ascending=False)

California has the highest average income and Florida has the lowest average income amongst these five states.

### Total number of employees by state
### 10.0 - Now Try This
- Calculate the total number of employees by state
- Sort by value
- Write down the state with the highest number of employees and the state with the lowest number of employees.

Hint: use ```groupby```

In [None]:
# Once your have verified your answer please uncomment the line below and run it, this will save your code 
#%%writefile -a {folder_location}/10.py
# Please note that if you uncomment and rub multiple times, the program will keep appending to the file.

# INSERT CODE BELOW


In [None]:
# Once your have verified your answer please uncomment the line below and run it, this will save your code 
#%%writefile -a {folder_location}/10.py
# Please note that if you uncomment and rub multiple times, the program will keep appending to the file.

# What's the state with the highest number of employees
answer1 = # INSERT YOUR ANSWER HERE in a string form
print(answer1)

In [None]:
# Once your have verified your answer please uncomment the line below and run it, this will save your code 
#%%writefile -a {folder_location}/10.py
# Please note that if you uncomment and rub multiple times, the program will keep appending to the file.

# What's the state with the lowest number of employees
answer2 = # INSERT YOUR ANSWER HERE in a string form
print(answer2)

### Means of transportation to work by State

Let's look at each state's transit mode and see which transit mode is most popular in each state.

We will groupby 'State' and we will get the sum of all of the transit modes.

In [None]:
five_states.groupby(['State'])['Drive','Carpool','Transit','Walk','OtherTransp','WorkAtHome'].sum()

As California is the largest state by both population and employment, we will work with California dataset only.


### 11.0 - Now Try This

#### Step 1: Select california state only and save it as ```ca_transit```

In [None]:
# Once your have verified your answer please uncomment the line below and run it, this will save your code 
#%%writefile -a {folder_location}/11.py
# Please note that if you uncomment and rub multiple times, the program will keep appending to the file.


# Step 1: Select california state only
# INSERT CODE BELOW



#### Step 2: Calculate ```sum``` of all transit modes by ```county``` in California.

In [None]:
# Once your have verified your answer please uncomment the line below and run it, this will save your code 
#%%writefile -a {folder_location}/11.py
# Please note that if you uncomment and rub multiple times, the program will keep appending to the file.

# Step 2: Calculate the sum of all transit modes by county in California.
# INSERT CODE BELOW



## Takeaways from this tutorial
Anytime you see data, you can use Pandas and you will be able to answer any questions!

In the later weeks, we will deep dive into the further uses of Pandas to analyze a more complicated data.

## Resources
- [About Titanic](https://en.wikipedia.org/wiki/Titanic)
- 
- [US Census Demographic Data](https://www.kaggle.com/muonneutrino/us-census-demographic-data)
- [10 minutes to Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html)
- [Pandas API Reference](https://pandas.pydata.org/pandas-docs/stable/reference/index.html)
- [A Gentle Introduction to Pandas](https://medium.com/@wbusaka/a-gentle-introduction-to-pandas-5ed17421a59d)

## Appendix

### How to download a dataset from Kaggle

Kaggle is the world's largest data science community with powerful tools and resources. There are lots of datasets you can download from this website.

#### Step 1: go to [Kaggle.com](https://Kaggle.com)

#### Step 2: click Data tab

![Kaggle Data](https://github.com/bitprj/DigitalHistory/blob/Narae/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/assets/Kaggle_data.png?raw=1)

#### Step 3: search for a dataset of your interest or explore the most popular datasets on the main page

![Search page](https://github.com/bitprj/DigitalHistory/blob/Narae/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/assets/search_page.png?raw=1)

#### Step 4: once you select a dataset, you can read context and download it. If you click Download button at the top, all of the datasets will be downloaded.

![US Census main](https://github.com/bitprj/DigitalHistory/blob/Narae/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/assets/US_Census_main.png?raw=1)

#### Step 5: if you want to download a specific dataset, hit the download icon in the selected dataset 

![Download specific](https://github.com/bitprj/DigitalHistory/blob/Narae/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/assets/download_specific.png?raw=1)

#### Step 6: if you'd like to see all columns, check ```select all``` and it will show all of the columns in the dataset.

![View columns](https://github.com/bitprj/DigitalHistory/blob/Narae/Week3-Introduction-to-Open-Data-Importing-Data-and-Basic-Data-Wrangling/assets/view_columns.png?raw=1)

#### Step 7: if your dataset is a csv file, then use ```read_csv()```. If your dataset is an excel file, use ```read_excel()``` to load your dataset.