# Python Data Science

> Dataframe Wrangling with Pandas

Kuo, Yao-Jen from [DATAINPOINT](https://www.datainpoint.com/)

In [1]:
import requests
import json

## TL; DR

> In this lecture, we will talk about essential data wrangling skills in `pandas`.

## Essential Data Wrangling Skills in `pandas`

## What is `pandas`?

> Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more.

Source: <https://github.com/pandas-dev/pandas>

## Why `pandas`?

Python used to have a weak spot in its analysis capability due to it did not have an appropriate structure handling the common tabular datasets. Pythonists had to switch to a more data-centric language like R or Matlab during the analysis stage until the presence of `pandas`.

## Import Pandas with `import` command

Pandas is officially aliased as `pd`.

In [2]:
import pandas as pd

## If Pandas is not installed, we will encounter a `ModuleNotFoundError`

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'pandas'
```

## Use `pip install` at Terminal to install pandas

```bash
pip install pandas
```

## Check version and its installation file path

- `__version__` attribute
- `__file__` attribute

In [3]:
print(pd.__version__)
print(pd.__file__)

1.1.2
/opt/conda/lib/python3.8/site-packages/pandas/__init__.py


## What does `pandas` mean?

![](https://media.giphy.com/media/46Zj6ze2Z2t4k/giphy.gif)

Source: <https://giphy.com/>

## Turns out its naming has nothing to do with panda the animal, it refers to three primary class customed by its author [Wes McKinney](https://wesmckinney.com/)

- **Pan**el(Deprecated since version 0.20.0)
- **Da**taFrame
- **S**eries

## In order to master `pandas`, it is vital to understand the relationships between `Index`, `ndarray`, `Series`, and `DataFrame`

- An `Index` and a `ndarray` assembles a `Series`
- A couple of `Series` that sharing the same `Index` can then form a `DataFrame`

## Review of the definition of modern data science

> Modern data science is a huge field, it invovles applications and tools like importing, tidying, transformation, visualization, modeling, and communication. Surrounding all these is programming.

![Imgur](https://i.imgur.com/din6Ig6.png)

Source: [R for Data Science](https://r4ds.had.co.nz/)

## Key functionalities analysts rely on `pandas` are

- Importing
- Tidying
- Transforming

## Tidying and transforming together is also known as WRANGLING

![](https://media.giphy.com/media/MnlZWRFHR4xruE4N2Z/giphy.gif)

Source: <https://giphy.com/>

## Importing

## `pandas` has massive functions importing tabular data

- Flat text file
- Database table
- Spreadsheet
- Array of JSONs
- HTML `<table></table>` tags
- ...etc.

Source: <https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html>

## Using `read_csv` function for flat text files

In [4]:
from datetime import date
from datetime import timedelta

today = date.today()
day_delta = timedelta(days=-2)
data_date = today + day_delta
print(data_date)
print(type(data_date))
data_date_str = date.strftime(data_date, '%m-%d-%Y')
print(data_date_str)

2020-10-26
<class 'datetime.date'>
10-26-2020


In [5]:
daily_report_url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/{}.csv"
daily_report_url = daily_report_url.format(data_date_str)
daily_report = pd.read_csv(daily_report_url)
print(type(daily_report))

<class 'pandas.core.frame.DataFrame'>


## Using `read_sql` function for database tables

```python
import sqlite3

conn = sqlite3.connect('YOUR_DATABASE.db')
sql_query = """
SELECT * 
  FROM YOUR_TABLE
 LIMIT 100;
"""
pd.read_sql(sql_query, conn)
```

## Using `read_excel` function for spreadsheets

```python
excel_file_path = "PATH/TO/YOUR/EXCEL/FILE"
pd.read_excel(excel_file_path)
```

## Using `read_json` function for array of JSONs

> JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language. JSON is a text format that is completely language independent but uses conventions that are familiar to programmers of the C-family of languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others. These properties make JSON an ideal data-interchange language.

Source: <https://www.json.org/json-en.html>

In [6]:
web_api = "https://data.nba.net/prod/v2/2019/teams.json"
resp_dict = requests.get(web_api).json()
teams_dict = resp_dict['league']['standard']
json_str = json.dumps(teams_dict)
with open('teams.json', 'w') as f:
    f.write(json_str)

In [7]:
teams = pd.read_json("teams.json", orient='records')
print(type(teams))

<class 'pandas.core.frame.DataFrame'>


## Using `read_html` function for HTML `<table></table>` tags

> The `<table>` tag defines an HTML table. An HTML table consists of one `<table>` element and one or more `<tr>`, `<th>`, and `<td>` elements. The `<tr>` element defines a table row, the `<th>` element defines a table header, and the `<td>` element defines a table cell.

Source: <https://www.w3schools.com/default.asp>

In [8]:
request_url = "https://en.wikipedia.org/wiki/All-time_Olympic_Games_medal_table"
html_tables = pd.read_html(request_url)
print(type(html_tables))
print(len(html_tables))

<class 'list'>
22


In [9]:
html_tables[1]

Unnamed: 0_level_0,Unnamed: 0_level_0,Summer Games,Summer Games,Summer Games,Summer Games,Summer Games,Winter Games,Winter Games,Winter Games,Winter Games,Winter Games,Combined Total,Combined Total,Combined Total,Combined Total,Combined Total
Unnamed: 0_level_1,Team (IOC code),№,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Total,№,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Total,№,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Total
0,Afghanistan (AFG),14,0,0,2,2,0,0,0,0,0,14,0,0,2,2
1,Algeria (ALG),13,5,4,8,17,3,0,0,0,0,16,5,4,8,17
2,Argentina (ARG),24,21,25,28,74,19,0,0,0,0,43,21,25,28,74
3,Armenia (ARM),6,2,6,6,14,7,0,0,0,0,13,2,6,6,14
4,Australasia (ANZ) [ANZ],2,3,4,5,12,0,0,0,0,0,2,3,4,5,12
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
148,Zimbabwe (ZIM) [ZIM],13,3,4,1,8,1,0,0,0,0,14,3,4,1,8
149,Independent Olympic Athletes (IOA) [IOA],3,1,0,1,2,0,0,0,0,0,3,1,0,1,2
150,Independent Olympic Participants (IOP) [IOP],1,0,1,2,3,0,0,0,0,0,1,0,1,2,3
151,Mixed team (ZZX) [ZZX],3,8,5,4,17,0,0,0,0,0,3,8,5,4,17


## Basic attributes and methods

## Basic attributes of a `DataFrame` object

- `shape`
- `dtypes`
- `index`
- `columns`

In [10]:
print(daily_report.shape)
print(daily_report.dtypes)
print(daily_report.index)
print(daily_report.columns)

(3958, 14)
FIPS                   float64
Admin2                  object
Province_State          object
Country_Region          object
Last_Update             object
Lat                    float64
Long_                  float64
Confirmed                int64
Deaths                   int64
Recovered                int64
Active                 float64
Combined_Key            object
Incidence_Rate         float64
Case-Fatality_Ratio    float64
dtype: object
RangeIndex(start=0, stop=3958, step=1)
Index(['FIPS', 'Admin2', 'Province_State', 'Country_Region', 'Last_Update',
       'Lat', 'Long_', 'Confirmed', 'Deaths', 'Recovered', 'Active',
       'Combined_Key', 'Incidence_Rate', 'Case-Fatality_Ratio'],
      dtype='object')


## Basic methods of a `DataFrame` object

- `head(n)`
- `tail(n)`
- `describe`
- `info`
- `set_index`
- `reset_index`

## `head(n)` returns the top n observations with header

In [11]:
daily_report.head() # n is default to 5

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Incidence_Rate,Case-Fatality_Ratio
0,,,,Afghanistan,2020-10-27 04:24:45,33.93911,67.709953,40937,1518,34150,5269.0,Afghanistan,105.159889,3.708137
1,,,,Albania,2020-10-27 04:24:45,41.1533,20.1683,19445,480,10705,8260.0,Albania,675.689763,2.468501
2,,,,Algeria,2020-10-27 04:24:45,28.0339,1.6596,56419,1922,39273,15224.0,Algeria,128.660566,3.406654
3,,,,Andorra,2020-10-27 04:24:45,42.5063,1.5218,4325,72,2957,1296.0,Andorra,5597.618585,1.66474
4,,,,Angola,2020-10-27 04:24:45,-11.2027,17.8739,9644,270,3530,5844.0,Angola,29.343155,2.799668


## `tail(n)` returns the bottom n observations with header

In [12]:
daily_report.tail(3)

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Incidence_Rate,Case-Fatality_Ratio
3955,,,,Yemen,2020-10-27 04:24:45,15.552727,48.516388,2060,599,1360,101.0,Yemen,6.906733,29.07767
3956,,,,Zambia,2020-10-27 04:24:45,-13.133897,27.849332,16200,348,15445,407.0,Zambia,88.120315,2.148148
3957,,,,Zimbabwe,2020-10-27 04:24:45,-19.015438,29.154857,8303,242,7797,264.0,Zimbabwe,55.863828,2.914609


## `describe` returns the descriptive summary for numeric columns

In [13]:
daily_report.describe()

Unnamed: 0,FIPS,Lat,Long_,Confirmed,Deaths,Recovered,Active,Incidence_Rate,Case-Fatality_Ratio
count,3261.0,3877.0,3877.0,3958.0,3958.0,3958.0,3954.0,3877.0,3914.0
mean,32376.071144,35.988073,-72.314962,10986.35,292.924962,7377.222,3319.522,2317.163268,2.132531
std,17974.010299,12.899356,53.156609,60073.91,1792.11928,73587.58,61925.86,1694.677076,2.714726
min,66.0,-52.368,-174.1596,0.0,0.0,0.0,-3460455.0,0.0,0.0
25%,19051.0,33.269842,-96.616867,242.0,3.0,0.0,203.0,1101.511018,0.769231
50%,30067.0,37.938993,-86.879242,760.5,13.0,0.0,653.5,2028.776083,1.587302
75%,47037.0,42.160731,-77.63786,3084.0,61.0,0.0,2116.25,3210.536183,2.768728
max,99999.0,71.7069,178.065,1648665.0,43348.0,3460455.0,1046402.0,17705.773956,81.422925


## `info` returns the concise information of the dataframe

In [14]:
daily_report.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3958 entries, 0 to 3957
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   FIPS                 3261 non-null   float64
 1   Admin2               3266 non-null   object 
 2   Province_State       3789 non-null   object 
 3   Country_Region       3958 non-null   object 
 4   Last_Update          3958 non-null   object 
 5   Lat                  3877 non-null   float64
 6   Long_                3877 non-null   float64
 7   Confirmed            3958 non-null   int64  
 8   Deaths               3958 non-null   int64  
 9   Recovered            3958 non-null   int64  
 10  Active               3954 non-null   float64
 11  Combined_Key         3958 non-null   object 
 12  Incidence_Rate       3877 non-null   float64
 13  Case-Fatality_Ratio  3914 non-null   float64
dtypes: float64(6), int64(3), object(5)
memory usage: 433.0+ KB


## `set_index` replaces current `Index` with a specific variable

In [15]:
daily_report.set_index('Combined_Key')

Unnamed: 0_level_0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Incidence_Rate,Case-Fatality_Ratio
Combined_Key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Afghanistan,,,,Afghanistan,2020-10-27 04:24:45,33.939110,67.709953,40937,1518,34150,5269.0,105.159889,3.708137
Albania,,,,Albania,2020-10-27 04:24:45,41.153300,20.168300,19445,480,10705,8260.0,675.689763,2.468501
Algeria,,,,Algeria,2020-10-27 04:24:45,28.033900,1.659600,56419,1922,39273,15224.0,128.660566,3.406654
Andorra,,,,Andorra,2020-10-27 04:24:45,42.506300,1.521800,4325,72,2957,1296.0,5597.618585,1.664740
Angola,,,,Angola,2020-10-27 04:24:45,-11.202700,17.873900,9644,270,3530,5844.0,29.343155,2.799668
...,...,...,...,...,...,...,...,...,...,...,...,...,...
West Bank and Gaza,,,,West Bank and Gaza,2020-10-27 04:24:45,31.952200,35.233200,50952,454,44055,6443.0,998.781515,0.891035
Western Sahara,,,,Western Sahara,2020-10-27 04:24:45,24.215500,-12.885800,10,1,8,1.0,1.674116,10.000000
Yemen,,,,Yemen,2020-10-27 04:24:45,15.552727,48.516388,2060,599,1360,101.0,6.906733,29.077670
Zambia,,,,Zambia,2020-10-27 04:24:45,-13.133897,27.849332,16200,348,15445,407.0,88.120315,2.148148


## `reset_index` resets current `Index` with default `RangeIndex` 

In [16]:
daily_report.set_index('Combined_Key').reset_index()

Unnamed: 0,Combined_Key,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Incidence_Rate,Case-Fatality_Ratio
0,Afghanistan,,,,Afghanistan,2020-10-27 04:24:45,33.939110,67.709953,40937,1518,34150,5269.0,105.159889,3.708137
1,Albania,,,,Albania,2020-10-27 04:24:45,41.153300,20.168300,19445,480,10705,8260.0,675.689763,2.468501
2,Algeria,,,,Algeria,2020-10-27 04:24:45,28.033900,1.659600,56419,1922,39273,15224.0,128.660566,3.406654
3,Andorra,,,,Andorra,2020-10-27 04:24:45,42.506300,1.521800,4325,72,2957,1296.0,5597.618585,1.664740
4,Angola,,,,Angola,2020-10-27 04:24:45,-11.202700,17.873900,9644,270,3530,5844.0,29.343155,2.799668
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3953,West Bank and Gaza,,,,West Bank and Gaza,2020-10-27 04:24:45,31.952200,35.233200,50952,454,44055,6443.0,998.781515,0.891035
3954,Western Sahara,,,,Western Sahara,2020-10-27 04:24:45,24.215500,-12.885800,10,1,8,1.0,1.674116,10.000000
3955,Yemen,,,,Yemen,2020-10-27 04:24:45,15.552727,48.516388,2060,599,1360,101.0,6.906733,29.077670
3956,Zambia,,,,Zambia,2020-10-27 04:24:45,-13.133897,27.849332,16200,348,15445,407.0,88.120315,2.148148


## Basic Dataframe Wrangling

## Basic wrangling is like writing SQL queries

- Selecting: `SELECT FROM`
- Filtering: `WHERE`
- Subsetting: `SELECT FROM WHERE`
- Indexing
- Sorting: `ORDER BY`
- Deriving
- Summarizing
- Summarizing and Grouping: `GROUP BY`

## Selecting a column as `Series`

In [17]:
print(daily_report['Country_Region'])
print(type(daily_report['Country_Region']))

0              Afghanistan
1                  Albania
2                  Algeria
3                  Andorra
4                   Angola
               ...        
3953    West Bank and Gaza
3954        Western Sahara
3955                 Yemen
3956                Zambia
3957              Zimbabwe
Name: Country_Region, Length: 3958, dtype: object
<class 'pandas.core.series.Series'>


## Selecting a column as `DataFrame`

In [18]:
print(type(daily_report[['Country_Region']]))
daily_report[['Country_Region']]

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Country_Region
0,Afghanistan
1,Albania
2,Algeria
3,Andorra
4,Angola
...,...
3953,West Bank and Gaza
3954,Western Sahara
3955,Yemen
3956,Zambia


## Selecting multiple columns as `DataFrame`, for sure

In [19]:
cols = ['Country_Region', 'Province_State']
daily_report[cols]

Unnamed: 0,Country_Region,Province_State
0,Afghanistan,
1,Albania,
2,Algeria,
3,Andorra,
4,Angola,
...,...,...
3953,West Bank and Gaza,
3954,Western Sahara,
3955,Yemen,
3956,Zambia,


## Filtering rows with conditional statements

In [20]:
is_taiwan = daily_report['Country_Region'] == 'Taiwan*'
daily_report[is_taiwan]

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Incidence_Rate,Case-Fatality_Ratio
624,,,,Taiwan*,2020-10-27 04:24:45,23.7,121.0,550,7,502,41.0,Taiwan*,2.309297,1.272727


## Subsetting columns and rows simultaneously

In [21]:
cols_to_select = ['Country_Region', 'Confirmed']
rows_to_filter = daily_report['Country_Region'] == 'Taiwan*'
daily_report[rows_to_filter][cols_to_select]

Unnamed: 0,Country_Region,Confirmed
624,Taiwan*,550


## Indexing `DataFrame` with

- `loc[]`
- `iloc[]`

## `loc[]` is indexing `DataFrame` with `Index` 

In [22]:
print(daily_report.loc[3388, ['Country_Region', 'Confirmed']]) # as Series
daily_report.loc[[3388], ['Country_Region', 'Confirmed']] # as DataFrame

Country_Region    US
Confirmed         45
Name: 3388, dtype: object


Unnamed: 0,Country_Region,Confirmed
3388,US,45


## `iloc[]` is indexing `DataFrame` with absolute position

In [23]:
print(daily_report.iloc[3388, [3, 7]]) # as Series
daily_report.iloc[[3388], [3, 7]] # as DataFrame

Country_Region    US
Confirmed         45
Name: 3388, dtype: object


Unnamed: 0,Country_Region,Confirmed
3388,US,45


## Sorting `DataFrame` with

- `sort_values`
- `sort_index`

## `sort_values` sorts `DataFrame` with specific columns

In [24]:
daily_report.sort_values(['Country_Region', 'Confirmed'])

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Incidence_Rate,Case-Fatality_Ratio
0,,,,Afghanistan,2020-10-27 04:24:45,33.939110,67.709953,40937,1518,34150,5269.0,Afghanistan,105.159889,3.708137
1,,,,Albania,2020-10-27 04:24:45,41.153300,20.168300,19445,480,10705,8260.0,Albania,675.689763,2.468501
2,,,,Algeria,2020-10-27 04:24:45,28.033900,1.659600,56419,1922,39273,15224.0,Algeria,128.660566,3.406654
3,,,,Andorra,2020-10-27 04:24:45,42.506300,1.521800,4325,72,2957,1296.0,Andorra,5597.618585,1.664740
4,,,,Angola,2020-10-27 04:24:45,-11.202700,17.873900,9644,270,3530,5844.0,Angola,29.343155,2.799668
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3953,,,,West Bank and Gaza,2020-10-27 04:24:45,31.952200,35.233200,50952,454,44055,6443.0,West Bank and Gaza,998.781515,0.891035
3954,,,,Western Sahara,2020-10-27 04:24:45,24.215500,-12.885800,10,1,8,1.0,Western Sahara,1.674116,10.000000
3955,,,,Yemen,2020-10-27 04:24:45,15.552727,48.516388,2060,599,1360,101.0,Yemen,6.906733,29.077670
3956,,,,Zambia,2020-10-27 04:24:45,-13.133897,27.849332,16200,348,15445,407.0,Zambia,88.120315,2.148148


## `sort_index` sorts `DataFrame` with the `Index` of `DataFrame`

In [25]:
daily_report.sort_index(ascending=False)

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Incidence_Rate,Case-Fatality_Ratio
3957,,,,Zimbabwe,2020-10-27 04:24:45,-19.015438,29.154857,8303,242,7797,264.0,Zimbabwe,55.863828,2.914609
3956,,,,Zambia,2020-10-27 04:24:45,-13.133897,27.849332,16200,348,15445,407.0,Zambia,88.120315,2.148148
3955,,,,Yemen,2020-10-27 04:24:45,15.552727,48.516388,2060,599,1360,101.0,Yemen,6.906733,29.077670
3954,,,,Western Sahara,2020-10-27 04:24:45,24.215500,-12.885800,10,1,8,1.0,Western Sahara,1.674116,10.000000
3953,,,,West Bank and Gaza,2020-10-27 04:24:45,31.952200,35.233200,50952,454,44055,6443.0,West Bank and Gaza,998.781515,0.891035
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4,,,,Angola,2020-10-27 04:24:45,-11.202700,17.873900,9644,270,3530,5844.0,Angola,29.343155,2.799668
3,,,,Andorra,2020-10-27 04:24:45,42.506300,1.521800,4325,72,2957,1296.0,Andorra,5597.618585,1.664740
2,,,,Algeria,2020-10-27 04:24:45,28.033900,1.659600,56419,1922,39273,15224.0,Algeria,128.660566,3.406654
1,,,,Albania,2020-10-27 04:24:45,41.153300,20.168300,19445,480,10705,8260.0,Albania,675.689763,2.468501


## Deriving new variables from `DataFrame`

- Simple operations
- `pd.cut`
- `map` with a `dict`
- `map` with a function(or a lambda expression)

## Deriving new variable with simple operations

In [26]:
active = daily_report['Confirmed'] - daily_report['Deaths'] - daily_report['Recovered']
print(active)

0        5269
1        8260
2       15224
3        1296
4        5844
        ...  
3953     6443
3954        1
3955      101
3956      407
3957      264
Length: 3958, dtype: int64


## Deriving categorical from numerical with `pd.cut`

In [27]:
import numpy as np

cut_bins = [0, 1000, 10000, 100000, np.Inf]
cut_labels = ['Less than 1000', 'Between 1000 and 10000', 'Between 10000 and 100000', 'Above 100000']
confirmed_categorical = pd.cut(daily_report['Confirmed'], bins=cut_bins, labels=cut_labels, right=False)
print(confirmed_categorical)

0       Between 10000 and 100000
1       Between 10000 and 100000
2       Between 10000 and 100000
3         Between 1000 and 10000
4         Between 1000 and 10000
                  ...           
3953    Between 10000 and 100000
3954              Less than 1000
3955      Between 1000 and 10000
3956    Between 10000 and 100000
3957      Between 1000 and 10000
Name: Confirmed, Length: 3958, dtype: category
Categories (4, object): ['Less than 1000' < 'Between 1000 and 10000' < 'Between 10000 and 100000' < 'Above 100000']


## Deriving categorical from categorical with `map`

- Passing a `dict`
- Passing a function(or lambda expression)

In [28]:
# Passing a dict
country_name = {
    'Taiwan*': 'Taiwan'
}
daily_report_tw = daily_report[is_taiwan]
daily_report_tw['Country_Region'].map(country_name)

624    Taiwan
Name: Country_Region, dtype: object

In [29]:
# Passing a function
def is_us(x):
    if x == 'US':
        return 'US'
    else:
        return 'Not US'
daily_report['Country_Region'].map(is_us)

0       Not US
1       Not US
2       Not US
3       Not US
4       Not US
         ...  
3953    Not US
3954    Not US
3955    Not US
3956    Not US
3957    Not US
Name: Country_Region, Length: 3958, dtype: object

In [30]:
# Passing a lambda expression)
daily_report['Country_Region'].map(lambda x: 'US' if x == 'US' else 'Not US')

0       Not US
1       Not US
2       Not US
3       Not US
4       Not US
         ...  
3953    Not US
3954    Not US
3955    Not US
3956    Not US
3957    Not US
Name: Country_Region, Length: 3958, dtype: object

## Summarizing `DataFrame` with aggregate methods

In [31]:
daily_report['Confirmed'].sum()

43483973

## Summarizing and grouping `DataFrame` with aggregate methods

In [32]:
daily_report.groupby('Country_Region')['Confirmed'].sum()

Country_Region
Afghanistan           40937
Albania               19445
Algeria               56419
Andorra                4325
Angola                 9644
                      ...  
West Bank and Gaza    50952
Western Sahara           10
Yemen                  2060
Zambia                16200
Zimbabwe               8303
Name: Confirmed, Length: 189, dtype: int64

## More Dataframe Wrangling Operations

## Other common `Dataframe` wranglings including

- Dealing with missing values
- Dealing with text values
- Reshaping dataframes
- Merging and joining dataframes

## Dealing with missing values

- Using `isnull` or `notnull` to check if `np.NaN` exists
- Using `dropna` to drop rows with `np.NaN`
- Using `fillna` to fill `np.NaN` with specific values

In [33]:
print(daily_report['Province_State'].size)
print(daily_report['Province_State'].isnull().sum())
print(daily_report['Province_State'].notnull().sum())

3958
169
3789


In [34]:
print(daily_report.dropna().shape)
print(daily_report['FIPS'].fillna(0))

(3191, 14)
0       0.0
1       0.0
2       0.0
3       0.0
4       0.0
       ... 
3953    0.0
3954    0.0
3955    0.0
3956    0.0
3957    0.0
Name: FIPS, Length: 3958, dtype: float64


## Dealing with text values

Prior to `pandas` 1.0, `object` dtype was the only option.

In [35]:
daily_report.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3958 entries, 0 to 3957
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   FIPS                 3261 non-null   float64
 1   Admin2               3266 non-null   object 
 2   Province_State       3789 non-null   object 
 3   Country_Region       3958 non-null   object 
 4   Last_Update          3958 non-null   object 
 5   Lat                  3877 non-null   float64
 6   Long_                3877 non-null   float64
 7   Confirmed            3958 non-null   int64  
 8   Deaths               3958 non-null   int64  
 9   Recovered            3958 non-null   int64  
 10  Active               3954 non-null   float64
 11  Combined_Key         3958 non-null   object 
 12  Incidence_Rate       3877 non-null   float64
 13  Case-Fatality_Ratio  3914 non-null   float64
dtypes: float64(6), int64(3), object(5)
memory usage: 433.0+ KB


## Now we can specify `string` to text values

In [36]:
for col in daily_report.columns:
    if daily_report[col].dtype == 'object':
        daily_report[col] = daily_report[col].astype('string')
daily_report.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3958 entries, 0 to 3957
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   FIPS                 3261 non-null   float64
 1   Admin2               3266 non-null   string 
 2   Province_State       3789 non-null   string 
 3   Country_Region       3958 non-null   string 
 4   Last_Update          3958 non-null   string 
 5   Lat                  3877 non-null   float64
 6   Long_                3877 non-null   float64
 7   Confirmed            3958 non-null   int64  
 8   Deaths               3958 non-null   int64  
 9   Recovered            3958 non-null   int64  
 10  Active               3954 non-null   float64
 11  Combined_Key         3958 non-null   string 
 12  Incidence_Rate       3877 non-null   float64
 13  Case-Fatality_Ratio  3914 non-null   float64
dtypes: float64(6), int64(3), string(5)
memory usage: 433.0 KB


## Splitting strings with `str.split` as a `Series`

In [37]:
split_pattern = ', '
daily_report['Combined_Key'].str.split(split_pattern)

0              [Afghanistan]
1                  [Albania]
2                  [Algeria]
3                  [Andorra]
4                   [Angola]
                ...         
3953    [West Bank and Gaza]
3954        [Western Sahara]
3955                 [Yemen]
3956                [Zambia]
3957              [Zimbabwe]
Name: Combined_Key, Length: 3958, dtype: object

## Splitting strings with `str.split` as a `DataFrame`

In [38]:
split_pattern = ', '
daily_report['Combined_Key'].str.split(split_pattern, expand=True)

Unnamed: 0,0,1,2
0,Afghanistan,,
1,Albania,,
2,Algeria,,
3,Andorra,,
4,Angola,,
...,...,...,...
3953,West Bank and Gaza,,
3954,Western Sahara,,
3955,Yemen,,
3956,Zambia,,


## Along with the new `string` data type, `pd.NA` is introduced 

In [39]:
split_key = daily_report['Combined_Key'].str.split(split_pattern, expand=True)
print(split_key[1][3408] is pd.NA)

False


## Replacing strings with `str.replace`

In [40]:
daily_report['Combined_Key'].str.replace(", ", ';')

0              Afghanistan
1                  Albania
2                  Algeria
3                  Andorra
4                   Angola
               ...        
3953    West Bank and Gaza
3954        Western Sahara
3955                 Yemen
3956                Zambia
3957              Zimbabwe
Name: Combined_Key, Length: 3958, dtype: string

## Testing for strings that match or contain a pattern with `str.contains`

In [41]:
print(daily_report['Country_Region'].str.contains('land').sum())
daily_report[daily_report['Country_Region'].str.contains('land')]

25


Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Incidence_Rate,Case-Fatality_Ratio
190,,,,Finland,2020-10-27 04:24:45,61.92411,25.748151,14970,354,9800,4816.0,Finland,270.18159,2.364729
233,,,,Iceland,2020-10-27 04:24:45,64.9631,-19.0208,4504,11,3463,1030.0,Iceland,1319.85348,0.244227
274,,,,Ireland,2020-10-27 04:24:45,53.1424,-7.6921,58067,1885,23364,32818.0,Ireland,1175.970008,3.24625
412,,,Aruba,Netherlands,2020-10-27 04:24:45,12.5211,-69.9683,4422,36,4222,164.0,"Aruba, Netherlands",4141.767979,0.814111
413,,,"Bonaire, Sint Eustatius and Saba",Netherlands,2020-10-27 04:24:45,12.1784,-68.2385,150,3,126,21.0,"Bonaire, Sint Eustatius and Saba, Netherlands",572.060562,2.0
414,,,Curacao,Netherlands,2020-10-27 04:24:45,12.1696,-68.99,873,1,590,282.0,"Curacao, Netherlands",531.992687,0.114548
415,,,Drenthe,Netherlands,2020-10-27 04:24:45,52.862485,6.618435,4551,74,0,4477.0,"Drenthe, Netherlands",921.848477,1.626016
416,,,Flevoland,Netherlands,2020-10-27 04:24:45,52.550383,5.515162,5453,106,0,5347.0,"Flevoland, Netherlands",1289.0613,1.943884
417,,,Friesland,Netherlands,2020-10-27 04:24:45,53.087337,5.7925,5177,83,0,5094.0,"Friesland, Netherlands",796.514231,1.603245
418,,,Gelderland,Netherlands,2020-10-27 04:24:45,52.061738,5.939114,27960,773,0,27187.0,"Gelderland, Netherlands",1340.395177,2.764664


## Reshaping dataframes from wide to long format with `pd.melt`

A common problem is that a dataset where some of the column names are not names of variables, but values of a variable.

In [42]:
ts_confirmed_global_url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
ts_confirmed_global = pd.read_csv(ts_confirmed_global_url)
ts_confirmed_global

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,10/17/20,10/18/20,10/19/20,10/20/20,10/21/20,10/22/20,10/23/20,10/24/20,10/25/20,10/26/20
0,,Afghanistan,33.939110,67.709953,0,0,0,0,0,0,...,40141,40200,40287,40357,40510,40626,40687,40768,40833,40937
1,,Albania,41.153300,20.168300,0,0,0,0,0,0,...,16774,17055,17350,17651,17948,18250,18556,18858,19157,19445
2,,Algeria,28.033900,1.659600,0,0,0,0,0,0,...,54203,54402,54616,54829,55081,55357,55630,55880,56143,56419
3,,Andorra,42.506300,1.521800,0,0,0,0,0,0,...,3377,3377,3623,3623,3811,3811,4038,4038,4038,4325
4,,Angola,-11.202700,17.873900,0,0,0,0,0,0,...,7462,7622,7829,8049,8338,8582,8829,9026,9381,9644
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
262,,West Bank and Gaza,31.952200,35.233200,0,0,0,0,0,0,...,46746,47135,47616,48129,48628,49134,49579,49989,50442,50952
263,,Western Sahara,24.215500,-12.885800,0,0,0,0,0,0,...,10,10,10,10,10,10,10,10,10,10
264,,Yemen,15.552727,48.516388,0,0,0,0,0,0,...,2055,2056,2056,2057,2057,2057,2060,2060,2060,2060
265,,Zambia,-13.133897,27.849332,0,0,0,0,0,0,...,15789,15853,15897,15982,16000,16035,16095,16117,16117,16200


## We can pivot the columns into a new pair of variables

To describe that operation we need four parameters:

- The set of columns whose names are not values
- The set of columns whose names are values
- The name of the variable to move the column names to
- The name of the variable to move the column values to

## In this example, the four parameters are

- `id_vars`: `['Province/State', 'Country/Region', 'Lat', 'Long']`
- `value_vars`: The columns from `1/22/20` to the last column
- `var_name`: Let's name it `Date`
- `value_name`: Let's name it `Confirmed`

In [43]:
idVars = ['Province/State', 'Country/Region', 'Lat', 'Long']
ts_confirmed_global_long = pd.melt(ts_confirmed_global,
                                  id_vars=idVars,
                                  var_name='Date',
                                  value_name='Confirmed')
ts_confirmed_global_long

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed
0,,Afghanistan,33.939110,67.709953,1/22/20,0
1,,Albania,41.153300,20.168300,1/22/20,0
2,,Algeria,28.033900,1.659600,1/22/20,0
3,,Andorra,42.506300,1.521800,1/22/20,0
4,,Angola,-11.202700,17.873900,1/22/20,0
...,...,...,...,...,...,...
74488,,West Bank and Gaza,31.952200,35.233200,10/26/20,50952
74489,,Western Sahara,24.215500,-12.885800,10/26/20,10
74490,,Yemen,15.552727,48.516388,10/26/20,2060
74491,,Zambia,-13.133897,27.849332,10/26/20,16200


## Merging and joining dataframes

- `merge` on column names
- `join` on index

## Using `merge` function to join dataframes on columns

In [44]:
left_df = daily_report[daily_report['Country_Region'].isin(['Taiwan*', 'Japan'])]
right_df = ts_confirmed_global_long[ts_confirmed_global_long['Country/Region'].isin(['Taiwan*', 'Korea, South'])]
# default: inner join
pd.merge(left_df, right_df, left_on='Country_Region', right_on='Country/Region')

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat_x,Long_,Confirmed_x,Deaths,Recovered,Active,Combined_Key,Incidence_Rate,Case-Fatality_Ratio,Province/State,Country/Region,Lat_y,Long,Date,Confirmed_y
0,,,,Taiwan*,2020-10-27 04:24:45,23.7,121.0,550,7,502,41.0,Taiwan*,2.309297,1.272727,,Taiwan*,23.7,121.0,1/22/20,1
1,,,,Taiwan*,2020-10-27 04:24:45,23.7,121.0,550,7,502,41.0,Taiwan*,2.309297,1.272727,,Taiwan*,23.7,121.0,1/23/20,1
2,,,,Taiwan*,2020-10-27 04:24:45,23.7,121.0,550,7,502,41.0,Taiwan*,2.309297,1.272727,,Taiwan*,23.7,121.0,1/24/20,3
3,,,,Taiwan*,2020-10-27 04:24:45,23.7,121.0,550,7,502,41.0,Taiwan*,2.309297,1.272727,,Taiwan*,23.7,121.0,1/25/20,3
4,,,,Taiwan*,2020-10-27 04:24:45,23.7,121.0,550,7,502,41.0,Taiwan*,2.309297,1.272727,,Taiwan*,23.7,121.0,1/26/20,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
274,,,,Taiwan*,2020-10-27 04:24:45,23.7,121.0,550,7,502,41.0,Taiwan*,2.309297,1.272727,,Taiwan*,23.7,121.0,10/22/20,548
275,,,,Taiwan*,2020-10-27 04:24:45,23.7,121.0,550,7,502,41.0,Taiwan*,2.309297,1.272727,,Taiwan*,23.7,121.0,10/23/20,548
276,,,,Taiwan*,2020-10-27 04:24:45,23.7,121.0,550,7,502,41.0,Taiwan*,2.309297,1.272727,,Taiwan*,23.7,121.0,10/24/20,550
277,,,,Taiwan*,2020-10-27 04:24:45,23.7,121.0,550,7,502,41.0,Taiwan*,2.309297,1.272727,,Taiwan*,23.7,121.0,10/25/20,550


In [45]:
# left join
pd.merge(left_df, right_df, left_on='Country_Region', right_on='Country/Region', how='left')

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat_x,Long_,Confirmed_x,Deaths,Recovered,Active,Combined_Key,Incidence_Rate,Case-Fatality_Ratio,Province/State,Country/Region,Lat_y,Long,Date,Confirmed_y
0,,,Aichi,Japan,2020-10-27 04:24:45,35.035551,137.211621,5913,91,5499,323.0,"Aichi, Japan",78.294662,1.538982,,,,,,
1,,,Akita,Japan,2020-10-27 04:24:45,39.748679,140.408228,61,0,59,2.0,"Akita, Japan",6.311498,0.000000,,,,,,
2,,,Aomori,Japan,2020-10-27 04:24:45,40.781541,140.828896,189,2,60,127.0,"Aomori, Japan",15.164024,1.058201,,,,,,
3,,,Chiba,Japan,2020-10-27 04:24:45,35.510141,140.198917,4837,77,4336,424.0,"Chiba, Japan",77.275999,1.591896,,,,,,
4,,,Ehime,Japan,2020-10-27 04:24:45,33.624835,132.856842,116,6,110,0.0,"Ehime, Japan",8.661791,5.172414,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
323,,,,Taiwan*,2020-10-27 04:24:45,23.700000,121.000000,550,7,502,41.0,Taiwan*,2.309297,1.272727,,Taiwan*,23.7,121.0,10/22/20,548.0
324,,,,Taiwan*,2020-10-27 04:24:45,23.700000,121.000000,550,7,502,41.0,Taiwan*,2.309297,1.272727,,Taiwan*,23.7,121.0,10/23/20,548.0
325,,,,Taiwan*,2020-10-27 04:24:45,23.700000,121.000000,550,7,502,41.0,Taiwan*,2.309297,1.272727,,Taiwan*,23.7,121.0,10/24/20,550.0
326,,,,Taiwan*,2020-10-27 04:24:45,23.700000,121.000000,550,7,502,41.0,Taiwan*,2.309297,1.272727,,Taiwan*,23.7,121.0,10/25/20,550.0


In [46]:
# right join
pd.merge(left_df, right_df, left_on='Country_Region', right_on='Country/Region', how='right')

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat_x,Long_,Confirmed_x,Deaths,Recovered,Active,Combined_Key,Incidence_Rate,Case-Fatality_Ratio,Province/State,Country/Region,Lat_y,Long,Date,Confirmed_y
0,,,,,,,,,,,,,,,,"Korea, South",35.907757,127.766922,1/22/20,1
1,,,,,,,,,,,,,,,,"Korea, South",35.907757,127.766922,1/23/20,1
2,,,,,,,,,,,,,,,,"Korea, South",35.907757,127.766922,1/24/20,2
3,,,,,,,,,,,,,,,,"Korea, South",35.907757,127.766922,1/25/20,2
4,,,,,,,,,,,,,,,,"Korea, South",35.907757,127.766922,1/26/20,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
553,,,,Taiwan*,2020-10-27 04:24:45,23.7,121.0,550.0,7.0,502.0,41.0,Taiwan*,2.309297,1.272727,,Taiwan*,23.700000,121.000000,10/22/20,548
554,,,,Taiwan*,2020-10-27 04:24:45,23.7,121.0,550.0,7.0,502.0,41.0,Taiwan*,2.309297,1.272727,,Taiwan*,23.700000,121.000000,10/23/20,548
555,,,,Taiwan*,2020-10-27 04:24:45,23.7,121.0,550.0,7.0,502.0,41.0,Taiwan*,2.309297,1.272727,,Taiwan*,23.700000,121.000000,10/24/20,550
556,,,,Taiwan*,2020-10-27 04:24:45,23.7,121.0,550.0,7.0,502.0,41.0,Taiwan*,2.309297,1.272727,,Taiwan*,23.700000,121.000000,10/25/20,550


## Using `join` method to join dataframes on index

In [47]:
left_df = daily_report[daily_report['Country_Region'].isin(['Taiwan*', 'Japan'])]
right_df = ts_confirmed_global_long[ts_confirmed_global_long['Country/Region'].isin(['Taiwan*', 'Korea, South'])]
left_df = left_df.set_index('Country_Region')
right_df = right_df.set_index('Country/Region')

In [48]:
# default: left join
left_df.join(right_df, lsuffix='_x', rsuffix='_y')

Unnamed: 0,FIPS,Admin2,Province_State,Last_Update,Lat_x,Long_,Confirmed_x,Deaths,Recovered,Active,Combined_Key,Incidence_Rate,Case-Fatality_Ratio,Province/State,Lat_y,Long,Date,Confirmed_y
Japan,,,Aichi,2020-10-27 04:24:45,35.035551,137.211621,5913,91,5499,323.0,"Aichi, Japan",78.294662,1.538982,,,,,
Japan,,,Akita,2020-10-27 04:24:45,39.748679,140.408228,61,0,59,2.0,"Akita, Japan",6.311498,0.000000,,,,,
Japan,,,Aomori,2020-10-27 04:24:45,40.781541,140.828896,189,2,60,127.0,"Aomori, Japan",15.164024,1.058201,,,,,
Japan,,,Chiba,2020-10-27 04:24:45,35.510141,140.198917,4837,77,4336,424.0,"Chiba, Japan",77.275999,1.591896,,,,,
Japan,,,Ehime,2020-10-27 04:24:45,33.624835,132.856842,116,6,110,0.0,"Ehime, Japan",8.661791,5.172414,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Taiwan*,,,,2020-10-27 04:24:45,23.700000,121.000000,550,7,502,41.0,Taiwan*,2.309297,1.272727,,23.7,121.0,10/22/20,548.0
Taiwan*,,,,2020-10-27 04:24:45,23.700000,121.000000,550,7,502,41.0,Taiwan*,2.309297,1.272727,,23.7,121.0,10/23/20,548.0
Taiwan*,,,,2020-10-27 04:24:45,23.700000,121.000000,550,7,502,41.0,Taiwan*,2.309297,1.272727,,23.7,121.0,10/24/20,550.0
Taiwan*,,,,2020-10-27 04:24:45,23.700000,121.000000,550,7,502,41.0,Taiwan*,2.309297,1.272727,,23.7,121.0,10/25/20,550.0


In [49]:
# inner join
left_df.join(right_df, lsuffix='_x', rsuffix='_y', how='inner')

Unnamed: 0,FIPS,Admin2,Province_State,Last_Update,Lat_x,Long_,Confirmed_x,Deaths,Recovered,Active,Combined_Key,Incidence_Rate,Case-Fatality_Ratio,Province/State,Lat_y,Long,Date,Confirmed_y
Taiwan*,,,,2020-10-27 04:24:45,23.7,121.0,550,7,502,41.0,Taiwan*,2.309297,1.272727,,23.7,121.0,1/22/20,1
Taiwan*,,,,2020-10-27 04:24:45,23.7,121.0,550,7,502,41.0,Taiwan*,2.309297,1.272727,,23.7,121.0,1/23/20,1
Taiwan*,,,,2020-10-27 04:24:45,23.7,121.0,550,7,502,41.0,Taiwan*,2.309297,1.272727,,23.7,121.0,1/24/20,3
Taiwan*,,,,2020-10-27 04:24:45,23.7,121.0,550,7,502,41.0,Taiwan*,2.309297,1.272727,,23.7,121.0,1/25/20,3
Taiwan*,,,,2020-10-27 04:24:45,23.7,121.0,550,7,502,41.0,Taiwan*,2.309297,1.272727,,23.7,121.0,1/26/20,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Taiwan*,,,,2020-10-27 04:24:45,23.7,121.0,550,7,502,41.0,Taiwan*,2.309297,1.272727,,23.7,121.0,10/22/20,548
Taiwan*,,,,2020-10-27 04:24:45,23.7,121.0,550,7,502,41.0,Taiwan*,2.309297,1.272727,,23.7,121.0,10/23/20,548
Taiwan*,,,,2020-10-27 04:24:45,23.7,121.0,550,7,502,41.0,Taiwan*,2.309297,1.272727,,23.7,121.0,10/24/20,550
Taiwan*,,,,2020-10-27 04:24:45,23.7,121.0,550,7,502,41.0,Taiwan*,2.309297,1.272727,,23.7,121.0,10/25/20,550


In [50]:
# inner join
left_df.join(right_df, lsuffix='_x', rsuffix='_y', how='inner')

Unnamed: 0,FIPS,Admin2,Province_State,Last_Update,Lat_x,Long_,Confirmed_x,Deaths,Recovered,Active,Combined_Key,Incidence_Rate,Case-Fatality_Ratio,Province/State,Lat_y,Long,Date,Confirmed_y
Taiwan*,,,,2020-10-27 04:24:45,23.7,121.0,550,7,502,41.0,Taiwan*,2.309297,1.272727,,23.7,121.0,1/22/20,1
Taiwan*,,,,2020-10-27 04:24:45,23.7,121.0,550,7,502,41.0,Taiwan*,2.309297,1.272727,,23.7,121.0,1/23/20,1
Taiwan*,,,,2020-10-27 04:24:45,23.7,121.0,550,7,502,41.0,Taiwan*,2.309297,1.272727,,23.7,121.0,1/24/20,3
Taiwan*,,,,2020-10-27 04:24:45,23.7,121.0,550,7,502,41.0,Taiwan*,2.309297,1.272727,,23.7,121.0,1/25/20,3
Taiwan*,,,,2020-10-27 04:24:45,23.7,121.0,550,7,502,41.0,Taiwan*,2.309297,1.272727,,23.7,121.0,1/26/20,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Taiwan*,,,,2020-10-27 04:24:45,23.7,121.0,550,7,502,41.0,Taiwan*,2.309297,1.272727,,23.7,121.0,10/22/20,548
Taiwan*,,,,2020-10-27 04:24:45,23.7,121.0,550,7,502,41.0,Taiwan*,2.309297,1.272727,,23.7,121.0,10/23/20,548
Taiwan*,,,,2020-10-27 04:24:45,23.7,121.0,550,7,502,41.0,Taiwan*,2.309297,1.272727,,23.7,121.0,10/24/20,550
Taiwan*,,,,2020-10-27 04:24:45,23.7,121.0,550,7,502,41.0,Taiwan*,2.309297,1.272727,,23.7,121.0,10/25/20,550


In [51]:
# right join
left_df.join(right_df, lsuffix='_x', rsuffix='_y', how='right')

Unnamed: 0,FIPS,Admin2,Province_State,Last_Update,Lat_x,Long_,Confirmed_x,Deaths,Recovered,Active,Combined_Key,Incidence_Rate,Case-Fatality_Ratio,Province/State,Lat_y,Long,Date,Confirmed_y
"Korea, South",,,,,,,,,,,,,,,35.907757,127.766922,1/22/20,1
"Korea, South",,,,,,,,,,,,,,,35.907757,127.766922,1/23/20,1
"Korea, South",,,,,,,,,,,,,,,35.907757,127.766922,1/24/20,2
"Korea, South",,,,,,,,,,,,,,,35.907757,127.766922,1/25/20,2
"Korea, South",,,,,,,,,,,,,,,35.907757,127.766922,1/26/20,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Taiwan*,,,,2020-10-27 04:24:45,23.7,121.0,550.0,7.0,502.0,41.0,Taiwan*,2.309297,1.272727,,23.700000,121.000000,10/22/20,548
Taiwan*,,,,2020-10-27 04:24:45,23.7,121.0,550.0,7.0,502.0,41.0,Taiwan*,2.309297,1.272727,,23.700000,121.000000,10/23/20,548
Taiwan*,,,,2020-10-27 04:24:45,23.7,121.0,550.0,7.0,502.0,41.0,Taiwan*,2.309297,1.272727,,23.700000,121.000000,10/24/20,550
Taiwan*,,,,2020-10-27 04:24:45,23.7,121.0,550.0,7.0,502.0,41.0,Taiwan*,2.309297,1.272727,,23.700000,121.000000,10/25/20,550
