## Manupulating data using pandas

**Prepared by** <br>
Nur Hurriyatul Huda Abdullah Sani (nurhuda@dosm.gov.my)

# Introduction to Pandas

In this section of the course we will learn how to use pandas for data analysis. 
Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language
You can think of pandas as an extremely powerful version of Excel, with a lot more features. In this section of the course, you should go through the Pandas in this order:

* Introduction to Pandas
* Series
* DataFrames
* Missing Data
* GroupBy
* Merging,Joining,and Concatenating
* Operations
* Data Input and Output

Reference:<br>
.[1]. https://pandas.pydata.org/docs/user_guide/index.html#user-guide

First you must have pandas library which can be installed using this command

In [None]:
# !pip install pandas

Import Pandas library using this command

In [None]:
import pandas as pd

## Series

The first main data type we will learn about for pandas is the Series data type. 

A Series is very similar to a NumPy array (it is built on top of the NumPy array object). 

What differentiates the NumPy array from a Series? 

1) is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. 

2) It also doesn't need to hold numeric data, it can hold any Python Object.

Let's explore this concept through some examples:

In [None]:
import numpy as np
import pandas as pd

You can convert a list,numpy array, or dictionary to a Series:

In [None]:
labels = ['a','b','c']
my_list = [10,20,30]
arr = np.array([10,20,30])
d = {'a':10,'b':20,'c':30}

Using Lists

In [None]:
pd.Series(data=my_list)

0    10
1    20
2    30
dtype: int64

In [None]:
pd.Series(data=my_list,index=labels)

a    10
b    20
c    30
dtype: int64

NumPy Arrays

In [None]:
pd.Series(arr)

0    10
1    20
2    30
dtype: int32

In [None]:
pd.Series(arr,labels)

a    10
b    20
c    30
dtype: int32

Dictionary

In [None]:
pd.Series(d)

a    10
b    20
c    30
dtype: int64

## Using an Index

The key of using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information

In [None]:
ser1 = pd.Series([1,2,3,4],index = ['USA', 'Germany','USSR', 'Japan'])

In [None]:
ser2 = pd.Series([6,7,8,9],index = ['USA', 'Germany','Italy', 'Japan'])

In [None]:
ser1

USA        1
Germany    2
USSR       3
Japan      4
dtype: int64

In [None]:
ser2

USA        6
Germany    7
Italy      8
Japan      9
dtype: int64

In [None]:
ser1[0]

1

In [None]:
ser2[3]

9

Operations are then also done based off of index:

In [None]:
ser1 + ser2

Germany     9.0
Italy       NaN
Japan      13.0
USA         7.0
USSR        NaN
dtype: float64

## DataFrames

One basic structure that you get with pandas is a data frame. A data frame is a two dimensional grid, rather similar to a relational database table except in memory.

DataFrames are the workhorse of pandas and are directly inspired by the R programming language.

We can think of a DataFrame as a bunch of Series objects put together to share the same index

In [None]:
from numpy.random import randn
np.random.seed(101)

In [None]:
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())

In [None]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [None]:
?np.random

## Selection and Indexing

How to grab data from a DataFrame

In [None]:
df['W']

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [None]:
# Pass a list of column names
df[['W','Z']]

Unnamed: 0,W,Z
A,2.70685,0.503826
B,0.651118,0.605965
C,-2.018168,-0.589001
D,0.188695,0.955057
E,0.190794,0.683509


DataFrame Columns are just Series

In [None]:
type(df['W'])

pandas.core.series.Series

**Creating a new column:**

In [None]:
df['new'] = df['W'] + df['Y']

In [None]:
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


**Removing Columns**

In [None]:
df.drop('new',axis=1) 

# axis = 1 is referring to column

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


but the 'new' column not permenantly deleted from memory

In [None]:
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


Not inplace unless specified!

In [None]:
df.drop('new',axis=1,inplace=True)

In [None]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


**Removing Rows**

In [None]:
df.drop('E',axis=0)
# axis = 0 is referring to column

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057


**Selecting Rows**

In [None]:
df.loc['A']

W    2.706850
X    0.628133
Y    0.907969
Z    0.503826
Name: A, dtype: float64

Or you can select based on position instead of label 

In [None]:
df.iloc[2]

# will select index row 2

W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

In [None]:
df.iloc[:,2]

# will select column 2

A    0.907969
B   -0.848077
C    0.528813
D   -0.933237
E    2.605967
Name: Y, dtype: float64

**Selecting subset of rows and columns **

In [None]:
df.loc['B','Y']

-0.8480769834036315

In [None]:
df.loc[['A','B'],['W','Y']]

# will select kind of matrix AW AY BW BY

Unnamed: 0,W,Y
A,2.70685,0.907969
B,0.651118,-0.848077


**Example 1**

In [None]:
##{key : list[]:value[]}

data = {'state': ['Jakarta', 'Jakarta', 'Jakarta', 'Selangor', 'Selangor','Kelantan','Kelantan'],
       'year': [2000, 2001, 2002, 2001, 2002, 2001, 2002],
       'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 1.2, 3.2]}
frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Jakarta,2000,1.5
1,Jakarta,2001,1.7
2,Jakarta,2002,3.6
3,Selangor,2001,2.4
4,Selangor,2002,2.9
5,Kelantan,2001,1.2
6,Kelantan,2002,3.2


**Rearrange the column**

In [None]:
frame = pd.DataFrame(data, columns=['year', 'state', 'pop'])
frame

Unnamed: 0,year,state,pop
0,2000,Jakarta,1.5
1,2001,Jakarta,1.7
2,2002,Jakarta,3.6
3,2001,Selangor,2.4
4,2002,Selangor,2.9
5,2001,Kelantan,1.2
6,2002,Kelantan,3.2


**Adding new column**

In [None]:
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                     index = ['one', 'two', 'three', 'four', 'five', 'six','seven'])
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Jakarta,1.5,
two,2001,Jakarta,1.7,
three,2002,Jakarta,3.6,
four,2001,Selangor,2.4,
five,2002,Selangor,2.9,
six,2001,Kelantan,1.2,
seven,2002,Kelantan,3.2,


**Selecting column**

In [None]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

In [None]:
frame2['state']

one       Jakarta
two       Jakarta
three     Jakarta
four     Selangor
five     Selangor
six      Kelantan
seven    Kelantan
Name: state, dtype: object

In [None]:
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
six      2001
seven    2002
Name: year, dtype: int64

In [None]:
frame2.loc['two'] 

year        2001
state    Jakarta
pop          1.7
debt         NaN
Name: two, dtype: object

 loc is location<br>
 will list all elements under loc [two].<br>
 loc will call base on assignee name

In [None]:
frame2.loc['two','state'] 

'Jakarta'

In [None]:
## iloc  base on index

frame2.iloc[1,1]

'Jakarta'

In [None]:
frame2.loc['two':,:'state']

Unnamed: 0,year,state
two,2001,Jakarta
three,2002,Jakarta
four,2001,Selangor
five,2002,Selangor
six,2001,Kelantan
seven,2002,Kelantan


In [None]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Jakarta,1.5,
two,2001,Jakarta,1.7,
three,2002,Jakarta,3.6,
four,2001,Selangor,2.4,
five,2002,Selangor,2.9,
six,2001,Kelantan,1.2,
seven,2002,Kelantan,3.2,


**The debt value is NaN. We can assign value for 'debt'**

In [None]:
frame2['debt'] = 16.5
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Jakarta,1.5,16.5
two,2001,Jakarta,1.7,16.5
three,2002,Jakarta,3.6,16.5
four,2001,Selangor,2.4,16.5
five,2002,Selangor,2.9,16.5
six,2001,Kelantan,1.2,16.5
seven,2002,Kelantan,3.2,16.5


In [None]:
frame2['debt'] = np.arange(7.)
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Jakarta,1.5,0.0
two,2001,Jakarta,1.7,1.0
three,2002,Jakarta,3.6,2.0
four,2001,Selangor,2.4,3.0
five,2002,Selangor,2.9,4.0
six,2001,Kelantan,1.2,5.0
seven,2002,Kelantan,3.2,6.0


In [None]:
frame2['debt'] = [10,20,15,13,11,67,87]
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Jakarta,1.5,10
two,2001,Jakarta,1.7,20
three,2002,Jakarta,3.6,15
four,2001,Selangor,2.4,13
five,2002,Selangor,2.9,11
six,2001,Kelantan,1.2,67
seven,2002,Kelantan,3.2,87


In [None]:
'Jakarta' in frame2.columns 

False

# Use Case Exercise

In [None]:
import numpy as np
import pandas as pd

**DATA 1**

Call data from Local Drive (downloaded files)

In [None]:
from google.colab import files
data_to_load = files.upload()

In [None]:
import io
df = pd.read_csv(io.BytesIO(data_to_load['data2.csv']))

If we call data directly from python in our machine (laptop/pc) we can use below syntax:

In [None]:
# df = pd.read_csv('data2.csv')

**Checking Top 5 and bottom 5 data**

In [None]:
df.head()

Unnamed: 0,COUNTRY,POPULATION,AREA,GDP,CONTINENTS,IND_DAY
0,China,1398.72,9596.96,12234.78,Asia,
1,India,1351.16,3287.26,2575.67,Asia,15/8/1947
2,US,329.74,9833.52,19485.39,N.America,1776-07-04
3,Indonesia,268.07,1910.93,1015.54,Asia,17/8/1945
4,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07


In [None]:
df.tail()

Unnamed: 0,COUNTRY,POPULATION,AREA,GDP,CONTINENTS,IND_DAY
15,Argentina,44.94,2780.4,637.49,S.America,1816-07-09
16,Algeria,43.38,2381.74,167.56,Africa,5/7/1962
17,Canada,37.59,9984.67,1647.12,N.America,1867-07-01
18,Australia,25.47,7692.02,1408.68,Oceania,
19,Kazakhstan,18.53,2724.9,159.41,Asia,16/12/1991


In [None]:
df.dtypes

COUNTRY        object
POPULATION    float64
AREA          float64
GDP           float64
CONTINENTS     object
IND_DAY        object
dtype: object

In [None]:
df.shape

(20, 6)

**Is there any null value in your data? Let's check them out**

In [None]:
pd.isnull(df)

Unnamed: 0,COUNTRY,POPULATION,AREA,GDP,CONTINENTS,IND_DAY
0,False,False,False,False,False,True
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,False,False,False,False,False,False
4,False,False,False,False,False,False
5,False,False,False,False,False,False
6,False,False,False,False,False,False
7,False,False,False,False,False,False
8,False,False,False,False,True,False
9,False,False,False,False,False,False


**Create new data frame from the original df**

In [None]:
newdf1=df[['COUNTRY','POPULATION','CONTINENTS','IND_DAY']]
newdf1

Unnamed: 0,COUNTRY,POPULATION,CONTINENTS,IND_DAY
0,China,1398.72,Asia,
1,India,1351.16,Asia,15/8/1947
2,US,329.74,N.America,1776-07-04
3,Indonesia,268.07,Asia,17/8/1945
4,Brazil,210.32,S.America,1822-09-07
5,Pakistan,205.71,Asia,14/8/1947
6,Nigeria,200.96,Africa,1/10/1960
7,Bangladesh,167.09,Asia,26/3/1971
8,Russia,146.79,,12/6/1992
9,Mexico,126.58,N.America,1810-09-16


In [None]:
newdf1['CONTINENTS'].fillna('transcontinental')

0                 Asia
1                 Asia
2            N.America
3                 Asia
4            S.America
5                 Asia
6               Africa
7                 Asia
8     transcontinental
9            N.America
10                Asia
11              Europe
12              Europe
13              Europe
14              Europe
15           S.America
16              Africa
17           N.America
18             Oceania
19                Asia
Name: CONTINENTS, dtype: object

In [None]:
newdf1

#no changes in Rusia

Unnamed: 0,COUNTRY,POPULATION,CONTINENTS,IND_DAY
0,China,1398.72,Asia,
1,India,1351.16,Asia,15/8/1947
2,US,329.74,N.America,1776-07-04
3,Indonesia,268.07,Asia,17/8/1945
4,Brazil,210.32,S.America,1822-09-07
5,Pakistan,205.71,Asia,14/8/1947
6,Nigeria,200.96,Africa,1/10/1960
7,Bangladesh,167.09,Asia,26/3/1971
8,Russia,146.79,,12/6/1992
9,Mexico,126.58,N.America,1810-09-16


In [None]:
newdf1['CONTINENTS'].fillna('Transcontinental', inplace=True)
newdf1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  downcast=downcast,


Unnamed: 0,COUNTRY,POPULATION,CONTINENTS,IND_DAY
0,China,1398.72,Asia,
1,India,1351.16,Asia,15/8/1947
2,US,329.74,N.America,1776-07-04
3,Indonesia,268.07,Asia,17/8/1945
4,Brazil,210.32,S.America,1822-09-07
5,Pakistan,205.71,Asia,14/8/1947
6,Nigeria,200.96,Africa,1/10/1960
7,Bangladesh,167.09,Asia,26/3/1971
8,Russia,146.79,Transcontinental,12/6/1992
9,Mexico,126.58,N.America,1810-09-16


In [None]:
newdf1.dtypes

COUNTRY        object
POPULATION    float64
CONTINENTS     object
IND_DAY        object
dtype: object

**How to change data to date format**

In [None]:
?pd.to_datetime

In [None]:
newdf1['IND_DAY']=pd.to_datetime(newdf1['IND_DAY'])
newdf1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,COUNTRY,POPULATION,CONTINENTS,IND_DAY
0,China,1398.72,Asia,NaT
1,India,1351.16,Asia,1947-08-15
2,US,329.74,N.America,1776-07-04
3,Indonesia,268.07,Asia,1945-08-17
4,Brazil,210.32,S.America,1822-09-07
5,Pakistan,205.71,Asia,1947-08-14
6,Nigeria,200.96,Africa,1960-01-10
7,Bangladesh,167.09,Asia,1971-03-26
8,Russia,146.79,Transcontinental,1992-12-06
9,Mexico,126.58,N.America,1810-09-16


In [None]:
newdf1.dtypes

COUNTRY               object
POPULATION           float64
CONTINENTS            object
IND_DAY       datetime64[ns]
dtype: object

**How to fill in missing date**

In [None]:
newdf1['IND_DAY'].fillna(pd.Timestamp("20210430"))

0    2021-04-30
1    1947-08-15
2    1776-07-04
3    1945-08-17
4    1822-09-07
5    1947-08-14
6    1960-01-10
7    1971-03-26
8    1992-12-06
9    1810-09-16
10   2021-04-30
11   2021-04-30
12   1789-07-14
13   2021-04-30
14   2021-04-30
15   1816-07-09
16   1962-05-07
17   1867-07-01
18   2021-04-30
19   1991-12-16
Name: IND_DAY, dtype: datetime64[ns]

In [None]:
newdf1['IND_DAY'].astype(str).replace({'NaT': "No date"})

0        No date
1     1947-08-15
2     1776-07-04
3     1945-08-17
4     1822-09-07
5     1947-08-14
6     1960-01-10
7     1971-03-26
8     1992-12-06
9     1810-09-16
10       No date
11       No date
12    1789-07-14
13       No date
14       No date
15    1816-07-09
16    1962-05-07
17    1867-07-01
18       No date
19    1991-12-16
Name: IND_DAY, dtype: object

In [None]:
newdf1['IND_DAY'].fillna(value = 'No date')

0                 No date
1     1947-08-15 00:00:00
2     1776-07-04 00:00:00
3     1945-08-17 00:00:00
4     1822-09-07 00:00:00
5     1947-08-14 00:00:00
6     1960-01-10 00:00:00
7     1971-03-26 00:00:00
8     1992-12-06 00:00:00
9     1810-09-16 00:00:00
10                No date
11                No date
12    1789-07-14 00:00:00
13                No date
14                No date
15    1816-07-09 00:00:00
16    1962-05-07 00:00:00
17    1867-07-01 00:00:00
18                No date
19    1991-12-16 00:00:00
Name: IND_DAY, dtype: object

In [None]:
newdf1['IND_DAY'].fillna(value = 'No date', inplace = True)
newdf1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  downcast=downcast,


Unnamed: 0,COUNTRY,POPULATION,CONTINENTS,IND_DAY
0,China,1398.72,Asia,No date
1,India,1351.16,Asia,1947-08-15 00:00:00
2,US,329.74,N.America,1776-07-04 00:00:00
3,Indonesia,268.07,Asia,1945-08-17 00:00:00
4,Brazil,210.32,S.America,1822-09-07 00:00:00
5,Pakistan,205.71,Asia,1947-08-14 00:00:00
6,Nigeria,200.96,Africa,1960-01-10 00:00:00
7,Bangladesh,167.09,Asia,1971-03-26 00:00:00
8,Russia,146.79,Transcontinental,1992-12-06 00:00:00
9,Mexico,126.58,N.America,1810-09-16 00:00:00


**Add new data set call 'data3'**

In [None]:
from google.colab import files
data_to_load = files.upload()

In [None]:
import io
df1 = pd.read_csv(io.BytesIO(data_to_load['data3.csv']), encoding='latin-1')

In [None]:
# if using directly from python or jupyter notebook
df1 = pd.read_csv('data3.csv')
df1

Unnamed: 0,country,cases,deaths,region
0,United States,32669121,584226,North America
1,India,16257309,186928,Asia
2,Brazil,14172139,383757,South America
3,France,5408606,102164,Europe
4,Russia,4736121,107103,Europe
...,...,...,...,...
215,MS Zaandam,9,2,
216,Vanuatu,4,1,Australia/Oceania
217,Marshall Islands,4,0,Australia/Oceania
218,Samoa,3,0,Australia/Oceania


In [None]:
df1.dtypes

country    object
cases      object
deaths     object
region     object
dtype: object

In [None]:
df1.shape

(220, 4)

**Change Data Type** <br>
change cases and death to float

In [None]:
df1['deaths'] = df1['deaths'].str.replace(',','')
df1['deaths'] = df1.deaths.astype(float)
df1['cases'] = df1['cases'].str.replace(',','')
df1['cases'] = df1.cases.astype(float)

In [None]:
df1.dtypes

country     object
cases      float64
deaths     float64
region      object
dtype: object

In [None]:
df.head(2)

Unnamed: 0,COUNTRY,POPULATION,AREA,GDP,CONTINENTS,IND_DAY
0,China,1398.72,9596.96,12234.78,Asia,
1,India,1351.16,3287.26,2575.67,Asia,15/8/1947


In [None]:
df1.head(2)

Unnamed: 0,country,cases,deaths,region
0,United States,32669121.0,584226.0,North America
1,India,16257309.0,186928.0,Asia


**Setting header to data case**

Different way writing the header.
Let's change it.

In [None]:
# .capitalize to change first letter as capital letter.
df1.columns=df1.columns.str.capitalize()  
df1.head(2)

Unnamed: 0,Country,Cases,Deaths,Region
0,United States,32669121.0,584226.0,North America
1,India,16257309.0,186928.0,Asia


In [None]:
df1.columns=df1.columns.str.upper()  

#.upper() to change header to uppercase 

df1.head(2)

Unnamed: 0,COUNTRY,CASES,DEATHS,REGION
0,United States,32669121.0,584226.0,North America
1,India,16257309.0,186928.0,Asia


In [None]:
df1.shape

(220, 4)

In [None]:
df.shape

(20, 6)

## Merge 

Pandas provides a single function, merge(), as the entry point for all standard database join operations between DataFrame or named Series objects.

**MERGE** combining data on common columns or indices.

You can achieve both many-to-one and many-to-many joins with merge()

![Mergeconcept](https://files.realpython.com/media/join_diagram.93e6ef63afbe.png)

When gluing together multiple DataFrames, you have a choice of how to handle the other axes (other than the one being concatenated). This can be done in the following two ways

Take the union of them all, join='outer'. This is the default option as it results in zero information loss.

Take the intersection, join='inner'.

In [None]:
merge_df=pd.merge(df, df1)
merge_df

Unnamed: 0,COUNTRY,POPULATION,AREA,GDP,CONTINENTS,IND_DAY,CASES,DEATHS,REGION
0,China,1398.72,9596.96,12234.78,Asia,,90566.0,4636.0,Asia
1,India,1351.16,3287.26,2575.67,Asia,15/8/1947,16257309.0,186928.0,Asia
2,Indonesia,268.07,1910.93,1015.54,Asia,17/8/1945,1626812.0,44172.0,Asia
3,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07,14172139.0,383757.0,South America
4,Pakistan,205.71,881.91,302.14,Asia,14/8/1947,784108.0,16842.0,Asia
5,Nigeria,200.96,923.77,375.77,Africa,1/10/1960,164588.0,2061.0,Africa
6,Bangladesh,167.09,147.57,245.63,Asia,26/3/1971,736074.0,10781.0,Asia
7,Russia,146.79,17098.25,1530.75,,12/6/1992,4736121.0,107103.0,Europe
8,Mexico,126.58,1964.38,1158.23,N.America,1810-09-16,2319519.0,214095.0,North America
9,Japan,126.22,377.97,4872.42,Asia,,547137.0,9777.0,Asia


By default, how = inner, which will merge only match data.

In [None]:
merge_df.shape

(18, 9)

In [None]:
merge_df.CONTINENTS=merge_df.CONTINENTS.replace(['N.America','S.America'],['North America','South America'])
merge_df

Unnamed: 0,COUNTRY,POPULATION,AREA,GDP,CONTINENTS,IND_DAY,CASES,DEATHS,REGION
0,China,1398.72,9596.96,12234.78,Asia,,90566.0,4636.0,Asia
1,India,1351.16,3287.26,2575.67,Asia,15/8/1947,16257309.0,186928.0,Asia
2,Indonesia,268.07,1910.93,1015.54,Asia,17/8/1945,1626812.0,44172.0,Asia
3,Brazil,210.32,8515.77,2055.51,South America,1822-09-07,14172139.0,383757.0,South America
4,Pakistan,205.71,881.91,302.14,Asia,14/8/1947,784108.0,16842.0,Asia
5,Nigeria,200.96,923.77,375.77,Africa,1/10/1960,164588.0,2061.0,Africa
6,Bangladesh,167.09,147.57,245.63,Asia,26/3/1971,736074.0,10781.0,Asia
7,Russia,146.79,17098.25,1530.75,,12/6/1992,4736121.0,107103.0,Europe
8,Mexico,126.58,1964.38,1158.23,North America,1810-09-16,2319519.0,214095.0,North America
9,Japan,126.22,377.97,4872.42,Asia,,547137.0,9777.0,Asia


In [None]:
test1merge_df=pd.merge(df, df1, how='inner')
test1merge_df

Unnamed: 0,COUNTRY,POPULATION,AREA,GDP,CONTINENTS,IND_DAY,CASES,DEATHS,REGION
0,China,1398.72,9596.96,12234.78,Asia,,90566.0,4636.0,Asia
1,India,1351.16,3287.26,2575.67,Asia,15/8/1947,16257309.0,186928.0,Asia
2,Indonesia,268.07,1910.93,1015.54,Asia,17/8/1945,1626812.0,44172.0,Asia
3,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07,14172139.0,383757.0,South America
4,Pakistan,205.71,881.91,302.14,Asia,14/8/1947,784108.0,16842.0,Asia
5,Nigeria,200.96,923.77,375.77,Africa,1/10/1960,164588.0,2061.0,Africa
6,Bangladesh,167.09,147.57,245.63,Asia,26/3/1971,736074.0,10781.0,Asia
7,Russia,146.79,17098.25,1530.75,,12/6/1992,4736121.0,107103.0,Europe
8,Mexico,126.58,1964.38,1158.23,N.America,1810-09-16,2319519.0,214095.0,North America
9,Japan,126.22,377.97,4872.42,Asia,,547137.0,9777.0,Asia


In [None]:
test1merge_df.shape

(18, 9)

In [None]:
merge_df=pd.merge(df, df1, how='outer')
merge_df
# will merge all data

Unnamed: 0,COUNTRY,POPULATION,AREA,GDP,CONTINENTS,IND_DAY,CASES,DEATHS,REGION
0,China,1398.72,9596.96,12234.78,Asia,,90566.0,4636.0,Asia
1,India,1351.16,3287.26,2575.67,Asia,15/8/1947,16257309.0,186928.0,Asia
2,US,329.74,9833.52,19485.39,N.America,1776-07-04,,,
3,Indonesia,268.07,1910.93,1015.54,Asia,17/8/1945,1626812.0,44172.0,Asia
4,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07,14172139.0,383757.0,South America
...,...,...,...,...,...,...,...,...,...
217,MS Zaandam,,,,,,9.0,2.0,
218,Vanuatu,,,,,,4.0,1.0,Australia/Oceania
219,Marshall Islands,,,,,,4.0,0.0,Australia/Oceania
220,Samoa,,,,,,3.0,0.0,Australia/Oceania


In [None]:
merge_df.shape

(222, 9)

In [None]:
# Let's try how='left' or 'right'

test1=pd.merge(df, df1, how='left')
test1

Unnamed: 0,COUNTRY,POPULATION,AREA,GDP,CONTINENTS,IND_DAY,CASES,DEATHS,REGION
0,China,1398.72,9596.96,12234.78,Asia,,90566.0,4636.0,Asia
1,India,1351.16,3287.26,2575.67,Asia,15/8/1947,16257309.0,186928.0,Asia
2,US,329.74,9833.52,19485.39,N.America,1776-07-04,,,
3,Indonesia,268.07,1910.93,1015.54,Asia,17/8/1945,1626812.0,44172.0,Asia
4,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07,14172139.0,383757.0,South America
5,Pakistan,205.71,881.91,302.14,Asia,14/8/1947,784108.0,16842.0,Asia
6,Nigeria,200.96,923.77,375.77,Africa,1/10/1960,164588.0,2061.0,Africa
7,Bangladesh,167.09,147.57,245.63,Asia,26/3/1971,736074.0,10781.0,Asia
8,Russia,146.79,17098.25,1530.75,,12/6/1992,4736121.0,107103.0,Europe
9,Mexico,126.58,1964.38,1158.23,N.America,1810-09-16,2319519.0,214095.0,North America


how='left', will merge base on left file, in this example is df

In [None]:
test2=pd.merge(df, df1, how='right')
test2

Unnamed: 0,COUNTRY,POPULATION,AREA,GDP,CONTINENTS,IND_DAY,CASES,DEATHS,REGION
0,United States,,,,,,32669121.0,584226.0,North America
1,India,1351.16,3287.26,2575.67,Asia,15/8/1947,16257309.0,186928.0,Asia
2,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07,14172139.0,383757.0,South America
3,France,67.02,640.68,2582.49,Europe,1789-07-14,5408606.0,102164.0,Europe
4,Russia,146.79,17098.25,1530.75,,12/6/1992,4736121.0,107103.0,Europe
...,...,...,...,...,...,...,...,...,...
215,MS Zaandam,,,,,,9.0,2.0,
216,Vanuatu,,,,,,4.0,1.0,Australia/Oceania
217,Marshall Islands,,,,,,4.0,0.0,Australia/Oceania
218,Samoa,,,,,,3.0,0.0,Australia/Oceania


how='right', will merge base on right file, in this example is df1

## Concatenating

With concatenation, your datasets are just stitched together along an axis — either the row axis or column axis.

In [None]:
concat_df=pd.concat([df, df1], axis=1)
concat_df

Unnamed: 0,COUNTRY,POPULATION,AREA,GDP,CONTINENTS,IND_DAY,COUNTRY.1,CASES,DEATHS,REGION
0,China,1398.72,9596.96,12234.78,Asia,,United States,32669121.0,584226.0,North America
1,India,1351.16,3287.26,2575.67,Asia,15/8/1947,India,16257309.0,186928.0,Asia
2,US,329.74,9833.52,19485.39,N.America,1776-07-04,Brazil,14172139.0,383757.0,South America
3,Indonesia,268.07,1910.93,1015.54,Asia,17/8/1945,France,5408606.0,102164.0,Europe
4,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07,Russia,4736121.0,107103.0,Europe
...,...,...,...,...,...,...,...,...,...,...
215,,,,,,,MS Zaandam,9.0,2.0,
216,,,,,,,Vanuatu,4.0,1.0,Australia/Oceania
217,,,,,,,Marshall Islands,4.0,0.0,Australia/Oceania
218,,,,,,,Samoa,3.0,0.0,Australia/Oceania


Let's try append the data

Let's call add new dataset call data4

In [None]:
from google.colab import files
data_to_load = files.upload()

In [None]:
import io
df2 = pd.read_csv(io.BytesIO(data_to_load['data4.csv']))
df2

In [None]:
# if using directly from python or jupyter notebook
df2 = pd.read_csv('data4.csv')
df2

Unnamed: 0,COUNTRY,POPULATION,AREA,GDP,CONTINENTS,IND_DAY
0,Egypt,93,640.68,375.77,Asia,1867-07-01
1,Germany,81,242.5,245.63,Europe,1789-07-14
2,Iran,80,301.34,143.0,Europe,
3,Turkey,79,,250.0,,


In [None]:
test3 = df.append(df2, ignore_index=True, sort=False)
test3

Unnamed: 0,COUNTRY,POPULATION,AREA,GDP,CONTINENTS,IND_DAY
0,China,1398.72,9596.96,12234.78,Asia,
1,India,1351.16,3287.26,2575.67,Asia,15/8/1947
2,US,329.74,9833.52,19485.39,N.America,1776-07-04
3,Indonesia,268.07,1910.93,1015.54,Asia,17/8/1945
4,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07
5,Pakistan,205.71,881.91,302.14,Asia,14/8/1947
6,Nigeria,200.96,923.77,375.77,Africa,1/10/1960
7,Bangladesh,167.09,147.57,245.63,Asia,26/3/1971
8,Russia,146.79,17098.25,1530.75,,12/6/1992
9,Mexico,126.58,1964.38,1158.23,N.America,1810-09-16


data from df 1 and data 4 are combine at row level

**LET'S MOVE TO GROUPBY**

In [None]:
df.head(2)

Unnamed: 0,COUNTRY,POPULATION,AREA,GDP,CONTINENTS,IND_DAY
0,China,1398.72,9596.96,12234.78,Asia,
1,India,1351.16,3287.26,2575.67,Asia,15/8/1947


In [None]:
df1.head(2)

Unnamed: 0,COUNTRY,CASES,DEATHS,REGION
0,United States,32669121.0,584226.0,North America
1,India,16257309.0,186928.0,Asia


In [None]:
df2.head()

Unnamed: 0,COUNTRY,POPULATION,AREA,GDP,CONTINENTS,IND_DAY
0,Egypt,93,640.68,375.77,Asia,1867-07-01
1,Germany,81,242.5,245.63,Europe,1789-07-14
2,Iran,80,301.34,143.0,Europe,
3,Turkey,79,,250.0,,


In [None]:
merge_df.head()

Unnamed: 0,COUNTRY,POPULATION,AREA,GDP,CONTINENTS,IND_DAY,CASES,DEATHS,REGION
0,China,1398.72,9596.96,12234.78,Asia,,90566.0,4636.0,Asia
1,India,1351.16,3287.26,2575.67,Asia,15/8/1947,16257309.0,186928.0,Asia
2,US,329.74,9833.52,19485.39,N.America,1776-07-04,,,
3,Indonesia,268.07,1910.93,1015.54,Asia,17/8/1945,1626812.0,44172.0,Asia
4,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07,14172139.0,383757.0,South America


In [None]:
merge_df.shape

(222, 9)

In [None]:
merge_df.dtypes

COUNTRY        object
POPULATION    float64
AREA          float64
GDP           float64
CONTINENTS     object
IND_DAY        object
CASES         float64
DEATHS        float64
REGION         object
dtype: object

**CONTINENT and REGION actually referring to the same thing.**
Let's try to fix it

Checking the elements

In [None]:
reg=merge_df.groupby('REGION').sum()
reg

Unnamed: 0_level_0,POPULATION,AREA,GDP,CASES,DEATHS
REGION,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Africa,244.34,3305.51,543.33,4513248.0,119538.0
Asia,3535.5,18927.5,21405.59,35616438.0,482532.0
Australia/Oceania,25.47,7692.02,1408.68,61971.0,1184.0
Europe,357.19,18397.38,9750.28,43483441.0,989869.0
North America,164.17,11949.05,2805.35,37760260.0,852502.0
South America,255.26,11296.17,2693.0,23897427.0,639607.0


In [None]:
con=merge_df.groupby('CONTINENTS').sum()
con

Unnamed: 0_level_0,POPULATION,AREA,GDP,CASES,DEATHS
CONTINENTS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Africa,244.34,3305.51,543.33,284951.0,5242.0
Asia,3535.5,18927.5,21405.59,20342739.0,276648.0
Europe,276.84,1541.63,10850.76,12567605.0,302214.0
N.America,493.91,21782.57,22290.74,3475353.0,237917.0
Oceania,25.47,7692.02,1408.68,29626.0,910.0
S.America,255.26,11296.17,2693.0,16968907.0,444377.0


In [None]:
merge_df.CONTINENTS=merge_df.CONTINENTS.replace(['N.America','S.America'],['North America','South America'])
merge_df

Unnamed: 0,COUNTRY,POPULATION,AREA,GDP,CONTINENTS,IND_DAY,CASES,DEATHS,REGION
0,China,1398.72,9596.96,12234.78,Asia,,90566.0,4636.0,Asia
1,India,1351.16,3287.26,2575.67,Asia,15/8/1947,16257309.0,186928.0,Asia
2,US,329.74,9833.52,19485.39,North America,1776-07-04,,,
3,Indonesia,268.07,1910.93,1015.54,Asia,17/8/1945,1626812.0,44172.0,Asia
4,Brazil,210.32,8515.77,2055.51,South America,1822-09-07,14172139.0,383757.0,South America
...,...,...,...,...,...,...,...,...,...
217,MS Zaandam,,,,,,9.0,2.0,
218,Vanuatu,,,,,,4.0,1.0,Australia/Oceania
219,Marshall Islands,,,,,,4.0,0.0,Australia/Oceania
220,Samoa,,,,,,3.0,0.0,Australia/Oceania


In [None]:
reg=merge_df.groupby('REGION').sum()
reg

Unnamed: 0_level_0,POPULATION,AREA,GDP,CASES,DEATHS
REGION,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Africa,244.34,3305.51,543.33,4513248.0,119538.0
Asia,3535.5,18927.5,21405.59,35616438.0,482532.0
Australia/Oceania,25.47,7692.02,1408.68,61971.0,1184.0
Europe,357.19,18397.38,9750.28,43483441.0,989869.0
North America,164.17,11949.05,2805.35,37760260.0,852502.0
South America,255.26,11296.17,2693.0,23897427.0,639607.0


In [None]:
con=merge_df.groupby('CONTINENTS').sum()
con

Unnamed: 0_level_0,POPULATION,AREA,GDP,CASES,DEATHS
CONTINENTS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Africa,244.34,3305.51,543.33,284951.0,5242.0
Asia,3535.5,18927.5,21405.59,20342739.0,276648.0
Europe,276.84,1541.63,10850.76,12567605.0,302214.0
North America,493.91,21782.57,22290.74,3475353.0,237917.0
Oceania,25.47,7692.02,1408.68,29626.0,910.0
South America,255.26,11296.17,2693.0,16968907.0,444377.0


In [None]:
df = df.rename(columns={'CONTINENTS': 'REGION'})
df.head()

Unnamed: 0,COUNTRY,POPULATION,AREA,GDP,REGION,IND_DAY
0,China,1398.72,9596.96,12234.78,Asia,
1,India,1351.16,3287.26,2575.67,Asia,15/8/1947
2,US,329.74,9833.52,19485.39,N.America,1776-07-04
3,Indonesia,268.07,1910.93,1015.54,Asia,17/8/1945
4,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07


In [None]:
df1.head()

Unnamed: 0,COUNTRY,CASES,DEATHS,REGION
0,United States,32669121.0,584226.0,North America
1,India,16257309.0,186928.0,Asia
2,Brazil,14172139.0,383757.0,South America
3,France,5408606.0,102164.0,Europe
4,Russia,4736121.0,107103.0,Europe


In [None]:
df101=pd.merge(df, df1)
df101

Unnamed: 0,COUNTRY,POPULATION,AREA,GDP,REGION,IND_DAY,CASES,DEATHS
0,China,1398.72,9596.96,12234.78,Asia,,90566.0,4636.0
1,India,1351.16,3287.26,2575.67,Asia,15/8/1947,16257309.0,186928.0
2,Indonesia,268.07,1910.93,1015.54,Asia,17/8/1945,1626812.0,44172.0
3,Pakistan,205.71,881.91,302.14,Asia,14/8/1947,784108.0,16842.0
4,Nigeria,200.96,923.77,375.77,Africa,1/10/1960,164588.0,2061.0
5,Bangladesh,167.09,147.57,245.63,Asia,26/3/1971,736074.0,10781.0
6,Japan,126.22,377.97,4872.42,Asia,,547137.0,9777.0
7,Germany,83.02,357.11,3693.2,Europe,,3238054.0,81693.0
8,France,67.02,640.68,2582.49,Europe,1789-07-14,5408606.0,102164.0
9,Italy,60.36,301.34,1943.84,Europe,,3920945.0,118357.0


In [None]:
reg=df101.groupby('REGION').sum()
reg

Unnamed: 0_level_0,POPULATION,AREA,GDP,CASES,DEATHS
REGION,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Africa,244.34,3305.51,543.33,284951.0,5242.0
Asia,3535.5,18927.5,21405.59,20342739.0,276648.0
Europe,210.4,1299.13,8219.53,12567605.0,302214.0


**What if you want to group the data into range or scale**

**Data binning** <br>
1) Data binning (or bucketing) groups data in bins (or buckets), in the sense that it replaces values contained into a small interval with a single representative value for that interval.<br>
2) Binning can be applied to convert numeric values to categorical or to sample (quantise) numeric values

For example, we want to group GDP in df into this group

![income_group.PNG](attachment:income_group.PNG)

we can use pd.cut

In [None]:
?pd.cut

Can we use this way?


In [None]:
bins=np.linspace(min(df["GDP"]),max (df["GDP"]),5) # regroup to 4 income group
group=["Low Income","Middle Low Income","Upper Middle Income", "High Income" ]
df["gdp_bin"]=pd.cut(df["GDP"],bins,labels=group,include_lowest=True)

What will happen to your df?

In [None]:
df

Unnamed: 0,COUNTRY,POPULATION,AREA,GDP,REGION,IND_DAY,gdp_bin
0,China,1398.72,9596.96,12234.78,Asia,,Upper Middle Income
1,India,1351.16,3287.26,2575.67,Asia,15/8/1947,Low Income
2,US,329.74,9833.52,19485.39,N.America,1776-07-04,High Income
3,Indonesia,268.07,1910.93,1015.54,Asia,17/8/1945,Low Income
4,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07,Low Income
5,Pakistan,205.71,881.91,302.14,Asia,14/8/1947,Low Income
6,Nigeria,200.96,923.77,375.77,Africa,1/10/1960,Low Income
7,Bangladesh,167.09,147.57,245.63,Asia,26/3/1971,Low Income
8,Russia,146.79,17098.25,1530.75,,12/6/1992,Low Income
9,Mexico,126.58,1964.38,1158.23,N.America,1810-09-16,Low Income


In [None]:
df['GDP'].min()

159.41

In [None]:
df['GDP'].max()

19485.39

Let's make some changes 

In [None]:
df["GDP_GROUP"]=pd.cut(df['GDP'], bins=[-float("inf"), 1036, 4045, 12535, float("inf")], 
                          labels=['Low income', 'Lower-middle income', 'Upper-middle income', 'High income'])

In [None]:
df.head()

Unnamed: 0,COUNTRY,POPULATION,AREA,GDP,REGION,IND_DAY,gdp_bin,GDP_GROUP
0,China,1398.72,9596.96,12234.78,Asia,,Upper Middle Income,Upper-middle income
1,India,1351.16,3287.26,2575.67,Asia,15/8/1947,Low Income,Lower-middle income
2,US,329.74,9833.52,19485.39,N.America,1776-07-04,High Income,High income
3,Indonesia,268.07,1910.93,1015.54,Asia,17/8/1945,Low Income,Low income
4,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07,Low Income,Lower-middle income


In [None]:
df["GDP_RANGE"]=pd.cut(df['GDP'], bins=[-float("inf"), 1036, 4045, 12535, float("inf")], 
                       labels=['< 1,036', '1,036 - 4,045', '4,046 - 12,535', '> 12,535'])

In [None]:
df.head()

Unnamed: 0,COUNTRY,POPULATION,AREA,GDP,REGION,IND_DAY,gdp_bin,GDP_GROUP,GDP_RANGE
0,China,1398.72,9596.96,12234.78,Asia,,Upper Middle Income,Upper-middle income,"4,046 - 12,535"
1,India,1351.16,3287.26,2575.67,Asia,15/8/1947,Low Income,Lower-middle income,"1,036 - 4,045"
2,US,329.74,9833.52,19485.39,N.America,1776-07-04,High Income,High income,"> 12,535"
3,Indonesia,268.07,1910.93,1015.54,Asia,17/8/1945,Low Income,Low income,"< 1,036"
4,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07,Low Income,Lower-middle income,"1,036 - 4,045"


Not a fan of pandas? Want to try something else?

In [None]:
def group(gincome):
    if gincome >= 12535:
        return "high income"
    if 4046 < gincome < 12535:
        return "upper-middle income"
    if 1036 < gincome < 4045:
        return "lower-middle income"
    if gincome <= 1036:
        return "low income"
    else:
        return 0
    
df['group'] = df['GDP'].apply(group)
df

Unnamed: 0,COUNTRY,POPULATION,AREA,GDP,REGION,IND_DAY,gdp_bin,GDP_GROUP,GDP_RANGE,group
0,China,1398.72,9596.96,12234.78,Asia,,Upper Middle Income,Upper-middle income,"4,046 - 12,535",upper-middle income
1,India,1351.16,3287.26,2575.67,Asia,15/8/1947,Low Income,Lower-middle income,"1,036 - 4,045",lower-middle income
2,US,329.74,9833.52,19485.39,N.America,1776-07-04,High Income,High income,"> 12,535",high income
3,Indonesia,268.07,1910.93,1015.54,Asia,17/8/1945,Low Income,Low income,"< 1,036",low income
4,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07,Low Income,Lower-middle income,"1,036 - 4,045",lower-middle income
5,Pakistan,205.71,881.91,302.14,Asia,14/8/1947,Low Income,Low income,"< 1,036",low income
6,Nigeria,200.96,923.77,375.77,Africa,1/10/1960,Low Income,Low income,"< 1,036",low income
7,Bangladesh,167.09,147.57,245.63,Asia,26/3/1971,Low Income,Low income,"< 1,036",low income
8,Russia,146.79,17098.25,1530.75,,12/6/1992,Low Income,Lower-middle income,"1,036 - 4,045",lower-middle income
9,Mexico,126.58,1964.38,1158.23,N.America,1810-09-16,Low Income,Lower-middle income,"1,036 - 4,045",lower-middle income


There are many ways for us to get answers. <br>
What matters is we know what we want to do. No need to worry about how to do it. <br>


By the way, we don't want gdp_bin and group column right?

How to remove it?

Check your previous note.

**End of part 1**<br>

I hope you learn something usefull today.<br>

Thank you.

**Prepared by** <br>
Nur Hurriyatul Huda Abdullah Sani (nurhuda@dosm.gov.my)

**For more references**, please visit <br>
https://coreteambda.wixsite.com/blog/learn-and-lead

References:<br>
[1] https://stackoverflow.com/questions/42818262/pandas-dataframe-replace-nat-with-none<br>
[2] https://pandas.pydata.org/pandas-docs/stable/user_guide/timedeltas.html<br>
[3] https://stackoverflow.com/questions/32327314/how-to-rearrange-a-date-in-python <br>
[4] https://stackoverflow.com/questions/30127427/pandas-cut-with-infinite-upper-lower-bounds