# Working with Data frame

## 1. Pandas Library

Pandas library is the one of the most populated used library for manipulating with data. We use the Series and Dataframe data structure extensively as these are much more powerful and useful to manipulate with data when compare with list and dictionary in python.

There's another very popular library called Numpy. Pandas bulid on top of it and we usually use pandas directly.

In [1]:
import pandas as pd

## 2. Pandas Series

A series is very similar to a list. We can easily convert a list to a simple series. A series also has index.

In [2]:
stocks = ["AAPL", "BABA", "DIDI", "MSFT", "AMZN", "ADBE", "TSLA", "MS", "V", "MA", "GS"]

In [3]:
stocks_series = pd.Series(stocks)

In [4]:
stocks_series

0     AAPL
1     BABA
2     DIDI
3     MSFT
4     AMZN
5     ADBE
6     TSLA
7       MS
8        V
9       MA
10      GS
dtype: object

Getting the values using index

In [5]:
stocks_series[0]

'AAPL'

In [6]:
stocks_series[1:3]

1    BABA
2    DIDI
dtype: object

In [7]:
stocks_series[2:6]

2    DIDI
3    MSFT
4    AMZN
5    ADBE
dtype: object

The difference between list and series is that we can use not use interger as index. Now it looks more like a dictionary. And we can create it from a dictionary

In [8]:
sales = {'Central Branch' : 10000,
         'TST Branch' : 2000,
         'Mongkok Branch' : 3000}

In [9]:
sales_series = pd.Series(sales)

In [10]:
sales_series

Central Branch    10000
TST Branch         2000
Mongkok Branch     3000
dtype: int64

Getting the number using index

In [11]:
sales_series["Central Branch"]

np.int64(10000)

## 3 Pandas Dataframe

You can consider the Series is one column of data on an excel spreadsheet. A dataframe has mulitple series and you can consider that the data of a whole spreadsheet

### 3.1 Create dataframe from csv

In [12]:
aapl = pd.read_csv("AAPL.csv")

In [13]:
aapl

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2019-10-28,61.855000,62.312500,61.680000,62.262501,61.650810,96572800
1,2019-10-29,62.242500,62.437500,60.642502,60.822498,60.224953,142839600
2,2019-10-30,61.189999,61.325001,60.302502,60.814999,60.217525,124522000
3,2019-10-31,61.810001,62.292500,59.314999,62.189999,61.579021,139162000
4,2019-11-01,62.384998,63.982498,62.290001,63.955002,63.326683,151125200
...,...,...,...,...,...,...,...
248,2020-10-21,116.669998,118.709999,116.449997,116.870003,116.870003,89946000
249,2020-10-22,117.449997,118.040001,114.589996,115.750000,115.750000,101988000
250,2020-10-23,116.389999,116.550003,114.279999,115.040001,115.040001,82572600
251,2020-10-26,114.010002,116.550003,112.879997,115.050003,115.050003,111850700


In [14]:
aapl_proper_index = pd.read_csv("AAPL.csv", parse_dates=True, index_col='Date')

In [15]:
aapl_proper_index

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2019-10-28,61.855000,62.312500,61.680000,62.262501,61.650810,96572800
2019-10-29,62.242500,62.437500,60.642502,60.822498,60.224953,142839600
2019-10-30,61.189999,61.325001,60.302502,60.814999,60.217525,124522000
2019-10-31,61.810001,62.292500,59.314999,62.189999,61.579021,139162000
2019-11-01,62.384998,63.982498,62.290001,63.955002,63.326683,151125200
...,...,...,...,...,...,...
2020-10-21,116.669998,118.709999,116.449997,116.870003,116.870003,89946000
2020-10-22,117.449997,118.040001,114.589996,115.750000,115.750000,101988000
2020-10-23,116.389999,116.550003,114.279999,115.040001,115.040001,82572600
2020-10-26,114.010002,116.550003,112.879997,115.050003,115.050003,111850700


### 3.2 From Quandl

In [16]:
import quandl

In [17]:
quandl.ApiConfig.api_key = 'x9M_pZutNNPnha1WDdjZ'
ck = quandl.get('HKEX/00001', start_date='2020-10-20', end_date='2021-10-20')

QuandlError: (Status 403) Something went wrong. Please try again. If you continue to have problems, please contact us at connect@quandl.com.

In [18]:
ck

NameError: name 'ck' is not defined

### 3.3 From Series

In [19]:
costs = {'Central Branch' : 300000,
         'TST Branch' : 50000,
         'Mongkok Branch' : 20000}

In [20]:
branch_summary = pd.DataFrame({"sales": sales, "costs": costs})

In [21]:
branch_summary

Unnamed: 0,sales,costs
Central Branch,10000,300000
TST Branch,2000,50000
Mongkok Branch,3000,20000


### 3.4 Getting data from dataframe (getting rows with date)

In [22]:
aapl_proper_index.loc["2019-10-30"]

Open         6.119000e+01
High         6.132500e+01
Low          6.030250e+01
Close        6.081500e+01
Adj Close    6.021753e+01
Volume       1.245220e+08
Name: 2019-10-30 00:00:00, dtype: float64

In [23]:
aapl_proper_index.loc["2019-10-30":"2019-11-15"]

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2019-10-30,61.189999,61.325001,60.302502,60.814999,60.217525,124522000
2019-10-31,61.810001,62.2925,59.314999,62.189999,61.579021,139162000
2019-11-01,62.384998,63.982498,62.290001,63.955002,63.326683,151125200
2019-11-04,64.332497,64.462502,63.845001,64.375,63.742554,103272000
2019-11-05,64.262497,64.547501,64.080002,64.282501,63.65097,79897600
2019-11-06,64.192497,64.372498,63.842499,64.309998,63.678192,75864400
2019-11-07,64.684998,65.087502,64.527496,64.857498,64.413116,94940400
2019-11-08,64.672501,65.110001,64.212502,65.035004,64.589409,69986400
2019-11-11,64.574997,65.6175,64.57,65.550003,65.100876,81821200
2019-11-12,65.387497,65.697502,65.230003,65.489998,65.041283,87388800


In [24]:
aapl_proper_index.loc["2019-11"]

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2019-11-01,62.384998,63.982498,62.290001,63.955002,63.326683,151125200
2019-11-04,64.332497,64.462502,63.845001,64.375,63.742554,103272000
2019-11-05,64.262497,64.547501,64.080002,64.282501,63.65097,79897600
2019-11-06,64.192497,64.372498,63.842499,64.309998,63.678192,75864400
2019-11-07,64.684998,65.087502,64.527496,64.857498,64.413116,94940400
2019-11-08,64.672501,65.110001,64.212502,65.035004,64.589409,69986400
2019-11-11,64.574997,65.6175,64.57,65.550003,65.100876,81821200
2019-11-12,65.387497,65.697502,65.230003,65.489998,65.041283,87388800
2019-11-13,65.282501,66.195,65.267502,66.1175,65.66449,102734400
2019-11-14,65.9375,66.220001,65.525002,65.660004,65.210121,89182800


### 3.5 Getting data from dataframe (get a series)

In [25]:
aapl_proper_index.loc["2019-11"]["Close"]

Date
2019-11-01    63.955002
2019-11-04    64.375000
2019-11-05    64.282501
2019-11-06    64.309998
2019-11-07    64.857498
2019-11-08    65.035004
2019-11-11    65.550003
2019-11-12    65.489998
2019-11-13    66.117500
2019-11-14    65.660004
2019-11-15    66.440002
2019-11-18    66.775002
2019-11-19    66.572502
2019-11-20    65.797501
2019-11-21    65.502502
2019-11-22    65.445000
2019-11-25    66.592499
2019-11-26    66.072502
2019-11-27    66.959999
2019-11-29    66.812500
Name: Close, dtype: float64

### 3.6 Getting data from dataframe (get multiple column from a dataframe)

In [26]:
aapl_proper_index.loc["2019-11"][["Open","Close"]]

Unnamed: 0_level_0,Open,Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2019-11-01,62.384998,63.955002
2019-11-04,64.332497,64.375
2019-11-05,64.262497,64.282501
2019-11-06,64.192497,64.309998
2019-11-07,64.684998,64.857498
2019-11-08,64.672501,65.035004
2019-11-11,64.574997,65.550003
2019-11-12,65.387497,65.489998
2019-11-13,65.282501,66.1175
2019-11-14,65.9375,65.660004


### 3.7 Getting data from dataframe (that's not a date/integer)

In [27]:
branch_summary.loc["Central Branch"]

sales     10000
costs    300000
Name: Central Branch, dtype: int64

In [28]:
branch_summary["sales"]

Central Branch    10000
TST Branch         2000
Mongkok Branch     3000
Name: sales, dtype: int64

### 3.8 Getting data from dataframe (using implicit index)

In [29]:
branch_summary.index

Index(['Central Branch', 'TST Branch', 'Mongkok Branch'], dtype='object')

In [30]:
aapl_proper_index.index

DatetimeIndex(['2019-10-28', '2019-10-29', '2019-10-30', '2019-10-31',
               '2019-11-01', '2019-11-04', '2019-11-05', '2019-11-06',
               '2019-11-07', '2019-11-08',
               ...
               '2020-10-14', '2020-10-15', '2020-10-16', '2020-10-19',
               '2020-10-20', '2020-10-21', '2020-10-22', '2020-10-23',
               '2020-10-26', '2020-10-27'],
              dtype='datetime64[ns]', name='Date', length=253, freq=None)

In [31]:
aapl.index

RangeIndex(start=0, stop=253, step=1)

In [32]:
aapl_proper_index.iloc[0:10]

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2019-10-28,61.855,62.3125,61.68,62.262501,61.65081,96572800
2019-10-29,62.2425,62.4375,60.642502,60.822498,60.224953,142839600
2019-10-30,61.189999,61.325001,60.302502,60.814999,60.217525,124522000
2019-10-31,61.810001,62.2925,59.314999,62.189999,61.579021,139162000
2019-11-01,62.384998,63.982498,62.290001,63.955002,63.326683,151125200
2019-11-04,64.332497,64.462502,63.845001,64.375,63.742554,103272000
2019-11-05,64.262497,64.547501,64.080002,64.282501,63.65097,79897600
2019-11-06,64.192497,64.372498,63.842499,64.309998,63.678192,75864400
2019-11-07,64.684998,65.087502,64.527496,64.857498,64.413116,94940400
2019-11-08,64.672501,65.110001,64.212502,65.035004,64.589409,69986400


## 4. Filtering

### 4.1 Single condition

In [33]:
aapl_proper_index["Open"] > 100

Date
2019-10-28    False
2019-10-29    False
2019-10-30    False
2019-10-31    False
2019-11-01    False
              ...  
2020-10-21     True
2020-10-22     True
2020-10-23     True
2020-10-26     True
2020-10-27     True
Name: Open, Length: 253, dtype: bool

In [34]:
aapl_proper_index[aapl_proper_index["Open"] > 100]

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-07-31,102.885002,106.415001,100.824997,106.260002,106.068756,374336800
2020-08-03,108.199997,111.637497,107.892502,108.937500,108.741440,308151200
2020-08-04,109.132500,110.790001,108.387497,109.665001,109.467628,173071600
2020-08-05,109.377502,110.392502,108.897499,110.062500,109.864410,121992000
2020-08-06,110.404999,114.412498,109.797501,113.902496,113.697502,202428800
...,...,...,...,...,...,...
2020-10-21,116.669998,118.709999,116.449997,116.870003,116.870003,89946000
2020-10-22,117.449997,118.040001,114.589996,115.750000,115.750000,101988000
2020-10-23,116.389999,116.550003,114.279999,115.040001,115.040001,82572600
2020-10-26,114.010002,116.550003,112.879997,115.050003,115.050003,111850700


## 4.2 multiple condition

In [35]:
(aapl_proper_index["Open"] > 100) & (aapl_proper_index["Volume"] > 100000000)

Date
2019-10-28    False
2019-10-29    False
2019-10-30    False
2019-10-31    False
2019-11-01    False
              ...  
2020-10-21    False
2020-10-22     True
2020-10-23    False
2020-10-26     True
2020-10-27    False
Length: 253, dtype: bool

In [36]:
cond = (aapl_proper_index["Open"] > 100) & (aapl_proper_index["Volume"] > 100000000)

In [37]:
aapl_proper_index[cond]

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-07-31,102.885002,106.415001,100.824997,106.260002,106.068756,374336800
2020-08-03,108.199997,111.637497,107.892502,108.9375,108.74144,308151200
2020-08-04,109.1325,110.790001,108.387497,109.665001,109.467628,173071600
2020-08-05,109.377502,110.392502,108.897499,110.0625,109.86441,121992000
2020-08-06,110.404999,114.412498,109.797501,113.902496,113.697502,202428800
2020-08-07,113.205002,113.675003,110.292503,111.112503,111.112503,198045600
2020-08-10,112.599998,113.775002,110.0,112.727501,112.727501,212403600
2020-08-11,111.970001,112.482498,109.107498,109.375,109.375,187902400
2020-08-12,110.497498,113.275002,110.297501,113.010002,113.010002,165944800
2020-08-13,114.43,116.042503,113.927498,115.010002,115.010002,210082000


#### Side notes

Showing the top 

In [38]:
aapl_proper_index[cond].head(20)

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-07-31,102.885002,106.415001,100.824997,106.260002,106.068756,374336800
2020-08-03,108.199997,111.637497,107.892502,108.9375,108.74144,308151200
2020-08-04,109.1325,110.790001,108.387497,109.665001,109.467628,173071600
2020-08-05,109.377502,110.392502,108.897499,110.0625,109.86441,121992000
2020-08-06,110.404999,114.412498,109.797501,113.902496,113.697502,202428800
2020-08-07,113.205002,113.675003,110.292503,111.112503,111.112503,198045600
2020-08-10,112.599998,113.775002,110.0,112.727501,112.727501,212403600
2020-08-11,111.970001,112.482498,109.107498,109.375,109.375,187902400
2020-08-12,110.497498,113.275002,110.297501,113.010002,113.010002,165944800
2020-08-13,114.43,116.042503,113.927498,115.010002,115.010002,210082000


In [39]:
aapl_proper_index[cond].tail(5)

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-10-16,121.279999,121.550003,118.809998,119.019997,119.019997,115393800
2020-10-19,119.959999,120.419998,115.660004,115.980003,115.980003,120639300
2020-10-20,116.199997,118.980003,115.629997,117.510002,117.510002,124423700
2020-10-22,117.449997,118.040001,114.589996,115.75,115.75,101988000
2020-10-26,114.010002,116.550003,112.879997,115.050003,115.050003,111850700


## 4.3 query

In [40]:
aapl_proper_index[cond].query("Open > 100 and Volume > 110000000").head(10)

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-07-31,102.885002,106.415001,100.824997,106.260002,106.068756,374336800
2020-08-03,108.199997,111.637497,107.892502,108.9375,108.74144,308151200
2020-08-04,109.1325,110.790001,108.387497,109.665001,109.467628,173071600
2020-08-05,109.377502,110.392502,108.897499,110.0625,109.86441,121992000
2020-08-06,110.404999,114.412498,109.797501,113.902496,113.697502,202428800
2020-08-07,113.205002,113.675003,110.292503,111.112503,111.112503,198045600
2020-08-10,112.599998,113.775002,110.0,112.727501,112.727501,212403600
2020-08-11,111.970001,112.482498,109.107498,109.375,109.375,187902400
2020-08-12,110.497498,113.275002,110.297501,113.010002,113.010002,165944800
2020-08-13,114.43,116.042503,113.927498,115.010002,115.010002,210082000


# 5. New columns

## 5.1 Density Example

In [41]:

population = pd.Series({'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
)

area = pd.Series({'California': 423967, 
             'Texas': 695662, 
             'New York': 141297,
             'Florida': 170312, 
             'Illinois': 149995})

states = pd.DataFrame(  {'population': population,'area': area} )
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [42]:
states["density"] = states["population"] / states["area"]

In [43]:
states

Unnamed: 0,population,area,density
California,38332521,423967,90.413926
Texas,26448193,695662,38.01874
New York,19651127,141297,139.076746
Florida,19552860,170312,114.806121
Illinois,12882135,149995,85.883763


## 5.2 Stocks example

In [44]:
aapl_proper_index["Percent Changes"] = aapl_proper_index["Close"].pct_change()

In [45]:
aapl_proper_index

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume,Percent Changes
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2019-10-28,61.855000,62.312500,61.680000,62.262501,61.650810,96572800,
2019-10-29,62.242500,62.437500,60.642502,60.822498,60.224953,142839600,-0.023128
2019-10-30,61.189999,61.325001,60.302502,60.814999,60.217525,124522000,-0.000123
2019-10-31,61.810001,62.292500,59.314999,62.189999,61.579021,139162000,0.022610
2019-11-01,62.384998,63.982498,62.290001,63.955002,63.326683,151125200,0.028381
...,...,...,...,...,...,...,...
2020-10-21,116.669998,118.709999,116.449997,116.870003,116.870003,89946000,-0.005446
2020-10-22,117.449997,118.040001,114.589996,115.750000,115.750000,101988000,-0.009583
2020-10-23,116.389999,116.550003,114.279999,115.040001,115.040001,82572600,-0.006134
2020-10-26,114.010002,116.550003,112.879997,115.050003,115.050003,111850700,0.000087


# 6. Aggregation

## 6.1 Basic operations

In [46]:
aapl_proper_index["Percent Changes"].mean()

np.float64(0.0028956964634767705)

In [47]:
aapl_proper_index["Percent Changes"].max()

np.float64(0.11980826040056836)

In [48]:
aapl_proper_index["Percent Changes"].min()

np.float64(-0.12864694751232164)

In [49]:
aapl_proper_index["Percent Changes"].median()

np.float64(0.0024045071214521263)

In [50]:
aapl_proper_index[aapl_proper_index["Percent Changes"] > 0]["Percent Changes"].mean()

np.float64(0.01985402460478214)

In [51]:
aapl_proper_index[aapl_proper_index["Percent Changes"] < 0]["Percent Changes"].mean()

np.float64(-0.01881547236798305)

## 6.2 Grouping

We use planets discovery data as an example for grouping

In [52]:
import seaborn as sns
planets = sns.load_dataset('planets')
planets.shape

(1035, 6)

In [53]:
planets

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.300000,7.10,77.40,2006
1,Radial Velocity,1,874.774000,2.21,56.95,2008
2,Radial Velocity,1,763.000000,2.60,19.84,2011
3,Radial Velocity,1,326.030000,19.40,110.62,2007
4,Radial Velocity,1,516.220000,10.50,119.47,2009
...,...,...,...,...,...,...
1030,Transit,1,3.941507,,172.00,2006
1031,Transit,1,2.615864,,148.00,2007
1032,Transit,1,3.191524,,174.00,2007
1033,Transit,1,4.125083,,293.00,2008


In [54]:
planets.groupby('method')['orbital_period'].median()

method
Astrometry                         631.180000
Eclipse Timing Variations         4343.500000
Imaging                          27500.000000
Microlensing                      3300.000000
Orbital Brightness Modulation        0.342887
Pulsar Timing                       66.541900
Pulsation Timing Variations       1170.000000
Radial Velocity                    360.200000
Transit                              5.714932
Transit Timing Variations           57.011000
Name: orbital_period, dtype: float64

In [55]:
planets.groupby('method')['orbital_period'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
method,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Astrometry,2.0,631.18,544.217663,246.36,438.77,631.18,823.59,1016.0
Eclipse Timing Variations,9.0,4751.644444,2499.130945,1916.25,2900.0,4343.5,5767.0,10220.0
Imaging,12.0,118247.7375,213978.177277,4639.15,8343.9,27500.0,94250.0,730000.0
Microlensing,7.0,3153.571429,1113.166333,1825.0,2375.0,3300.0,3550.0,5100.0
Orbital Brightness Modulation,3.0,0.709307,0.725493,0.240104,0.291496,0.342887,0.943908,1.544929
Pulsar Timing,5.0,7343.021201,16313.265573,0.090706,25.262,66.5419,98.2114,36525.0
Pulsation Timing Variations,1.0,1170.0,,1170.0,1170.0,1170.0,1170.0,1170.0
Radial Velocity,553.0,823.35468,1454.92621,0.73654,38.021,360.2,982.0,17337.5
Transit,397.0,21.102073,46.185893,0.355,3.16063,5.714932,16.1457,331.60059
Transit Timing Variations,3.0,79.7835,71.599884,22.3395,39.67525,57.011,108.5055,160.0


In [56]:
planets.groupby('method')["number"].count()

method
Astrometry                         2
Eclipse Timing Variations          9
Imaging                           38
Microlensing                      23
Orbital Brightness Modulation      3
Pulsar Timing                      5
Pulsation Timing Variations        1
Radial Velocity                  553
Transit                          397
Transit Timing Variations          4
Name: number, dtype: int64

# 7. Joining Data

## 7.1 Merge (or join)

In [57]:
department = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})

department

Unnamed: 0,employee,group
0,Bob,Accounting
1,Jake,Engineering
2,Lisa,Engineering
3,Sue,HR


In [58]:
hire_date = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
                    'hire_date': [2004, 2008, 2012, 2014]})

hire_date

Unnamed: 0,employee,hire_date
0,Lisa,2004
1,Bob,2008
2,Jake,2012
3,Sue,2014


In [59]:
employee = pd.merge(department, hire_date)
employee

Unnamed: 0,employee,group,hire_date
0,Bob,Accounting,2008
1,Jake,Engineering,2012
2,Lisa,Engineering,2004
3,Sue,HR,2014


In [60]:
employee = pd.merge(department, hire_date, on="employee")
employee

Unnamed: 0,employee,group,hire_date
0,Bob,Accounting,2008
1,Jake,Engineering,2012
2,Lisa,Engineering,2004
3,Sue,HR,2014


In [61]:
salary = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'salary': [70000, 80000, 120000, 90000]})
salary

Unnamed: 0,name,salary
0,Bob,70000
1,Jake,80000
2,Lisa,120000
3,Sue,90000


In [62]:
employee = pd.merge(department, salary, left_on="employee", right_on="name")
employee

Unnamed: 0,employee,group,name,salary
0,Bob,Accounting,Bob,70000
1,Jake,Engineering,Jake,80000
2,Lisa,Engineering,Lisa,120000
3,Sue,HR,Sue,90000


In [63]:
employee = employee.drop('name',axis=1)
employee

Unnamed: 0,employee,group,salary
0,Bob,Accounting,70000
1,Jake,Engineering,80000
2,Lisa,Engineering,120000
3,Sue,HR,90000


## 7.2 one to many merging

In [64]:
supervisor = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
                    'supervisor': ['Carly', 'Guido', 'Steve']})
supervisor

Unnamed: 0,group,supervisor
0,Accounting,Carly
1,Engineering,Guido
2,HR,Steve


In [65]:
pd.merge(employee,supervisor)

Unnamed: 0,employee,group,salary,supervisor
0,Bob,Accounting,70000,Carly
1,Jake,Engineering,80000,Guido
2,Lisa,Engineering,120000,Guido
3,Sue,HR,90000,Steve


# 7.3 Many to Many merging

In [66]:
skills = pd.DataFrame({'group': ['Accounting', 'Accounting','Engineering', 
                              'Engineering', 'HR', 'HR'],
                    'skills': ['math', 'spreadsheets', 'coding', 
                               'linux','spreadsheets', 'organization']})

In [67]:
pd.merge(employee,skills)

Unnamed: 0,employee,group,salary,skills
0,Bob,Accounting,70000,math
1,Bob,Accounting,70000,spreadsheets
2,Jake,Engineering,80000,coding
3,Jake,Engineering,80000,linux
4,Lisa,Engineering,120000,coding
5,Lisa,Engineering,120000,linux
6,Sue,HR,90000,spreadsheets
7,Sue,HR,90000,organization


It's a very strange set of data. Make sure you know how to use it for many-to-many merging

## 7.4 Inner Join / Outer Join / Left Join / Right Join

In [68]:
fav_food = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'],
                    'food': ['fish', 'beans', 'bread']},
                   columns=['name', 'food'])
fav_food

Unnamed: 0,name,food
0,Peter,fish
1,Paul,beans
2,Mary,bread


In [69]:
fav_drink = pd.DataFrame({'name': ['Mary', 'Joseph'],
                    'drink': ['wine', 'beer']},
                   columns=['name', 'drink'])
fav_drink

Unnamed: 0,name,drink
0,Mary,wine
1,Joseph,beer


In [70]:
pd.merge(fav_food, fav_drink)

Unnamed: 0,name,food,drink
0,Mary,bread,wine


In [71]:
pd.merge(fav_food, fav_drink, how="outer")

Unnamed: 0,name,food,drink
0,Joseph,,beer
1,Mary,bread,wine
2,Paul,beans,
3,Peter,fish,


In [72]:
pd.merge(fav_food, fav_drink, how="left")

Unnamed: 0,name,food,drink
0,Peter,fish,
1,Paul,beans,
2,Mary,bread,wine


In [73]:
pd.merge(fav_food, fav_drink, how="right")

Unnamed: 0,name,food,drink
0,Mary,bread,wine
1,Joseph,,beer


# 8. Handling Missing Data

In [74]:
hibor = pd.read_csv("hibor.csv", parse_dates=True, index_col='date')

In [75]:
hibor

Unnamed: 0_level_0,overnight,1 week,2 weeks,1 months,2 months,3 months,6 months,12 months
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2010-01-01,,,,,,,,
2010-01-02,,,,,,,,
2010-01-03,,,,,,,,
2010-01-04,0.03,0.04971,0.05,0.07964,0.11893,0.15679,0.31571,0.71429
2010-01-05,0.03,0.04971,0.05,0.07964,0.11,0.15,0.29929,0.68929
2010-01-06,0.03,0.049,0.04971,0.08,0.11,0.14,0.28,0.66857
2010-01-07,0.03,0.04971,0.04971,0.06964,0.1,0.13,0.26,0.62857
2010-01-08,0.03,0.04971,0.04971,0.06964,0.1,0.13,0.26,0.62857
2010-01-09,,,,,,,,
2010-01-10,,,,,,,,


## 8.1 Check missing data

In [76]:
hibor.isnull()

Unnamed: 0_level_0,overnight,1 week,2 weeks,1 months,2 months,3 months,6 months,12 months
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2010-01-01,True,True,True,True,True,True,True,True
2010-01-02,True,True,True,True,True,True,True,True
2010-01-03,True,True,True,True,True,True,True,True
2010-01-04,False,False,False,False,False,False,False,False
2010-01-05,False,False,False,False,False,False,False,False
2010-01-06,False,False,False,False,False,False,False,False
2010-01-07,False,False,False,False,False,False,False,False
2010-01-08,False,False,False,False,False,False,False,False
2010-01-09,True,True,True,True,True,True,True,True
2010-01-10,True,True,True,True,True,True,True,True


In [77]:
hibor.isnull().values.any()

np.True_

In [78]:
hibor["overnight"].isnull().values.any()

np.True_

In [79]:
hibor["overnight"].isnull().value_counts()

overnight
False    6
True     5
Name: count, dtype: int64

In [80]:
hibor["overnight"][hibor["overnight"].isnull()]

date
2010-01-01   NaN
2010-01-02   NaN
2010-01-03   NaN
2010-01-09   NaN
2010-01-10   NaN
Name: overnight, dtype: float64

## 8.2 Drop Data

In [81]:
hibor.dropna()

Unnamed: 0_level_0,overnight,1 week,2 weeks,1 months,2 months,3 months,6 months,12 months
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2010-01-04,0.03,0.04971,0.05,0.07964,0.11893,0.15679,0.31571,0.71429
2010-01-05,0.03,0.04971,0.05,0.07964,0.11,0.15,0.29929,0.68929
2010-01-06,0.03,0.049,0.04971,0.08,0.11,0.14,0.28,0.66857
2010-01-07,0.03,0.04971,0.04971,0.06964,0.1,0.13,0.26,0.62857
2010-01-08,0.03,0.04971,0.04971,0.06964,0.1,0.13,0.26,0.62857
2010-01-11,0.03,0.04971,0.04971,0.06036,0.09,0.12,0.24,0.57


## 8.3 Fill with specific values

Notes: Just show as an example. Does not make sense in this scenario

In [82]:
hibor.fillna(0)

Unnamed: 0_level_0,overnight,1 week,2 weeks,1 months,2 months,3 months,6 months,12 months
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2010-01-01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2010-01-02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2010-01-03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2010-01-04,0.03,0.04971,0.05,0.07964,0.11893,0.15679,0.31571,0.71429
2010-01-05,0.03,0.04971,0.05,0.07964,0.11,0.15,0.29929,0.68929
2010-01-06,0.03,0.049,0.04971,0.08,0.11,0.14,0.28,0.66857
2010-01-07,0.03,0.04971,0.04971,0.06964,0.1,0.13,0.26,0.62857
2010-01-08,0.03,0.04971,0.04971,0.06964,0.1,0.13,0.26,0.62857
2010-01-09,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2010-01-10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 8.4 Fill with previous values (i.e. forward fill)


In [83]:
hibor.fillna(method='ffill')

  hibor.fillna(method='ffill')


Unnamed: 0_level_0,overnight,1 week,2 weeks,1 months,2 months,3 months,6 months,12 months
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2010-01-01,,,,,,,,
2010-01-02,,,,,,,,
2010-01-03,,,,,,,,
2010-01-04,0.03,0.04971,0.05,0.07964,0.11893,0.15679,0.31571,0.71429
2010-01-05,0.03,0.04971,0.05,0.07964,0.11,0.15,0.29929,0.68929
2010-01-06,0.03,0.049,0.04971,0.08,0.11,0.14,0.28,0.66857
2010-01-07,0.03,0.04971,0.04971,0.06964,0.1,0.13,0.26,0.62857
2010-01-08,0.03,0.04971,0.04971,0.06964,0.1,0.13,0.26,0.62857
2010-01-09,0.03,0.04971,0.04971,0.06964,0.1,0.13,0.26,0.62857
2010-01-10,0.03,0.04971,0.04971,0.06964,0.1,0.13,0.26,0.62857


## 8.5 Fill with next values (i.e. back fill)


Remark: may not make sense in this example

In [84]:
hibor.fillna(method='bfill')

  hibor.fillna(method='bfill')


Unnamed: 0_level_0,overnight,1 week,2 weeks,1 months,2 months,3 months,6 months,12 months
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2010-01-01,0.03,0.04971,0.05,0.07964,0.11893,0.15679,0.31571,0.71429
2010-01-02,0.03,0.04971,0.05,0.07964,0.11893,0.15679,0.31571,0.71429
2010-01-03,0.03,0.04971,0.05,0.07964,0.11893,0.15679,0.31571,0.71429
2010-01-04,0.03,0.04971,0.05,0.07964,0.11893,0.15679,0.31571,0.71429
2010-01-05,0.03,0.04971,0.05,0.07964,0.11,0.15,0.29929,0.68929
2010-01-06,0.03,0.049,0.04971,0.08,0.11,0.14,0.28,0.66857
2010-01-07,0.03,0.04971,0.04971,0.06964,0.1,0.13,0.26,0.62857
2010-01-08,0.03,0.04971,0.04971,0.06964,0.1,0.13,0.26,0.62857
2010-01-09,0.03,0.04971,0.04971,0.06036,0.09,0.12,0.24,0.57
2010-01-10,0.03,0.04971,0.04971,0.06036,0.09,0.12,0.24,0.57


# 9. Export CSV

Export dataframe to a csv. Remember don't override the original file!

In [85]:
aapl_proper_index.to_csv("AAPL_new.csv")