# Working with Data frame

## 1. Pandas Library

Pandas library is the one of the most populated used library for manipulating with data. We use the Series and Dataframe data structure extensively as these are much more powerful and useful to manipulate with data when compare with list and dictionary in python.

There's another very popular library called Numpy. Pandas bulid on top of it and we usually use pandas directly.

In [7]:
import pandas as pd

## 2. Pandas Series

A series is very similar to a list. We can easily convert a list to a simple series. A series also has index.

In [8]:
stocks = ["AAPL", "BABA", "DIDI", "MSFT", "AMZN", "ADBE", "TSLA", "MS", "V", "MA", "GS"]

In [9]:
stocks_series = pd.Series(stocks)

In [10]:
stocks_series

0     AAPL
1     BABA
2     DIDI
3     MSFT
4     AMZN
5     ADBE
6     TSLA
7       MS
8        V
9       MA
10      GS
dtype: object

Getting the values using index

In [11]:
stocks_series[0]

'AAPL'

In [12]:
stocks_series[1:3]

1    BABA
2    DIDI
dtype: object

In [13]:
stocks_series[2:6]

2    DIDI
3    MSFT
4    AMZN
5    ADBE
dtype: object

The difference between list and series is that we can use not use interger as index. Now it looks more like a dictionary. And we can create it from a dictionary

In [14]:
sales = {'Central Branch' : 10000,
         'TST Branch' : 2000,
         'Mongkok Branch' : 3000}

In [15]:
sales_series = pd.Series(sales)

In [16]:
sales_series

Central Branch    10000
TST Branch         2000
Mongkok Branch     3000
dtype: int64

Getting the number using index

In [17]:
sales_series["Central Branch"]

np.int64(10000)

## 3 Pandas Dataframe

You can consider the Series is one column of data on an excel spreadsheet. A dataframe has mulitple series and you can consider that the data of a whole spreadsheet

### 3.1 Create dataframe from csv

In [18]:
aapl = pd.read_csv("AAPL.csv")

In [19]:
aapl

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2019-10-28,61.855000,62.312500,61.680000,62.262501,61.650810,96572800
1,2019-10-29,62.242500,62.437500,60.642502,60.822498,60.224953,142839600
2,2019-10-30,61.189999,61.325001,60.302502,60.814999,60.217525,124522000
3,2019-10-31,61.810001,62.292500,59.314999,62.189999,61.579021,139162000
4,2019-11-01,62.384998,63.982498,62.290001,63.955002,63.326683,151125200
...,...,...,...,...,...,...,...
248,2020-10-21,116.669998,118.709999,116.449997,116.870003,116.870003,89946000
249,2020-10-22,117.449997,118.040001,114.589996,115.750000,115.750000,101988000
250,2020-10-23,116.389999,116.550003,114.279999,115.040001,115.040001,82572600
251,2020-10-26,114.010002,116.550003,112.879997,115.050003,115.050003,111850700


In [20]:
aapl_proper_index = pd.read_csv("AAPL.csv", parse_dates=True, index_col='Date')

In [21]:
aapl_proper_index

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2019-10-28,61.855000,62.312500,61.680000,62.262501,61.650810,96572800
2019-10-29,62.242500,62.437500,60.642502,60.822498,60.224953,142839600
2019-10-30,61.189999,61.325001,60.302502,60.814999,60.217525,124522000
2019-10-31,61.810001,62.292500,59.314999,62.189999,61.579021,139162000
2019-11-01,62.384998,63.982498,62.290001,63.955002,63.326683,151125200
...,...,...,...,...,...,...
2020-10-21,116.669998,118.709999,116.449997,116.870003,116.870003,89946000
2020-10-22,117.449997,118.040001,114.589996,115.750000,115.750000,101988000
2020-10-23,116.389999,116.550003,114.279999,115.040001,115.040001,82572600
2020-10-26,114.010002,116.550003,112.879997,115.050003,115.050003,111850700


### 3.2 From Quandl

In [22]:
import quandl

ModuleNotFoundError: No module named 'quandl'

In [None]:
quandl.ApiConfig.api_key = 'x9M_pZutNNPnha1WDdjZ'
ck = quandl.get('HKEX/00001', start_date='2020-10-20', end_date='2021-10-20')

In [None]:
ck

### 3.3 From Series

In [None]:
costs = {'Central Branch' : 300000,
         'TST Branch' : 50000,
         'Mongkok Branch' : 20000}

In [None]:
branch_summary = pd.DataFrame({"sales": sales, "costs": costs})

In [None]:
branch_summary

### 3.4 Getting data from dataframe (getting rows with date)

In [None]:
aapl_proper_index.loc["2019-10-30"]

In [None]:
aapl_proper_index.loc["2019-10-30":"2019-11-15"]

In [None]:
aapl_proper_index.loc["2019-11"]

### 3.5 Getting data from dataframe (get a series)

In [None]:
aapl_proper_index.loc["2019-11"]["Close"]

### 3.6 Getting data from dataframe (get multiple column from a dataframe)

In [None]:
aapl_proper_index.loc["2019-11"][["Open","Close"]]

### 3.7 Getting data from dataframe (that's not a date/integer)

In [None]:
branch_summary.loc["Central Branch"]

In [None]:
branch_summary["sales"]

### 3.8 Getting data from dataframe (using implicit index)

In [None]:
branch_summary.index

In [None]:
aapl_proper_index.index

In [None]:
aapl.index

In [None]:
aapl_proper_index.iloc[0:10]

## 4. Filtering

### 4.1 Single condition

In [None]:
aapl_proper_index["Open"] > 100

In [None]:
aapl_proper_index[aapl_proper_index["Open"] > 100]

## 4.2 multiple condition

In [None]:
(aapl_proper_index["Open"] > 100) & (aapl_proper_index["Volume"] > 100000000)

In [None]:
cond = (aapl_proper_index["Open"] > 100) & (aapl_proper_index["Volume"] > 100000000)

In [None]:
aapl_proper_index[cond]

#### Side notes

Showing the top 

In [None]:
aapl_proper_index[cond].head(20)

In [None]:
aapl_proper_index[cond].tail(5)

## 4.3 query

In [None]:
aapl_proper_index[cond].query("Open > 100 and Volume > 110000000").head(10)

# 5. New columns

## 5.1 Density Example

In [None]:

population = pd.Series({'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
)

area = pd.Series({'California': 423967, 
             'Texas': 695662, 
             'New York': 141297,
             'Florida': 170312, 
             'Illinois': 149995})

states = pd.DataFrame(  {'population': population,'area': area} )
states

In [None]:
states["density"] = states["population"] / states["area"]

In [None]:
states

## 5.2 Stocks example

In [None]:
aapl_proper_index["Percent Changes"] = aapl_proper_index["Close"].pct_change()

In [None]:
aapl_proper_index

# 6. Aggregation

## 6.1 Basic operations

In [None]:
aapl_proper_index["Percent Changes"].mean()

In [None]:
aapl_proper_index["Percent Changes"].max()

In [None]:
aapl_proper_index["Percent Changes"].min()

In [None]:
aapl_proper_index["Percent Changes"].median()

In [None]:
aapl_proper_index[aapl_proper_index["Percent Changes"] > 0]["Percent Changes"].mean()

In [None]:
aapl_proper_index[aapl_proper_index["Percent Changes"] < 0]["Percent Changes"].mean()

## 6.2 Grouping

We use planets discovery data as an example for grouping

In [None]:
import seaborn as sns
planets = sns.load_dataset('planets')
planets.shape

In [None]:
planets

In [None]:
planets.groupby('method')['orbital_period'].median()

In [None]:
planets.groupby('method')['orbital_period'].describe()

In [None]:
planets.groupby('method')["number"].count()

# 7. Joining Data

## 7.1 Merge (or join)

In [None]:
department = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})

department

In [None]:
hire_date = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
                    'hire_date': [2004, 2008, 2012, 2014]})

hire_date

In [None]:
employee = pd.merge(department, hire_date)
employee

In [None]:
employee = pd.merge(department, hire_date, on="employee")
employee

In [None]:
salary = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'salary': [70000, 80000, 120000, 90000]})
salary

In [None]:
employee = pd.merge(department, salary, left_on="employee", right_on="name")
employee

In [None]:
employee = employee.drop('name',axis=1)
employee

## 7.2 one to many merging

In [None]:
supervisor = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
                    'supervisor': ['Carly', 'Guido', 'Steve']})
supervisor

In [None]:
pd.merge(employee,supervisor)

# 7.3 Many to Many merging

In [None]:
skills = pd.DataFrame({'group': ['Accounting', 'Accounting','Engineering', 
                              'Engineering', 'HR', 'HR'],
                    'skills': ['math', 'spreadsheets', 'coding', 
                               'linux','spreadsheets', 'organization']})

In [None]:
pd.merge(employee,skills)

It's a very strange set of data. Make sure you know how to use it for many-to-many merging

## 7.4 Inner Join / Outer Join / Left Join / Right Join

In [None]:
fav_food = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'],
                    'food': ['fish', 'beans', 'bread']},
                   columns=['name', 'food'])
fav_food

In [None]:
fav_drink = pd.DataFrame({'name': ['Mary', 'Joseph'],
                    'drink': ['wine', 'beer']},
                   columns=['name', 'drink'])
fav_drink

In [None]:
pd.merge(fav_food, fav_drink)

In [None]:
pd.merge(fav_food, fav_drink, how="outer")

In [None]:
pd.merge(fav_food, fav_drink, how="left")

In [None]:
pd.merge(fav_food, fav_drink, how="right")

# 8. Handling Missing Data

In [None]:
hibor = pd.read_csv("hibor.csv", parse_dates=True, index_col='date')

In [None]:
hibor

## 8.1 Check missing data

In [None]:
hibor.isnull()

In [None]:
hibor.isnull().values.any()

In [None]:
hibor["overnight"].isnull().values.any()

In [None]:
hibor["overnight"].isnull().value_counts()

In [None]:
hibor["overnight"][hibor["overnight"].isnull()]

## 8.2 Drop Data

In [None]:
hibor.dropna()

## 8.3 Fill with specific values

Notes: Just show as an example. Does not make sense in this scenario

In [None]:
hibor.fillna(0)

## 8.4 Fill with previous values (i.e. forward fill)


In [None]:
hibor.fillna(method='ffill')

## 8.5 Fill with next values (i.e. back fill)


Remark: may not make sense in this example

In [None]:
hibor.fillna(method='bfill')

# 9. Export CSV

Export dataframe to a csv. Remember don't override the original file!

In [None]:
aapl_proper_index.to_csv("AAPL_new.csv")