# Pandas 00 - Intro

by Nova@Douban

The video record of this session is here: https://zoom.us/recording/share/rDS-o_BWuPyBYIbswQ6bKJ5QGeFzY50BVFnBnw4t7pOwIumekTziMw?startTime=1545565951000

---

## 0.1 Course overview

<img src="../image/outline.png">

----

## 0.2 How to learn pandas?

1. __Code, Code, and Code!__
2. Read [pandas documentation](http://pandas.pydata.org/pandas-docs/stable/)
3. Check [StackOverflow](http://stackoverflow.com)
4. Check reference books
    1. _Python for Data Analysis_
    2. _Learning Pandas - Python Data Discovery and Analysis Made Easy_
5. Check blogs
    1. [pandas's Author Wes McKinney](http://wesmckinney.com/archives.html)
    2. [Dataquest](https://www.dataquest.io/blog/)
    3. [Introduction to Pandas by Ritchie Ng](https://www.ritchieng.com/tag_pandas/)

---

## 0.3 A Brief Overview of pandas
    
### 0.3.1 When to use pandas?

1. If the dataset can fit in your local machine / single server, use pandas;
2. If you want to speed up Python computing, use pandas;
3. If the computing logic is too complicated to simple SQL queries, use pandas;
4. If you want to convert data file format, use pandas;

### 0.3.2 How to use pandas?

1. If you use pandas, use it in pandas way;
2. If you use pandas, use it as a framework.
3. If you use pandas, track the code, not the data.


### 0.3.3 Some basic principles of pandas

1. There are multiple ways to finish a task in pandas.

2. If there are multiple ways to write in pandas, we shall choose the most suitable way.

3. If we use pandas, take it as a framework, not just a tool.

---

## 0.4 Examples

### 0.4.1 Multiple ways to drop a column

In [1]:
# Download Nasdaq dataset: https://finance.yahoo.com/quote/%5EIXIC/history?p=%5EIXIC

import pandas as pd

in_file = '../data/nasdaq.csv'
df = pd.read_csv(in_file, engine='c')
df.head()
# df.describe()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2018-11-23,6919.52002,6987.890137,6919.160156,6938.97998,6938.97998,958950000
1,2018-11-26,7026.5,7083.930176,7003.120117,7081.850098,7081.850098,2011180000
2,2018-11-27,7041.22998,7105.140137,7014.359863,7082.700195,7082.700195,2067360000
3,2018-11-28,7135.080078,7292.709961,7090.97998,7291.589844,7291.589844,2390260000
4,2018-11-29,7267.370117,7319.959961,7217.689941,7273.080078,7273.080078,1983460000


In [2]:
def drop_col_1(df, col):
    '''
    using drop
    '''
    df1 = df.copy()
    df1.drop(col, axis=1, inplace=True)
    return df1

def drop_col_2(df, col):
    '''
    using del
    '''
    df2 = df.copy()
    del(df2[col])
    return df2

def drop_col_3(df, col):
    '''
    using boolean selection
    '''
    df3 = df.copy()
    
    cols = list(df3.columns)
    cols.remove(col)
    
    df3 = df3[cols]
    return df3

## 0.4.2 Choose the most suitable way from multiple ways

We can use different profiling tools to benchmark the performence of different ways:

1. %timeit for speed
2. %memit for memory consumption

In [3]:
%timeit r1 = drop_col_1(df, ['Open'])
%timeit r2 = drop_col_2(df, 'Open')
%timeit r3 = drop_col_3(df, 'Open')

1.08 ms ± 97.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
510 µs ± 22.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1 ms ± 124 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [4]:
%load_ext memory_profiler
%memit r1 = drop_col_1(df, ['Open'])
%memit r2 = drop_col_2(df, 'Open')
%memit r2 = drop_col_3(df, 'Open')

peak memory: 76.87 MiB, increment: 0.45 MiB
peak memory: 76.88 MiB, increment: 0.00 MiB
peak memory: 76.88 MiB, increment: 0.00 MiB


__Conclusion__

1.  `drop_col_2` used the least time among the three methods;
2. Three methods consumed same memory;
3. `drop_col_1` is the most pandas way, and `drop_col_3` is the least readable

Therefore, we choose `drop_col_2` to finish this task

## 0.4.3 Use pandas a framework

Suppose we have a task as following:

1. read data from a CSV file;
2. calculate some results according to requirements;
3. output results to a Json and an Excel file.

These jobs can be easily handled by pandas.

In [5]:
def process_nasdaq(in_csv, out_json, out_excel):
    # read from CSV
    df = pd.read_csv(in_csv, engine='c')

    # Clean data
    df.rename(columns={'Adj Close': 'Adj_close'}, inplace=True)

    # Calcualtion
    df['Max_diff'] = df['High'] - df['Low']
    df['Open_close_diff'] = df['Close'] - df['Open']

    # Output to Json
    df.to_json(out_json, lines=True, orient='records')
    
    # Output to Excel
    writer = pd.ExcelWriter(out_excel)
    df.to_excel(writer,'Sheet1')
    writer.save()

In [6]:
in_csv = '../data/nasdaq.csv'
out_json = '../data/nasdaq.json'
out_excel = '../data/nasdaq.xlsx'
process_nasdaq(in_csv, out_json, out_excel)

---

To the rest sessions (outlines and video records), please scan the QR code below to pay.

1. The price is 799 RMB.
2. Please leave your email address in the __payment comment__, so I will send you the links of the rest sessions.


<img src="../image/alipay.jpg">