# Pandas 01 - Basics of Pandas

by Nova@Douban

The video record of this session is here: https://zoom.us/recording/share/L9Jwofdbg3CX2L4wLoPAVrHYyi0F0ok2_58ozScsXsmwIumekTziMw


---

## 1.1 Data Structure of pandas

`pandas` significantly simplies data structures. If you used `R` or a retional database, you will find `pandas` very similar.

### 1.1.1 Three primary data structures in pandas

1. `Series` (a column):

    1. A one-dimensional array-like object containing an array of data.
    
    2. A fixed-length, __ordered dict__.
    
    3. Automatically aligns differently-indexed data in operations
    
    4. The column returned when indexing a DataFrame is a view, not a copy.


2. `DataFrame` (a collection of columns): 

    1. A tabular, spreadsheet-like data structure containing an ordered collection of columns;
 
    2. __A collection of Series__.
    
   
3. `index`:
        
    1. an Index also functions as __a fixed-size set__
    
    2. Index objects are __immutable__ and thus can’t be modified by the user
    
    3. It is a class in pandas, more complicated than the one in RDS.
    
        a. Identication: Indices are used to locate Series / rows / items in a DataFrame.   
        
        b. Alignemnt: pandas will always align with index automatically first.
        
        c. Selection: using index to select relevant columns/rows.
        

### 1.1.2 Example of DataFrame, Series and index

In [1]:
# Download Nasdaq dataset: https://finance.yahoo.com/quote/%5EIXIC/history?p=%5EIXIC

import pandas as pd

in_file = '../data/nasdaq.csv'
df = pd.read_csv(in_file, engine='c')

print(df.head())
print()
print(type(df))

         Date         Open         High          Low        Close  \
0  2018-11-23  6919.520020  6987.890137  6919.160156  6938.979980   
1  2018-11-26  7026.500000  7083.930176  7003.120117  7081.850098   
2  2018-11-27  7041.229980  7105.140137  7014.359863  7082.700195   
3  2018-11-28  7135.080078  7292.709961  7090.979980  7291.589844   
4  2018-11-29  7267.370117  7319.959961  7217.689941  7273.080078   

     Adj Close      Volume  
0  6938.979980   958950000  
1  7081.850098  2011180000  
2  7082.700195  2067360000  
3  7291.589844  2390260000  
4  7273.080078  1983460000  

<class 'pandas.core.frame.DataFrame'>


In [2]:
print(df['Date'].head())
print()
print(type(df['Date']))
print()

0    2018-11-23
1    2018-11-26
2    2018-11-27
3    2018-11-28
4    2018-11-29
Name: Date, dtype: object

<class 'pandas.core.series.Series'>



In [3]:
df.index

RangeIndex(start=0, stop=20, step=1)

In [4]:
df.index.values

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

---

### 1.1.3 Two other data structure in pandas

1. items:

    1. The smallest unit in pandas.
    
    
2. rows:

    1. Row is not a primary data structure in pandas
    
### 1.1.4 Examples of items and rows

In [5]:
df['Date'][0]

'2018-11-23'

In [6]:
df.loc[0:0]

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2018-11-23,6919.52002,6987.890137,6919.160156,6938.97998,6938.97998,958950000


---

## 1.2 Functions based on pandas

<img src="../image/folder3.png">



### 1.2.1 Three levels of functions

Each level of function only handles ite related levels of problems.

1. DataFrame-level functions

2. Series-level functions

3. Item-level functions

### 1.2.1 Example of different levels of pandas functions

The following is a sample script to analyse logs:

1. `prepare_overall_chat` is the overall wrapper;
2. `clean_chat_log` is a DataFrame-level funtion;
3. `get_all_mentions` and `count_active_user` are Series-level functions.

In [7]:
def prepare_overall_chat(chat_base, time, day):
    '''
    an overall wrapper
    '''
    # 'reading starts'
    overall = csv2pd(chat_base, time, day, HEAD_ANALYSIS, sep=',', engine='c')
    
    # 'overall'
    clean_overall = clean_chat_log(overall)
    
    # 'cleaned kom records'
    all_mentions = get_all_mentions(clean_overall, switch=False)
    
    # 'all_mentions'
    active_user_count = count_active_user(overall, overall['RoomName'], colname='Name', header=HEAD_ACTIVE)
    return clean_overall, all_mentions, active_user_count

    
def clean_chat_log(df):
    '''
    at Dataframe level
    '''
    # 'remove null or duplicates'
    df = df[df.TextMsg.notnull()]
    df = df[df.Name.notnull()]
    df = df.drop_duplicates()

    # replace ?? or ** in data
    df.Name = df.loc[:, 'Name'].str.replace('\?\?|\*\*', '?#')
    df.TextMsg = df.loc[:, 'TextMsg'].str.replace('\?\?|\*\*', '?#')

    # 'to_uni' and 'strip new lines'
    df = batch_to_uni(df, col_list=['Name', 'TextMsg', 'RoomName'])
    df = batch_strip(df, col_list=['Name', 'RoomName'], strip_str='\n\r ')
    return df

    
def get_all_mentions(df, switch=True):
    '''
    at Series level
    '''
    all_mentions = df[df['TextMsg'].str.contains('@')]
    all_mentions.MsgTime = pd.to_datetime(pd.Series(all_mentions.MsgTime))  # todo fix .loc
    if switch:
        cleaned_mentions = pd.DataFrame.copy(all_mentions)
        cleaned_mentions = batch_replace(cleaned_mentions, 'TextMsg', CN_PUNCS, '')
        return all_mentions, cleaned_mentions
    else:
        return all_mentions
    
def count_active_user(df, col1, colname, header):
    '''
    at Series level
    '''
    active_user_count = df.groupby([col1])[colname].unique().apply(len)
    active_user_count = active_user_count.subtract(1)  # Exclude 班长
    active_user_count = active_user_count.reset_index()
    active_user_count.columns = header
    active_user_count = batch_to_uni(active_user_count, ['RoomName'])
    return active_user_count

---

## 1.3 Summarizing and computing descriptive statistics


### 1.3.1 Take a glance at the dataset

__Be careful if the function should include parentheses:)__

1. `DataFrame.describe()`: provide descriptive stats of the dataset
2. `DataFrame.values`: access values of the dataset
3. `DataFrame.head()`: access the head of the dataset
4. `DataFrame.tail()`: access the tail of the dataset
5. `DataFrame.shape`: provide the length and width of the dataset
6. `DataFrame.size`: provide the product of the length and width of the dataset
7. `DataFrame.columns`: provide the colomn names of the dataset
8. `DataFrame.index`: provide the row index of the dataset
9. `DataFrame.axes`: provide the colomn names and row index of the dataset

### 1.3.2 Examples of a glance at the dataset

In [8]:
df.describe()

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume
count,20.0,20.0,20.0,20.0,20.0,20.0
mean,7036.433984,7094.115039,6932.403003,6996.186035,6996.186035,2492132000.0
std,236.773607,234.200045,268.133301,278.415894,278.415894,666522200.0
min,6573.490234,6586.680176,6304.629883,6333.0,6333.0,958950000.0
25%,6911.255005,6973.870117,6842.670166,6878.972656,6878.972656,2186262000.0
50%,7033.86499,7117.485108,6983.674805,7051.080078,7051.080078,2443730000.0
75%,7142.332397,7227.205078,7092.375,7165.887574,7165.887574,2643168000.0
max,7486.129883,7486.509766,7392.220215,7441.509766,7441.509766,4534120000.0


In [9]:
df.values

array([['2018-11-23', 6919.52002, 6987.890137, 6919.160156, 6938.97998,
        6938.97998, 958950000],
       ['2018-11-26', 7026.5, 7083.930176000001, 7003.120117,
        7081.850098000001, 7081.850098000001, 2011180000],
       ['2018-11-27', 7041.22998, 7105.140137, 7014.359863, 7082.700195,
        7082.700195, 2067360000],
       ['2018-11-28', 7135.080078, 7292.709961, 7090.97998, 7291.589844,
        7291.589844, 2390260000],
       ['2018-11-29', 7267.370117, 7319.959961, 7217.689941, 7273.080078,
        7273.080078, 1983460000],
       ['2018-11-30', 7279.299805, 7332.790039, 7255.680176000001,
        7330.540039, 7330.540039, 2542820000],
       ['2018-12-03', 7486.129883, 7486.509765999999, 7392.220215,
        7441.509765999999, 7441.509765999999, 2621020000],
       ['2018-12-04', 7407.950195, 7421.109863, 7150.109863,
        7158.430176000001, 7158.430176000001, 2635810000],
       ['2018-12-06', 7017.049805, 7189.52002, 6984.339844,
        7188.259765999999, 7188.2

In [10]:
df.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2018-11-23,6919.52002,6987.890137,6919.160156,6938.97998,6938.97998,958950000
1,2018-11-26,7026.5,7083.930176,7003.120117,7081.850098,7081.850098,2011180000
2,2018-11-27,7041.22998,7105.140137,7014.359863,7082.700195,7082.700195,2067360000
3,2018-11-28,7135.080078,7292.709961,7090.97998,7291.589844,7291.589844,2390260000
4,2018-11-29,7267.370117,7319.959961,7217.689941,7273.080078,7273.080078,1983460000


In [11]:
df.tail()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
15,2018-12-17,6886.459961,6931.810059,6710.009766,6753.72998,6753.72998,2665240000
16,2018-12-18,6809.819824,6847.27002,6733.709961,6783.910156,6783.910156,2595400000
17,2018-12-19,6777.589844,6868.859863,6586.5,6636.830078,6636.830078,2899950000
18,2018-12-20,6607.759766,6666.200195,6447.910156,6528.410156,6528.410156,3258090000
19,2018-12-21,6573.490234,6586.680176,6304.629883,6333.0,6333.0,4534120000


In [12]:
df.shape

(20, 7)

In [13]:
df.size

140

In [14]:
df.columns

Index(['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')

In [15]:
df.index

RangeIndex(start=0, stop=20, step=1)

In [16]:
df.axes

[RangeIndex(start=0, stop=20, step=1),
 Index(['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')]

In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 7 columns):
Date         20 non-null object
Open         20 non-null float64
High         20 non-null float64
Low          20 non-null float64
Close        20 non-null float64
Adj Close    20 non-null float64
Volume       20 non-null int64
dtypes: float64(5), int64(1), object(1)
memory usage: 1.2+ KB


---

## 1.4 Dive into index

### 1.4.1 Index labels

Index labels:
    
1. do not need to be integers;

2. can have repeated labels (__Be careful, this is different from dict__);

3. can have hierarchical sets of labels.

### 1.4.2 Examples of index labels

In [17]:
import numpy as np
import pandas as pd

# The default index is int
aray = np.random.randn(6)
srs = pd.Series(aray)
print(srs)
print()

# We can set repeated non-int labels to index
ind = ['a'] * 6
srs.index = ind
print(srs)
print()

# We can set multi-level labels to index
ind = zip(['a'] * 3 + ['b'] * 3, np.random.randn(6))
ind = pd.MultiIndex.from_tuples(ind, names=['letter', 'float'])
srs.index = ind
print(ind)
print()
print(srs)

0   -0.281210
1    0.770726
2   -0.176266
3   -1.612378
4   -1.868139
5   -0.496955
dtype: float64

a   -0.281210
a    0.770726
a   -0.176266
a   -1.612378
a   -1.868139
a   -0.496955
dtype: float64

MultiIndex(levels=[['a', 'b'], [-1.1125657480588875, -0.17357278036727714, -0.14227622735236414, 0.04140226853409916, 0.923176857116319, 1.2940772084264573]],
           labels=[[0, 0, 0, 1, 1, 1], [4, 1, 3, 0, 5, 2]],
           names=['letter', 'float'])

letter  float    
a        0.923177   -0.281210
        -0.173573    0.770726
         0.041402   -0.176266
b       -1.112566   -1.612378
         1.294077   -1.868139
        -0.142276   -0.496955
dtype: float64


In [37]:
sr1 = pd.Series(aray, index=['a'] * 6)
sr1

a   -0.281210
a    0.770726
a   -0.176266
a   -1.612378
a   -1.868139
a   -0.496955
dtype: float64

---

### 1.4.3 Three major usages

a. Identication: Indices are used to locate Series / rows / items in a DataFrame.   

b. Alignment: pandas will always align with index automatically first.

c. Selection: using index to select relevant columns/rows.

### 1.4.4 Examples of identication

In [38]:
index_df = df.copy()

index_df[index_df['Date'] == '2018-12-14']

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
14,2018-12-14,6986.370117,7027.169922,6898.990234,6910.660156,6910.660156,2200510000


---

### 1.4.5 Examples of alignment

In [32]:
index_df['Max_diff'] = index_df['High'] - index_df['Low']
index_df.head()

         Date         Open         High          Low        Close  \
0  2018-11-23  6919.520020  6987.890137  6919.160156  6938.979980   
1  2018-11-26  7026.500000  7083.930176  7003.120117  7081.850098   
2  2018-11-27  7041.229980  7105.140137  7014.359863  7082.700195   
3  2018-11-28  7135.080078  7292.709961  7090.979980  7291.589844   
4  2018-11-29  7267.370117  7319.959961  7217.689941  7273.080078   

     Adj Close      Volume    Max_diff  
0  6938.979980   958950000   68.729981  
1  7081.850098  2011180000   80.810059  
2  7082.700195  2067360000   90.780274  
3  7291.589844  2390260000  201.729981  
4  7273.080078  1983460000  102.270020  


---

### 1.4.6 Examples of selection

In [33]:
index_df.loc[14, ['Date', 'Close']]

Date     2018-12-14
Close       6910.66
Name: 14, dtype: object


In [34]:
index_df['Date'] = index_df['Date'].astype('str')
index_df.set_index('Date', inplace=True)
print(index_df.index)
print()
print(index_df.loc['2018-12-14', ['Date', 'Close']])

Index(['2018-11-23', '2018-11-26', '2018-11-27', '2018-11-28', '2018-11-29',
       '2018-11-30', '2018-12-03', '2018-12-04', '2018-12-06', '2018-12-07',
       '2018-12-10', '2018-12-11', '2018-12-12', '2018-12-13', '2018-12-14',
       '2018-12-17', '2018-12-18', '2018-12-19', '2018-12-20', '2018-12-21'],
      dtype='object', name='Date')

Date             NaN
Close    6910.660156
Name: 2018-12-14, dtype: float64


In [22]:
index_df.index.name = None
print(index_df.head())

                   Open         High          Low        Close    Adj Close  \
2018-11-23  6919.520020  6987.890137  6919.160156  6938.979980  6938.979980   
2018-11-26  7026.500000  7083.930176  7003.120117  7081.850098  7081.850098   
2018-11-27  7041.229980  7105.140137  7014.359863  7082.700195  7082.700195   
2018-11-28  7135.080078  7292.709961  7090.979980  7291.589844  7291.589844   
2018-11-29  7267.370117  7319.959961  7217.689941  7273.080078  7273.080078   

                Volume    Max_diff  
2018-11-23   958950000   68.729981  
2018-11-26  2011180000   80.810059  
2018-11-27  2067360000   90.780274  
2018-11-28  2390260000  201.729981  
2018-11-29  1983460000  102.270020  


In [23]:
index_df.index.name = "Date"
index_df.reset_index(inplace=True)
print(index_df.head())

         Date         Open         High          Low        Close  \
0  2018-11-23  6919.520020  6987.890137  6919.160156  6938.979980   
1  2018-11-26  7026.500000  7083.930176  7003.120117  7081.850098   
2  2018-11-27  7041.229980  7105.140137  7014.359863  7082.700195   
3  2018-11-28  7135.080078  7292.709961  7090.979980  7291.589844   
4  2018-11-29  7267.370117  7319.959961  7217.689941  7273.080078   

     Adj Close      Volume    Max_diff  
0  6938.979980   958950000   68.729981  
1  7081.850098  2011180000   80.810059  
2  7082.700195  2067360000   90.780274  
3  7291.589844  2390260000  201.729981  
4  7273.080078  1983460000  102.270020  


---

### 1.4.7 Five ways of index selection

1. `[]` operator: using index / column names to access data.
2. `df.loc`: Access a group of rows and columns by label(s)
3. `df.iloc`: Access a group of rows and columns by integer position(s)
4. `df.at`: Access a single value for a row/column label pair.
5. `df.iat`: Access a single value for a row/column pair by integer position.

In [24]:
index_df.set_index('Date', inplace=True)

print(index_df['2018-11-27':'2018-11-28'])
print()
print(index_df.loc['2018-11-27':'2018-11-28'])
print()
print(index_df.iloc[2:3])
print()
print(index_df.at['2018-11-27','Open'])
print()
print(index_df.iat[2, 0])

                   Open         High          Low        Close    Adj Close  \
Date                                                                          
2018-11-27  7041.229980  7105.140137  7014.359863  7082.700195  7082.700195   
2018-11-28  7135.080078  7292.709961  7090.979980  7291.589844  7291.589844   

                Volume    Max_diff  
Date                                
2018-11-27  2067360000   90.780274  
2018-11-28  2390260000  201.729981  

                   Open         High          Low        Close    Adj Close  \
Date                                                                          
2018-11-27  7041.229980  7105.140137  7014.359863  7082.700195  7082.700195   
2018-11-28  7135.080078  7292.709961  7090.979980  7291.589844  7291.589844   

                Volume    Max_diff  
Date                                
2018-11-27  2067360000   90.780274  
2018-11-28  2390260000  201.729981  

                  Open         High          Low        Close    Adj

In [25]:
%timeit index_df['2018-11-27':'2018-11-28']

124 µs ± 42.8 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [26]:
%timeit index_df.loc['2018-11-27':'2018-11-28']

112 µs ± 6.67 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [27]:
%timeit index_df.iloc[2:3]

175 µs ± 90.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [28]:
%timeit index_df.at['2018-11-27','Open']

8.53 µs ± 3.88 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [29]:
%timeit index_df.iat[2, 0]

7.2 µs ± 1.86 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)


__We recommend using `pd.DataFrame.iloc() / pd.DataFrame.loc()` in this case for the best performance and readibility.__

---

## 1.5 Exercises

### 1.5.1 Reviewing

Please review the code above.

### 1.5.1 Refactoring

If you have written pandas scripts before, try to refactor them into different levels of functions.

### 1.5.3 Checking parameters

Check the default and optional parameters of the following methods:

1. `DataFrame.desribe()`: provide descriptive stats of the dataset
2. `DataFrame.values`: access values of the dataset
3. `DataFrame.head()`: access the head of the dataset
4. `DataFrame.tail()`: access the tail of the dataset
5. `DataFrame.shape`: provide the length and width of the dataset
6. `DataFrame.size`: provide the product of the length and width of the dataset
7. `DataFrame.columns`: provide the colomn names of the dataset
8. `DataFrame.index`: provide the row index of the dataset
9. `DataFrame.axes`: provide the colomn names and row index of the dataset

---

To the rest sessions (outlines and video records), please scan the QR code below to pay.

1. The price is 799 RMB.
2. Please leave your email address in the __payment comment__, so I will send you the links of the rest sessions.


<img src="../image/alipay.jpg">

---