# 範例目標:
1. 實做效能調校
2. 大型資料集處理

# 範例重點:
1. 在這裡介紹3種加速方法針對 Pandas ，在 Python 中還有很多方式可以提升效能
2. 欄位的型態降級有助於減少記憶體佔用空間

Groupby https://zhuanlan.zhihu.com/p/101284491

In [1]:
import pandas as pd
import numpy as np 
import time

In [2]:
score_df = pd.DataFrame([[1,50,80,70,'boy',1], 
              [2,60,45,50,'boy',2],
              [3,98,43,55,'boy',1],
              [4,70,69,89,'boy',2],
              [5,56,79,60,'girl',1],
              [6,60,68,55,'girl',2],
              [7,45,70,77,'girl',1],
              [8,55,77,76,'girl',2],
              [9,25,57,60,'girl',1],
              [10,88,40,43,'girl',3],
              [11,25,60,45,'boy',3],
              [12,80,60,23,'boy',3],
              [13,20,90,66,'girl',3],
              [14,50,50,50,'girl',3],
              [15,89,67,77,'girl',3]],columns=['student_id','math_score','english_score','chinese_score','sex','class'])
score_df

Unnamed: 0,student_id,math_score,english_score,chinese_score,sex,class
0,1,50,80,70,boy,1
1,2,60,45,50,boy,2
2,3,98,43,55,boy,1
3,4,70,69,89,boy,2
4,5,56,79,60,girl,1
5,6,60,68,55,girl,2
6,7,45,70,77,girl,1
7,8,55,77,76,girl,2
8,9,25,57,60,girl,1
9,10,88,40,43,girl,3


**agg使用Python的內建函式**

In [3]:
star_time = time.time()
score_df.groupby('class').agg('mean')
end_time = time.time()
end_time - star_time

0.002966642379760742

**agg使用自定義函式**

In [4]:
star_time = time.time()
score_df.groupby('class').agg(lambda x: x.mean())
end_time = time.time()
end_time - star_time

0.021939516067504883

**transform使用Python的內建函式**

In [5]:
star_time = time.time()
score_df.groupby('class').transform('mean')
end_time = time.time()
end_time - star_time

0.023904800415039062

**transform使用自定義函式**

In [6]:
star_time = time.time()
score_df.groupby('class').transform(lambda x: x.mean())
end_time = time.time()
end_time - star_time

0.014986276626586914

這邊可以看出在使用agg和transform進行操作時，儘量使用Python的內建函式，能夠提高執行效率

In [7]:
score_df

Unnamed: 0,student_id,math_score,english_score,chinese_score,sex,class
0,1,50,80,70,boy,1
1,2,60,45,50,boy,2
2,3,98,43,55,boy,1
3,4,70,69,89,boy,2
4,5,56,79,60,girl,1
5,6,60,68,55,girl,2
6,7,45,70,77,girl,1
7,8,55,77,76,girl,2
8,9,25,57,60,girl,1
9,10,88,40,43,girl,3


**篩選出對應資料，用list方式搜索**

In [8]:
score_df1 = score_df.copy()
star_time = time.time()
score_df1['Pass_math'] = [i>=60 for i in score_df1.math_score]
end_time = time.time()
end_time - star_time

0.0009779930114746094

**用DataFrame column方式搜索**

In [9]:
score_df1 = score_df.copy()
star_time = time.time()
score_df1['Pass_math'] = score_df1.math_score>=60
end_time = time.time()
end_time - star_time

0.008969306945800781

**用自定義式搜索**

In [10]:
score_df2 = score_df.copy()
star_time = time.time()
score_df2['Pass_math'] = score_df2.math_score.apply(lambda x : x>=60)
end_time = time.time()
end_time - star_time

0.000978231430053711

**用isin()**

In [11]:
score_df3 = score_df.copy()
star_time = time.time()
score_df3['Pass_math'] = score_df3.math_score.isin(range(60, 100))
end_time = time.time()
end_time - star_time

0.0009648799896240234

以上實驗看出採用isin() 篩選出對應資料室最快的，速度快是因為它採用了向量化的資料處理方式（這裡的isin() 是其中一種方式，還有其他方式可嘗試)

In [12]:
np.random.randint(3,9,10)

array([3, 3, 4, 8, 6, 8, 7, 8, 3, 7])

**遇到大資料集時，常有記憶體不足的問題**

**首先先生成大資料，因為改善部分不同所以分成浮點數float與整數int的資料集，可以看到不管浮點數還是整數都佔了800128bytes**

In [13]:
float_data = pd.DataFrame(np.random.uniform(0,5,100000).reshape(1000,100))
int_data = pd.DataFrame(np.random.randint(0,1000,100000).reshape(1000,100))
int_data.memory_usage(deep=True).sum(), float_data.memory_usage(deep=True).sum()

(400128, 800128)

**整數型態int改成uint減少記憶體正用空間，使用前800128bytes，使用後剩下200128bytes**

In [14]:
downcast_int = int_data.apply(pd.to_numeric,downcast='unsigned')
int_data.memory_usage(deep=True).sum(),downcast_int.memory_usage(deep=True).sum()

(400128, 200128)

**原本有100個欄位是int64，經過downcast變成了100個欄位的uint16**

In [15]:
compare_int = pd.concat([int_data.dtypes,downcast_int.dtypes],axis=1)
compare_int.columns = ['before','after']
compare_int.apply(pd.value_counts)

Unnamed: 0,before,after
uint16,,100.0
int32,100.0,


**浮點數型態float64改成float32減少記憶體正用空間，使用前800128bytes，使用後剩下400128bytes**

In [16]:
downcast_float = float_data.apply(pd.to_numeric,downcast='float')
float_data.memory_usage(deep=True).sum(),downcast_float.memory_usage(deep=True).sum()

(800128, 400128)

**原本有100個欄位是float64，經過downcast變成了100個欄位的float32**

In [17]:
compare_int = pd.concat([float_data.dtypes,downcast_float.dtypes],axis=1)
compare_int.columns = ['before','after']
compare_int.apply(pd.value_counts)

Unnamed: 0,before,after
float32,,100.0
float64,100.0,
