# 範例目標:
1. 實做統計函式
2. 實做自定義的行或列函式應用

# 範例重點:
1. 統計函數使用方式與 Numpy 類似，不同之處為 Pandas 的資料型態是 DataFrame
2. 使用自定義函數時lambda x 與數學中的 f(x) 是相同的意思

# [教學目標]

* 正確的從 DataFrame 中插入或刪除資料
* 正確的對 DataFrame 進行合併與重組
* 了解 DataFrame 中合併的方法差異
  - 從 DataFrame 中插入或刪除資料
    - 刪除 : `del`、`.pop()`
  - DataFrame 的合併與重組
    - 合併 : `.concat([one, two])`、`.merge(one, two, on=)`、`one.join(two)`
    - `.reset_index(drop=True)` **concat後重排index**
  - `.groupby('A').sum()` : 根據某個（多個）字段劃分為不同的群體（group）<br>
    Groupby https://zhuanlan.zhihu.com/p/101284491

In [1]:
import pandas as pd

In [3]:
score_df = pd.DataFrame([[1,50,80,70], 
              [2,60,45,50],
              [3,98,43,55],
              [4,70,69,89],
              [5,56,79,60],
              [6,60,68,55],
              [7,45,70,77],
              [8,55,77,76],
              [9,25,57,60],
              [10,88,40,43]],columns=['student_id','math_score','english_score','chinese_score'])
score_df = score_df.set_index('student_id')
score_df

Unnamed: 0_level_0,math_score,english_score,chinese_score
student_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,50,80,70
2,60,45,50
3,98,43,55
4,70,69,89
5,56,79,60
6,60,68,55
7,45,70,77
8,55,77,76
9,25,57,60
10,88,40,43


**指定欄位算平均**

In [4]:
score_df.math_score.mean()

60.7

**全欄位算平均**

In [5]:
score_df.mean()

math_score       60.7
english_score    62.8
chinese_score    63.5
dtype: float64

In [7]:
score_df.mean(axis=0)

math_score       60.7
english_score    62.8
chinese_score    63.5
dtype: float64

**學生平均分數**

In [6]:
score_df.mean(axis=1)

student_id
1     66.666667
2     51.666667
3     65.333333
4     76.000000
5     65.000000
6     61.000000
7     64.000000
8     69.333333
9     47.333333
10    57.000000
dtype: float64

**學生3科總分數**

In [8]:
score_df.sum(axis=1)

student_id
1     200
2     155
3     196
4     228
5     195
6     183
7     192
8     208
9     142
10    171
dtype: int64

**本次各科考試人數**

In [9]:
score_df.count()

math_score       10
english_score    10
chinese_score    10
dtype: int64

**各科中位數分佈**

In [10]:
score_df.median()

math_score       58.0
english_score    68.5
chinese_score    60.0
dtype: float64

**各科百分位數分佈(75%)**

百分位數 : `.quantile()`

In [11]:
score_df.quantile(0.75)

math_score       67.50
english_score    75.25
chinese_score    74.50
Name: 0.75, dtype: float64

**各科最大值**

`.max()`

In [12]:
score_df.max()

math_score       98
english_score    80
chinese_score    89
dtype: int64

**各科最小值**

`.min()`

In [13]:
score_df.min()

math_score       25
english_score    40
chinese_score    43
dtype: int64

**各科標準差**

`.std()`

In [14]:
score_df.std()

math_score       20.854256
english_score    15.418603
chinese_score    14.151953
dtype: float64

**各科變異數**

`.var()`

In [15]:
score_df.var()

math_score       434.900000
english_score    237.733333
chinese_score    200.277778
dtype: float64

**各科之間的相關係數**

`.corr()`

In [16]:
score_df.corr()

Unnamed: 0,math_score,english_score,chinese_score
math_score,1.0,-0.532708,-0.314552
english_score,-0.532708,1.0,0.68234
chinese_score,-0.314552,0.68234,1.0


**各科開根號乘以十**

In [17]:
score_df.apply(lambda x : x**(0.5)*10)

Unnamed: 0_level_0,math_score,english_score,chinese_score
student_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,70.710678,89.442719,83.666003
2,77.459667,67.082039,70.710678
3,98.994949,65.574385,74.161985
4,83.666003,83.066239,94.339811
5,74.833148,88.881944,77.459667
6,77.459667,82.462113,74.161985
7,67.082039,83.666003,87.749644
8,74.161985,87.749644,87.177979
9,50.0,75.498344,77.459667
10,93.808315,63.245553,65.574385


# Numpy 運算

**從 DataFrame 中插入或刪除資料**

In [18]:
import pandas as pd

df = pd.DataFrame([[1], [2]], columns = ['a'])
print(df)

df['b'] = pd.Series([3, 4])
print(df)

   a
0  1
1  2
   a  b
0  1  3
1  2  4


**新增之後，index重複**

In [19]:
import pandas as pd

df = pd.DataFrame([[1, 2]], columns = ['a', 'b'])
print(df)


df = df.append(pd.DataFrame([[3, 4]], columns = ['a', 'b']))
print(df)


   a  b
0  1  2
   a  b
0  1  2
0  3  4


`.reset_index(drop=True)`**重排index**

In [20]:
import pandas as pd

df = pd.DataFrame([[1, 2]], columns = ['a', 'b'])
print(df)


df = df.append(pd.DataFrame([[3, 4]], columns = ['a', 'b']))
df = df.reset_index(drop=True)
print(df)

   a  b
0  1  2
   a  b
0  1  2
1  3  4


**刪除 :**<br>

`del`<br>
`.pop()`<br>

In [21]:
import pandas as pd

df = pd.DataFrame([[1, 2, 3]], columns = ['a', 'b', 'c'])
print(df)



del df['a']
df.pop('c')

   a  b  c
0  1  2  3


0    3
Name: c, dtype: int64

In [22]:
import pandas as pd

df = pd.DataFrame([[1], [2]], columns = ['a'])
print(df)



df = df.drop(1)
print(df)

   a
0  1
1  2
   a
0  1


**DataFrame 的合併與重組**

**合併 :**
    
`.concat([one, two])`

In [23]:
one = pd.DataFrame({
    'id':[1, 2],
    'Name': ['Alex', 'Amy'],
})
two = pd.DataFrame({
    'id':[1, 2],
    'Name': ['Bob', 'Tom']
})

pd.concat([one, two])


Unnamed: 0,id,Name
0,1,Alex
1,2,Amy
0,1,Bob
1,2,Tom


`.reset_index(drop=True)` **concat後重排index**

In [24]:
one = pd.DataFrame({
    'id':[1, 2],
    'Name': ['Alex', 'Amy'],
})
two = pd.DataFrame({
    'id':[1, 2],
    'Name': ['Bob', 'Tom']
})

pd.concat([one, two]).reset_index(drop=True)

Unnamed: 0,id,Name
0,1,Alex
1,2,Amy
2,1,Bob
3,2,Tom


`.merge(one, two, on=)`

In [25]:
one = pd.DataFrame({
    'id':[1, 2],
    'Name': ['Alex', 'Amy'],
})
two = pd.DataFrame({
    'id':[1, 2],
    'Score': [98, 60]
})

pd.merge(one, two, on='id')


Unnamed: 0,id,Name,Score
0,1,Alex,98
1,2,Amy,60


`one.join(two)`

In [26]:
one = pd.DataFrame({
    'Name': ['Alex', 'Amy'],
})
two = pd.DataFrame({
    'Score': [98, 60]
})

one.join(two)


Unnamed: 0,Name,Score
0,Alex,98
1,Amy,60


`.groupby('A').sum()` : 根據某個（多個）字段劃分為不同的群體（group）

In [29]:
df = pd.DataFrame({
  'A' : ['foo', 'bar', 'foo', 'bar'],
  'B' : ['one', 'one', 'two', 'three'],
  'C' : [1,2,3,4],
  'D' : [10, 20, 30, 40]
})

df.groupby('A').sum()



Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,6,60
foo,4,40


In [30]:
df.groupby('A').agg(sum)

Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,6,60
foo,4,40


In [31]:
df.groupby(['A','B']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,2,20
bar,three,4,40
foo,one,1,10
foo,two,3,30
