# 機器學習百日馬拉松期中考 - Enron Fraud Dataset 安隆公司詐欺案資料集
- https://www.kaggle.com/c/3rd-ml100marathon-midterm/overview

### Description
***
安隆公司曾是一間能源公司，2001 年破產前是世界上最大的電力、天然氣及電信公司之一。擁有上千億資產的公司於 2002 年竟然在短短幾周內宣告破產，才揭露其財報在多年以來均是造假的醜聞。在本資料集中你將會扮演偵探的角色，透過高層經理人內部的 mail 來往的情報以及薪資、股票等財務特徵，訓練出一個機器學習模型來幫忙你找到可疑的詐欺犯罪者是誰! 我們已經先幫你找到幾位犯罪者 (Person-of-Interest, poi) 與清白的員工，請利用這些訓練資料來訓練屬於自己的詐欺犯機器學習模型吧!

## 特徵說明
1. 有關財務的特徵: ['salary', 'deferral_payments', 'total_payments', 'loan_advances', 'bonus', 'restricted_stock_deferred', 'deferred_income', 'total_stock_value', 'expenses', 'exercised_stock_options', 'other', 'long_term_incentive', 'restricted_stock', 'director_fees'] (單位皆為美元)。更詳細的特徵說明請參考 enron61702insiderpay.pdf 的最後一頁(請至Data頁面參考該PDF檔)
2. 有關 email 的特徵: ['to_messages', 'email_address', 'from_poi_to_this_person', 'from_messages', 'from_this_person_to_poi', 'shared_receipt_with_poi'] (除了 email_address，其餘皆為次數)
3. 嫌疑人的標記，也就是我們常用的 **y**。POI label: [‘poi’] (boolean, represented as integer)
4. 我們也建議你對既有特徵進行一些特徵工程如 rescale, transform ，也試著發揮想像力與創意，建立一些可以幫助找到嫌疑犯的特徵，增進模型的預測能力，

## 關鍵問題
如果你是第一次實作機器學習專案，一開始可能會有些迷惘，不曉得該從何著手，我們提供了一系列的問題，這些都是一個機器學習專案中必須要回答的問題，可以試著從這些問題開始！



### Hint
***
1. 總結一下這個項目的目標以及機器學習如何有助於實現它。作為答案的一部分，提供有關數據集的一些背景知識以及如何使用它來回答項目問題。當你得到它時，數據中是否有任何異常值，你是如何處理這些異常值的？ [關鍵字：“EDA”，“Outlier”]
2. 您最終在 POI 的偵測中使用了哪些特徵 (features)，以及您使用哪些方法來選擇它們？你有沒有做任何 data scaling？為什麼或者為什麼不？您應該嘗試自己設計一些新的特徵 - 解釋您嘗試製作該特徵的原因及其背後的基本原理。(您不一定要在最終分析中使用它，只需對其進行設計和測試）在 feature selection 步驟中，如果您使用了決策樹之類的算法，請同時提供您 feature importance 的數值 [關鍵字：“create feature”，“feature selection”，“normalization”]
3. 您最終使用了什麼算法？您嘗試了哪些？算法之間的模型性能如何不同？ [關鍵字：“modeling”]
4. 調整算法的超參數 (hyper-parameter) 是什麼意思，如果你做得不好怎麼辦？你是如何調整特定算法的參數的？你調整了什麼參數？ （有些算法沒有你需要調整的參數 - 如果你選擇的那個是這種情況，請確定並簡要說明你將如何為不是你最終選擇的模型或不同的模型做到這一點。利用參數調整，例如決策樹分類器）。 [關鍵字：“Hyper-parameter tuning”]
5. 什麼是驗證，如果你做錯了，你可以犯下的經典錯誤是什麼？您是如何驗證分析的？ [關鍵字：“Hyper-parameter tuning]
6. 給出至少 2 個評估指標和每個評估指標的平均表現。解釋您選擇該指標的原因。 [關鍵字：“Evaluation metrics”]

## 專案結束後你可以學會
- 如何處理存在各種缺陷的真實資料
- 使用 val/test data 來了解機器學習模型的訓練情形
- 使用適當的評估函數了解預測結果
- 應用適當的特徵工程提升模型的準確率
- 調整機器學習模型的超參數來提升準確率
- 清楚的說明文件讓別人了解你的成果

### Evaluation
***
評估指標為 AUC，請參考 Data 中的 submission.csv，將測試資料的預測結果上傳

注意只能有兩個 column，分別為員工姓名 (name)、嫌犯的預測機率值 (poi)，上傳後系統會自動幫您計算 AUC 分數並進行排名。本次期中考 Private score 就是最終結果，請至少超過 baseline 來完成這次的考試。

***
##### RREFERENCES : kernel71ad45d024(https://www.kaggle.com/cntseng2000/kernel71ad45d024)

In [1]:
import os
for dirname, _, filenames in os.walk('../hw/input/'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

../hw/input/enron61702insiderpay.pdf
../hw/input/sample_submission.csv
../hw/input/test_features.csv
../hw/input/train_data.csv


In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor

import warnings
warnings.filterwarnings('ignore')

In [3]:
data_path = '../hw/input/'
df_train = pd.read_csv(data_path + 'train_data.csv')

df_test = pd.read_csv(data_path + 'test_features.csv')

df_train.head()

Unnamed: 0,name,bonus,deferral_payments,deferred_income,director_fees,email_address,exercised_stock_options,expenses,from_messages,from_poi_to_this_person,...,long_term_incentive,other,poi,restricted_stock,restricted_stock_deferred,salary,shared_receipt_with_poi,to_messages,total_payments,total_stock_value
0,RICE KENNETH D,1750000.0,,-3504386.0,,ken.rice@enron.com,19794175.0,46950.0,18.0,42.0,...,1617011.0,174839.0,True,2748364.0,,420636.0,864.0,905.0,505050.0,22542539.0
1,SKILLING JEFFREY K,5600000.0,,,,jeff.skilling@enron.com,19250000.0,29336.0,108.0,88.0,...,1920000.0,22122.0,True,6843672.0,,1111258.0,2042.0,3627.0,8682716.0,26093672.0
2,SHELBY REX,200000.0,,-4167.0,,rex.shelby@enron.com,1624396.0,22884.0,39.0,13.0,...,,1573324.0,True,869220.0,,211844.0,91.0,225.0,2003885.0,2493616.0
3,KOPPER MICHAEL J,800000.0,,,,michael.kopper@enron.com,,118134.0,,,...,602671.0,907502.0,True,985032.0,,224305.0,,,2652612.0,985032.0
4,CALGER CHRISTOPHER F,1250000.0,,-262500.0,,christopher.calger@enron.com,,35818.0,144.0,199.0,...,375304.0,486.0,True,126027.0,,240189.0,2188.0,2598.0,1639297.0,126027.0


In [4]:
df_train.describe()

Unnamed: 0,bonus,deferral_payments,deferred_income,director_fees,exercised_stock_options,expenses,from_messages,from_poi_to_this_person,from_this_person_to_poi,loan_advances,long_term_incentive,other,restricted_stock,restricted_stock_deferred,salary,shared_receipt_with_poi,to_messages,total_payments,total_stock_value
count,61.0,28.0,34.0,13.0,81.0,73.0,65.0,65.0,65.0,2.0,49.0,69.0,82.0,10.0,73.0,65.0,65.0,96.0,98.0
mean,1147436.0,634437.4,-462566.4,89397.846154,2985081.0,51040.547945,711.323077,64.8,40.092308,40962500.0,792617.1,447177.4,1294855.0,-221885.7,273902.5,1111.369231,2156.061538,2590977.0,3527136.0
std,1505189.0,860364.6,809539.2,41143.391399,6004174.0,47596.682104,2074.497628,91.863214,88.901407,57364040.0,950464.5,1341564.0,2498335.0,205191.374121,171664.7,1165.852016,2811.676718,10566450.0,7182997.0
min,70000.0,-102500.0,-3504386.0,3285.0,3285.0,148.0,12.0,0.0,0.0,400000.0,71023.0,2.0,44093.0,-560222.0,477.0,2.0,57.0,148.0,-44093.0
25%,450000.0,76567.5,-552703.2,101250.0,400478.0,18834.0,19.0,10.0,0.0,20681250.0,275000.0,972.0,268922.0,-389621.75,206121.0,178.0,517.0,302402.5,421151.8
50%,750000.0,195190.0,-117534.0,108579.0,850010.0,41953.0,45.0,28.0,7.0,40962500.0,422158.0,52382.0,462822.5,-139856.5,251654.0,599.0,1088.0,1106740.0,997971.0
75%,1000000.0,834205.2,-27083.25,112492.0,2165172.0,59175.0,215.0,88.0,27.0,61243750.0,831809.0,362096.0,966490.5,-77953.25,288589.0,1902.0,2649.0,1985668.0,2493616.0
max,8000000.0,2964506.0,-1042.0,125034.0,34348380.0,228763.0,14368.0,528.0,411.0,81525000.0,5145434.0,10359730.0,14761690.0,44093.0,1111258.0,4527.0,15149.0,103559800.0,49110080.0


In [5]:
df_train['poi'] = df_train['poi'].astype(float)
train_Y = df_train['poi']
ids = df_test['name']

df_train = df_train.drop(['name', 'email_address', 'poi'] , axis=1)
df_test = df_test.drop(['name', 'email_address'] , axis=1)

df = pd.concat([df_train,df_test])
df = df.fillna(0)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 146 entries, 0 to 32
Data columns (total 19 columns):
bonus                        146 non-null float64
deferral_payments            146 non-null float64
deferred_income              146 non-null float64
director_fees                146 non-null float64
exercised_stock_options      146 non-null float64
expenses                     146 non-null float64
from_messages                146 non-null float64
from_poi_to_this_person      146 non-null float64
from_this_person_to_poi      146 non-null float64
loan_advances                146 non-null float64
long_term_incentive          146 non-null float64
other                        146 non-null float64
restricted_stock             146 non-null float64
restricted_stock_deferred    146 non-null float64
salary                       146 non-null float64
shared_receipt_with_poi      146 non-null float64
to_messages                  146 non-null float64
total_payments               146 non-null floa

In [6]:
#finalcial 和 email 數據的數值資訊都捨去, 有就是1, 沒有(na)就是0
for col in df:
    df[col] = df[col].apply(lambda x: 1 if abs(x)>0 else 0)

#先前測試, 發現有時候把 'bonus' 和 'deferred_income' 結合成新的 feature 結果比較好
df ['com2'] = df['bonus'] + df['deferred_income']
df.drop(['bonus', 'deferred_income'], axis = 1)
        
df.describe()

Unnamed: 0,bonus,deferral_payments,deferred_income,director_fees,exercised_stock_options,expenses,from_messages,from_poi_to_this_person,from_this_person_to_poi,loan_advances,long_term_incentive,other,restricted_stock,restricted_stock_deferred,salary,shared_receipt_with_poi,to_messages,total_payments,total_stock_value,com2
count,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0,146.0
mean,0.561644,0.267123,0.335616,0.116438,0.69863,0.650685,0.589041,0.506849,0.452055,0.027397,0.452055,0.636986,0.753425,0.123288,0.650685,0.589041,0.589041,0.856164,0.863014,0.89726
std,0.497894,0.44398,0.473831,0.321854,0.460433,0.478395,0.493701,0.501674,0.499409,0.1638,0.499409,0.482524,0.432501,0.329899,0.478395,0.493701,0.493701,0.352131,0.345016,0.749524
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
50%,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0
75%,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0


In [7]:
#測試幾種model後, 發現 gdbt 和 clf 分的比較好

gdbt = GradientBoostingClassifier(tol=100, subsample=0.75, n_estimators=250, max_features=9,
                                  max_depth=6, learning_rate=0.03)
         
train_num = train_Y.shape[0]
train_X = df[:train_num]
test_X = df[train_num:]
            
X_train, X_test, y_train, y_test = train_test_split(train_X, train_Y, test_size=0.25, random_state=4)
          
gdbt.fit(train_X, train_Y)
gdbt_pred = gdbt.predict_proba(test_X)[:,1]
gdbt_score = cross_val_score(gdbt, train_X, train_Y, cv=5).mean()

sub = pd.DataFrame({'name': ids, 'poi': gdbt_pred})
sub.to_csv('mid_abs_com2_gdbt1027.csv', index=False)
            

# 建立模型
clf = DecisionTreeClassifier()
clf.fit(train_X, train_Y)
clf_pred = clf.predict_proba(test_X)[:,1]
clf_score = cross_val_score(clf, train_X, train_Y, cv=5).mean()

sub = pd.DataFrame({'name': ids, 'poi': clf_pred})
sub.to_csv('mid_abs_com2_clf1027.csv', index=False)


print('gdbt:')
print(gdbt_score)

print('clf:')
print(clf_score)      

gdbt:
0.8766798418972332
clf:
0.850197628458498
