## Detección de fraude

Este dataset está disponible en <a href= "https://www.kaggle.com/mlg-ulb/creditcardfraud"> Kaggle </a>, se registran las transacciones de dos días con tarjetas de crédito. 
El objetivo es detectar qué transacciones son fraude. Para ello, se cuenta con 28 variables que son componentes principales obtenidos por PCA y el monto por la transacción.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("../Data/creditcard.csv")

- Inspeccione el data frame. ¿Qué proporción hay de cada clase? ¿El dataset está desbalanceado?

In [3]:
df.Class.value_counts(normalize=True)

0    0.998273
1    0.001727
Name: Class, dtype: float64

In [4]:
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


- Particione en train y test

In [5]:
from sklearn.model_selection import train_test_split

In [6]:
X = df.drop("Class", axis = 1)
y = df.Class

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

In [8]:
y.value_counts(normalize=True)

0    0.998273
1    0.001727
Name: Class, dtype: float64

### Entrene un algoritmo LightGBM con RandomizedSearch. 

In [9]:
from lightgbm import LGBMClassifier

In [10]:
import scipy.stats as st

In [11]:
one_to_left = st.beta(10, 1) # Esta distribución nos dará valores entre 0 y 1 mayormente cercanos a 1

In [12]:
params_lgbm = {  
    "n_estimators": st.randint(20,40), # Number of boosted trees to fit.
    "max_depth": st.randint(3, 12),     # Maximum tree depth for base learners.
    "learning_rate": st.uniform(0.05, 0.4), #     Boosting learning rate (xgb’s “eta”)
    "colsample_bytree": one_to_left, #     Subsample ratio of columns when constructing each tree.
    "subsample": one_to_left,     # Subsample ratio of the training instance.
    "gamma": st.uniform(0, 10), #     Minimum loss reduction required to make a further partition on a leaf node of the tree.
    'reg_alpha': st.uniform(0.05,10),   # L1 regularization term on weights
    "min_child_weight": st.uniform(1,20), #    Minimum sum of instance weight(hessian) needed in a child.
}

In [13]:
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

In [14]:
model_lgbm = LGBMClassifier() 

In [15]:
lgbm = RandomizedSearchCV(model_lgbm, params_lgbm, n_iter = 25, verbose= True)

In [16]:
lgbm.fit(X_train,y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 25 candidates, totalling 75 fits


[Parallel(n_jobs=1)]: Done  75 out of  75 | elapsed:  1.6min finished


RandomizedSearchCV(cv='warn', error_score='raise-deprecating',
                   estimator=LGBMClassifier(boosting_type='gbdt',
                                            class_weight=None,
                                            colsample_bytree=1.0,
                                            importance_type='split',
                                            learning_rate=0.1, max_depth=-1,
                                            min_child_samples=20,
                                            min_child_weight=0.001,
                                            min_split_gain=0.0,
                                            n_estimators=100, n_jobs=-1,
                                            num_leaves=31, objective=None,
                                            random_state=None, reg_alpha=0....
                                        'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000001B0C21E5C88>,
                                       

In [17]:
lgbm.best_params_

{'colsample_bytree': 0.6936062469406592,
 'gamma': 5.539430929543178,
 'learning_rate': 0.3886917197228546,
 'max_depth': 8,
 'min_child_weight': 10.753963486402808,
 'n_estimators': 30,
 'reg_alpha': 0.7871729081465484,
 'subsample': 0.6759011175302062}

Obtenga el valor final de AUC en test

In [18]:
from sklearn.metrics import roc_auc_score

In [19]:
y_predicted = lgbm.predict_proba(X_test)

In [20]:
roc_auc_score(y_test, y_predicted[:,1])

0.9252731158974951

### Bonus: entrene un algoritmo CatBoost 

In [21]:
from catboost import CatBoostClassifier

In [22]:
model_cat = CatBoostClassifier()

In [23]:
model_cat.fit(X_train, y_train)

Learning rate set to 0.076455
0:	learn: 0.4297808	total: 183ms	remaining: 3m 3s
1:	learn: 0.2785064	total: 296ms	remaining: 2m 27s
2:	learn: 0.1723430	total: 411ms	remaining: 2m 16s
3:	learn: 0.1077993	total: 522ms	remaining: 2m 10s
4:	learn: 0.0707199	total: 718ms	remaining: 2m 22s
5:	learn: 0.0472547	total: 862ms	remaining: 2m 22s
6:	learn: 0.0323778	total: 1s	remaining: 2m 22s
7:	learn: 0.0230914	total: 1.15s	remaining: 2m 22s
8:	learn: 0.0170161	total: 1.27s	remaining: 2m 20s
9:	learn: 0.0129381	total: 1.4s	remaining: 2m 18s
10:	learn: 0.0102554	total: 1.55s	remaining: 2m 19s
11:	learn: 0.0081880	total: 1.76s	remaining: 2m 25s
12:	learn: 0.0068814	total: 1.96s	remaining: 2m 28s
13:	learn: 0.0058929	total: 2.13s	remaining: 2m 29s
14:	learn: 0.0051175	total: 2.25s	remaining: 2m 27s
15:	learn: 0.0046040	total: 2.36s	remaining: 2m 25s
16:	learn: 0.0042015	total: 2.5s	remaining: 2m 24s
17:	learn: 0.0038293	total: 2.69s	remaining: 2m 26s
18:	learn: 0.0035739	total: 2.89s	remaining: 2m 29

160:	learn: 0.0013713	total: 22.8s	remaining: 1m 58s
161:	learn: 0.0013664	total: 23s	remaining: 1m 58s
162:	learn: 0.0013632	total: 23.1s	remaining: 1m 58s
163:	learn: 0.0013570	total: 23.2s	remaining: 1m 58s
164:	learn: 0.0013529	total: 23.4s	remaining: 1m 58s
165:	learn: 0.0013464	total: 23.5s	remaining: 1m 57s
166:	learn: 0.0013379	total: 23.6s	remaining: 1m 57s
167:	learn: 0.0013318	total: 23.7s	remaining: 1m 57s
168:	learn: 0.0013304	total: 23.8s	remaining: 1m 57s
169:	learn: 0.0013242	total: 24s	remaining: 1m 57s
170:	learn: 0.0013181	total: 24.1s	remaining: 1m 56s
171:	learn: 0.0013122	total: 24.2s	remaining: 1m 56s
172:	learn: 0.0013084	total: 24.4s	remaining: 1m 56s
173:	learn: 0.0013036	total: 24.5s	remaining: 1m 56s
174:	learn: 0.0012988	total: 24.6s	remaining: 1m 55s
175:	learn: 0.0012888	total: 24.7s	remaining: 1m 55s
176:	learn: 0.0012841	total: 24.8s	remaining: 1m 55s
177:	learn: 0.0012810	total: 24.9s	remaining: 1m 55s
178:	learn: 0.0012769	total: 25.1s	remaining: 1m 5

316:	learn: 0.0007956	total: 43.5s	remaining: 1m 33s
317:	learn: 0.0007920	total: 43.6s	remaining: 1m 33s
318:	learn: 0.0007862	total: 43.7s	remaining: 1m 33s
319:	learn: 0.0007840	total: 43.8s	remaining: 1m 33s
320:	learn: 0.0007817	total: 43.9s	remaining: 1m 32s
321:	learn: 0.0007802	total: 44.1s	remaining: 1m 32s
322:	learn: 0.0007786	total: 44.2s	remaining: 1m 32s
323:	learn: 0.0007779	total: 44.3s	remaining: 1m 32s
324:	learn: 0.0007752	total: 44.4s	remaining: 1m 32s
325:	learn: 0.0007729	total: 44.5s	remaining: 1m 32s
326:	learn: 0.0007697	total: 44.6s	remaining: 1m 31s
327:	learn: 0.0007658	total: 44.9s	remaining: 1m 31s
328:	learn: 0.0007622	total: 45.1s	remaining: 1m 32s
329:	learn: 0.0007595	total: 45.3s	remaining: 1m 31s
330:	learn: 0.0007559	total: 45.4s	remaining: 1m 31s
331:	learn: 0.0007535	total: 45.6s	remaining: 1m 31s
332:	learn: 0.0007503	total: 45.7s	remaining: 1m 31s
333:	learn: 0.0007479	total: 45.8s	remaining: 1m 31s
334:	learn: 0.0007453	total: 45.9s	remaining: 

472:	learn: 0.0004657	total: 1m 2s	remaining: 1m 9s
473:	learn: 0.0004635	total: 1m 2s	remaining: 1m 9s
474:	learn: 0.0004612	total: 1m 2s	remaining: 1m 9s
475:	learn: 0.0004567	total: 1m 3s	remaining: 1m 9s
476:	learn: 0.0004555	total: 1m 3s	remaining: 1m 9s
477:	learn: 0.0004539	total: 1m 3s	remaining: 1m 9s
478:	learn: 0.0004519	total: 1m 3s	remaining: 1m 8s
479:	learn: 0.0004502	total: 1m 3s	remaining: 1m 8s
480:	learn: 0.0004484	total: 1m 3s	remaining: 1m 8s
481:	learn: 0.0004469	total: 1m 3s	remaining: 1m 8s
482:	learn: 0.0004447	total: 1m 3s	remaining: 1m 8s
483:	learn: 0.0004432	total: 1m 4s	remaining: 1m 8s
484:	learn: 0.0004407	total: 1m 4s	remaining: 1m 8s
485:	learn: 0.0004393	total: 1m 4s	remaining: 1m 7s
486:	learn: 0.0004384	total: 1m 4s	remaining: 1m 7s
487:	learn: 0.0004362	total: 1m 4s	remaining: 1m 7s
488:	learn: 0.0004349	total: 1m 4s	remaining: 1m 7s
489:	learn: 0.0004332	total: 1m 4s	remaining: 1m 7s
490:	learn: 0.0004324	total: 1m 4s	remaining: 1m 7s
491:	learn: 

630:	learn: 0.0002700	total: 1m 22s	remaining: 48s
631:	learn: 0.0002694	total: 1m 22s	remaining: 47.9s
632:	learn: 0.0002671	total: 1m 22s	remaining: 47.8s
633:	learn: 0.0002660	total: 1m 22s	remaining: 47.6s
634:	learn: 0.0002643	total: 1m 22s	remaining: 47.5s
635:	learn: 0.0002626	total: 1m 22s	remaining: 47.4s
636:	learn: 0.0002620	total: 1m 22s	remaining: 47.2s
637:	learn: 0.0002616	total: 1m 23s	remaining: 47.1s
638:	learn: 0.0002609	total: 1m 23s	remaining: 47s
639:	learn: 0.0002602	total: 1m 23s	remaining: 46.8s
640:	learn: 0.0002595	total: 1m 23s	remaining: 46.7s
641:	learn: 0.0002589	total: 1m 23s	remaining: 46.5s
642:	learn: 0.0002578	total: 1m 23s	remaining: 46.4s
643:	learn: 0.0002570	total: 1m 23s	remaining: 46.3s
644:	learn: 0.0002566	total: 1m 23s	remaining: 46.1s
645:	learn: 0.0002559	total: 1m 23s	remaining: 46s
646:	learn: 0.0002539	total: 1m 24s	remaining: 45.9s
647:	learn: 0.0002528	total: 1m 24s	remaining: 45.8s
648:	learn: 0.0002522	total: 1m 24s	remaining: 45.6s

786:	learn: 0.0001650	total: 1m 41s	remaining: 27.6s
787:	learn: 0.0001647	total: 1m 42s	remaining: 27.5s
788:	learn: 0.0001646	total: 1m 42s	remaining: 27.3s
789:	learn: 0.0001634	total: 1m 42s	remaining: 27.2s
790:	learn: 0.0001632	total: 1m 42s	remaining: 27.1s
791:	learn: 0.0001631	total: 1m 42s	remaining: 26.9s
792:	learn: 0.0001627	total: 1m 42s	remaining: 26.8s
793:	learn: 0.0001624	total: 1m 42s	remaining: 26.7s
794:	learn: 0.0001622	total: 1m 42s	remaining: 26.5s
795:	learn: 0.0001616	total: 1m 42s	remaining: 26.4s
796:	learn: 0.0001609	total: 1m 43s	remaining: 26.3s
797:	learn: 0.0001606	total: 1m 43s	remaining: 26.1s
798:	learn: 0.0001595	total: 1m 43s	remaining: 26s
799:	learn: 0.0001589	total: 1m 43s	remaining: 25.9s
800:	learn: 0.0001586	total: 1m 43s	remaining: 25.7s
801:	learn: 0.0001584	total: 1m 43s	remaining: 25.6s
802:	learn: 0.0001583	total: 1m 43s	remaining: 25.5s
803:	learn: 0.0001581	total: 1m 44s	remaining: 25.4s
804:	learn: 0.0001577	total: 1m 44s	remaining: 2

942:	learn: 0.0001118	total: 2m 1s	remaining: 7.36s
943:	learn: 0.0001117	total: 2m 1s	remaining: 7.23s
944:	learn: 0.0001115	total: 2m 2s	remaining: 7.1s
945:	learn: 0.0001113	total: 2m 2s	remaining: 6.97s
946:	learn: 0.0001112	total: 2m 2s	remaining: 6.85s
947:	learn: 0.0001109	total: 2m 2s	remaining: 6.72s
948:	learn: 0.0001108	total: 2m 2s	remaining: 6.59s
949:	learn: 0.0001104	total: 2m 2s	remaining: 6.46s
950:	learn: 0.0001100	total: 2m 2s	remaining: 6.33s
951:	learn: 0.0001098	total: 2m 2s	remaining: 6.2s
952:	learn: 0.0001094	total: 2m 3s	remaining: 6.07s
953:	learn: 0.0001093	total: 2m 3s	remaining: 5.94s
954:	learn: 0.0001092	total: 2m 3s	remaining: 5.81s
955:	learn: 0.0001090	total: 2m 3s	remaining: 5.68s
956:	learn: 0.0001086	total: 2m 3s	remaining: 5.55s
957:	learn: 0.0001085	total: 2m 3s	remaining: 5.42s
958:	learn: 0.0001084	total: 2m 3s	remaining: 5.29s
959:	learn: 0.0001081	total: 2m 3s	remaining: 5.17s
960:	learn: 0.0001078	total: 2m 4s	remaining: 5.04s
961:	learn: 0.

<catboost.core.CatBoostClassifier at 0x1b0c079d898>

In [24]:
y_predicted_cat = model_cat.predict_proba(X_test)

In [25]:
cat_auc = roc_auc_score(y_test,y_predicted_cat[:,1])
print("El valor del AUC es: ", cat_auc)

El valor del AUC es:  0.9770241905348189
