## Porto Seguro’s Safe Driver Prediction

<br><font color=blue>The aim of this compitation is to predict probability that a driver will intiate an auto insurance claim next year.A more accurate prediction will allow them to further tailor their prices, and hopefully make auto insurance coverage more accessible to more drivers. </font>


**Steps**

1. [Read data set](#Read-data-set)
2. [Explore data set](#Explore-data-set)
3. [Correlation plot](#Correlation-plot)
4. [Missing value is data set](#Missing-value-is-data-set)
5. [Convert variables into category type](#Convert-variables-into-category-type)
6. [Univariate analysis](#Univariate-analysis)
7. [Descrictive Statistic Features](#Descrictive-Statistic-Features)
8. [Determine outliers in dataset](#Determine-outliers-in-dataset)
9. [One Hot Encoding](#One-Hot-Encoding)
10. [Split data set](#Split-data-set)
11. [Hyperparameter tuning](#Hyperparameter-tuning)
12. [Logistic Regression model](#Logistic-Regression-model)
13. [Model performance](#Model-performance)
14. [Reciever Operating Charactaristics](#Reciever-Operating-Charactaristics)
15. [Predict for unseen data set](#Predict-for-unseen-data-set)

In [None]:
!pip install seaborn
!pip install lightgbm
!pip install missingno
!pip install xgboost

## Import library

In [None]:
#Import library
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import joblib
import lightgbm as lgbm
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, roc_auc_score ,roc_curve,auc
from sklearn.model_selection import StratifiedKFold,GridSearchCV
import missingno as mssno
import xgboost as xgb
seed =45
%matplotlib inline

In [None]:
MODELS = {
    "LGBM":lgbm.LGBMClassifier(n_estimators=600, objective='binary' ),
     "LOGI":LogisticRegression()
}

In [None]:
MODEL=MODELS["LGBM"]
# One-Hot処理の有無
IS_OHE=True
# 相関係数がゼロであるps_calcをドロップするか
IS_DROP_COEFF_ZERO=True
# 欠損値が存在するという情報をを加えるか
IS_ADD_NULL_VAL_INFO=True
# アウトライヤーを1%/99%値で置き換えるか
IS_OUTLIER_ROUNDING=True

## Read data set

In [None]:
path = '../input/'
#path = 'dataset/'
train = pd.read_csv(path+'train.csv',na_values=-1)
test = pd.read_csv(path+'test.csv',na_values=-1)
print('Number rows and columns:',train.shape)
print('Number rows and columns:',test.shape)

Porto Seguro provided close to 600k and 900k observation of train and test dataset respectively. They were 57 feature anonymized in order to protect company trade secrets, but they were given bit informaation about  The train and test data set contains feature with similar grouping are tagged with (e.g., ind, reg, car, cat, calc, bin). Values of  -1 indicate that the feature was missing from the observation.

Porto Seguroはそれぞれ列車と試験データセットの600 kと900 kに近い観測を提供した。これらは、企業の企業秘密を保護するために匿名化された57の特徴であったが、列車とテストのデータセットには、同様のグループ化が(例:ind, reg, car, cat, calc, bin)のタグが付けられた特徴が含まれていることに関するビット情報が与えられた。-1の値は、その特徴が観測から欠落していたことを示します。

## Explore data set

In [None]:
train.head(3).T

## Target varaiable

In [None]:
plt.figure(figsize=(10,3))
sns.countplot(train['target'],palette='rainbow')
plt.xlabel('Target')

train['target'].value_counts()

The 'target' variable in imbalanced. The target column in data set is whether or not claim was filed for that policy holder. The target variable is quite unbalanced, with only  %4 of  policyholders in training data filing claim within the year.

'target'変数が不均衡です。データセットのターゲット列は、そのポリシー保持者に対してクレームが提出されたかどうかです。目標変数は非常に不均衡であり、トレーニングデータファイリングの契約者のうち年内に請求するのは%4人のみである。

## Correlation plot
Correlation is a measure bivariate analysis that measure the strength of assciation between variable and direction of relationship.In terms of strength of relationship, the value of the correlation coefficient varies between +1 and -1

相関は、変数と関係の方向の間の関連の強さを測定する測度二変量解析である。相関係数の値は、相関の強さでは+1から-1の間で変化する

In [None]:
cor = train.drop('id',axis=1).corr()
plt.figure(figsize=(16,16))
sns.heatmap(cor)

> The correlation coefficient for **ps_calc** is 0,so we will drop these from our dataset.

> **ps_calc**の相関係数は0であるため、データセットからこれらを削除します。

In [None]:
ps_cal = train.columns[train.columns.str.startswith('ps_calc')]
if IS_DROP_COEFF_ZERO:
    train = train.drop(ps_cal,axis =1)
    test = test.drop(ps_cal,axis=1)
train.shape

## Missing value is data set
>Values of -1 indicate that the feature was missing from the observation. The target columns signifies whether or not a claim was filed for that policy holder.
>-1の値は、その特徴が観測から欠落していたことを示します。ターゲット列は、そのポリシー保持者に対してクレームが提出されたかどうかを示します。

In [None]:
k= pd.DataFrame()
k['train']= train.isnull().sum()
k['test'] = test.isnull().sum()
fig,ax = plt.subplots(figsize=(16,5))
k.plot(kind='bar',ax=ax)

In [None]:
def count_rows_with_missing_values(df):
    """
    データフレーム内の欠損値を含む行の数をカウントする関数
    :param df: pandas.DataFrame, 入力データフレーム
    :return: int, 欠損値を含む行の数
    """
    return df.isnull().any(axis=1).sum()
def count_rows_with_nulls_in_target1(df):
    # targetが1で、欠損値を1つ以上含むカラムを持つ行の数を取得する
    target1_rows = df[df['target'] == 1]
    num_rows_target1_with_nulls = target1_rows.isnull().any(axis=1).sum()
    return num_rows_target1_with_nulls

def count_rows_without_nulls_in_target0(df):
    # targetが0で、全てのカラムに欠損値が含まれない行の数を取得する
    target0_rows = df[df['target'] == 0]
    num_rows_target0_without_nulls = target0_rows.notnull().all(axis=1).sum()
    return num_rows_target0_without_nulls

num_rows_with_missing_values = count_rows_with_missing_values(train)
print("データ総数：",train.shape[0])
print('欠損値を含む行の数:', num_rows_with_missing_values)
num_rows_target1 = (train['target'] == 1).sum()
print('保険請求した件数:', num_rows_target1)

# count_rows_with_nulls_in_target1関数を使って、targetが1で欠損値を1つ以上含む行数をカウントする
num_rows_target1_with_nulls = count_rows_with_nulls_in_target1(train)
print('targetが1で、欠損値を1つ以上含むカラムを持つ行の数:', num_rows_target1_with_nulls)


# count_rows_without_nulls_in_target0関数を使って、targetが0で欠損値を含まない行数をカウントする
num_rows_target0_without_nulls = count_rows_without_nulls_in_target0(train)
print('targetが0で、欠損値を含まない行の数:', num_rows_target0_without_nulls)

## 欠損値が１以上あるという情報を加える

In [None]:
def add_null_val_flag(df=None,cols_with_missing=None):
    # 欠損値を含む列を選択
    if cols_with_missing is None:
        cols_with_missing = df.columns[df.isnull().any()].tolist()
    print(cols_with_missing)

    # 欠損値があることを示すフラグ列を作成
    for col in cols_with_missing:
        df[col + '_missing'] = df[col].isnull().astype(int)

    return df,cols_with_missing

In [None]:
if IS_ADD_NULL_VAL_INFO:
    train,train_cols_with_missing =add_null_val_flag(train,None)
    test,_=add_null_val_flag(test,train_cols_with_missing)

In [None]:
train.head(2)

Missing value in test train data set are in same propotion and same column

テスト列車のデータセットの欠損値は、同じ比率と同じ列にある

In [None]:
mssno.bar(train,color='y',figsize=(16,4),fontsize=12)

In [None]:
mssno.bar(test,color='b',figsize=(16,4),fontsize=12)

In [None]:
mssno.matrix(train)

### Replace missing value with mode

In [None]:
def missing_value(df):
    col = df.columns
    # 最頻値による欠損地補完
    for i in col:
        if df[i].isnull().sum()>0:
            df[i].fillna(df[i].mode()[0],inplace=True)

In [None]:
missing_value(train)
missing_value(test)

## Convert variables into category type

In [None]:
def basic_details(df):
    b = pd.DataFrame()
    b['Missing value'] = df.isnull().sum()
    b['N unique value'] = df.nunique()
    b['dtype'] = df.dtypes
    return b
basic_details(train)

>The unique value of "ps_car_11_cat" is maximum in the data set is 104

## カテゴリ変数に変換

In [None]:
def category_type(df):
    col = df.columns
    for i in col:
        if df[i].nunique()<=104:
            #print(df[i].nunique())
            # print(df[i].astype("category"))
            df[i] = df[i].astype('category')
category_type(train)
category_type(test)

## Descrictive Statistic Features

In [None]:
def descrictive_stat_feat(df):
    df = pd.DataFrame(df)
    dcol= [c for c in train.columns if train[c].nunique()>=10]
    dcol.remove('id')   
    d_median = df[dcol].median(axis=0)
    d_mean = df[dcol].mean(axis=0)
    q1 = df[dcol].apply(np.float32).quantile(0.25)
    q3 = df[dcol].apply(np.float32).quantile(0.75)
    print(d_mean)
    print(dcol)
    #Add mean and median column to data set having more then 10 categories
    for c in dcol:
        df[c+str('_q1')] = (df[c].astype(np.float32).values < q1[c]).astype(np.int8)
        df[c+str('_q3')] = (df[c].astype(np.float32).values > q3[c]).astype(np.int8)
        df[c+str('_mean_range')] = (df[c].astype(np.float32).values > d_mean[c]).astype(np.int8)
        df[c+str('_median_range')] = (df[c].astype(np.float32).values > d_median[c]).astype(np.int8)
    return df

In [None]:
bin_col = [col for col in train.columns if 'bin' in col]
print(bin_col)

In [None]:
cat_col = [col for col in train.columns if '_cat' in col]
print(cat_col)

In [None]:
tot_cat_col = list(train.select_dtypes(include=['category']).columns)

other_cat_col = [c for c in tot_cat_col if c not in cat_col+ bin_col]
other_cat_col

In [None]:
num_col = [c for c in train.columns if c not in tot_cat_col]
num_col.remove('id')
num_col

## Determine outliers in dataset
The extreme observations in data set which resembles completely different behavoir from the rest of data point are called outliers. The outliers present in numeric feature are replaced by 1%/99% of feature.

データセット内の極端な観測値が、他のデータポイントとはまったく異なる動作に似ている場合は、外れ値と呼ばれます。数値フィーチャーに存在する外れ値は、フィーチャーの1%/99%で置き換えられます。

In [None]:
def outlier(df,columns):
    for i in columns:
        quartile_1,quartile_3 = np.percentile(df[i],[25,75])
        quartile_f,quartile_l = np.percentile(df[i],[1,99])
        IQR = quartile_3-quartile_1
        lower_bound = quartile_1 - (1.5*IQR)
        upper_bound = quartile_3 + (1.5*IQR)
        print(i,lower_bound,upper_bound,quartile_f,quartile_l)
                
        df[i].loc[df[i] < lower_bound] = quartile_f
        df[i].loc[df[i] > upper_bound] = quartile_l

if IS_OUTLIER_ROUNDING: 
    outlier(train,num_col)
    outlier(test,num_col) 

In [None]:
train.head(3).T

## One Hot Encoding
A One hot encoding is a representation of categorical variable as binary vectors.It allows the representation of categorical data to be more expresive. This first requires that the categorical values be mapped to integer values, that is label encoding. Then, each integer value is represented as a binary vector that is all zero values except the index of the integer, which is marked with a 1.
The Dummy variable trap is a scenario in which the independent variable are multicollinear, a scenario in which two or more variables are highly correlated in simple term one variable can be predicted from the others.

One hot encodingはカテゴリ変数をバイナリベクトルとして表現したものである。これにより、カテゴリデータの表現をより表現できるようになります。まず、カテゴリ値を整数値にマッピングする必要があります。次に、各整数値は、1でマークされた整数のインデックスを除き、すべて0の値であるバイナリベクトルとして表されます。
ダミー変数トラップは、独立変数が多重共線的であるシナリオであり、2つ以上の変数が単純項で高い相関関係にあるシナリオで、一方の変数を他方から予測することができる。

In [None]:
def OHE(df1,df2,column):
    cat_col = column
    #cat_col = df.select_dtypes(include =['category']).columns
    len_df1 = df1.shape[0]
    
    df = pd.concat([df1,df2],ignore_index=True)
    c2,c3 = [],{}
    
    print('Categorical feature',len(column))
    for c in cat_col:
        if df[c].nunique()>2 :
            c2.append(c)
            c3[c] = 'ohe_'+c
    
    df = pd.get_dummies(df, prefix=c3, columns=c2,drop_first=True)

    df1 = df.loc[:len_df1-1]
    df2 = df.loc[len_df1:]
    print('Train',df1.shape)
    print('Test',df2.shape)
    return df1,df2

In [None]:
if IS_OHE:
    train1,test1 = OHE(train,test,tot_cat_col)
else:
    train1, test1 = train,test

In [None]:
train1.head(3)

In [None]:
def gini(y_true, y_prob):
    y_true = np.asarray(y_true)
    y_true = y_true[np.argsort(y_prob)]
    ntrue = 0
    gini = 0
    delta = 0
    n = len(y_true)
    for i in range(n - 1, -1, -1):
        y_i = y_true[i]
        ntrue += y_i
        gini += y_i * delta
        delta += 1 - y_i
    gini = 1 - 2 * gini / (ntrue * (n - ntrue))
    return gini

def RescaleData(train, test):
    scaler = StandardScaler()
    scaler.fit_transform(train)
    scaler.fit_transform(test)
    return train, test


def DropCalcCol(train, test):
    col_to_drop = train.columns[train.columns.str.startswith('ps_calc_')]
    train = train.drop(col_to_drop, axis=1)
    test = test.drop(col_to_drop, axis=1)
    return train, test

# t-SNE適用

In [None]:
if True:
    import matplotlib.pyplot as plt
    from sklearn.manifold import TSNE
    tsne = TSNE(n_components=2, random_state = 0, perplexity = 30, n_iter = 100)
    train_2d=tsne.fit_transform(train1)
    
    plt.scatter(data_2d[:, 0], data_2d[:, 1])
    plt.show()

## Split data set

In [None]:
X = train1.drop(['target','id'],axis=1)
y = train1['target'].astype('category')
# x_test = test1.drop(['target','id'],axis=1)

In [None]:
y_train = train1['target'].values
train_id = train1['id'].values
X = train1.drop(['target', 'id'], axis=1)
test_id = test1['id']
X_test = test1.drop(['id',"target"], axis=1)

## Train

In [None]:
from sklearn.ensemble import RandomForestClassifier

kf = StratifiedKFold(n_splits=2,random_state=seed,shuffle=True)
pred_test_full=0
cv_score=[]
i=1
for train_index,test_index in kf.split(X,y):    
    x_train,x_test = X.loc[train_index],X.loc[test_index]
    y_train,y_test = y[train_index],y[test_index]
        
    #criterion can be also : entropy 
    model = MODEL
    model.fit(x_train, y_train)
    
    print('Model Train Score is : ' , model.score(x_train, y_train))
    print('Model Test Score is : ' , model.score(x_test, y_test))
    
    # print('Model Train Gini Score is : ' ,  gini(x_train, y_train))
    # print('Model Test Gini Score is : ' ,  gini(x_test, y_test))
    
    y_pred = model.predict(x_test)
    y_pred_prob = model.predict_proba(x_test)
name = "model"
joblib.dump(model,name+".h5")

In [None]:
model=joblib.load("model.h5")
#y_pred = model.predict(X_test)
y_pred_prob = model.predict_proba(X_test)
# print('Predicted Value for RandomForestClassifierModel is : ' , y_pred[:10])
print('Prediction Probabilities Value for RandomForestClassifierModel is : ' , y_pred_prob[:10])

In [None]:
#submit = pd.DataFrame({'id':test['id'],'target':y_pred_prob[:,1]})
submit = pd.DataFrame()
submit['id'] = test["id"]
submit['target'] = y_pred_prob[:,1]
#submit.to_csv('lr_porto.csv.gz',index=False,compression='gzip') 
submit.to_csv('sumbit.csv',index=False) 