## Perform machine learning on something close to real data

I'm going to perform using a home credit dataset from kaggle

#### Problem 1
Confirmation of competition contents

- **What to learn**: The transaction information of the clients
- **What to predict**: The repayment abilities
- **Submission file**:
For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. The file should contain a header and have the following format:
```SK_ID_CURR,TARGET
100001,0.1
100005,0.9
100013,0.2
etc.
```
- **What kind of index value will the submitted items be evaluated?**: Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

### Problem 2
Learning and verification

In [14]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from category_encoders import CountEncoder

# loading the csv of the dataset
df = pd.read_csv('application_train.csv')

# cleaning the dataset by removing the empy data(null)
cleaned_df = df.dropna()

categorical_feats = df.select_dtypes('object').columns.tolist()

# # separating them into variables
X = df.drop(columns=['TARGET'])
y = df['TARGET']

In [16]:
# Encoding values
X = CountEncoder(cols=categorical_feats).fit_transform(X)

In [17]:
# splitting the data into training and testing data using train_test_split from sklearn
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# standardizing the data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_trans = scaler.transform(X_train)
X_test_trans = scaler.transform(X_test)

# fitting the data
from lightgbm import LGBMClassifier

reg = LGBMClassifier(random_state=5).fit(X_train_trans, y_train)

# predicting
reg_pred = reg.predict(X_test_trans)

print("Acc:", accuracy_score(y_true=y_test, y_pred=reg_pred))
print("ROC", roc_auc_score(y_test,reg_pred))

Acc: 0.9196909388901896
ROC 0.5088892347318036


**Accuracy** is very high which is a good indication

### Problem 3
Estimation for test data

In [20]:
# loading the csv of the test dataset
test_df = pd.read_csv('application_test.csv')

# cleaning the dataset by removing the empy data(null)
test_cleaned_df = test_df.dropna(axis=0)

# separating them into variables
test_X = X = CountEncoder(cols=categorical_feats).fit_transform(test_df)

# standardizing the data
test_scaler = StandardScaler()
test_X_test_trans = scaler.fit_transform(test_X)

# predicting
test_reg_pred = reg.predict(test_X_test_trans)

kgl_submission = pd.concat([test_df['SK_ID_CURR'], pd.Series(test_reg_pred, name='TARGET')], axis=1)
kgl_submission.to_csv('kggl_submission.csv', index=False)

In [21]:
kgl_submission

Unnamed: 0,SK_ID_CURR,TARGET
0,100001,0
1,100005,0
2,100013,0
3,100028,0
4,100038,0
...,...,...
48739,456221,0
48740,456222,0
48741,456223,0
48742,456224,0


### Problem 4
Based on the baseline model, we will make various improvements to the input feature quantities to improve accuracy

In [23]:
# cleaning the dataset by removing the empy data(null)
cleaned_df = df.dropna()
# # separating them into variables
X = cleaned_df.drop(columns=['TARGET'])
y = cleaned_df['TARGET']

In [24]:
# imputation
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score

# pattern 1
imp_mean = SimpleImputer(strategy='mean')

# drop the missing values
imp_X = imp_mean.fit_transform(X)

# One hot encoding
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
enc_imp_X = enc.fit_transform(imp_X).toarray()

# splitting the data into training and testing data using train_test_split from sklearn
X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(enc_imp_X, y, test_size=0.25, random_state=42)

# standardizing the data
scaler = StandardScaler()
scaler.fit(X_train_1)
X_train_trans_1 = scaler.transform(X_train_1)
X_test_trans_1 = scaler.transform(X_test_1)

# fitting the data
from lightgbm import LGBMClassifier
lgbm = LGBMClassifier(random_state=5)
lgb = lgbm.fit(X_train_trans_1, y_train_1)

# predicting
reg_pred_1 = lgb.predict(X_test_trans_1)

print("Accuracy: ", accuracy_score(y_test_1,reg_pred_1))

ValueError: Cannot use mean strategy with non-numeric data:
could not convert string to float: 'Cash loans'

In [18]:
imp_median = SimpleImputer(strategy='median')

# drop the missing values
imp_X_1 = imp_median.fit_transform(X)

# One hot encoding
from sklearn.preprocessing import OneHotEncoder
enc_1 = OneHotEncoder(handle_unknown='ignore')
enc_imp_X_1 = enc.fit_transform(imp_X_1).toarray()

# splitting the data into training and testing data using train_test_split from sklearn
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(enc_imp_X_1, y, test_size=0.25, random_state=42)

# standardizing the data
scaler = StandardScaler()
scaler.fit(X_train_2)
X_train_trans_2 = scaler.transform(X_train_2)
X_test_trans_2 = scaler.transform(X_test_2)

# fitting the data
from lightgbm import LGBMClassifier
lgbm_1 = LGBMClassifier(random_state=5)
lgb_1 = lgbm_1.fit(X_train_trans_2, y_train_2)

# predicting
reg_pred_2 = lgb_1.predict(X_test_trans_2)

print("Accuracy: ", accuracy_score(y_test_2,reg_pred_2))

Accuracy:  0.9446768944676894


In [27]:
imp_mf = SimpleImputer(strategy='most_frequent')

# drop the missing values
imp_X_2 = imp_mf.fit_transform(X)

# One hot encoding
from sklearn.preprocessing import OneHotEncoder
enc_2 = OneHotEncoder(handle_unknown='ignore')
enc_imp_X_2 = enc_2.fit_transform(imp_X_2).toarray()

# splitting the data into training and testing data using train_test_split from sklearn
X_train_3, X_test_3, y_train_3, y_test_3 = train_test_split(enc_imp_X_2, y, test_size=0.25, random_state=42)

# standardizing the data
scaler = StandardScaler()
scaler.fit(X_train_3)
X_train_trans_3 = scaler.transform(X_train_3)
X_test_trans_3 = scaler.transform(X_test_3)

# fitting the data
from lightgbm import LGBMClassifier
lgbm_2 = LGBMClassifier(random_state=5)
lgb_2 = lgbm_2.fit(X_train_trans_3, y_train_3)

# predicting
reg_pred_3 = lgb_2.predict(X_test_trans_3)

print("Accuracy: ", accuracy_score(y_test_3,reg_pred_3))

MemoryError: Unable to allocate 6.05 GiB for an array with shape (8602, 94474) and data type float64

In [None]:
imp_cnst = SimpleImputer(strategy='constant')

# drop the missing values
imp_X_3 = imp_cnst.fit_transform(X)

# One hot encoding
from sklearn.preprocessing import OneHotEncoder
enc_3 = OneHotEncoder(handle_unknown='ignore')
enc_imp_X_3 = enc_3.fit_transform(imp_X_3).toarray()

# splitting the data into training and testing data using train_test_split from sklearn
X_train_4, X_test_4, y_train_4, y_test_4 = train_test_split(enc_imp_X_3, y, test_size=0.25, random_state=42)

# standardizing the data
scaler = StandardScaler()
scaler.fit(X_train_4)
X_train_trans_4 = scaler.transform(X_train_4)
X_test_trans_4 = scaler.transform(X_test_4)

# fitting the data
from lightgbm import LGBMClassifier
lgbm_3 = LGBMClassifier(random_state=5)
lgb_3 = lgbm_3.fit(X_train_trans_4, y_train_4)

# predicting
reg_pred_4 = lgb_3.predict(X_test_trans_4)

print("Accuracy: ", accuracy_score(y_test_4,reg_pred_4))

For the feature engineering,

I used imputation and hot encoding technique because they are more useful,
and from my observation, for all paterns of you can use in simple imputer the accuracy is still high and the stays constant