# Credit Card Users Churn Prediction

### Business Context

The goal of the problem is to predict whether a client will default on the loan payment or not. For each ID in the test_data, you must predict the “default” level.

Datasets
The problem contains two datasets, Train Data, and Test Data. Model building is to be done on Train Dataset and the Model testing is to be done on Test Dataset. The output from the Test Data is to be submitted in the Hackathon platform

Metric to measure
Your score is the percentage of all correct predictions made by you. This is simply known as accuracy. The best accuracy is 1 whereas the worst is 0. It will be calculated as the total number of two correct predictions (True positive + True negative) divided by the total number of observations in the dataset.

Submission File Format:
You should submit a CSV file with exactly 39933 entries plus a header row.
The file should have exactly two columns

·         ID ( sorted in any order)
·         default (contains 0 & 1, 1 represents default)

### Data Description
* ID	unique: ID assigned to each applicant*
* loan_amnt:	loan amount ($) applied each applicant*
* loan_term:	Loan duration in years*
* interest_rate:	Applicable interest rate on Loan in %*
* loan_grade:	Loan Grade Assigned by the bank*
* loan_subgrade:	Loan SubGrade Assigned by the bank*
* job_experience:	Number of years job experience* 
* home_ownership:	Status of House Ownership*
* annual_income:	Annual income of the applicant*
* income_verification_status:	Status of Income verification by the bank*
* loan_purpose:	Purpose of loan*
*s tate_code:	State code of the applicant's residence*
* debt_to_income:	Ratio to total debt to income (total debt might include other loan aswell)*
* delinq_2yrs	number: of 30+ days delinquency in past 2 years*
* public_records:	number of legal cases against the applicant*
* revolving_balance:	total credit revolving balance*
* total_acc:	total number of credit lines available in members credit line*
* interest_receive:	total interest received by the bank on the loan*
* application_type:	Whether the applicant has applied the loan by creating individuall or joint account*
* last_week_pay:	How many months have the applicant paid the loan EMI already*
* total_current_balance:	total current balance of all the accounts of applicant*
* total_revolving_limit:	total revolving credit limit*
* default	status of loan: amount, 1 = Defaulter, 0 = Non Defaulters*

## Importing necessary libraries

In [214]:
# To supress warnings
import warnings
warnings.filterwarnings("ignore")

# To help with reading and manipulating data
import pandas as pd
import numpy as np

# To help with data visualization
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['figure.dpi'] = 100
plt.rcParams['savefig.dpi'] = 100
import seaborn as sns

# To be used for missing value imputation
from sklearn.impute import SimpleImputer

# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    AdaBoostClassifier,
    GradientBoostingClassifier,
    RandomForestClassifier,
    BaggingClassifier,
)

# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

from xgboost import XGBClassifier

# To get different metric scores, and split data
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    ConfusionMatrixDisplay,
)

# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder

# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)

# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# This will help in making the Python code more structured automatically (good coding practice)
# %load_ext nb_black

## Loading the dataset

In [215]:
# load datasets
train_set=pd.read_csv("Train_set_(1)_(2).csv")
test_set=pd.read_csv("Test_set_(1)_(1).csv")

### Observations & Sanity checks

In [216]:
# Checking the shape of the data
print(f'There is', train_set.shape[0], f'rows and', train_set.shape[1], f'columns')
print(f'There is', test_set.shape[0], f'rows and', test_set.shape[1], f'columns')

There is 93174 rows and 23 columns
There is 39933 rows and 22 columns


- *These is missing rows in train data*

In [217]:
# printing a concise summary of the DataFrame
train_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 93174 entries, 0 to 93173
Data columns (total 23 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   ID                          93174 non-null  int64  
 1   loan_amnt                   93174 non-null  int64  
 2   loan_term                   93174 non-null  object 
 3   interest_rate               93174 non-null  float64
 4   loan_grade                  93174 non-null  object 
 5   loan_subgrade               93174 non-null  object 
 6   job_experience              88472 non-null  object 
 7   home_ownership              93174 non-null  object 
 8   annual_income               93173 non-null  float64
 9   income_verification_status  93174 non-null  object 
 10  loan_purpose                93174 non-null  object 
 11  state_code                  93174 non-null  object 
 12  debt_to_income              93174 non-null  float64
 13  delinq_2yrs                 931

In [218]:
# printing a concise summary of the DataFrame
test_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39933 entries, 0 to 39932
Data columns (total 22 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   ID                          39933 non-null  int64  
 1   loan_amnt                   39933 non-null  int64  
 2   loan_term                   39933 non-null  object 
 3   interest_rate               39933 non-null  float64
 4   loan_grade                  39933 non-null  object 
 5   loan_subgrade               39933 non-null  object 
 6   job_experience              37844 non-null  object 
 7   home_ownership              39933 non-null  object 
 8   annual_income               39933 non-null  float64
 9   income_verification_status  39933 non-null  object 
 10  loan_purpose                39933 non-null  object 
 11  state_code                  39933 non-null  object 
 12  debt_to_income              39933 non-null  float64
 13  delinq_2yrs                 399

In [219]:
# Checking null (NaN) values for sanity
train_set.isnull().sum()

ID                               0
loan_amnt                        0
loan_term                        0
interest_rate                    0
loan_grade                       0
loan_subgrade                    0
job_experience                4702
home_ownership                   0
annual_income                    1
income_verification_status       0
loan_purpose                     0
state_code                       0
debt_to_income                   0
delinq_2yrs                      2
public_records                   2
revolving_balance                0
total_acc                        2
interest_receive                 0
application_type                 0
last_week_pay                 1924
total_current_balance         7386
total_revolving_limit         7386
default                          0
dtype: int64

In [220]:
# Checking null (NaN) values for sanity
test_set.isnull().sum()

ID                               0
loan_amnt                        0
loan_term                        0
interest_rate                    0
loan_grade                       0
loan_subgrade                    0
job_experience                2089
home_ownership                   0
annual_income                    0
income_verification_status       0
loan_purpose                     0
state_code                       0
debt_to_income                   0
delinq_2yrs                      1
public_records                   1
revolving_balance                0
total_acc                        1
interest_receive                 0
application_type                 0
last_week_pay                  806
total_current_balance         3230
total_revolving_limit         3230
dtype: int64

- *These is missing rows 'job_experience','last_week_pay', 'total_current_balance', 'total_revolving_limit' in train and test data*

### Feature Engineering

#### Encoding categorical job_experience for datasets:

In [221]:
# encoding job_experience into numbers
job_experience = {
    "<5 Years": 0,
    "6-10 years": 1,
    "10+ years": 2,
}

train_set2=train_set.copy()
test_set2=test_set.copy()

train_set2["job_experience"] = train_set2["job_experience"].map(job_experience)
test_set2["job_experience"] = test_set2["job_experience"].map(job_experience)

#### Train-Validation Split

-*To avoid the data leaking let's split train ant valid data before missing values treatment:*

In [222]:
# Dividing train data into X and y
X = train_set2.drop(["default"], axis=1)
y = train_set2["default"]
X_test = test_set2

In [223]:
# Splitting data into training and validation set:

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=1, stratify=y
)

print(X_train.shape, X_val.shape, y_val.shape, X_test.shape)

(74539, 22) (18635, 22) (18635,) (39933, 22)


In [224]:
X_train, X_test = X_train.align(X_test, join='left', axis=1)
print(X_train.shape, X_val.shape, y_val.shape, X_test.shape)

(74539, 22) (18635, 22) (18635,) (39933, 22)


- *Split data have the same amount columns*

#### Impute missing values for train, val and test sets

In [225]:
# Get list of categorical and numerical columns
cat_cols = list(X_train.select_dtypes(include='object').columns)
num_cols = list(X_train.select_dtypes(exclude='object').columns)

# Impute categorical columns
cat_imputer = SimpleImputer(strategy='most_frequent')
X_train[cat_cols] = cat_imputer.fit_transform(X_train[cat_cols])
X_val[cat_cols] = cat_imputer.transform(X_val[cat_cols])
X_test[cat_cols] = cat_imputer.transform(X_test[cat_cols])

# Impute numerical columns
num_imputer = SimpleImputer(strategy='mean')
X_train[num_cols] = num_imputer.fit_transform(X_train[num_cols])
X_val[num_cols] = num_imputer.transform(X_val[num_cols])
X_test[num_cols] = num_imputer.transform(X_test[num_cols])

#### Encoding categorical data

In [226]:
# let's use pandas get-dummies function
X_train_encoded = pd.get_dummies(X_train, drop_first=True, dtype=float) # for train
X_val_encoded = pd.get_dummies(X_val, drop_first=True, dtype=float) # for val
X_test_encoded = pd.get_dummies(X_test, drop_first=True, dtype=float) # for test set
print(X_train_encoded.shape, X_val_encoded.shape, X_test_encoded.shape)

(74539, 114) (18635, 113) (39933, 114)


In [227]:
# shape mismatch when One-Hot-Encoding Train and test/valid data fixing of state_code_ID colums
train_cols = list(X_train_encoded.columns)
val_cols = list(X_val_encoded.columns)
test_cols = list(X_test_encoded.columns)
cols_not_in_val = {c:0 for c in train_cols if c not in val_cols}
print(cols_not_in_val)
cols_not_in_test = {c:0 for c in train_cols if c not in test_cols}
print(cols_not_in_test)
X_val_encoded = X_val_encoded.assign(**cols_not_in_val)
print(X_train_encoded.shape, X_val_encoded.shape, X_test_encoded.shape)

{'state_code_ID': 0}
{}
(74539, 114) (18635, 114) (39933, 114)


In [228]:
# allighn all columns val by train
X_train_encoded, X_val_encoded = X_train_encoded.align(X_val_encoded, join='left', axis=1)

In [229]:
# allighn all columns testl by train
X_test_encoded, X_train_encoded = X_test_encoded.align(X_train_encoded, join='left', axis=1)

- *Looks aligned!*

In [230]:
# Checking class balance for whole data, train set, validation set, and test set

print("Target value ratio in y")
print(y.value_counts(1))
print("*" * 80)
print("Target value ratio in y_train")
print(y_train.value_counts(1))
print("*" * 80)
print("Target value ratio in y_val")
print(y_val.value_counts(1))
print("*" * 80)

Target value ratio in y
default
0   0.762
1   0.238
Name: proportion, dtype: float64
********************************************************************************
Target value ratio in y_train
default
0   0.763
1   0.237
Name: proportion, dtype: float64
********************************************************************************
Target value ratio in y_val
default
0   0.762
1   0.238
Name: proportion, dtype: float64
********************************************************************************


- *The data is unbalanced and needs some balance*

## Model building

### Model evaluation criterion

Calculate the total number of two correct predictions (True positive + True negative) divided by the total number of observations in the dataset - accuracy.

**Let's define a function to output different metrics on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.**

In [231]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf

def make_confusion_matrix(actual_targets, predicted_targets):
    """
    To plot the confusion_matrix with percentages

    actual_targets: actual target (dependent) variable values
    predicted_targets: predicted target (dependent) variable values
    """
    cm = confusion_matrix(actual_targets, predicted_targets)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(cm.shape[0], cm.shape[1])

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

#### Sample tuning method for XGBoost with original data

In [278]:
# defining model
Model = XGBClassifier()

# Parameter grid to pass in RandomSearchCV
param_grid={'n_estimators':np.arange(50,110,25),
            'scale_pos_weight':[1,2,5],
            'learning_rate':[0.01,0.1,0.05],
            'gamma':[1,3],
            'subsample':[0.7,0.9],        
}

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=metrics.make_scorer(metrics.accuracy_score), cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_encoded, y_train)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

Best parameters are {'subsample': 0.7, 'scale_pos_weight': 2, 'n_estimators': 50, 'learning_rate': 0.05, 'gamma': 3} with CV score=0.9069212185040139:


In [279]:
# for origin set
tuned_xgb_origin = XGBClassifier(
    subsample=0.7,
    scale_pos_weight=2,
    n_estimators=50,
    learning_rate=0.05,
    gamma=3,
    objective='binary:logistic'
)


tuned_xgb_origin.fit(X_train_encoded, y_train)

In [280]:
# Checking the model's performance on the origin_encoded training set
tuned_xgb_origin_train = model_performance_classification_sklearn(tuned_xgb_origin, X_train_encoded, y_train)
tuned_xgb_origin_train

Unnamed: 0,Accuracy,Recall,Precision,F1
0,0.909,0.89,0.764,0.822


In [281]:
# Checking the model's performance on the origin validation set
tuned_xgb_origin_val = model_performance_classification_sklearn(tuned_xgb_origin, X_val_encoded, y_val)
tuned_xgb_origin_val

Unnamed: 0,Accuracy,Recall,Precision,F1
0,0.904,0.873,0.76,0.812


- *Not bad!*

#### SMOTE balance

In [236]:
# initializing SMOTE instance
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=42)

# fitting SMOTE on training data and validation data
X_train_smote, y_train_smote= sm.fit_resample(X_train_encoded, y_train)
print('After UpSampling, the shape of train_X: {}'.format(X_train_smote.shape))
print('After UpSampling, the shape of train_y: {} \n'.format(y_train_smote.shape))

After UpSampling, the shape of train_X: (113672, 114)
After UpSampling, the shape of train_y: (113672,) 



In [237]:
# Checking class balance for train set
print(y_train_smote.value_counts(1))

default
0   0.500
1   0.500
Name: proportion, dtype: float64


- *looks balanced*

#### Sample tuning method for XGBoost with oversampled balanced data

In [238]:
# defining model
Model = XGBClassifier()

# Parameter grid to pass in RandomSearchCV
param_grid={'n_estimators':np.arange(50,110,25),
            'scale_pos_weight':[1,2,5],
            'learning_rate':[0.01,0.1,0.05],
            'gamma':[1,3],
            'subsample':[0.7,0.9],
}

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=metrics.make_scorer(metrics.accuracy_score), cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_smote, y_train_smote)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))



Best parameters are {'subsample': 0.9, 'scale_pos_weight': 5, 'n_estimators': 100, 'learning_rate': 0.1, 'gamma': 3} with CV score=0.9310299189047256:


In [282]:
# for origin set
tuned_xgb_smote = XGBClassifier(
    subsample=0.9,
    scale_pos_weight=5,
    n_estimators=100,
    learning_rate=0.1,
    gamma=3,
    objective='binary:logistic'
)

tuned_xgb_smote.fit(X_train_smote, y_train_smote)

In [283]:
# Checking the model's performance on the origin_encoded training set
tuned_xgb_smote_train = model_performance_classification_sklearn(tuned_xgb_smote, X_train_smote, y_train_smote)
tuned_xgb_smote_train

Unnamed: 0,Accuracy,Recall,Precision,F1
0,0.939,0.985,0.903,0.942


In [284]:
# Checking the model's performance on the origin validation set
tuned_xgb_smote_val = model_performance_classification_sklearn(tuned_xgb_smote, X_val_encoded, y_val)
tuned_xgb_smote_val

Unnamed: 0,Accuracy,Recall,Precision,F1
0,0.9,0.928,0.727,0.815


- *Less accuracy, need reducing of overfitting*

#### Sample tuning method for XGBoost with undersampled data

In [242]:
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train_encoded_pd, y_train)

In [243]:
print("Before Under Sampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Under Sampling, counts of label 'No': {} \n".format(sum(y_train == 0)))

print("After Under Sampling, counts of label 'Yes': {}".format(sum(y_train_un == 1)))
print("After Under Sampling, counts of label 'No': {} \n".format(sum(y_train_un == 0)))

print("After Under Sampling, the shape of train_X: {}".format(X_train_un.shape))
print("After Under Sampling, the shape of train_y: {} \n".format(y_train_un.shape))

Before Under Sampling, counts of label 'Yes': 17703
Before Under Sampling, counts of label 'No': 56836 

After Under Sampling, counts of label 'Yes': 17703
After Under Sampling, counts of label 'No': 17703 

After Under Sampling, the shape of train_X: (35406, 114)
After Under Sampling, the shape of train_y: (35406,) 



In [244]:
# defining model
Model = XGBClassifier()

# Parameter grid to pass in RandomSearchCV
param_grid={'n_estimators':np.arange(50,110,25),
            'scale_pos_weight':[1,2,5],
            'learning_rate':[0.01,0.1,0.05],
            'gamma':[1,3],
            'subsample':[0.7,0.9],
}

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=metrics.make_scorer(metrics.accuracy_score), cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un, y_train_un)

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

Best parameters are {'subsample': 0.7, 'scale_pos_weight': 2, 'n_estimators': 50, 'learning_rate': 0.05, 'gamma': 3} with CV score=0.9142517568423256:


In [245]:
# for origin set
tuned_xgb_un = XGBClassifier(
    subsample=0.7,
    scale_pos_weight=1,
    n_estimators=50,
    learning_rate=0.05,
    gamma=3,
    objective='binary:logistic'
)

tuned_xgb_un.fit(X_train_un, y_train_un)

In [246]:
# Checking the model's performance on the origin_encoded training set
tuned_xgb_un_train = model_performance_classification_sklearn(tuned_xgb_un, X_train_un, y_train_un)
tuned_xgb_un_train

Unnamed: 0,Accuracy,Recall,Precision,F1
0,0.914,0.919,0.91,0.915


In [247]:
# Checking the model's performance on the origin validation set
tuned_xgb_un_val = model_performance_classification_sklearn(tuned_xgb_un, X_val_encoded, y_val)
tuned_xgb_un_val

Unnamed: 0,Accuracy,Recall,Precision,F1
0,0.905,0.903,0.748,0.818


- *Less accuracy, worse than before*

#### Perfomance comparison

In [248]:
# Performance comparison

models_train_comp_df = pd.concat(
    [
        tuned_xgb_origin_train.T,
        tuned_xgb_smote_train.T,
        tuned_xgb_un_train.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "XGB Original data train",
    "XGB Oversampled data train",
    "XGB Undersampled data train", 
]

models_val_comp_df = pd.concat(
    [
        tuned_xgb_origin_val.T,
        tuned_xgb_smote_val.T,
        tuned_xgb_un_val.T,
    ],
    axis=1,
)
models_val_comp_df.columns = [
    "XGB Original data val",
    "XGB Oversampled data val",
    "XGB Undersampled data val", 
]



print("Valiation performance comparison:")
models_val_comp_df

Valiation performance comparison:


Unnamed: 0,XGB Original data val,XGB Oversampled data val,XGB Undersampled data val
Accuracy,0.904,0.9,0.905
Recall,0.873,0.928,0.903
Precision,0.76,0.727,0.748
F1,0.812,0.815,0.818


In [249]:
# concat train and valid metric
result = pd.concat([models_train_comp_df, models_val_comp_df], axis=1).reindex(models_train_comp_df.index)

In [250]:
print("Models performance comparison")
result.T

Models performance comparison


Unnamed: 0,Accuracy,Recall,Precision,F1
XGB Original data train,0.909,0.89,0.764,0.822
XGB Oversampled data train,0.939,0.985,0.903,0.942
XGB Undersampled data train,0.914,0.919,0.91,0.915
XGB Original data val,0.904,0.873,0.76,0.812
XGB Oversampled data val,0.9,0.928,0.727,0.815
XGB Undersampled data val,0.905,0.903,0.748,0.818


In [251]:
# add difference between metrics
result['difference accuracy origin'] = result['XGB Original data train'] - result['XGB Original data val']
result['difference accuracy balanced'] = result['XGB Oversampled data train'] - result['XGB Oversampled data val']
result['difference accuracy undersampled'] = result['XGB Undersampled data train'] - result['XGB Undersampled data val']

In [252]:
result.T

Unnamed: 0,Accuracy,Recall,Precision,F1
XGB Original data train,0.909,0.89,0.764,0.822
XGB Oversampled data train,0.939,0.985,0.903,0.942
XGB Undersampled data train,0.914,0.919,0.91,0.915
XGB Original data val,0.904,0.873,0.76,0.812
XGB Oversampled data val,0.9,0.928,0.727,0.815
XGB Undersampled data val,0.905,0.903,0.748,0.818
difference accuracy origin,0.005,0.018,0.005,0.01
difference accuracy balanced,0.039,0.057,0.176,0.127
difference accuracy undersampled,0.009,0.016,0.162,0.096


- *XGB Original data train is bette generalized*

#### Exctracting predictions from test set and saving model

In [253]:
# choosing best model:
y_predict = tuned_xgb_origin.predict(X_test_encoded)

In [265]:
data = {'ID': X_test_encoded_pd['ID'].astype(int), 'default': y_predict.astype(int)}
df = pd.DataFrame(data)

In [266]:
df

Unnamed: 0,ID,default
0,4855329,1
1,66862420,0
2,3637416,1
3,53682249,0
4,53937165,0
...,...,...
39928,57779318,0
39929,59742362,0
39930,72657145,0
39931,15220189,0


In [268]:
# Save the DataFrame to a CSV file
df.to_csv('Submission.csv', index=False)

In [269]:
df.shape

(39933, 2)

In [263]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39933 entries, 0 to 39932
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   ID       39933 non-null  float64
 1   default  39933 non-null  int64  
dtypes: float64(1), int64(1)
memory usage: 624.1 KB


In [270]:
#check example capabilities
dfe=pd.read_csv("Sample_Submission_(1)_(1).csv")

In [272]:
dfe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39933 entries, 0 to 39932
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   ID       39933 non-null  int64
 1   default  39933 non-null  int64
dtypes: int64(2)
memory usage: 624.1 KB


In [262]:
dfe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39933 entries, 0 to 39932
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   ID       39933 non-null  int64
 1   default  39933 non-null  int64
dtypes: int64(2)
memory usage: 624.1 KB
