<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"></ul></div>

Let's investigate our target variable. 

In [None]:
app_train['TARGET'].value_counts()

In [None]:
app_train['TARGET'].plot.hist();

Very imbalanced - far more loans are payed on time then not repaid.

In [None]:
app_train.select_dtypes('object').apply(pd.Series.nunique, axis = 0)

Most categorical variables only have a few levels. We'll need to one-hot encode these which will add a lot of features to our dataset.

Let's examine a few interesting variables, in some cases we'll split them based on whether they repaid or defaulted.

In [None]:
# Used to explore a single feature on a set of subplots. 
# Visualise distribution, noise & outliers and missing values as well as correlation with target.
def explore_variable(df, feature, target=None, by_categorical=None):
    '''
    Numerical features will display 3 plots: histogram, correlation between source feature and a target and a box plot for feature. 
    Optional by_categorical can be provided to show box plot by levels of a categorical variable.
    Categorical features will display 2 plots: bar chart of levels, median of target by feature levels.
    '''    
    feature_type = df[feature].dtype
    missing = df.apply(lambda x: sum(x.isnull())).loc[feature]
    print("'{}' is of type {} with {} missing values".format(feature, feature_type, missing))

    if feature_type == "object":
        fig, ax = plt.subplots(1, 2, figsize=(15,5))
        fig.subplots_adjust(wspace=0.3)

        ax1 = ax.ravel()[0]
        ax1.set_title("Distribution of {}".format(feature))
        df[feature].value_counts().plot.barh(ax=ax1)

        ax2 = ax.ravel()[1]
        ax2.set_title("Median {} by {}".format(target, feature))
#         df.groupby(feature)[[target]].median().plot.barh(ax=ax2)
        if by_categorical is not None:
            pd.pivot_table(data=df, index=feature, values=target, columns=by_categorical).plot.barh(ax=ax2)
        else:
            pd.pivot_table(data=df, index=feature, values=target).plot.barh(ax=ax2)

        plt.show()

    elif feature_type == "int64" or "float64":
        if target is not None:
            fig, ax = plt.subplots(1, 3, figsize=(20,8), dpi=400)
            fig.subplots_adjust(wspace=0.3)

            ax1 = ax.ravel()[0]
            ax1.set_title("Distribution of {}".format(feature))
            df[feature].hist(bins=50, ax=ax1)

            ax2 = ax.ravel()[1]
            ax2.set_title("Correlation btw\n {} and {}".format(feature, target))
            df.plot.scatter(x=feature, y=target, ax=ax2)  

            ax3 = ax.ravel()[2]
            ax3.set_title("Box plot for {}".format(feature))
            #df[feature].plot.box(ax=ax3)
            if by_categorical is not None:
                sns.boxplot(x=by_categorical, y=feature, data=df, ax=ax3)
            else:
                sns.boxplot(y=df[feature], ax=ax3)
        else:
            fig, ax = plt.subplots(1, 2, figsize=(20,8), dpi=400)
            fig.subplots_adjust(wspace=0.3)

            ax1 = ax.ravel()[0]
            ax1.set_title("Distribution of {}".format(feature))
            df[feature].hist(bins=50, ax=ax1)

            ax2 = ax.ravel()[1]
            ax2.set_title("Box plot for {}".format(feature))
            #df[feature].plot.box(ax=ax3)
            if by_categorical is not None:
                sns.boxplot(x=by_categorical, y=feature, data=df, ax=ax2)
            else:
                sns.boxplot(y=df[feature], ax=ax2)
            
        

        plt.show()


In [None]:
app_train.NAME_CONTRACT_TYPE.value_counts()

In [None]:
plt.style.use('fivethirtyeight')

sns.catplot(data=app_train, x="TARGET", hue="NAME_CONTRACT_TYPE", kind="count")
plt.title('Repayment or Default by Loan Type')
plt.xlabel('Loan Type'); plt.ylabel('Number of Loans');

Since most of the loans are Cash Loans, this doesn't tell us much.

In [None]:
app_train.CODE_GENDER.value_counts()

In [None]:
sns.catplot(data=app_train, x="TARGET", hue="CODE_GENDER", kind="count")

More females repay the loan, but there are nearly twice as many females in the dataset. Likewise more females default.

There is a category called XNA with only 4 entries, let's remove these from the training and test set.

In [None]:
app_train = app_train[app_train.CODE_GENDER != "XNA"]

In [None]:
app_train.shape

Let's see if number of children affects loan repayment.

In [None]:
app_train.CNT_CHILDREN.describe()

Someone has 19 children! This must be an error, let's check it out.

In [None]:
app_train[app_train.CNT_CHILDREN >= 10]

These seem unlikely. Most also seem to repay their loans (TARGET == 0). Let's change their child count to np.nan so they can be imputed rather. We'll add a flag to indicate they are anomolies.

In [None]:
anom = app_train[app_train['CNT_CHILDREN'] >= 10]
non_anom = app_train[app_train['CNT_CHILDREN'] < 10]
print('The non-anomalies default on %0.2f%% of loans' % (100 * non_anom['TARGET'].mean()))
print('The anomalies default on %0.2f%% of loans' % (100 * anom['TARGET'].mean()))
print('There are %d anomalous children counts' % len(anom))

In [None]:
# Create an anomalous flag column
app_train['CNT_CHILDREN_ANOM'] = app_train["CNT_CHILDREN"] >= 10

# Replace the anomalous values with nan
app_train["CNT_CHILDREN"].replace({10: np.nan, 11: np.nan, 12: np.nan, 13: np.nan, 14: np.nan, 15: np.nan, 16: np.nan, 17: np.nan, 18: np.nan, 19: np.nan}, inplace=True)

In [None]:
# Function to calculate missing values by column
def missing_values_table(df):
        # Total missing values
        mis_val = df.isnull().sum()
        
        # Percentage of missing values
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        
        # Make a table with the results
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        
        # Rename the columns
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        
        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        
        # Print some summary information
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        
        # Return the dataframe with missing information
        return mis_val_table_ren_columns

In [None]:
missing_values_table(app_train).head(30)

- There are a lot of missing values, we can try impute them using some criteria or we can remove columns that have a certain percent or greater of missing values. 

In [None]:
app_test = pd.read_csv('input/application_test.csv')
print("test data size:", app_test.shape)
app_test.head()