<h3>- Note that I will use the Dataset with the Risk column </h3>

<h2>Context</h2>
The original dataset contains 1000 entries with 20 categorial/symbolic attributes prepared by Prof. Hofmann. In this dataset, each entry represents a person who takes a credit by a bank. Each person is classified as good or bad credit risks according to the set of attributes. The link to the original dataset can be found below.

<h2>Content</h2>
It is almost impossible to understand the original dataset due to its complicated system of categories and symbols. Thus, I wrote a small Python script to convert it into a readable CSV file. Several columns are simply ignored, because in my opinion either they are not important or their descriptions are obscure. The selected attributes are:

<b>Age </b>(numeric)<br>
<b>Sex </b>(text: male, female)<br>
<b>Job </b>(numeric: 0 - unskilled and non-resident, 1 - unskilled and resident, 2 - skilled, 3 - highly skilled)<br>
<b>Housing</b> (text: own, rent, or free)<br>
<b>Saving accounts</b> (text - little, moderate, quite rich, rich)<br>
<b>Checking account </b>(numeric, in DM - Deutsch Mark)<br>
<b>Credit amount</b> (numeric, in DM)<br>
<b>Duration</b> (numeric, in month)<br>
<b>Purpose</b>(text: car, furniture/equipment, radio/TV, domestic appliances, repairs, education, business, vacation/others<br>
<b>Risk </b> (Value target - Good or Bad Risk)<br>

Please give me your feedback and if you like votes up to keep me motivated. 

In [1]:
#Load the librarys
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import rcParams

%matplotlib inline

# figure size in inches
rcParams['figure.figsize'] = 8,6

df_credit = pd.read_csv("../input/german-credit-data-with-risk/german_credit_data.csv",index_col=0)

FileNotFoundError: File b'../input/german-credit-data-with-risk/german_credit_data.csv' does not exist

In [None]:
#Searching for Missings,type of data and also known the shape of data
print(df_credit.info())

In [None]:
#Looking unique values
print(df_credit.nunique())
#Looking the data
print(df_credit.head())

I will start by column Age.



In [None]:
fig, ax =plt.subplots(2,1, figsize=(12,8))
sns.distplot(df_credit["Age"], ax=ax[0])
sns.countplot(x="Age", data=df_credit, palette="hls", ax=ax[1], hue = "Risk")
plt.show()

In [None]:
#Let's look the Credit Amount column
interval = (18, 25, 35, 60, 120)

cats = ['Student', 'Young', 'Adult', 'Senior']
df_credit["Age_cat"] = pd.cut(df_credit.Age, interval, labels=cats)

ax = sns.factorplot(x="Age_cat", y="Credit amount", data=df_credit, kind="box",size=5, aspect=2,hue="Risk",palette='hls')
ax.set(xlabel='Age Categorical', ylabel='Credit Amount(U$)',title="Age Categorical x Credit Amount Separeted by Risk groups")
plt.show()

Interesting. 

I will now Look the distribuition of Housing own and rent by Risk


In [None]:
ax = sns.countplot(x="Housing", data=df_credit, palette="hls", hue = "Risk")
ax.set(xlabel='Housing Categories', ylabel='Count',title="Housing Count")
plt.show()

we can see that the own and good risk have a high correlation

In [None]:
ax = sns.factorplot(x="Housing", y="Credit amount", data=df_credit, \
               palette="hls", hue = "Risk",kind="violin",size=4,aspect=2,split=True)
ax.set(xlabel="Housing",ylabel="Credit Amount(U$)",title="Housing by Credit Amount")
plt.show()

Interesting moviments! Highest values come from category "free" and we have a different distribuition by Risk

Looking the diference by Sex

In [None]:
print("Total difference by Sex: ")
print(df_credit.groupby("Sex")["Sex"].count())
sns.countplot(x="Sex", data=df_credit, palette="hls", hue = "Risk")
plt.show()

In [None]:
#Looking the Sex by Credit Amount
ax = sns.factorplot(x="Sex",y="Credit amount",data=df_credit,kind="violin",\
                    palette="hls",size=6,hue="Risk",split=True)
ax.set(title="Amount by Sex")
plt.show()

I will create categories of Age and look the distribuition of Credit Amount by Risk...


I will do some explorations through the Job
- distribuition
- Crossed by Credit amount
- Crossed by Age

In [None]:
fig, ax = plt.subplots(figsize=(12,12),nrows=3)
sns.countplot(x="Job", data=df_credit, palette="hls", hue="Risk", ax=ax[0])
sns.boxplot(x="Job", y="Credit amount", data=df_credit, palette="hls", ax=ax[1], hue="Risk")
sns.violinplot(x="Job", y="Age", data=df_credit, ax=ax[2],  hue="Risk", split=True, palette="hls")
plt.show()

Looking the distribuition of Credit Amont

In [None]:
sns.distplot(df_credit["Credit amount"])
plt.show()

Distruibution of Saving accounts by Risk

In [None]:
print("Description of Distribuition Saving accounts by Risk:  ")
print(pd.crosstab(df_credit["Saving accounts"],df_credit.Risk))

fig, ax = plt.subplots(3,1, figsize=(12,10))
sns.countplot(x="Saving accounts", data=df_credit, palette="hls", ax=ax[0],hue="Risk")
sns.violinplot(x="Saving accounts", y="Job", data=df_credit, palette="hls", hue = "Risk", ax=ax[1],split=True)
sns.boxplot(x="Saving accounts", y="Credit amount", data=df_credit, ax=ax[2], hue = "Risk",palette="hls")
plt.show()


Pretty and interesting distribution...

In [None]:
print("Values describe: ")
print(pd.crosstab(df_credit.Purpose, df_credit.Risk))

fig, ax = plt.subplots(3,1, figsize=(12,14))
sns.countplot(x="Purpose", data=df_credit, palette="hls", hue = "Risk",orient=45, ax=ax[0])
sns.boxplot(x="Purpose", y="Age", data=df_credit, palette="hls", ax=ax[1], hue = "Risk",)
sns.violinplot(x="Purpose", y="Credit amount", data=df_credit, palette="hls", ax=ax[2], hue = "Risk",split=True)
plt.show()

Duration of the loans distribuition and density

In [None]:
fig, ax =plt.subplots(3,1, figsize=(12,10))
sns.countplot(x="Duration", data=df_credit, palette="hls", ax=ax[0], hue = "Risk")
sns.pointplot(x="Duration", y ="Credit amount",data=df_credit,ax=ax[1],hue="Risk", palette="hls")
sns.distplot(df_credit["Duration"], ax=ax[2])
plt.show()


Interesting, we can see that the highest duration have the high amounts. <br>
The highest density is between [12 ~ 18 ~ 24] months<br>
It all make sense.


Total of Checking account, variable that maybe will be dropped

In [None]:
print("Total values of the most missing variable: ")
print(df_credit.groupby("Checking account")["Checking account"].count())

sns.countplot(x="Checking account", data=df_credit, palette="hls", hue="Risk")

In [None]:
fig, ax = plt.subplots(2,1, figsize=(12,14))

sns.boxplot(x="Checking account",y="Credit amount", data=df_credit,hue='Risk',palette="hls", ax=ax[0])
sns.violinplot(x="Checking account", y="Age", data=df_credit, palette="hls", ax=ax[1], hue = "Risk",split=True)

plt.show()

Crosstab session and anothers to explore our data by another metrics a little deep

In [None]:
print(pd.crosstab(df_credit.Sex, df_credit.Job))

In [None]:
sns.factorplot(x="Housing",y="Job",data=df_credit,kind="violin",size=6,hue="Risk", palette="hls",split=True)
plt.show()

In [None]:
print(pd.crosstab(df_credit["Checking account"],df_credit.Sex))

In [None]:
print(pd.crosstab(df_credit.Purpose, df_credit.Sex))

In [None]:
print("Purpose : ",df_credit.Purpose.unique())
print("Sex : ",df_credit.Sex.unique())
print("Housing : ",df_credit.Housing.unique())
print("Saving accounts : ",df_credit['Saving accounts'].unique())
print("Risk : ",df_credit['Risk'].unique())
print("Checking account : ",df_credit['Checking account'].unique())
print("Aget_cat : ",df_credit['Age_cat'].unique())

In [None]:
df_credit.Purpose.replace(('radio/TV', 'education','furniture/equipment', 'car', 'business', 'domestic appliances','repairs','vacation/others'), (0,1,2,3,4,5,6,7), inplace=True)

df_credit.Sex.replace(('female','male'), (0,1), inplace=True)

df_credit.Housing.replace(('own','free','rent'), (0,1,2), inplace=True)

df_credit["Saving accounts"].replace((str('nan'), 'little', 'quite rich', 'rich', 'moderate'), (0,1,3,4,2), inplace=True)

df_credit.Risk.replace(('good', 'bad'),(0,1), inplace=True)

df_credit["Checking account"].replace(('little', 'moderate', 'rich'), (0,1,2), inplace=True)

df_credit["Age_cat"].replace(('Student', 'Young', 'Adult','Senior'), (0,1,2,3), inplace=True)

<h1>Looking the correlation of the data

In [None]:
plt.figure(figsize=(14,12))
sns.heatmap(df_credit.astype(float).corr(),linewidths=0.1,vmax=1.0, 
            square=True,  linecolor='white', annot=True)
plt.show()

<h1>Training a model to see the posibilities of prediction using all variables that we have

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


In [None]:
#Excluding the missing columns
del df_credit["Saving accounts"]
del df_credit["Checking account"]

In [None]:
#Creating the X and y variables
X = df_credit.drop('Risk', 1).values
y = df_credit["Risk"].values

# Spliting X and y into train and test version
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state=42)

# Creating the classifier
model = RandomForestClassifier(n_estimators=10, random_state=0, class_weight="balanced_subsample", )

# Running the fit
model.fit(X_train, y_train)

# Printing the Training Score
print("Training score data: ")
print(model.score(X_train, y_train))

In [None]:
#Testing the model 
#Predicting by X_test
y_pred = model.predict(X_test)

# Verificaar os resultados obtidos
print(accuracy_score(y_test,y_pred))
print("\n")
print(confusion_matrix(y_test, y_pred))
print("\n")
print(classification_report(y_test, y_pred))

Getting bad results in classification, how can I increase my model? 

I will try a new approach, because we have a unbalanced data.... Let me know what's wrong, how can I do it at the correct way?

In [None]:
from sklearn.utils import resample
from sklearn.metrics import roc_curve

In [None]:
# Spliting the data in test and train
df_test, df_train = train_test_split(df_credit, test_size = 0.7, random_state=42)

In [None]:
#train_high = df_train[df_train["Risk"] == 0] 
#train_lower = df_train[df_train["Risk"] == 1]

#train_resample = resample(train_high, replace=False, n_samples=len(train_lower), random_state=123)

#train_sample = pd.concat([train_resample, train_lower])
#train_sample.Risk.value_counts()

In [None]:
# handling with the variables Train and test
X = df_train.drop('Risk', 1).values
y = df_train["Risk"].values

X_test = df_test.drop('Risk', 1).values
y_test = df_test["Risk"].values

In [None]:
# Criando o classificador logreg
rf = RandomForestClassifier(n_estimators=5, random_state=0)

# Fitting with train data
rf.fit(X_train, y_train)

In [None]:
# Printing the Training Score
print("Training score data: ")
print(model.score(X_train, y_train))

How can I increase this models? What's the correctly way to evaluate my models? What's the best technique to test models?



<h2>Please, let me know how can I increase my prediction and if I can do something of another way! Feel free to comment below.
