At first I'll import all the libraries which will be needed:

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy import stats
from sklearn.metrics import r2_score
import seaborn as sns 
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score

Ok, now I will load given data. It's csv file, so I use read_csv.

In [2]:
df = pd.read_csv('Dane_bank_nowe.csv', sep=',')

**It is good practice to work on a copy of the data, not on the original file.** If we will edit few times original file and after 3 hours it will turn out that deleting one of the column was a mistake - working on a copy will allow us to quickly restore the column :) So..

In [3]:
df_copy = df.copy(deep = True)

### Let's see what do we have here ;)

In [4]:
df_copy.head()

Unnamed: 0.1,Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,886607.9,1
1,1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,916554.56,0
2,2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,849781.25,1
3,3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,1367384.5,0
4,4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,1504164.2,0


In [5]:
df_copy.duplicated().sum()

0

Ok, there are no duplicates in this file - that's good information ;)

There are few columns, which are not useful for further analysis:
* Unnamed: 0
* RowNumber
* CustomerId
* Surname (I assumed that surname has no impact had no effect on creditworthiness ;) )

So, goodbye columns!

In [6]:
to_drop = ['RowNumber', 'Surname', 'CustomerId', 'Unnamed: 0']
df_copy.drop(to_drop, inplace=True, axis=1)

Did it work?

In [7]:
df_copy.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.0,1,1,1,886607.9,1
1,608,Spain,Female,41,1,83807.86,1,0,1,916554.56,0
2,502,France,Female,42,8,159660.8,3,1,0,849781.25,1
3,699,France,Female,39,1,0.0,2,0,0,1367384.5,0
4,850,Spain,Female,43,2,125510.82,1,1,1,1504164.2,0


Yup, it's ok :)

In next step I'm going to check non-numeric columns and convert data into numeric. It will make work much easier and it's necessary e.g. to make chart.

In [8]:
df_copy.Geography.unique()

array(['France', 'Spain', 'Germany'], dtype=object)

In [9]:
df_copy.Gender.unique()

array(['Female', 'Male'], dtype=object)

In [10]:
d = {'France':1,'Spain':2,'Germany':3, 'Female':0, 'Male':1}
df_copy = df_copy.replace(d)

In [11]:
df_copy.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,1,0,42,2,0.0,1,1,1,886607.9,1
1,608,2,0,41,1,83807.86,1,0,1,916554.56,0
2,502,1,0,42,8,159660.8,3,1,0,849781.25,1
3,699,1,0,39,1,0.0,2,0,0,1367384.5,0
4,850,2,0,43,2,125510.82,1,1,1,1504164.2,0


Much better, isn't it?

Ok, now let's see more information about data:

In [12]:
df_copy.describe()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,650.5288,1.7495,0.5457,38.9218,5.0128,76485.889288,1.5302,0.7055,0.5151,1130141.0,0.2037
std,96.653299,0.830433,0.497932,10.487806,2.892174,62397.405202,0.581654,0.45584,0.499797,307852.8,0.402769
min,350.0,1.0,0.0,18.0,0.0,0.0,1.0,0.0,0.0,191477.9,0.0
25%,584.0,1.0,0.0,32.0,3.0,0.0,1.0,0.0,0.0,872845.0,0.0
50%,652.0,1.0,1.0,37.0,5.0,97198.54,1.0,1.0,1.0,1160135.0,0.0
75%,718.0,3.0,1.0,44.0,7.0,127644.24,2.0,1.0,1.0,1363969.0,0.0
max,850.0,3.0,1.0,92.0,10.0,250898.09,4.0,1.0,1.0,1730501.0,1.0
