<a href="https://colab.research.google.com/github/carolinaberrafato/project-sklearn-si/blob/main/Projeto.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Projeto - Sistemas Inteligentes

Grupo:
- Maria Carolina Santos Berrafato (mcsb3)
- Victor Luiz de Paula Lima (vlpl)
- Victor Matheus de Azevedo Pereira (vmap)

O dataset escolhido, cujo link pode ser encontrado mais abaixo, fornece dados sobre determinados clientes de um banco. A coluna alvo descreve se um cliente arbitrário aceitaria ou não um empréstimo oferecido na campanha de marketing mais recente do banco.

Assim sendo, utilizaremos três diferentes algoritmos (kNN, árvores de decisão e Naive Bayes) para classificar o aceite ou não de um empréstimo oferecido para um cliente arbitrário.

Parâmetros:
- Customer ID: ID do cliente;
- Age: Idade do cliente em anos completados;
- Experience: Número de anos em experiência profissional;
- Income: Renda anual do cliente;
- ZIP Code: Endereço do CEP de casa;
- Family: Tamanho da família do cliente;
- CCAvg: Média mensal de gastos com cartão de crédito;
- Education: Nível educacional (1: Graduação | 2: Pós-Graduação | 3: Avançado/Profissional);
- Mortgage: Valor da hipoteca da casa, caso haja;
- Personal Loan: O cliente aceitou o empréstimo oferecido na campanha anterior?;
- Securities Account: O cliente possui uma conta bancária de segurança com o banco?;
- CD Account: O cliente possui uma conta de certificado de depósito (CD) com o banco?;
- Online: O cliente usa alguma facilidade online do banco?;
- CreditCard: O cliente possui cartão de crédito fornecido pelo banco em questão?;

## Configurando ambiente, importando o dataset do Kaggle e lendo os dados


_Link do dataset: https://www.kaggle.com/datasets/teertha/personal-loan-modeling?resource=download_

_Para que os seguintes comandos funcionem, segui o seguinte tutorial: https://medium.com/ml-book/how-to-import-kaggle-data-in-google-colab-c286de376fe1_

**Nota: O código só vai rodar na minha máquina porque está conectado ao meu Drive, que possui uma pasta chamada "Projeto".**



In [33]:
from google.colab import drive
drive.mount('/content/gdrive') # Conectando este Notebook ao Google Drive

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [34]:
import os
os.environ["KAGGLE_CONFIG_DIR"] = '/content/gdrive/MyDrive/Projeto' # Linkando o Kaggle ao diretório Projeto no Drive

In [35]:
# Indo até o diretório "Projeto" no Drive

%cd gdrive/MyDrive/Projeto

[Errno 2] No such file or directory: 'gdrive/MyDrive/Projeto'
/content/gdrive/MyDrive/Projeto/projectDataset


In [36]:
!kaggle datasets download -d teertha/personal-loan-modeling # Baixando o dataset (arquivo zip) no diretório Projeto

personal-loan-modeling.zip: Skipping, found more recently modified local copy (use --force to force download)


In [37]:
!unzip personal-loan-modeling.zip -d projectDataset # Extraindo o arquivo zip numa nova pasta (projectDataset) dentro do diretório Projeto

Archive:  personal-loan-modeling.zip
replace projectDataset/Bank_Personal_Loan_Modelling.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: n


In [38]:
# Indo até o diretório "projectDataset" no Drive

%cd projectDataset

/content/gdrive/MyDrive/Projeto/projectDataset/projectDataset


In [39]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import f1_score
import seaborn as sns
import matplotlib.pyplot as plt

In [40]:
# Lendo o arquivo csv (dataset)

dataset = pd.read_csv("Bank_Personal_Loan_Modelling.csv")
dataset.head(15) # Mostrando as 15 primeiras linhas do dataset

Unnamed: 0,ID,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
0,1,25,1,49,91107,4,1.6,1,0,0,1,0,0,0
1,2,45,19,34,90089,3,1.5,1,0,0,1,0,0,0
2,3,39,15,11,94720,1,1.0,1,0,0,0,0,0,0
3,4,35,9,100,94112,1,2.7,2,0,0,0,0,0,0
4,5,35,8,45,91330,4,1.0,2,0,0,0,0,0,1
5,6,37,13,29,92121,4,0.4,2,155,0,0,0,1,0
6,7,53,27,72,91711,2,1.5,2,0,0,0,0,1,0
7,8,50,24,22,93943,1,0.3,3,0,0,0,0,0,1
8,9,35,10,81,90089,3,0.6,2,104,0,0,0,1,0
9,10,34,9,180,93023,1,8.9,3,0,1,0,0,0,0


In [41]:
# Removendo os espaços entre os nomes das colunas e transformando as letras maiúsculas em minúsculas

dataset.columns = [column.replace(' ', '_').lower() for column in dataset.columns]

dataset.head(15)

Unnamed: 0,id,age,experience,income,zip_code,family,ccavg,education,mortgage,personal_loan,securities_account,cd_account,online,creditcard
0,1,25,1,49,91107,4,1.6,1,0,0,1,0,0,0
1,2,45,19,34,90089,3,1.5,1,0,0,1,0,0,0
2,3,39,15,11,94720,1,1.0,1,0,0,0,0,0,0
3,4,35,9,100,94112,1,2.7,2,0,0,0,0,0,0
4,5,35,8,45,91330,4,1.0,2,0,0,0,0,0,1
5,6,37,13,29,92121,4,0.4,2,155,0,0,0,1,0
6,7,53,27,72,91711,2,1.5,2,0,0,0,0,1,0
7,8,50,24,22,93943,1,0.3,3,0,0,0,0,0,1
8,9,35,10,81,90089,3,0.6,2,104,0,0,0,1,0
9,10,34,9,180,93023,1,8.9,3,0,1,0,0,0,0


In [42]:
dataset['personal_loan'] = dataset.personal_loan

x = dataset.drop(['id', 'zip_code', 'personal_loan'], axis = 1)
y = dataset.personal_loan

## Analisando os dados antes de implementar os algoritmos

In [43]:
# Checando se há valores faltantes nas colunas

dataset.isnull().sum()

id                    0
age                   0
experience            0
income                0
zip_code              0
family                0
ccavg                 0
education             0
mortgage              0
personal_loan         0
securities_account    0
cd_account            0
online                0
creditcard            0
dtype: int64

In [44]:
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)
x_train, x_valid, y_train, y_valid = train_test_split(x_train, y_train, test_size = 0.20, random_state = 0)

In [45]:
dataset.describe()

Unnamed: 0,id,age,experience,income,zip_code,family,ccavg,education,mortgage,personal_loan,securities_account,cd_account,online,creditcard
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,2500.5,45.3384,20.1046,73.7742,93152.503,2.3964,1.937938,1.881,56.4988,0.096,0.1044,0.0604,0.5968,0.294
std,1443.520003,11.463166,11.467954,46.033729,2121.852197,1.147663,1.747659,0.839869,101.713802,0.294621,0.305809,0.23825,0.490589,0.455637
min,1.0,23.0,-3.0,8.0,9307.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1250.75,35.0,10.0,39.0,91911.0,1.0,0.7,1.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2500.5,45.0,20.0,64.0,93437.0,2.0,1.5,2.0,0.0,0.0,0.0,0.0,1.0,0.0
75%,3750.25,55.0,30.0,98.0,94608.0,3.0,2.5,3.0,101.0,0.0,0.0,0.0,1.0,1.0
max,5000.0,67.0,43.0,224.0,96651.0,4.0,10.0,3.0,635.0,1.0,1.0,1.0,1.0,1.0


Percebemos que, ao descrever o dataset, há o valor mínimo -3 na coluna experiência. Por isso, precisaremos substituir o valor pela média entre valores de experiência de clientes com idades parecidas.

In [46]:
dataset[dataset.experience < 0].experience.value_counts()

-1    33
-2    15
-3     4
Name: experience, dtype: int64

In [47]:
for oddExp in dataset[dataset.experience <0].experience.unique():
    ageForOddExp = dataset[dataset.experience == oddExp].age.value_counts().index.tolist()
    
    for i in dataset[dataset.experience == oddExp].experience.index.tolist():
        dataset.loc[i,'experience'] = dataset[(dataset.age.isin(ageForOddExp)) & (dataset.experience > 0)].experience.mean()


In [48]:
dataset.describe()

Unnamed: 0,id,age,experience,income,zip_code,family,ccavg,education,mortgage,personal_loan,securities_account,cd_account,online,creditcard
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,2500.5,45.3384,20.149833,73.7742,93152.503,2.3964,1.937938,1.881,56.4988,0.096,0.1044,0.0604,0.5968,0.294
std,1443.520003,11.463166,11.391004,46.033729,2121.852197,1.147663,1.747659,0.839869,101.713802,0.294621,0.305809,0.23825,0.490589,0.455637
min,1.0,23.0,0.0,8.0,9307.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1250.75,35.0,10.0,39.0,91911.0,1.0,0.7,1.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2500.5,45.0,20.0,64.0,93437.0,2.0,1.5,2.0,0.0,0.0,0.0,0.0,1.0,0.0
75%,3750.25,55.0,30.0,98.0,94608.0,3.0,2.5,3.0,101.0,0.0,0.0,0.0,1.0,1.0
max,5000.0,67.0,43.0,224.0,96651.0,4.0,10.0,3.0,635.0,1.0,1.0,1.0,1.0,1.0


## Treinamento

In [49]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
x_valid = scaler.transform(x_valid)

### Naive Bayes

In [50]:
from sklearn.naive_bayes import GaussianNB as NB

classifier_nb = NB()
classifier_nb.fit(x_train, y_train)

In [51]:
from sklearn.metrics import precision_score
from sklearn.metrics import classification_report

print("Score de treinamento:", precision_score(y_train, classifier_nb.predict(x_train), average="macro"))
print(classification_report(y_train, classifier_nb.predict(x_train)))

Score de treinamento: 0.7038051173915566
              precision    recall  f1-score   support

           0       0.96      0.92      0.94      2884
           1       0.45      0.60      0.52       316

    accuracy                           0.89      3200
   macro avg       0.70      0.76      0.73      3200
weighted avg       0.91      0.89      0.90      3200



In [52]:
print("Score de teste:", precision_score(y_test, classifier_nb.predict(x_test), average="micro"))
print(classification_report(y_test, classifier_nb.predict(x_test)))

Score de teste: 0.892
              precision    recall  f1-score   support

           0       0.96      0.92      0.94       910
           1       0.43      0.62      0.51        90

    accuracy                           0.89      1000
   macro avg       0.70      0.77      0.72      1000
weighted avg       0.91      0.89      0.90      1000



In [53]:
print("Score de validação:", precision_score(y_valid, classifier_nb.predict(x_valid), average="micro"))
print(classification_report(y_valid, classifier_nb.predict(x_valid)))

Score de validação: 0.87125
              precision    recall  f1-score   support

           0       0.96      0.90      0.93       726
           1       0.38      0.59      0.46        74

    accuracy                           0.87       800
   macro avg       0.67      0.75      0.69       800
weighted avg       0.90      0.87      0.88       800



### Árvores de decisão

In [54]:
from sklearn.tree import DecisionTreeClassifier as DTC

classifier_dt = DTC(criterion = "gini", random_state = 50)
classifier_dt.fit(x_train, y_train)

In [55]:
pred_y_test =  classifier_dt.predict(x_test)

score_teste = precision_score(y_test, pred_y_test, average = "micro")

print(f'Score de teste: {score_teste}')
print(classification_report(y_test, classifier_dt.predict(x_test)))

Score de teste: 0.982
              precision    recall  f1-score   support

           0       0.99      0.99      0.99       910
           1       0.90      0.90      0.90        90

    accuracy                           0.98      1000
   macro avg       0.95      0.95      0.95      1000
weighted avg       0.98      0.98      0.98      1000



### kNN

In [56]:
from sklearn.neighbors import KNeighborsClassifier as KNN

k_range = tuple(range(1, 20))
k_scores_train = []
k_scores_valid = []

for k in k_range:
    classifier_knn = KNN(n_neighbors = k)
    classifier_knn.fit(x_train, y_train)
    k_scores_train.append(classifier_knn.score(x_train, y_train))
    k_scores_valid.append(classifier_knn.score(x_valid, y_valid))

print(f'Score validação: {k_scores_train}')
print(f'Score validação: {k_scores_valid}')

Score validação: [1.0, 0.9703125, 0.9759375, 0.96375, 0.9696875, 0.9603125, 0.9646875, 0.9571875, 0.96, 0.955, 0.956875, 0.9528125, 0.95625, 0.949375, 0.951875, 0.9471875, 0.9503125, 0.9459375, 0.9484375]
Score validação: [0.9525, 0.95125, 0.955, 0.95875, 0.96, 0.95375, 0.95625, 0.95, 0.955, 0.95, 0.9525, 0.95, 0.95, 0.9475, 0.94625, 0.945, 0.94625, 0.945, 0.94625]
