<a href="https://colab.research.google.com/github/aplneto/IF1014/blob/main/06_Support_Vector_Machines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Instalação dos pacotes necessários

In [None]:
!python3 -m pip install optuna



# Análise Exploratória e preparação dos dados

In [None]:
'''
Antônio Paulino - apln2@cin.ufpe.br
Ailton Rodrigues - ajr@cin.ufpe.br
Douglas Tavares - dtrps@cin.ufpe.br

Realizar as atividades de compreensão do problema, dos dados e a análise
exploratória para o domínio Credit Approval Data Set
(https://archive.ics.uci.edu/ml/datasets/Credit+Approval).
Apresentar relatórios com o itens mencionados e discussões com gráficos da base
de dados.
'''

DATA_FOLDER = (
    'https://archive.ics.uci.edu/ml/machine-learning-databases/'
    'credit-screening/'
)

DATA_DESCRIPTION = DATA_FOLDER + 'crx.names'
DATA_SET = DATA_FOLDER + 'crx.data'

In [None]:
import pandas
import numpy

aliases = [
  'Gender', 'Age', 'Debt', 'Married', 'BankCustomer', 'EducationLevel',
  'Ethnicity', 'YearsEmployed', 'PriorDefault', 'Employed', 'CreditScore',
  'DriversLicense', 'Citizen', 'ZipCode', 'Income', 'Approved'
]
data = pandas.read_csv(DATA_SET, names=aliases, na_values='?', header=None)
data.head()

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,EducationLevel,Ethnicity,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,ZipCode,Income,Approved
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202.0,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43.0,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280.0,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100.0,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120.0,0,+


#### Remoção das variáveis Ethnicity (A7) e ZipCode (A14) por não exercerem influência na variável alvo

In [None]:
# removing useless variables A7 (Ethnicity) and A14 (ZipCode)

data.drop(['Ethnicity', 'ZipCode'], axis=1, inplace=True)
data.head()

Unnamed: 0,Gender,Age,Debt,Married,BankCustomer,EducationLevel,YearsEmployed,PriorDefault,Employed,CreditScore,DriversLicense,Citizen,Income,Approved
0,b,30.83,0.0,u,g,w,1.25,t,t,1,f,g,0,+
1,a,58.67,4.46,u,g,q,3.04,t,t,6,f,g,560,+
2,a,24.5,0.5,u,g,q,1.5,t,f,0,f,g,824,+
3,b,27.83,1.54,u,g,w,3.75,t,t,5,t,g,3,+
4,b,20.17,5.625,u,g,w,1.71,t,f,0,f,s,0,+


#### Separação das variáveis em contínuas e categóricas

In [None]:
continuous = data.describe().columns
categorical = data.drop(list(continuous) + ['Approved'], axis=1).columns

print(continuous)
print(categorical)

Index(['Age', 'Debt', 'YearsEmployed', 'CreditScore', 'Income'], dtype='object')
Index(['Gender', 'Married', 'BankCustomer', 'EducationLevel', 'PriorDefault',
       'Employed', 'DriversLicense', 'Citizen'],
      dtype='object')


# Limpeza dos dados

## Modelo de regressão linear para completar dados continuos ausentes

Variáveis continuas ausentes serão preenchidas por valores previstos a partir de um modelo de regressão linear construído a partir da coluna com valores ausentes e da coluna com todos os valores mais fortemente correlacionada a ela

In [None]:
continuous_columns_missing_values = []

for column in continuous:
  if data[column].isnull().sum() > 0:
    continuous_columns_missing_values.append(column)

print(continuous_columns_missing_values)

['Age']


In [None]:
most_correlated_columns = {}
candidates = [
  x for x in continuous if x not in continuous_columns_missing_values
]
for column in continuous_columns_missing_values:
  most_correlated_columns[column] = max(
      candidates, key=lambda x: abs(data[x].corr(data[column]))
  )

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
prediction_models = {}

for pair in most_correlated_columns.items():
  rows = data[~data[list(pair)].isnull().any(axis=1)][list(pair)]
  y = rows[pair[0]]
  x = rows[pair[1]]
  lr = LinearRegression()
  lr.fit(x.values.reshape(-1, 1), y)
  d = pandas.DataFrame(data= {
      'value' : lr.predict(data[pair[1]].values.reshape(-1, 1))
  })
  data[pair[0]] = numpy.where(data[column].isna(), d['value'], data[column])

data[continuous].isna().sum()

Age              0
Debt             0
YearsEmployed    0
CreditScore      0
Income           0
dtype: int64

# Dados categóricos

Algoritmos de aprendizado de máquina esperam que os dados estejam em formato numérico, por esse motivo, as variáveis categóricas da base de dados foram convertidas em números inteiros usando o *LabelEncoder* da biblioteca *sklearn*.

Os valores ausentes foram completados usando um algoritmo de árvore de decisão

In [None]:
categorical_columns_missing_values = [
  p[0] for p in dict(data[categorical].isna().sum() > 0).items() if p[1]
]
complete_data = data.dropna()
print(categorical_columns_missing_values)

['Gender', 'Married', 'BankCustomer', 'EducationLevel']


In [None]:
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict

label_dict = defaultdict(LabelEncoder)
complete_data = complete_data.apply(
    lambda x: label_dict[x.name].fit_transform(x)
    if x.name in list(categorical) + ['Approved']
    else x
)

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
trees = {}
X = complete_data.drop(categorical_columns_missing_values, axis=1)
for column in categorical_columns_missing_values:
  Y = complete_data[column]
  tree = DecisionTreeClassifier(
      max_leaf_nodes=Y.nunique(), random_state=2**Y.nunique()
  )
  trees[column] = tree
  tree.fit(X.values, Y.values)

In [None]:
for column in trees:
  tree = trees[column]
  encoder = label_dict[column]
  d = pandas.DataFrame(data= {
      'value' : encoder.inverse_transform(
          tree.predict(
          data.drop(categorical_columns_missing_values, axis=1).apply(
                  lambda x: label_dict[x.name].fit_transform(x)
                  if x.name in list(categorical) + ['Approved']
                  else x
          ).values
        )
      )
    }
  )

  data[column] = numpy.where(data[column].isna(), d['value'], data[column])

In [None]:
labels = data['Approved']
data.drop('Approved', axis=1, inplace=True)
X = data.apply(
    lambda x: label_dict[x.name].fit_transform(x)
    if x.name in categorical
    else x
)
print(X)

     Gender    Age    Debt  ...  DriversLicense  Citizen  Income
0         1  30.83   0.000  ...               0        0       0
1         0  58.67   4.460  ...               0        0     560
2         0  24.50   0.500  ...               0        0     824
3         1  27.83   1.540  ...               1        0       3
4         1  20.17   5.625  ...               0        2       0
..      ...    ...     ...  ...             ...      ...     ...
685       1  21.08  10.085  ...               0        0       0
686       0  22.67   0.750  ...               1        0     394
687       0  25.25  13.500  ...               1        0       1
688       1  17.92   0.205  ...               0        0     750
689       1  35.00   3.375  ...               1        0       0

[690 rows x 13 columns]


In [None]:
Y = pandas.DataFrame(
    LabelEncoder().fit_transform(labels), columns=[labels.name]
)
print(Y)

     Approved
0           0
1           0
2           0
3           0
4           0
..        ...
685         1
686         1
687         1
688         1
689         1

[690 rows x 1 columns]


# Divisão das instâncias em treinamento e teste

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(
    X.values, Y.values, test_size = 0.2, random_state = 4
)

In [None]:
X_train = pandas.DataFrame(X_train, columns=X.columns)
X_test = pandas.DataFrame(X_test, columns=X.columns)
Y_train = pandas.DataFrame(Y_train, columns=Y.columns)
Y_test = pandas.DataFrame(Y_test, columns=Y.columns)

In [None]:
print(X_train.shape)
print(Y_train.shape)
print(X_test.shape)
print(Y_test.shape)

(552, 13)
(552, 1)
(138, 13)
(138, 1)


# Transformação dos Dados

* Uma vez que as variáveis continuas possuem valores entre 0 e um determinado limite, estes serão normalizados entre os valores 0.0 e 1.0 para análise de diminuição de dimensionalidade

In [None]:
X_train[continuous].describe()

Unnamed: 0,Age,Debt,YearsEmployed,CreditScore,Income
count,552.0,552.0,552.0,552.0,552.0
mean,31.375927,4.723342,2.222554,2.574275,1016.20471
std,11.79137,4.95801,3.34217,5.163208,5328.577631
min,13.75,0.0,0.0,0.0,0.0
25%,22.5,0.875,0.165,0.0,0.0
50%,28.448036,2.75,1.0,0.0,5.5
75%,37.52,7.3125,2.55125,3.0,462.25
max,80.25,28.0,28.5,67.0,100000.0


In [None]:
X_test[continuous].describe()

Unnamed: 0,Age,Debt,YearsEmployed,CreditScore,Income
count,138.0,138.0,138.0,138.0,138.0
mean,32.437283,4.900254,2.226812,1.702899,1022.108696
std,12.17297,5.073769,3.376051,3.331796,4724.580194
min,15.17,0.0,0.0,0.0,0.0
25%,23.08,1.25,0.125,0.0,0.0
50%,29.585,2.73,0.75,0.0,1.0
75%,39.1075,7.0,2.75,2.0,200.0
max,76.75,25.21,17.5,20.0,50000.0


In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
train_scalers = defaultdict(MinMaxScaler)
test_scalers = defaultdict(MinMaxScaler)

for column in continuous:
  train_scaler = train_scalers[column]
  test_scaler = test_scalers[column]
  X_train[column] = train_scaler.fit_transform(X_train[column].values.reshape(-1, 1))
  X_test[column] = test_scaler.fit_transform(X_test[column].values.reshape(-1, 1))

# Redução da dimensionalidade

<!--

* https://www.analyticsvidhya.com/blog/2018/08/dimensionality-reduction-techniques-python/

 -->

## Principal component analysis

<!--

* https://www.datasklr.com/principal-component-analysis-and-factor-analysis/principal-component-analysis
* https://www.youtube.com/watch?v=FgakZw6K1QQ
* https://jmausolf.github.io/code/pca_in_python/

-->

In [None]:
from sklearn.decomposition import PCA

In [None]:
pca_train = PCA()
pca_train.fit(X_train[continuous])

PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

In [None]:
numpy.cumsum(pca_train.explained_variance_ratio_)

array([0.49681538, 0.7953898 , 0.91164728, 0.96771056, 1.        ])

In [None]:
pca_X_train = pandas.DataFrame(
    data = pca_train.transform(X_train[continuous]),
    columns = ['PC%d' % (i) for i in numpy.arange(pca_train.n_components_)+1]
)

max_column = numpy.argmax(numpy.cumsum(pca_train.explained_variance_ratio_) > 0.9) + 1
principal_components = pca_X_train.columns[:max_column:]

pca_X_train = pandas.concat(
    [pca_X_train[principal_components], X_train[categorical]],
    axis = 1
)

pca_X_train

Unnamed: 0,PC1,PC2,PC3,Gender,Married,BankCustomer,EducationLevel,PriorDefault,Employed,DriversLicense,Citizen
0,-0.080758,-0.184270,-0.065782,1.0,1.0,0.0,5.0,0.0,0.0,0.0,1.0
1,-0.216770,-0.018099,0.009423,0.0,1.0,0.0,6.0,0.0,0.0,1.0,2.0
2,-0.189481,-0.012675,0.048359,1.0,1.0,0.0,6.0,0.0,1.0,1.0,0.0
3,0.087857,-0.047823,-0.124984,1.0,1.0,0.0,10.0,1.0,1.0,0.0,0.0
4,-0.130168,0.191575,-0.026154,1.0,1.0,0.0,6.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
547,-0.143830,0.060022,0.087901,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0
548,0.041393,0.023916,0.093961,0.0,1.0,0.0,8.0,0.0,0.0,1.0,2.0
549,-0.167171,-0.040376,-0.002445,1.0,1.0,0.0,8.0,0.0,1.0,1.0,0.0
550,0.028834,0.238163,-0.045819,1.0,1.0,0.0,10.0,1.0,0.0,0.0,0.0


In [None]:
pca_X_test = pandas.DataFrame(
    data = pca_train.transform(X_test[continuous]),
    columns = ['PC%d' % (i) for i in numpy.arange(pca_train.n_components_)+1]
)

principal_components = pca_X_test.columns[:max_column:]

pca_X_test = pandas.concat(
    [pca_X_test[principal_components], X_test[categorical]],
    axis = 1
)

pca_X_test

Unnamed: 0,PC1,PC2,PC3,Gender,Married,BankCustomer,EducationLevel,PriorDefault,Employed,DriversLicense,Citizen
0,0.013034,0.154900,0.158099,0.0,1.0,0.0,10.0,1.0,1.0,0.0,0.0
1,0.616798,-0.115678,-0.053012,1.0,2.0,2.0,1.0,1.0,0.0,0.0,0.0
2,-0.141278,0.023545,-0.017648,0.0,2.0,2.0,0.0,1.0,0.0,1.0,0.0
3,0.207157,-0.161917,0.414889,1.0,2.0,2.0,6.0,0.0,0.0,1.0,2.0
4,-0.293477,0.043979,0.033666,1.0,2.0,2.0,1.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
133,-0.131200,0.060932,0.003077,1.0,2.0,2.0,8.0,0.0,1.0,1.0,0.0
134,-0.245238,0.012432,0.017578,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
135,0.305105,-0.170936,-0.040540,0.0,1.0,0.0,6.0,0.0,0.0,1.0,0.0
136,0.237959,0.276361,-0.153109,1.0,2.0,2.0,5.0,0.0,1.0,0.0,0.0


In [None]:
import optuna
from sklearn.svm import SVC
from sklearn.model_selection import RepeatedKFold
# from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate

#### *Repeated K-Fold* com 5x10

In [None]:
kfold = RepeatedKFold(5, 10)
m = SVC()
cross_validate(m, pca_X_train.values, Y_train.values.ravel(), scoring='accuracy', cv=kfold, return_train_score=True)

{'fit_time': array([0.00962114, 0.00562048, 0.00545645, 0.00553393, 0.00562096,
        0.00533414, 0.00822878, 0.00578189, 0.00533676, 0.00564837,
        0.00526595, 0.00532103, 0.00562048, 0.00522041, 0.00585079,
        0.00486565, 0.00550747, 0.00548363, 0.00552034, 0.00551963,
        0.00559449, 0.00578117, 0.00536966, 0.00532913, 0.00523257,
        0.00543594, 0.00523996, 0.00540829, 0.00594544, 0.00553298,
        0.0054214 , 0.00546765, 0.0054996 , 0.00548315, 0.00554061,
        0.00515032, 0.00525832, 0.00537682, 0.00529075, 0.0059998 ,
        0.00523925, 0.00551701, 0.00524068, 0.00497222, 0.00530648,
        0.00538301, 0.01378775, 0.0054338 , 0.00532055, 0.00534225]),
 'score_time': array([0.00197959, 0.00306439, 0.00119066, 0.00121307, 0.00135422,
        0.001791  , 0.0018971 , 0.00126386, 0.00111699, 0.00111389,
        0.00112319, 0.00109458, 0.00112534, 0.00109267, 0.00110984,
        0.00103903, 0.00114894, 0.00116086, 0.00123692, 0.0011735 ,
        0.00121212, 

#### Diferentes configurações de SVM (com os quatro kernels - 'linear', 'poly', 'rbf', 'sigmoid') para a classificação da base de dados

In [None]:
def model_factory(c_value, k_function, d, g, c):
  model = SVC(c_value, k_function, d, g, c)
  return model

def svc_model_optimization_study(trial: optuna.trial.FixedTrial):
  c_value = trial.suggest_categorical('c_value', [0.01, 0.1, 1, 10, 100])
  kernel = trial.suggest_categorical(
      'kernel', ['linear', 'poly', 'rbf', 'sigmoid']
  )
  degree = trial.suggest_int('degree', 1, 3)
  gamma = trial.suggest_categorical(
      'gamma', ['scale', 'auto']
  )
  coef0 = trial.suggest_loguniform('coef0', 0.1, 1.0)
  model = model_factory(c_value, kernel, degree, gamma, coef0)
  kfold = RepeatedKFold(5, 10)
  scores = cross_validate(
      model, pca_X_train.values, Y_train.values.ravel(), scoring='accuracy',
      cv=kfold, return_train_score=True
  )
  trial.set_user_attr('model', model)
  trial.set_user_attr('scores', scores)
  return numpy.mean(scores['test_score'])

In [None]:
study = optuna.study.create_study(
    study_name='SVM_cross_validation_study', direction='maximize'
)
study.optimize(svc_model_optimization_study, n_trials=20)

[32m[I 2021-11-03 00:55:01,421][0m A new study created in memory with name: SVM_cross_validation_study[0m
[32m[I 2021-11-03 00:55:02,050][0m Trial 0 finished with value: 0.6025733005733006 and parameters: {'c_value': 1, 'kernel': 'sigmoid', 'degree': 2, 'gamma': 'scale', 'coef0': 0.3792263139879589}. Best is trial 0 with value: 0.6025733005733006.[0m
[32m[I 2021-11-03 00:55:02,275][0m Trial 1 finished with value: 0.8604963144963145 and parameters: {'c_value': 0.1, 'kernel': 'linear', 'degree': 1, 'gamma': 'auto', 'coef0': 0.797715645308757}. Best is trial 1 with value: 0.8604963144963145.[0m
[32m[I 2021-11-03 00:55:03,212][0m Trial 2 finished with value: 0.548945126945127 and parameters: {'c_value': 0.01, 'kernel': 'sigmoid', 'degree': 1, 'gamma': 'scale', 'coef0': 0.1465759553707515}. Best is trial 1 with value: 0.8604963144963145.[0m
[32m[I 2021-11-03 00:55:03,840][0m Trial 3 finished with value: 0.8719148239148241 and parameters: {'c_value': 100, 'kernel': 'rbf', 'degr

In [None]:
for i, trial in enumerate(study.trials):
  print('Trial %i' % i)
  print(
      'Acurácia de treinamento:',
      numpy.mean(trial.user_attrs['scores']['train_score'])
  )
  print(
      'Acurácia de validação:',
      numpy.mean(trial.user_attrs['scores']['test_score'])
  )
  model = trial.user_attrs['model'].fit(pca_X_train.values, Y_train.values.ravel())
  print(
      'Acurácia de testes:',
      model.score(pca_X_test.values, Y_test.values.ravel())
  )

Trial 0
Acurácia de treinamento: 0.5986735206903273
Acurácia de validação: 0.6025733005733006
Acurácia de testes: 0.7101449275362319
Trial 1
Acurácia de treinamento: 0.8605065615989987
Acurácia de validação: 0.8604963144963145
Acurácia de testes: 0.8333333333333334
Trial 2
Acurácia de treinamento: 0.5489150532007675
Acurácia de validação: 0.548945126945127
Acurácia de testes: 0.5797101449275363
Trial 3
Acurácia de treinamento: 0.9075632304203732
Acurácia de validação: 0.8719148239148241
Acurácia de testes: 0.8405797101449275
Trial 4
Acurácia de treinamento: 0.9195650567919477
Acurácia de validação: 0.8731859131859132
Acurácia de testes: 0.8115942028985508
Trial 5
Acurácia de treinamento: 0.5862321338791927
Acurácia de validação: 0.5786257166257167
Acurácia de testes: 0.7246376811594203
Trial 6
Acurácia de treinamento: 0.86621520403033
Acurácia de validação: 0.8606633906633907
Acurácia de testes: 0.8333333333333334
Trial 7
Acurácia de treinamento: 0.5867457752331702
Acurácia de validaçã

#Considerações Finais

O algoritmo SVM foi treinado com 4 *kernels*: linear, polynomial, rbf e sigmoid com ajustes dos parâmetros *Gamma*, *degree* e *C*. Os valores de *Gamma* foram: *Scale* e *auto*, o *degree* foram de 1 a 3, o coef0 usou uma distribuição uniforme *random* com valor miníno 0,1 e máximo de 1,0, o C  utilizou uma distribuição uniforme *random* com valor miníno de 0,5 e máximo de 1,0.

Diante dessa configuração, 20 combinações foram avaliadas, sendo a melhor combinação para o conjunto de validação, a de número 11 composta dos parâmetros: C = 10, Kernel: rbf com gamma: auto e coef0 ~= 0,11. Essa combinação obteve uma acurácia de 0,91 para   treinamento, 0,87 para validação e 0,81 para teste.

Observa-se que a acurácia de treinamento e de validação são iguais e que na fase de teste houve uma pequena perda na acurácia, apesar disso, conclui-se que o modelo treinado é adequado e não sofreu *underfitting* e nem *overfitting* sendo útil para tarefa de aprovação de crédito conforme experimentação realizada.

O SVM apresentou ser um algoritmo poderoso em relação aos outros modelos analisados, pois obteve a acurácia de 0,81; enquanto os algoritmos K-NN e LVQ (com os valores ótimos para o conjunto de teste) - experimentados em missões anteriores -  obtiveram acurácia de 0,78 e 0,62 respectivamente. 

Ademais, o SVM supera o K-NN em relação ao consumo de memória, pois o SVM é eficaz em espaços dimensionais elevados.



#Referências

https://scikit-learn.org/stable/modules/svm.html

https://scikit-learn.org/stable/modules/svm.html#svm-kernels

https://www.kdnuggets.com/2020/03/machine-learning-algorithm-svm-explained.html

Yun Liu, Jie Lian, Michael R. Bartolacci, Qing-An Zeng. Density-Based Penalty Parameter Optimization on C-SVM, publicado em: 07/07/2014. https://doi.org/10.1155/2014/851814

Ana Carolina Lorena, André Carlos Ponce de Leon Carvalho, et al. Introdução às máquinas de vetores suporte (support vector machines). 2003.

