---
# **Exercises - Example-Based Classification**
---

**Author**
> Vitor Eduardo de Souza Costa

**References**
> - Solange Rezende. [Paradigma de aprendizado baseado em instâncias](https://edisciplinas.usp.br/pluginfile.php/8366136/mod_resource/content/1/Aula_15_IA_MedidasDistancia_KNN.pdf). Mai. de 2024.

## Importing necessary libraries

In [259]:
import pandas as pd
from sklearn.metrics import f1_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold

---
## Exercise 1
---

### Creating a dataset of patients registers to verify their diagnostic

> On the proposed exercise, the available data are referent to six different patients, commenting their symptoms and respective diagnostic. The objective is utilizing this data to label new cases, compare results with theoretical awaited results and the Decision Trees results with this same dataset.

#### Writing dataset file to train

In [260]:
%%writefile patients_register_train.tsv
Nome;Febre;Enjôo;Manchas;Dores;Diagnóstico
João;sim;sim;pequenas;sim;doente
Pedro;não;não;grandes;não;saudável
Maria;sim;sim;pequenas;não;saudável
José;sim;não;grandes;sim;doente
Ana;sim;não;pequenas;sim;saudável
Leila;não;não;grandes;sim;doente

Writing patients_register_train.tsv


#### Writing dataset file to predict

In [261]:
%%writefile patients_register_predict.tsv
Nome;Febre;Enjôo;Manchas;Dores
Luis;não;não;pequenas;sim
Laura;sim;sim;grandes;sim

Writing patients_register_predict.tsv


#### Reading dataset file to train

In [262]:
patients_dataset_train = pd.read_csv('patients_register_train.tsv', index_col='Nome', sep=';')

print(patients_dataset_train)

      Febre Enjôo   Manchas Dores Diagnóstico
Nome                                         
João    sim   sim  pequenas   sim      doente
Pedro   não   não   grandes   não    saudável
Maria   sim   sim  pequenas   não    saudável
José    sim   não   grandes   sim      doente
Ana     sim   não  pequenas   sim    saudável
Leila   não   não   grandes   sim      doente


#### Reading dataset file to predict

In [263]:
patients_x_predict = pd.read_csv('patients_register_predict.tsv', index_col='Nome', sep=';')

print(patients_x_predict)

      Febre Enjôo   Manchas Dores
Nome                             
Luis    não   não  pequenas   sim
Laura   sim   sim   grandes   sim


### Cleaning and treating the dataset

> With the available data, It's necessary to facility its computational treatment, converting its symbolical values into numerical, conserving the order on ordinary values and keeping an unitary distance when there isn't an order. On our case, all we need to do is the conversion of categorical values into numerical.

#### Converting symbolical values into numerical to train

In [264]:
patients_dataset_train.Febre.replace({'sim': 1,'não': 0}, inplace=True)
patients_dataset_train.Enjôo.replace({'sim': 1,'não': 0}, inplace=True)
patients_dataset_train.Manchas.replace({'grandes': 1,'pequenas': 0}, inplace=True)
patients_dataset_train.Dores.replace({'sim': 1,'não': 0}, inplace=True)
patients_dataset_train.Diagnóstico.replace({'saudável': 1,'doente': 0}, inplace=True)

# Separating label from dataset
patients_x = patients_dataset_train.drop(['Diagnóstico'], axis=1)
patients_y = patients_dataset_train.Diagnóstico

print(patients_x)

       Febre  Enjôo  Manchas  Dores
Nome                               
João       1      1        0      1
Pedro      0      0        1      0
Maria      1      1        0      0
José       1      0        1      1
Ana        1      0        0      1
Leila      0      0        1      1


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  patients_dataset_train.Febre.replace({'sim': 1,'não': 0}, inplace=True)
  patients_dataset_train.Febre.replace({'sim': 1,'não': 0}, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  patients_dataset_train.Enjôo.replace({'sim': 1,'não': 0}, inplace=True)
  patients_dat

#### Converting symbolical values into numerical to predict

In [265]:
patients_x_predict.Febre.replace({'sim': 1,'não': 0}, inplace=True)
patients_x_predict.Enjôo.replace({'sim': 1,'não': 0}, inplace=True)
patients_x_predict.Manchas.replace({'grandes': 1,'pequenas': 0}, inplace=True)
patients_x_predict.Dores.replace({'sim': 1,'não': 0}, inplace=True)

print(patients_x_predict)

       Febre  Enjôo  Manchas  Dores
Nome                               
Luis       0      0        0      1
Laura      1      1        1      1


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  patients_x_predict.Febre.replace({'sim': 1,'não': 0}, inplace=True)
  patients_x_predict.Febre.replace({'sim': 1,'não': 0}, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  patients_x_predict.Enjôo.replace({'sim': 1,'não': 0}, inplace=True)
  patients_x_predict.Enjôo

### Solution for K = 1

> Creating a K-NN model where K = 1, looking to compare these obtained results for different values of K and theoretical values obtained at the Excel spreadsheet named "*exercise_1.xlsx*".

#### Construction of the model

In [266]:
patients_k1 = KNeighborsClassifier(n_neighbors=1)

#### Validation of the model

##### Instance a Stratified KFold cross validation for K = 1

In [267]:
# Defining Stratified KFold and the number of splits which our dataset should be divisible
patients_skf_k1 = StratifiedKFold(n_splits=3)
patients_k1_index = KNeighborsClassifier(n_neighbors=1)
patients_accuracies_k1 = []
patients_f1_scores_k1 = []

# Defining train and test dataframes for each index prepared by SKF, then defining accuracy for each index results
for i, (patients_train_index_k1, patients_test_index_k1) in enumerate(patients_skf_k1.split(patients_x, patients_y)):
    patients_x_train_k1 = patients_x.iloc[patients_train_index_k1]
    patients_y_train_k1 = patients_y.iloc[patients_train_index_k1]
    patients_x_test_k1 = patients_x.iloc[patients_test_index_k1]
    patients_y_test_k1 = patients_y.iloc[patients_test_index_k1]
    patients_k1_index.fit(patients_x_train_k1,patients_y_train_k1)
    patients_accuracies_k1.insert(i,patients_k1_index.score(patients_x_test_k1,patients_y_test_k1))
    patients_f1_scores_k1.insert(i,f1_score(patients_y_test_k1,patients_k1_index.predict(patients_x_test_k1), average='weighted'))

##### Showing accuracy results by Stratified KFold cross validation for K = 1

In [268]:
# Transforming the list of accuracies and F1 for each interactions of KFold into a dataframe
patients_accuracies_k1_df = pd.DataFrame(data=patients_accuracies_k1, columns=[''])
patients_f1_scores_k1_df = pd.DataFrame(data=patients_f1_scores_k1, columns=[''])

# Calculating the accuracy of model as mean of accuracies and F1 from each split of SKF
patients_accuracy_k1 = patients_accuracies_k1_df.mean()
patients_f1_score_k1 = patients_f1_scores_k1_df.mean()

# Getting standard deviation for accuracies
patients_accuracy_std_k1 = patients_accuracies_k1_df.std()
patients_f1_score_std_k1 = patients_f1_scores_k1_df.std()

print(f"Accuracy of the model is: {patients_accuracy_k1}\nIts standard deviation is: {patients_accuracy_std_k1}\n\n")
print(f"F1 score of the model is: {patients_f1_score_k1}\nIts standard deviation is: {patients_f1_score_std_k1}")

Accuracy of the model is:     0.0
dtype: float64
Its standard deviation is:     0.0
dtype: float64


F1 score of the model is:     0.0
dtype: float64
Its standard deviation is:     0.0
dtype: float64


#### Obtaining the label on examples to predict

In [269]:
patients_classifier_k1 = patients_k1.fit(patients_x,patients_y)

patients_classified_k1 = patients_classifier_k1.predict(patients_x_predict)
patients_results_k1 = patients_x_predict.copy()
patients_results_k1['Diagnóstico'] = patients_classified_k1

print(patients_results_k1)

       Febre  Enjôo  Manchas  Dores  Diagnóstico
Nome                                            
Luis       0      0        0      1            1
Laura      1      1        1      1            0


#### Returning numerical values into categorical for better interpretation

In [270]:
patients_results_k1.Febre.replace({0: 'Não', 1: 'Sim'}, inplace=True)
patients_results_k1.Enjôo.replace({0: 'Não', 1: 'Sim'}, inplace=True)
patients_results_k1.Dores.replace({0: 'Não', 1: 'Sim'}, inplace=True)
patients_results_k1.Manchas.replace({0: 'Pequenas', 1: 'Grandes'}, inplace=True)
patients_results_k1.Diagnóstico.replace({0: 'Doente', 1: 'Saudável'}, inplace=True)

print(patients_results_k1)

      Febre Enjôo   Manchas Dores Diagnóstico
Nome                                         
Luis    Não   Não  Pequenas   Sim    Saudável
Laura   Sim   Sim   Grandes   Sim      Doente


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  patients_results_k1.Febre.replace({0: 'Não', 1: 'Sim'}, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  patients_results_k1.Enjôo.replace({0: 'Não', 1: 'Sim'}, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the inte

### Solution for K = 3

> Creating a K-NN model where K = 3, looking to compare these obtained results for different values of K and theoretical values obtained at the Excel spreadsheet named "*exercise_1.xlsx*".

#### Construction of the model

In [271]:
patients_k3 = KNeighborsClassifier(n_neighbors=3)

#### Validation of the model

##### Instance a Stratified KFold cross validation for K = 3

In [272]:
# Defining Stratified KFold and the number of splits which our dataset should be divisible
patients_skf_k3 = StratifiedKFold(n_splits=3)
patients_k3_index = KNeighborsClassifier(n_neighbors=3)
patients_accuracies_k3 = []
patients_f1_scores_k3 = []

# Defining train and test dataframes for each index prepared by SKF, then defining accuracy for each index results
for i, (patients_train_index_k3, patients_test_index_k3) in enumerate(patients_skf_k3.split(patients_x, patients_y)):
    patients_x_train_k3 = patients_x.iloc[patients_train_index_k3]
    patients_y_train_k3 = patients_y.iloc[patients_train_index_k3]
    patients_x_test_k3 = patients_x.iloc[patients_test_index_k3]
    patients_y_test_k3 = patients_y.iloc[patients_test_index_k3]
    patients_k3_index.fit(patients_x_train_k3,patients_y_train_k3)
    patients_accuracies_k3.insert(i,patients_k3_index.score(patients_x_test_k3,patients_y_test_k3))
    patients_f1_scores_k3.insert(i,f1_score(patients_y_test_k3,patients_k3_index.predict(patients_x_test_k3), average='weighted'))

##### Showing accuracy results by Stratified KFold cross validation for K = 3

In [273]:
# Transforming the list of accuracies and F1 for each interactions of KFold into a dataframe
patients_accuracies_k3_df = pd.DataFrame(data=patients_accuracies_k3, columns=[''])
patients_f1_scores_k3_df = pd.DataFrame(data=patients_f1_scores_k3, columns=[''])

# Calculating the accuracy of model as mean of accuracies and F1 from each split of SKF
patients_accuracy_k3 = patients_accuracies_k3_df.mean()
patients_f1_score_k3 = patients_f1_scores_k3_df.mean()

# Getting standard deviation for accuracies
patients_accuracy_std_k3 = patients_accuracies_k3_df.std()
patients_f1_score_std_k3 = patients_f1_scores_k3_df.std()

print(f"Accuracy of the model is: {patients_accuracy_k3}\nIts standard deviation is: {patients_accuracy_std_k3}\n\n")
print(f"F1 score of the model is: {patients_f1_score_k3}\nIts standard deviation is: {patients_f1_score_std_k3}")

Accuracy of the model is:     0.5
dtype: float64
Its standard deviation is:     0.5
dtype: float64


F1 score of the model is:     0.444444
dtype: float64
Its standard deviation is:     0.509175
dtype: float64


#### Obtaining the label on examples to predict

In [274]:
patients_classifier_k3 = patients_k3.fit(patients_x,patients_y)

patients_classified_k3 = patients_classifier_k3.predict(patients_x_predict)
patients_results_k3 = patients_x_predict.copy()
patients_results_k3['Diagnóstico'] = patients_classified_k3

print(patients_results_k3)

       Febre  Enjôo  Manchas  Dores  Diagnóstico
Nome                                            
Luis       0      0        0      1            0
Laura      1      1        1      1            0


#### Returning numerical values into categorical for better interpretation

In [275]:
patients_results_k3.Febre.replace({0: 'Não', 1: 'Sim'}, inplace=True)
patients_results_k3.Enjôo.replace({0: 'Não', 1: 'Sim'}, inplace=True)
patients_results_k3.Dores.replace({0: 'Não', 1: 'Sim'}, inplace=True)
patients_results_k3.Manchas.replace({0: 'Pequenas', 1: 'Grandes'}, inplace=True)
patients_results_k3.Diagnóstico.replace({0: 'Doente', 1: 'Saudável'}, inplace=True)

print(patients_results_k3)

      Febre Enjôo   Manchas Dores Diagnóstico
Nome                                         
Luis    Não   Não  Pequenas   Sim      Doente
Laura   Sim   Sim   Grandes   Sim      Doente


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  patients_results_k3.Febre.replace({0: 'Não', 1: 'Sim'}, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  patients_results_k3.Enjôo.replace({0: 'Não', 1: 'Sim'}, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the inte

### Solution for K = 5

> Creating a K-NN model where K = 5, looking to compare these obtained results for different values of K and theoretical values obtained at the Excel spreadsheet named "*exercise_1.xlsx*".

#### Construction of the model

In [276]:
patients_k5 = KNeighborsClassifier(n_neighbors=5)

#### Validation of the model

> Actually, because of the amount of examples on the dataset is limited, we're not able to determine the cross validation for k = 5, because to find the 5-nearest examples we need at least 6 examples, and It's exactly the number of examples we have in all dataset, so if we accomplish a split, we'll not have enough data to get 5-nearest examples.

#### Obtaining the label on examples to predict

In [277]:
patients_classifier_k5 = patients_k5.fit(patients_x,patients_y)

patients_classified_k5 = patients_classifier_k5.predict(patients_x_predict)
patients_results_k5 = patients_x_predict.copy()
patients_results_k5['Diagnóstico'] = patients_classified_k5

print(patients_results_k5)

       Febre  Enjôo  Manchas  Dores  Diagnóstico
Nome                                            
Luis       0      0        0      1            0
Laura      1      1        1      1            0


#### Returning numerical values into categorical for better interpretation

In [278]:
patients_results_k5.Febre.replace({0: 'Não', 1: 'Sim'}, inplace=True)
patients_results_k5.Enjôo.replace({0: 'Não', 1: 'Sim'}, inplace=True)
patients_results_k5.Dores.replace({0: 'Não', 1: 'Sim'}, inplace=True)
patients_results_k5.Manchas.replace({0: 'Pequenas', 1: 'Grandes'}, inplace=True)
patients_results_k5.Diagnóstico.replace({0: 'Doente', 1: 'Saudável'}, inplace=True)

print(patients_results_k5)

      Febre Enjôo   Manchas Dores Diagnóstico
Nome                                         
Luis    Não   Não  Pequenas   Sim      Doente
Laura   Sim   Sim   Grandes   Sim      Doente


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  patients_results_k5.Febre.replace({0: 'Não', 1: 'Sim'}, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  patients_results_k5.Enjôo.replace({0: 'Não', 1: 'Sim'}, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the inte

---
## Exercise 2
---

### Creating a dataset about scholarity and salary correlated to a class

> On the presented exercise, It's available data referent scholarity, state and salary of different people. We're looking to register this available data to implement a solution to label new examples.

#### Writing dataset file to train

In [279]:
%%writefile salary_scholarity_train.tsv
Estado;Escolaridade;Altura;Salário;Classe
SP;Médio;180;3000;A
RJ;Superior;174;7000;B
RS;Médio;180;600;B
RJ;Superior;100;2000;A
SP;Fundamental;178;5000;A
RJ;Fundamental;188;1800;A


Writing salary_scholarity_train.tsv


#### Writing dataset file to predict

In [280]:
%%writefile salary_scholarity_predict.tsv
Estado;Escolaridade;Altura;Salário
RJ;Médio;178;2000
SP;Superior;200;800

Writing salary_scholarity_predict.tsv


#### Reading dataset file to train

In [281]:
salary_dataset = pd.read_csv('salary_scholarity_train.tsv', sep=';')

print(salary_dataset)

  Estado Escolaridade  Altura  Salário Classe
0     SP        Médio     180     3000      A
1     RJ     Superior     174     7000      B
2     RS        Médio     180      600      B
3     RJ     Superior     100     2000      A
4     SP  Fundamental     178     5000      A
5     RJ  Fundamental     188     1800      A


#### Reading dataset file to predict

In [282]:
# @title Lendo dataset com valores a classificar

salary_x_predict = pd.read_csv('salary_scholarity_predict.tsv', sep=';')

print(salary_x_predict)

  Estado Escolaridade  Altura  Salário
0     RJ        Médio     178     2000
1     SP     Superior     200      800


### Cleaning and treating the datasets

> With this available data, It's necessary to facility the computational treatment, so we need to convert symbolical values into numerical, conserving its order if It's an ordinary value and keeping unitary distance for non-ordinary values. At our dataset, we just need to convert categorical values into numerical.

#### Converting categorical values into numerical to train dataset

In [283]:
# Symbolical values with unitary distance or ordinary values with defined order
salary_dataset.Classe.replace({'A': 1,'B': 0}, inplace=True)
salary_dataset.Escolaridade.replace({'Superior': 2, 'Médio': 1, 'Fundamental': 0}, inplace=True)

# Treatment to keep unitary distance at non-ordinary values
salary_onehot = pd.get_dummies(salary_dataset.Estado,dtype=int)
salary_dataset = pd.concat([salary_onehot, salary_dataset.drop('Estado', axis=1)], axis=1)

# Separating label from the other attributes of dataset
salary_x = salary_dataset.drop('Classe', axis=1)
salary_y = salary_dataset.Classe.copy()
print(f"{salary_x}\n\n{salary_y}")


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  salary_dataset.Classe.replace({'A': 1,'B': 0}, inplace=True)
  salary_dataset.Classe.replace({'A': 1,'B': 0}, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  salary_dataset.Escolaridade.replace({'Superior': 2, 'Médio': 1, 'Fundamental': 0}, inplace=True)
  salary_da

   RJ  RS  SP  Escolaridade  Altura  Salário
0   0   0   1             1     180     3000
1   1   0   0             2     174     7000
2   0   1   0             1     180      600
3   1   0   0             2     100     2000
4   0   0   1             0     178     5000
5   1   0   0             0     188     1800

0    1
1    0
2    0
3    1
4    1
5    1
Name: Classe, dtype: int64


#### Converting categorical values into numerical to predict dataset

In [284]:
# Symbolical values with unitary distance or ordinary values with defined order
salary_x_predict.Escolaridade.replace({'Superior': 2, 'Médio': 1, 'Fundamental': 0}, inplace=True)

# Treatment to keep unitary distance at non-ordinary values
salary_onehot_predict = pd.get_dummies(salary_x_predict.Estado,dtype=int)
salary_x_predict = pd.concat([salary_onehot_predict, salary_x_predict.drop('Estado', axis=1)], axis=1)

# Recovering lost column RS for the predict dataset to keep the same shape as train dataset
RS = [0]*len(salary_x_predict.RJ)
RS_pd = pd.DataFrame(RS,columns=['RS'])

salary_x_predict = pd.concat([salary_x_predict.RJ,RS_pd,salary_x_predict.drop('RJ', axis=1)],axis=1)
print(salary_x_predict)

   RJ  RS  SP  Escolaridade  Altura  Salário
0   1   0   0             1     178     2000
1   0   0   1             2     200      800


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  salary_x_predict.Escolaridade.replace({'Superior': 2, 'Médio': 1, 'Fundamental': 0}, inplace=True)
  salary_x_predict.Escolaridade.replace({'Superior': 2, 'Médio': 1, 'Fundamental': 0}, inplace=True)


### Solution for K = 1

> Creating a K-NN model where K = 1, looking to compare these obtained results for different values of K and theoretical values obtained at the Excel spreadsheet named "*exercise_2.xlsx*".

#### Construction of the model

In [285]:
salary_k1 = KNeighborsClassifier(n_neighbors=1)

#### Validation of the model

##### Instance a Stratified KFold cross validation for K = 1

In [286]:
# Defining Stratified KFold and the number of splits which our dataset should be divisible
salary_skf_k1 = StratifiedKFold(n_splits=3)
salary_k1_index = KNeighborsClassifier(n_neighbors=1)
salary_accuracies_k1 = []
salary_f1_scores_k1 = []

# Defining train and test dataframes for each index prepared by SKF, then defining accuracy for each index results
for i, (salary_train_index_k1, salary_test_index_k1) in enumerate(salary_skf_k1.split(salary_x, salary_y)):
    salary_x_train_k1 = salary_x.iloc[salary_train_index_k1]
    salary_y_train_k1 = salary_y.iloc[salary_train_index_k1]
    salary_x_test_k1 = salary_x.iloc[salary_test_index_k1]
    salary_y_test_k1 = salary_y.iloc[salary_test_index_k1]
    salary_k1_index.fit(salary_x_train_k1,y_train_k1)
    salary_accuracies_k1.insert(i,salary_k1_index.score(salary_x_test_k1,salary_y_test_k1))
    salary_f1_scores_k1.insert(i,f1_score(salary_y_test_k1,salary_k1_index.predict(salary_x_test_k1), average='weighted'))



##### Showing accuracy results by Stratified KFold cross validation for K = 1

In [287]:
# Transforming the list of accuracies and F1 for each interactions of KFold into a dataframe
salary_accuracies_k1_df = pd.DataFrame(data=salary_accuracies_k1, columns=[''])
salary_f1_scores_k1_df = pd.DataFrame(data=salary_f1_scores_k1, columns=[''])

# Calculating the accuracy of model as mean of accuracies and F1 from each split of SKF
salary_accuracy_k1 = salary_accuracies_k1_df.mean()
salary_f1_score_k1 = salary_f1_scores_k1_df.mean()

# Getting standard deviation for accuracies
salary_accuracy_std_k1 = salary_accuracies_k1_df.std()
salary_f1_score_std_k1 = salary_f1_scores_k1_df.std()

print(f"Accuracy of the model is: {salary_accuracy_k1}\nIts standard deviation is: {salary_accuracy_std_k1}\n\n")
print(f"F1 score of the model is: {salary_f1_score_k1}\nIts standard deviation is: {salary_f1_score_std_k1}")

Accuracy of the model is:     0.333333
dtype: float64
Its standard deviation is:     0.288675
dtype: float64


F1 score of the model is:     0.222222
dtype: float64
Its standard deviation is:     0.19245
dtype: float64


#### Obtaining the label on examples to predict

In [288]:
salary_classifier_k1 = salary_k1.fit(salary_x,salary_y)

salary_classified_k1 = salary_classifier_k1.predict(salary_x_predict)
salary_results_k1 = salary_x_predict.copy()
salary_results_k1['Classe'] = salary_classified_k1

print(salary_results_k1)

   RJ  RS  SP  Escolaridade  Altura  Salário  Classe
0   1   0   0             1     178     2000       1
1   0   0   1             2     200      800       0


#### Returning numerical values into categorical for better interpretation

In [289]:
salary_results_k1.Escolaridade.replace({1:'Médio',2:'Superior'}, inplace=True)
salary_results_k1.Classe.replace({1:'A',0:'B'}, inplace=True)
salary_dummies_k1 = pd.concat([salary_results_k1.RJ, salary_results_k1.RS, salary_results_k1.SP], axis=1)
salary_onecold_k1 = pd.from_dummies(salary_dummies_k1)
salary_onecold_k1.rename(columns={'':'Estado'}, inplace=True)
salary_results_k1 = pd.concat([salary_onecold_k1, salary_results_k1.drop(['RJ','RS','SP'], axis=1)], axis=1)

print(salary_results_k1)

  Estado Escolaridade  Altura  Salário Classe
0     RJ        Médio     178     2000      A
1     SP     Superior     200      800      B


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  salary_results_k1.Escolaridade.replace({1:'Médio',2:'Superior'}, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  salary_results_k1.Classe.replace({1:'A',0:'B'}, inplace=True)


### Solution for K = 3

> Creating a K-NN model where K = 3, looking to compare these obtained results for different values of K and theoretical values obtained at the Excel spreadsheet named "*exercise_2.xlsx*".

#### Construction of the model

In [290]:
salary_k3 = KNeighborsClassifier(n_neighbors=3)

#### Validation of the model

##### Instance a Stratified KFold cross validation for K = 3

In [291]:
# Defining Stratified KFold and the number of splits which our dataset should be divisible
salary_skf_k3 = StratifiedKFold(n_splits=3)
salary_k3_index = KNeighborsClassifier(n_neighbors=3)
salary_accuracies_k3 = []
salary_f1_scores_k3 = []

# Defining train and test dataframes for each index prepared by SKF, then defining accuracy for each index results
for i, (salary_train_index_k3, salary_test_index_k3) in enumerate(salary_skf_k3.split(salary_x, salary_y)):
    salary_x_train_k3 = salary_x.iloc[salary_train_index_k3]
    salary_y_train_k3 = salary_y.iloc[salary_train_index_k3]
    salary_x_test_k3 = salary_x.iloc[salary_test_index_k3]
    salary_y_test_k3 = salary_y.iloc[salary_test_index_k3]
    salary_k3_index.fit(salary_x_train_k3,salary_y_train_k3)
    salary_accuracies_k3.insert(i,salary_k3_index.score(salary_x_test_k3,salary_y_test_k3))
    salary_f1_scores_k3.insert(i,f1_score(salary_y_test_k3,salary_k3_index.predict(salary_x_test_k3), average='weighted'))



##### Showing accuracy results by Stratified KFold cross validation for K = 3

In [292]:
# Transforming the list of accuracies and F1 for each interactions of KFold into a dataframe
salary_accuracies_k3_df = pd.DataFrame(data=salary_accuracies_k3, columns=[''])
salary_f1_scores_k3_df = pd.DataFrame(data=salary_f1_scores_k3, columns=[''])

# Calculating the accuracy of model as mean of accuracies and F1 from each split of SKF
salary_accuracy_k3 = salary_accuracies_k3_df.mean()
salary_f1_score_k3 = salary_f1_scores_k3_df.mean()

# Getting standard deviation for accuracies
salary_accuracy_std_k3 = salary_accuracies_k3_df.std()
salary_f1_score_std_k3 = salary_f1_scores_k3_df.std()

print(f"Accuracy of the model is: {salary_accuracy_k3}\nIts standard deviation is: {salary_accuracy_std_k3}\n\n")
print(f"F1 score of the model is: {salary_f1_score_k3}\nIts standard deviation is: {salary_f1_score_std_k3}")

Accuracy of the model is:     0.666667
dtype: float64
Its standard deviation is:     0.288675
dtype: float64


F1 score of the model is:     0.555556
dtype: float64
Its standard deviation is:     0.3849
dtype: float64


#### Obtaining the label on examples to predict

In [293]:
salary_classifier_k3 = salary_k3.fit(salary_x,salary_y)

salary_classified_k3 = salary_classifier_k3.predict(salary_x_predict)
salary_results_k3 = salary_x_predict.copy()
salary_results_k3['Classe'] = salary_classified_k3

print(salary_results_k3)

   RJ  RS  SP  Escolaridade  Altura  Salário  Classe
0   1   0   0             1     178     2000       1
1   0   0   1             2     200      800       1


#### Returning numerical values into categorical for better interpretation

In [294]:
salary_results_k3.Escolaridade.replace({1:'Médio',2:'Superior'}, inplace=True)
salary_results_k3.Classe.replace({1:'A',0:'B'}, inplace=True)
salary_dummies_k3 = pd.concat([salary_results_k3.RJ, salary_results_k3.RS, salary_results_k3.SP], axis=1)
salary_onecold_k3 = pd.from_dummies(salary_dummies_k3)
salary_onecold_k3.rename(columns={'':'Estado'}, inplace=True)
salary_results_k3 = pd.concat([salary_onecold_k3, salary_results_k3.drop(['RJ','RS','SP'], axis=1)], axis=1)

print(salary_results_k3)

  Estado Escolaridade  Altura  Salário Classe
0     RJ        Médio     178     2000      A
1     SP     Superior     200      800      A


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  salary_results_k3.Escolaridade.replace({1:'Médio',2:'Superior'}, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  salary_results_k3.Classe.replace({1:'A',0:'B'}, inplace=True)
