---

# **Exercise - Decision Tree**

---

**Author**

> Vitor Eduardo de Souza Costa (13902723)

**References**
> - Brucce N. dos Santos e Solange O. Rezende. [Prática de Árvores de Decisão](https://edisciplinas.usp.br/mod/resource/view.php?id=5293825). Mai. de 2024. 
>- Solange Rezende. [Algoritmos de Indução de Árvores de Decisão](https://edisciplinas.usp.br/pluginfile.php/8358567/mod_resource/content/1/Aula_14_IA_Arvores-de-Decisao.pdf). Mai. de 2024.

## Importing necessary libraries

In [49]:
import pandas as pd, graphviz
from sklearn.metrics import f1_score
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.model_selection import StratifiedKFold

---
## Decision Tree Model
---

### Creating a dataset from register of each patient

#### Write dataset file

In [50]:
%%writefile register_patient_train.tsv
Nome;Febre;Enjôo;Manchas;Dores;Diagnóstico
João;sim;sim;pequenas;sim;doente
Pedro;não;não;grandes;não;saudável
Maria;sim;sim;pequenas;não;saudável
José;sim;não;grandes;sim;doente
Ana;sim;não;pequenas;sim;saudável
Leila;não;não;grandes;sim;doente

Overwriting register_patient_train.tsv


#### Reading dataset file

In [51]:
dataset = pd.read_csv('register_patient_train.tsv', index_col='Nome', sep=';')

print(dataset)

      Febre Enjôo   Manchas Dores Diagnóstico
Nome                                         
João    sim   sim  pequenas   sim      doente
Pedro   não   não   grandes   não    saudável
Maria   sim   sim  pequenas   não    saudável
José    sim   não   grandes   sim      doente
Ana     sim   não  pequenas   sim    saudável
Leila   não   não   grandes   sim      doente


### Cleaning and treating data

> On the proposed exercise, on which we need to predict the diagnostic of new patients based on an already labeled dataset, there isn't much to do to prepare it, therefore, some following steps could be disposables, but, looking to fix every taught content, we're going to replicate even what is unnecessary.

#### Quantifying the amount of null values

In [52]:
dataset.isnull().sum(axis=0)

Febre          0
Enjôo          0
Manchas        0
Dores          0
Diagnóstico    0
dtype: int64

#### Removing duplicate examples

In [53]:
dataset.drop_duplicates(inplace=True)

#### Print the unique values of each column

In [54]:
for col in dataset.columns:
  print(col, dataset[col].unique(), sep='\n\t')

Febre
	['sim' 'não']
Enjôo
	['sim' 'não']
Manchas
	['pequenas' 'grandes']
Dores
	['sim' 'não']
Diagnóstico
	['doente' 'saudável']


#### Transforming categorical data into numerical

In [55]:
dataset.Febre.replace({'sim': 1, 'não': 0}, inplace=True)
dataset.Enjôo.replace({'sim': 1, 'não': 0}, inplace=True)
dataset.Manchas.replace({'grandes': 1, 'pequenas': 0}, inplace=True)
dataset.Dores.replace({'sim': 1, 'não': 0}, inplace=True)
dataset.Diagnóstico.replace({'saudável': 1, 'doente': 0}, inplace=True)

print(dataset)

       Febre  Enjôo  Manchas  Dores  Diagnóstico
Nome                                            
João       1      1        0      1            0
Pedro      0      0        1      0            1
Maria      1      1        0      0            1
José       1      0        1      1            0
Ana        1      0        0      1            1
Leila      0      0        1      1            0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  dataset.Febre.replace({'sim': 1, 'não': 0}, inplace=True)
  dataset.Febre.replace({'sim': 1, 'não': 0}, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  dataset.Enjôo.replace({'sim': 1, 'não': 0}, inplace=True)
  dataset.Enjôo.replace({'sim': 1, 'não': 0}, inplace=Tr

#### Filling null values with zeros

In [56]:
## It's not the best practice and It isn't even necessary on our dataset, just accomplished didactics purposes
dataset.fillna(0, inplace=True)

#### Separing the labels of the dataset from other attributes

In [57]:
y = dataset.Diagnóstico

# Removes all attributes aren't going to be used at classification, including the labels 
x = dataset.drop(['Diagnóstico'], inplace=False, axis=1)

print(x)

       Febre  Enjôo  Manchas  Dores
Nome                               
João       1      1        0      1
Pedro      0      0        1      0
Maria      1      1        0      0
José       1      0        1      1
Ana        1      0        0      1
Leila      0      0        1      1


### Construção/Treinamento

#### Setting the decision tree model

In [58]:
# @title Determinando árvore de decisões

## Instance the classifier, setting the criterion as entropy
tree = DecisionTreeClassifier(criterion="entropy")

## Train the decision tree using all data
tree.fit(x, y)

#### Showing results of decision tree model on a PDF

In [59]:
labels_name = ['Doente', 'Saudável']
graph_date = export_graphviz(tree, feature_names=x.columns, class_names=labels_name, filled=True)
graph = graphviz.Source(graph_date)
graph.render('tree_diagram')

'tree_diagram.pdf'

### Validation of model

#### Instance a Stratified KFold cross validation

In [60]:
# Defining Stratified KFold and the number of splits which our dataset should be divisible
skf = StratifiedKFold(n_splits=3)
tree_index = DecisionTreeClassifier(criterion="entropy")
accuracies = []
f1_scores = []

# Defining train and test dataframes for each index prepared by SKF, then defining accuracy for each index results
for i, (train_index, test_index) in enumerate(skf.split(x, y)):
    x_train = x.iloc[train_index]
    y_train = y.iloc[train_index]
    x_test = x.iloc[test_index]
    y_test = y.iloc[test_index]
    tree_index.fit(x_train,y_train)
    accuracies.insert(i,tree_index.score(x_test,y_test))
    f1_scores.insert(i,f1_score(y_test,tree_index.predict(x_test), average='weighted'))



#### Showing accuracy results by Stratified KFold cross validation

In [61]:
# Transforming the list of accuracies and F1 for each interactions of KFold into a dataframe
accuracies_df = pd.DataFrame(data=accuracies, columns=[''])
f1_scores_df = pd.DataFrame(data=f1_scores, columns=[''])

# Calculating the accuracy of model as mean of accuracies and F1 from each split of SKF
accuracy = accuracies_df.mean()
f1_score = f1_scores_df.mean()

# Getting standard deviation for accuracies
accuracy_std = accuracies_df.std()
f1_score_std = f1_scores_df.std()

print(f"Accuracy of the model is: {accuracy}\nIts standard deviation is: {accuracy_std}\n\n")
print(f"F1 score of the model is: {f1_score}\nIts standard deviation is: {accuracy_std}")

Accuracy of the model is:     0.5
dtype: float64
Its standard deviation is:     0.5
dtype: float64


F1 score of the model is:     0.444444
dtype: float64
Its standard deviation is:     0.5
dtype: float64


> We may observe that we have a small dataset, because of it the accuracy of our model is extremely harmed.

---
## Prediction for new examples
---

> We can define new established cases on the proposed exercise and use the decision tree defined above. 

### Defining new dataframe for prediction

#### Writing dataframe file

In [62]:
%%writefile new_patients_predict.tsv
Nome;Febre;Enjôo;Manchas;Dores
Luis;não;não;pequenas;sim
Laura;sim;sim;grandes;sim

Overwriting new_patients_predict.tsv


#### Reading dataframe file

In [63]:
x_predict = pd.read_csv('new_patients_predict.tsv', index_col='Nome', sep=';')

#### Converting categorical values into numerical

In [64]:
x_predict.Febre.replace({'sim': 1, 'não': 0}, inplace=True)
x_predict.Enjôo.replace({'sim': 1, 'não': 0}, inplace=True)
x_predict.Manchas.replace({'grandes': 1, 'pequenas': 0}, inplace=True)
x_predict.Dores.replace({'sim': 1, 'não': 0}, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  x_predict.Febre.replace({'sim': 1, 'não': 0}, inplace=True)
  x_predict.Febre.replace({'sim': 1, 'não': 0}, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  x_predict.Enjôo.replace({'sim': 1, 'não': 0}, inplace=True)
  x_predict.Enjôo.replace({'sim': 1, 'não': 0}, in

### Prediction of new cases

In [65]:
y_predict = tree.predict(x_predict)
result = x_predict.copy()
result['Diagnóstico'] = y_predict

## Performing convertion of numerical into categorical for values of label
result.Diagnóstico.replace({1: 'Saudável', 0: 'Doente'}, inplace=True)

print(result['Diagnóstico'])

Nome
Luis     Saudável
Laura      Doente
Name: Diagnóstico, dtype: object


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  result.Diagnóstico.replace({1: 'Saudável', 0: 'Doente'}, inplace=True)
