### Imports
 Primeiramente são importadas as bibliotecass
nescessárias, sendo elas:
 - pandas: para leitura dos dados da base de dados
 - sklearn model_selection: para seleção dos modelos,   neste caso utiliza-se o k-fold apenas
 
    As demais são os métodos de regressão escolhidos:
 - Regressão Linear
 - Elastic Net
 - Arvore de Decisão (Regressão)
 - Floresta Aleatória (Regressão)
 - Boosting de gradiente (Regressão)
 - Support Vector (Regressão)

In [227]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

from sklearn import tree
from sklearn import model_selection
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR

### Base de Dados
Então realiza-se a leitura da base de dados contida no arquivo 'Analise Geral Normalizada.xlsx' e seus dados são divididos entre:
- X : dados de entrada da função.
- Y : valor a ser predito

In [228]:
df = pd.read_excel("Analise Geral Normalizada.xlsx")
array = df.values
X = array[:,1:23]
Y = array[:,23]

### Modelos de teste
Criam-se os modelos de teste

In [229]:
linearReg = LinearRegression()
elastic = ElasticNet()
Dtree = DecisionTreeRegressor()
forest = RandomForestRegressor()
boosting = GradientBoostingRegressor()
supportVector = SVR()

### Testes por fit

In [230]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, Y, test_size=0.2, random_state=42)

### Linear Regression

In [231]:
linearReg.fit(X_train, y_train)

y_pred = linearReg.predict(X_test)
print("Previsões:", y_pred[:10])


Previsões: [0.13820664 0.61106307 0.71278126 0.1279912  0.71684887 0.97760433
 0.37771122 0.64218896 0.49525694 0.40706873]


In [232]:
print(linearReg.coef_)

[-0.1144203  -0.01030204  0.09960172  0.59311501 -0.05712835  0.24095261
 -0.07550595 -0.06795311 -1.19533992  1.42507437 -0.28058173 -0.00216886
  0.3374322   0.17842422 -0.17140218 -0.09950561  0.39563949  0.05883316
  0.06690265  0.10312122 -0.12843137  0.05908376]


In [233]:
score = linearReg.score(X_test, y_test)
print("R^2 score:", score)

R^2 score: 0.7743193368735054


### Elastic Net

In [234]:
elastic.fit(X_train, y_train)

y_pred = elastic.predict(X_test)
print("Previsões:", y_pred[:10])

Previsões: [0.36901506 0.36901506 0.36901506 0.36901506 0.36901506 0.36901506
 0.36901506 0.36901506 0.36901506 0.36901506]


In [235]:
print(elastic.coef_)

[ 0. -0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0. -0.  0.  0.
 -0.  0.  0.  0.]


In [236]:
score = elastic.score(X_test, y_test)
print("R^2 score:", score)

R^2 score: -0.002521183097285995


### Árvore de decisão

In [237]:
Dtree.fit(X_train, y_train)

y_pred = Dtree.predict(X_test)
print("Previsões:", y_pred[:10])

Previsões: [0.01830035 0.50920737 0.7650692  0.14823287 0.88493652 0.99405238
 0.15578177 0.59647718 0.34804987 0.29452133]


In [238]:
print(Dtree.tree_.node_count)

271


In [239]:
score = Dtree.score(X_test, y_test)
print("R^2 score:", score)

R^2 score: 0.9612878406434329


In [240]:
n_nodes = Dtree.tree_.node_count
children_left = Dtree.tree_.children_left
children_right = Dtree.tree_.children_right
feature = Dtree.tree_.feature
threshold = Dtree.tree_.threshold
values = Dtree.tree_.value

node_depth = np.zeros(shape=n_nodes, dtype=np.int64)
is_leaves = np.zeros(shape=n_nodes, dtype=bool)
stack = [(0, 0)]  # start with the root node id (0) and its depth (0)
while len(stack) > 0:
    # `pop` ensures each node is only visited once
    node_id, depth = stack.pop()
    node_depth[node_id] = depth

    # If the left and right child of a node is not the same we have a split
    # node
    is_split_node = children_left[node_id] != children_right[node_id]
    # If a split node, append left and right children and depth to `stack`
    # so we can loop through them
    if is_split_node:
        stack.append((children_left[node_id], depth + 1))
        stack.append((children_right[node_id], depth + 1))
    else:
        is_leaves[node_id] = True

print(
    "The binary tree structure has {n} nodes and has "
    "the following tree structure:\n".format(n=n_nodes)
)
for i in range(n_nodes):
    if is_leaves[i]:
        print(
            "{space}node={node} is a leaf node with value={value}.".format(
                space=node_depth[i] * "\t", node=i, value=values[i]
            )
        )
    else:
        print(
            "{space}node={node} is a split node with value={value}: "
            "go to node {left} if X[:, {feature}] <= {threshold} "
            "else to node {right}.".format(
                space=node_depth[i] * "\t",
                node=i,
                left=children_left[i],
                feature=feature[i],
                threshold=threshold[i],
                right=children_right[i],
                value=values[i],
            )
        )

The binary tree structure has 271 nodes and has the following tree structure:

node=0 is a split node with value=[[0.36901506]]: go to node 1 if X[:, 3] <= 0.6411448419094086 else to node 200.
	node=1 is a split node with value=[[0.24296656]]: go to node 2 if X[:, 3] <= 0.4746275395154953 else to node 161.
		node=2 is a split node with value=[[0.19850166]]: go to node 3 if X[:, 5] <= 0.20239591598510742 else to node 106.
			node=3 is a split node with value=[[0.15847183]]: go to node 4 if X[:, 3] <= 0.10608267039060593 else to node 29.
				node=4 is a split node with value=[[0.25282644]]: go to node 5 if X[:, 3] <= 0.06771592050790787 else to node 24.
					node=5 is a split node with value=[[0.26248427]]: go to node 6 if X[:, 9] <= 0.04108499363064766 else to node 19.
						node=6 is a split node with value=[[0.25736508]]: go to node 7 if X[:, 19] <= 0.3689735680818558 else to node 18.
							node=7 is a split node with value=[[0.25931221]]: go to node 8 if X[:, 19] <= 0.03382581565529

### Floresta Aleatoria

In [241]:
forest.fit(X_train, y_train)

y_pred = forest.predict(X_test)
print("Previsões:", y_pred[:10])

Previsões: [0.03552328 0.56397804 0.74140455 0.15172366 0.89666705 0.9568226
 0.1873453  0.55941553 0.43087129 0.31636052]


In [243]:
score = forest.score(X_test, y_test)
print("R^2 score:", score)

R^2 score: 0.9812747345084193


### Boosting de Gradiente

In [244]:
boosting.fit(X_train, y_train)

y_pred = boosting.predict(X_test)
print("Previsões:", y_pred[:10])

Previsões: [0.03079668 0.5654899  0.73994872 0.15419837 0.86136649 0.96158482
 0.19786776 0.56255851 0.44088295 0.33043663]


In [246]:
score = boosting.score(X_test, y_test)
print("R^2 score:", score)

R^2 score: 0.9675101913649108


### Support Vector

In [247]:
supportVector.fit(X_train, y_train)

y_pred = supportVector.predict(X_test)
print("Previsões:", y_pred[:10])

Previsões: [0.09736798 0.66388512 0.68831663 0.14788433 0.76942262 0.92412936
 0.29357767 0.60849423 0.40973288 0.3183579 ]


In [249]:
score = supportVector.score(X_test, y_test)
print("R^2 score:", score)

R^2 score: 0.8322128167607192


### K-fold
  Cria-se o modelo k-fold utilizando 12 folds (a
partir de 12 folds não houve melhoria consideravel nos valores de acurácia nas saídas, e se tornam bastante custosos a partir deste ponto)

In [250]:
seed = 0
kfold = model_selection.KFold(n_splits=12, random_state=seed, shuffle=True)

In [251]:
linearReg = LinearRegression()
elastic = ElasticNet()
tree = DecisionTreeRegressor()
forest = RandomForestRegressor()
boosting = GradientBoostingRegressor()
supportVector = SVR()

### Cross Validation
Executa o cross validation dos K-folds nos modelos de teste

In [252]:
results = [['Linear Regression:', model_selection.cross_val_score(linearReg, X, Y, cv=kfold)],
           ['Elastic Net:', model_selection.cross_val_score(elastic, X, Y, cv=kfold)],
           ['Decision Tree:', model_selection.cross_val_score(tree, X, Y, cv=kfold)],
           ['Random Forest:', model_selection.cross_val_score(forest, X, Y, cv=kfold)],
           ['Gradient Boosting:', model_selection.cross_val_score(boosting, X, Y, cv=kfold)],
           ['Support Vector:', model_selection.cross_val_score(supportVector, X, Y, cv=kfold)],
]

### Analise
Analise final dos testes de cross validation, sendo apresentados os dados de acurácia média entre os diferentes modelos

In [253]:
for result in results:
    print(result[0], result[1])
    print('Média', result[1].mean())
    print()

Linear Regression: [0.84577921 0.81880755 0.67316311 0.78250798 0.62509547 0.86103952
 0.59735783 0.84649799 0.84674514 0.39750102 0.87190733 0.82943846]
Média 0.7496533844182457

Elastic Net: [-2.78961815e-02 -6.20156253e-02 -3.26208873e-04 -1.07068978e-01
 -1.42875970e-01 -4.83975947e-02 -7.12118232e-01 -5.14377089e-03
 -2.98824541e-02 -3.38697044e-01 -2.43489135e-05 -9.48001590e-02]
Média -0.13077054724227172

Decision Tree: [0.87334379 0.6959032  0.96399474 0.98174688 0.85776857 0.94910625
 0.90436934 0.95974148 0.9482513  0.80605517 0.96559567 0.98777356]
Média 0.9078041631214725

Random Forest: [0.96472785 0.82569747 0.97213178 0.98641822 0.85129821 0.96606305
 0.97750351 0.98485167 0.96858404 0.91642942 0.98032722 0.96721781]
Média 0.9467708535380585

Gradient Boosting: [0.96390893 0.87004673 0.94818565 0.98201628 0.83718664 0.97664659
 0.98443862 0.97323495 0.97642582 0.79699684 0.98334314 0.9503105 ]
Média 0.9368950571552889

Support Vector: [0.80028897 0.75597856 0.76471198 0

Foram encontrados melhores resultados nos casos de Florestas Aleatória, Boosting de Gradiente e Árvore de Decisão