# Regressão Linear Multivariada - Trabalho

## Estudo de caso: Qualidade de Vinhos

Nesta trabalho, treinaremos um modelo de regressão linear usando descendência de gradiente estocástico no conjunto de dados da Qualidade do Vinho. O exemplo pressupõe que uma cópia CSV do conjunto de dados está no diretório de trabalho atual com o nome do arquivo *winequality-white.csv*.

O conjunto de dados de qualidade do vinho envolve a previsão da qualidade dos vinhos brancos em uma escala, com medidas químicas de cada vinho. É um problema de classificação multiclasse, mas também pode ser enquadrado como um problema de regressão. O número de observações para cada classe não é equilibrado. Existem 4.898 observações com 11 variáveis de entrada e 1 variável de saída. Os nomes das variáveis são os seguintes:

1. Fixed acidity.
2. Volatile acidity.
3. Citric acid.
4. Residual sugar.
5. Chlorides.
6. Free sulfur dioxide. 
7. Total sulfur dioxide. 
8. Density.
9. pH.
10. Sulphates.
11. Alcohol.
12. Quality (score between 0 and 10).

O desempenho de referencia de predição do valor médio é um RMSE de aproximadamente 0.148 pontos de qualidade.

Utilize o exemplo apresentado no tutorial e altere-o de forma a carregar os dados e analisar a acurácia de sua solução. 


### Aluno: Marcos Felipe de Menezes Mota -354080

In [19]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import normalize
from math import sqrt


def predict(row, coefficients):    
    yhat = coefficients[0]
    for i in range(len(row)-1):
        yhat += coefficients[i + 1] * row[i]
    return yhat


def coefficients_sgd(train, l_rate, n_epoch):
    coef = [0.0 for i in range(len(train[0]))]
    print ('Coeficiente Inicial={0}' % (coef))
    for epoch in range(n_epoch):
        sum_error = 0
        for row in train:
            yhat = predict(row, coef)
            error = yhat - row[-1]
            sum_error += error**2
            coef[0] = coef[0] - l_rate * error
            for i in range(len(row)-1):
                coef[i + 1] = coef[i + 1] - l_rate * error * row[i] 
        print(('epoch=%d, lrate=%.3f, error=%.3f' % (epoch, l_rate, sum_error)))
    return coef


def rmse_metric(actual, predicted):
  sum_error = 0.0
  for i in range(len(actual)):
    prediction_error = predicted[i] - actual[i]
    sum_error += (prediction_error ** 2)
  mean_error = sum_error / float(len(actual))
  return sqrt(mean_error)

dataset = pd.read_csv("winequality-white.csv", delimiter=";")
dataset = dataset.iloc[:,:].values
dataset = normalize(dataset)
data_train, data_test = train_test_split(dataset, test_size=0.2)

print(data_train)

[[ 0.03387099  0.00120558  0.0013778  ...,  0.00223893  0.05453803
   0.02870423]
 [ 0.06434697  0.00255837  0.00333364 ...,  0.00364374  0.07209961
   0.03876323]
 [ 0.03719477  0.00091839  0.00225005 ...,  0.0021123   0.04591947
   0.03214363]
 ..., 
 [ 0.07470573  0.00245462  0.00277478 ...,  0.00586974  0.10138635
   0.06403348]
 [ 0.05133141  0.00150572  0.00225858 ...,  0.00465405  0.07802374
   0.04106513]
 [ 0.0866237   0.00210372  0.00457868 ...,  0.00569241  0.13364799
   0.0866237 ]]


In [24]:
l_rate = 0.01
n_epoch = 30

coef = coefficients_sgd(data_train.tolist(), l_rate, n_epoch)
print(coef)

Coeficiente Inicial={0}
epoch=0, lrate=0.010, error=1.407
epoch=1, lrate=0.010, error=1.177
epoch=2, lrate=0.010, error=1.028
epoch=3, lrate=0.010, error=0.902
epoch=4, lrate=0.010, error=0.796
epoch=5, lrate=0.010, error=0.706
epoch=6, lrate=0.010, error=0.630
epoch=7, lrate=0.010, error=0.566
epoch=8, lrate=0.010, error=0.512
epoch=9, lrate=0.010, error=0.466
epoch=10, lrate=0.010, error=0.427
epoch=11, lrate=0.010, error=0.395
epoch=12, lrate=0.010, error=0.367
epoch=13, lrate=0.010, error=0.344
epoch=14, lrate=0.010, error=0.324
epoch=15, lrate=0.010, error=0.307
epoch=16, lrate=0.010, error=0.293
epoch=17, lrate=0.010, error=0.281
epoch=18, lrate=0.010, error=0.271
epoch=19, lrate=0.010, error=0.262
epoch=20, lrate=0.010, error=0.255
epoch=21, lrate=0.010, error=0.249
epoch=22, lrate=0.010, error=0.244
epoch=23, lrate=0.010, error=0.239
epoch=24, lrate=0.010, error=0.236
epoch=25, lrate=0.010, error=0.233
epoch=26, lrate=0.010, error=0.230
epoch=27, lrate=0.010, error=0.228
epoch=

In [25]:
def test_predict(dataset, coefficients):
    y_pred = []
    for row in dataset:
        y_pred += [predict(row, coefficients)]
    return y_pred

In [28]:
y_pred = test_predict(data_test, coef)
print(y_pred)

[0.050871531415869503, 0.030857179981893121, 0.062964789343552319, 0.030520093107617958, 0.027404513593404466, 0.060804853354153143, 0.046397826677712112, 0.050145472321706713, 0.074604473565915666, 0.061633889372975065, 0.035989847897245984, 0.05301212325018112, 0.043416220253610083, 0.071062491156764607, 0.04727507143335867, 0.057244647982916202, 0.051296165624744372, 0.076503238482190206, 0.050361654906679251, 0.047280874510256296, 0.044561342241359943, 0.027035958957764436, 0.029073834283720248, 0.034521174971068616, 0.034560360504433368, 0.028701360406641518, 0.066112631778924791, 0.041981546710690952, 0.058614394987136276, 0.040429631830565804, 0.040721547285493556, 0.027725854721566419, 0.043437680801333589, 0.052153772269223728, 0.040737793291901331, 0.04441897890026536, 0.032179416813290257, 0.05315466427567684, 0.028076005689954171, 0.035624170614166305, 0.056610694574571088, 0.067005977541398296, 0.073314375255596864, 0.036739711912576689, 0.070439672629330652, 0.04432327484

In [37]:
print(data_test[:,-1].tolist())

[0.0430433853080912, 0.027962641959843815, 0.060479486152498006, 0.019431419716201777, 0.022674016403607347, 0.07617130857365903, 0.042807655227836125, 0.05046695238576308, 0.06812844042259575, 0.06594320808258869, 0.04571444420296451, 0.05786812219068246, 0.04479497830415948, 0.07894011708178932, 0.047675010689764354, 0.0619548138129868, 0.057768851694845094, 0.08789813309476784, 0.04499429893843479, 0.05303699643342404, 0.05060177498774814, 0.02207111724063403, 0.0343249631952334, 0.029038132351423357, 0.034302032380470145, 0.03637150430451434, 0.06513439284647424, 0.04017036733067905, 0.04222823462070803, 0.04044496641189083, 0.04431329204749007, 0.021322425761744614, 0.05043749676638263, 0.052784954339487294, 0.03941106434082922, 0.04137553177428458, 0.035865884348005334, 0.05008054883550479, 0.02173036985404247, 0.029718984751528427, 0.053573355659924625, 0.08973778066514383, 0.06325611060210598, 0.035461435234930776, 0.07939358141914474, 0.04366818883233077, 0.05198170428368488, 

In [34]:
print(rmse_metric(data_test[:,-1], y_pred))

0.0076665329240807635
