**Autor: German Bertachini - 58750**

# Redes Neuronales - Trabajo Practico N1

## Problema de regresión

Se cuenta con el siguiente [dataset](https://archive.ics.uci.edu/ml/datasets/Physicochemical+Properties+of+Protein+Tertiary+Structure) para formar para construir un algoritmo predictor.


### Librerías a utilizar

In [19]:
import pandas as pd
import numpy as np
import scipy as sp
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, RepeatedStratifiedKFold, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import roc_curve, auc, confusion_matrix, classification_report, roc_auc_score, mean_squared_error, r2_score
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD, Adam, RMSprop

#### Armamos nuestro dataframe usando Pandas

In [2]:
#features
features = ['F1','F2','F3','F4','F5','F6','F7','F8','F9']
RMSD =['RMSD']
cols= RMSD + features
data_frame= pd.read_csv('CASP.csv', sep=',')
df_RMSD=data_frame["RMSD"]
data_frame.drop(columns="RMSD", inplace=True)

Se analiza el data frame obtenido buscando patrones. Contamos con 1483 muestras de de un subconjunto de 10 clases de proteneinas, las cuales se tienen asociadas 8 parametros.

In [4]:
print(data_frame.shape)

(45730, 9)


In [106]:
df_RMSD.head()

0    17.284
1     6.021
2     9.275
3    15.851
4     7.962
Name: RMSD, dtype: float64

### Normalizamos nuestro dataset
Resulta obligatorio normalizar nuestro dataset al tratarse de un caso de regresión.

In [6]:
data_frame=((data_frame-data_frame.mean())/data_frame.std())/data_frame.max()

In [23]:
normalized_df=pd.concat([data_frame,df_RMSD],axis=1)

In [8]:
normalized_df.head()

Unnamed: 0,F1,F2,F3,F4,F5,F6,F7,F8,F9,RMSD
0,2.3e-05,5.7e-05,0.416976,0.002867,1.634555e-07,0.001664,1e-06,0.00162,-0.02266,17.284
1,-2.3e-05,-6.2e-05,-1.108272,-0.002448,-1.830124e-07,-0.001395,-3e-06,-0.001567,0.012166,6.021
2,-1.3e-05,-5.8e-05,-2.173551,-0.001769,-9.481921e-08,-0.001524,-5e-06,-0.002072,0.012968,9.275
3,-9e-06,-2.9e-05,-0.585817,-0.001742,-5.113615e-08,-0.000864,-4e-06,1e-06,0.013733,15.851
4,-1.5e-05,-5.7e-05,-1.915627,-0.002495,-1.125187e-07,-0.00122,-6e-06,-0.001465,0.016303,7.962


Distribucion probabilistica de los features.

In [22]:
normalized_df.describe()

Unnamed: 0,F1,F2,F3,F4,F5,F6,F7,F8,F9,RMSD
count,45730.0,45730.0,45730.0,45730.0,45730.0,45730.0,45730.0,45730.0,45730.0,45730.0
mean,-5.237196999999999e-19,1.695032e-18,-2.632733e-14,2.6009300000000002e-17,1.949226e-21,1.5413910000000002e-17,-8.053947e-20,-1.3342349999999999e-19,-1.342439e-15,7.748528
std,2.497821e-05,6.530825e-05,1.731032,0.002707701,1.827482e-07,0.001671101,9.438577e-06,0.002857143,0.01808289,6.118312
min,-4.603729e-05,-0.0001165774,-5.777607,-0.004552277,-3.398146e-07,-0.002713603,-1.88895e-05,-0.003538973,-0.05835043,0.0
25%,-1.806468e-05,-4.630873e-05,-1.201587,-0.001950646,-1.343655e-07,-0.001214678,-3.90328e-06,-0.001971155,-0.01239526,2.305
50%,-5.987623e-06,-1.557471e-05,-0.06171133,-0.0007695214,-4.246997e-08,-0.0004646203,-7.082148e-07,-0.0008079355,0.002345538,5.03
75%,1.387698e-05,3.429898e-05,1.114775,0.001473142,1.045293e-07,0.0008553847,3.098429e-06,0.001063331,0.01314582,13.379
max,0.0001856579,0.0005483354,7.578017,0.01298644,1.329605e-06,0.01080904,0.000482722,0.0141622,0.06283074,20.999


### Dividimos nuestro data set
Asignamos un 70% para train y un 15% para test y valid.

In [161]:
df_train,df_aux=train_test_split(normalized_df, test_size=0.3, shuffle=True) 
df_test,df_val=train_test_split(df_aux, test_size=0.5, shuffle=True) 
X_train=df_train.drop(columns="RMSD").values
X_test=df_test.drop(columns="RMSD").values
y=df_RMSD.values
y_train = df_train.RMSD.values
y_test = df_test.RMSD.values

In [111]:
y

array([17.284,  6.021,  9.275, ..., 10.356,  9.791, 18.827])

In [162]:
LR = LinearRegression()
LR.fit(X_train,y_train)
y_pred=LR.predict(X_test)

### Predictor sin optimizaciones

In [163]:
coef=LR.coef_
MSE=mean_squared_error(y_test,y_pred)
score=r2_score(y_test,y_pred)
print("Los coeficientes obtenidos son: \n", coef)
print("MSE  \n%.2f" % MSE)
print("Score \n%.2f" % score)


Los coeficientes obtenidos son: 
 [ 2.61004768e+05  3.29132909e+04  6.35810526e-01 -2.18084042e+03
 -1.37613774e+07 -1.05712454e+03 -2.60069524e+04  2.95431804e+02
 -3.63698398e+01]
MSE  
27.42
Score 
0.28


Los resultados obtenidos ($MSE=27.42$ y un $score=0.28$) son malos, de ahi la necesidad de optimizacion usando features polinomiales para una mejor aproximacion.

### Features polinomiales
Defino los features polinomiales a usar y busco cual es el que mejor resultado dado mi data frame.

In [175]:
give_me_my_best_poly_order()

Para un orden 1
 MSE: 26.883806 
 Score 0.284245 
 
Para un orden 2
 MSE: 24.197869 
 Score 0.355756 
 
Para un orden 3
 MSE: 23.430014 
 Score 0.376199 
 
Para un orden 4
 MSE: 34.945097 
 Score 0.069621 
 
Para un orden 5
 MSE: 52.213734 
 Score -0.390139 
 


In [174]:
def give_me_my_best_poly_order():
    for i in range(1,6):      
        poly=PolynomialFeatures(i)
        poly_feat=poly.fit_transform(normalized_df.drop(columns="RMSD").values)
        #we split and shuffle the data set in every iteration
        X_train,X_aux,y_train,y_aux=train_test_split(poly_feat,y, test_size=0.3, shuffle=False, random_state=42) 
        X_test,X_val,y_test, y_val=train_test_split(X_aux,y_aux, test_size=0.5, shuffle=False, random_state=42) 
        LR = LinearRegression()
        LR.fit(X_train, y_train)
        y_pred=LR.predict(X_val)
        print("Para un orden %i\n MSE: %f \n Score %f \n " % (i, mean_squared_error(y_val,y_pred), r2_score(y_val,y_pred)))
 
        

### Test
Elegimos features polinomiales de orden 3 por ser los que mejor optimizan el MSE y el score. Aplicamos dichos features en el test data set

In [178]:
        poly=PolynomialFeatures(3)
        poly_feat=poly.fit_transform(normalized_df.drop(columns="RMSD").values)
        X_train,X_aux,y_train,y_aux=train_test_split(poly_feat,y, test_size=0.3, shuffle=False, random_state=42) 
        X_test,X_val,y_test, y_val=train_test_split(X_aux,y_aux, test_size=0.5, shuffle= False, random_state=42) 
        LR = LinearRegression()
        LR.fit(X_train, y_train)
        y_pred=LR.predict(X_test)
        print("Para un orden %i\n MSE: %f \n Score %f \n " % (3, mean_squared_error(y_test,y_pred), r2_score(y_test,y_pred)))

Para un orden 3
 MSE: 22.256534 
 Score 0.424631 
 


Se obtiene un $MSE=22.25$ y un $score=0.42$, mejorando sustancialmente respecto de la primera iteracion sin features polinomiales.
