# Intoduction
For my final project, I'm going to implement two different machine learining approaches to see how they differ from one another.
From this, I hope to draw a conclusion on what approach works best for the dataset. 

## Dataset 
This project will primarly will be using the [scikit-learn diabetes dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset). I will consider this project successful if I could test different machine learning models and identify which model works best and know *why* it works best.

# Import Packages


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from google.colab import drive
%matplotlib inline

from numpy.random import default_rng
from sklearn import datasets # Where our diabetes datasets lives.
from sklearn.model_selection import train_test_split # used for splitting our data
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

#Load Data Set
This dataset comes from sklearns diabetes dataset. Instead of using the dataset provided by sklearns, I got the [data](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt) from the [source](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html) directly as the data from sklearn was standardized.

In [2]:
diabetes_data = datasets.load_diabetes()
data = pd.read_csv('final/diabetes_data.csv')
data.columns = ['age',	'sex',	'bmi',	'bp',	's1',	's2', 's3',	's4',	's5',	's6',	'dp']
data.head()

FileNotFoundError: ignored

We could take a quick look at the shape of the data just so we could visualize how it's structured, and the correlation coefficients of of the differend feature variables

In [None]:
data.shape

#Data Interpritation
### We can see in the shape of the data that there are 442 instances(samples) of data, with 10 feature variables. 

### These feature variables are:
*   **age:** (in years)
*   **sex:**
*   **bmi:**: Body mass index
*   **bp:** average blood pressure
*   **s1-tc:** total serum cholesterol
*   **s2-ldl:** low-density lipoproteiens
*   **s3-hdl:** high-density lipoproteins
*   **s4-tch:** total cholesterol / HDL
*   **s5-ltg:** possibly log of serum triglycerides level
*   **s6-glu:** blood sugar level
*   **dp:** Our target data represents the disease progression one year after baseline. This is our variable of interest.








# Setup

The first two Machine Learning approaches that I will be focusing on this project consist of univariate and multivariate linear regression models. This data set consists of 10 feature variables, so it's important to see the predictions differ in both models. I will then compare the results with a K-Nearest-Neighbors model.

*Note: I will be use Assignment 3 as a reference when building the models as it's a well done exemplar on univariate & mulitvariate regression algorithims.*



In [None]:
data.corr()

As we can see, Body Mass Index and s5 (possibly log of serum triglycerides levels) are highly correlated with the disease progression after a year with the correlation values of 0.586450 and 0.565883 respectively.

For our univariate linear regression model, we will use **BMI** as our feature variable and **dp** as our comparison.

Our first step is to write a function that fits our univariate linear regression model. *
Note: I will use Assignemt 3 as a reference as the process is similar.*

In [None]:
def univariate_model(df, feature):
  linear_regression = LinearRegression()
  X = data[[feature]].values #Primarly for, but not limited to, BMI.
  y = data['dp'].values;
  linear_regression.fit(X,y)
  return X, y, linear_regression


In [None]:
Xp, yp, lrp = univariate_model(data, 'bmi')

Now that we have this univariate regression model, we could visuallize the regression model by using the generated linear regression model to predict.

In [None]:
def plot_graph(X, y, lr, xlabel,title):
    x_label = xlabel
    y_label = 'Disease Progression'
    y_pred = lr.predict(X)  # our predicted linear model with respect to X
    fig = plt.figure(figsize=(12,5))
    ax = fig.add_subplot(111)
    ax.scatter(X, y, c="darkgreen")
    ax.plot(X,y_pred,c="darkred") # 'o' reformats the plot to circles instead of lines.
    ax.set_xlabel(x_label)
    ax.set_ylabel(y_label)
    ax.set_title(title)
    return None

def plot_regression_result_1(X, y, lr, xlabel):
    y_true = data['dp'].values # Ground truth target values.
    y_pred = lr.predict(X) # our predicted linear model with respect to X
    title = (f'{xlabel} Model with Respect to Disease Progression')
    plot_graph(X, y, lr, xlabel,title);
    return None

In [None]:
plt.rcParams.update({'font.size': 12})

plot_regression_result_1(Xp, yp, lrp, 'bmi')

It's important for us to analyze the regression loss of our predicted model. In essense, this will help us justify if linear regression is a viable approach to this dataset. 

I'm going to create a function that calculates the mean absolute error & root mean error so we could se

In [None]:
def round_y(y):# Referenced Assignement 3
# We rounded our regressions in order to get an integer to match with our data.
# This is done so we could calculate a valid regression accuracy.
  return np.round(y) 

def evaluate(X, y, lr):
  y_true = data['dp'].values
  y_pred = lr.predict(X)

  accur = np.sum((round_y(y_pred) == y))/y.shape[0]

  rmse = np.sqrt(mean_squared_error(y_true, y_pred)) # Mean root error regression loss.
  mae = mean_absolute_error(y_true, y_pred) # Mean absolute error regression loss.
  r2 = lr.score(X,y)
  print("The root mean squared error is: %4.2f" % rmse)
  print("The mean absolute error is: %4.2f" % mae)
  print(f'Coefficient of determination(R2): {r2:5.2f}')
  print(f"Accuracy: {accur:5.3f}")

In [None]:
evaluate(Xp,yp,lrp)

We used such evaluation metrics to make judgement of this linear regression model to compare continuous values. In this case we compared BMI with disease progression.


As we can see, our evaluation metrics for our regression models tells us that essentially our BMI regression model can predict the disease progression with an **average error of 51.80**. The mean squared error wasn't that helpful because it punished larger errors, therefore I calculated the root mean squared error as it's easier to understand. 

It's important to note that the accuracy on BMI is not good whatsoever. This is to be expected as our error metrics were on average very high; however, we could take a look at the correlation coefficient. We could see that the variance in BMI and disease progression in this linear regression model is only 0.344. 


*   R2 = 0: Meaning that the model always fails to predict target value
*   R2 = 1: Meaning that the model perfectly predict target value

We could actually see the different R2 values with the other feature variables.

In [None]:
def calculate_r2():
  sum = 0;
  r2_scores = []
  for feature in data.columns.drop(['dp']):
      Xa, ya, lra = univariate_model(data, feature)
      y_pred = lra.predict(Xa) # predction model
      r2 = lra.score(Xa,ya)
      r2_scores.append([r2,feature]) # appends accuracy and feature to our accuracies list.

  [print(f"R2 score {r2[0]:5.2f} for feature {r2[1]}") for r2 in r2_scores] # prints our score for each feat.


calculate_r2()

By looking at these predicted univariate regression r2 models scores, we can tell that the **bmi** and **s5** feature models have the strongest relationship with disease progression.

##Generalize with Multivariate Linear Regression
As we can see, our BMI by it self produces very low accuracies with high error rates. Now let's train our linear regression all of our feature variables.

We're going to start of by creating a function that fits our multiple features.


In [None]:
def multivariate_model(df):
  X = df[df.columns.drop(['dp'])] # all features except disease progression.
  y = df['dp'] # our target value
  model = LinearRegression();
  model.fit(X,y)
  return X, y, model

In [None]:
Xm, ym, lrm = multivariate_model(data)

y_pred = np.round(lrm.predict(Xm))
evaluate(Xm,ym,lrm)

Using this dataset, we can determine that,by using sklearn's linear regression model, it's only able to successfuly make "good" predictions 52% of the time when working with multivariate data. 

#K-Nearest-Neighbors Regressor
Let's implement a K-nearest neighbor algorithm to see if there's a better change in our coefficient determination. This algorithm also allows us to generalize the data, so we'll be able to include all of the 10 features instead of just looking at BMI.

##Setup

In order for our knn algorithm to work properly, we need to split the data into two subsets: training & testing.

Due to the nature of the data, I'll have to write a function that does the splitting for us. This function will be utilizing the sklearn.

When splitting this dataset:

*   We're splitting data 50/50.
*   Random State = 42.



In [None]:
def split(df,size):
  X = df[df.columns.drop(['dp'])].values # All features except for quality
  y = df['dp'].values
  return train_test_split(X,y,test_size=size,random_state=42)

Now we could implement the k-nn algorithm and see how the R2 metrics differ in both machine learning models.

In [None]:
def fit_nearest_neighbor (df, k) : # Reference from Assignment 3.
  X_train, X_test, y_train, y_test = split(df,.5)
  kn = KNeighborsRegressor(n_neighbors=k)
  kn.fit(X_train, y_train) #fit training data
  kn.predict(X_test)
  r2 = kn.score(X_test, y_test)
  print(f'R2 score with {k} neighbors: {r2}')
  return r2

In [None]:
n_neighbors = 10
r2_scores = []
for i in range(2,n_neighbors+1):
  r2_scores.append(fit_nearest_neighbor(data,i))


In [None]:
print(f"Average knn R2 score: {np.mean(r2_scores):5.3f}")

#Neural Network Approach
##Setup

In [None]:
from tensorflow import keras
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from keras.metrics import MeanAbsoluteError

!pip install tensorflow-addons # - May need to install this.
import tensorflow as tf
import tensorflow_addons as tfa
from tensorflow_addons.metrics import RSquare # add-on metric for rsquare

As we've previously seen in both our sklearn linear regression model & knn, our coefficient of determination is possibly our best indicator on how well our model is training. As a result, I wanted to collect the R2 metrics in our neural network implementation. 

##Custom Metric Function
I found an amazing metric function that calculates R2 metrics when using keras.I will be using this metric to see if our neural network model produces better R2 results. 

*Source: https://jmlb.github.io/ml/2017/03/20/CoeffDetermination_CustomMetric4Keras/*

**Update: I'm actually using a [tensorflow add-on](https://www.tensorflow.org/addons/api_docs/python/tfa/metrics/RSquare) that calculates the R2 metrics **

In [None]:
model_1 = Sequential()
input_size = len(data.columns)-1
model_1.add(Dense(10, input_dim = input_size, kernel_initializer='normal', activation='relu')) # 10 feature variables
model_1.add(Dense(5, activation='relu')) # hidden layer
model_1.add(Dense(1, activation='relu')) # single output

Our next step is to compile the network. We'll use the Adam optimizer with a the loss function of mean squared error. It would be beneficial to take the metrics of the mean squared error and mean absolute error as it would be useful to compare it against a regular regression model with no regularization.

In [None]:
loss_fn = keras.losses.MeanSquaredError()
opt = keras.optimizers.Adam()
mae = MeanAbsoluteError()
r_sqr_metric = RSquare(dtype=tf.float32, y_shape=(1,))
# compile network
model_1.compile(loss=loss_fn, optimizer=opt, metrics=[r_sqr_metric, mae])

Now we can split the data (60% training, 40% test) and finally we will train the neural network with our validation data.

In [None]:
X_train, X_test, y_train, y_test = split(data,.6)
history = model_1.fit(X_train,y_train,batch_size=5,validation_split=0.6,epochs=1000, validation_data=(X_test, y_test))

In [None]:
def plot_metrics(metric_name, title,ylim=1):
  plt.title(title)
  plt.ylim(-1,ylim)
  plt.plot(history.history[metric_name],color='blue',label=metric_name)
  plt.plot(history.history['val_' + metric_name],color='green',label='val_' + metric_name)
  plt.legend()

plot_metrics("r_square", "R Squared");

#Conclusion



By using these two machine learning approaches with this dataset, I was able to see how good -or not so good- the machine learning models did when trying to predict the target value. For this dataset, It would be more appropiate to use a linear regression model instead of a KNN model. We saw that the R2 score on the multivariate linear regression model was higher than the average, of up to 10, k-nn. I was mostly focused on the coefficient of correlation as the accuracy for this dataset was very low with a high error rate, so I created an instance of a Neural Network model. I primarly wanted to see if creating and training a neural network to solve this linear-regression dataset could further improve our results. I was glad to see that the model does improve the over-all relationship between our trained models and our variable of interest. All though, I would still to a non-neural-network implementation of a linear regression model as neural model is a bit too complex (and I assume resource intensive) just for an extra ~10% better R2 score predictions. It was an amazing experience applying these models to make decisions about real-world problems such as diabetes.
