## ReadMe

This script take a data array X and a label vector Y and fits linear regression parameters theta using scikit package

Shapes:
X = (m, n) where m is number of training examples and n is number of features
Y = m
theta = n + 1 (+1 for bias term)

Sections:

1) (OPTIONAL) Create a dummy data vector and label vector

2) Implement linear regression using stochastic gradient descent to learn parameters theta

3) Report performance of linear regression compared to zero rule algorithm (i.e. guess the mean output always)

Citations:

https://machinelearningmastery.com/implement-linear-regression-stochastic-gradient-descent-scratch-python/

https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html

### Imports

In [None]:
import numpy as np
from random import seed
from random import randrange
from csv import reader
from math import sqrt
import matplotlib.pyplot as plt
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.datasets import make_regression
from sklearn import preprocessing 

### Section 1: (OPTIONAL) Create dummy data vectors and label vector

In [None]:
seed(7)
X, Y = make_regression(n_samples=1000, n_features=100000, noise=0.2, n_informative = 1000)
Y = np.rint(np.interp(Y, (Y.min(), Y.max()), (0, 100)))

### Section 2: Linear Regression with Stochastic Gradient Descent

#### Hyperparameters

In [None]:
train_split = 0.8 # ratio of training examples to put into train

#### Normalize Data

In [None]:
# Scale data to have 0 mean and unit variance (the default)
X_normalized = preprocessing.scale(X)

#### Split data into train and test

In [None]:
train_dev_split = np.random.rand(len(Y)) < train_split # split data into 90% train, 10% dev, based on lenghto of labels
X_train = X_normalized[train_dev_split]
X_test = X_normalized[~train_dev_split]
Y_train = Y[train_dev_split]
Y_test = Y[~train_dev_split]

In [None]:
#print(X_train.shape); print(Y_train.shape)

#### Train linear regression model

In [None]:
# Create linear regression object using scitkit
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(X_train, Y_train)

# Make predictions using the testing set
Y_pred = regr.predict(X_test)

# The coefficients
print('Coefficients: \n', regr.coef_)

### Section 3: Report Results and Compare to Zero-Rule

#### Calculate RMSE and Variance (Linear Regression)

In [None]:
print("Root mean squared error of Linear Regression: %.2f"
      % sqrt(mean_squared_error(Y_test, Y_pred)))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(Y_test, Y_pred))

#### Calculate RMSE and Variance (Zero-Rule for comparison)

In [None]:
Y_pred_0rule = np.full_like(Y_test, np.mean(Y_train))  

# The mean squared error
print("Root mean squared error of Zero Rule: %.2f"
      % sqrt(mean_squared_error(Y_test, Y_pred_0rule)))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(Y_test, Y_pred_0rule))

#### Plot Linear Regression Results

In [None]:
# Plot predicted vs. true
fig, ax = plt.subplots()
ax.scatter(Y_test, Y_pred, edgecolors=(0, 0, 0))
ax.plot([Y_test.min(), Y_test.max()], [Y_pred.min(), Y_pred.max()], 'k--', lw=4)
ax.set_xlabel('GDI True')
ax.set_ylabel('GDI Predicted')
plt.show()

### Scratch

In [None]:
# X = np.random.rand(100,1000000)
# Y = np.random.randint(0,101,1000)
# generate regression dataset

#print(X); print(Y)

In [None]:
# Find the min and max values for each feature column in dataset
# def dataset_minmax(dataset):
#     minmax = list()
#     for i in range(len(dataset[0])):
#         col_values = [row[i] for row in dataset]
#         value_min = min(col_values)
#         value_max = max(col_values)
#         minmax.append([value_min, value_max])
#     return minmax
 
# # Rescale dataset features to the range 0-1
# def normalize_dataset(dataset, minmax):
#     dataset_normalized = (dataset - dataset.mean())/dataset.std()
#     return dataset_normalized

# minmax = dataset_minmax(X)
# X_normalized = normalize_dataset(X, minmax)
