# Full Model Prediction 
This model will run a regression on all of the data over the time period given, treating identifier, market cap, the factors, etc as independent variables. 
The regression is completed using SKLearn which utilizes utilizes test and training data to fit a learned model to the data. The model will then be used to complete a factor selection and forward/backward factor selection which can be used to fine-tune the model.

**How to Run This Code:** Run each segment of code and markdown in order using the 'Run' button above. Once you have run the last segment the code will execute and results will be outputted. It may take a few moments for the code to output results so please be patient.

### Import Libraries

In [None]:
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from time import time
from sklearn.feature_selection import SequentialFeatureSelector
import warnings
warnings.filterwarnings('ignore')

## Functions to be called later

The model is built and run in the section **"Calculating the Model"**.

**Calculating r^2:** This function calculates the r^2 of the model by utilizing the r2_score function with the test target data and predictions from the model.

In [None]:
def r2(preds, test):
    
    # Calculates the r^2 score based on the test values and the predicted values
    r = r2_score(test["target"], preds)
    
    print("r^2 is = " + str(r) + "\n")
    
    return r

**K Fold Test:** Conducts a k_fold test on the data using the predictors (X) and response variables (y) using the KFold function. It then Calculates the cross validation score which is later used to calculate the Root Mean Squared Error. The RMSE is useful as it provides information regarding the accuracy of our model - i.e. the lower the RMSE the better

In [None]:
def k_fold_test(predictor_list, train, model, X, y):

    # Conducts the K_Fold test
    cv = KFold(n_splits=10, random_state=100, shuffle=True)

    scores = cross_val_score(model, train[predictor_list], train["target"],
                             scoring='neg_mean_absolute_error',
                             cv=cv, n_jobs=-1)
    
    print("root mean squared error (RMSE) = " + str(np.sqrt(np.mean(np.absolute(
        scores)))) + "\n")

**Selecting features with Sequential Feature Selection:** SFS is a greedy procedure where, at each iteration, we choose the best new feature to add to our selected features based a cross-validation score (starts with 0 features and choose the best single feature with the highest score. The procedure is repeated until we reach the desired number of selected features) Our function below does this in forward SFS.

In [None]:
def seq_selection(X, y, features, ridge):

    # Forward Selection
    
    # starts the time
    tic_fwd = time()
    # runs the SequentialFeatureSelector on our ridge values previosuly calculated
    # and specifies that this is to be done in the forward direction
    sfs_forward = SequentialFeatureSelector(
        ridge, scoring='r2', direction="forward"
    ).fit(X, y)
    # stops the time
    toc_fwd = time()


    #prints the results
    print(
        "Features selected by forward sequential selection: ", selecting(sfs_forward.get_support(), features)
    )
    print(f"Done in {toc_fwd - tic_fwd:.3f}s")
    
    return selecting(sfs_forward.get_support(), features)


## Calculating the Model

**Set Path and Read Data:** Below I have specified the path to the given CSV containing the data. You do not need to change the path for it to work on this notebook, but if you would like to download the code, you may.

In [None]:
path_to_file = "data/data.csv"
# Reads the information contained in the CSV
ds = pd.read_csv(path_to_file)

**Turning the Identifiers into Dummy Variables:** This will assist in preventing a collinearity between the identifiers when the regression is run.

In [None]:
ds = pd.get_dummies(ds,prefix='Identifier ', prefix_sep='=', columns=[
    'identifier'])

**Get Predictors:** List the predictors that will be used in the model by filtering column names

In [None]:
df = ds.drop(['target'], axis = 1)
predictor_list = df.columns[1:]
predictor_list

**1) Arranging the dataset to factor the dates as the index:** This will result in the index of the data being set as the dates. 

In [None]:
ds = ds.set_index('date')
ds.head(5)

**2) Partitioning the training and testing data:** 30% of the data is being selected as test data, while 70% is being used as training data. This selection is done at random.

In [None]:
train_data, test_data = train_test_split(ds, test_size=0.3)

**3) Running Linear Model**: A linear regression on the training data is run using SkLearn LinearRegression().

In [None]:
X, y = train_data[predictor_list], train_data["target"]

# Defines the model as a Linear Regression
model = LinearRegression()

# Fits the model using the predictors above and defined target training data
model.fit(X,y);

**4) Making predictions with all of the features and graphing the data vs predicted:** Predictions for the data are made using our training data and graphed against the test data to demonstrate accuracy.

In [None]:
# Creates the predictors using the model.predict
preds = model.predict(test_data[predictor_list])

combined = np.vstack((test_data['target'], preds))

plt.plot(combined.T)
plt.grid()
plt.legend(("Target","Predictions"))
plt.title("Full Model");

**5) Calculating the R^2 for the full model:** R^2 for the model is calculated using the function defined in the functions section of the code.

In [None]:
r_2 = r2(preds, test_data)

**6) Calculating the R^2 for the full model:** R^2 for the model is calculated using the function defined in the functions section of the code.

In [None]:
k_fold_test(predictor_list, train_data, LinearRegression(), X, y)

### NOTE: All sections below #6 may take multiple minute to execute. This is because of the size of the dataset and number of predictors that we have used due to the utilization of dummy variables for our identifiers.

**7) Backward Selection:** Backward selection is conducted to determine the best predictors for the data which will be used to modify our model.

In [None]:
new_predictors = seq_selection(X, y, predictor_list, model)

**7) Optimizing feature selection using backward selection:** Using our new predictors from the backward selection, we run a linear regression using our training and testing data to formulate new predictions and a revised R^2 score. The new predictions are then graphed against our previous predictions and the test target data.

In [None]:
new_preds = linear_model(train_data, test_data, new_predictors)

new_r_2 = r2(new_preds, test_data, iden)

**8) Graphing all of the models vs the test target data:** The new predictions are graphed against our previous predictions and the test target data.

In [None]:
combined = np.vstack((test_data['target'], new_preds, preds))

plt.plot(combined.T)
plt.grid()
plt.legend(("Target","Reduced Predictions","Full Predictions"))
plt.title(iden);