**DESCRIPTION:** This model will run a regression on all of the data over the time period given, treating identifier, market cap, the factors, etc as independent variables. 
The regression is completed using SKLearn which utilizes test and training data to fit a learned model to the data.

In [1]:
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

**Calculating the Model:** The function below calculates and plots our model given a datatable.

In [2]:
def calculate_model(data, predictor_list):
    # Sets the index of the graph as the date so that the regression occurs
    # over the dates
    df.set_index(pd.DatetimeIndex(df['date']), inplace=True)

    # Sets the  predictor values
    predictors = predictor_list

    # Uses the train_test_split to randomly select 30% of the data as testing
    # data and saving the rest for the creation/training of the model
    train, test = train_test_split(data, test_size=0.3)

    # Defines the model as a Linear Regression
    model = LinearRegression()

    # Fits the model using the predictors above and defined target training data
    model.fit(train[predictors], train["target"])

    # Creates the predictors using the model.predict
    preds = model.predict(test[predictors])
    preds = pd.Series(preds, index=test.index)

    # Calculates the r^2 score based on the test values and the predicted values
    r = r2_score(test["target"], preds)

    combined = pd.concat({"target": test["target"], "Predictions": preds},
                         axis=1)

    # k_fold test
    # predictor (x) and response variables (y)
    y = data['target']
    X = data[predictor_list]

    # Conducts the K_Fold test
    cv = KFold(n_splits=10, random_state=1, shuffle=True)

    # modl = LinearRegression()

    # Calculates the cross validation score which is later used to calculate the
    # RMSE below
    scores = cross_val_score(model, X, y,
                             scoring='neg_mean_absolute_error',
                             cv=cv, n_jobs=-1)

    # the lower the RMSE the better
    print("r^2 is = " + str(r))
    print("root mean squared error (RMSE) = " + str(np.sqrt(np.mean(np.absolute(
        scores)))))
    
    # Plots the origional vs predicted
    combined.plot()
    plt.title("All of them Together")
    plt.show()


**Note on Path to File:** Below I have specified the path to the given CSV containing the data. You do not need to change the path for it to work on this notebook, but if you would like to download the code, you may.

In [None]:
if __name__ == '__main__':
    path_to_file = "data/data.csv"
    
    # Reads the information contained in the CSV
    df = pd.read_csv(path_to_file)

    # Turns the identifiers into dummy variables for the regression
    df = pd.get_dummies(df,prefix='Identifier ', prefix_sep='=', columns=[
        'identifier'])
    
    # Setting the list predictor variables called cols
    cols = list(df.columns)

    cols.remove('target')
    cols.remove('date')

    # Running the regression with the given dataset and predictor list
    calculate_model(df, cols)

    print('\n Model Complete')