# Linear Modeling

The goal of this lab is to design a class for linear regression, using no classes or functions from Scikit-Learn.

Ensure your class satisfies:
- Includes a class method to fit the model to a Pandas dataframe $X$ and a Pandas series $y$
- Includes a class method to solve for the optimal coefficients
- Includes a class method to make predictions, given a new matrix $\hat{X}$
- Does not invert any matrices explicitly. Instead, solve the normal equations using `np.lingalg.solve`.
- It can be instructed to automatically perform a train-test split and return performance metrics on the test set.
- It can provide metrics including SSE, MSE, RMSE, and $R^2$.

Before programming your class, consider the following questions and record the answers:
- How does your class handle categorical data? How does Sci-kit do it? 
It does not, but you can do one-hot encoding. Sci-kit is the same. 
- How does your class handle missing data? How does Sci-kit do it? 
It does not, but you can impute values yourself. Sci-kit is the same 
- Does your class have any methods for creating polynomial expansions or otherwise transforming data? How does Sci-kit do it?
It does not, but you can create your own polynomial expansions. Scikit learn as a library that generates polynomial features for you. 
- How does your class handle the bias/intercept/constant? How does Sci-kit do it?
My class adds an intercept column to the data. Sci-kit has a fit_intercept parameter, which defaults to true. 
- What output do you automatically provide to the user? Why? How does Sci-kit do it?
I automatically calculate SSE, MSE, RMSE, and R^2 for the user. Scikit learn does not, they have a library thatt lets users compute these metrics easily too. 
- Are you including any tools for statistical inference? How does Sci-kit do it?
No, I am not including any tools for statistical inference. Sci-kit has a library for statistical analysis that users can use. 
In order to measure how long it takes to run code, you can `import time`, and 

```
start = time.time()
<expressions and code>
finish = time.time()
runtime = start-finish
```

For the `heart_hw.csv` and `cars_hw.csv` data in the assignment folder, run some regressions and compare the performance of your class with Sci-Kit's linear regression model. Do you get the same answers for the optimal coefficients, SSE, and $R^2$? Which one runs faster?



In [None]:
#linear regression class
import pandas as pd
import numpy as np
from numpy.linalg import solve
from sklearn.model_selection import train_test_split

class LinearRegression:
    def __init__(self, split_data=False, test_size = 0.2, random_state = 0):
        self.coefficients = None
        self.split_data = split_data
        self.test_size = test_size
        self.random_state = random_state

    def train_test(X, Y):
        return train_test_split(X, Y, test_size=self.test_size, random_state=self.random_state)   

    def fit(self, X: pd.Dataframe, y: pd.Series):
        X.insert(0, 'Intercept', 1)  # add intercept
        X = X.values
        Y = y.values.reshape(-1, 1)  # turn y into column
        
        if self.split_data:
            X_train, X_tets, Y_train, Y_test = train_test(X, Y)
            
            
