# Simple Linear Regression - using Scikit Learn

By: Anuj Khandelwal (@anujonthemove)

In my previous notebooks I have implemented Simple Linear Regression using closed-form math solution and using gradient descent. This notebook is to demonstrate simple linear regression using Scikit Learn.

**Dataset: ** [Sweden Auto Insurance data](https://www.math.muni.cz/~kolacek/docs/frvs/M7222Q/data/AutoInsurSweden.txt)

**Task: ** Predict payment for auto insurance claims in Sweden. 

## Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## Load data

In [2]:
df = pd.read_csv("../datasets/regression/univariate-regression/auto-insurance-sweden.csv", sep=',')
print(df.head())

IOError: File ../datasets/regression/univariate-regression/auto-insurance-sweden.csv does not exist

## Rows and Columns

In [None]:
# rows and columns
rows, cols = df.shape
print("rows: {}".format(rows), "cols: {}".format(cols))

## Split data and convert to numpy arrays
Split the data into training data and target data.<br>
Converting dataframes to numpy arrays makes math operations easy.

In [None]:
X_df = df.iloc[:,0:1]
y_df = df.iloc[:,1:2]

# converting dataframe object to numpy arrays for optimized math operations
X = np.array(X_df)
y = np.array(y_df)

## Linear model evaluation
Using **k-fold cross validation** for model evaluation. "Cross-validation is a technique to evaluate predictive models by partitioning the original sample into a training set to train the model, and a test set to evaluate it." <br>
Source: [OpenML](https://www.openml.org/a/estimation-procedures/1)

In [None]:
# error metric
from sklearn.metrics import mean_squared_error
# cross-validation
from sklearn.model_selection import KFold
# import linear model
from sklearn import linear_model

# initialize k-fold cross-validation with 2 splits
kf = KFold(n_splits=2, shuffle=True)

# split input data 
kf.get_n_splits(X)

# create linear regression object
regr = linear_model.LinearRegression()

results = []
score = []

for train_index, test_index in kf.split(X):
    predictions = regr.fit(X[train_index], y[train_index]).predict(X[test_index])
    results.append(mean_squared_error(y[test_index], predictions))
    score.append(regr.score(X[test_index], y[test_index]))
    
print("Mean of MSE: {}".format(np.mean(results)))
print("Mean of variance score: {}".format(np.mean(score)))

## Fit model on the entire dataset

In [None]:
regr.fit(X, y)
print("Coefficients: {}".format(regr.coef_))
print("Intercept: {}".format(regr.intercept_))
coeff = np.array([(regr.intercept_).flatten(), (regr.coef_).flatten()])

## Prediction

In [None]:
Test_x = np.array([31, 14, 53, 26])
Test_x = Test_x[:, np.newaxis]
predictions = regr.predict(Test_x)
print("predictions: \n {}".format(predictions))

## Regression Line

In [None]:
plt.figure(figsize=(10, 8))
plt.scatter(X, y, alpha=0.5)
plt.title('Sweden Auto Insurance Dataset')
plt.xlabel('number of claims')
plt.ylabel('total payment(in thousands)')
plt.grid()
pred = 
plt.plot(Test_x, predictions, 'g');