# Linear regression 
In this lab we train a Linear regressor with synthetic data for a regression problem (prediction of a real number). Given the "Time" feature, and a training set where the independent variable is the time, and the dependent variable is the average packet size. We want to predict the packet size to spot anomalies in the network traffic.
In this lab we focus on the **problem of overfitting the training data with Polynomial Regression**.

In [None]:
# Author: Roberto Doriguzzi-Corin
# Project: Course on Network Intrusion and Anomaly Detection with Machine Learning
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Import necessary libraries
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

OUTPUT_FILE = "./reg_tree"

# Generate some sample data (a sin function with some random noise)
np.random.seed(0)
m = 40
X = 6 * np.random.rand(m, 1) - 3 
y = 0.5 * X  + 80 + 0.3*np.random.randn(m, 1)

# Fit a linear regressor
lr = LinearRegression()
lr.fit(X, y)

In [None]:
print("Resulting curve: " + str(lr.intercept_[0]) + " + " + str(lr.coef_[0][0]) + "*Time") 

# Predictions (numbers from 0 to 5 with increment 0.01)
X_plot = np.arange(-3, 3, 0.01)[:, np.newaxis]
y_pred = lr.predict(X_plot)

np.random.seed(1)
X_test = 6 * np.random.rand(m, 1) - 3 
y_test = 0.5 * X_test + 80 + 0.3*np.random.randn(m, 1)

# Plot the results
plt.figure()
plt.scatter(X, y, s=20, edgecolor="black", c="yellow", label="training data")
plt.scatter(X_test, y_test, s=20, edgecolor="black", c="red", label="test data")
plt.plot(X_plot, y_pred, color="cornflowerblue", label="prediction")
plt.xlabel("x")
plt.ylabel("y")
plt.title("Linear Regression")
plt.legend()
plt.show()

# Error on the training set
y_pred = lr.predict(X)
mse = mean_squared_error(y,y_pred)
print("MSE measured on the training set: ", mse)

y_pred_test = lr.predict(X_test)
mse = mean_squared_error(y_test,y_pred_test)
print("MSE measured on the test set: ", mse)

# Linear regression with polynomial features
You can use a linear model to fit nonlinear data. A simple way to do this is to add powers of each feature as new features, then train a linear model on this extended set of features. This technique is called Polynomial Regression. 

In [None]:
# X_poly now contains the original feature of X plus the square of this feature
MAX_DEGREE=19

poly_features = PolynomialFeatures(degree=MAX_DEGREE, include_bias=False)
X_poly = poly_features.fit_transform(X)

lr = LinearRegression() 
lr.fit(X_poly, y)
print(lr.intercept_, lr.coef_) # print bias and weights

In [None]:
# Predictions (numbers from 0 to 5 with increment 0.01)
X_plot = np.arange(-3, 3, 0.01)[:, np.newaxis]
X_plot_poly = poly_features.fit_transform(X_plot)
y_pred = lr.predict(X_plot_poly)

np.random.seed(2)
X_test = 6 * np.random.rand(m, 1) - 3 
X_test_poly = poly_features.fit_transform(X_test)
y_test = 0.5 * X_test + 80 + 0.3*np.random.randn(m, 1)

# Plot the results
plt.figure()
plt.scatter(X, y, s=20, edgecolor="black", c="yellow", label="training data")
plt.scatter(X_test, y_test, s=20, edgecolor="black", c="red", label="test data")
plt.plot(X_plot, y_pred, color="cornflowerblue", label="prediction")
plt.xlabel("x")
plt.ylabel("y")
plt.ylim([78, 82])
plt.title("Polynomial Regression")
plt.legend()
plt.show()

# Error on the training set
y_pred = lr.predict(X_poly)
mse = mean_squared_error(y,y_pred)
print("MSE measured on the benign training set: ", mse)


y_pred_test = lr.predict(X_test_poly)
mse = mean_squared_error(y_test,y_pred_test)
print("MSE measured on the test set: ", mse)

# Plot the MSE trend on training and test set when varying the polynomial degree

In [None]:
mse_train = []
mse_test = []

for degree in range(1,MAX_DEGREE+1):
    poly_features = PolynomialFeatures(degree=degree, include_bias=False)
    X_poly = poly_features.fit_transform(X)
    X_test_poly = poly_features.fit_transform(X_test)


    lr = LinearRegression() 
    lr.fit(X_poly, y)

    y_pred = lr.predict(X_poly)
    mse_train.append(mean_squared_error(y,y_pred))
    y_pred_test = lr.predict(X_test_poly)
    mse_test.append(mean_squared_error(y_test,y_pred_test))

# Index for the x-axis, increasing by 1
index = list(range(1, MAX_DEGREE+1))

# Plotting the line charts for both data sets
plt.xticks(np.arange(1, MAX_DEGREE+1, 1.0))
plt.plot(index, mse_train, marker='o', linestyle='-', color='b', label='Training set')
plt.plot(index, mse_test, marker='s', linestyle='--', color='r', label='The test set')

# Adding titles and labels
plt.title('Mean Square Error (MSE)')
plt.xlabel('Polynomial Degree')
plt.ylabel('MSE')
plt.grid(True)
plt.legend()

# Display the plot
plt.show()