### HELLENIC OPEN UNIVERSITY - SCHOOL OF SCIENCE AND TECHNOLOGY
### DATA SCIENCE AND MACHINE LEARNING : DAMA61 ACAD. YEAR 2023-24

#### <center> WRITTEN ASSIGNMENT 2 - SOLUTIONS </center>

In [None]:
# increase the width of the notebook
from IPython.display import display, HTML
display(HTML("<style>.container { width:80% !important; }</style>"))

### Problem 1

Use the following code to create a set of non-linear data.

In [None]:
import numpy as np
np.random.seed(40)
m = 1000
X = 10 * np.random.rand(m, 1) - 5
y = 2*X + X**2 + np.random.randn(m, 1)

1) Transform the data using a polynomial of degree four.</br>
2) Train a Lasso regularized linear regression models with alpha 0.01 on the polynomial features.</br>
3) What are the coefficients each lasso regularized model predicts? Comment on your results.</br>
4) Train a Ridge regularized linear regression models with alpha 0.01 on the polynomial features.</br>
5) What are the coefficients the ridge regularized model predicts? Comment on your results.</br>
<i>Hint: Check the API Documentation of <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso">Lasso</a> and <a href="https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge">Ridge</a> models.</i> 

In [None]:
# create the data
import numpy as np
np.random.seed(40)
m = 1000
X = 10 * np.random.rand(m, 1) - 5
y = 2 * X + X**2 + np.random.randn(m, 1)

In [None]:
# plot the data
import matplotlib.pyplot as plt
fig = plt.figure(figsize = (8,6))
plt.plot(X, y, "o")
plt.xlabel("x")
plt.ylabel("y")
plt.show()

In [None]:
# transform the data to polynomial features of degree 4
from sklearn.preprocessing import PolynomialFeatures

poly_features = PolynomialFeatures(degree=4, include_bias=False)
x_poly = poly_features.fit_transform(X)

In [None]:
# apply Lasso regularizer with alpha 0.01
from sklearn.linear_model import Lasso

lasso = Lasso(0.01)
lasso.fit(x_poly, y)

In [None]:
# make predictions and display the data and the results
y_pred = lasso.predict(x_poly)

plt.plot(X, y, "b*", label="Ground Truth")
plt.plot(X, y_pred, "ro", label="Predicted")
plt.xlabel("X")
plt.ylabel("Y")
plt.legend()
plt.show()

In [None]:
lasso.coef_

#### Comments:

Our analysis assumed a polynomial of degree 4 to fit the data. We observe that the model retrieves the factors of the first and second-order features while eliminating the contribution of the third and fourth-order features.

In [None]:
# apply Ridge regularizer with alpha 0.01
from sklearn.linear_model import Ridge

ridge = Ridge(0.01)
ridge.fit(x_poly, y)

In [None]:
# make predictions and display the data and the results
y_pred = ridge.predict(x_poly)

plt.plot(X, y, "b*", label="Ground Truth")
plt.plot(X, y_pred, "ro", label="Predicted")
plt.xlabel("X")
plt.ylabel("Y")
plt.legend()
plt.show()

In [None]:
ridge.coef_

#### Comments:

Like the Lasso model, we observe that the Ridge one retrieves the factors of the first and second-order features while eliminating the contribution of the third and fourth-order features. 

<hr>

### Problem 2

Work with the Iris dataset to create an SVM classifier to distinguish Iris-Setosa and Iris-Virginica samples:

1) Load the data and keep only the sepal length and sepal width features, and the Iris-Setosa and Iris-Virginica target values.</br>
2) Visualize the data on a plot displaing the sepal length on the x-axis and the sepal width on the y-axis.</br>
3) Train two linear SVM classifiers with the regularization hyperparameter C equal to 10 and 100, respectively.</br>
4) Which of the values of C concludes to a more reliable model? Comment on your results.
5) Use your choice of the value of C from the previous question and train a Logistic Regression model to classify the two classes again.</br>
6) Use a contour plot to visualize the probability of a sample being Iris-Setosa or Iris-Virginica for all the values of sepal lengths and sepal widths in the range of their minimum and maximum value, respectively.</br>
7) What is the probability of a sample being Iris setosa given that its sepal length is 5.5 cm and its sepal width is 3.25 cm?</br>

In [None]:
# load the data
from sklearn import datasets
iris = datasets.load_iris(as_frame=True)

# keep the sepal-length and sepal-width features
X = iris.data[["sepal length (cm)", "sepal width (cm)"]].values
y = iris.target

# keep the setosa and virginica samples
setosa_or_virginica = (y == 0) | (y == 2)
X = X[setosa_or_virginica]
y = y[setosa_or_virginica]

In [None]:
import matplotlib.pyplot as plt

# visualize the data
fig = plt.figure(figsize=(8,6))
plt.plot(X[:, 0][y==0], X[:, 1][y==0], "bs", label="Iris-Setosa")
plt.plot(X[:, 0][y==2], X[:, 1][y==2], "ro", label="Iris-Versicolor")
plt.xlabel("Sepal length (cm)")
plt.ylabel("Sepal width (cm)")
plt.legend()
plt.grid()
plt.show()

In [None]:
# define a function that draws the decision boundary of a given SVM classifier
def plot_svc_decision_boundary(svm_clf, xmin, xmax):
    w = svm_clf.coef_[0]
    b = svm_clf.intercept_[0]

    # At the decision boundary, w0*x0 + w1*x1 + b = 0
    # => x1 = -w0/w1 * x0 - b/w1
    x0 = np.linspace(xmin, xmax, 200)
    decision_boundary = -w[0] / w[1] * x0 - b / w[1]

    margin = 1/w[1]
    gutter_up = decision_boundary + margin
    gutter_down = decision_boundary - margin
    svs = svm_clf.support_vectors_

    plt.plot(x0, decision_boundary, "k-", linewidth=2, zorder=-2)
    plt.plot(x0, gutter_up, "k--", linewidth=2, zorder=-2)
    plt.plot(x0, gutter_down, "k--", linewidth=2, zorder=-2)
    plt.scatter(svs[:, 0], svs[:, 1], s=180, facecolors='#AAA', zorder=-1)

In [None]:
import numpy as np
from sklearn.svm import SVC

# train an SVM classifier with linear kernel and C = 10
svm_clf10 = SVC(kernel="linear", C=10)
svm_clf10.fit(X, y)

# train an SVM classifier with linear kernel and C = 100
svm_clf100 = SVC(kernel="linear", C=100)
svm_clf100.fit(X, y)

# visualize our results
fig = plt.figure(figsize=(14, 6))

plt.subplot(121)
plt.title("C = 10")
plt.plot(X[:, 0][y==0], X[:, 1][y==0], "bs", label="Iris-Setosa")
plt.plot(X[:, 0][y==2], X[:, 1][y==2], "ro", label="Iris-Versicolor")

# plot the decision boundary for C=10
plot_svc_decision_boundary(svm_clf10, min(X[:,0]), max(X[:,0]))
plt.xlabel("Sepal length (cm)")
plt.ylabel("Sepal width (cm)")
plt.legend()
plt.grid()

plt.subplot(122)
plt.title("C = 100")
plt.plot(X[:, 0][y==0], X[:, 1][y==0], "bs", label="Iris-Setosa")
plt.plot(X[:, 0][y==2], X[:, 1][y==2], "ro", label="Iris-Versicolor")

# plot the decision boundary for C=100
plot_svc_decision_boundary(svm_clf100, min(X[:,0]), max(X[:,0]))

plt.xlabel("Sepal length (cm)")
plt.ylabel("Sepal width (cm)")
plt.legend()
plt.grid()
plt.show()

#### Comments:

We observe that the lower value of C, i.e., C=10, concludes a more reliable and generalizable model, keeping the street as large as possible while eliminating the margin violations. On the other hand, C equal to 100, is very sensitive to outliers.

In [None]:
# train a logistic regression model
from sklearn.linear_model import LogisticRegression
softmax_reg = LogisticRegression(C=10, random_state=42)
softmax_reg.fit(X, y)

In [None]:
lengths, widths = np.meshgrid(np.linspace(min(X[:,0]), max(X[:,0]), 500).reshape(-1, 1),
                              np.linspace(min(X[:,1]), max(X[:,1]), 200).reshape(-1, 1))

X_new = np.c_[lengths.ravel(), widths.ravel()]

y_proba = softmax_reg.predict_proba(X_new)
y_predict = softmax_reg.predict(X_new)

probs = y_proba[:, 1].reshape(lengths.shape)
classes = y_predict.reshape(lengths.shape)

plt.figure(figsize=(8, 6))

plt.plot(X[:, 0][y==0], X[:, 1][y==0], "bs", label="Iris-Setosa")
plt.plot(X[:, 0][y==2], X[:, 1][y==2], "ro", label="Iris-Versicolor")

plt.contourf(lengths, widths, probs, alpha=0.5, cmap="bwr")
contour = plt.contour(lengths, widths, classes, cmap="bwr")

plt.legend()
plt.grid()
plt.show()

In [None]:
print(f"The probability of a sample being Iris setosa given that its sepal length is 5.5 cm and its sepal width is 3.25 cm is {100*softmax_reg.predict_proba([[5.5, 3.25]])[0][0]:.2f}%.")