---

# Project Managemente Assingment 1

description: This document is a study on how different statistical and machine learning models can be used to predict data based on some parameters. 

authors: Diego Díaz Vidal, Marco Fernández Pérez, Lucía Patricia Gil Díaz

date: 04/2024

---

## Part 2: Effort Estimation

In this part of the assingment we will study the relationship between different variables (size, resources, duration, etc) and the effort needed to carry out a project. For this purpose, we will obtain data from two datasets: `albretch.arff` and `china.arff`.
Both datasets contain various observations of projects in which the effort is known (**supervised learning**). The goal is to train a machine learning model with some of that data, and then use the model to predict the rest of the data.

## Task 1: albretch.arff

Fort this task, we will use data from the "albretch" dataset. This dataset contains observations with two variables only: the size of the project and the effort needed.

The following code imports the necessary libraries and downloads the data.

In [None]:
# Imports
import arff
import numpy as np
import matplotlib.pyplot as plt

# Load the dataset
with open('./albretch.arff', 'r') as f:
    data = arff.load(f)

# Convert the data to numpy arrays
size = np.array([row[0] for row in data['data']])
effort = np.array([row[1] for row in data['data']])

Next, we will create and study two different estimation models and compare the results to see which one suits best to this data.

### 1.1. Linear Regression Model

A linear regression model is a statistical method used for modeling the **relationship between a dependent variable** (also known as the target variable) **and one or more independent variables** (also known as the predictor variables). It asssumes that there is a linear relationship between the independent variables and the dependent variable.

The basic idea behind the linear regression model is to find the best-fitting straight line that describes the relationship between the independent variables and the dependent variable. This line is called the **regression line**.

The goal of the linear regression model is to **estimate the coefficients that minimize the sum of the squared differences between the observed values of the dependent variable and the values predicted by the regression line**.

Once the coefficients are estimated, **the regression line can be used to make predictions** about dependent variables for new values of the independent variables.

The following code creates a linear regression model that calculates those coefficients and then uses the regression line to calcule the **residuals**, as well as the **mean squared error**.

In [None]:
m, b = np.polyfit(size, effort, 1)              # regression line
effort_predicted = m * size + b                 # predict the effort

residuals = effort - effort_predicted           # residuals
mean_residual = np.mean(np.abs(residuals))      # mean residual

mse = np.mean((effort - effort_predicted)**2)   # mean squared error

In this case, due to the small number of observations, we decided to use all data for the training of the model.

The results will be shown in the following code.

In [None]:
# Print the results
print(f"The Linear Regression is: Effort = {m:.2f}*Size + {b:.2f}")
print(f"The Mean Squared Error is: {mse:.2f}")
print(f"The Mean Residual is: {mean_residual:.2f}")

# Plot the data and the regression line
plt.figure(figsize=(6, 6))
plt.scatter(size, effort, color='blue', label='Data')
plt.plot(size, effort_predicted, color='red', label='Linear Regression')
plt.fill_between(size, effort_predicted - mean_residual, effort_predicted + mean_residual,
                 color='green', alpha=0.2, label='Mean Residual Region')
plt.title('Size vs Effort')
plt.xlabel('Size')    
plt.ylabel('Effort')
plt.grid(True)
plt.legend()
plt.show()

As we can see, the mean residual is notably high, considering the value of the variables. This is probably because there is a weak relationship between the variables or high variability not explained by the model. This could be due to:

1. **Nonlinearity in the data**: if this data does not follow a linear relationship; a linear model may not fit well, resulting in large residuals. We will later compare the results obtained with this linear model with the results obtained with a polynomial regression model.
2. **Presence of outliers**: outliers can significantly affect the model and increase residuals, specially in a small dataset like this one.
3. **Low number of observations**: the greater the number of observations the more robust the model is to outliers, and the more accurately it can identify the relationship between the variables (if exists).

### 1.2. Polynomial Regression Model

After observing the results brought by the Linear Regression Model, we decided to further continue with the study of the relationship between the two variables by training another model: a polynomial regression model.

While **the Linear Regression Model assumes a linear relationship between the variables**, **the Polynomial Model allows for a non-linear relationship**, which often offers a more accurate measure of the relationship.

In [None]:
coefficients = np.polyfit(size, effort, 2)      # polynomial regression

# Function to calculate the predicted values
def polyval(coefficients, x):
    return coefficients[0] * x**2 + coefficients[1] * x + coefficients[2]

effort_predicted = polyval(coefficients, size)  # predict the effort

residuals = effort - effort_predicted           # residuals
mean_residual = np.mean(np.abs(residuals))      # mean residual

mse = np.mean((effort - effort_predicted)**2)   # mean squared error

# Equation of the polynomial regression
equation = f"Effort = {coefficients[0]:.2f} * Size^2 + {coefficients[1]:.2f} * Size + {coefficients[2]:.2f}"

This code creates a polynomial model of degree 2 that studies the relationship between the size of the project and the effort needed. In this case, additionally to the mean residual, we also calculate the mean squared error to assess how well the polynomial regression captures the relationship.

Now we will observe the results.

In [None]:
# Print the results
print("Equation of the Polynomial Regression:", equation)
print(f"The Mean Squared Error is: {mse:.2f}")
print(f"The Mean Residual is: ", mean_residual)

# Plot the data and the polynomial regression
plt.figure(figsize=(6, 6))
plt.scatter(size, effort, color='blue', label='Data')
sorted_indices = np.argsort(size)
plt.plot(size[sorted_indices], effort_predicted[sorted_indices], color='red', label='Polynomial Regression')
plt.fill_between(size[sorted_indices], effort_predicted[sorted_indices] - mean_residual,
                 effort_predicted[sorted_indices] + mean_residual, color='green',
                 alpha=0.2, label='Mean Residual Region')
plt.title('Size vs Effort')
plt.xlabel('Size')
plt.ylabel('Effort')
plt.grid(True)
plt.legend()
plt.show()

### Results

Both models provide equations that express the effort as a function of size. To compare how well each model fits the actual data, we will use the following statistic:

- Mean Residual: this value represents the average distance between the actual value and the predicted value by the curve.
For the linear model, the mean residual is 143.14, while for the polynomial model is 110.45, which is lower. This means that the curve drawn by the polynomial model is, on average, closer to the actual values, resulting in a better performing model.
- Mean Squared Error: while the linear model has a mse of 28697.99, the polynomial model's mse is 21549.97, also lower.

These results demonstrate that the polynomial regression model fits the data better, making it a better candidate than the linear regression model when choosing a model to predict new outcomes. However, the results of both models are not as good as to be considered "reliable" for this case. This may be due to the low dependence between variables or the limited volume of data.

## Task 2: china.arff

Similar to what we did before, in this part we will download a dataset containing data of various observations of projects.

This dataset is much bigger than the previous one, so for convenience, we will convert it to csv format before starting to work with it.

Additionally, since we have many more observations at our disposal, we can split the dataset into training and testing sets to perform a more thorough analysis of the models.

### 2.1. Linear Regression Model

Just like with the `albretch` dataset, here we will train a Linear Regression Model and evaluate its predicting capabilities.

In [None]:
# Imports
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

# Read the file
data = pd.read_csv("chinaOriginal.csv")

# Divide the data into training and testing
X = data.drop(columns=["Effort"])
y = data["Effort"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

linear_model = LinearRegression()               # create the model
linear_model.fit(X_train, y_train)              # train the model using only training data
y_pred_linear = linear_model.predict(X_test)    # predict the effort using testing data

mse_linear = mean_squared_error(y_test, y_pred_linear)  # mean squared error
r2 = r2_score(y_test, y_pred_linear)                    # R^2 score

residuals = y_test - y_pred_linear      # residuals
average_residual = residuals.mean()     # average residual

# Print Metrics
print("Mean Squared Error:", mse_linear)
print("R2 Score:", r2)
print("Coefficients:", linear_model.coef_)
print("Intercept:", linear_model.intercept_)
print("Average Residual:", average_residual)

# Visualize the results of the predictions
plt.figure(figsize=(6, 6))
plt.scatter(y_test, y_pred_linear, color='blue', label='Actual Values vs Predictions')
# Reference line (perfect prediction)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], '--', color='red', label='Perfect Predictions')
plt.xlabel('Actual Values')
plt.ylabel('Predictions')
plt.title('Actual Values vs Predictions (Linear Model)')
plt.grid(True)
plt.legend()
plt.show()

# Visualize the results for the residuals
plt.figure(figsize=(6, 6))
plt.scatter(y_pred_linear, residuals, color='blue', label='Residuals vs Predictions')
plt.xlabel('Predictions')
plt.ylabel('Residuals')
plt.title('Residuals Diagram (Linear Regression)')
# Reference line (perfect residuals)
plt.axhline(y=0, color='red', linestyle='--', label='Perfect Residuals')
plt.grid(True)
plt.legend()
plt.show()

### 2.2. Neural Network

Multi-layer Perceptron Regressor (MLP) is a type of **artificial neural network that is used in regression tasks**. It consists of multiple layers of nodes (neurons) arranged in a feedforward manner, where each node in one layer is connected to every node in the subsequent layer.

The **Training** of MLP Regressors **is carried out using optimization algorithms such as gradient descent to minimize a loss function, which measures the difference between the predicted and the actual values**.
**Backpropagation is used to update the weights of the network iteratively during traininig**.

In [None]:
# Imports
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score, explained_variance_score, mean_absolute_percentage_error
from sklearn.metrics import mean_absolute_error, median_absolute_error

from sklearn.neural_network import MLPRegressor

# Read the file
data = pd.read_csv("chinaOriginal.csv")

# Divide the data into training and testing
X = data.drop(columns=["Effort"])
y = data["Effort"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create the Multi-layer Perceptron regressor Model
mlp_model = MLPRegressor(hidden_layer_sizes=(100, 50), activation='relu', solver='adam', random_state=42)

mlp_model.fit(X_train, y_train)         # train the model
y_pred_mlp = mlp_model.predict(X_test)  # predict the effort

# Metrics
mse_mlp = mean_squared_error(y_test, y_pred_mlp)    # mean squared error
r2_mlp = r2_score(y_test, y_pred_mlp)               # R^2 score

residuals = y_test - y_pred_mlp         # residuals
average_residual = residuals.mean()     # average residual

# Print Metrics
print("Mean Squared Error (RMSE):", mse_mlp)
print("R2 Score (MLPRegressor):", r2_mlp)
print("Average Residual:", average_residual)

# Visualize the results of the predictions
plt.figure(figsize=(6, 6))
plt.scatter(y_test, y_pred_mlp, color='blue', label='Actual Values vs Predictions')
# Reference line (perfect prediction)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], '--', color='red', label='Perfect Predictions')
plt.xlabel('Actual Values')
plt.ylabel('Predictions')
plt.title('Actual Values vs Predictions (Neural Network)')
plt.grid(True)
plt.legend()
plt.show()

# Visualize the results for the residuals
plt.figure(figsize=(6, 6))
plt.scatter(y_pred_mlp, residuals, color='blue', label='Residuals vs Predictions')
plt.xlabel('Predictions')
plt.ylabel('Residuals')
plt.title('Residuals Diagram (MLP)')
# Reference line (perfect residuals)
plt.axhline(y=0, color='red', linestyle='--', label='Perfect Residuals')
plt.grid(True)
plt.legend()
plt.show()

### 2.3. Support Vector Regression

Support Vector Regression (SVR) is a type of Support Vector Machine (SVM) algorithm that is used for **regression tasks**. Similar to classification using SVM, SVR also works by **finding the hyperplane that best fits the data**. However, **in SVR, the goal is to minimize the error between the predicted value and the actual value**, rather than simply classifying data points into different categories.

The following code creates a SVR model and uses it to predict the effort based on the other parameters.

In [None]:
# Imports
import pandas as pd

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

from sklearn.svm import SVR

# Read the file
data = pd.read_csv("chinaOriginal.csv")

# Divide the data into training and testing
X = data.drop(columns=["Effort"])
y = data["Effort"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create the SVR model
svr_model = SVR(kernel='linear')

svr_model.fit(X_train, y_train)         # train the model
y_pred_svr = svr_model.predict(X_test)  # predict the effort

# Metrics
mse_svr = mean_squared_error(y_test, y_pred_svr)    # mean squared error
r2_svr = r2_score(y_test, y_pred_svr)               # R^2 score

residuals_svr = y_test - y_pred_svr         # residuals
average_residual_svr = residuals_svr.mean() # mean residuals

# Print Metrics
print("Mean Squared Error (SVR):", mse_svr)
print("R2 Score (SVR):", r2_svr)
print("Average Residual (SVR):", average_residual_svr)

# Visualize the results of the predictions
plt.figure(figsize=(6, 6))
plt.scatter(y_test, y_pred_svr, color='blue', label='Actual Values vs Predictions')
# Reference line (perfect prediction)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], '--', color='red', label='Perfect Predictions')
plt.xlabel('Actual Values')
plt.ylabel('Predictions')
plt.title('Actual Values vs Predictions (Support Vector Regression)')
plt.grid(True)
plt.legend()
plt.show()

# Visualize the results for the residuals
plt.figure(figsize=(6, 6))
plt.scatter(y_pred_svr, residuals_svr, color='blue', label='Residuals vs Predictions')
plt.xlabel('Predictions')
plt.ylabel('Residuals')
plt.title('Residuals Diagram (SVR)')
# Reference line (perfect residuals)
plt.axhline(y=0, color='red', linestyle='--', label='Perfect Residuals')
plt.grid(True)
plt.legend()
plt.show()

### Conclussions

After studying the performance of multiple models (Linear, SVR and MLP) we have obtained the following results:

|             |LINEAR   |MLP      |SVR      |
|:-----------:|:-------:|:-------:|:-------:|
|MSE          |1679987.5|647875.52|322104.17|
|R2	          |0.944    |0.978    |0.989    |
|Mean Residual|-9.105   |38.43    |76.052   |

According to these results:
- The Support Vectorial Regression Model has the lowest MSE value, indicating better predictive quality.
- The SVR also hast the best R2 coefficient.
- Finally, we observe that the lineal model has a negative mean residual, which means that the model underestimated the value of the effort during the predictions, while the other two overestimated it.

In conclussion, the SCR appears to be the most suitable for this dataset based on the MSE and R2 valuse. However, other factors such as training time and the interpretability of the model should be considered.