a) Description of the data set and its characteristics:

The dataset contains information about various independent tasks that a computer needs to perform. The dataset has four main features:

Number of instructions (in billions of instructions, which implies the processing speed of the CPU)
Memory required (in MB)
Input file size (in MB)
Output file size (in MB)
The dataset consists of at least 70 data points. The primary goal of using this dataset is to understand the relationship between these features and potentially predict the resource requirements for future tasks.

b) Loading the selected dataset and displaying the first seven data:

In [18]:
import pandas as pd

data = pd.read_csv("tasks.csv")
data.head(7)


Unnamed: 0,id,Number of instructions (109 instructions),Memory required (MB),Input file size (MB),Output file size (MB)
0,Task_1,37.0,124.0,27.0,59.0
1,Task_2,27.0,155.0,26.0,60.0
2,Task_3,25.0,121.0,88.0,53.0
3,Task_4,25.0,185.0,57.0,83.0
4,Task_5,69.0,115.0,46.0,97.0
5,Task_6,60.0,69.0,13.0,75.0
6,Task_7,54.0,61.0,50.0,50.0


c) Apply the necessary pre-processing on the dataset:

In [19]:
# Check for missing values
print(data.isnull().sum())

# Drop missing values
data.fillna(data.mean(), inplace=True)

# Scale the features
from sklearn.preprocessing import StandardScaler
data.drop(columns='id', inplace=True)
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
scaled_data = pd.DataFrame(scaled_data, columns=data.columns)

# Split the dataset into features (X) and the target variable (y)
X = scaled_data.drop("Output file size (MB)", axis=1)
y = scaled_data["Output file size (MB)"]

# Split the dataset into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


id                                           0
Number of instructions (109 instructions)    4
Memory required (MB)                         7
Input file size (MB)                         5
Output file size (MB)                        4
dtype: int64


  data.fillna(data.mean(), inplace=True)


d) Teach a model using one of the methods learned in the lesson:

Here, we will use a simple linear regression model.

In [20]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)


e) Evaluate the model:

In [33]:
from sklearn.metrics import mean_squared_error, r2_score

# Make predictions using the test set
y_pred = model.predict(X_test)

# Calculate the mean squared error and R-squared value
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("R-squared Value:", r2)


Mean Squared Error: 0.8507658977810996
R-squared Value: -0.05438986085577868


Here, we will use Random Forest Regressor model

In [22]:
from sklearn.ensemble import RandomForestRegressor

# Initialize the Random Forest Regressor model
model_Random_Forest = RandomForestRegressor(n_estimators=100, random_state=42)

# Fit the model to the training data
model_Random_Forest.fit(X_train, y_train)


In [32]:
# Make predictions using the test set
y_pred_Random_Forest = model_Random_Forest.predict(X_test)

# Calculate the mean squared error and R-squared value
mse_Random_Forest = mean_squared_error(y_test, y_pred_Random_Forest)
r2_Random_Forest = r2_score(y_test, y_pred_Random_Forest)

print("Mean Squared Error:", mse_Random_Forest)
print("R-squared Value:", r2_Random_Forest)


Mean Squared Error: 0.9883945365415557
R-squared Value: -0.22495880543957436


Here, we will use Knn model with pre defined k.

In [24]:
from sklearn.neighbors import KNeighborsRegressor

# Initialize the k-Nearest Neighbors Regressor model with a chosen value for 'k'
k = 5
model_KNN = KNeighborsRegressor(n_neighbors=k)

# Fit the model to the training data
model_KNN.fit(X_train, y_train)


In [29]:
# Make predictions using the test set
y_pred_KNN = model.predict(X_test)

# Calculate the mean squared error and R-squared value
mse_KNN = mean_squared_error(y_test, y_pred_KNN)
r2_KNN = r2_score(y_test, y_pred_KNN)

print("Mean Squared Error:", mse_KNN)
print("R-squared Value:", r2_KNN)


Mean Squared Error: 0.8507658977810996
R-squared Value: -0.05438986085577868


Here, we use cross-validation to find the best 'k' value for the KNN regressor.

In [26]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid for 'k'
param_grid = {'n_neighbors': list(range(1, 31))}

# Initialize the KNN Regressor model
knn = KNeighborsRegressor()

# Set up the GridSearchCV
grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='neg_mean_squared_error', return_train_score=True)

# Fit the model to the training data
grid_search.fit(X_train, y_train)

# Find the best 'k' value
best_k = grid_search.best_params_['n_neighbors']
print("Best k value:", best_k)

# Use the best KNN model
best_knn = grid_search.best_estimator_


Best k value: 30


In [30]:
# Make predictions using the test set
y_pred_best_knn = best_knn.predict(X_test)

# Calculate the mean squared error and R-squared value
mse_best_knn = mean_squared_error(y_test, y_pred_best_knn)
r2_best_knn = r2_score(y_test, y_pred_best_knn)

print("Mean Squared Error:", mse_best_knn)
print("R-squared Value:", r2_best_knn)


Mean Squared Error: 0.843922165272763
R-squared Value: -0.04590813611102895


As we can see the knn with k=30 is has the best fit among other ways.