### TASK 1
Load the data from Lab 2 and convert the answers in the answers DataFrame to the numerical equivalents taken from the questions DataFrame. Do the next steps with the numerical data only.

#### Reading data from file
First we read data from file and from questions.csv file we create a map where from question and answer its weight is retuned. We use that function to map over all columns in responses.csv file and replace all the answers with its corresponding weight.

In [1]:
import pandas as pd

qf = pd.read_csv('questions.csv', keep_default_na=False)
rf = pd.read_csv('responses.csv', keep_default_na=False)

rf.fillna('None', inplace=True)
qf.fillna('None', inplace=True)

# Create a map from question and answer to weight
question_answer_to_weight_map = {(row['Question'], row['Answer']): row['Answer weight'] for _, row in qf.iterrows()}

# Iterate through the responses and replace the answers with weights
for question in rf.columns:
    rf[question] = rf[question].map(lambda answer: question_answer_to_weight_map.get((question, answer), answer))

### TASK 2
Split the data for training and validation (80%) and testing (20%) randomly (so that you have a roughly equal representation in all parts).

#### Spliting data for training and validation
We create train and test datasets using `train_test_split` from `sklearn` to split dataset randomly between train and test datasets. We save train and test files so that we won't randomize our datasets each time we run the function.

In [2]:
from sklearn.model_selection import train_test_split
from os import path

# Define file names
TRAIN_FILE_NAME = 'train.csv'
TEST_FILE_NAME = 'test.csv'

# Read train and test data from files if they exist
if path.exists(TRAIN_FILE_NAME) and path.exists(TEST_FILE_NAME):
    train = pd.read_csv(TRAIN_FILE_NAME)
    test = pd.read_csv(TEST_FILE_NAME)
    print("Using saved train and test datasets!")
else:
    # Split the data into train two datasets for training and testing
    train, test = train_test_split(rf, test_size=0.2, random_state=1618)

    # Save train dataset
    train.to_csv('train.csv')

    # Save test dataset
    test.to_csv('test.csv')
    print("Created and saved new train and test datasets!")


Using saved train and test datasets!


### TASK 3
Check how well you can predict a student's level of machine learning experience (Question 4) from his/her other variables (maybe except the year). Use a machine learning method of your choice. You can approach this both as regression (output is a number) or classification task. For this, you should train your ML model on the training data and report the accuracy of your choice on the testing data. You can also use (cross-)validation to validate different (versions of) ML methods to choose the best. Remember not to use the testing data for validation!

#### Import required libraries
First we import required libraries

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

#### Define parameters
Define the parameters that will be used in training

In [4]:

# Define training parameters
PREDICTION_QUESTION = 'What is your level of machine learning (/deep learning) experience?'
TRAIN_COLUMNS_DROP = ['Year', PREDICTION_QUESTION]
# Training Kernel
# Possible values 'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'
KERNEL = 'rbf'

#### Create train and test datasets
We create our training and testing dataset. For training dataset we drop our target and year columns, for target dataset we use only the prediction question column. Then we split the dataset into train and test sets so that we can train the data and later test it.

In [5]:

# Create training dataset by removing target column and other not useful columns
X = train.drop(columns=TRAIN_COLUMNS_DROP)

# Create training target dataset
y = train[[PREDICTION_QUESTION]]

# Split the data into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#### Training
We train the model using [support vector machines](https://scikit-learn.org/stable/modules/svm.html) with training dataset on target dataset.

In [6]:
# Train the model
svm_model = SVC(kernel=KERNEL)
svm_model.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


#### Validation
We test our model with testing data to find out its accuracy. We first make predictions on our testing dataset and compare prediction values with the real ones and then we evaluate the model.

##### Accuracy
Accuracy shows what percentage of our predictions were correct.

##### Confusion matrix
Confusion matrix shows how many prediction are correct and incorrect per class.

##### Classification report
Classification report displays a table with precision, recall, F1, and support scores for the model.

**Precision** - accuracy.

**Recall** - fraction of positives that were correctly identified.

**F1-score** - shows the percentage of positive predictions that were correct.

**Support** - the number of actual occurences of the class.

In [7]:

# Make predictions on the test dataset
y_pred = svm_model.predict(X_test)

# Evaluate the model
class_report = classification_report(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)

# Output the results
print(f"Model Accuracy: {accuracy * 100:.2f}%")
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(class_report)


Model Accuracy: 50.00%
Confusion Matrix:
[[ 0  1  0  1  0]
 [ 0  7  0  8  0]
 [ 0  0  0  3  0]
 [ 0  7  0 14  0]
 [ 0  0  0  1  0]]
Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         2
           1       0.47      0.47      0.47        15
           2       0.00      0.00      0.00         3
           3       0.52      0.67      0.58        21
           4       0.00      0.00      0.00         1

    accuracy                           0.50        42
   macro avg       0.20      0.23      0.21        42
weighted avg       0.43      0.50      0.46        42



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


#### Cross validation
We can use cross validation to train multiple models and take the model with the best accuracy.

##### Defining parameters
Define parameters for cross validation.

In [8]:
FOLDS = 5
PARAMS = {
    'C': [0.1, 1, 10, 100],       # Regularization parameter
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],   # Kernel type
    # 'gamma': ['scale', 'auto'],    # Kernel coefficient
    'degree': [2, 3, 4]            # Only used for 'poly' kernel
}

##### Grid search
We do the grid search to find the best models from the data.

In [9]:
from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(SVC(), PARAMS, cv=FOLDS, scoring='accuracy')
grid_search.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = colu

##### Models
We take the best model and parameters from training.

In [10]:
best_svm_params = grid_search.best_params_
best_svm_model = grid_search.best_estimator_

##### Validation
Validating to get the best found model.

In [11]:
y_pred = best_svm_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f"Best Params: {best_svm_params}")
print(f"Best Model Accuracy: {accuracy * 100:.2f}%")
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(class_report)

Best Params: {'C': 100, 'degree': 2, 'kernel': 'linear'}
Best Model Accuracy: 54.76%
Confusion Matrix:
[[ 0  2  0  0  0]
 [ 0 10  0  4  1]
 [ 0  1  0  0  2]
 [ 0  6  1 12  2]
 [ 0  0  0  0  1]]
Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         2
           1       0.53      0.67      0.59        15
           2       0.00      0.00      0.00         3
           3       0.75      0.57      0.65        21
           4       0.17      1.00      0.29         1

    accuracy                           0.55        42
   macro avg       0.29      0.45      0.30        42
weighted avg       0.57      0.55      0.54        42



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


#### Other methods
Here i tried to train and validate models using different ML methods. 

##### Linear regression

Fits a linear model with coefficients $w = (w_1, ..., w_p)$ to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation. [source](https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares).

MSE - Mean squared error (lower is better)

MAE - Mean absolute error (lower is better)

$R^2$ - indicates the portion of variance in target that is predictable (higher better).

In [15]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Train the Linear Regression model
linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)

# Valdiate the model
y_pred = linear_reg.predict(X_test)

# Get model metrics
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse}")
print(f"Mean Absolute Error (MAE): {mae}")
print(f"R-squared (R2 Score): {r2}")

Mean Squared Error (MSE): 0.7483275673309363
Mean Absolute Error (MAE): 0.7426320298575062
R-squared (R2 Score): 0.33997508561411405


##### Decision trees
We can use decision trees to train a model.

In [22]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Train the decision tree
tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train, y_train)

# Make predictions on the test set
y_pred = tree.predict(X_test)

# Get model metrics
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)

# Print results
print(f"Accuracy: {accuracy * 100:.2f}%")
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(class_report)

Accuracy: 52.38%
Confusion Matrix:
[[ 0  1  1  0  0  0]
 [ 0 10  3  2  0  0]
 [ 0  0  1  2  0  0]
 [ 0  2  5 11  2  1]
 [ 0  0  0  1  0  0]
 [ 0  0  0  0  0  0]]
Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         2
           1       0.77      0.67      0.71        15
           2       0.10      0.33      0.15         3
           3       0.69      0.52      0.59        21
           4       0.00      0.00      0.00         1
           5       0.00      0.00      0.00         0

    accuracy                           0.52        42
   macro avg       0.26      0.25      0.24        42
weighted avg       0.63      0.52      0.56        42



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


##### k-Nearest neighbors

There are two different types of training we can do with K-nearest neighbors: Classification or regression.

###### Regression
We make a regression model using KNN and test it.

In [23]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Train the K nearest neighbors model
knn_reg = KNeighborsRegressor(n_neighbors=17)
knn_reg.fit(X_train, y_train)

# Make predictions
y_pred = knn_reg.predict(X_test)

# Get model metrics
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse}")
print(f"Mean Absolute Error (MAE): {mae}")
print(f"R-squared (R2 Score): {r2}")

Mean Squared Error (MSE): 1.1082550667325752
Mean Absolute Error (MAE): 0.9355742296918766
R-squared (R2 Score): 0.022519031141868373


Scores are worse than our linear regression model.

###### Classification
We make a classification model using KNN and we test it.

In [24]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Train the K nearest neighbors model
knn_class = KNeighborsClassifier(n_neighbors=41)
knn_class.fit(X_train, y_train)

# Make predictions
y_pred = knn_class.predict(X_test)

# Evaluate the model
class_report = classification_report(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)

# Output the results
print(f"Model Accuracy: {accuracy * 100:.2f}%")
print("Confusion Matrix:")
print(conf_matrix)
print("Classification Report:")
print(class_report)


Model Accuracy: 52.38%
Confusion Matrix:
[[ 0  2  0  0  0]
 [ 0  8  0  7  0]
 [ 0  1  0  2  0]
 [ 0  7  0 14  0]
 [ 0  0  0  1  0]]
Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         2
           1       0.44      0.53      0.48        15
           2       0.00      0.00      0.00         3
           3       0.58      0.67      0.62        21
           4       0.00      0.00      0.00         1

    accuracy                           0.52        42
   macro avg       0.21      0.24      0.22        42
weighted avg       0.45      0.52      0.48        42



  return self._fit(X, y)
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


This model had worse accuracy than our SVM model.

#### Testing
Here is where we test our models and see the results agains our testing data, because validation was done using our training data.

In [25]:
# Test input data
X = test.drop(columns=TRAIN_COLUMNS_DROP)

# Test target data
y = test[[PREDICTION_QUESTION]]

# SVM
svm_pred = svm_model.predict(X)
svm_best_pred = best_svm_model.predict(X)

svm_accuracy = accuracy_score(y, svm_pred)
svm_best_accuracy = accuracy_score(y, svm_best_pred)
print(f"Support Vector Machine:")
print(f"Accuracy: {svm_accuracy*100:.2f}%")
print(f"Best model accuracy: {svm_best_accuracy*100:.2f}%\n")

# Decision tree
tree_pred = tree.predict(X)

tree_accuracy = accuracy_score(y, tree_pred)

print(f"Decision tree")
print(f"Accuracy: {tree_accuracy*100:.2f}%\n")

# KNN classification
knn_class_pred = knn_class.predict(X)

knn_class_accuracy = accuracy_score(y, knn_class_pred)

print(f"KNN classification")
print(f"Accuracy: {knn_class_accuracy*100:.2f}%\n")

# Linear regression
linear_pred = linear_reg.predict(X)

linear_mse = mean_squared_error(y, linear_pred)
linear_mae = mean_absolute_error(y, linear_pred)
linear_r2 = r2_score(y, linear_pred)

print(f"Linear regression")
print(f"MSE: {linear_mse}")
print(f"MAE: {linear_mae}")
print(f"R2: {linear_r2}\n")

# KNN regression
knn_reg_pred = knn_reg.predict(X)

knn_reg_mse = mean_squared_error(y, knn_reg_pred)
knn_reg_mae = mean_absolute_error(y, knn_reg_pred)
knn_reg_r2 = r2_score(y, knn_reg_pred)

print(f"KNN regression")
print(f"MSE: {knn_reg_mse}")
print(f"MAE: {knn_reg_mae}")
print(f"R2: {knn_reg_r2}\n")

Support Vector Machine:
Accuracy: 42.31%
Best model accuracy: 32.69%

Decision tree
Accuracy: 30.77%

KNN classification
Accuracy: 44.23%

Linear regression
MSE: 1.4086013911329283
MAE: 0.9160405621514113
R2: 0.026365500607505532

KNN regression
MSE: 1.46513175405909
MAE: 0.9864253393665159
R2: -0.0127086561799028



As we can see for classification the best model was from KNN and from regression the best model was linear regression. Results can change based on data chosen for training and testing.