#Machine Learning

Machine Learning will be conducted reasonably, although it is neither typical, nor very beneficial to conduct machine learning on such a small dataset.

Since the size of the dataset is not very large, regression models are optimal for this project because they are fairly simple and stable with small datasets. Specifically Logistic Regression is used to analyze and model the probability of a province being coastal, which is represented in binary (1-coastal, 0-inland), based on given well-being indicators.

##Import Libraries and Load the Dataset

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

import pandas as pd
import numpy as np

from google.colab import drive
drive.mount('/content/drive')

df = pd.read_csv("/content/drive/MyDrive/DSA210 Project/data/merged_data.csv", sep=";")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


##Select Target and Features

Target: Coastal Status (1-Coastal, 0-Inland),
since the probability of coastal status is modeled based on well-being indicators.

In [None]:
# Target (Y)
y = df['coast_stat']

features = [
    "happiness_score_percent",
    "employment_rate_percent",
    "average_years_in_school",
    "satisfied_with_education_percent",
    "hospital_beds_per_100k",
    "satisfied_with_healthcare_percent",
]

x = df[features]

## Split Train-Test Samples

In [None]:
x_train, x_test, y_train, y_test = train_test_split(
    x,
    y,
    test_size=0.2, ## reasonable split for small datasets
    ## randomness seed and stratification to stabilize the process
    random_state=42,
    stratify=y
)

print("Train samples:", x_train.shape[0])
print("Test samples:", x_test.shape[0])
print("Total: 81 (Provinces)")


Train samples: 64
Test samples: 17
Total: 81 (Provinces)


##Scaling

Standardize the features for comparability since columns vastly differ in units.

In [None]:
scaler = StandardScaler()

x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

## Model Training and Evaluation

In [None]:
# Train the Model #

model = LogisticRegression(max_iter=1000)
model.fit(x_train_scaled, y_train)

# Evaluate #

y_pred = model.predict(x_test_scaled)

print(classification_report(y_test, y_pred))
print ("\n")

# Confusion Matrix for accuracy analysis #
print("Confusion Matrix:\n")
print(confusion_matrix(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.77      0.91      0.83        11
           1       0.75      0.50      0.60         6

    accuracy                           0.76        17
   macro avg       0.76      0.70      0.72        17
weighted avg       0.76      0.76      0.75        17



Confusion Matrix:

[[10  1]
 [ 3  3]]


###Evaluation
The model correctly predicted 76% of the provinces' coastal status. It performed better on inland provinces since much lower recall ratio (3/6) is observed on coastal provinces. This is very much expected based on the dataset size and the fact that the coastal subset is even smaller than inland.

Important Note: This test split is found to be slightly optimistic afterwards as it is explained in the Cross-Validation section.

###Confusion Matrix Results

#####True Negative: 10
(Inland province predicted as Inland)
#####False Positive: 1
(Inland province predicted as Coastal)
#####True Positive: 3
(Coastal province predicted as Coastal)
#####False Negative: 3
(Coastal province predicted as Inland)

### Cross-Validation

#### Resample to maximize stability for small dataset

In [None]:
# 5-fold Cross Validation #
cv = cross_val_score(
    model,
    x_train_scaled,
    y_train,
    cv=5,
    scoring="accuracy"
)

print("Cross-validation scores:", cv)
print("Mean CV accuracy:", cv.mean())


Cross-validation scores: [0.76923077 0.84615385 0.61538462 0.53846154 0.66666667]
Mean CV accuracy: 0.6871794871794872


These scores suggest the model performance varies accross different test samples, which is again expected as a result of the small dataset. Also it is observed that the first test sample from Model Training and Evaluation section, was a fairly optimal test sample given the accuracy of 76% is higher than the cross-validation mean.

###Logistic Regression Coefficients

In [None]:
coefficients = pd.DataFrame({
    "Feature": features,
    "Coefficient": model.coef_[0]
})

coefficients.sort_values(by="Coefficient", ascending=False)

Unnamed: 0,Feature,Coefficient
1,employment_rate_percent,1.033262
2,average_years_in_school,0.242848
3,satisfied_with_education_percent,0.046255
5,satisfied_with_healthcare_percent,0.02209
4,hospital_beds_per_100k,-0.146382
0,happiness_score_percent,-0.603151


The positive coefficients mean the feature led the model to classify the province as coastal(1) and negatives, as inland(0).

The Logistic Regression model suggests a substantial correlation in coastal status and employment rate. Average years in school also moderately influenced the model to classify provinces as coastal rather than inland. An interesting take-away from the negative happinnes score coefficient with 0.6 magnitude is that simply happiness is more associated with inland provinces than coastal.

These results are consistent with the EDA and Hypothesis Test results.

##Summary

Although the coefficients portray the same results with EDA and Hypothesis testing, which means the model correctly interpreted the data, the cross-validation accuracy is concerning in the sense that the model is not able to consistently predict correctly if a province is coastal or not based on given indicators, which was again expected due to small dataset size.

As a summary, the model performs more exploratory than predictive. As it can find correlations between variables but is not able to correctly classify provinces consistently.

As a final note, my own take-away is that the model exceeded my very low expectations due to dataset size by producing some fairly beneficial results that are parallel with EDA findings.