Importing libraries

NumPy: utilized for high-performance vector arithmetic.

Pandas: utilized for dataset management and cleaning.

Matplotlib: utilized for generating graphical visualizations.

Perceptron: a simplt custom built class implementation for the perceptron logic.

In [49]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import sys
import os

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path: sys.path.append(module_path)
from src.logistic_regressor import LogisticRegressor

For the introduction, I chose a simple dataset suitable for probability calculation. \
The Floods dataset fits these criteria well and is perfect for demonstrating the basics.

In [None]:
# --- 1. Load Data ---
FILE_PATH = '../datasets/floods.csv'

try:
    df = pd.read_csv(FILE_PATH)
except FileNotFoundError:
    print(f"Error: The file at {FILE_PATH} was not found.")
    exit()


# --- 2. Data Cleaning ---
# There is no need to clean the data as it is already clean in this dataset.


# --- 3. Analysis ---
# Correlation Matrix (Simple 1-to-1 relationship)
correlations = df.corr()['FloodProbability'].sort_values(ascending=False).drop('FloodProbability')
print("A table of features correlations with flood probability:")
print(correlations)

# We can see there is no particular feature that has a very high correlation with flood probability. 
# This means we will need to use muas much features as we can to predict flood probability with the most accuracy.

A table of features correlations with flood probability:
DeterioratingInfrastructure        0.229444
TopographyDrainage                 0.229414
RiverManagement                    0.228917
Watersheds                         0.228152
DamsQuality                        0.227467
PopulationScore                    0.226928
Siltation                          0.226544
IneffectiveDisasterPreparedness    0.225126
PoliticalFactors                   0.225009
MonsoonIntensity                   0.224081
WetlandLoss                        0.223732
InadequatePlanning                 0.223329
Landslides                         0.222991
AgriculturalPractices              0.221846
ClimateChange                      0.220986
Urbanization                       0.220867
Deforestation                      0.220237
Encroachments                      0.218259
DrainageSystems                    0.217895
CoastalVulnerability               0.215187
Name: FloodProbability, dtype: float64


In [None]:
X = df.drop('FloodProbability', axis=1).values
y = df['FloodProbability'].values

X = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))

np.random.seed(10)
indices = np.random.permutation(len(X))
split = int(len(X) * 0.8)

X_train, y_train = X[indices[:split]], y[indices[:split]]
X_test, y_test = X[indices[split:]], y[indices[split:]]

model = LogisticRegressor()
model.fit(X_train, y_train, show_progress=False, learning_rate=2, n_epochs=2000)


y_pred = model.predict(X_test)

mae = np.mean(np.abs(y_pred - y_test))
r2 = 1 - (np.sum((y_test - y_pred)**2) / np.sum((y_test - np.mean(y_test))**2))

print(f"Mean Absolute Error: {mae:.4f}")
print(f"R-squared Score: {r2:.4f}")

Mean Absolute Error: 0.0002
R-squared Score: 0.9999
