<a href="https://colab.research.google.com/github/basugautam/Reproducibility-Challenge-Project/blob/Architecture-Files/15_Constrained_Learning_framework_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
# Mount Google Drive to access the file
from google.colab import drive
drive.mount('/content/drive')  # Mount Google Drive

# Import necessary libraries
import pandas as pd

# Provide the path to the file in Google Drive
file_path = '/content/drive/My Drive/timeseries_data.csv.csv'

# Read the CSV file into a pandas DataFrame
df = pd.read_csv(file_path)

# Display the first few rows of the data
df.head()


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Dependency Ratio,Unnamed: 11,Unnamed: 12,Median Age,Unnamed: 14,Unnamed: 15
0,Name,GENC,Year,Total Population,Growth Rate,Population Density (per sq km),Total Fertility Rate,Life Expectancy at Birth,Under-5 Mortality Rate,Sex Ratio of the Population,Youth and Old Age (0-14 and 65+),Youth (0-14),Old Age (65+),Both Sexes,Male,Female
1,-> 2024,,,--,--,--,--,--,--,--,--,--,--,--,--,--
2,Canada,CA,2024,38904514,0.72,4.3,1.44,83.9,4.8,0.99,56.8,23.9,32.9,42.5,43.9,41.2
3,-> 2025,,,--,--,--,--,--,--,--,--,--,--,--,--,--
4,Canada,CA,2025,39187155,0.73,4.3,1.43,84.8,4.4,0.99,57.7,23.8,33.9,42.8,44.1,41.4


In [4]:
# --- Explanation for Extracting and Reading Data ---
# (a) We are mounting Google Drive to access files stored there and work with them in Google Colab.
# (b) The pandas library's read_csv function is used to read the CSV file and load it into a pandas DataFrame, making it easy to manipulate and analyze the data.
# (c) The head() function is used to view the first few rows of the data to ensure it has been loaded correctly and provides a quick preview.
# (d) This operation allows us to confirm the file is loaded properly and examine its structure before proceeding with further analysis.

# --- Extracting and Reading Data from the Loaded CSV ---
# Checking basic information about the dataset (columns, types, non-null count)
df.info()

# --- Explanation for df.info() ---
# (a) The df.info() function is used to gather essential information about the dataset, such as column names, data types, and the number of non-null values in each column.
# (b) By reviewing this information, we can check if the columns are correctly formatted and identify if there is missing or incorrect data that may need to be handled.
# (c) The function outputs the data types for each column and counts the non-null entries, which helps us ensure the data quality and preparation for modeling.
# (d) This operation provides an overview of the dataset, highlighting any potential data quality issues (e.g., missing values or incorrect data types).

# Checking for missing values in the dataset
df.isnull().sum()

# --- Explanation for df.isnull().sum() ---
# (a) We are using df.isnull().sum() to check for missing values in the dataset, as handling missing data is critical for building accurate models.
# (b) This helps us identify columns that have missing values, so we can decide whether to drop or fill those entries to prepare the data for analysis.
# (c) The isnull() function returns a boolean mask indicating whether values are missing, and the sum() function counts the number of missing values for each column.
# (d) This operation helps us pinpoint missing data, which can be addressed by filling or dropping missing values as necessary.

# Checking descriptive statistics for numerical columns
df.describe()

# --- Explanation for df.describe() ---
# (a) The df.describe() function generates summary statistics for numerical columns, allowing us to understand their distribution (e.g., mean, standard deviation, min, max).
# (b) This is helpful for detecting outliers, data skewness, and other characteristics that will influence data preprocessing and model selection.
# (c) The describe function computes metrics like mean, min, max, and quartiles for all numerical columns, summarizing the data's central tendency and spread.
# (d) This operation provides insights into the range, central values, and variability of the numerical data, helping us understand its distribution.

# --- Problem Identification: Uneven Error Distribution in Multi-Step Forecasting ---

# Assuming we have the true values (y_test) and predicted values (y_pred)
y_test = df['true_values']  # Replace with the actual column name for true values
y_pred = df['predicted_values']  # Replace with the actual column name for predicted values

# Calculate Mean Absolute Error (MAE) and Mean Squared Error (MSE)
from sklearn.metrics import mean_absolute_error, mean_squared_error

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

# Identify steps with large errors (threshold can be set as needed)
threshold = 10  # Example threshold for large error detection
error_diff = y_test - y_pred
large_errors = error_diff[abs(error_diff) > threshold]  # Find large errors based on the threshold

# Plotting the error differences for visualization
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.plot(error_diff, label='Error Difference (True - Predicted)', color='blue')
plt.axhline(y=threshold, color='red', linestyle='--', label='Threshold')
plt.axhline(y=-threshold, color='red', linestyle='--')
plt.title('Error Differences (True - Predicted)')
plt.xlabel('Time Step')
plt.ylabel('Error Difference')
plt.legend()
plt.show()

# Printing the error metrics and large errors
print(f"MAE: {mae}, MSE: {mse}")
print(f"Steps with large errors: {large_errors}")

# --- Explanation for the Problem Identification ---

# (a) We are calculating MAE and MSE to measure how well the model is performing. These metrics provide an objective measure of prediction error.
# (b) By identifying large errors, we can highlight time steps where the model performs poorly, which might otherwise be ignored in traditional forecasting methods.
# (c) MAE is the average magnitude of errors, while MSE emphasizes larger errors due to squaring the differences between predicted and actual values.
# (d) This operation helps us detect and visualize areas where the model is performing poorly, allowing us to focus on improving those areas.

# --- Proposed Solution: A Constrained Learning Framework ---
# In this approach, we add a constraint to keep the error distribution balanced across all time steps.
# We impose upper bounds on forecast errors to avoid large fluctuations at certain time steps.

# Constrained Learning Approach: Implementing error constraints
upper_bound = 5  # Setting an upper bound on errors (can be adjusted based on domain knowledge)

# Applying constraint by adjusting predictions to not exceed the upper bound
y_pred_constrained = y_pred.copy()
y_pred_constrained[abs(y_pred_constrained - y_test) > upper_bound] = y_test + upper_bound * (y_pred_constrained - y_test) / abs(y_pred_constrained - y_test)

# Calculating the new error metrics for constrained predictions
mae_constrained = mean_absolute_error(y_test, y_pred_constrained)
mse_constrained = mean_squared_error(y_test, y_pred_constrained)

# Plotting the error differences for constrained predictions
error_diff_constrained = y_test - y_pred_constrained

plt.figure(figsize=(10, 6))
plt.plot(error_diff_constrained, label='Constrained Error Difference (True - Predicted)', color='green')
plt.axhline(y=upper_bound, color='red', linestyle='--', label='Upper Bound')
plt.axhline(y=-upper_bound, color='_


SyntaxError: unterminated string literal (detected at line 97) (<ipython-input-4-74aa5357479c>, line 97)