I will be applying optimization in different areas:
1. Optimization of memory usage by loading data in chunks
2. Optimization by data types
3. Optimization of time, memory usage and CPU usage while applying machine learning models on my dataset taken from csv file (GridSeachCV and Cross-Validation part).
   Apart from this, I will apply cross validation with cv=5,10,15 also to check the optimized results.
   Lastly, I will check the optimized execution time with different no of CPU processors used. 
Justification: The reason for applying optimization in machine learning parts is that these sections of code require exhaustive search and computation, thus taking a lot of time and memory, that need to be optimised. 

In [118]:
import pandas as pd
import numpy as np
import seaborn as sns #visualisation
import matplotlib.pyplot as plt #visualisation
%matplotlib inline 
import matplotlib.dates as mdates
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import KFold, cross_val_score
from scipy.stats import skew,kurtosis
from scipy.stats import poisson,norm
sns.set(color_codes=True)
from memory_profiler import memory_usage
import time
import multiprocessing
import psutil
import threading 


# DATA PREPARATION AND VISUALISATION

## EXPLORATORY DATA ANALYSIS:

In [120]:
df=pd.read_csv("Energy2.csv")

In [122]:
df.head()

Unnamed: 0,Statistic Label,Sector,Year,Fuel Type,UNIT,VALUE
0,Fuel Consumption (ktoe),Final energy consumption,1990,Sum of all coal products,ktoe,843
1,Fuel Consumption (ktoe),Final energy consumption,1990,Bituminous coal,ktoe,825
2,Fuel Consumption (ktoe),Final energy consumption,1990,Anthracite and manufactured ovoids,ktoe,0
3,Fuel Consumption (ktoe),Final energy consumption,1990,Coke,ktoe,0
4,Fuel Consumption (ktoe),Final energy consumption,1990,Lignite,ktoe,18


# 1. Optimization of memory usage by loading data in chunks

In [125]:
#pip install memory-profiler

In [127]:
#import pandas as pd
from memory_profiler import memory_usage

# Function to load the entire dataset
def load_full_data():
    #df = pd.read_csv('energy_data.csv')
    df=pd.read_csv("Energy2.csv")# Load the whole dataset at once
    return df

# Function to load the data in chunks and process it
def load_data_in_chunks(chunksize=10000):
    for chunk in pd.read_csv('Energy2.csv', chunksize=chunksize):
        process(chunk)  # Process each chunk

# Simulate processing function for chunks
def process(chunk):
    
    pass

# Use memory_profiler to track memory usage for the full load
def profile_full_load():
    mem_usage = memory_usage((load_full_data,))  # Measure memory usage when loading full data
    return mem_usage

# Use memory_profiler to track memory usage for chunked loading
def profile_chunked_load(chunksize=10000):
    mem_usage = memory_usage((load_data_in_chunks, (chunksize,)))  # Measure memory usage when loading data in chunks
    return mem_usage

if __name__ == "__main__":
    # Profile memory usage when loading the full dataset
    print("Memory usage when loading the full dataset:")
    full_load_mem_usage = profile_full_load()
    print(f"Max memory used (full load): {max(full_load_mem_usage)} MiB\n")
    
    # Profile memory usage when loading the dataset in chunks
    print("Memory usage when loading the dataset in chunks:")
    chunked_load_mem_usage = profile_chunked_load()
    print(f"Max memory used (chunked load): {max(chunked_load_mem_usage)} MiB\n")


Memory usage when loading the full dataset:
Max memory used (full load): 74.87109375 MiB

Memory usage when loading the dataset in chunks:
Max memory used (chunked load): 65.5546875 MiB



Insights: The above results show a clear decrease in memory usage when data is loaded in chunks as compared to loading the full dataset all together. 

# 2. Optimization by data types

In [13]:
#We look at memory usage in by the daat types and variable in our original dataset
df.info(memory_usage='deep')
#Gives basic info about your DataFrame df and also shows a more accurate estimate of memory usage


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65340 entries, 0 to 65339
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Statistic Label  65340 non-null  object
 1   Sector           65340 non-null  object
 2   Year             65340 non-null  int64 
 3   Fuel Type        65340 non-null  object
 4   UNIT             65340 non-null  object
 5   VALUE            65340 non-null  int64 
dtypes: int64(2), object(4)
memory usage: 17.8 MB


Here we can see that Statistic Label, Sector, Fuel Type and UNIT have object data type. Since object type takes more memory therefore we can down cast our data type to category, which requires less memory.
Similarly, Year and VALUE have data type int64. This again requires more memory so we can down cast our data type to int16, which requires less memory. We can do this because every year has four digits and it is an 11 bit number and int16 is sufficient to store an 11-bit number. Similarly, the VALUE column has a maximum value 13189, which comprises 5 digits and comes out to be a 14 bit number. Here, also, int16 is sufficient to store a 14-bit number.

In [15]:
# Check memory usage before optimization
mem_before = df.memory_usage(deep=True).sum() / 1024**2  # in MiB
print(f"Memory usage before optimization: {mem_before:.2f} MiB")

Memory usage before optimization: 17.84 MiB


In [19]:
#Optimize data types
# Convert object columns to category if they have few unique values
for col in df.select_dtypes(include='object').columns:
    num_unique = df[col].nunique()
    num_total = len(df[col])
    if num_unique / num_total < 0.5:
        df[col] = df[col].astype('category')

# Downcast integers
df['Year'] = pd.to_numeric(df['Year'], downcast='integer')
df['VALUE'] = pd.to_numeric(df['VALUE'], downcast='integer')

print(df.dtypes)


Statistic Label    category
Sector             category
Year                  int16
Fuel Type          category
UNIT               category
VALUE                 int16
dtype: object


In [21]:
#Optimized memory usage
mem_after = df.memory_usage(deep=True).sum() / 1024**2  # in MiB
print(f"Memory usage after optimization: {mem_after:.2f} MiB")
print(f"Memory saved: {mem_before - mem_after:.2f} MiB")


Memory usage after optimization: 0.51 MiB
Memory saved: 17.33 MiB


In [23]:
df['Year'] = pd.to_numeric(df['Year'], downcast='integer')
df['VALUE'] = pd.to_numeric(df['VALUE'], downcast='integer')

for col in ['Statistic Label', 'Sector', 'Fuel Type', 'UNIT']:
    df[col] = df[col].astype('category')
    
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65340 entries, 0 to 65339
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   Statistic Label  65340 non-null  category
 1   Sector           65340 non-null  category
 2   Year             65340 non-null  int16   
 3   Fuel Type        65340 non-null  category
 4   UNIT             65340 non-null  category
 5   VALUE            65340 non-null  int16   
dtypes: category(4), int16(2)
memory usage: 519.3 KB


Insights: The above working shows clearly that as we downcast the data types, memory usage gets reduced significantly. 

# 3. Optimization of time, memory usage and CPU usage while applying machine learning models

# MACHINE LEARNING

In [60]:
import pandas as pd
df=pd.read_csv("Energy2.csv")

In [61]:
df.head()

Unnamed: 0,Statistic Label,Sector,Year,Fuel Type,UNIT,VALUE
0,Fuel Consumption (ktoe),Final energy consumption,1990,Sum of all coal products,ktoe,843
1,Fuel Consumption (ktoe),Final energy consumption,1990,Bituminous coal,ktoe,825
2,Fuel Consumption (ktoe),Final energy consumption,1990,Anthracite and manufactured ovoids,ktoe,0
3,Fuel Consumption (ktoe),Final energy consumption,1990,Coke,ktoe,0
4,Fuel Consumption (ktoe),Final energy consumption,1990,Lignite,ktoe,18


## MODELLING

In [65]:
#from sklearn.model_selection import train_test_split, GridSearchCV
#from sklearn.ensemble import RandomForestRegressor
#from sklearn.tree import DecisionTreeRegressor
#from sklearn.metrics import mean_squared_error, r2_score
#import numpy as np

df = df[~df['Sector'].str.contains('Sum of all|Final energy consumption', case=False)]
df = df[~df['Fuel Type'].str.contains('Sum of all', case=False)]



In [67]:
df['VALUE_log'] = np.log1p(df['VALUE'])
df.head()

Unnamed: 0,Statistic Label,Sector,Year,Fuel Type,UNIT,VALUE,VALUE_log
2905,Fuel Consumption (ktoe),Industry- non energy mining,1990,Bituminous coal,ktoe,0,0.0
2906,Fuel Consumption (ktoe),Industry- non energy mining,1990,Anthracite and manufactured ovoids,ktoe,0,0.0
2907,Fuel Consumption (ktoe),Industry- non energy mining,1990,Coke,ktoe,0,0.0
2908,Fuel Consumption (ktoe),Industry- non energy mining,1990,Lignite,ktoe,0,0.0
2910,Fuel Consumption (ktoe),Industry- non energy mining,1990,Milled peat,ktoe,0,0.0


In [69]:
#define X amd y
X=df[['Sector', 'Fuel Type', 'Year']]
X_encoded = pd.get_dummies(X, columns=['Sector', 'Fuel Type'], drop_first=True)
y = df['VALUE_log']

In [77]:
# Train-Test split
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)

In [85]:
#define hyperparameters
param_grids = {
    'Decision Tree': {
        'max_depth': [10, 20, 30 ],
        'min_samples_split': [2, 5, 10, 20],
        'min_samples_leaf': [1, 2, 4, 8],
        'max_features': ['sqrt', 'log2']
    },
    
    'Random Forest': {
    'n_estimators': [100, 300, 500],
    'max_depth': [5, 10, 20, None],
    'max_features': ['sqrt', 'log2']
   # 'learning_rate': [0.01, 0.05, 0.1],
    #'subsample': [0.7, 1],
    #'colsample_bytree': [0.7, 1]
}
}


In [87]:
#Initialize models
models = {
    "Decision Tree": DecisionTreeRegressor(random_state=42),
    "Random Forest": RandomForestRegressor(random_state=42)
}

#Store evaluation results
results = {
    "R2 Score testing": {},
    "R2 Score training": {},
    "MAE": {},
    "RMSE": {},
    "CV": {},
    "CV R2 Mean": {},    #  Cross-validation R2 mean
    "CV R2 Std": {}      #  Cross-validation R2 std deviation
}
predictions = {}
best_params = {}
fitted_models={}

In [89]:
pip install psutil


Note: you may need to restart the kernel to use updated packages.


In [90]:
#checking memory and cpu usage and execution time for both models
#import time
#import multiprocessing
#import psutil

print("Number of CPU cores available:", multiprocessing.cpu_count())

# Function to print system usage
def print_system_usage(note=""):
    cpu = psutil.cpu_percent(interval=1)
    mem = psutil.virtual_memory()
    print(f"{note} CPU usage: {cpu}% | Memory usage: {mem.percent}%")

#Train, predict, evaluate
for name, model in models.items():
       
    print_system_usage("Before GridSearchCV")
    
    grid_search = GridSearchCV(estimator=model, param_grid=param_grids[name], 
                               cv=5, n_jobs=-1, verbose=1, scoring='r2')
    start_time = time.time()
    grid_search.fit(X_train, y_train)
    end_time = time.time()

    print(f"{name} GridSearchCV execution time: {end_time - start_time:.2f} sec")
    print_system_usage("After GridSearchCV")

    best_model = grid_search.best_estimator_
    best_params[name] = grid_search.best_params_
    #y_pred = best_model.predict(X_test)
    # Predict on test set
    
    y_pred_log = best_model.predict(X_test)
    y_trainn=best_model.predict(X_train)

    # Invert the log transformation to get predictions in original VALUE scale
    y_pred = np.expm1(y_pred_log)
    y_train_pred = np.expm1(y_trainn)

    # Invert y_test as well
    y_test_original = np.expm1(y_test)
    y_train_original = np.expm1(y_train)
    
    # Store results
    results["R2 Score testing"][name] = r2_score(y_test_original, y_pred)
    results["R2 Score training"][name] = r2_score(y_train_original, y_train_pred)
    results["MAE"][name] = mean_absolute_error(y_test_original, y_pred)
    results["RMSE"][name] = np.sqrt(mean_squared_error(y_test_original, y_pred))
    predictions[name] = y_pred

    print(f"\n{name} Best Params: {grid_search.best_params_}")
    print(f"{name} R² Score (Test): {results['R2 Score testing'][name]:.3f}")
    print(f"{name} R² Score (Train): {results['R2 Score training'][name]:.3f}")
    print(f"{name} MAE: {results['MAE'][name]:.3f}")
    print(f"{name} RMSE: {results['RMSE'][name]:.3f}")





Number of CPU cores available: 8
Before GridSearchCV CPU usage: 3.9% | Memory usage: 88.2%
Fitting 5 folds for each of 96 candidates, totalling 480 fits
Decision Tree GridSearchCV execution time: 11.99 sec
After GridSearchCV CPU usage: 3.3% | Memory usage: 93.1%

Decision Tree Best Params: {'max_depth': 30, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 2}
Decision Tree R² Score (Test): 0.910
Decision Tree R² Score (Train): 0.965
Decision Tree MAE: 1.937
Decision Tree RMSE: 16.438
Before GridSearchCV CPU usage: 1.0% | Memory usage: 93.1%
Fitting 5 folds for each of 24 candidates, totalling 120 fits
Random Forest GridSearchCV execution time: 302.91 sec
After GridSearchCV CPU usage: 0.4% | Memory usage: 89.9%

Random Forest Best Params: {'max_depth': None, 'max_features': 'sqrt', 'n_estimators': 500}
Random Forest R² Score (Test): 0.941
Random Forest R² Score (Train): 0.983
Random Forest MAE: 1.355
Random Forest RMSE: 13.299


Insights: The optimization process using GridSearchCV is computationally efficient for the Decision Tree model, completing 480 fits in just 11.99 seconds with minimal CPU usage (~3.3–3.9%) and moderate memory increase (from 88.2% to 93.1%). In contrast, the Random Forest model, despite performing only 120 fits, requires significantly more time (302.91 seconds) due to its computationally intensive nature, especially with 500 estimators, but interestingly shows lower CPU usage (0.4–1.0%) and memory fluctuation (from 93.1% to 89.9%). This highlights the trade-off between model complexity, resource efficiency, and execution time, where Random Forest offers superior performance metrics but at a higher computational cost.

# Cross validation with cv=5,10,15 to check the optimized results

In [None]:
#cv checks

In [93]:
#from sklearn.model_selection import GridSearchCV, KFold, cross_val_score
#import time #for checking execution time
#import multiprocessing #to run multiple processes in parallel, allowing tasks to be executed concurrently using multiple CPU cores.
#import psutil #for checking memory and cpu usage
#import threading #to create and manage multiple threads within a single process, enabling concurrent execution.
#import pandas as pd

print("Number of CPU cores available:", multiprocessing.cpu_count())

def get_system_usage():
    cpu = psutil.cpu_percent(interval=1)
    mem = psutil.virtual_memory().percent
    return cpu, mem

cv_values = [5, 10, 15]
results = {
    "Model": [],
    "CV": [],
    "CV R2 Mean": [],
    "CV R2 Std": [],
    "Time (s)": [],
    "Avg CPU (%)": [],
    "Avg Memory (%)": []
}

for name, model in models.items():
    print(f"\nRunning GridSearchCV for {name} ...")
    
    grid_search = GridSearchCV(estimator=model, param_grid=param_grids[name], 
                               cv=5, n_jobs=-1, verbose=1, scoring='r2')
    grid_search.fit(X_train, y_train)
    best_model = grid_search.best_estimator_

    for cv in cv_values:
        kf = KFold(n_splits=cv, shuffle=True, random_state=42)
        print(f"\nRunning cross_val_score for {name} with cv={cv} ...")
        
        resource_readings = {"cpu": [], "mem": []}
        stop_monitor = threading.Event()

        def monitor():
            while not stop_monitor.is_set():
                cpu, mem = get_system_usage()
                resource_readings["cpu"].append(cpu)
                resource_readings["mem"].append(mem)
                time.sleep(0.5)

        monitor_thread = threading.Thread(target=monitor)
        monitor_thread.start()

        start_time = time.time()
        cv_scores = cross_val_score(best_model, X_train, y_train, cv=kf, scoring='r2', n_jobs=-1)
        end_time = time.time()

        stop_monitor.set()
        monitor_thread.join()

        avg_cpu = sum(resource_readings["cpu"]) / len(resource_readings["cpu"]) if resource_readings["cpu"] else 0
        avg_mem = sum(resource_readings["mem"]) / len(resource_readings["mem"]) if resource_readings["mem"] else 0
        exec_time = end_time - start_time

        results["Model"].append(name)
        results["CV"].append(cv)
        results["CV R2 Mean"].append(cv_scores.mean())
        results["CV R2 Std"].append(cv_scores.std())
        results["Time (s)"].append(exec_time)
        results["Avg CPU (%)"].append(avg_cpu)
        results["Avg Memory (%)"].append(avg_mem)

        print(f"{name} with CV={cv}: Mean R² = {cv_scores.mean():.3f}, Std Dev = {cv_scores.std():.3f}")
        print(f"Execution time: {exec_time:.2f} sec | Avg CPU: {avg_cpu:.1f}% | Avg Memory: {avg_mem:.1f}%")

results_df = pd.DataFrame(results)
print("\nSummary:")
print(results_df)


Number of CPU cores available: 8

Running GridSearchCV for Decision Tree ...
Fitting 5 folds for each of 96 candidates, totalling 480 fits





Running cross_val_score for Decision Tree with cv=5 ...
Decision Tree with CV=5: Mean R² = 0.820, Std Dev = 0.030
Execution time: 0.27 sec | Avg CPU: 19.8% | Avg Memory: 91.8%

Running cross_val_score for Decision Tree with cv=10 ...
Decision Tree with CV=10: Mean R² = 0.815, Std Dev = 0.024
Execution time: 0.50 sec | Avg CPU: 24.8% | Avg Memory: 90.6%

Running cross_val_score for Decision Tree with cv=15 ...
Decision Tree with CV=15: Mean R² = 0.808, Std Dev = 0.026
Execution time: 0.62 sec | Avg CPU: 34.2% | Avg Memory: 90.6%

Running GridSearchCV for Random Forest ...
Fitting 5 folds for each of 24 candidates, totalling 120 fits

Running cross_val_score for Random Forest with cv=5 ...
Random Forest with CV=5: Mean R² = 0.972, Std Dev = 0.003
Execution time: 45.99 sec | Avg CPU: 53.9% | Avg Memory: 90.8%

Running cross_val_score for Random Forest with cv=10 ...
Random Forest with CV=10: Mean R² = 0.976, Std Dev = 0.003
Execution time: 109.86 sec | Avg CPU: 71.9% | Avg Memory: 91.1%


Insights: As cross-validation (CV) folds increases from 5 to 15, both models requires more time and CPU resources. For the Decision Tree, execution time grows from 0.27s to 0.62s, with CPU usage increasing from 19.8% to 34.2%, while memory stays stable (~90–92%), showing efficient scalability. In contrast, the Random Forest model shows significant increases in computational load i.e. time increases sharply from 46s (CV=5) to 141s (CV=15), and average CPU usage surges from 54% to 96%, reflecting its parallel complexity and larger model size.
Despite these increases, R² values remained stable for Random Forest, with the Decision Tree slightly declining and the Random Forest consistently achieving high performance across CV folds, thus highlighting the trade-off between improved accuracy and resource demands at higher CV values.

In [None]:
# As checked above, total number of CPU cores available is 8 so we will check execution time with 4 and 8 (max) CPUs.  

# Time optimization with different CPU cores (n_jobs)

In [95]:
# Optimization Benchmarking Section 
#import time
for name, model in models.items():
    optimization_configs = [
    { "n_jobs": 4},
    {"n_jobs": -1}
]

for config in optimization_configs:
    print(f"\n{name} - Benchmarking GridSearchCV with cv=5, n_jobs={config['n_jobs']}")
    
    start_time = time.time()
    grid_search_opt = GridSearchCV(estimator=model, param_grid=param_grids[name],
                                   cv=5, n_jobs=config['n_jobs'], 
                                   verbose=1, scoring='r2')
    grid_search_opt.fit(X_train, y_train)
    end_time = time.time()
    
    print(f"Execution time: {end_time - start_time:.2f} sec")



Random Forest - Benchmarking GridSearchCV with cv=5, n_jobs=4
Fitting 5 folds for each of 24 candidates, totalling 120 fits
Execution time: 425.28 sec

Random Forest - Benchmarking GridSearchCV with cv=5, n_jobs=-1
Fitting 5 folds for each of 24 candidates, totalling 120 fits
Execution time: 310.13 sec
