<a href="https://colab.research.google.com/github/datagrad/Kamal-Thesis-Work/blob/main/Decision_Trees_AQI_Kamal_1st_Attempt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np

In [2]:

# Assuming you have your dataset in a CSV file named 'pollutant_data.csv'
url = 'https://raw.githubusercontent.com/datagrad/MS_Reference_MS-DS-LJMU-C13/main/city_day.csv'
df = pd.read_csv(url)

# only keeping data for 'Delhi', 'Bengaluru', 'Patna', 'Ahmedabad', 'Amritsar' Cities
cities_to_keep = ['Delhi', 'Bengaluru', 'Hyderabad']
df = df[df['City'].isin(cities_to_keep)]


# Retain only the desired columns: 'City', 'Date', and 'PM2.5'
df = df[['City', 'Date', 'PM2.5']]

# Remove the 'AQI_Bucket' column
# df.drop(columns=['AQI_Bucket'], inplace=True)

df

Unnamed: 0,City,Date,PM2.5
1,Bengaluru,1/1/2015,
3,Delhi,1/1/2015,313.22
7,Bengaluru,1/2/2015,
9,Delhi,1/2/2015,186.18
13,Bengaluru,1/3/2015,
...,...,...,...
29489,Delhi,6/30/2020,39.80
29493,Hyderabad,6/30/2020,19.38
29509,Bengaluru,7/1/2020,17.50
29515,Delhi,7/1/2020,54.01


In [3]:
data_types = df.dtypes
print(data_types)


City      object
Date      object
PM2.5    float64
dtype: object


In [4]:
# Convert 'Date' column to datetime data type
df['Date'] = pd.to_datetime(df['Date'])

# Now you can check the data types again to verify the change
data_types = df.dtypes
print(data_types)

City             object
Date     datetime64[ns]
PM2.5           float64
dtype: object


In [5]:
# Find NaN values for each city in the 'PM2.5' column
nan_values_by_city = df[df['PM2.5'].isnull()].groupby('City').size()
print("NaN values for PM2.5 by city:")
print(nan_values_by_city)

NaN values for PM2.5 by city:
City
Bengaluru    146
Delhi          2
Hyderabad    115
dtype: int64


In [6]:
# NAN value imputation


# Replace NaN values in the 'PM2.5' column with next day's value (forward fill) for each city
df['PM2.5'] = df.groupby('City')['PM2.5'].fillna(method='ffill')

In [7]:
# Filter rows with NaN values in the 'PM2.5' column
rows_with_nan_pm25 = df[df['PM2.5'].isnull()]

# Display all the rows with NaN values in the 'PM2.5' column
# print(

rows_with_nan_pm25
    # )

Unnamed: 0,City,Date,PM2.5
1,Bengaluru,2015-01-01,
7,Bengaluru,2015-01-02,
13,Bengaluru,2015-01-03,
19,Bengaluru,2015-01-04,
22,Hyderabad,2015-01-04,
...,...,...,...
589,Hyderabad,2015-03-26,
596,Hyderabad,2015-03-27,
603,Hyderabad,2015-03-28,
610,Hyderabad,2015-03-29,


In [8]:
# Drop rows with NaN values in the 'PM2.5' column
df.dropna(subset=['PM2.5'], inplace=True)

In [9]:

# Extract date-related features
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
df['DayOfWeek'] = df['Date'].dt.dayofweek

# Drop the original 'Date' column since we have extracted useful features
df.drop(columns=['Date'], inplace=True)

# Handling Missing Values (optional)
# df = df.dropna()  # Remove rows with missing values

# Encoding Categorical Variables (City column)
df_encoded = pd.get_dummies(df, columns=['City'], drop_first=True)

# Splitting the Data into Training and Testing Sets
from sklearn.model_selection import train_test_split

X = df_encoded.drop(columns=['PM2.5'])
y = df_encoded['PM2.5']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [10]:


# # Handling 'nan' (non-numeric) values in the 'AQI' column and replacing them with NaN
# df['AQI'] = df['AQI'].replace('nan', np.nan)

In [11]:
#Check the unique values in the 'AQI' column:
# print(df['AQI'].unique())


In [12]:
from sklearn.tree import DecisionTreeRegressor

# Create the Decision Trees Model
decision_tree_model = DecisionTreeRegressor(max_depth=5, random_state=42)

# Train the Model
decision_tree_model.fit(X_train, y_train)


In [None]:
# Handling Missing Values (optional)
# df = df.dropna()  # Remove rows with missing values

# Encoding Categorical Variables (City column)
df_encoded = pd.get_dummies(df, columns=['City'], drop_first=True)


In [14]:


# Splitting the Data into Training and Testing Sets
from sklearn.model_selection import train_test_split

X = df_encoded.drop(columns=['PM2.5'])
y = df_encoded['PM2.5']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [15]:
from sklearn.tree import DecisionTreeRegressor

# Create the Decision Trees Model
decision_tree_model = DecisionTreeRegressor(max_depth=5, random_state=42)

# Train the Model
decision_tree_model.fit(X_train, y_train)


In [16]:
y_pred = decision_tree_model.predict(X_test)


In [17]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

print("Mean Absolute Error:", mae)
print("Root Mean Squared Error:", rmse)
print("R-squared:", r2)


Mean Absolute Error: 20.838791392340415
Root Mean Squared Error: 35.14648450118847
R-squared: 0.7011526955298639


**Mean Absolute Error (MAE):** This metric represents the average absolute difference between the actual values and the predicted values. Lower values of MAE indicate better performance, and a MAE close to 0 would mean that the model's predictions are very accurate. In this case, an MAE of 20.84 means, on average, the model's predictions are off by approximately 20.84 units, which may or may not be acceptable based on the scale of the target variable.

**Root Mean Squared Error (RMSE)**: This metric is similar to MAE but gives higher weight to large errors. Like MAE, lower values of RMSE indicate better performance. In this case, an RMSE of 35.15 means, on average, the model's predictions have an error of approximately 35.15 units.

**R-squared (R^2)**: This metric represents the proportion of variance in the target variable that is predictable from the input features. It ranges from 0 to 1, and higher values indicate better performance. An R^2 value of 0.70 means that around 70% of the variance in the 'AQI' column can be explained by the features used in the model.