## Combining CNN and LSTM
### First, the CNN finds important patterns in the data, like spotting key phrases in a text. Then, the LSTM remembers these patterns over time, like remembering how the weather has been changing over the past days. Together, they become really good at understanding the data and making accurate predictions.

## How It Works Together
### 1. CNN: Finds important features in the data.
### 2. LSTM: Remembers these features over time.
### 3. XGBoost: Uses the features and memory to make the final prediction.

In [7]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
import xgboost as xgb
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, mean_absolute_percentage_error
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv1D, MaxPooling1D, Flatten, LSTM, Dense
from math import sqrt

In [8]:
df = pd.read_csv(r"C:\Users\visha\Desktop\processed_data2.csv")
print(df.head())

# Encode categorical columns
label_encoder = LabelEncoder()
df['Region'] = label_encoder.fit_transform(df['Region'])
df['Day_period'] = label_encoder.fit_transform(df['Day_period'])
df['Season'] = label_encoder.fit_transform(df['Season'])
df['Weekday_or_weekend'] = label_encoder.fit_transform(df['Weekday_or_weekend'])
df['Regular_day_or_holiday'] = label_encoder.fit_transform(df['Regular_day_or_holiday'])

# Define features and target including encoded columns
features = df[['PM2.5', 'PM10', 'NO', 'NO2', 'NOx', 'NH3', 'CO', 'SO2', 'O3', 'Benzene', 'Toluene',
               'Region', 'Day_period', 'Month_encoded', 'Season', 'Weekday_or_weekend', 'Regular_day_or_holiday']]
target = df['AQI']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3, random_state=42)

# Print the shapes of the splits to ensure consistency
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

   Serial number       City             Datetime     PM2.5     PM10        NO  \
0              1  Ahmedabad  2015-01-29 09:00:00  0.051896  0.15735 -0.717443   
1              2  Ahmedabad  2015-01-29 12:00:00  0.099619  0.15735 -0.741590   
2              3  Ahmedabad  2015-01-29 13:00:00 -0.136347  0.15735 -0.747717   
3              4  Ahmedabad  2015-01-29 14:00:00 -0.149292  0.15735 -0.745915   
4              5  Ahmedabad  2015-01-29 15:00:00 -0.249729  0.15735 -0.753123   

        NO2       NOx       NH3        CO  ...  Status      Region  \
0 -0.589015 -0.525303  0.112012  0.032461  ...  Active  5. Western   
1 -0.815643 -0.641089  0.112012 -0.347962  ...  Active  5. Western   
2 -0.922628 -0.751600  0.112012 -0.444487  ...  Active  5. Western   
3 -0.836468 -0.678852  0.112012 -0.416097  ...  Active  5. Western   
4 -0.908745 -0.740493  0.112012 -0.529656  ...  Active  5. Western   

     Day_period    Month  Year     Season Weekday_or_weekend  \
0    1. Morning  01. Jan  20

#### Why Use Nested 1D Array?
#### Convolutional Operation: A 1D CNN applies convolutional filters that slide across the sequence of features (columns) within each sample (row).

#### Feature Extraction: By reshaping the data into this nested 1D array format, you enable the CNN to extract meaningful patterns and relationships from the sequence of features.

#### So, when you say the data is converted to a nested 1D array for a 1D CNN, it means restructuring the data to facilitate the application of convolutional operations over the sequence of features within each sample. This reshaping is crucial for leveraging the capabilities of 1D CNNs in tasks like time series analysis, where understanding patterns within sequences of data is essential.

#### Splitting only the features to 1D array and CNN works with 1D data


#### X_1d_cnn:
#### [
  ### [ [30], [50], [20] ],
  ### [ [35], [55], [22] ],
  ### [ [40], [60], [25] ],
  ### [ [45], [65], [28] ],
  ### [ [50], [70], [30] ]
#### ]
Each element [30], [50], [20], etc., represents a scalar value (feature) within the reshaped array.
The outermost brackets [] represent the samples (days).
The middle brackets [] represent the features within each sample.
The innermost brackets [] represent the channel dimension added by np.expand_dims().

In [9]:

# Reshape data for CNN (samples, timesteps, features)
X_train_cnn = np.expand_dims(X_train.values, axis=2)
X_test_cnn = np.expand_dims(X_test.values, axis=2)
print(f"X_train_cnn shape: {X_train_cnn.shape}")
print(f"X_test_cnn shape: {X_test_cnn.shape}")

X_train_cnn shape: (2517657, 17, 1)
X_test_cnn shape: (1078996, 17, 1)


### Epoch Loop:

    1. The model will loop through 50 epochs.
    2. For each epoch:
           The training data is divided into trainsize/32 batches of size 32.
           The epochs is (dataset-testdataset-validationdataset)/32
           For each batch:
               The model performs a forward pass to make predictions.
               The loss is calculated by comparing the predictions to the actual values.
               The model performs a backward pass to update its weights based on the loss.
           After processing all batches, the model's performance is evaluated on the validation set which is a part of training dataset. I have mentioned the percentage split for validation dataset as 0.3 which means 30 percent of training dataset will be for validation.
        The training loss and validation loss are recorded and displayed.

In [None]:
# Initialize the CNN-LSTM model
cnn_lstm_model = Sequential()
cnn_lstm_model.add(Conv1D(filters=64, kernel_size=2, activation='relu', input_shape=(X_train_cnn.shape[1], 1)))
cnn_lstm_model.add(MaxPooling1D(pool_size=2))
cnn_lstm_model.add(LSTM(50, return_sequences=False))
cnn_lstm_model.add(Dense(50, activation='relu'))
cnn_lstm_model.add(Dense(1))
# Compile the model
cnn_lstm_model.compile(optimizer='adam', loss='mse')

# Train the model
cnn_lstm_model.fit(X_train_cnn, y_train, epochs=50, batch_size=32, validation_split=0.3, verbose=1)

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/50
[1m55074/55074[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m853s[0m 15ms/step - loss: 0.2125 - val_loss: 0.1548
Epoch 2/50
[1m55074/55074[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m848s[0m 15ms/step - loss: 0.1464 - val_loss: 0.1342
Epoch 3/50
[1m55074/55074[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m776s[0m 14ms/step - loss: 0.1290 - val_loss: 0.1229
Epoch 4/50
[1m55074/55074[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m617s[0m 11ms/step - loss: 0.1204 - val_loss: 0.1176
Epoch 5/50
[1m55074/55074[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m519s[0m 9ms/step - loss: 0.1147 - val_loss: 0.1133
Epoch 6/50
[1m55074/55074[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m489s[0m 9ms/step - loss: 0.1105 - val_loss: 0.1098
Epoch 7/50
[1m55074/55074[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m520s[0m 9ms/step - loss: 0.1073 - val_loss: 0.1094
Epoch 8/50
[1m55074/55074[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m707s[0m 13ms/step - loss: 0.1047 - val_

In [None]:
# Extract features from the CNN-LSTM model
intermediate_layer_model = Sequential()
intermediate_layer_model.add(cnn_lstm_model.layers[0])
intermediate_layer_model.add(cnn_lstm_model.layers[1])
intermediate_layer_model.add(cnn_lstm_model.layers[2])

X_train_features = intermediate_layer_model.predict(X_train_cnn)
X_test_features = intermediate_layer_model.predict(X_test_cnn)

In [None]:
# Initialize and train the XGBoost model
xgb_model = xgb.XGBRegressor(n_estimators=100, random_state=42)
xgb_model.fit(X_train_features, y_train)

In [None]:
# Make predictions on the test set
y_pred_xgb = xgb_model.predict(X_test_features)

# Calculate and print metrics for the combined model
r2_xgb = r2_score(y_test, y_pred_xgb)
mse_xgb = mean_squared_error(y_test, y_pred_xgb)
mae_xgb = mean_absolute_error(y_test, y_pred_xgb)
mape_xgb = mean_absolute_percentage_error(y_test, y_pred_xgb)
rmse_xgb = sqrt(mse_xgb)

print("Combined CNN-LSTM + XGBoost Metrics:")
print("R² (Coefficient of Determination):", r2_xgb)
print("MSE (Mean Squared Error):", mse_xgb)
print("MAE (Mean Absolute Error):", mae_xgb)
print("MAPE (Mean Absolute Percentage Error):", mape_xgb)
print("RMSE (Root Mean Square Error):", rmse_xgb)

In [None]:
import matplotlib.pyplot as plt

# Generate predictions for the test set using the XGBoost model
y_pred_xgb = xgb_model.predict(X_test_features)

# Plotting the actual vs. predicted values
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_xgb, alpha=0.5, label='Predicted vs Actual', color='blue')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', label='Ideal fit')
plt.xlabel('Actual AQI')
plt.ylabel('Predicted AQI')
plt.title('Actual vs Predicted AQI using CNN-LSTM + XGBoost')
plt.legend()
plt.grid(True)
plt.show()


In [None]:
# Assuming y_pred_xgb has been generated using the XGBoost model
y_pred_xgb = xgb_model.predict(X_test_features)

# Number of points to sample for visualization
num_points = 500

# Ensure the number of points does not exceed the test set size
num_points = min(num_points, len(y_test))

# Randomly select indices for the sample
sample_indices = np.random.choice(len(y_test), num_points, replace=False)

# Sample the actual and predicted values
y_test_sample = y_test.iloc[sample_indices]
y_pred_xgb_sample = y_pred_xgb[sample_indices]

plt.figure(figsize=(10, 6))

# Plot actual vs predicted values for the sampled subset
plt.scatter(y_test_sample, y_pred_xgb_sample, color='blue', edgecolor='k', alpha=0.7, label='Predicted Values')

# Ideal fit line
plt.plot([y_test_sample.min(), y_test_sample.max()], [y_test_sample.min(), y_test_sample.max()], 'r--', lw=2, label='Ideal Fit')

# Labels, title, and legend
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Values using CNN-LSTM + XGBoost')
plt.legend(loc='lower right')

# Adjusting the scale to a smaller range to zoom in
plt.xlim([0, 5])
plt.ylim([0, 5])

plt.grid(True)
plt.show()
