# Importing necessary libraries for data manipulation, preprocessing, and model building

In [2]:
# Importing pandas for data manipulation and analysis
import pandas as pd 
# Importing train_test_split for splitting the dataset into training and testing sets
from sklearn.model_selection import train_test_split  
# Importing OneHotEncoder for encoding categorical variables and StandardScaler for scaling numerical features
from sklearn.preprocessing import OneHotEncoder, StandardScaler 
# Importing ColumnTransformer to apply different preprocessing steps to different columns in the dataset
from sklearn.compose import ColumnTransformer  
# Importing Pipeline to streamline the process of creating machine learning models by combining preprocessing and modeling steps
from sklearn.pipeline import Pipeline  


Conclusion:These libraries facilitate data preparation and modeling by enabling efficient data manipulation (pandas), splitting datasets (train_test_split), encoding and scaling features (OneHotEncoder and StandardScaler), applying diverse preprocessing steps (ColumnTransformer), and integrating these steps into a unified workflow (Pipeline).









# This line loads the dataset from the CSV file

In [3]:
 # Loading the dataset from a CSV file into a pandas DataFrame for data manipulation and analysis
df = pd.read_csv('updated_floods.csv') 



# This line defines a list of features necessary for modeling and analysis, specifying the variables to be used.

In [4]:
 # Defining a list of required features that will be used for modeling and analysis
required_features = [
    'Year', 'Flood_Area', 'MonsoonIntensity', 'Deforestation', 
    'ClimateChange', 'Siltation', 'AgriculturalPractices', 
    'DrainageSystems', 'CoastalVulnerability', 'Landslides', 
    'PopulationScore', 'InadequatePlanning', 'Latitude', 'Longitude'
] 


Conclusion:Specifying required features helps focus the analysis and modeling on relevant data, ensuring that only the necessary information is considered for building predictive models.

# FloodProbability as the target variable for the prediction model, representing the likelihood of a flood occurring.

In [4]:
 # Defining the target variable for the prediction model, which represents the likelihood of a flood occurring
target = 'FloodProbability' 



Conclusion: Specifying the target variable helps direct the model’s focus on predicting the likelihood of a flood based on input features.

# select the features and target variable from the DataFrame to prepare the input data (X) and output data (y) for the prediction model.

In [5]:
 # Selecting the required features from the DataFrame to create the input data for the model
X = df[required_features] 
 # Selecting the target variable from the DataFrame to create the output data for the model
y = df[target] 


Conclusion: Selecting the appropriate features and target variable ensures that the model is trained with the relevant data, allowing for accurate predictions and effective model training.

# lists for feature encoding and processing: categorical_features contains features that need encoding, while numeric_features includes features that are numerical, derived by excluding categorical features from the list of required features.

In [6]:
# Defining a list of categorical features that need to be encoded for machine learning models
categorical_features = ['Flood_Area']  
# Defining a list of numerical features by excluding categorical features from the required features
numeric_features = [col for col in required_features if col not in categorical_features]  


Conclusion: Separating categorical and numerical features ensures that appropriate preprocessing techniques, such as encoding for categorical features and scaling for numerical features, can be applied during model training.

## This code creates a ColumnTransformer object named preprocessor to apply different preprocessing steps to numerical and categorical features: StandardScaler for numerical features to standardize them, and OneHotEncoder for categorical features to encode them for machine learning models.

In [7]:
preprocessor = ColumnTransformer(
    transformers=[
         # Applying StandardScaler to numerical features to standardize them (mean=0, variance=1)
        ('num', StandardScaler(), numeric_features), 
        # Applying OneHotEncoder to categorical features to convert them into a format suitable for machine learning models
        ('cat', OneHotEncoder(), categorical_features)  
    ]
)  # Creating a preprocessor object using ColumnTransformer to apply different preprocessing steps to numerical and categorical features


Conclusion: Using ColumnTransformer ensures that appropriate preprocessing techniques are applied to different types of features, facilitating effective data transformation and preparation for modeling.

# splits the dataset into training and testing sets, with 80% of the data used for training and 20% for testing, while ensuring reproducibility by setting a fixed random_state

In [8]:
 # Splitting the dataset into training and testing sets with 80% of the data for training and 20% for testing, ensuring reproducibility with a fixed random state
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 

Conclusion: Splitting the data into training and testing sets ensures that the model is evaluated on unseen data, and using a fixed random state guarantees that the split is reproducible.

# Fit the preprocessor to the training data, applying standard scaling and one-hot encoding, and then transform both the training and test data using the fitted preprocessor to ensure consistent preprocessing.

In [9]:
 # Fitting the preprocessor to the training data and transforming it, applying standard scaling and one-hot encoding as defined
X_train = preprocessor.fit_transform(X_train) 
 # Transforming the test data using the same preprocessor fitted to the training data to ensure consistency in feature scaling and encoding
X_test = preprocessor.transform(X_test) 

Conclusion: Fitting the preprocessor to the training data and then applying it to both training and test data ensures that feature scaling and encoding are consistent, which is crucial for accurate model performance and evaluation.

# Imports libraries for building and evaluating a regression model, and for saving/loading the model

In [10]:
# Importing RandomForestRegressor for building a regression model using the random forest algorithm
from sklearn.ensemble import RandomForestRegressor 
# Importing mean_squared_error to evaluate the performance of the regression model
from sklearn.metrics import mean_squared_error  
# Importing pickle for saving and loading the trained model to and from files
import pickle  


Conclusion: These imports enable you to build and assess a regression model, and manage model persistence through saving and loading, facilitating model deployment and reuse.

# creates a RandomForestRegressor model with 100 trees and a fixed random state for reproducibility, and then trains the model using the preprocessed training data.

In [11]:
# Creating a RandomForestRegressor model with 100 trees and a fixed random state for reproducibility
model = RandomForestRegressor(n_estimators=100, random_state=42)
# Training the model on the preprocessed training data
model.fit(X_train, y_train)  


Conclusion: Configuring the RandomForestRegressor with a specific number of trees and a fixed random state ensures reproducible results, and training the model on preprocessed data allows it to learn patterns from the training set.

# These lines predict target variable values for the test set using the trained model, calculate the Mean Squared Error (MSE) between the actual and predicted values, and print the MSE to evaluate the model's performance.

In [12]:
# Predicting the target variable values for the test set using the trained model
y_pred = model.predict(X_test)  
# Calculating the Mean Squared Error (MSE) between the actual and predicted values for the test set
mse = mean_squared_error(y_test, y_pred)  
# Printing the Mean Squared Error to evaluate the model's performance
print(f'Mean Squared Error: {mse}')  

Mean Squared Error: 0.0013818382875000002


Conclusion: Predicting and evaluating the model with MSE provides a quantitative measure of how well the model performs on unseen data, helping assess its accuracy and effectiveness.

#  Save the trained RandomForestRegressor model and the preprocessor object to files named 'flood_model.pkl' and 'preprocessor.pkl' respectively, using pickle.

In [13]:
# Saving the trained RandomForestRegressor model to a file named 'flood_model.pkl' for future use
pickle.dump(model, open('flood_model.pkl', 'wb')) 
# Saving the preprocessor object to a file named 'preprocessor.pkl' to ensure consistency in data preprocessing for future predictions
pickle.dump(preprocessor, open('preprocessor.pkl', 'wb'))  


Conclusion: Saving the model and preprocessor ensures that they can be easily loaded and reused in the future, facilitating consistent data preprocessing and model inference without retraining