<a href="https://www.kaggle.com/code/gizemnalbantarslan/regression-with-a-flood-prediction?scriptVersionId=199070967" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# <p style="text-align:center;"> Regression with a Flood Prediction Dataset </p>

<span style="font-size:18px;"> The goal of this competition is to predict the probability of a region flooding based on various factors. </span>

In [None]:
# import and requirements
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error,accuracy_score
from sklearn.model_selection import GridSearchCV,train_test_split,cross_validate, RandomizedSearchCV, validation_curve
from sklearn.preprocessing import StandardScaler

warnings.simplefilter(action='ignore', category=FutureWarning)
# warnings.simplefilter("ignore", category=ConvergenceWarning)

pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

# 1. Exploratory Data Analysis

In [None]:
train = pd.read_csv("../input/playground-series-s4e5/train.csv")
tr=train.copy()
test = pd.read_csv("../input/playground-series-s4e5/test.csv")
ts=test.copy()

> # 1.1 Definition of Functions

<span style="font-size:18px;">First of all, let's define our functions. The functions we will use here and their tasks are as follows: 

* <span style="font-size:18px;"> **check_df :** An overview of the dataset
* <span style="font-size:18px;"> **cat_summary :** analysis of categorical variables
* <span style="font-size:18px;"> **num_summary :** review of numeric variables
* <span style="font-size:18px;"> **target_summary_with_num:** analysis of the relationship of numeric variables with the target variable
* <span style="font-size:18px;"> **target_summary_with_cat :** analyzing the relationship of categorical variables with the target variable
* <span style="font-size:18px;">**correlation_matrix:** analysis of correlations
* <span style="font-size:18px;"> **grab_col_names:** detailed categorization of variables </span>

In [None]:
def check_df(dataframe):
    print("##################### Shape #####################")
    print(dataframe.shape)
    print("##################### Types #####################")
    print(dataframe.dtypes)
    print("##################### Head #####################")
    print(dataframe.head(3))
    print("##################### Tail #####################")
    print(dataframe.tail(3))
    print("##################### NA #####################")
    print(dataframe.isnull().sum())
    print("##################### describe #####################")
    print(dataframe.describe())
    print("##################### Quantiles #####################")
    print(dataframe.quantile([0, 0.05, 0.50, 0.95, 0.99, 1]).T)

In [None]:
def cat_summary(dataframe, col_name, plot=False):
    print(pd.DataFrame({col_name: dataframe[col_name].value_counts(),
                        "Ratio": 100 * dataframe[col_name].value_counts() / len(dataframe)}))
    print("##########################################")
    if plot:
        sns.countplot(x=dataframe[col_name], data=dataframe)
        plt.show(block=True)

In [None]:
def num_summary(dataframe, numerical_col, plot=False):
    quantiles = [0.05, 0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90, 0.95, 0.99]
    print(dataframe[numerical_col].describe(quantiles).T)

    if plot:
        dataframe[numerical_col].hist(bins=20)
        plt.xlabel(numerical_col)
        plt.title(numerical_col)
        plt.show(block=True)

In [None]:
def target_summary_with_num(dataframe, target, numerical_col):
    print(dataframe.groupby(target).agg({numerical_col: "mean"}), end="\n\n\n")

def target_summary_with_cat(dataframe, target, categorical_col):
    print(pd.DataFrame({"TARGET_MEAN": dataframe.groupby(categorical_col)[target].mean()}), end="\n\n\n")

In [None]:
def correlation_matrix(df, cols):
    fig = plt.gcf()
    fig.set_size_inches(20,20)
    plt.xticks(fontsize=10)
    plt.yticks(fontsize=10)
    fig = sns.heatmap(df[cols].corr(), annot=False, linewidths=0.5, annot_kws={'size': 12}, linecolor='w', cmap='RdBu')
    plt.show(block=True)

In [None]:
def grab_col_names(dataframe, cat_th=10, car_th=20):

    # cat_cols, cat_but_car
    cat_cols = [col for col in dataframe.columns if dataframe[col].dtypes == "O"]
    num_but_cat = [col for col in dataframe.columns if dataframe[col].nunique() < cat_th and
                   dataframe[col].dtypes != "O"]
    cat_but_car = [col for col in dataframe.columns if dataframe[col].nunique() > car_th and
                   dataframe[col].dtypes == "O"]
    cat_cols = cat_cols + num_but_cat
    cat_cols = [col for col in cat_cols if col not in cat_but_car]

    # num_cols
    num_cols = [col for col in dataframe.columns if dataframe[col].dtypes != "O"]
    num_cols = [col for col in num_cols if col not in num_but_cat]

    # print(f"Observations: {dataframe.shape[0]}")
    # print(f"Variables: {dataframe.shape[1]}")
    # print(f'cat_cols: {len(cat_cols)}')
    # print(f'num_cols: {len(num_cols)}')
    # print(f'cat_but_car: {len(cat_but_car)}')
    # print(f'num_but_cat: {len(num_but_cat)}')
    return cat_cols, num_cols, cat_but_car

> # 1.2 Data Analysis

In [None]:
check_df(tr)

<span style="font-size:18px;">There is no NA values and anomalous distribution in quantiles. </span>

In [None]:
cat_cols, num_cols, cat_but_car = grab_col_names(tr)
num_cols = [col for col in num_cols if col not in ("FloodProbability", "id")]

<span style="font-size:18px;"> As we can see after this analysis, there is no categorical or cardinal variable in the data. All variables are in the numeric category.
Therefore, let's do our analysis on numeric variables in the following processes </span>

In [None]:
# Analysis of num_cols with graphics
plt.figure(figsize=(20, 20))
for col in num_cols:
    plt.subplot(7, 3, num_cols.index(col) + 1)
    sns.histplot(tr[col], bins=20)
    plt.title(col)
plt.tight_layout()
plt.show()

In [None]:
#correlation analysis of numerical variables
correlation_matrix(tr, num_cols)

In [None]:
# target analysis of numerical variables
for col in num_cols:
    target_summary_with_num(tr, "FloodProbability", col)

# 2. Data Preprocessing & Feature Engineering


> # 2.1 Definition of Functions

<span style="font-size:18px;">We define the functions to prepare the data. 

* <span style="font-size:18px;">**outlier_thresholds:** catches outliers.
* <span style="font-size:18px;">**replace_with_thresholds:** removes outliers from the dataset.
* <span style="font-size:18px;">**check_outlier:** checks if there is an outlier or not.</span>

In [None]:
def outlier_thresholds(dataframe, col_name, q1=0.25, q3=0.75):
    quartile1 = dataframe[col_name].quantile(q1)
    quartile3 = dataframe[col_name].quantile(q3)
    interquantile_range = quartile3 - quartile1
    up_limit = quartile3 + 1.5 * interquantile_range
    low_limit = quartile1 - 1.5 * interquantile_range
    return low_limit, up_limit

In [None]:
def replace_with_thresholds(dataframe, variable):
    low_limit, up_limit = outlier_thresholds(dataframe, variable)
    dataframe.loc[(dataframe[variable] < low_limit), variable] = low_limit
    dataframe.loc[(dataframe[variable] > up_limit), variable] = up_limit

In [None]:
def check_outlier(dataframe, col_name, q1=0.25, q3=0.75):
    low_limit, up_limit = outlier_thresholds(dataframe, col_name, q1, q3)
    if dataframe[(dataframe[col_name] > up_limit) | (dataframe[col_name] < low_limit)].any(axis=None):
        return True
    else:
        return False

> # 2.2 Definition of New Variables

<span style="font-size:18px;"> Let's extract new variables from existing variables and define all operations as a function so that we can easily perform operations for the test dataset. </span>

In [None]:
def add_features(dataframe):
    dataframe["NEW_musdrasy"] = dataframe["MonsoonIntensity"] * dataframe["TopographyDrainage"]
    dataframe["NEW_musinefdis"] = dataframe["MonsoonIntensity"] * dataframe["IneffectiveDisasterPreparedness"]
    dataframe["NEW_musdrasy"] = dataframe["MonsoonIntensity"] * dataframe["DrainageSystems"]
    dataframe["NEW_musdeter"] = dataframe["MonsoonIntensity"] * dataframe["DeterioratingInfrastructure"]
    dataframe["NEW_deforlandsl"] = dataframe["Deforestation"] * dataframe["Landslides"]
    dataframe["NEW_deforpolit"] = dataframe["Deforestation"] * dataframe["PoliticalFactors"]
    dataframe["NEW_urbdrasy"] = dataframe["Urbanization"] * dataframe["DrainageSystems"]
    dataframe["NEW_urbdeter"] = dataframe["Urbanization"] * dataframe["DeterioratingInfrastructure"]
    dataframe["NEW_urbinadeq"] = dataframe["Urbanization"] * dataframe["InadequatePlanning"]
    dataframe["NEW_clmagr"] = dataframe["ClimateChange"] * dataframe["AgriculturalPractices"]
    dataframe["NEW_damdeter"] = dataframe["DamsQuality"] * dataframe["DeterioratingInfrastructure"]
    dataframe["NEW_dampop"] = dataframe["DamsQuality"] * dataframe["PopulationScore"]
    dataframe["NEW_popinadeq"] = dataframe["PopulationScore"] * dataframe["InadequatePlanning"]
    dataframe["NEW_inadeqpolit"] = dataframe["InadequatePlanning"] * dataframe["PoliticalFactors"]

In [None]:
add_features(tr)
tr.shape

<span style="font-size:18px;"> Since the quantiles of the variables are not very sharp and each variable is close, we did not derive categorical variables from these variables.

<span style="font-size:18px;"> Instead, we derived new numeric variables by interacting them with each other. 

<span style="font-size:18px;"> Cause we are adding new variables, we do some operations again. </span>

In [None]:
check_df(tr)

> # 2.3 Data Preprocessing with New Variables

In [None]:
cat_cols, num_cols, cat_but_car = grab_col_names(tr)
num_cols = [col for col in num_cols if col not in ("FloodProbability", "id")]

In [None]:
# target analysis of numerical variables
for col in num_cols:
    target_summary_with_num(tr, "FloodProbability", col)

<span style="font-size:18px;">  We check for outliers and suppress outliers. </span>

In [None]:
for col in tr.columns:
    print(col, check_outlier(tr, col, 0.05, 0.95))
    if check_outlier(tr, col):
        replace_with_thresholds(tr, col)

<span style="font-size:18px;"> After suppression, we analyze outliers again.</span>

In [None]:
for col in num_cols:
    print(col, check_outlier(tr, col, 0.05, 0.95))

> # 2.4 StandardScaller


<span style="font-size:18px;">Once all outliers have been suppressed, we can proceed with standardization. 

* <span style="font-size:18px;"> Since we do not have categorical variables, we do not do one-hot-encoding.
* <span style="font-size:18px;"> For numeric variables, we continue with standard-scaller.</span>

In [None]:
X_scaled = StandardScaler().fit_transform(tr[num_cols])
tr[num_cols] = pd.DataFrame(X_scaled, columns=tr[num_cols].columns)

In [None]:
y = tr["FloodProbability"]
X = tr.drop(["FloodProbability"], axis=1)

> # 2.5 DataPreproccessing Fonction 

<span style="font-size:18px;"> Let's put all these meaningful operations in a function so that we don't have to do the same operations on the test dataset.  </span>

In [None]:
def flood_data_prep(dataframe,graphs=True):
    cat_cols, num_cols, cat_but_car = grab_col_names(dataframe)
    num_cols = [col for col in num_cols if col not in ("FloodProbability", "id")]
    
    # Analysis of num_cols with graphics
    if graphs:
        plt.figure(figsize=(20, 20))
        for col in num_cols:
            plt.subplot(7, 3, num_cols.index(col) + 1)
            sns.histplot(dataframe[col], bins=20)
            plt.title(col)
        plt.tight_layout()
        plt.show()
    
    #correlation analysis of numerical variables
    correlation_matrix(dataframe, num_cols)
    
    # adding new features
    add_features(dataframe)
    
    cat_cols, num_cols, cat_but_car = grab_col_names(dataframe)
    num_cols = [col for col in num_cols if col not in ("FloodProbability", "id")]
        
    # checking suppression of outliers
    for col in dataframe.columns:
        print(col, check_outlier(dataframe, col, 0.05, 0.95))
        if check_outlier(dataframe, col):
            replace_with_thresholds(dataframe, col)
        
    # StandardScaller
    X_scaled = StandardScaler().fit_transform(dataframe[num_cols])
    dataframe[num_cols] = pd.DataFrame(X_scaled, columns=dataframe[num_cols].columns)

# 3. Model for r2 Score

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

In [None]:
from sklearn.metrics import r2_score

In [None]:
lr = LinearRegression()
lr.fit(X_train,y_train)
y_pred_lr = lr.predict(X_test)
r2 = r2_score(y_test,y_pred_lr)
print(r2)

<span style="font-size:18px;"> When we operated with the r2 score on the Train dataset, we found a value of approximately 0.84. 

<span style="font-size:18px;"> Now let's do these operations on the test data that the model never sees.</span>

# 4. Prediction on test data

<span style="font-size:18px;"> It will be easier for us to work on the test dataset because we have done our preliminary preparations and defined the necessary functions. The operations we will do are as follows: 

* <span style="font-size:18px;"> 1. Examine and prepare the data with the "flood_data_prep" function.
* <span style="font-size:18px;"> 2. Make predictions for the test dataset.</span>

In [None]:
flood_data_prep(ts)

In [None]:
# check again    
for col in num_cols:
    print(col, check_outlier(ts, col, 0.05, 0.95))

In [None]:
X_scaled = StandardScaler().fit_transform(ts[num_cols])
ts[num_cols] = pd.DataFrame(X_scaled, columns=ts[num_cols].columns)

In [None]:
ts.head()

In [None]:
model = LinearRegression()
model.fit(X, y)
predictions = model.predict(ts)
dictionary = {"id":ts["id"], "FloodProbability":predictions}
dfSubmission = pd.DataFrame(dictionary)
dfSubmission.head()

# 5. CONCLUSION

In [None]:
dfSubmission.to_csv("FloodProbability.csv", index=False)

<span style="font-size:20px;"> We also made our prediction for the test dataset and transferred it to the submission file.

<span style="font-size:20px;"> Thank you for following this section and I hope it has been a useful and inspiring work for you as well. </span>