<a href="https://colab.research.google.com/github/fact-h/Graduation-project/blob/main/LightGBM_v1_0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 基于机器学习的城市洪涝快速模拟研究

- 目标：根据降雨和潮位的序列信息预测某点的最大水深
- 使用的机器学习算法：[LightGBM](https://lightgbm.readthedocs.io/en/latest/)
- 模型输入特征：10个降雨和潮位特征
- 模型输出变量：最大水深

目录；
导入模块


## 导入相关的模块

In [45]:
#@title 导入模块
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.model_selection import GridSearchCV

import lightgbm as lgb

import tensorflow as tf
from tensorflow.keras import layers
from matplotlib import pyplot as plt

## 数据预处理

### 加载数据集
先上传两个CSV数据文件：
- `E:\毕业设计\数据\模型训练数据\X.csv`
- `E:\毕业设计\数据\模型训练数据\y.csv`

In [2]:
df_X_raw = pd.read_csv('/content/X.csv')
df_y = pd.read_csv('/content/y.csv')

## 提取特征
10个特征：
- 6个降雨相关的特征：**累计降雨量 降雨重现期 降雨峰值 最大2h降雨量 最大3h降雨量 峰值前累计降雨量**
- 4个潮位相关的特征：**最大潮位 潮位重现期 平均潮位 最大5h平均潮位**

### 创建降雨的特征DataFrame: `rain_feature_df`和创建潮位的有关特征DataFrame: `tide_feature_df`

In [3]:
# 新建一个降雨DataFrame
rain_feature_df = pd.DataFrame()

# 添加累计降雨量
rain_feature_df['CumRainfall'] = df_X_raw.iloc[:,2:9].sum()

# 添加降雨重现期
rain_feature_df['RainRP'] = [5,10,20,35,50,75,100]

# 添加降雨峰值
rain_feature_df['RainfallPeak'] = df_X_raw.iloc[:,2:9].max()

# 添加最大2h降雨量
rain_feature_df['MaxRainfall2h'] = np.add(df_X_raw.iloc[0:-2,2:9], df_X_raw.iloc[1:-1,2:9]).max()

# 添加最大3h降雨量
rain_feature_df['MaxRainfall3h'] = np.add(np.add(df_X_raw.iloc[0:-3,2:9],df_X_raw.iloc[1:-2,2:9]),df_X_raw.iloc[2:-1,2:9]).max()

# 添加峰值前累计降雨量
peak_index = df_X_raw[df_X_raw.iloc[:,2]==rain_feature_df['RainfallPeak'][0]].index.tolist()[0]
rain_feature_df['CumRainfallBeforePeak'] = df_X_raw.iloc[0:peak_index,2:9].sum()



# 创建一个从潮位中提取的特征DataFrame
tide_feature_df = pd.DataFrame()

# 添加最大潮位
tide_feature_df['MaxTide'] = df_X_raw.iloc[:,9:].max()

# 添加潮位重现期
tide_feature_df['TideRP'] = [5,10,20,35,50,75,100]

# 添加平均潮位
tide_feature_df['MeanTide'] = df_X_raw.iloc[:,9:].mean()

# 添加最大5h平均潮位
tide_feature_df['MaxTide5h'] = np.add(
    np.add(np.add(
        np.add(df_X_raw.iloc[0:-5,9:],df_X_raw.iloc[1:-4,9:]),
        df_X_raw.iloc[2:-3,9:]),df_X_raw.iloc[3:-2,9:]),
        df_X_raw.iloc[4:-1,9:]).max()/5

  


## 混合降雨和潮位数据

### 重置索引，将`rainx`或`tidex`换为数字

In [4]:
# 重置索引，将索引换为数字形式，方便后面数据组合
rain_feature_df = rain_feature_df.reset_index(drop=True) # 重置索引后将原索引所在的列删除
tide_feature_df = tide_feature_df.reset_index(drop=True)

### 将降雨特征、潮位特征和最大水深组合在一起，形成49条数据样本
先将降雨的每一行复制7遍，再用`concat`方法将7组潮位数据首尾相连，即相当于整个复制7遍。

然后使用`join`方法连接降雨和潮位，每个重现期的降雨对应7个重现期的潮位。最后再将水深数据加上，得到总的数据集`df_data`。

In [5]:
# 将数据的每一行复制7遍
rain_repeat_df = pd.DataFrame(np.repeat(rain_feature_df.values,tide_feature_df.shape[0],axis=0)) 
rain_repeat_df.columns = rain_feature_df.columns
# 将所有数据复制7遍
tide_concat_df = pd.concat([tide_feature_df, tide_feature_df, tide_feature_df, tide_feature_df, tide_feature_df, tide_feature_df, tide_feature_df]).reset_index(drop=True) 

# 组合降雨和潮位特征数据
df_X = rain_repeat_df.join(tide_concat_df)
# 添加输出变量-水深
df_data = df_X.join(df_y['depth'])
df_data.head()

Unnamed: 0,CumRainfall,RainRP,RainfallPeak,MaxRainfall2h,MaxRainfall3h,CumRainfallBeforePeak,MaxTide,TideRP,MeanTide,MaxTide5h,depth
0,199.101882,5.0,56.394737,86.377019,99.455229,35.806351,2.8989,5,1.872049,2.579491,0.0
1,199.101882,5.0,56.394737,86.377019,99.455229,35.806351,3.1596,10,2.040403,2.811467,0.23
2,199.101882,5.0,56.394737,86.377019,99.455229,35.806351,3.4009,20,2.19623,3.026179,0.33
3,199.101882,5.0,56.394737,86.377019,99.455229,35.806351,3.585497,35,2.315439,3.190437,0.46
4,199.101882,5.0,56.394737,86.377019,99.455229,35.806351,3.7012,50,2.390157,3.293392,0.61


In [6]:
# 归一化：z-score
df_data_mean = df_data.mean()
df_data_std = df_data.std()
df_data_norm = (df_data - df_data.mean()) / df_data.std()
df_data_norm

Unnamed: 0,CumRainfall,RainRP,RainfallPeak,MaxRainfall2h,MaxRainfall3h,CumRainfallBeforePeak,MaxTide,TideRP,MeanTide,MaxTide5h,depth
0,-1.635439,-1.127397,-1.635439,-1.635439,-1.635439,-1.635439,-1.738472,-1.127397,-1.738472,-1.738472,-2.244169
1,-1.635439,-1.127397,-1.635439,-1.635439,-1.635439,-1.635439,-0.982866,-0.975632,-0.982866,-0.982866,-1.299035
2,-1.635439,-1.127397,-1.635439,-1.635439,-1.635439,-1.635439,-0.283489,-0.672102,-0.283489,-0.283489,-0.888107
3,-1.635439,-1.127397,-1.635439,-1.635439,-1.635439,-1.635439,0.251543,-0.216807,0.251543,0.251543,-0.353901
4,-1.635439,-1.127397,-1.635439,-1.635439,-1.635439,-1.635439,0.586893,0.238488,0.586893,0.586893,0.262491
5,-1.635439,-1.127397,-1.635439,-1.635439,-1.635439,-1.635439,0.952582,0.997313,0.952582,0.952582,0.796697
6,-1.635439,-1.127397,-1.635439,-1.635439,-1.635439,-1.635439,1.213811,1.756138,1.213811,1.213811,1.207625
7,-1.016574,-0.975632,-1.016574,-1.016574,-1.016574,-1.016574,-1.738472,-1.127397,-1.738472,-1.738472,-1.422313
8,-1.016574,-0.975632,-1.016574,-1.016574,-1.016574,-1.016574,-0.982866,-0.975632,-0.982866,-0.982866,-1.175757
9,-1.016574,-0.975632,-1.016574,-1.016574,-1.016574,-1.016574,-0.283489,-0.672102,-0.283489,-0.283489,-0.847014


In [7]:
# 加入正态分布进行数据增强
df_data_augmented = df_data_norm + 0.005 * df_data_norm * np.random.standard_normal(size=df_data_norm.shape)
df_data_augmented

Unnamed: 0,CumRainfall,RainRP,RainfallPeak,MaxRainfall2h,MaxRainfall3h,CumRainfallBeforePeak,MaxTide,TideRP,MeanTide,MaxTide5h,depth
0,-1.638597,-1.132505,-1.638855,-1.649347,-1.652165,-1.643009,-1.727972,-1.13143,-1.729869,-1.737032,-2.24042
1,-1.625219,-1.122769,-1.636227,-1.641314,-1.628225,-1.63765,-0.984799,-0.97265,-0.98401,-0.989552,-1.294526
2,-1.644028,-1.131,-1.638359,-1.637321,-1.640108,-1.638998,-0.283161,-0.673239,-0.281139,-0.284249,-0.883818
3,-1.618174,-1.127088,-1.635215,-1.645793,-1.630973,-1.637533,0.252102,-0.217786,0.251533,0.252309,-0.352799
4,-1.640206,-1.128005,-1.633753,-1.634919,-1.641359,-1.649018,0.590669,0.237625,0.587945,0.589927,0.264668
5,-1.636057,-1.121802,-1.626772,-1.642178,-1.634103,-1.63565,0.952843,1.002178,0.956052,0.954282,0.798041
6,-1.622015,-1.12859,-1.629927,-1.633073,-1.631687,-1.635738,1.214343,1.7713,1.204639,1.20931,1.204751
7,-1.016935,-0.97933,-1.014657,-1.020277,-1.015498,-1.014891,-1.749625,-1.134062,-1.740176,-1.743517,-1.427677
8,-1.020393,-0.977904,-1.01566,-1.02653,-1.010455,-1.010218,-0.984334,-0.980573,-0.984333,-0.98863,-1.173176
9,-1.011146,-0.970126,-1.005075,-1.024109,-1.019386,-1.012082,-0.282573,-0.670702,-0.283307,-0.285272,-0.85198


In [8]:
# 将添加过噪声的数据加到总数据集中，并打乱数据
df_all_data = pd.concat([df_data_norm,df_data_augmented])
df_all_data = df_all_data.reset_index(drop=True)
df_all_data = df_all_data.reindex(np.random.permutation(df_all_data.index))
df_all_data = df_all_data.reset_index(drop=True)
df_all_data

Unnamed: 0,CumRainfall,RainRP,RainfallPeak,MaxRainfall2h,MaxRainfall3h,CumRainfallBeforePeak,MaxTide,TideRP,MeanTide,MaxTide5h,depth
0,-1.644028,-1.131000,-1.638359,-1.637321,-1.640108,-1.638998,-0.283161,-0.673239,-0.281139,-0.284249,-0.883818
1,-0.370592,-0.672102,-0.370592,-0.370592,-0.370592,-0.370592,-0.283489,-0.672102,-0.283489,-0.283489,-0.723736
2,0.546989,0.239377,0.547556,0.548359,0.547934,0.549118,-0.977844,-0.985344,-0.973850,-0.984427,-0.882804
3,0.547887,0.237567,0.547652,0.551826,0.548791,0.546262,-1.727288,-1.120214,-1.741160,-1.740640,-1.091757
4,-0.374609,-0.670914,-0.370952,-0.371207,-0.372128,-0.370611,-1.723827,-1.121339,-1.733838,-1.736293,-1.263278
...,...,...,...,...,...,...,...,...,...,...,...
93,0.980497,0.998661,0.991453,0.985473,0.982148,0.990226,-0.283008,-0.677847,-0.281727,-0.281089,-0.557767
94,-1.635439,-1.127397,-1.635439,-1.635439,-1.635439,-1.635439,0.251543,-0.216807,0.251543,0.251543,-0.353901
95,-0.370592,-0.672102,-0.370592,-0.370592,-0.370592,-0.370592,-0.982866,-0.975632,-0.982866,-0.982866,-1.052478
96,0.986053,0.997313,0.986053,0.986053,0.986053,0.986053,0.952582,0.997313,0.952582,0.952582,1.248717


In [16]:
# 创建训练集和验证集
train_split = round(0.7 * df_all_data.shape[0])
val_split = round(0.2 * df_all_data.shape[0])
test_split = round(0.1 * df_all_data.shape[0])

y_test = df_all_data.depth[0:test_split]
y_val = df_all_data.depth[test_split:(val_split + test_split)]
y_train = df_all_data.depth[(val_split + test_split):]

X_test = df_all_data[0:test_split].drop(['depth'], axis=1)
X_val = df_all_data[test_split:(val_split + test_split)].drop(['depth'], axis=1)
X_train = df_all_data[(val_split + test_split):].drop(['depth'], axis=1)

# 进行训练

In [17]:
gbm = lgb.LGBMRegressor(num_leaves=30,
                        learning_rate=0.01,
                        n_estimators=200)
gbm.fit(X_train,y_train,
        eval_set=[(X_val,y_val)],
        eval_metric=['l1','l2'],
        callbacks=[lgb.early_stopping(5)])

[1]	valid_0's l1: 0.836503	valid_0's l2: 0.933815	valid_0's l2: 0.933815
Training until validation scores don't improve for 5 rounds.
[2]	valid_0's l1: 0.829764	valid_0's l2: 0.920167	valid_0's l2: 0.920167
[3]	valid_0's l1: 0.823092	valid_0's l2: 0.906791	valid_0's l2: 0.906791
[4]	valid_0's l1: 0.817122	valid_0's l2: 0.893683	valid_0's l2: 0.893683
[5]	valid_0's l1: 0.808662	valid_0's l2: 0.879619	valid_0's l2: 0.879619
[6]	valid_0's l1: 0.802779	valid_0's l2: 0.866972	valid_0's l2: 0.866972
[7]	valid_0's l1: 0.794486	valid_0's l2: 0.853401	valid_0's l2: 0.853401
[8]	valid_0's l1: 0.786876	valid_0's l2: 0.840107	valid_0's l2: 0.840107
[9]	valid_0's l1: 0.78055	valid_0's l2: 0.828096	valid_0's l2: 0.828096
[10]	valid_0's l1: 0.774825	valid_0's l2: 0.816326	valid_0's l2: 0.816326
[11]	valid_0's l1: 0.766869	valid_0's l2: 0.803689	valid_0's l2: 0.803689
[12]	valid_0's l1: 0.759525	valid_0's l2: 0.791311	valid_0's l2: 0.791311
[13]	valid_0's l1: 0.753419	valid_0's l2: 0.780132	valid_0's 

LGBMRegressor(learning_rate=0.01, n_estimators=200, num_leaves=30)

# 开始预测

In [18]:
# 预测
y_pred = gbm.predict(X_test,num_iteration=gbm.best_iteration_)

# 归一化后的评估
rmse_test = mean_squared_error(y_test,y_pred) ** 0.5 # mse加根号即是rmse
print(f'The RMSE of prediction is: {rmse_test}')

The RMSE of prediction is: 0.22931484187481377


In [19]:
# 原始数据评估
y_pred_raw = y_pred * df_data_std.depth + df_data_mean.depth
y_test_raw = y_test * df_data_std.depth + df_data_mean.depth

rmse_test_raw = mean_squared_error(y_test_raw,y_pred_raw) ** 0.5 # mse加根号即是rmse
print(f'The RMSE of raw prediction is: {rmse_test_raw}')

The RMSE of raw prediction is: 0.05580416514962073


In [47]:
depth = [list(y_test_raw),list(y_pred_raw)]

depth = np.transpose(depth)
cols = ['real','predict']
df = pd.DataFrame(data=depth,columns=cols)
df['relative_error'] = np.abs(df.predict - df.real)
df

Unnamed: 0,real,predict,relative_error
0,0.331044,0.366611,0.035567
1,0.37,0.321033,0.048967
2,0.33129,0.356359,0.025068
3,0.280441,0.356359,0.075917
4,0.238702,0.32794,0.089238
5,0.460268,0.52097,0.060702
6,0.391079,0.416524,0.025445
7,0.859651,0.773328,0.086323
8,0.39,0.356359,0.033641
9,0.341961,0.356359,0.014398


In [48]:
# 计算纳什效率系数
H_obs = y_test_raw
H_m = y_pred_raw
H_m_mean = H_obs.mean()

NSE = 1 - ((H_obs - H_m)**2).sum() / ((H_obs - H_m_mean)**2).sum()
print(f'The NSE of prediction is: {NSE}')

The NSE of prediction is: 0.8842957179954278


In [49]:
# 计算R2_score
R2 = r2_score(y_test_raw,y_pred_raw)
print(f'The R2 score of prediction is: {R2}')

The R2 score of prediction is: 0.8842957179954278
