<h2><center> Welcome to the Landslide Prediction Challenge</h2></center>

A landslide is the movement of a mass of rock, debris, or earth(soil) down a slope. As a common natural hazard, it can lead to significant losses of human lives and properties.


Hong Kong, one of the hilly and densely populated cities in the world, is frequently affected by extreme rainstorms, making it highly susceptible to rain-induced natural terrain landslides

<img src = "https://drive.google.com/uc?export=view&id=1-8sSI75AG3HM89nDJEwo6_KJbAEUXS-r">

The common practice of identifying landslides is visual interpretation which, however, is labor-intensive and time-consuming.

***Thus, this hack will focus on automating the landslide identification process using artificial intelligence techniques***

This will be achieved by using high-resolution terrain information to perform the terrain-based landslide identification. Other auxiliary data such as the lithology of the surface materials and rainfall intensification factor are also provided.


# **SETUP**

In [14]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [15]:
csv_path = "/content/drive/MyDrive/LandSlidePrevention/"

In [16]:
%%capture
!pip install catboost --quiet

# **Libraries**

In [17]:
# Import libraries
import pandas as pd
import random
import os
import numpy as np
import sklearn

from sklearn.metrics import f1_score

import lightgbm as lgb
import xgboost as xgb
import catboost
from catboost import CatBoostClassifier

import warnings
warnings.filterwarnings('ignore')

In [18]:
print("version sklearn==",sklearn.__version__)
print("version lightgbm==",lgb.__version__)
print("version xgboost==",xgb.__version__)
print("version catboost==",catboost.__version__)
print("version pandas==",pd.__version__)
print("version numpy==",np.__version__)

version sklearn== 1.0.2
version lightgbm== 2.2.3
version xgboost== 0.90
version catboost== 1.1
version pandas== 1.3.5
version numpy== 1.21.6


In [19]:
#!pip freeze reqiurement.txt

# **UTILS**

In [20]:
## Seeder
# :seed to make all processes deterministic     # type: int
def seed_everything(seed=0):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
## ------------------- 

In [21]:
########################### Vars
#################################################################################
SEED = 42
seed_everything(SEED)

# **Load Data files**

In [22]:
# Read files to pandas dataframes
train = pd.read_csv(f'{csv_path}Train.csv')
test = pd.read_csv(f'{csv_path}Test.csv')
sample_submission = pd.read_csv(f'{csv_path}SampleSubmission.csv')

In [23]:
train.head()

Unnamed: 0,Sample_ID,1_elevation,2_elevation,3_elevation,4_elevation,5_elevation,6_elevation,7_elevation,8_elevation,9_elevation,...,17_sdoif,18_sdoif,19_sdoif,20_sdoif,21_sdoif,22_sdoif,23_sdoif,24_sdoif,25_sdoif,Label
0,1,130,129,127,126,123,126,125,124,122,...,1.281779,1.281743,1.28172,1.281684,1.281811,1.281788,1.281752,1.281729,1.281693,0
1,2,161,158,155,153,151,162,159,155,153,...,1.359639,1.359608,1.359587,1.359556,1.359683,1.359662,1.359631,1.35961,1.359579,1
2,3,149,151,154,156,158,154,157,158,160,...,1.365005,1.365025,1.365055,1.365075,1.364937,1.364967,1.364988,1.365018,1.365038,0
3,4,80,78,77,75,73,80,78,77,75,...,1.100708,1.100738,1.100759,1.100789,1.10063,1.10065,1.10068,1.1007,1.100731,0
4,5,117,115,114,112,110,115,113,111,110,...,1.28418,1.28413,1.284056,1.284006,1.284125,1.28405,1.284001,1.283926,1.283876,0


# **Feature Engineering**

In [24]:
groupCols = {"elevation": [[f"{i}_elevation" for i in range(1,6)],  [f"{i}_elevation" for i in range(6,11)], [f"{i}_elevation" for i in range(11,16)],
                           [f"{i}_elevation" for i in range(16,21)], [f"{i}_elevation" for i in range(21,26)] ],
             "lsfactor": [[f"{i}_lsfactor" for i in range(1,6)],  [f"{i}_lsfactor" for i in range(6,11)], [f"{i}_lsfactor" for i in range(11,16)],
                           [f"{i}_lsfactor" for i in range(16,21)], [f"{i}_lsfactor" for i in range(21,26)] ],
             "placurv": [[f"{i}_placurv" for i in range(1,6)],  [f"{i}_placurv" for i in range(6,11)], [f"{i}_placurv" for i in range(11,16)],
                           [f"{i}_placurv" for i in range(16,21)], [f"{i}_placurv" for i in range(21,26)] ],
             "procurv": [[f"{i}_procurv" for i in range(1,6)],  [f"{i}_procurv" for i in range(6,11)], [f"{i}_procurv" for i in range(11,16)],
                           [f"{i}_procurv" for i in range(16,21)], [f"{i}_procurv" for i in range(21,26)] ],
             "sdoif": [[f"{i}_sdoif" for i in range(1,6)],  [f"{i}_sdoif" for i in range(6,11)], [f"{i}_sdoif" for i in range(11,16)],
                           [f"{i}_sdoif" for i in range(16,21)], [f"{i}_sdoif" for i in range(21,26)] ],
             "slope": [[f"{i}_slope" for i in range(1,6)],  [f"{i}_slope" for i in range(6,11)], [f"{i}_slope" for i in range(11,16)],
                           [f"{i}_slope" for i in range(16,21)], [f"{i}_slope" for i in range(21,26)] ],
             "twi": [[f"{i}_twi" for i in range(1,6)],  [f"{i}_twi" for i in range(6,11)], [f"{i}_twi" for i in range(11,16)],
                           [f"{i}_twi" for i in range(16,21)], [f"{i}_twi" for i in range(21,26)] ],
             "aspect": [[f"{i}_aspect" for i in range(1,6)],  [f"{i}_aspect" for i in range(6,11)], [f"{i}_aspect" for i in range(11,16)],
                           [f"{i}_aspect" for i in range(16,21)], [f"{i}_aspect" for i in range(21,26)] ],
             }

In [25]:
df = pd.DataFrame()
selectedCols = ['elevation', 'lsfactor', 'placurv', 'procurv', 'sdoif', 'slope', 'twi', 'aspect']
for i in selectedCols:
  df[i+"_mean"] = train[[x for x in train.columns if i in x]].mean(axis = 1)
  df[i+"_median"] = train[[x for x in train.columns if i in x]].median(axis = 1)
  df[i+"_min"] = train[[x for x in train.columns if i in x]].min(axis = 1)
  df[i+"_max"] = train[[x for x in train.columns if i in x]].max(axis = 1)
  df[i+"_std"] = train[[x for x in train.columns if i in x]].std(axis = 1)
  df[i+"_range"] = df[i+"_max"] - df[i+"_min"]
  df[i+"_ratio1"] = df[i+"_max"] / df[i+"_min"]
  df[i+"_ratio2"] = df[i+"_min"] / df[i+"_max"]

for i in selectedCols:
  for idx in range(0,5):
    df[i+f"_mean_5_{idx}"] = train[groupCols[i][idx]].mean(axis = 1)
    df[i+f"_median_5_{idx}"] = train[groupCols[i][idx]].median(axis = 1)
    df[i+f"_min_5_{idx}"] = train[groupCols[i][idx]].min(axis = 1)
    df[i+f"_max_5_{idx}"] = train[groupCols[i][idx]].max(axis = 1)
    df[i+f"_std_5_{idx}"] = train[groupCols[i][idx]].std(axis = 1)
    df[i+f"_range_5_{idx}"] = df[i+f"_max_5_{idx}"] - df[i+f"_min_5_{idx}"]
    df[i+f"_ratio1_5_{idx}"] = df[i+f"_max_5_{idx}"] / df[i+f"_min_5_{idx}"]
    df[i+f"_ratio2_5_{idx}"] = df[i+f"_min_5_{idx}"] / df[i+f"_max_5_{idx}"]

df[[f"geology_{i}" for i in range(1, 8)]] = np.nan
for index in range(len(df)):
  current_dict = train[[x for x in train.columns if "geology" in x]].iloc[index,:].value_counts().to_dict()
  for i in range(1, 8):
    df.loc[index, f"geology_{i}"] = current_dict.get(i, 0)

df["geology_nunique"] = train[[x for x in train.columns if "geology" in x]].nunique(axis = 1)
df['Label'] = train.Label
df.head()

Unnamed: 0,elevation_mean,elevation_median,elevation_min,elevation_max,elevation_std,elevation_range,elevation_ratio1,elevation_ratio2,lsfactor_mean,lsfactor_median,...,aspect_ratio2_5_4,geology_1,geology_2,geology_3,geology_4,geology_5,geology_6,geology_7,geology_nunique,Label
0,119.44,119.0,110,130,5.795688,20,1.181818,0.846154,9.013694,8.703197,...,0.940183,0.0,0.0,25.0,0.0,0.0,0.0,0.0,1,0
1,156.2,156.0,150,162,4.092676,12,1.08,0.925926,8.013825,7.571757,...,0.947432,0.0,0.0,25.0,0.0,0.0,0.0,0.0,1,1
2,162.56,164.0,149,171,6.474823,22,1.147651,0.871345,10.958018,11.46899,...,0.801143,0.0,25.0,0.0,0.0,0.0,0.0,0.0,1,0
3,76.4,77.0,72,80,2.565801,8,1.111111,0.9,3.78572,3.400196,...,0.90582,0.0,25.0,0.0,0.0,0.0,0.0,0.0,1,0
4,109.16,109.0,102,117,3.954744,15,1.147059,0.871795,7.742521,7.587172,...,0.863444,0.0,10.0,0.0,0.0,15.0,0.0,0.0,2,0


In [26]:
test_df = pd.DataFrame()
for i in selectedCols:
  test_df[i+"_mean"] = test[[x for x in test.columns if i in x]].mean(axis = 1)
  test_df[i+"_median"] = test[[x for x in test.columns if i in x]].median(axis = 1)
  test_df[i+"_min"] = test[[x for x in test.columns if i in x]].min(axis = 1)
  test_df[i+"_max"] = test[[x for x in test.columns if i in x]].max(axis = 1)
  test_df[i+"_std"] = test[[x for x in test.columns if i in x]].std(axis = 1)
  test_df[i+"_range"] = test_df[i+"_max"] - test_df[i+"_min"]
  test_df[i+"_ratio1"] = test_df[i+"_max"] / test_df[i+"_min"]
  test_df[i+"_ratio2"] = test_df[i+"_min"] / test_df[i+"_max"]

for i in selectedCols:
  for idx in range(0,5):
    test_df[i+f"_mean_5_{idx}"] = test[groupCols[i][idx]].mean(axis = 1)
    test_df[i+f"_median_5_{idx}"] = test[groupCols[i][idx]].median(axis = 1)
    test_df[i+f"_min_5_{idx}"] = test[groupCols[i][idx]].min(axis = 1)
    test_df[i+f"_max_5_{idx}"] = test[groupCols[i][idx]].max(axis = 1)
    test_df[i+f"_std_5_{idx}"] = test[groupCols[i][idx]].std(axis = 1)
    test_df[i+f"_range_5_{idx}"] = test_df[i+f"_max_5_{idx}"] - test_df[i+f"_min_5_{idx}"]
    test_df[i+f"_ratio1_5_{idx}"] = test_df[i+f"_max_5_{idx}"] / test_df[i+f"_min_5_{idx}"]
    test_df[i+f"_ratio2_5_{idx}"] = test_df[i+f"_min_5_{idx}"] / test_df[i+f"_max_5_{idx}"]

test_df[[f"geology_{i}" for i in range(1, 8)]] = np.nan
for index in range(len(test_df)):
  current_dict = test[[x for x in test.columns if "geology" in x]].iloc[index,:].value_counts().to_dict()
  for i in range(1, 8):
    test_df.loc[index, f"geology_{i}"] = current_dict.get(i, 0)

test_df["geology_nunique"] = test[[x for x in test.columns if "geology" in x]].nunique(axis = 1)
test_df.head()

Unnamed: 0,elevation_mean,elevation_median,elevation_min,elevation_max,elevation_std,elevation_range,elevation_ratio1,elevation_ratio2,lsfactor_mean,lsfactor_median,...,aspect_ratio1_5_4,aspect_ratio2_5_4,geology_1,geology_2,geology_3,geology_4,geology_5,geology_6,geology_7,geology_nunique
0,117.0,118.0,109,123,4.769696,14,1.12844,0.886179,6.773817,8.138618,...,2.666667,0.375,0.0,25.0,0.0,0.0,0.0,0.0,0.0,1
1,184.64,185.0,181,189,2.252406,8,1.044199,0.957672,4.992889,5.424446,...,1.084574,0.922021,0.0,20.0,0.0,0.0,5.0,0.0,0.0,2
2,37.92,37.0,35,43,2.158703,8,1.228571,0.813953,3.488714,2.247645,...,9.61796,0.103972,0.0,25.0,0.0,0.0,0.0,0.0,0.0,1
3,132.24,132.0,123,140,4.342426,17,1.138211,0.878571,9.090602,8.958429,...,1.906624,0.524487,0.0,0.0,25.0,0.0,0.0,0.0,0.0,1
4,332.64,332.0,325,341,5.266878,16,1.049231,0.953079,8.841271,8.933039,...,1.389817,0.719519,0.0,0.0,25.0,0.0,0.0,0.0,0.0,1


In [27]:
test_df.shape

(5430, 392)

In [28]:
df = pd.concat([df, train.iloc[:, 1:-1]], axis = 1)
test_df = pd.concat([test_df, test.iloc[:, 1:]], axis = 1)

In [29]:
df.shape, test_df.shape

((10864, 618), (5430, 617))

# **MODELLING**

# **CATBOOST**

In [30]:
features =  ['8_slope', 'geology_3', '9_slope', '13_slope', 'aspect_range',
       'slope_max_5_1', 'twi_mean', 'elevation_ratio2', '14_slope',
       '14_lsfactor', 'placurv_min', 'slope_std', 'slope_max', '24_elevation',
       '8_lsfactor', 'slope_mean_5_1', 'elevation_ratio1', 'slope_range',
       'placurv_std', 'elevation_std', 'sdoif_mean', 'twi_max', '17_aspect',
       'twi_median', 'lsfactor_max_5_0', '8_placurv', 'twi_median_5_4',
       'slope_std_5_1', '14_sdoif', 'placurv_min_5_2', '16_elevation',
       '14_placurv', '25_elevation', '10_geology', 'twi_mean_5_0',
       'sdoif_ratio1', 'twi_max_5_1', '15_placurv', '3_geology', '9_lsfactor',
       '3_aspect', 'aspect_std', 'sdoif_std_5_4', 'sdoif_min_5_0',
       '19_geology', 'lsfactor_min_5_0', '13_geology', '9_geology',
       'sdoif_max_5_3', 'lsfactor_max_5_4',
# ------------------------------------------------------------------------------  
        '19_placurv', '15_elevation', '13_placurv', 'slope_range_5_1',
       'slope_median_5_1', '7_geology', 'twi_max_5_0', 'slope_mean_5_0',
       '13_lsfactor', '12_geology', '6_aspect', '11_geology',
       'lsfactor_median_5_2', 'elevation_max_5_3', '12_aspect', '4_sdoif',
       'placurv_range', 'twi_median_5_2', 'aspect_ratio1_5_0',
       'lsfactor_range', '6_geology', 'placurv_range_5_2', 'lsfactor_mean_5_4',
       '8_procurv', '11_lsfactor', '21_lsfactor', '5_placurv',
       'sdoif_range_5_1', 'lsfactor_std', 'sdoif_median', 'lsfactor_max',
       'twi_median_5_1', '11_aspect', 'aspect_min_5_0', '5_procurv',
       'elevation_ratio1_5_1', '2_geology', '5_geology', 'aspect_std_5_4',
       'procurv_max', 'elevation_median_5_0', '16_lsfactor', '11_twi',
       'sdoif_median_5_3', 'twi_ratio2', 'twi_ratio2_5_2', 'elevation_max_5_4',
       '15_twi', 'sdoif_min_5_4', 'twi_range',
       
# ------------------------------------------------------------------------------ 
        '23_twi', 'slope_mean_5_2', '8_elevation', 'placurv_mean_5_1',
       'lsfactor_min', 'lsfactor_mean', '1_aspect', '25_lsfactor', 'sdoif_max',
       'sdoif_ratio2_5_2', '6_twi', 'sdoif_mean_5_4', 'sdoif_ratio2',
       'elevation_mean', 'sdoif_ratio1_5_2', 'twi_std_5_4', '20_sdoif',
       '9_placurv', 'sdoif_ratio2_5_4', 'sdoif_min_5_3', 'twi_mean_5_4',
       'elevation_mean_5_3', 'sdoif_std_5_0', '22_elevation',
       'lsfactor_min_5_4', '17_elevation', 'elevation_std_5_3',
       'elevation_range', '13_sdoif', '20_geology', 'procurv_ratio2',
       'placurv_median_5_2', '8_sdoif', 'sdoif_mean_5_0', 'twi_max_5_3',
       '1_lsfactor', '20_elevation', 'procurv_mean_5_2',
       'elevation_ratio1_5_4', 'aspect_max_5_4', 'sdoif_range_5_3',
       '10_placurv', '4_procurv', 'aspect_range_5_3', 'twi_ratio1_5_2',
       'placurv_std_5_2', '22_geology', '5_elevation', 'twi_median_5_3',
       'placurv_mean_5_2']
# ------------------------------------------------------------------------------ 
_  =    [ 'slope_std_5_3', 'aspect_median_5_1', 'sdoif_ratio1_5_0', '18_placurv', 
        'slope_range_5_2', 'slope_mean_5_4',
       'aspect_ratio2', 'lsfactor_max_5_1', '2_sdoif', 'sdoif_max_5_1',
       'procurv_std', 'slope_ratio2', 'procurv_std_5_1', '3_twi',
       'elevation_min_5_0', 'lsfactor_median', 'slope_max_5_2', '5_lsfactor',
       'placurv_ratio1_5_1', 'twi_max_5_2', '17_geology', 'lsfactor_mean_5_0',
       'slope_ratio1_5_1', '21_sdoif', 'elevation_std_5_2', '3_placurv',
       'sdoif_range', '21_aspect', 'aspect_std_5_3', 'procurv_std_5_0',
       'aspect_ratio1', '5_twi', '16_sdoif', '7_twi', 'aspect_min',
       'sdoif_ratio2_5_3', '6_lsfactor', 'placurv_ratio1_5_2',
       'aspect_max_5_3', 'elevation_min_5_3', 'sdoif_ratio2_5_1',
       'lsfactor_ratio2_5_0', 'slope_min_5_1', 'twi_min_5_4',
       'aspect_range_5_4', '2_lsfactor', 'twi_range_5_4', 'twi_mean_5_2',
       '23_aspect', '15_procurv'
# ------------------------------------------------------------------------------ 
       ]

In [31]:
predsCatBoost = np.zeros(len(test_df))
for scalePos in [3.8, 4]:
  cat_params = {
                'n_estimators':5000,
                'learning_rate': 0.07,
                "eval_metric" : 'F1',#'TotalF1',
                #'eval_metric':'AUC',
                'loss_function':'Logloss',
                'random_seed':SEED,
                "scale_pos_weight":scalePos,
                'metric_period':500,
                'od_wait':500,
                'task_type':'GPU',
                'depth': 8,
                #'colsample_bylevel':0.7,
                }
  model = CatBoostClassifier(**cat_params) 
  model.fit(df[features],df.Label)
  predsCatBoost += model.predict(test_df[features])

0:	learn: 0.8118961	total: 230ms	remaining: 19m 10s
500:	learn: 0.9704600	total: 52.1s	remaining: 7m 47s
1000:	learn: 0.9900713	total: 1m 45s	remaining: 7m
1500:	learn: 0.9969572	total: 2m 39s	remaining: 6m 11s
2000:	learn: 0.9993706	total: 3m 32s	remaining: 5m 18s
2500:	learn: 0.9999516	total: 4m 26s	remaining: 4m 26s
3000:	learn: 1.0000000	total: 5m 19s	remaining: 3m 32s
3500:	learn: 1.0000000	total: 6m 12s	remaining: 2m 39s
4000:	learn: 1.0000000	total: 7m 4s	remaining: 1m 46s
4500:	learn: 1.0000000	total: 7m 58s	remaining: 53.1s
4999:	learn: 1.0000000	total: 8m 52s	remaining: 0us
0:	learn: 0.8264122	total: 108ms	remaining: 9m 1s
500:	learn: 0.9706963	total: 52.7s	remaining: 7m 52s
1000:	learn: 0.9911052	total: 1m 46s	remaining: 7m 6s
1500:	learn: 0.9962403	total: 2m 39s	remaining: 6m 11s
2000:	learn: 0.9992642	total: 3m 33s	remaining: 5m 19s
2500:	learn: 0.9998159	total: 4m 25s	remaining: 4m 25s
3000:	learn: 1.0000000	total: 5m 18s	remaining: 3m 32s
3500:	learn: 1.0000000	total: 6m

# **LIGHTGBM**

In [32]:
features = [
    'geology_3', '8_slope', '9_slope', '13_slope', 'sdoif_mean', 'twi_mean',
       'aspect_range', 'placurv_min', 'slope_max_5_1', 'lsfactor_max_5_4',
       '13_lsfactor', 'slope_max_5_2', '8_lsfactor', '1_aspect', 'sdoif_range',
       'elevation_ratio1', 'elevation_std', '14_slope', 'slope_max',
       '20_elevation', 'elevation_ratio2', 'placurv_ratio1', 'placurv_std',
       'twi_mean_5_0', '1_lsfactor', 'placurv_min_5_2', 'lsfactor_min_5_0',
       'sdoif_median', '4_elevation', 'lsfactor_std', 'sdoif_ratio1_5_4',
       '9_placurv', '21_lsfactor', 'aspect_std', 'aspect_ratio1',
       '14_lsfactor', 'lsfactor_range', 'twi_max', 'slope_std', 'twi_std',
       'lsfactor_range_5_1', 'placurv_ratio2_5_0', 'sdoif_ratio2_5_4',
       'placurv_ratio1_5_2', '6_aspect', '8_placurv', '19_placurv',
       '16_aspect', 'sdoif_max_5_0', '5_lsfactor',
# -------------------------------------------------------------------------------- 
      '7_aspect', '6_lsfactor', 'lsfactor_mean_5_4', '9_twi', 'slope_std_5_1',
       'lsfactor_max', '14_placurv', '2_aspect', 'sdoif_max', 'lsfactor_min',
       '10_elevation', 'twi_max_5_2', '9_lsfactor', 'placurv_ratio1_5_1',
       'placurv_range', 'sdoif_std', '11_aspect', 'procurv_max_5_2',
       '3_aspect', 'twi_ratio1', 'aspect_min', 'procurv_range', '10_aspect',
       'twi_max_5_0', '4_twi', '25_elevation', 'procurv_std',
       'slope_range_5_1', '15_twi', 'sdoif_ratio2', '16_twi', 'twi_median',
       'slope_range', '12_aspect', 'sdoif_range_5_4', 'sdoif_std_5_4',
       'procurv_mean_5_4', 'aspect_min_5_0', 'aspect_max_5_4',
       'lsfactor_median_5_2', '18_placurv', '15_placurv', 'aspect_ratio1_5_0',
       '25_placurv', 'twi_std_5_2', 'slope_mean_5_2', '11_lsfactor',
       'slope_std_5_2', 'twi_median_5_0', 'procurv_min',
# -------------------------------------------------------------------------------- 
    '1_procurv', 'twi_ratio2_5_1', '25_twi', 'twi_ratio1_5_2',
       'twi_range_5_2', '24_lsfactor', 'lsfactor_std_5_1', '10_procurv',
       'elevation_ratio1_5_3', 'twi_range', '12_twi', '9_procurv',
       '16_lsfactor', 'twi_std_5_1', '17_aspect', 'twi_median_5_3',
       'twi_ratio2', '5_elevation', 'procurv_ratio2_5_1', 'procurv_ratio1_5_1',
       'procurv_ratio2', 'lsfactor_mean_5_1', 'placurv_min_5_1',
       'aspect_std_5_3', 'twi_range_5_0', 'procurv_max', 'twi_mean_5_1',
       'procurv_std_5_1', 'placurv_min_5_0', '13_placurv', 'aspect_std_5_4',
       '12_placurv', 'elevation_ratio1_5_1', 'slope_mean_5_0', 'twi_min_5_3',
       'procurv_min_5_1', 'twi_min', '11_twi', 'procurv_range_5_0',
       'procurv_max_5_1', '1_elevation', 'twi_median_5_1',
       'placurv_median_5_1', '23_twi', 'placurv_std_5_2', 'placurv_range_5_2',
       'twi_min_5_2', 'placurv_ratio1_5_4', 'sdoif_ratio1',
       'elevation_std_5_0',

# -------------------------------------------------------------------------------- 
      'procurv_ratio2_5_2', '6_placurv', '16_placurv', 'twi_std_5_3',
       'procurv_range_5_2', '22_aspect', 'slope_median_5_2', 'slope_mean_5_1',
       'lsfactor_max_5_3', 'procurv_max_5_4', 'placurv_ratio2',
       'elevation_ratio2_5_3', '19_twi', 'aspect_std_5_2', '20_placurv',
       'aspect_ratio2_5_0', '1_placurv', 'slope_std_5_3', '3_slope',
       'procurv_ratio2_5_4', 'procurv_ratio2_5_0', 'placurv_range_5_3',
       '22_procurv', '5_procurv', 'placurv_median_5_0', 'twi_range_5_3',
       'slope_ratio1', '22_twi', 'aspect_range_5_0', '6_procurv',
       'slope_ratio1_5_1', '15_procurv', 'lsfactor_std_5_4',
       'lsfactor_median_5_1', 'placurv_median_5_2', 'aspect_median_5_1',
       '8_procurv', 'sdoif_min', 'slope_range_5_0', 'placurv_min_5_3',
       'twi_ratio1_5_1', 'procurv_median_5_0', '25_procurv',
       'lsfactor_range_5_0', 'aspect_mean_5_4', '15_slope', 'aspect_ratio2',
       'aspect_ratio1_5_4', 'lsfactor_ratio1', 'aspect_mean'
# -------------------------------------------------------------------------------- 
]

In [33]:
predsLGB = np.zeros(len(test_df))
for scalePos in [3.9, 4]:
  lgb_params = {
                    'objective':'binary',
                    'boosting_type':'gbdt',
                    #'metric':'auc',
                    'n_jobs':-1,
                    'learning_rate':0.01,
                    'num_leaves': 2**6, #2**8
                    'max_depth':-1,
                    "scale_pos_weight":scalePos, #3
                    'tree_learner':'serial',
                    'colsample_bytree': 0.7,
                    'subsample_freq':1,
                    'subsample':1,
                    'n_estimators':800,
                    'max_bin':255,
                    'verbose':-1,
                    'seed': SEED
                } 
  model_lgb = lgb.LGBMClassifier(**lgb_params)
  model_lgb.fit(df[features],df.Label)
  predsLGB += model_lgb.predict(test_df[features])

# **XGBOOST**

In [34]:
def xgb_f1(y, t, threshold=0.5):
    t = t.get_label()
    y_bin = (y > threshold).astype(int)
    return 'f1',f1_score(t,y_bin)

In [35]:
# Features

features = ['8_slope', 'elevation_std', '10_geology', '5_geology', '19_geology',
       'geology_3', 'elevation_median_5_1', '9_slope', '23_geology',
       '8_elevation', '13_slope', '22_sdoif', '14_geology', '13_geology',
       'elevation_mean_5_1', 'slope_mean_5_1', 'elevation_range', '17_geology',
       '24_sdoif', '15_geology', 'slope_max_5_1', '18_geology', '20_geology',
       '19_sdoif', 'twi_mean', '12_sdoif', 'elevation_mean_5_4', '22_geology',
       '16_sdoif', '14_slope', '7_geology', '21_geology', '17_sdoif',
       '10_elevation', '20_elevation', 'sdoif_mean_5_2', '8_sdoif',
       'sdoif_ratio1', 'sdoif_ratio2_5_4', 'slope_max_5_2', 'sdoif_min_5_3',
       'sdoif_mean_5_4', '20_sdoif', 'geology_5', 'sdoif_ratio1_5_4',
       'slope_median_5_2', '5_elevation', 'elevation_min_5_2', 'sdoif_max_5_4',
       '15_elevation',
       
      '9_elevation', '11_sdoif', '4_sdoif', '25_sdoif', '18_elevation',
       'sdoif_min_5_1', '14_sdoif', 'sdoif_median_5_4', 'lsfactor_mean_5_4',
       'sdoif_ratio1_5_3', '7_sdoif', 'sdoif_min', '8_geology', '6_elevation',
       '4_elevation', '23_sdoif', '25_elevation', '19_elevation',
       '22_elevation', '1_elevation', 'slope_mean_5_2', '6_sdoif',
       'elevation_ratio2', 'sdoif_max', 'sdoif_median', 'elevation_mean_5_3',
       '21_sdoif', 'aspect_std', 'twi_median', 'sdoif_mean', '16_geology',
       'lsfactor_max_5_4', 'sdoif_std_5_2', '14_elevation', 'slope_max',
       'slope_median_5_1', 'elevation_mean_5_2', 'sdoif_max_5_0',
       'elevation_min_5_4', 'elevation_ratio1', 'elevation_median_5_0',
       '2_sdoif', '2_geology', 'sdoif_ratio2_5_3', '6_aspect', 'placurv_min',
       '24_geology', 'elevation_min_5_0', '9_sdoif', 'sdoif_range',


       '7_elevation', 'sdoif_min_5_4', '6_geology', '1_aspect', '1_lsfactor',
       'sdoif_median_5_1', '18_sdoif', 'sdoif_mean_5_0', 'placurv_std',
       '21_elevation', 'sdoif_range_5_0', 'placurv_min_5_2', 'aspect_min_5_0',
       'aspect_range', 'sdoif_ratio1_5_2', '3_geology', '17_elevation',
       'twi_max', '8_placurv', 'aspect_median_5_3', '4_geology',
       'elevation_median', 'sdoif_ratio2', '5_sdoif', '13_elevation',
       'sdoif_max_5_1', '25_geology', 'lsfactor_median_5_4', 'sdoif_range_5_1',
       'twi_mean_5_2', '7_aspect', '15_sdoif', 'twi_median_5_2',
       'sdoif_std_5_3', 'slope_range', 'sdoif_min_5_0', 'aspect_max_5_1',
       'lsfactor_min_5_0', 'geology_2', 'twi_std', 'geology_nunique',
       'sdoif_median_5_3', '1_sdoif', 'elevation_min_5_3', 'sdoif_range_5_4',
       'elevation_max_5_2', 'aspect_ratio1', '17_aspect', 'sdoif_std_5_4',
       'placurv_mean_5_1',

       'placurv_min_5_1', 'sdoif_range_5_2', 'elevation_mean',
       'slope_mean_5_3', 'sdoif_ratio2_5_0', '9_placurv', 'elevation_max',
       'twi_range', 'sdoif_min_5_2', 'twi_mean_5_0', 'elevation_max_5_4',
       'aspect_ratio2_5_0', '11_aspect', '12_aspect', 'aspect_max_5_2',
       'placurv_ratio1', 'slope_ratio1', 'slope_mean_5_0', 'twi_min_5_0',
       'elevation_min_5_1', 'twi_mean_5_1', '2_aspect', 'aspect_ratio2',
       'twi_max_5_1', 'twi_max_5_2', 'lsfactor_min', '3_aspect',
       'slope_max_5_3', 'elevation_max_5_1', '8_lsfactor', '6_lsfactor',
       'placurv_mean_5_2', 'lsfactor_min_5_4', '21_lsfactor', '1_geology',
       '12_elevation', 'aspect_median_5_1', 'lsfactor_ratio2', '4_twi',
       'placurv_median_5_2', 'aspect_ratio2_5_1', '11_twi', 'procurv_std',
       'lsfactor_ratio2_5_1', '3_elevation', 'aspect_mean_5_1', 'slope_std',
       '24_slope', '19_slope', 'aspect_ratio1_5_0'    
       ]

In [36]:
predsXGB = np.zeros(len(test_df))
for scalePos in [3.8, 4]:
  model_xgb = xgb.XGBClassifier( 
         n_estimators=2000,
         max_depth=12, 
         learning_rate=0.02, 
         subsample=0.8,
         scale_pos_weight = scalePos,
         colsample_bytree=0.4, 
          missing=-1, 
         eval_metric='auc',
         tree_method='gpu_hist',
         random_state = 0
          ) 
  model_xgb.fit(df[features],df.Label)
  predsXGB  += model_xgb.predict(test_df[features])

# **Blend**

In [37]:
predBlend = (predsXGB + predsCatBoost + predsLGB) / 6
preds = [1 if x >= 0.5 else 0 for x in predBlend]
sub_file = pd.DataFrame({'Sample_ID': test.Sample_ID, 'Label': preds})

# Create a csv file and upload to zindi 
sub_file.to_csv('WinningSolution.csv', index = False)
sub_file.head()

Unnamed: 0,Sample_ID,Label
0,10865,0
1,10866,0
2,10867,0
3,10868,1
4,10869,1


In [38]:
from google.colab import files
files.download('WinningSolution.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>