# LightGBM implementation

LightGBM is a new gradient boosting tree framework, which is highly efficient and scalable and can support many different algorithms including GBDT, GBRT, GBM, and MART. LightGBM is evidenced to be several times faster than existing implementations of gradient boosting trees, due to its fully greedy tree-growth method and histogram-based memory and computation optimization. 

## Importing libraries and settings

In [4]:
import re
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns
import string
import nltk
import warnings 
warnings.filterwarnings("ignore", category=DeprecationWarning)

%matplotlib inline

In [5]:
from sklearn.model_selection import TimeSeriesSplit, KFold
from sklearn.metrics import mean_squared_error
import lightgbm as lgbm
import numpy as np
import pandas as pd

params = {"objective" : "binary", 
          "metric" : "binary_logloss", 
          "max_depth": 11}
#           "min_child_samples": 20, 
#           "reg_alpha": 0.2, 
#           "reg_lambda": 0.2,
#           "num_leaves" : 100, 
#           "learning_rate" : 0.01, 
#           "subsample" : 0.9, 
#           "colsample_bytree" : 0.9, 
#           "subsample_freq ": 5}

n_fold = 10

## Get dataset

Data is obtained from United States Geological Survey [1], the latitude range for the earthquake occurences is 20°S and 40°S and longitude range is 70°E to 105°E
<img src="./dataset.png">

In [7]:
data         = pd.read_csv("./query_large.csv", sep=",", parse_dates=['time'], squeeze=True)
data['time'] = data['time'].apply(lambda y: y.timestamp())
X = data[['time', 'latitude', 'longitude']]
y = data[['mag']]

y.mag = y.mag.apply(lambda x: 1 if x>4.3 else 0)
X_train, y_train, X_test, y_test = X[:-1000], y[:-1000], X[-1000:], y[-1000:]


from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

# splitting data into training and validation set
xtrain, xvalid, ytrain, yvalid = X_train, X_test, y_train, y_test

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


## Train the model

In [9]:
folds = KFold(n_splits=n_fold, shuffle=False, random_state=42)
prediction = np.zeros(X_test.shape[0])

d_train = lgbm.Dataset(xtrain, ytrain)
d_valid = lgbm.Dataset(xvalid, yvalid)

for i in range(25):
   
    print('Fold:', i)

    model = lgbm.train(params, d_train,5000, valid_sets=[d_valid], verbose_eval=50, early_stopping_rounds=1000)

Fold: 0
Training until validation scores don't improve for 1000 rounds.
[50]	valid_0's binary_logloss: 0.68782
[100]	valid_0's binary_logloss: 0.69546
[150]	valid_0's binary_logloss: 0.704342
[200]	valid_0's binary_logloss: 0.711216
[250]	valid_0's binary_logloss: 0.72005
[300]	valid_0's binary_logloss: 0.726577
[350]	valid_0's binary_logloss: 0.735014
[400]	valid_0's binary_logloss: 0.739004
[450]	valid_0's binary_logloss: 0.747595
[500]	valid_0's binary_logloss: 0.751528
[550]	valid_0's binary_logloss: 0.759173
[600]	valid_0's binary_logloss: 0.766587
[650]	valid_0's binary_logloss: 0.768439
[700]	valid_0's binary_logloss: 0.777053
[750]	valid_0's binary_logloss: 0.781247
[800]	valid_0's binary_logloss: 0.783521
[850]	valid_0's binary_logloss: 0.789548
[900]	valid_0's binary_logloss: 0.795698
[950]	valid_0's binary_logloss: 0.800459
[1000]	valid_0's binary_logloss: 0.808418
Early stopping, best iteration is:
[10]	valid_0's binary_logloss: 0.680071
Fold: 1
Training until validation sc

[650]	valid_0's binary_logloss: 0.768439
[700]	valid_0's binary_logloss: 0.777053
[750]	valid_0's binary_logloss: 0.781247
[800]	valid_0's binary_logloss: 0.783521
[850]	valid_0's binary_logloss: 0.789548
[900]	valid_0's binary_logloss: 0.795698
[950]	valid_0's binary_logloss: 0.800459
[1000]	valid_0's binary_logloss: 0.808418
Early stopping, best iteration is:
[10]	valid_0's binary_logloss: 0.680071
Fold: 9
Training until validation scores don't improve for 1000 rounds.
[50]	valid_0's binary_logloss: 0.68782
[100]	valid_0's binary_logloss: 0.69546
[150]	valid_0's binary_logloss: 0.704342
[200]	valid_0's binary_logloss: 0.711216
[250]	valid_0's binary_logloss: 0.72005
[300]	valid_0's binary_logloss: 0.726577
[350]	valid_0's binary_logloss: 0.735014
[400]	valid_0's binary_logloss: 0.739004
[450]	valid_0's binary_logloss: 0.747595
[500]	valid_0's binary_logloss: 0.751528
[550]	valid_0's binary_logloss: 0.759173
[600]	valid_0's binary_logloss: 0.766587
[650]	valid_0's binary_logloss: 0.76

[100]	valid_0's binary_logloss: 0.69546
[150]	valid_0's binary_logloss: 0.704342
[200]	valid_0's binary_logloss: 0.711216
[250]	valid_0's binary_logloss: 0.72005
[300]	valid_0's binary_logloss: 0.726577
[350]	valid_0's binary_logloss: 0.735014
[400]	valid_0's binary_logloss: 0.739004
[450]	valid_0's binary_logloss: 0.747595
[500]	valid_0's binary_logloss: 0.751528
[550]	valid_0's binary_logloss: 0.759173
[600]	valid_0's binary_logloss: 0.766587
[650]	valid_0's binary_logloss: 0.768439
[700]	valid_0's binary_logloss: 0.777053
[750]	valid_0's binary_logloss: 0.781247
[800]	valid_0's binary_logloss: 0.783521
[850]	valid_0's binary_logloss: 0.789548
[900]	valid_0's binary_logloss: 0.795698
[950]	valid_0's binary_logloss: 0.800459
[1000]	valid_0's binary_logloss: 0.808418
Early stopping, best iteration is:
[10]	valid_0's binary_logloss: 0.680071
Fold: 18
Training until validation scores don't improve for 1000 rounds.
[50]	valid_0's binary_logloss: 0.68782
[100]	valid_0's binary_logloss: 0.6

## Get predictions and scores

In [12]:
prediction = model.predict(xvalid) # predicting on the validation set
prediction_int = prediction >= 0.3 # if prediction is greater than or equal to 0.3 than 1 else 0
prediction_int = prediction_int.astype(np.int)
print("Validation F1 Score :")
f1_score(yvalid, prediction_int) # calculating f1 score

Validation F1 Score :


0.6737400530503979