# This is the training and testing notebook for the Machine Learning Model

We are going to train and test different machine learning models for prediction based on our data.
We are testing the following models: linear regression, decision tree, random forest, grid search, and neural networks.
We are also going to use this to trunticate our data for unnecessarily values

### Install the required dependencies: numpy, matplotlib, pandas, basemap, and skikit-learn

In [1]:
%pip install numpy
%pip install matplotlib
%pip install pandas
%pip install basemap
%pip install scikit-learn
%pip install scipy
%pip install xgboost
%pip install dill
%pip install tensorflow


Note: you may need to restart the kernel to use updated packages.

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.

Note: you may need to restart the kernel to use updated packages.



### Import the required dependencies: numpy, pandas, matplotlib, and basemap for viewing the data

In [3]:
# import required dependancies
#-----------------------------
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


### Read and describe the data from the csv file

In [4]:
# read the data
#-------------------
data = pd.read_csv("final_data.csv")
data.describe()

Unnamed: 0,latitude,longitude,mag,depth,timestamp
count,3749279.0,3749279.0,3749279.0,3749279.0,3749279.0
mean,36.10546,-99.83861,1.900921,22.56044,1114155000.0
std,20.04812,77.68673,1.266708,55.3005,389052900.0
min,-84.422,-179.999,0.01,0.0,94712400.0
25%,33.68367,-141.6976,1.0,3.34,823122000.0
50%,37.567,-118.9253,1.55,7.547,1184695000.0
75%,44.516,-115.9252,2.4,15.6,1452330000.0
max,87.386,180.0,9.1,735.8,1641010000.0


### Plot the values for data visualization

In [None]:
m = Basemap(projection='mill',llcrnrlat=-80,urcrnrlat=80, llcrnrlon=-180,urcrnrlon=180,lat_ts=20,resolution='c')

longitudes = data["longitude"].tolist()
latitudes = data["latitude"].tolist()
#m = Basemap(width=12000000,height=9000000,projection='lcc', resolution=None,lat_1=80.,lat_2=55,lat_0=80,lon_0=-107.)
x,y = m(longitudes,latitudes)
fig = plt.figure(figsize=(12,10))
plt.title("All affected areas")
m.plot(x, y, "o", markersize = 2, color = 'blue')
m.drawcoastlines()
m.fillcontinents(color='coral',lake_color='aqua')
m.drawmapboundary()
m.drawcountries()
plt.show()

## Hard core ML stuff begin here

In [5]:
# create our data for training and testing
X = data[['timestamp', 'latitude', 'longitude']]
y = data[['mag', 'depth']]

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape, X_test.shape, y_train.shape, X_test.shape)

(2999423, 3) (749856, 3) (2999423, 2) (749856, 3)


### This is our train and test for LINEAR REGRESSION MODEL

In [30]:
from sklearn.linear_model import LinearRegression

linear_regressor = LinearRegression()
linear_regressor.fit(X_train, y_train)
linear_regressor.score(X_test, y_test)

0.2645730311363322

The score for the linear regression model is: 26%. Not good

### Lets train and test for Decision Tree Regressor

In [31]:
from sklearn.tree import DecisionTreeRegressor
decisiontree_regressor = DecisionTreeRegressor(random_state=0)
decisiontree_regressor.fit(X_train, y_train)
decisiontree_regressor.score(X_test, y_test)

0.7770814361611147

The score for the Decision Tree Regressor is: 77%. Better, but not the best.

### Let's try working with XGBoost


In [32]:
import xgboost as xgb

xgb_model = xgb.XGBRegressor(objective="reg:linear", random_state=42)
xgb_model.fit(X_train, y_train)
xgb_model.score(X_test, y_test)



0.8261888851791132

The score for XGBoost is 82%. Its pretty good, but we can do better.

### Let's try working with the RandomForest Regressor

In [6]:
from sklearn.ensemble import RandomForestRegressor
randomforest_regressor = RandomForestRegressor(n_estimators=100, random_state=42)
randomforest_regressor.fit(X_train, y_train)
randomforest_regressor.score(X_test, y_test)

0.8786946505597895

In [4]:
from sklearn.ensemble import RandomForestRegressor
randomforest_regressor2 = RandomForestRegressor(n_estimators=10, random_state=42)
randomforest_regressor2.fit(X_train, y_train)
randomforest_regressor2.score(X_test, y_test)

0.8688528742183084

87% is pretty good. Can we do better?

### Let's use GridSearch to see how well we can optimize our findings

In [5]:
from sklearn.model_selection import GridSearchCV
print("Initializing parameters")
parameters = {'n_estimators':[20, 50]}
print("Initializing grid search")
grid_obj = GridSearchCV(randomforest_regressor2, parameters)
print("Fitting grid search")
grid_fit = grid_obj.fit(X_train, y_train)
print("Getting best estimator")
best_fit = grid_fit.best_estimator_
print("Getting score")
best_fit.score(X_test, y_test)

Initializing parameters
Initializing grid search
Fitting grid search


The score for the RandomForest Regressor is 88%. Can we do better?

### Let's try using a Neural Network

In [None]:
import keras.api._v2.keras as keras
import tensorflow as tf
from keras.layers import Dense

model = keras.Sequential()  
model.add(Dense(16, activation='relu', input_shape=(3,)))  
model.add(Dense(16, activation='relu'))  
model.add(Dense(2, activation='softmax'))  
  
model.compile(optimizer='SGD', loss='squared_hinge', metrics=['accuracy'])  
model.fit(X_train, y_train, batch_size=10, epochs=20, verbose=1, validation_data=(X_test, y_test))

In [None]:
%pip install dill

The score for the RandomForest Regressor is 88%

### Lets save the models now.

In [8]:
import dill as pickle

# print("Linear Model saving")
# pickle.dump(linear_regressor, open('linear_regressor.pkl', 'wb'))
# print("Linear Model saved")
# print("decision tree Model saving")
# pickle.dump(decisiontree_regressor, open('decisiontree_regressor.pkl', 'wb'))
# print("decision tree Model saved")
# print("XGBoost forest Model saving")
# pickle.dump(xgb_model, open('xgboost_regressor.pkl', 'wb'))
# print("XGBoost forest Model saved")
print("Random forest Model saving")
pickle.dump(randomforest_regressor2, open('randomforest_regressor.pkl', 'wb'))
print("Random forest Model saving")
# print("Neural Network Model saving")
# pickle.dump(neural_network_model, open('neural_network.pkl', 'wb'))
# print("Neural Network Model saving")

Random forest Model saving
Random forest Model saving


In [1]:
import dill as pickle
randomforest_regressor3 = pickle.load(open('randomforest_regressor.pkl', 'rb'))

In [6]:
randomforest_regressor3.score(X_test, y_test)

0.8688528742183084

In [10]:
randomforest_regressor3.predict(X_test[0:100])

array([[1.54300e+00, 1.53640e+00],
       [1.07000e+00, 9.33000e+00],
       [1.07200e+00, 7.91100e+00],
       [1.49400e+00, 9.49970e+00],
       [1.20000e+00, 1.10300e+01],
       [1.95000e+00, 3.73900e+01],
       [4.30000e+00, 3.76400e+01],
       [5.20000e-01, 1.53420e+01],
       [2.87000e+00, 2.64700e+01],
       [4.16000e+00, 4.22250e+02],
       [4.39000e+00, 9.68300e+01],
       [1.19700e+00, 4.76320e+00],
       [1.04900e+00, 1.67220e+00],
       [1.63000e+00, 2.84440e+00],
       [7.97000e-01, 7.44800e-01],
       [1.91000e+00, 5.23300e+00],
       [4.42000e+00, 8.03270e+01],
       [3.05000e+00, 1.40340e+01],
       [1.80600e+00, 2.29740e+00],
       [2.68000e+00, 2.99000e+01],
       [1.36000e+00, 3.38700e+01],
       [1.17100e+00, 3.95050e+00],
       [2.69000e+00, 6.00760e+00],
       [1.01200e+00, 4.12570e+00],
       [2.88000e+00, 4.10000e+00],
       [4.93000e+00, 1.00000e+01],
       [1.12000e+00, 1.13000e+00],
       [1.35000e+00, 1.09120e+02],
       [9.13000e-01,

In [9]:
y_test[0:100]

Unnamed: 0,mag,depth
218604,0.46,1.910
1599636,0.90,16.400
3703199,0.69,8.000
812369,2.19,9.018
1412766,1.30,5.000
...,...,...
1920489,0.20,8.000
852063,2.50,10.000
481474,1.94,8.630
595183,1.03,4.856
