
# Earthquake Analysis and Prediction

# Introduction and Objective:

This project aims to predict future earthquakes using Linear Regression on the https://www.kaggle.com/datasets/alessandrolobello/the-ultimate-earthquake-dataset-from-1990-2023, which includes data on over 3 million global earthquakes.

The model will be:

1. Trained on historical data
2. Evaluated on a test set
3. Deployed for real-time prediction

Goal: Improve the accuracy and reliability of earthquake prediction using machine learning, potentially saving lives and reducing damage..

In [None]:
import os
for filename in os.listdir('/content'):
    print(filename)

## 1. Importing necessary libraries:

In [None]:
# Basics
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# For better plots
import plotly.express as px
import plotly.graph_objects as go

# ML libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score

# Deep-learning
import tensorflow as tf

## 2. Importing Data:

In [None]:
data = pd.read_csv("/content/Eartquakes-1990-2023.csv")

## 3. Understanding the basics of the Data:

In [None]:
data.head()

In [None]:
data.describe()

In [None]:
data.info()

Checking for "Null" values:

In [None]:
data.isna().sum()

Great, no "Null" values, won't have to go through "Pre-Processing" steps 😎😎

## 4. EDA Time:

Before delving into EDA, let's convert the "date" column to Pandas dataframe.

In [None]:
data['date'] = pd.to_datetime(data['date'], format='ISO8601')

In [None]:
data.date

#### a. Magnitude of earthquakes earthquake.png:

In [None]:
# Extract the year
year = pd.to_numeric(data.date.dt.year)

# Extract the magnitude
magnitude = data.magnitudo
df = pd.DataFrame({
    'year': year,
    'magnitude': magnitude
})
df = df[df['magnitude'] >= 0]

sns.histplot(data=df, x='year', weights='magnitude', bins=10, kde=True)
plt.xlabel("Year")
plt.ylabel("Magnitude")


In [None]:
data["magnitudo"].plot(kind = "line", style = ".", title ="Magnitudo trend by year", figsize =(16,5))
plt.show()

**Inferences** inferential-statistics.png

* The histogram shows that the magnitude of earthquakes is distributed over a wide range, from about 2 to 9. There is a peak in the number of earthquakes with a magnitude of about 5. The number of earthquakes with a magnitude of 7 or higher is relatively low.

* The histogram also shows that the magnitude of earthquakes has been increasing over time. This is likely due to the improved ability of scientists to detect and record earthquakes.

* The line-graph shows, the magnitude of earthquakes has been increasing over time, but the increase is not linear. There are some years with no earthquakes recorded, and the number of earthquakes with a magnitude of 7 or higher is relatively low.
<br>
<br>


The increase in the magnitude of earthquakes could be due to a number of factors, such as:
- Increased urbanization and development, which can lead to more earthquakes due to the stress placed on the Earth's crust.
<br>
- Climate change, which can cause the Earth's crust to shift and move, leading to earthquakes.
<br>
- Plate tectonics, the movement of the Earth's tectonic plates, which can cause earthquakes when they collide or rub against each other.
<br>
<br>
The relatively low number of earthquakes with a magnitude of 7 or higher could be due to a number of factors, such as:
- The Earth's crust is not evenly distributed, and some areas are more prone to earthquakes than others.
<br>
<br>
- The monitoring of earthquakes has improved over time, so we are more likely to detect smaller earthquakes.
<br>
<br>
- The effects of climate change may have reduced the number of large earthquakes.

#### b) Locating the places of earthquakes 🌍:
(***To be run, only if you are working with GPU***)

In [None]:
import folium
import folium

# Create a new DataFrame with the latitude and longitude columns
df = pd.DataFrame({
    "latitude": data.latitude,
    "longitude": data.longitude,
    "magnitude": data.magnitudo
})
df = df.sort_values(by='magnitude', ascending=False).head(1000)

In [None]:
from folium.plugins import MarkerCluster

map = folium.Map(location=[0, 0], zoom_start=2)
marker_cluster = MarkerCluster().add_to(map)

for latitude, longitude, magnitude in zip(df.latitude, df.longitude, df.magnitude):
    folium.CircleMarker(
        location=[latitude, longitude],
        radius=magnitude * 1.5,  # scaled down
        color='red',
        fill=True,
        fill_opacity=0.7
    ).add_to(marker_cluster)

map


In [None]:
import pandas as pd
import plotly.graph_objects as go

# Create a new DataFrame with correct column names
df = pd.DataFrame({
    "latitude": data['latitude'],
    "longitude": data['longitude'],
    "magnitude": data['magnitudo']
})

# Drop rows with NaNs
df = df.dropna(subset=['latitude', 'longitude', 'magnitude'])

# Ensure all magnitudes are non-negative and size is reasonable
df['size'] = df['magnitude'].clip(lower=0) * 3  # Adjust multiplier for better marker sizes

# Create a single Scattergeo trace
fig = go.Figure(
    data=go.Scattergeo(
        lat=df['latitude'],
        lon=df['longitude'],
        mode='markers',
        marker=dict(
            size=df['size'],
            color='red',
            opacity=0.6,
        ),
        text=df['magnitude'],  # Optional: hover text
    ),
    layout=go.Layout(
        title="Earthquakes (1990–2023)",
        geo=dict(showland=True, landcolor='rgb(217, 217, 217)'),
        margin=dict(l=0, r=0, t=30, b=0)
    )
)

# Show the map
fig.show()


#### c). Top states with highest earthquakes:

In [None]:
# Get the top 5 states with the highest number of earthquakes
top_5_states = (
    data.groupby("state")
    .size()
    .to_frame(name="count")
    .reset_index()
    .sort_values(by=["count"], ascending=False)
    .head(5)["state"]
)

# Get the unique values of the top 5 states
top_5_states_unique = top_5_states.unique()

# Print the unique values
print(top_5_states_unique)

In [None]:
print(data['place'].head(10))
# This will create a new 'state' column by extracting the word after the last comma
data['state'] = data['place'].str.extract(r',\s*([\w\s]+)$')

# Drop rows where state couldn't be extracted (no comma in place string)
data = data.dropna(subset=['state'])

In [None]:
top_5_states = (
    data['state']
    .value_counts()
    .head(5)
    .reset_index(name='count')
    .rename(columns={'index': 'state'})
)

import plotly.express as px

fig = px.bar(
    top_5_states,
    x='state',
    y='count',
    title='Top 5 States with the Highest Number of Earthquakes (1990–2023)',
    labels={'count': 'Earthquake Count', 'state': 'State'},
    color='state'
)

fig.show()

California is occuring twice.

#### d) Bottom 5 states with lowest amount of earthquakes.

In [None]:
import plotly.express as px

# Get the bottom 5 states by number of earthquakes
bottom_5_states = (
    data.groupby("state")
    .size()
    .reset_index(name="count")
    .sort_values(by="count", ascending=True)
    .head(5)
)

# Plot the bar chart
fig = px.bar(
    bottom_5_states,
    x="state",
    y="count",
    title="Bottom 5 States with the Highest Number of Earthquakes (1990–2023)",
    labels={"state": "State", "count": "Earthquake Count"},
    color="state"  # Optional: adds color for each bar
)

fig.show()


#### e) Top 5 Strongest earthquakes:

In [None]:
# Get the top 5 strongest earthquakes
top_5_earthquakes = (
    data.sort_values(by=["magnitudo"], ascending=False)
    .head(5)
)

# Create a map of the earthquakes
fig = px.scatter_geo(
    top_5_earthquakes,
    lat="latitude",  # Make sure to use the correct latitude column name
    lon="longitude",  # Make sure to use the correct longitude column name
    size="magnitudo",
    color="magnitudo",
    title="Top 5 Strongest Earthquakes (1990-2023)",
)

# Display the map
fig.show()

The size of the marker on the map represents the magnitude of the earthquake. The color of the marker also represents the magnitude of the earthquake, with red representing the strongest earthquakes and blue representing the weakest earthquakes.

## 5. Modelling Time:

We will create a simple Linear Regression model as the data is continuous.

In [None]:
# Preprocess the data (drop "place" column and keep only numerical features)
numerical_columns = ["magnitudo", "depth", "latitude", "longitude"]
data_numeric = data[numerical_columns]

# Separate input features (X) and target variable (y)
X = data_numeric.drop(columns=["magnitudo"])
y = data_numeric["magnitudo"]

# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

# Create the Linear Regression model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)

# Predict using the model
y_pred = model.predict(X_test)

# Calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)

# Print the mean squared error
print("Mean Squared Error:", mse)

In [None]:
y_pred = model.predict(X_test)

In [None]:
for i, (actual, predicted) in enumerate(zip(y_test, y_pred), 1):
    if i % 1000 == 0:
        print(f"Iteration: {i}  |  Actual Magnitude: {actual:.2f}  |  Predicted Magnitude: {predicted:.2f}")

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, color='blue')
plt.xlabel("Actual Magnitude")
plt.ylabel("Predicted Magnitude")
plt.title("Actual vs. Predicted Earthquake Magnitudes")
plt.show()

### Inferences: inferential-statistics.png

The model does not seem to be perfomring very bad. The Mean Squared Error is almost 0.9.

The predicted values are almost close to the actual values.