# Earthquake Data Analysis

### Description

The catalog includes the magnitude, time of occurrence (s), and 3D coordinates (m) of earthquakes in about 20 years of recording in South California. Coordinates were converted from latitude, longitude, and depth of events in a seismic catalog. Magnitudes should be within the range $[0,8]$.

* **Waiting time (t)**: time interval between an event and the next one in the sequence.
* **Distance (r)**: Eucledian 3D distance between events. (each 3D set of coordinates refers to the hypocenter, i.e. the point triggering the slip in a fault that forms the earthquake)

### Assignments

3. Compute the distribution $P_m(t)$ of waiting times for events of magnitude m or above (i.e. do not consider events below $m$). In shaping the bin sizes, take into account that this distribution is expected to have a power-law decay with time (e.g $\sim 1/t$), and that a power-law is well visualized in log-log scale. Do this analysis for many values of $m$, say $m=2,3,4,5$.

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.optimize import curve_fit
from scipy.stats import norm
import pandas_bokeh
from bokeh.models import CrosshairTool
from bokeh.models import HoverTool
from bokeh.plotting import figure, output_file, show

pandas_bokeh.output_notebook()
%matplotlib inline

In [2]:
#load the file
url='https://raw.githubusercontent.com/rossetl/Earthquake-Data-Analysis/main/SouthCalifornia-1982-2011_Physics-of-Data.dat'
df = pd.read_csv(url, sep='\s', usecols=[i for i in range(0,7,1)], 
                 names=['index','pointer','t','mag','x','y','z'], engine='python')

df.sort_values(by='t')
#set the limits for the waiting times
Tmin = 40 #short times are difficult to detect and distiguish

## ANALYSIS OF THE DISTRIBUTION $P_m(t)$

-  When examining waiting times we lose information about the real time of each event, which can be regarded as a disadvantage, this, however, depends on which relations we consider most interesting 

- In our analysis we calculate the waiting times between time-neighbouring events and make a histogram of the count using successive bins that are logarithmically increasing in length. Such count histograms are equivalent to the statistical probability distributions, this can illuminate more details of the data on the graphs than can the corresponding probability density functions (pdf)

-  it is important to also analyse the data in order to try to identify underlying patterns of behaviour

- explain the omori's law and show it

- this range is a very large set of times, so it would have been extremely impractical to use equal-sized bins so, bins with exponentially increasing sizes were used

Model 1: $P_m (t) = \frac{K}{t^\alpha}$ \
\
Model 2: $P_m (t) = \frac{K}{t + c}$ \
\
Model 3: $P_m (t) = \frac{K}{(t+c)^\alpha}$ 

In [3]:
#Definition of 3 different models to fit the distribution

def fit_pdf(x, K, p):
    return K*((x)**(-p))

def fit_pdf2(x, K, c):
    return K*((x+c)**(-1))

def fit_pdf3(x, K, c, p):
    return K*((x+abs(c))**(-p))

In [47]:
#Definition of the function to create histograms and fit them
def histo(m, nbins, Tup):
    plt.figure(figsize=(12,7))
    mask = df['mag'] >= m     #select only events with magnitude m or above
    dt = df['t'][mask].diff().dropna(how='any')     #compute the waiting times
    dt = dt[dt > Tmin]     #select only times above Tmin
    
    #histogram of waiting time
    n, bins, patches = plt.hist(dt, bins = np.logspace(np.log10(dt.min()), np.log10(dt.max()), nbins),  
                                alpha = 0.4, color = 'r', edgecolor='b', label = 'Data distribution', density=True, 
                                linewidth=1, zorder=1, visible=0)
    plt.close('all')
    
    Y = n[n>0]       #we don't consider empty bins for the fit
    X = np.array([0.5 * (bins[i] + bins[i+1]) for i in range(len(bins)-1)])
    X = X[n>0]
    
    #fit model 1
    par, pcov = curve_fit(fit_pdf, xdata=X[X<Tup], ydata=Y[X<Tup], p0=[0.01, 1])
    perr = np.sqrt(pcov)
    
    #fit model 2
    par2, pcov2 = curve_fit(fit_pdf2, xdata=X[X<Tup], ydata=X[X<Tup], p0=[0.1, 0])
    perr2 = np.sqrt(pcov2)
    
    #fit model 3
    par3, pcov3 = curve_fit(fit_pdf3, xdata=X, ydata=Y, p0=[0.01, 0, 1])
    perr3 = np.sqrt(abs(pcov3))
    
    # fit results (parameters)
    print('MAGNITUDE ABOVE m=' + str(m))
    print('---Fit Parameters model 1---')
    print('K= %0.3f +/- %0.3f' %(par[0], perr[0][0]))
    print('\u03B1= %0.2f +/- %0.2f' %(par[1], perr[1][1]))
    
    print('---Fit Parameters model 2---')
    print('K= %0.0f +/- %0.0f' %(par2[0], perr2[0][0]))
    print('c= %0.0f +/- %0.0f' %(abs(par2[1]), perr2[1][1]))
    
    if m < 5:
        print('---Fit Parameters model 3---')
        print('K= %0.2f +/- %0.2f' %(par3[0], perr3[0][0]))
        print('c= %0.0f +/- %0.0f' %(abs(par3[1]), perr3[1][1]))
        print('\u03B1= %0.2f +/- %0.2f\n' %(par3[2], perr3[2][2]))
    else : 
        print('---Fit Parameters model 3---')
        print('K= %0.0f +/- %0.0f' %(par3[0], perr3[0][0]))
        print('c= %0.0f +/- %0.0f' %(abs(par3[1]), perr3[1][1]))
        print('\u03B1= %0.2f +/- %0.2f\n' %(par3[2], perr3[2][2]))
    return X, Y, [par, par2, par3]

In [48]:
#Computation and plot using an interactive tool 
x2, y2, pfit2 = histo(2,15, 2*10**4) #(m,bins,Tmax)
x3, y3, pfit3 = histo(3,20, 2*10**5)
x4, y4, pfit4 = histo(4,20, 3*10**6)
x5, y5, pfit5 = histo(5,32, 6*10**7)

#define vector for universal law
alpha=np.array([round(pfit2[0][1],2), round(pfit3[0][1],2), round(pfit4[0][1],2), round(pfit5[0][1],2)])
print('\n\u03B1-parameters=',alpha)

p = figure(x_axis_type='log', y_axis_type='log', plot_width=800, plot_height=600, title="Histogram of Waiting times")
p.title.text_font_size = "30px"

#histograms
p.scatter(x = x2, y=y2, color="blue", legend_label='m=2', marker='circle', size=7, alpha=0.75)
p.scatter(x = x3, y=y3, color="orange", legend_label='m=3', marker='square', size=7,)
p.scatter(x = x4, y=y4, color="green", legend_label='m=4', marker='triangle', size=7)
p.scatter(x = x5, y=y5, color="red", legend_label='m=5', marker='diamond', size=7)

#fit model 1 
p.line(x=np.linspace(50, 2*10**4,100), y=fit_pdf(np.linspace(50, 2*10**4,100), *pfit2[0]), color="blue", 
       legend_label='Fit model 1(m=2)', line_width=1.5)
p.line(x=np.linspace(50, 2*10**5,100), y=fit_pdf(np.linspace(50, 2*10**5,100), *pfit3[0]), color="orange", 
       legend_label='Fit model 1(m=3)', line_width=1.5)
p.line(x=np.linspace(50, 3*10**6,100), y=fit_pdf(np.linspace(50, 3*10**6,100), *pfit4[0]), color="green", 
       legend_label='Fit model 1(m=4)', line_width=1.5)
p.line(x=np.linspace(50, 6*10**7,100), y=fit_pdf(np.linspace(50, 6*10**7,100), *pfit5[0]), color="red", 
       legend_label='Fit model 1(m=5)', line_width=1.5)

p.legend.click_policy='hide'
p.legend.location='bottom_left'
p.xaxis.axis_label = 'Waiting times (s)'
p.yaxis.axis_label = 'P\u2098 (t)'
p.xaxis.axis_label_text_font_size = '15pt'
p.yaxis.axis_label_text_font_size = '15pt'
show(p)

MAGNITUDE ABOVE m=2
---Fit Parameters model 1---
K= 0.014 +/- 0.002
α= 0.63 +/- 0.03
---Fit Parameters model 2---
K= 545655 +/- 5922170
c= 4746 +/- 1220
---Fit Parameters model 3---
K= 0.09 +/- 0.01
c= 77 +/- 6
α= 0.91 +/- 0.02

MAGNITUDE ABOVE m=3
---Fit Parameters model 1---
K= 0.017 +/- 0.002
α= 0.74 +/- 0.02
---Fit Parameters model 2---
K= 9 +/- 8746320
c= 2208 +/- 5557209084
---Fit Parameters model 3---
K= 0.09 +/- 0.01
c= 51 +/- 3
α= 1.00 +/- 0.02

MAGNITUDE ABOVE m=4
---Fit Parameters model 1---
K= 0.016 +/- 0.002
α= 0.78 +/- 0.02
---Fit Parameters model 2---
K= 10 +/- 137923054
c= 1947 +/- 99546347079
---Fit Parameters model 3---
K= 0.06 +/- 0.02
c= 43 +/- 11
α= 1.00 +/- 0.06

MAGNITUDE ABOVE m=5
---Fit Parameters model 1---
K= 0.008 +/- 0.003
α= 0.74 +/- 0.09
---Fit Parameters model 2---
K= 10 +/- 3262381469
c= 2247 +/- 181669359814
---Fit Parameters model 3---
K= 3 +/- 24
c= 243 +/- 293
α= 1.61 +/- 0.99


α-parameters= [0.63 0.74 0.78 0.74]


### OBSERVATIONS

- the fit are not really good: our aim is to find the best parameter p for every function, so the best one for all m is the model 1

-  real data series include a complex series of foreshocks, main shocks and aftershocks, and just which events are main shocks may be difficult to define

- Observing empirical waiting time probability distributions demonstrate a characteristic pattern, with power law behaviour at short and intermediate waiting times, and a marked decrease in the number of events at larger waiting times

- It has been claimed that the observed characteristic pattern reflects important physics of the system related to a change in event correlation which scales in a defined manner with changes in threshold magnitude, waiting time and spatial distance

- It is relevant to recall that the Omori law is a purely empirical law without any complete underlying physical theory. It is therefore likely that it is, on a fundamental level, only an approximation to the seismological reality

- aftershocks don’t all occur instantly at the time of the mainshock, but they have a time dependence that is described by Omori’s law

- The waiting time distribution seen contains the Omori characteristic, with a rise, near power law segment for the intermediate waiting times and fall-off at larger waiting times. However, the waiting times tails do not fit well to a single Omori sequence

- The Omori waiting time probability distribution can be divided in two parts: \
(1) The distribution for the shortest waiting times which shows power law behaviour consistent with the fact that the rate can be approximated to a constant rate \
(2) For big waiting times the curve rapidly decays

-  There are small deviations from these simple models how we can see in the pdf

- It is important to note that the range of the power-law region varies with cutoff magnitude