# Module 4, Week 2 In Class Exercise

Non-linear model fitting

**Before class reading: Chapter C of USGS HayWired Report **

**Last week we:**
- Loaded marine bathymetry data and age of seafloor (found from marine magnetics) from several oceanic ridges
- Used `interpolate.Rbf` to do a 2D interpolation of age data
- Plotted bathymetry data vs. age
- Fit a bathymetry(age) model
- Compared our with the model of Stein and Stein, 1992

**Our goals for today:**
- Load peak ground acceleration observations from two notable M6 quakes in California
- Fit a ground motion prediction equation (GMPE) using `polyfit`
- Vary our assumed mean event depth to find better fitting model
- Fit a GMPE after weighing the data by the distance distribution using `linalg.solve`


## Setup

Run this cell as it is to setup your environment.

In [None]:
import math
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

# Analysis of Strong Ground Motion Data

Earthquakes are the sudden dislocation of rock on opposite sides of a fault due to applied stress. Seismic waves are generated by this process and propagate away from the fault affecting nearby communities. It is the strong shaking from earthquakes that we recognize as the earthquake. These motions can lead to landslides, liquefaction of the ground, and of course impact anything built within or on the ground. The motions generated by fault dislocation affect many aspects of modern society. Earthquake Engineering is a field that studies the ground motions generated by earthquakes and how they affect the built environment. To utilize ground motions for engineering applications requires studying the physics of seismic wave propagation, and the development of models that effectively describe it. Of particular importance is the need to accurately model and predict seismic wave amplitudes. Such studies generally focus on examining the peak acceleration and velocity as a function of distance from the source. The physics indicates that the ground motions generally decrease in amplitude with increasing distance.

In this module we will investigate peak ground acceleration observations from two notable M6 quakes in California, the 2004 Parkfield and the 2014 West Napa earthquakes. You will analyze the data by fitting a ground motion prediction equation (GMPE), i.e. an attenuation relationship that describes the rate at which ground motions decrease with increasing distance from the source. Such GMPE relationships are of primary importance in being able to forecast the effects of future earthquakes taking into account the uncertainty as manifest as observed variance in motions with respect to median levels in many events. This information coupled with the statistics of earthquake occurrence rates, notably Gutenberg-Richter statistics, provides the frame work for characterizing future ground motion hazard.

In [None]:
from IPython.display import YouTubeVideo
# https://www.youtube.com/watch?v=D6vX0926aGI
# M6.7 Earthquake on the West Napa Fault
# Video credit: SCEC
YouTubeVideo('D6vX0926aGI', width="1000", height="500")

## Part 1, Load and Plot Peak Ground Acceleration Data

Make a plot showing the data in both a linear-linear, and loglog projection. Note the fact that in a single earthquake there can be significant gaps in coverage, but when considered together a more complete representation of strong ground motion attenuation may be obtained. Note that the data is also non-linear even in the log-log projection. The following is an example for the 2004 Parkfield earthquake.

<img src="./parkonly.png">

One 'g' is the gravitational acceleration at the surface of the Earth and has a value of 981 cm/$s^2$. Earthquake Engineers commonly use the peak ground acceleration in such units in their geotechnical materials and structural engineering analyses. 0.1%g is the level people generally can perceive shaking, at 2%g some people may be disoriented, at 50% the shaking is very violent and unengineered structures can suffer damage and collapse, while well engineered buildings can survive if the duration is short.

In [None]:
#Read Napa and Parkfield Earthquake Peak Ground Acceleration Data
ndist, npga=np.array(pd.read_table('napa_pga.txt')).transpose()
pdist, ppga=np.array(pd.read_table('park_pga.txt')).transpose()
dist=np.hstack((ndist,pdist))
pga =np.hstack((npga,ppga))

In [None]:
#Plot the two data sets
fig, ax = plt.subplots()
plt.plot(...,...)
ax.set(xlabel='Distance (km)', ylabel='Peak ground acceleration (g)',
       title='Peak Acceleration Data')
plt.legend(['Napa','Parkfield'],fontsize=12)
plt.show()

fig, ax = plt.subplots()
plt.loglog(...,...)
ax.set(xlabel='Distance (km)', ylabel='Peak ground acceleration (g)',
       title='Peak Acceleration Data')
plt.legend(['Napa','Parkfield'],fontsize=12,loc=3)
plt.show()

## Part 2, Fitting Strong Motion Data

In order to use the observations of peak ground acceleration (and other parameters like peak velocity, or spectral acceleration quantities) it is necessary to develop a model that accurately describes the behavior. From physics it is understood that in the far-field (large distance compared to the source dimension) that ground motions decay as a power law with distance due to the spreading of wave energy in three dimensions as the wavefield travels outward from the earthquake source. This is called geometrical spreading. In addition, there is a inelastic attenuation term that accounts for dissipative energy loss due to material imperfections. Based on theory the following is a simple relationship that describes this behavior.

$pga=a*{\frac{1}{r^b}}*e^{cr}$

where $r=\sqrt{(dist^2 + h^2)}$ is the total distance from the source taking into account an average depth $h$, $a$ is a coeffient that depends on magnitude and scales the overall motions, $b$ is the exponent for the power-law geometrical spreading term, and $c$ is the coefficient for the in-elastic term (important only at large distances). Taking the natural logarithm of this equation yields a linear relationship in the model coeffients.

$\mathrm{ln}(pga)=a + b*\mathrm{ln}(r) + c*r$

For this exercise we will fit the above equation to the data assuming that $c=0$. 

__Compute the ground motion prediction equation (GMPE) for the combined Parkfield and Napa earthquake data sets, plot the results, and print the best fitting solution parameters.__


The following is an example of an unweighted least squares inversion and the 95% confidence intervals for the combined data set.

<img src="./unweightedfit1.png">

In [None]:
#Create total distance accounting for an average depth h
h=4.0
r=np.sqrt(dist**2 + h**2)

# Fit the realtionship: log(pga) = a + b*log(r)
# Note: to linearize log(r) and log(pga) are used
soln = np.polyfit(...,...,...)

#Create GMPE curve
Model_dist = np.arange(0.1,180,0.1)
Model_r = ...
Model_x = ...
Model_y = ...

In [None]:
#Compute 95% confidence levels
degfree=len(r)-2                           #degrees of freedom
e=np.log(pga)-(soln[1]+soln[0]*np.log(r))  #residuals between data and model
var=np.sum(e**2)/degfree                   #variance
se_y=np.sqrt(var)                          #standard error of the estimate
sdev=np.sqrt(var)                          #standard deviation
#Calculate 95% confidence bounds
t=stats.t.ppf(1-0.05/2,degfree)             #division by 2 to map from single-tail to dual-tail t-distribution
lower95=np.exp(np.log(Model_y)-t*se_y)
upper95=np.exp(np.log(Model_y)+t*se_y)

In [None]:
#Plot Results
fig, ax = plt.subplots()
ax.loglog(...,...,'r.',...,...,'b.',...,...,'k-',linewidth=2)
ax.loglog(Model_dist,lower95,'r-',Model_dist,upper95,'r-',linewidth=1)
ax.set(xlabel='Distance (km)', ylabel='Peak ground acceleration (g)',
       title='Peak Acceleration Data and Unweighted Least Squares Inversion')
plt.legend(['Napa','Parkfield'],fontsize=12,loc=3)
plt.show()

print(f'h={h:.1f}; a, Intercept= {...:.3f}; b, Slope={...:.3f}; Variance={var:.3f}')

## Part 3, Vary Depth and Evaluate Variance

Next we want to test our assumption that the mean depth of the earthquakes is 4.0km. We will define a function to use different values of $h$, solve for the $a$ and $b$ coefficients, and quantify the fit using the variance of the data with respect to the obtained model. Then use this function to compute the variance between our data and models as a function of the input $h$.


In [None]:
def gmpe_fit(dist,pga,h):
    r=np.sqrt(dist**2 + h**2)

    # Fit the realtionship: log(pga) = a + b*log(r)
    # Note: to linearize log(r) and log(pga) are used
    soln = ...

    #Compute variance
    degfree=len(r)-2                           #degrees of freedom
    e=np.log(pga)-(soln[1]+soln[0]*np.log(r))  #residuals between data and model
    var=np.sum(e**2)/degfree                   #variance

    return var

In [None]:
depth = ... # depth from 0.1 to 10 km
var_d = np.zeros(len(depth))
for i in range(0,len(depth),1):
    var_d[i] = gmpe_fit(dist,pga,depth[i])

    
fig, ax = plt.subplots()
ax.plot(...,...)
ax.set(xlabel='Depth (km)', ylabel='Variance',
       title='Fit of model as a function of assumed average event depth')
plt.show()   

What depth produces the best fitting model (minimum variance)?

In [None]:
min_var = np.min(var_d)
min_depth = ...
print(min_depth)

_Write your answer here._

Run the Part 2 inversion again with this depth.

## Part 4, Weighted Fitting

You have probably noticed that the data is not uniformly distributed in distance. In such circumstances it is often advantageous to weight the data prior to inversion to make their respecive influence to the model more equal. In this particular problem it is possible to consider discrete distance bins of 10 km width from 0 to 180 km and count the number of events in each bin. Then the weight can be the inverse of the count, meaning that the weight of bins with many observations will be less and distance bins with fewer will be given more influence with a goal of making the influence on the model more equal over the entire distance range.

Compare your results with and without the weighting scheme.

In [None]:
# plot a histogram of distances
binstep=10. ; # 10 km bins
bins=np.arange(0.,180.,binstep);
...(dist,bins=bins,density=False);

If we just weigh all the observations evenly which distance will dominate the result (i.e. what distance bins has the most measurements?)?

_Write your answer here._

In [None]:
#Determine distance weighting bins
binstep=10.  # 10 km bins
bins=np.arange(0.,180.,binstep)
num=...(dist,bins=dins,density=...) # use np.histogram to sort the dist data into bins and count occurance

ww=1/(num[:][0]+1) #weight is the inverse of the number in a bin, 1 is added to elimate possible empty bins

#Loop over distances and assign each a weight
w=np.ones(len(dist))
for i in range(0,len(dist),1):
    for j in range(0,len(ww),1):
        if abs(dist[i]-bins[j]) <= binstep:
            I=j
            break
    w[i]=ww[I]


w=w/np.sum(w)        #normalize all of the weights to the number of values  


In [None]:
#Create total distance accounting for an average depth h
h=...
r=np.sqrt(dist**2 + h**2)

#Create weighted A and D matrices Note: That to linearize log(r) and log(pga) are used
tmp=np.ones(len(r))
A=np.column_stack(...,...)  #log(r)*sqrt(weights), sqrt(weights)
D=...   #log(pga)*sqrt(weights)

#Create ATA ATD and invert
ATA=np.dot(np.transpose(A),A)
ATD=np.dot(np.transpose(A),D)
W_soln=np.linalg.solve(...,...)

#Create GMPE curve
W_Model_dist = np.arange(0.1,180,0.1)
W_Model_r = np.sqrt(W_Model_dist**2 + h**2)
W_Model_x = np.log(W_Model_r)
W_Model_y = np.exp(np.polyval(W_soln,W_Model_x))

In [None]:
#Compute 95% confidence levels
degfree=len(r)-2                           #degrees of freedom
e=...-...  #residuals between data and model
var=np.sum(e**2)/degfree                   #variance
se_y=np.sqrt(var)                          #standard error of the estimate
sdev=np.sqrt(var)                          #standard deviation
#Calculate 95% confidence bounds
t=stats.t.ppf(1-0.05/2,degfree)             #division by 2 to map from single-tail to dual-tail t-distribution
W_lower95=np.exp(np.log(W_Model_y)-...)
W_upper95=np.exp(np.log(W_Model_y)+t*se_y)

In [None]:
#Plot Inversion Results
fig, ax = plt.subplots()
ax.loglog(ndist, npga,'r.',pdist,ppga,'b.',W_Model_dist,W_Model_y,'k-',Model_dist,Model_y,'--')
ax.loglog(W_Model_dist,W_lower95,'r-',W_Model_dist,W_upper95,'r-',linewidth=1)
ax.set(xlabel='dist', ylabel='peak ground acceleration (g)',
       title='Peak Acceleration Data and Weighted Least Squares Inversion')
plt.legend(['Napa','Parkfield','Weighted','Unweighted'],fontsize=12)
plt.show()

print(f'h={h:.1f}; Intercept= {soln[1]:.3f}; Slope={soln[0]:.3f}; Variance={var:.3f}')

Compare your results with and without the weighting scheme.

_Write your answer here._