# Module 3, Week 1 In Class Exercise

Linear regression

**Before class reading: Data 8 textbook Chapter 15  **

**Last week we:**
- Load a Bay Area seismic catalog.
- Compute the distance and time interval between earthquakes, and use these to indentify aftershocks.
- Remove the aftershocks from the catalog (decluster).

**Our goals for today:**
- Learn how to deal with bivariate data (fitting lines, curves).
- Apply line fitting to determine the spreading rate of various ocean ridges

## Setup

Run this cell as it is to setup your environment.

In [None]:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
from scipy import stats
from cartopy import config
import cartopy.crs as ccrs

### Bivariate statistics

There are many examples in the Earth Sciences where we are interested in the dependence between two types of data.  For example, the distance from the ridge crest versus age gives you a spreading rate.  The depth in a sediment core versus age gives you a sedimentation rate.  The ratio of the radioactive form of carbon, $^{14}$C, to a stable form,$^{12}$C, is a function of the age of the material being dated. The difference in arrival times of the $P$ and $S$ seismic waves is related to distance from the earthquake source to the seismometer receiver. These examples rely on the use of bivariate statistics to get at the desired quantities.


If two variables are associated or dependent, they will show a correlation, or trend, when plotted. Independent variables will show no relation i.e. just scatter. 

<img src="Figures/Correlation_examples.svg" width=900>
> Pearson correlation coefficient between several example X,Y sets. Source: https://en.wikipedia.org/wiki/Correlation_and_dependence

The simplest dependence between two variables is linear, where one variable is proportional to the other. This relationship can be described with the equation $y = mx +b$, where $m$ is the slope of the line and $b$ is the line's y-axis intercept. This equation is called the _model_.

With just two data points the slope of the line connecting them can be found. There are two data to constrain the two unknowns $m$ and $b$. Once the model is developed it can be used to determine the $y$ for an given $x$.

In [None]:
x = np.asarray([3.0, 5.0]);
y = np.asarray([2.0, 7.0]);

m = ...; # use the two points to find the slope rise/run, dy/dx
b = ...; # solve for the y-intercept

plt.figure(1,(5,5)) 
plt.plot(x,y,'o')
plt.plot(x,m*x+b,'-')
plt.xlabel('X ', fontsize=16);
plt.ylabel('Y', fontsize=16);
plt.grid()

However, the more usual case we have in Earth Sciences is an _overdetermined_ problem. There is more data than unknowns in our model. The data will also be scattered. For overdetermined problems we find the best fitting line to the date. This is the line that minimizes the misfit between data and the model. This is called _linear regression_ and a method of _least squares_ is used, where the _mean squared_ misfit is minimized.

<img src="Figures/best_fit.png" width=300> <img src="Figures/worse_fit.png" width=300>
>Source: https://www.inferentialthinking.com/chapters/15/3/Method_of_Least_Squares

In [None]:
# make up some randomly scattered linearly related data
x = np.random.randn(100)*5
m = 2
b = np.random.rand(100)*10
y = m*x+b

In [None]:
# plot the data
plt.figure(1,(5,5)) 
plt.plot(x,y,'o')
plt.xlabel('X ', fontsize=16);
plt.ylabel('Y', fontsize=16);
plt.grid()

To fit a line, $y = mx + b$, through some noisy data-points we can use the numpy functions `np.polyfit` and `np.polyval`. `np.polyfit` takes the $x$ and $y$ data arrays, and the degree of our model (1 for linear, 2 for quadratic, etc.) as inputs, and returns the model parameters of the least-squares best-fit line as outputs. `np.polyval` takes the model parameters and any $x$ values as input, and returns the $y$ values of the model at the given $x$ values. Another function, which we will look at next week, that returns the model parameters of a least-squares best-fit line is `np.linalg.lstsq`.

In [None]:
m, b = np.polyfit(...,...,1)
modelY=np.polyval([m, b],...)

print ('slope: %7.3f, intercept: %4.1f'%\
    (m, b))

This matches the _real_ values of $m$ and $b$ which we used to create the scattered data pretty well.

In [None]:
# now plot the data and the best-fit line 
plt.figure(1,(5,5)) 
plt.plot(x,y,'o')
plt.plot(x,modelY,'k-') 
plt.xlabel('X', fontsize=16);
plt.ylabel('Y', fontsize=16);
plt.grid()

We'd also like to know who well this model fits our data i.e. how correlated the data are. We'll use the $R^{2}$ correlation coefficient for this. $R^{2}$ is zero for uncorrelated data, and 1 for perfectly linear data (so no misfit between the model line and data). We'll use the scipy function `stats.linregress` to compute $R^{2}$.



In [None]:
# compute the fit statistics  
slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)

print ('slope: %7.3f, intercept: %4.1f, R^2: %5.3f'%\
    (slope, intercept, r_value**2))

## Seafloor Spreading Rates

Now we will look at the relationship between seafloor age (Myr) and distance from the ridge (km) to determine the velocity of the spreading between oceanic plates (km/Myr).

### Mid-Atlantic Ridge

In [None]:
# data from the Mid-Atlantic Ridge
atlantic_data=pd.read_csv('MAR_age_dist.csv')
atlantic_data.head()

We'll plot the location of these age picks.

In [None]:
# set up plot and extent of map
fig, ax = plt.subplots(figsize=(20,9))
ax = plt.axes(projection=ccrs.Robinson(-46.0))
lat0=0.0
lat1=60.0
lon0=-90.0
lon1=0
ax.set_extent([lon0, lon1, lat0, lat1], crs=ccrs.PlateCarree())

# colormap
z = atlantic_data.Age
clrmap = matplotlib.cm.get_cmap('jet_r')
normalize = matplotlib.colors.Normalize(vmin=0, vmax=150)
colors = [cmap(normalize(value)) for value in z]

# plot age pick locations on map with color set by age
ax.scatter(atlantic_data.Lon,atlantic_data.Lat,marker='o',color=colors,transform=ccrs.PlateCarree())
cax, _ = matplotlib.colorbar.make_axes(ax)
cbar = matplotlib.colorbar.ColorbarBase(cax, cmap=clrmap, norm=normalize)

plt.title('Age Myr')
ax.coastlines()
ax.stock_img()
ax.gridlines()

plt.show()

Next we'll plot the distance from the ridge as a function of seafloor age. 

In [None]:
plt.figure(1,(10,10)) 
plt.plot(atlantic_data.Age,atlantic_data.Distance,'o')
plt.xlabel('Age, Myr ', fontsize=16);
plt.ylabel('Distance, km', fontsize=16);
plt.grid()

In [None]:
# Declare variables for the columns we'll be using further.
age = atlantic_data.Age
abs_dist = abs(atlantic_data.Distance)

In [None]:
plt.figure(1,(10,10)) 
plt.plot(age,abs_dist,'o')
plt.xlabel('Age, Myr ', fontsize=16);
plt.ylabel('Distance, km', fontsize=16);
plt.grid()

In [None]:
m, b= np.polyfit(...,...,1)
print ('Spreading rate (km/Myr): %7.3f, intercept (km): %4.1f'%(m,b))

In [None]:
modelYs=np.polyval(...,...)

# now plot the data and the best-fit line: 
plt.figure(1,(10,10)) 
plt.plot(age,abs_dist,'o')
plt.plot(age,modelYs,'k-') # plot as black line
plt.xlabel('Age, Myr ', fontsize=16);
plt.ylabel('Distance, km', fontsize=16);
plt.grid()

It looks like the data from before 100 Myr have a different slope than the data after 100 Myr. Let's fit them separately. 

In [None]:
m_young,b_young = np.polyfit(...,...,1)
m_old,b_old = np.polyfit(...,...,1)

modelY1=np.polyval(...,...)
modelY2=np.polyval(...,...)

# now plot the data and the best-fit line: 
plt.figure(1,(10,10)) 
plt.plot(age,abs_dist,'o')
plt.plot(age[age<100],modelY1,'k-') # plot as black line
plt.plot(age[age>100],modelY2,'k-') # plot as black line
plt.xlabel('Age, Myr ', fontsize=16);
plt.ylabel('Distance, km', fontsize=16);
plt.grid()

In [None]:
print ('Spreading rate since 100 Myr (km/Myr): %7.3f'%(m_young))
print ('Spreading rate before 100 Myr (km/Myr): %7.3f'%(m_old))

Did spreading at the Mid-Atlantic Ridge speed up or slow down?

_Write your answer here._

By how much?

_Write your answer here._

### East Pacific Rise

In [None]:
# data from the Pacific Antarctic Ridge
pacific_data=pd.read_csv('PAR_age_dist.csv')
pacific_data.head()

We'll plot the location of these age picks.

In [None]:
# set up figure and map extent
fig, ax = plt.subplots(figsize=(20,9))
ax = plt.axes(projection=ccrs.Robinson(-113.0))
lat0=-40.0
lat1=-80.0
lon0=-175.0
lon1=-90.0
ax.set_extent([lon0, lon1, lat0, lat1], crs=ccrs.PlateCarree())

# colormap scaled with age
z = pacific_data.Age
cmap = matplotlib.cm.get_cmap('jet_r')
normalize = matplotlib.colors.Normalize(vmin=0, vmax=150)
colors = [cmap(normalize(value)) for value in z]

#scatter plot of age locations
ax.scatter(pacific_data.Lon,pacific_data.Lat,marker='o',color=colors,transform=ccrs.PlateCarree())
cax, _ = matplotlib.colorbar.make_axes(ax)
cbar = matplotlib.colorbar.ColorbarBase(cax, cmap=cmap, norm=normalize)

plt.title('Age Myr')
ax.coastlines()
ax.stock_img()
ax.gridlines()

plt.show()

What do you notice about this dataset with regards to the time span and data spacing in comparison with the data from the Atlantic?

_Write your answer here._

In [None]:
plt.figure(1,(10,10)) 
plt.plot(pacific_data.Age,pacific_data.Distance,'o')

plt.xlabel('Age, Myr ', fontsize=16);
plt.ylabel('Distance, km', fontsize=16);
plt.text(10,2000, 'Eastern Side', fontsize=16);
plt.text(10,1700, 'Antarctic Plate', fontsize=16);
plt.text(10,-2000, 'Western Side', fontsize=16);
plt.text(10,-2300, 'Pacific Plate', fontsize=16);
plt.grid()

In [None]:
# Declare variables for the columns we'll be using further.
age = pacific_data.Age
dist = pacific_data.Distance

Fit the Pacific and Antarctic plates with separate best-fit lines. Use Boolean indexing i.e. `dist > 0` and `dist < 0`, `polyfit` and `polyval`.

Plot the data and the best-fit lines.

Print the spreading rates.

Which plate is spreading faster?

_Write your answer here._

How fast are a point on the Antarctic plate and a point of the Pacific plate moving apart?

_Write your answer here._

How good are these fits? Compute and print the $R^2$ correlation coefficient.

In [None]:
# compute the fit statistics 
slope, intercept, r_value_ant, p_value, std_err = stats.linregress(age[(dist > 0)],dist[(dist > 0)])
print ('R^2 Antarctic plate: %5.3f'%(r_value_ant**2))

# Pacific plate



Does this make sense with how scattered about the lines the data look?

_Write your answer here._