# Module 2, Week 1 In Class Exercise

Splitting and filtering data.

**Review from Data 8 textbook Chapters **

**Before class reading: Fundamentals of Geophysics  **

**Last week we:**
- Loaded and visualized an earthquake catalog.
- Plotted earthquake magnitude and depth.
- Learned some more complicated mapping techinque. 

**Our goals for today:**
- pandas DataFrames, indexing, and data cleaning.
- Load marine geophysical data (bathymetry and marine magnetic anomalies) from two oceanic ridges.
- Select data and drop rows with gaps.
- Plot bathymetry data and evaluate spreading rate.
- Declare a function to detrend and filter magnetic anomalie data.
- Plot marine magnetic anomaly data and compare spreading rates.

## Setup

Run this cell as it is to setup your environment.

In [None]:
import  numpy  as  np
import  matplotlib.pyplot as plt
import pandas as pd
from scipy import signal

import warnings
warnings.filterwarnings("ignore")

##  Part 1: Data Wrangling

### Arrays and Data Structures

NumPy and pandas offer several types of data structures, the two main structures that we have been using so far and will use in future are `nparray` and `DataFrame`. A `nparray` is a fast and flexible container for large datasets that allows you to perform operations on whole blocks of data at once. Arrays are best suited for homogenous (just one type) numerical data. `DataFrames` are designed for tabular datasets, and can handle heterogenous data (multiple types: int, float, string, etc.).


__nparray__

In [None]:
# Generate a random nparray called arr_data
arr_data = np.random.randn(2,3)
arr_data

In [None]:
# use .shape to determine the shape of arr_data
arr_data.shape

In [None]:
# use .dtype to determine the type of arr_data
arr_data.dtype

In [None]:
# use .ndim to determine the dimensions of arr_data
arr_data.ndim

In [None]:
# Generate a nparray of zeros with np.zeros
arr0 = np.zeros((4,4))
arr0

In [None]:
# Generate a nparray of ones with np.ones
arr1 = np.ones((4,4))
arr1

In [None]:
# np.ones is handy for making a nparray of any single value
arr5 = arr1 * 5
arr5

In [None]:
# Generate an array of integers between 0 and 10 in steps of 1, including 0 (start) but not 11 (end)
arr2 = np.arange(0,11,1) 
arr2


In [None]:
# Generate an array of floats between 0 and 10 in steps of 2, including 0 (start) but not 11 (end)
arr2 = np.arange(0.,11.,2.) 
arr2

In [None]:
# Generate an array of 14 evenly spaced numbers between 0 and 10, including 0 (start) and 10 (end).
arr3=np.linspace(0,10,14) 
arr3

__DataFrame__

`Series` and `DataFrames` are like nparrays but they have the added feature of index labels assigned to each row and column -- the bold labels in the below `DataFrame`. These labels can be used to bin and select data.

In [None]:
# generate a new DataFrame
# note that index values (like the column labels) don't have to integers and don't have to be in order
frame = pd.DataFrame(np.random.rand(3, 3), index=['a','d','c'], columns=['banana','apple','pear'])
frame

We've seen `DataFrame` structures before in our tabular data files. Such as the .csv (Comma Separated Variable) data file of all the earthquakes of magnitude 4 and higher from 2000 - 2012 in the ANSS (Advanced National Seismic System) Comprehensive Catalog or "ComCat."

In [None]:
EQ_catalog = pd.read_csv('ANSS_2000_2012.csv',header=8,names = ['DateTime','Latitude','Longitude','Depth','Magnitude','MagType','NbStations','Gap','Distance','RMS','Source','EventID'])
EQ_catalog.head()


There are two ways to reference individual columns (which are called `Series`): `DataFrame.Series` and `DataFrame['Series']`. These do the same thing.

In [None]:
print(type(EQ_catalog.Depth))
print(type(EQ_catalog['Depth']))

The `.values` function can be used to return the values of the `Series` as a `nparray`, so without the labled index values of the `Series`.

In [None]:
type(EQ_catalog.Depth.values)


### Indexing and Slicing

<img src="Figures/indices.png" width=900>
> Source: Python for Data Analysis (2nd Edition) McKinney, W.

<img src="Figures/array_slicing.png" width=900>
> Source: Python for Data Analysis (2nd Edition) McKinney, W.

In [None]:
# generate a random array
arr_data = np.random.randn(10,5)
arr_data

In [None]:
# slice out the first 3 rows of arr_data
a = arr_data[...]
a

In [None]:
# slice out the last 2 columns of arr_data
b = arr_data[..]
b

Slicing a `DataFrame` is a bit different because you can reference the index labels.

In [None]:
# slice out the first 10 rows of EQ_data
EQ_catalog[...]

In [None]:
# slice out the a chunk of Depths
EQ_catalog.Depth[5:10]

If you just want the values from that chunk and not the index labels use `.values`.

In [None]:
EQ_catalog.Depth.values[5:10]

This can be an issue with elementwise arithmetic because in the `DataFrame` case the original index labels are maintained.

In [None]:
EQ_catalog.Depth[5:10]+EQ_catalog.Depth[10:15]

In [None]:
EQ_catalog.Depth.values[5:10]+EQ_catalog.Depth.values[10:15]

You can also use `reindex` to rearrange or add/delete DataFrame index labels.

In [None]:
frame2 = frame.reindex(['a','b','c','d'])
frame2

### Boolean Indexing

We can use Boolean indexing to filter out values from our DataFrame where the argument we want is `True`.

In [None]:
# use Boolean Indexing to fish out rows with magnitudes larger than 6.5
large_mag=EQ_catalog[...]

large_mag.head()


In [None]:
# use Boolean Indexing to fish out depths of earthqukes with magnitudes larger than 6.5
large_mag_depths=xxx[EQ_catalog.Magnitude>=6.5]
large_mag_depths.head()

# note that the original index information is retained

### Sorting

nparrays can be sorted using the `.sort()` method. Put the axis value you want to sort by in the parentheses.

In [None]:
arr_data = np.random.randn(6,3)
arr_data

In [None]:
arr_data.sort(0)
arr_data

Note that this sorts all the columns(rows) rather than sorting by just one column(row) and the maintaining rows(columns). 

In [None]:
arr_data = np.random.randn(6,3)
arr_data.sort(1)
arr_data

DataFrames can be sorted by their index value (`.sort_index`) or by the values in that column (`.sort_values`).

In [None]:
EQ_catalog.sort_index(axis=1).head()

In [None]:
EQ_catalog.sort_values(by=['Magnitude']).head()

You can reverse the order of sorting with `ascending=False`.

In [None]:
EQ_catalog.sort_values(by=['Magnitude'],ascending=False).head()

You can also sort by first on column than another.

In [None]:
EQ_catalog.sort_values(by=['Magnitude','Depth']).tail(15)

### Data Cleaning - replacing data, removing data, find duplicates and missing data (NaNs)

`np.where` can be used to replace values of an array.

In [None]:
arr_data = np.random.randn(6,3)
print(arr_data)

In [None]:
# replace the elements of arr_data that are <0 with 3.0
arr_data2=np.where(...) 
arr_data2

`.drop()` can be used to drop whole columns from a DataFrame.

In [None]:
EQ_data=EQ_catalog.drop(['MagType','NbStations','Gap', 'Distance','RMS','Source','EventID'], axis='columns')
EQ_data.head()

`.unique()` returns the unique values from the specified object.

In [None]:
unique_mags = EQ_data.Magnitude.unique()
unique_mags.sort()
unique_mags

`.value_counts()` returns the count of each unique value from the specified object.

In [None]:
EQ_data.Magnitude.value_counts()

This can be used to find duplicate values.

In [None]:
EQ_data.DateTime.value_counts()

In [None]:
EQ_data[EQ_data.DateTime == '2001/03/07 02:49:42.87']

Two earthquakes at the same time!

`np.isnan` returns a boolean object with True where NaNs appear in the DataFrame.

In [None]:
frame2.head()

In [None]:
np.isnan(frame2)

__Your turn__

Create a new 5x3 `DataFrame` of random numbers using `something = pd.DataFrame( ... , index=[''], columns=[''])`

Sort you `DataFrame` by the first column using `.sort_values(by=[''])`.

Select the rows where the second column is postive using boolean indexing.

##  Part 2: Marine Geology - Bathymetry and Magnetic Anomalies

We'll look at marine magnetics and bathymetry data from two surveys, one from the Mid-Atlantic Ridge and one from the East Pacific Rise.

First we'll load the atlantic data, and then we'll need to clean it up.

In [None]:
# Load the seafloor depth, marine mag anom data
# Source: https://maps.ngdc.noaa.gov/viewers/geophysics/
#names=['SURVEY_ID','TIMEZONE','DATE','TIME','LAT','LON','POS_TYPE','NAV_QUALCO','BAT_TTIME','CORR_DEPTH','BAT_CPCO','BAT_TYPCO','BAT_QUALCO','MAG_TOT','MAG_TOT2','MAG_RES','MAG_RESSEN','MAG_DICORR','MAG_SDEPTH','MAG_QUALCO','GRA_OBS','EOTVOS','FREEAIR','GRA_QUALCO','LINEID','POINTID'])

atlantic_data_file=pd.read_table('data_tracks/vanc05mv.m77t')
atlantic_data_file.head()



Let's use `.drop` to remove the columns we won't be using.

In [None]:
atlantic_data_slim = atlantic_data_file.xxx(['SURVEY_ID','TIMEZONE','DATE','TIME','POS_TYPE','NAV_QUALCO','BAT_TTIME','BAT_CPCO','BAT_TYPCO','BAT_QUALCO','MAG_TOT2','MAG_RES','MAG_RESSEN','MAG_DICORR','MAG_SDEPTH','MAG_QUALCO','GRA_OBS','EOTVOS','FREEAIR','GRA_QUALCO','LINEID','POINTID'], axis='columns')

atlantic_data_slim.head()

Next we'll use `np.isnan` to remove rows were we don't have depth AND magnetic field measurements.

In [None]:
atlantic_data_clean = atlantic_data_slim[~np.isnan(atlantic_data_slim.CORR_DEPTH) &  ~np.isnan(atlantic_data_slim.MAG_TOT)];
atlantic_data_clean.head()


Let's take a look at our data!

In [None]:
plt.figure(1,(20,4))
plt.plot(atlantic_data_clean.LON,-1*atlantic_data_clean.CORR_DEPTH,color='mediumblue');
plt.xlabel('Longitude, degrees');
plt.ylabel('Bathymentry, m');

In [None]:
plt.figure(1,(20,4))
plt.plot(atlantic_data_clean.LON,atlantic_data_clean.MAG_TOT,color='mediumblue');
plt.xlabel('Longitude, degrees');
plt.ylabel('Total magnetic field, nT');

Let's just analyze the portion of the survey from around the ridge, so from longitudes -24.0 to 0.0 degrees. So we'll use Boolean indexing to pull out rows of `atlantic_data_clean` where `atlantic_data_clean.LON` is between those values.

In [None]:
atlantic_data = atlantic_data_clean[...]

In [None]:
atl_lat=atlantic_data.LAT;
atl_lon=atlantic_data.LON;
atl_depth=atlantic_data.CORR_DEPTH;
atl_total_mag=atlantic_data.MAG_TOT;


Here's a map of where our survey line was collected with a grid of seafloor bathymetry in the background.

<img src="Figures/MAR_track_map.png" width=900>

In [None]:
plt.figure(1,(20,4))
plt.plot(atl_lon,-1*atl_depth,color='mediumblue');
plt.title('Mid Atlantic Ridge')
plt.xlabel('Longitude, degrees');
plt.ylabel('Bathymentry, m');

In [None]:
plt.figure(1,(20,4))
plt.plot(atl_lon,atl_total_mag,color='mediumblue');
plt.title('Mid Atlantic Ridge')
plt.xlabel('Longitude, degrees');
plt.ylabel('Total magnetic field, nT');

I used another program to project the latitude and longitude coordinates to distance from the ridge along the ship track azimuth -- let's load that.

In [None]:
projected_atlantic_data=pd.read_csv('data_tracks/projected_vanc05mv.csv',names=['DIST','DEPTH','MAG_TOT'])
atl_dist=projected_atlantic_data.DIST;
atl_depth=projected_atlantic_data.DEPTH;
atl_total_mag=projected_atlantic_data.MAG_TOT;

In [None]:
plt.figure(1,(20,4))
plt.plot(atl_dist,-1*atl_depth,color='mediumblue');
plt.title('Mid Atlantic Ridge')
plt.xlabel('Distance to Ridge, km');
plt.ylabel('Bathymentry, m');

In [None]:
plt.figure(1,(20,4))
plt.plot(atl_dist,atl_total_mag,color='mediumblue');
plt.title('Mid Atlantic Ridge')
plt.xlabel('Distance to Ridge, km');
plt.ylabel('Total magnetic field, nT');

Now let's load and clean the data from the East Pacific Rise. This time we'll select date from Longitudes between -126.0 and -95.0 degrees.

In [None]:
# Load the seafloor depth, marine mag anom data
# Source: https://maps.ngdc.noaa.gov/viewers/geophysics/
#names=['SURVEY_ID','TIMEZONE','DATE','TIME','LAT','LON','POS_TYPE','NAV_QUALCO','BAT_TTIME','CORR_DEPTH','BAT_CPCO','BAT_TYPCO','BAT_QUALCO','MAG_TOT','MAG_TOT2','MAG_RES','MAG_RESSEN','MAG_DICORR','MAG_SDEPTH','MAG_QUALCO','GRA_OBS','EOTVOS','FREEAIR','GRA_QUALCO','LINEID','POINTID'])

pacific_data_file=pd.read_table('data_tracks/nbp9707.m77t')

pacific_data_clean = pacific_data_file[...]; #use ~np.isnan to clear out rows were there are nans
pacific_data = pacific_data_clean[...] # use Boolean indexing to select rows with Longitude -126 deg to -95 deg

In [None]:
pac_lat=pacific_data.LAT;
pac_lon=pacific_data.LON;
pac_depth=pacific_data.CORR_DEPTH;
pac_total_mag=pacific_data.MAG_TOT;


Here's a map of where our survey line was collected with a grid of seafloor bathymetry in the background.

<img src="Figures/EPR_track_map.png" width=900>

In [None]:
plt.figure(1,(20,4))
plt.plot(pac_lon,-1*pac_depth,color='tomato');
plt.title('East Pacific Rise');
plt.xlabel('Longitude, degrees');
plt.ylabel('Bathymetry, m');

In [None]:
plt.figure(1,(20,4))
plt.plot(pac_lon,pac_total_mag,color='tomato');
plt.title('East Pacific Rise');
plt.xlabel('Longitude, degrees');
plt.ylabel('Total magnetic field, nT');

Again, I used another program to project the latitude and longitude coordinates to distance from the ridge along the ship track azimuth -- let's load that.

In [None]:
projected_pacific_data=pd.read_csv('data_tracks/projected_nbp9707.csv',names=['DIST','DEPTH','MAG_TOT'])
pac_dist=projected_pacific_data.DIST;
pac_depth=projected_pacific_data.DEPTH;
pac_total_mag=projected_pacific_data.MAG_TOT;


__Bathymetry__

Now let's compare the two ridges' bathymetry. 

Let's plot them together on one figure as subplots. First we use `.GridSpec` to set up the grid of subplots, then we use `fig.add_subplot` to set up the subplot axes, and then we can start adding our plot elements to the subplots.

In [None]:
fig = plt.figure(1,(20,8)) # create figure object
grid = plt.GridSpec(2, 1, wspace=0.4, hspace=0.3) # create grid reference frame and spacing for 2 vertical subplots

ax1=fig.add_subplot(grid[0,0]) # create axis object for top subplot
ax2=fig.add_subplot(grid[1,0]) # create axis object for bottom subplot

ax1.plot(...,...,color='tomato'); # plot the pacific bathymetry
ax1.set_xlim(-1000, 1000); # set the x axis range
ax1.set_ylim(-5000, -1000); # set the y  axis range
ax1.set_xlabel('Distance to Ridge, km'); # labels!
ax1.set_ylabel('Bathymetry, m');
ax1.set_title('East Pacific Rise');
ax1.grid() # add a grid

ax2.plot(...,...,color='mediumblue'); # plot the atlantic bathymetry
ax2.set_xlim(-1000, 1000);
ax2.set_ylim(-5000, -1000);
ax2.set_xlabel('Distance to Ridge, km');
ax2.set_ylabel('Bathymetry, m');
ax2.set_title('Mid Atlantic Ridge');
ax2.grid()



<img src="Figures/spreading_ridges.png" width=900>
> Source: Essentials of Geology (13th Edition) Lutgens, Tarbuck, and Tasa.

What do you observe in the bathymetry? Do these ridges have a rift valley at the center? Is the slope steep or gentle? Is the bathymetry rough or smooth?

_Write your answer here._

Based on the ridge bathymetry, which spreading center do you think is spreading faster the Atlantic (blue) or Pacific (red)?

_Write your answer here._

__Crustal Magnetic Field__

Now we compare the evidence from their marine magnetic field data.

In [None]:
fig = plt.figure(1,(20,8))
grid = plt.GridSpec(2, 1, wspace=0.4, hspace=0.3)

ax0=fig.add_subplot(grid[0,0])
ax1=fig.add_subplot(grid[1,0])

ax0.plot(pac_dist,pac_total_mag,color='tomato');
ax0.set_xlim(-1000, 1000);
ax0.set_xlabel('Distance to Ridge, km');
ax0.set_ylabel('Total Field, nT');
ax0.set_title('East Pacific Rise');

ax1.plot(atl_dist,atl_total_mag,color='mediumblue');
ax1.set_xlim(-1000, 1000);
ax1.set_xlabel('Distance to Ridge, km');
ax1.set_ylabel('Total Field, nT');
ax1.set_title('Mid Atlantic Ridge');

I'm defining a new function `total2anom` to process these total magnetic field measurements into magnetic anomaly by removing the background drift.

In [None]:
def total2anom(total_mag, distance):
    """
    Simple function (i.e. stupid, doesn't use knowledge of background field from observatory) to process 
    measured total magnetic field to magnetic anomaly. Detrends and highpass filters the total field.
    
    inputs: 
    total magnetic field measurement
    distance from the ridge in km
    
    output:
    marine magnetic anomaly (detrended and filtered total field)
    """
    total_detrended = signal.detrend(total_mag); # detrend to remove drift
    sample_dist = np.mean(abs(distance.values[1:]-distance.values[0:-1])); # determine sample spacing
    fs = 1/sample_dist; # sampling frequency in km^-1
    fN = fs *0.5; # Nyquist frequency
    # design filter coefficents for highpass filter - 0 to 1/500km filtered, 1/450km to fN passed, 
    # remove nonlinear drift
    filter_coefs = signal.remez(1001, [0, 0.002, 0.00222, fN], [0, 1], Hz=fs);
    # apply the filter to the detrended anomaly
    filtered_anom = signal.filtfilt(filter_coefs, [1], total_detrended, padlen=len(total_detrended)-1)
    
    return filtered_anom

Use this `total2anom` function to compute the marine magnetic anomalies.

In [None]:
atl_mma = total2anom(atl_total_mag,atl_dist)
pac_mma = total2anom(pac_total_mag,pac_dist)

In [None]:
fig = plt.figure(1,(20,8))
grid = plt.GridSpec(2, 1, wspace=0.4, hspace=0.3)

ax0=fig.add_subplot(grid[0,0])
ax1=fig.add_subplot(grid[1,0])

ax0.plot(pac_dist[:],np.zeros(pac_dist.shape),color='black'); # plot a black reference line at 0 nT
ax0.plot(...,...,color='tomato'); # plot the pacific marine magnetic anomaly
ax0.set_xlim(-1000, 1000);
ax0.set_xlabel('Distance to Ridge, km');
ax0.set_ylabel('Magnetic Anomaly, nT');
ax0.set_title('East Pacific Rise');

ax1.plot(...,...,color='black'); # plot a black reference line at 0 nT
ax1.plot(atl_dist,atl_mma,color='mediumblue'); # plot the atlantic marine magnetic anomaly
ax1.set_xlim(-1000, 1000);
ax1.set_xlabel('Distance to Ridge, km');
ax1.set_ylabel('Magnetic Anomaly, nT');
ax1.set_title('Mid Atlantic Ridge');



Plot the marine magnetic anomalies together as subplots again with reference lines at zero nT, but zoom in the `xlim` to $\pm$250 km for the pacific data and $\pm$150 km for the atlantic data.

Which wiggles can you match between lines and to the model profile due to the GPTS below? Can you pick the Bruhnes, Matuyama, Gauss, and Gilbet polarity chrons? What distance from the ridge does the Bruhnes-Matuyama reversal (which tells us an age of 780 kyr) occur at for both ridges?

_Write your answer here._

<img src="Figures/marine_mag_anom.png" width=900>
> Source: Fundamentals of Geophysics (2nd Edition) Lowrie, W.

Based on the marine magnetic anomalies, which spreading center do you think is spreading faster the Atlantic (blue) or Pacific (red)? Is that consistent with your estimate from the bathymetry?

_Write your answer here._

__Build a new DataFrame of our Distance, Depth, Marine Magnetic Anomaly output as a .csv file that we can open again later.__

In [None]:
data1 = pd.DataFrame({'Distance':atl_dist, 'Depth':atl_depth, 'Magnetic_Anomaly':atl_mma})

In [None]:
data1.head()

In [None]:
data1.to_csv("Atlantic_dist_depth_mma.csv")

In [None]:
data2 = pd.DataFrame({'Distance':pac_dist, 'Depth':pac_depth, 'Magnetic_Anomaly':pac_mma})
data2.to_csv("Pacific_dist_depth_mma.csv")