<a href="https://colab.research.google.com/github/emilygolf/Predicting-Seismicity/blob/main/Data_collection%26cleaning_STATS112.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Final Project | Stats 112 | Emily Snell | Data Collection/Cleaning

---

This colab collects and cleans my final project data. 

I will be collecting:
- seismicity (earthquake) data
- eruption data
- volcanic location data 

I have already downloaded the raw data in cvs format from various geologic websites (USGS and smithsonian)

Prediction data:
- my prediction data will be scraped from a university website. More details at the end of this colab! 


# Project Goal:


**Predict seismicity associated with "old" eruptions**

I want to use seismic and eruption data to predict the seismicity associated with historic eruptions. Seismographs were not invented until the 1880's, and even then were not globally available. Further, seismic data for eruptions prior to 2000 is sparse. In volcanology, which is my primary study, geologists can look at volcanic deposits to understand how large or violent an eruption was, even if it occured thoughsands of years ago. However, estimating the size of an earthquake that occured hundreds (or thousands) of years ago is nearly impossible based on first order observations. 

With this data I aim to train a machine learning model that uses eruptions from 1990-2022 along with seismic data of earthquakes which occurred at volcanic sites to predict and give insight into the seismicity from historical eruptions (1450BC to early 1900's) globally. 

Let's leverage machine learning to estimate some really old earthquakes!!!  

# 1) load in data

In [1]:
#load in data from my personal github
import requests
import pandas as pd
import os 
!git clone https://github.com/emilygolf/Final-Project.git

Cloning into 'Final-Project'...
remote: Enumerating objects: 148, done.[K
remote: Counting objects: 100% (69/69), done.[K
remote: Compressing objects: 100% (52/52), done.[K
remote: Total 148 (delta 20), reused 60 (delta 17), pack-reused 79
Receiving objects: 100% (148/148), 9.21 MiB | 8.54 MiB/s, done.
Resolving deltas: 100% (41/41), done.


In [2]:
#move to the right folder to acess the data 
%cd /content/Final-Project/DataSci\ Final\ Project

/content/Final-Project/DataSci Final Project


In [3]:
#initialize earthquake df
#this is the global data from 1990-present 
earthquake_df = pd.read_csv('volcanic_earthquakes.csv')
earthquake_df

Unnamed: 0,time,latitude,longitude,depth,mag,magType,nst,gap,dmin,rms,...,updated,place,type,horizontalError,depthError,magError,magNst,status,locationSource,magSource
0,2022-01-15T04:14:45.000Z,-20.546000,-175.390000,0.00,5.8,ms_20,,,,2.82,...,2022-11-15T02:42:10.861Z,"68 km NNW of Nuku‘alofa, Tonga",volcanic eruption,8.10,6.00,0.021,507,reviewed,us,us
1,2018-08-02T21:55:12.060Z,19.411167,-155.283167,0.55,5.3,mw,34.0,27.0,0.006521,0.10,...,2022-05-03T19:25:25.243Z,"6 km SW of Volcano, Hawaii",volcanic eruption,0.14,0.11,,22,reviewed,hv,hv
2,2018-07-31T17:59:46.000Z,19.410333,-155.285667,0.57,5.3,mw,36.0,23.0,0.007439,0.12,...,2022-05-03T19:25:07.031Z,"6 km WSW of Volcano, Hawaii",volcanic eruption,0.16,0.11,,9,reviewed,hv,hv
3,2018-07-29T22:10:25.570Z,19.406333,-155.282333,1.29,5.3,mw,39.0,24.0,0.011390,0.12,...,2018-11-01T01:31:30.280Z,"6 km SW of Volcano, Hawaii",volcanic eruption,0.16,0.12,,22,reviewed,hv,hv
4,2018-07-28T12:37:25.390Z,19.396833,-155.271833,0.21,5.3,mw,37.0,40.0,0.002246,0.10,...,2022-05-03T19:25:00.878Z,"6 km SW of Volcano, Hawaii",volcanic eruption,0.10,0.10,,21,reviewed,hv,hv
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58,2018-05-20T21:50:07.310Z,19.405000,-155.281000,0.01,4.9,mw,13.0,109.0,0.006997,0.09,...,2018-08-15T19:47:00.040Z,"6 km SW of Volcano, Hawaii",volcanic eruption,0.34,31.61,,5,reviewed,hv,hv
59,2018-05-20T01:58:13.320Z,19.405000,-155.281000,0.01,4.9,mw,19.0,71.0,0.006997,0.21,...,2018-08-15T19:46:59.040Z,"6 km SW of Volcano, Hawaii",volcanic eruption,0.32,31.61,,6,reviewed,hv,hv
60,2018-05-19T09:58:33.210Z,19.405000,-155.281000,0.01,5.1,mw,13.0,56.0,0.006997,0.38,...,2018-08-15T19:46:58.040Z,"6 km SW of Volcano, Hawaii",volcanic eruption,0.61,31.61,,5,reviewed,hv,hv
61,2018-05-17T14:04:10.700Z,19.405000,-155.281000,0.01,5.0,mw,13.0,56.0,0.006997,0.41,...,2018-08-15T19:46:55.040Z,"6 km SW of Volcano, Hawaii",volcanic eruption,0.62,31.61,,5,reviewed,hv,hv


# Concatinate earthquake csv files

In [4]:
#lets add some more data from various regions to make the data more robust
other_earthquakes = ['Hawaii_2015.csv', 'aleutians_2000.csv', 'aleutians_2015.csv', 'NewZealand_2000.csv',
                     'NewZealand_2010.csv', 'Japan_2001.csv', 'indonesia_2014.csv', 'indonesia_2018.csv', 'hawaii_2023.csv', 
                     'central_america_1989.csv', 'central_america_1995.csv', 'central_america_2000.csv','columbia_1990.csv', 
                     'new_zealand_1990.csv', 'Russia.csv', 'Indonesia.csv', 'Chile_1990.csv']
for earthquake in other_earthquakes:
  new_df = pd.read_csv(earthquake)
  earthquake_df = pd.concat([earthquake_df, new_df], axis=0)

In [5]:
#now we have 116,000+ earthquakes!
earthquake_df

Unnamed: 0,time,latitude,longitude,depth,mag,magType,nst,gap,dmin,rms,...,updated,place,type,horizontalError,depthError,magError,magNst,status,locationSource,magSource
0,2022-01-15T04:14:45.000Z,-20.546000,-175.390000,0.00,5.8,ms_20,,,,2.82,...,2022-11-15T02:42:10.861Z,"68 km NNW of Nuku‘alofa, Tonga",volcanic eruption,8.10,6.00,0.021,507.0,reviewed,us,us
1,2018-08-02T21:55:12.060Z,19.411167,-155.283167,0.55,5.3,mw,34.0,27.0,0.006521,0.10,...,2022-05-03T19:25:25.243Z,"6 km SW of Volcano, Hawaii",volcanic eruption,0.14,0.11,,22.0,reviewed,hv,hv
2,2018-07-31T17:59:46.000Z,19.410333,-155.285667,0.57,5.3,mw,36.0,23.0,0.007439,0.12,...,2022-05-03T19:25:07.031Z,"6 km WSW of Volcano, Hawaii",volcanic eruption,0.16,0.11,,9.0,reviewed,hv,hv
3,2018-07-29T22:10:25.570Z,19.406333,-155.282333,1.29,5.3,mw,39.0,24.0,0.011390,0.12,...,2018-11-01T01:31:30.280Z,"6 km SW of Volcano, Hawaii",volcanic eruption,0.16,0.12,,22.0,reviewed,hv,hv
4,2018-07-28T12:37:25.390Z,19.396833,-155.271833,0.21,5.3,mw,37.0,40.0,0.002246,0.10,...,2022-05-03T19:25:00.878Z,"6 km SW of Volcano, Hawaii",volcanic eruption,0.10,0.10,,21.0,reviewed,hv,hv
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2323,1990-03-15T09:07:01.160Z,-21.143000,-68.748000,120.00,4.6,mb,,,,1.00,...,2014-11-07T00:45:56.979Z,"146 km N of Calama, Chile",earthquake,,9.10,,3.0,reviewed,us,us
2324,1990-03-10T02:04:13.390Z,-22.671000,-68.267000,123.20,4.8,mb,,,,0.80,...,2014-11-07T00:45:54.684Z,"27 km NNW of San Pedro de Atacama, Chile",earthquake,,18.50,,2.0,reviewed,us,us
2325,1990-03-08T03:59:35.230Z,-24.625000,-70.263000,59.20,4.5,mb,,,,1.30,...,2014-11-07T00:45:54.332Z,"89 km NNE of Taltal, Chile",earthquake,,11.70,,2.0,reviewed,us,us
2326,1990-03-04T16:30:52.060Z,-20.069000,-69.311000,125.00,4.2,mb,,,,1.70,...,2014-11-07T00:45:52.298Z,"89 km E of Iquique, Chile",earthquake,,15.30,,1.0,reviewed,us,us


# Cleaning earthquake data 

(keeping only the columns we care about)

In [6]:
#keep time, lat, long, depth, mag, horizontalError, depthError, magError
earthquakes_clean_df = earthquake_df[['time', 'latitude', 'longitude', 'depth', 'mag', 'horizontalError', 'depthError', 'magError']]

**Now let's clean up the time colum**

In [7]:
## this function takes in the default earthquake time column and converts it to a yyy-mm-dd pandas datetime format 
## the new column will be called "Date"

def clean_time_data(data_frame, date):

  #initialize dict
  Date = []

  for time in data_frame[date]:
    #add the first 4 chars which represent the year
    Date.append(time[0:4] + '-' + time[5:7] + '-' + time[8:10])
  Date_series = pd.to_datetime(pd.Series(Date), yearfirst=True)
  data_frame['Date'] = Date_series
  return data_frame

In [8]:
# didn't get rid of original time column since could need it down the line...
earthquakes_clean_df = clean_time_data(earthquakes_clean_df, 'time')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_frame['Date'] = Date_series


In [9]:
#look at the beautiful new date column!
earthquakes_clean_df

Unnamed: 0,time,latitude,longitude,depth,mag,horizontalError,depthError,magError,Date
0,2022-01-15T04:14:45.000Z,-20.546000,-175.390000,0.00,5.8,8.10,6.00,0.021,2022-01-15
1,2018-08-02T21:55:12.060Z,19.411167,-155.283167,0.55,5.3,0.14,0.11,,2018-08-02
2,2018-07-31T17:59:46.000Z,19.410333,-155.285667,0.57,5.3,0.16,0.11,,2018-07-31
3,2018-07-29T22:10:25.570Z,19.406333,-155.282333,1.29,5.3,0.16,0.12,,2018-07-29
4,2018-07-28T12:37:25.390Z,19.396833,-155.271833,0.21,5.3,0.10,0.10,,2018-07-28
...,...,...,...,...,...,...,...,...,...
2323,1990-03-15T09:07:01.160Z,-21.143000,-68.748000,120.00,4.6,,9.10,,2001-06-02
2324,1990-03-10T02:04:13.390Z,-22.671000,-68.267000,123.20,4.8,,18.50,,2001-06-01
2325,1990-03-08T03:59:35.230Z,-24.625000,-70.263000,59.20,4.5,,11.70,,2001-05-31
2326,1990-03-04T16:30:52.060Z,-20.069000,-69.311000,125.00,4.2,,15.30,,2001-05-30


# Clean Eruptions and Volcanoes df's 

In [10]:
#now read eruption data 
eruptions_df = pd.read_csv('global_eruptions_1990+.csv', skiprows=1, encoding='cp1252')
eruptions_df

Unnamed: 0,Volcano Number,Volcano Name,Eruption Number,Eruption Category,Area of Activity,VEI,VEI Modifier,Start Year Modifier,Start Year,Start Year Uncertainty,...,Evidence Method (dating),End Year Modifier,End Year,End Year Uncertainty,End Month,End Day Modifier,End Day,End Day Uncertainty,Latitude,Longitude
0,355100,Lascar,22493,Confirmed Eruption,,,,,2022,,...,Observations: Reported,>,2022.0,,12.0,,19.0,,-23.370,-67.730
1,332020,Mauna Loa,22492,Confirmed Eruption,,,,,2022,,...,Observations: Reported,,2022.0,,12.0,,10.0,,19.475,-155.608
2,284141,Ahyi,22489,Confirmed Eruption,,,,,2022,,...,Observations: Reported,>,2022.0,,12.0,,19.0,,20.420,145.030
3,343100,San Miguel,22491,Confirmed Eruption,,,,,2022,,...,Observations: Reported,,2022.0,,11.0,,29.0,,13.434,-88.269
4,352050,Cotopaxi,22486,Confirmed Eruption,Summit crater,2.0,,,2022,,...,Observations: Reported,>,2022.0,,12.0,,19.0,,-0.677,-78.436
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
775,233020,"Fournaise, Piton de la",14373,Confirmed Eruption,SE Dolomieu crater and upper SE flank,0.0,,,1990,,...,Observations: Reported,,1990.0,,5.0,,8.0,,-21.244,55.708
776,263340,Raung,16149,Confirmed Eruption,,2.0,,,1990,,...,Observations: Reported,,1990.0,,12.0,>,16.0,15.0,-8.119,114.056
777,241100,Ruapehu,14659,Confirmed Eruption,,1.0,,,1990,,...,Observations: Reported,,1990.0,,1.0,,26.0,,-39.280,175.570
778,351080,Galeras,11373,Confirmed Eruption,,2.0,,,1990,,...,Observations: Reported,,1992.0,,7.0,,16.0,,1.220,-77.370


Narrow down the columns we care about...

In [11]:
#pick the data we want and set country as index as that is a common column in our other data frames we want to combine
#eruptions from 1990's on
eruptions_clean_df = eruptions_df[['Start Year', 'Start Month', 'Start Day', 'End Year', 'End Month', 'End Day', 'Volcano Name', 'Latitude', 'Longitude', 'VEI']]

Some end dates have Nan's, so to make things simpiler, if an end date is a Nan then I will assume it should be the same value as the start equivalent (ie, same year, month, or day that it started)

In [12]:
## This function takes in an eruption data frame and uses the End Year/Month/Day columns and cleans up missing values
## if a value is missing we simply relace it with the start column equivalent 
## most Nan's in this data set are missing end days or the entire End year, day, and month is missing
## To be safe, the best and safest estimate is to assume the eruption occured only one day/month/year
## For example, if all end values are missing the assumption is the eruption lasted one day
## if only the year is missing then we assume the eruption happend in one year etc...
## the output are 3 updated series for month, day, and year
def clean_nans(eruption_df):
  year = eruption_df['End Year']
  Syear = eruption_df['Start Year']
  month = eruption_df['End Month']
  Smonth = eruption_df['Start Month']
  day = eruption_df['End Day']
  Sday = eruption_df['Start Day']
  index = []

  for i in range(len(year)):
    if pd.isna(year[i]):
      year[i] = Syear[i]
    if pd.isna(month[i]):
      month[i] = Smonth[i]
    if pd.isna(day[i]):
      day[i] = Sday[i]
  return year, month, day

In [14]:
#creating new end data
year, month, day = clean_nans(eruptions_clean_df)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  year[i] = Syear[i]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  month[i] = Smonth[i]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  day[i] = Sday[i]


In [15]:
#assining the values
eruptions_clean_df['End Year'] = year
eruptions_clean_df['End Month'] = month
eruptions_clean_df['End Day'] = day

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  eruptions_clean_df['End Year'] = year
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  eruptions_clean_df['End Month'] = month
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  eruptions_clean_df['End Day'] = day


Now read in my volcano data and rename the columns. This data frame has latitude and longitude for every volcano on earth as well as the name, country, and type.

In [16]:
volcanoes_df = pd.read_csv('volcanoes_global.csv', skiprows=1, encoding='UTF-8')

In [17]:
volcanoes_df = volcanoes_df.rename(columns={'Unnamed: 0': 'Volcano Name', 'Unnamed: 1': 'Country', 'Unnamed: 2': 'Location', 'Unnamed: 3': 'Latitude', 'Unnamed: 4': 'Longitude', 'Unnamed: 5': 'Type'})

In [18]:
volcanoes_df

Unnamed: 0,Volcano Name,Country,Location,Latitude,Longitude,Type
0,Dacht-I-Navar Group,Afghanistan,Afghanistan,33.95,67.920,Lava dome
1,Vakak Group,Afghanistan,Afghanistan,34.25,67.970,Volcanic field
2,Atakor Volcanic Field,Algeria,Africa-N,23.33,5.830,Scoria cones
3,In Ezzane Volc Field,Algeria,Africa-N,23.00,10.833,Volcanic field
4,Manzaz Volcanic Field,Algeria,Africa-N,23.92,5.830,Scoria cones
...,...,...,...,...,...,...
1599,"Sawad, Harra Es-",Yemen,Arabia-S,13.58,46.120,Volcanic field
1600,Hanish,Yemen,Red Sea,13.72,42.730,Shield volcano
1601,"Tair, Jebel at",Yemen,Red Sea,15.55,41.830,Stratovolcano
1602,Zubair Group,Yemen,Red Sea,15.05,42.180,Shield volcano


# Combining data frames and setting data type

---
I want to combine my eruption and seismicity data. To do this I am going to use my volcanic data set to filter and match earthquakes to volcanic centers (ie only keeping earthquakes at volcanoe and then assigning that earthquake the volcanic). 

I will then use this same method to clean up the eruptions data set to make sure naming conventions are consistent between all of the data sets. 

I am mapping earthquakes to volcanoes with a function since volcanic centers are only indicated with a single lat and long value, but correspond to a physical 3-d area. 

A function lets me search a range of lats and longs rather than simply merging the volcanoe data and seismic data based on a single location.


In [19]:
## this function maps earthquakes to volcanoes and returns a df of volcanic earthquakes
## Each earthquake at a volcanic center will also be assigned the volcano's name and type
## each row of the output data frame is a single earthquake
def map_volcanoes(volcanoe_df, seismic_df):
  indexs = []
  names = []
  countries = []
  kinds = []
  #loop through all lats and longs of seismic data
  for name, lat_2, lon_2, country, kind, in zip(volcanoe_df['Volcano Name'], volcanoe_df['Latitude'], volcanoe_df['Longitude'], volcanoe_df['Country'], volcanoe_df['Type']):
    #also loop through lat and long of volcanic centers
    i = 0
    for lat, lon in zip(seismic_df['latitude'], seismic_df['longitude']):
      #if the earthquake and volcanic center are within a roughly 2 mile radius, add the index to our list 
      if abs(float(lat) - float(lat_2)) <= 0.0375 and abs(float(lon) - float(lon_2)) <= 0.0375:
        indexs.append(i)
        names.append(name)
        countries.append(country)
        kinds.append(kind)
      i += 1
        
  new_seismic_df = seismic_df.iloc[indexs]
  new_seismic_df['Volcano Name'] = names
  new_seismic_df['Country'] = countries
  new_seismic_df['Type'] = kinds
  return new_seismic_df, indexs, names


In [20]:
volcanoe_earthquakes_df, indexs, names = map_volcanoes(volcanoes_df, earthquakes_clean_df)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_seismic_df['Volcano Name'] = names
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_seismic_df['Country'] = countries
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_seismic_df['Type'] = kinds


# Volcanic Earthquakes

---

This data frame is the new data frame computed with my function which mapped earthquakes to volcanic centers and assigned the corresponding information to each row. 

Now each row is an earthquake which happened at a volcano with the volcano's information

In [21]:
#almost 16,000 earthquakes!
volcanoe_earthquakes_df

Unnamed: 0,time,latitude,longitude,depth,mag,horizontalError,depthError,magError,Date,Volcano Name,Country,Type
2172,1990-12-27T20:54:57.750Z,-24.018,-66.474,205.6,4.5,,4.9,,2002-11-12,Tuzgle,Argentina,Stratovolcano
852,1996-09-28T23:49:28.830Z,-19.482,-67.445,224.2,3.9,,12.9,,2010-07-14,"Jayu Khota, Laguna",Bolivia,Maars
991,1996-04-17T01:29:48.540Z,-21.772,-68.234,135.6,4.3,,12.9,,2009-05-18,"Azufre, Cerro del",Bolivia,Stratovolcano
2091,1991-05-10T22:07:51.010Z,-18.387,-69.100,139.0,4.9,,6.4,,2003-06-27,Guallatiri,Chile,Stratovolcano
435,1998-03-30T18:28:27.590Z,-23.818,-67.754,105.6,4.9,,7.4,,2013-01-17,Miniques,Chile,Stratovolcano
...,...,...,...,...,...,...,...,...,...,...,...,...
530,2009-10-05T17:03:29.580Z,-18.747,169.212,250.2,4.6,,6.4,,2012-07-22,Traitor's Head,Vanuatu,Stratovolcano
1063,2009-07-03T10:13:05.050Z,-18.778,169.268,258.0,4.4,,9.0,,2008-09-16,Traitor's Head,Vanuatu,Stratovolcano
8975,2006-04-01T11:27:08.160Z,-18.737,169.236,243.3,4.5,,,,1997-10-13,Traitor's Head,Vanuatu,Stratovolcano
11881,2004-12-17T02:48:01.690Z,-18.783,169.210,279.7,4.4,,31.2,,1995-11-11,Traitor's Head,Vanuatu,Stratovolcano


# Cleaning Eruption time columns

Before I go any further let's clean the eruption time to match that of the earthquake nomenclature, since I want them to be the same format and data type to compare later on

This means I need to convert the individual start and end columns into 2 columns each in yyy-mm-dd pandas datetime format

In [22]:
eruptions_clean_df 

Unnamed: 0,Start Year,Start Month,Start Day,End Year,End Month,End Day,Volcano Name,Latitude,Longitude,VEI
0,2022,12,10,2022.0,12.0,19.0,Lascar,-23.370,-67.730,
1,2022,11,27,2022.0,12.0,10.0,Mauna Loa,19.475,-155.608,
2,2022,11,18,2022.0,12.0,19.0,Ahyi,20.420,145.030,
3,2022,11,15,2022.0,11.0,29.0,San Miguel,13.434,-88.269,
4,2022,10,21,2022.0,12.0,19.0,Cotopaxi,-0.677,-78.436,2.0
...,...,...,...,...,...,...,...,...,...,...
775,1990,1,18,1990.0,5.0,8.0,"Fournaise, Piton de la",-21.244,55.708,0.0
776,1990,1,16,1990.0,12.0,16.0,Raung,-8.119,114.056,2.0
777,1990,1,7,1990.0,1.0,26.0,Ruapehu,-39.280,175.570,1.0
778,1990,1,7,1992.0,7.0,16.0,Galeras,1.220,-77.370,2.0


In [23]:
#lets combine the year month and dates to have 2 colums of a start and end date
start = []
end = []
for row in eruptions_clean_df.values:
  start.append(str(row[0])+'-'+str(row[1])+'-'+str(row[2]))
  end.append(str(row[3]).replace(".0", "")+'-'+str(row[4]).replace(".0", "")+'-'+str(row[5]).replace(".0", ""))


In [24]:
#only keeping what I want
eruptions_final_df = eruptions_clean_df[['Volcano Name', 'Latitude', 'Longitude', 'VEI']]

In [25]:
#lets make the date columns, for start and end of eruption
eruptions_final_df['Start Date'] = pd.to_datetime(pd.Series(start), yearfirst=True)
eruptions_final_df['End Date'] = pd.to_datetime(pd.Series(end), yearfirst=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  eruptions_final_df['Start Date'] = pd.to_datetime(pd.Series(start), yearfirst=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  eruptions_final_df['End Date'] = pd.to_datetime(pd.Series(end), yearfirst=True)


Eruption Data frame with correct date format!

In [26]:
eruptions_final_df

Unnamed: 0,Volcano Name,Latitude,Longitude,VEI,Start Date,End Date
0,Lascar,-23.370,-67.730,,2022-12-10,2022-12-19
1,Mauna Loa,19.475,-155.608,,2022-11-27,2022-12-10
2,Ahyi,20.420,145.030,,2022-11-18,2022-12-19
3,San Miguel,13.434,-88.269,,2022-11-15,2022-11-29
4,Cotopaxi,-0.677,-78.436,2.0,2022-10-21,2022-12-19
...,...,...,...,...,...,...
775,"Fournaise, Piton de la",-21.244,55.708,0.0,1990-01-18,1990-05-08
776,Raung,-8.119,114.056,2.0,1990-01-16,1990-12-16
777,Ruapehu,-39.280,175.570,1.0,1990-01-07,1990-01-26
778,Galeras,1.220,-77.370,2.0,1990-01-07,1992-07-16


In [27]:
volcanoe_earthquakes_df=volcanoe_earthquakes_df.reset_index()

In [28]:
volcanoe_earthquakes_df

Unnamed: 0,index,time,latitude,longitude,depth,mag,horizontalError,depthError,magError,Date,Volcano Name,Country,Type
0,2172,1990-12-27T20:54:57.750Z,-24.018,-66.474,205.6,4.5,,4.9,,2002-11-12,Tuzgle,Argentina,Stratovolcano
1,852,1996-09-28T23:49:28.830Z,-19.482,-67.445,224.2,3.9,,12.9,,2010-07-14,"Jayu Khota, Laguna",Bolivia,Maars
2,991,1996-04-17T01:29:48.540Z,-21.772,-68.234,135.6,4.3,,12.9,,2009-05-18,"Azufre, Cerro del",Bolivia,Stratovolcano
3,2091,1991-05-10T22:07:51.010Z,-18.387,-69.100,139.0,4.9,,6.4,,2003-06-27,Guallatiri,Chile,Stratovolcano
4,435,1998-03-30T18:28:27.590Z,-23.818,-67.754,105.6,4.9,,7.4,,2013-01-17,Miniques,Chile,Stratovolcano
...,...,...,...,...,...,...,...,...,...,...,...,...,...
15862,530,2009-10-05T17:03:29.580Z,-18.747,169.212,250.2,4.6,,6.4,,2012-07-22,Traitor's Head,Vanuatu,Stratovolcano
15863,1063,2009-07-03T10:13:05.050Z,-18.778,169.268,258.0,4.4,,9.0,,2008-09-16,Traitor's Head,Vanuatu,Stratovolcano
15864,8975,2006-04-01T11:27:08.160Z,-18.737,169.236,243.3,4.5,,,,1997-10-13,Traitor's Head,Vanuatu,Stratovolcano
15865,11881,2004-12-17T02:48:01.690Z,-18.783,169.210,279.7,4.4,,31.2,,1995-11-11,Traitor's Head,Vanuatu,Stratovolcano


# Cleaning Nomenclature (Eruption Data Frame)

I notice that my eruptions data set has slightly different names for a few volcanoes. Using a very similar function as above (slightly tweeked), I will use my volcanic data frame to map the correct names to each row in my eruption data frame. 


In [29]:
#this function maps volcanic names to eruptions
def map_volcanoes_eruptions(volcanoe_df, eruption_df):
  indexs = []
  names = []
  #loop through all lats and longs of seismic data
  for name, lat_2, lon_2 in zip(volcanoe_df['Volcano Name'], volcanoe_df['Latitude'], volcanoe_df['Longitude']):
    #also loop through lat and long of volcanic centers
    i = 0
    for lat, lon in zip(eruption_df['latitude'], eruption_df['longitude']):
      #if the earthquake and volcanic center are within a roughly 1 mile radius, add the index to our list 
      if abs(float(lat) - float(lat_2)) <= 0.010 and abs(float(lon) - float(lon_2)) <= 0.010:
        indexs.append(i)
        names.append(name)
      i += 1
        
  new_eruption_df = eruption_df.iloc[indexs]
  new_eruption_df['Volcano Name'] = names
  return new_eruption_df, indexs, names

In [30]:
#rename my columns and now running the function!
eruptions_final_df = eruptions_final_df.rename(columns={"Latitude": 'latitude', 'Longitude': 'longitude'})
eruptions_renamed, indexs, names = map_volcanoes_eruptions(volcanoes_df, eruptions_final_df)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_eruption_df['Volcano Name'] = names


# Final Eruption Data Frame 

---

This data frame has the correct names and date columns! So it is ready to be merged with my earthquake data set

In [31]:
eruptions_renamed

Unnamed: 0,Volcano Name,latitude,longitude,VEI,Start Date,End Date
138,Bristol Island,-59.017,-26.533,1.0,2016-04-24,2016-07-19
521,Montagu Island,-58.445,-26.374,1.0,2001-10-01,2007-09-20
185,Saunders,-57.800,-26.483,1.0,2014-11-12,2022-11-16
549,Saunders,-57.800,-26.483,0.0,2000-05-13,2013-11-16
580,Saunders,-57.800,-26.483,0.0,1999-01-19,1999-01-19
...,...,...,...,...,...,...
448,St. Helens,46.200,-122.180,2.0,2004-10-01,2008-01-27
761,St. Helens,46.200,-122.180,3.0,1990-11-05,1991-02-14
370,"Tair, Jebel at",15.550,41.830,3.0,2007-09-30,2008-06-16
215,Zubair Group,15.050,42.180,2.0,2013-09-28,2013-11-20


# Mapping eruptions to earthquakes

---

Before I can actually merge, I first need to create an artificial date column in my seismic data set which allows me to merge on a unique key. Currently, only general information like the volcanic name, location, and type matches between the data sets. However there are mulitple earthquakes and eruptions at a single given location. 

Here is my bahemuth of a function which matches an eruption (at a specific location and time) to an earthquake. 

This produces a "start date" column which I named that way so I can easily merge with my eruption data set which has a column with the same name 

In [35]:
## this function takes in eruption and earthquake data
## using latitude, longitude, and time it rules if a volcanic earthquake happened during an eruption
## if the earthquake happened during an eruption, the time data is re-assigned to the start date of the eruption
## this is because I will merge my data sets on time and volcanic name
## if a volcanic earthquake does not correspond to an eruption it is still kept in the data set
## just the time data is not altered so when I merge it is not matched to an eruption
def eruption_matching(earthquake_df, eruptions_df):
  ind = {}
  #empty data frame 
  final_df = pd.DataFrame()
  #create a list of unique volcanoes that have eruptions
  volcanoes = eruptions_df['Volcano Name'].unique().tolist()
  #loop through
  for volcano in volcanoes:
    #creating mini df's with just earthquakes and eruptions for that specific volcano
    mini_df = earthquake_df[earthquake_df['Volcano Name']==volcano]
    mini_df=mini_df.reset_index()
    eruption_df = eruptions_df[eruptions_df['Volcano Name']==volcano]
    eruption_df=eruption_df.reset_index()
    #creating an artifical eruptions series with the earthquake dates which I will overwrite soon
    eruption_series = pd.Series(mini_df['Date'])
    #then find the index of earthquakes that match an eruption time
    #but make sure the volcano exists in both data sets 
    if len(mini_df) > 0 and len(eruption_df) > 0:
      #print('yes')
      for i in range(len(mini_df)):
        for p in range(len(eruption_df)):
          if eruption_df['End Date'][p]>=mini_df['Date'][i] and eruption_df['Start Date'][p]<=mini_df['Date'][i]:
          #assigning start date here, may change this to something else later depending on how it looks merged
            eruption_series[i] = eruption_df['Start Date'][p]
            #print(eruption_df['Start Date'][p])
      #make column in our masked data frame
      mini_df['Start Date'] = pd.to_datetime(eruption_series, yearfirst='True', utc=False, unit='ns')
      final_df = pd.concat([final_df, mini_df], axis=0)
  return final_df

dummy = eruption_matching(volcanoe_earthquakes_df, eruptions_renamed)


I've created a dummy seismic data frame with mapped "start times" of a corresponding eruption if there is an eruption at that time

In [36]:
dummy

Unnamed: 0,level_0,index,time,latitude,longitude,depth,mag,horizontalError,depthError,magError,Date,Volcano Name,Country,Type,Start Date
0,6,792,1994-06-06T20:47:40.530Z,2.917000,-76.057000,12.100,6.80,,,,2008-10-26,"Huila, Nevado del",Colombia,Stratovolcano,2008-10-26
0,7,4528,1995-10-11T22:44:38.970Z,10.851000,-85.316000,161.800,4.30,,,,1994-08-10,Rincon de la Vieja,Costa Rica,Complex volcano,1994-08-10
0,19,31,2015-01-23T03:07:03.020Z,64.653800,-17.520600,6.630,4.60,6.90,3.40,0.076,2018-06-21,Bardarbunga,Iceland,Stratovolcano,2018-06-21
1,20,84,2014-12-23T22:23:48.060Z,64.631500,-17.505000,6.650,4.50,3.00,4.70,0.085,2014-08-29,Bardarbunga,Iceland,Stratovolcano,2014-08-29
2,21,128,2014-12-02T02:18:28.000Z,64.666000,-17.487000,2.500,5.30,6.80,3.90,,2014-08-29,Bardarbunga,Iceland,Stratovolcano,2014-08-29
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
161,15707,18639,2017-04-08T14:11:28.810Z,19.475500,-155.642700,7.905,3.20,0.80,2.90,0.191,2015-02-01,Mauna Loa,United States,Shield volcano,2015-02-01
162,15708,18640,2017-04-08T14:11:28.810Z,19.475500,-155.642667,4.380,3.23,0.24,0.38,0.191,2015-02-01,Mauna Loa,United States,Shield volcano,2015-02-01
163,15709,18733,2016-11-09T10:07:47.170Z,19.459833,-155.588501,44.610,2.57,1.09,1.28,,2015-01-30,Mauna Loa,United States,Shield volcano,2015-01-30
164,15710,18772,2016-09-06T14:25:57.620Z,19.461000,-155.592667,-1.830,3.79,0.24,0.23,0.220,2015-01-27,Mauna Loa,United States,Shield volcano,2015-01-27


# Merging

Now the data frames match, I will merge on start date and volcanic name. I will do an "outer" merge so I do not lose data I can analyze later on. 


I will keep earthquakes which do not correspond to an eruption because seismicity at volcanoes do not only occur during an active eruption. The idea is that processes like magma rising to the surface or gasses being released, events which are often called pre-eruptive signatures, can produce detectible seismicity. This is why seismicity outside of an eruption time window is very informative on the volcano, its structure, and how it erupts. 

In [38]:
final = dummy.merge(eruptions_renamed, on=('Volcano Name', 'Start Date'), how='outer', suffixes=['_earthquake', '_eruption'], right_index=False)

In [39]:
#my final data frame! 
#it is not the prettiest but I've kept all data so I am not losing anything I want to use later on
final

Unnamed: 0,level_0,index,time,latitude_earthquake,longitude_earthquake,depth,mag,horizontalError,depthError,magError,Date,Volcano Name,Country,Type,Start Date,latitude_eruption,longitude_eruption,VEI,End Date
0,6.0,792.0,1994-06-06T20:47:40.530Z,2.9170,-76.0570,12.10,6.8,,,,2008-10-26,"Huila, Nevado del",Colombia,Stratovolcano,2008-10-26,2.930,-76.030,3.0,2012-01-14
1,7.0,4528.0,1995-10-11T22:44:38.970Z,10.8510,-85.3160,161.80,4.3,,,,1994-08-10,Rincon de la Vieja,Costa Rica,Complex volcano,1994-08-10,,,,NaT
2,19.0,31.0,2015-01-23T03:07:03.020Z,64.6538,-17.5206,6.63,4.6,6.9,3.4,0.076,2018-06-21,Bardarbunga,Iceland,Stratovolcano,2018-06-21,,,,NaT
3,20.0,84.0,2014-12-23T22:23:48.060Z,64.6315,-17.5050,6.65,4.5,3.0,4.7,0.085,2014-08-29,Bardarbunga,Iceland,Stratovolcano,2014-08-29,64.633,-17.516,0.0,2015-02-27
4,21.0,128.0,2014-12-02T02:18:28.000Z,64.6660,-17.4870,2.50,5.3,6.8,3.9,,2014-08-29,Bardarbunga,Iceland,Stratovolcano,2014-08-29,64.633,-17.516,0.0,2015-02-27
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16208,,,,,,,,,,,NaT,St. Helens,,,2004-10-01,46.200,-122.180,2.0,2008-01-27
16209,,,,,,,,,,,NaT,St. Helens,,,1990-11-05,46.200,-122.180,3.0,1991-02-14
16210,,,,,,,,,,,NaT,"Tair, Jebel at",,,2007-09-30,15.550,41.830,3.0,2008-06-16
16211,,,,,,,,,,,NaT,Zubair Group,,,2013-09-28,15.050,42.180,2.0,2013-11-20


In [40]:
#now let's save my final data set for me to use in other colabs
#mounting to my drive as a csv so I can save 
from google.colab import drive
drive.mount('drive', force_remount=True)
final.to_csv('final.csv')
!cp final.csv "drive/My Drive/"

Mounted at drive


now on to my historical data which I will use for prediction!

# Historical Eruptions

---


To make things interesting, I will also scrape eruption data from this website: https://volcano.oregonstate.edu/largest-eruptions-1400-ad , which has large historical eruptions.

I will use this data set to predict the seismicity associated with these historic eruptions, giving insight into the past!!!


In [None]:
import requests
response_wiki = requests.get(
    "https://volcano.oregonstate.edu/largest-eruptions-1400-ad")

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(response_wiki.text, "html.parser")

In [None]:
#I have inspected and found the tag for the table, lets see how many tables match it!
len(soup.find_all("table"))

1

In [None]:
#okay so there is only 1 table under that tag so great! Will pull that

table = soup.find_all("table")[0]

In [None]:

#pull table body with info
table_body = table.find('tbody')
#initialize dict
historical_eruptions = {'Name': [], 'Date': [], 'VEI': [], 'Type': []}

for row in table_body.find_all('tr')[1:]: #here we skip the first row which is a header
#tr's indicate rows
#td's indicates "colomns" but really data points
#some rows keep the text in the td tags and some nest them in <a> tags, so have to manually set each 
  value = row.find_all('td')
  #historical_eruptions['VEI'].append(vei.text.strip().strip('?')) #.text is more safe than .string, less error prone
  historical_eruptions['Name'].append(value[0].text.strip())
  historical_eruptions['Date'].append(value[2].text.strip())
  historical_eruptions['VEI'].append(value[3].text.strip().replace("?", "").replace("+",""))
  historical_eruptions['Type'].append(value[4].text.strip().replace("?", ""))


In [None]:
import pandas as pd
historical_eruptions_df = pd.DataFrame.from_dict(historical_eruptions)

In [None]:
historical_eruptions_df = historical_eruptions_df.merge(volcanoes_df, left_on='Name', right_on='Volcano Name')

In [None]:
#my historical data!
historical_eruptions_df

Unnamed: 0,Name,Date,VEI,Type_x,Volcano Name,Country,Location,Latitude,Longitude,Type_y
0,Aniakchak,1450C,5,Caldera,Aniakchak,United States,Alaska Peninsula,56.88,-158.17,Caldera
1,Kuwae,1452,6,Caldera,Kuwae,Vanuatu,Vanuatu-SW Pacific,-16.829,168.536,Caldera
2,Bardarbunga,1477,5,Stratovolcano,Bardarbunga,Iceland,Iceland-NE,64.633,-17.516,Stratovolcano
3,St. Helens,1480D,5,Stratovolcano,St. Helens,United States,US-Washington,46.2,-122.18,Stratovolcano
4,St. Helens,1540,5,Stratovolcano,St. Helens,United States,US-Washington,46.2,-122.18,Stratovolcano
5,St. Helens,18-May-80,5,Stratovolcano,St. Helens,United States,US-Washington,46.2,-122.18,Stratovolcano
6,Billy Mitchell,1580C,6,Ash shield,Billy Mitchell,Papua New Guinea,Bougainville-SW Paci,-6.092,155.225,Pyroclastic shield
7,Raung,1593,5,Stratovolcano,Raung,Indonesia,Java,-8.119,114.056,Stratovolcano
8,Huaynaputina,1600,6,Explosion crater,Huaynaputina,Peru,Peru,-16.608,-70.85,Stratovolcano
9,Parker,"Jan. 4, 1641",6,Stratovolcano,Parker,Philippines,Mindanao-Philippines,6.113,124.892,Stratovolcano


In [None]:
#mounting to my drive as a csv so I can save 
from google.colab import drive
drive.mount('drive', force_remount=True)
historical_eruptions_df.to_csv('historical_eruptions.csv')
!cp historical_eruptions.csv "drive/My Drive/"

Mounted at drive


# On to data visualization and exploration!