In [1]:
"""
Purpose of this file:

- Handling of NaN values
- Verifying if the data provided is adequate, and if not, add more relevant columns

Summary

NaN values -

Looking at the values, there seems to be an understanding that majority of the updated values are from "stronger" earthquakes, which are detectable. And all the activity that is not 
in detectable range will be supplied NULL values, but also any logical failures are listed as NULL.

It's not possible to determine the volume of "weaker" signals returning NULL, and logical error (Not with my current understanding of the calculations anyway). 
So it is safer to assume that majority of the signals are "weak", so I raised the floor from NULL to the lowest denominator across the board.

Country Level Data

I wanted to review an aggregated comparison across country/state level data, however, {place} value is not compatible for indexeing due to too much variability in the description/content.

To enable country level aggregation, I used the reverse geocoding method, via Nominatim, which takes long and lat column data and returns the location data, according to the zoom level. 
Zoom 3 was used to bring back country level data.

Manual sifting

America being a very large country, resulted in the country looking like an outlier, being massively populated with events. So I extracted state level data and integrated that into the dataset to enable 
a clearer comparison and zoom level of where the events are occuring.

(Don't run it, it takes 1-2 hours to run!)

Results: (modified_all_month tab)
https://raw.githubusercontent.com/Whistlingwind/wk3-earthquake-assignment/main/modified_all_month.csv
"""

import csv
import pandas as pd
import math
import numpy as np
from numpy import nan
import io
from geopy.geocoders import Nominatim
from geopy.point import Point

#Import Data and cleanse the NaN values from the dataset
df = pd.read_csv(r'https://raw.githubusercontent.com/Whistlingwind/wk3-earthquake-assignment/main/all_month.csv', 
            index_col='id', 
            parse_dates=['time'],
            header=0, 
            names=['time','Latitude','Longitude','depth','mag','magType','nst','gap','dmin','rms','net','id','updated','place','type','horizontalError','depthError',
                   'magError','magNst','status','locationSource','magSource'] )

#Replace any empty fields as NaN so it can be caught in the next steps
df = df.replace('',np.nan) 

"""
NaN replacement settings:

nst : 0 - No specific floor other than 0
gap : 0 - No specific floor other than 0
dmin : 0 - No specific floor other than 0
horizontalError : 0 - No specific floor other than 0
depthError : 0 - No specific floor other than 0
magError : 0 - No specific floor other than 0
magNst : 1 - Denotes total number of stations for detection, assuming that at least 1 station is required for detection, so setting the floor as 1.

"""
#Create a collection of key value pairs (unordered), to determine the expected replacement for NaN values for those specified columns
naFixValues = {"nst": 0, "gap": 0, "dmin": 0, "horizontalError": 0, "depthError": 0, "magError": 0, "magNst": 1}
#Function to replace NaN values using numpy
df.fillna(value=naFixValues)

#Setup (assuming) crawler call, user agent identification as "test" to resolve access requirements
geolocator = Nominatim(user_agent="test")

#Pass lat & long coordinates into reverse Nominatim function to return location data at zoom level 3 (country)
def reverse_geocoding(lat, lon):
    try:
        location = geolocator.reverse(Point(lat, lon),zoom=3)
        return location.raw['display_name']
    except:
        return None


#Call to reverse function, and store temp results into new column called "country"
df['country'] = np.vectorize(reverse_geocoding)(df['Latitude'], df['Longitude'])

#Save the result as CSV, mainly because it takes 70 minutes to process, and we can't be doing this mid-presentation or mid code/result review, as it would take forever.
df.to_csv('all_month_modified.csv')



In [78]:
df = pd.read_csv(r'C:\Users\Whistlingwind\Desktop\Data Science\Earthquake_data\all_month.csv',header=0,usecols=['id','latitude','longitude'] )
print(df)

       latitude   longitude            id
0     38.825500 -122.854332    nc73941516
1     58.177200 -155.255000  ak023c9xsxas
2     55.910700 -158.895200  ak023c9xrwud
3     61.043900 -148.400400  ak023c9xqda0
4     63.514800 -151.058000  ak023c9xoh1i
...         ...         ...           ...
9480  13.837700  144.759200    us7000kse7
9481  35.380167  -84.167167    se60546486
9482  51.890667 -177.855000    av92020996
9483  38.827835 -122.782837    nc73929141
9484  19.379667 -155.284500    hv73544682

[9485 rows x 3 columns]


In [73]:
df = pd.DataFrame({'id': [1,2,3,4,5],
                   'Latitude': [30.197535, 34.895699, 33.636700, 33.636700, 32.733601],
                   'Longitude': [-97.662015, -82.218903, -84.428101, -84.428101, -117.190000]
                    })
print(df)

   id   Latitude   Longitude
0   1  30.197535  -97.662015
1   2  34.895699  -82.218903
2   3  33.636700  -84.428101
3   4  33.636700  -84.428101
4   5  32.733601 -117.190000
