## Reverse Geocoder - Modis Datasets

This notebook was created in order to translate the lat/long features into state codes that are used on the weather data file. There are around 7 mio usable rows so there is no way to look up the states for each lat/long pair manually. 

We decided to automate this by using the geocoding API provided by Google. The preocess applied is known as reverse geocoding.


In [None]:
!pip install googlemaps



We used a shared google drive folder to ease loading the datasets on everyones machines. This notebook cannot be executed in this state without access to said shared folder

- We start off with the imports

In [None]:
from google.colab import drive
import pandas as pd
import googlemaps

cell to test the google API

In [None]:
'''
gmaps = googlemaps.Client(key='AIzaSyCPBCyv142YkjfcwhqfkXHYrgOLZVHc2go')
reverse_geocode_result = gmaps.reverse_geocode((-21.7912, -54.1011))
if len(reverse_geocode_result)>=1 and len(reverse_geocode_result[0]['address_components'])>=3:
  obj = reverse_geocode_result[0]['address_components'][2]['short_name']
  print(obj)
'''

"\ngmaps = googlemaps.Client(key='AIzaSyCPBCyv142YkjfcwhqfkXHYrgOLZVHc2go')\nreverse_geocode_result = gmaps.reverse_geocode((-21.7912, -54.1011))\nif len(reverse_geocode_result)>=1 and len(reverse_geocode_result[0]['address_components'])>=3:\n  obj = reverse_geocode_result[0]['address_components'][2]['short_name']\n  print(obj)\n"

- mount drive to access .csv datasets from the shared folder

In [None]:
drive.mount("/content/drive")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


- we use a list to iteratatively manage the csv files which are named according to their year.
- create dataframe

In [None]:
list1=[i for i in range (2000,2020)]
list1
df = pd.DataFrame()

- iterate over list and append each csv to the dataframe. Edit here to limit the years loaded into the df

In [None]:
for i in list1:
  print(i)
  df = df.append(pd.read_csv(f'/content/drive/Shared drives/BNCS411_Final_Project/modis_{i}_Brazil.csv', delimiter=","))

2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019


- we reset the index of the df to eliminate doubles and ease further editing. Column 'index' gets created for some reason.
- dropping unused features including 'index' column.
- dropping all rows (fires) categorized other than forest fires.
- dropping all rows with low confidence ( < 30 ).
- sort dataframe by aquisition date and reset the index once more.
- finally drop confidence, type & the newly created index column because we already used them to alter/drop rows and don't need the feature anymore.

In [None]:
df = df.reset_index()
df = df.drop(['index', 'satellite', 'brightness', 'scan', 'track', 'acq_time', 'instrument', 'version', 'bright_t31','frp', 'daynight'], 1)
mask = df['type'].isin(['1','2','3'])
df.sort_values('type')
df = df[~mask]
drop_index = df[df['confidence']<=29].index
df.drop(drop_index, inplace=True)
df = df.sort_values('acq_date')
df.reset_index(inplace=True)
df = df.drop(['confidence', 'index', 'type'], 1)

- Create statecode column
- randomize df to extract 50 000 values later.

In [None]:
df["statecode"] = ""
df = df.sample(frac=1).reset_index(drop=True)

- Create smaller df2 from randomized df
- iterate over the reduced df2 and get statecodes for all Lat/Long pairs
- skip on corrupted lat/long values
- Execute only once(!) and write onto csv file to save for future use



In [None]:
'''
df2 = df.head(50000)
for i in df2.index:
  rg_result = gmaps.reverse_geocode((df2.loc[i, 'latitude'], df2.loc[i, 'longitude']))
  if len(rg_result)>=1 and len(rg_result[0]['address_components'])>=3:
    df2.at[i,'statecode'] = rg_result[0]['address_components'][2]['short_name']
    print(i)
  else:
    print('skip')

df2.to_csv('rand_fires.csv')
!cp rand_fires.csv "/content/drive/Shared drives/BNCS411_Final_Project"
'''

'\ndf2 = df.head(50000)\nfor i in df2.index:\n  rg_result = gmaps.reverse_geocode((df2.loc[i, \'latitude\'], df2.loc[i, \'longitude\']))\n  if len(rg_result)>=1 and len(rg_result[0][\'address_components\'])>=3:\n    df2.at[i,\'statecode\'] = rg_result[0][\'address_components\'][2][\'short_name\']\n    print(i)\n  else:\n    print(\'skip\')\n\ndf2.to_csv(\'rand_fires.csv\')\n!cp rand_fires.csv "/content/drive/Shared drives/BNCS411_Final_Project"\n'

As we executed the cell above this one only once, we can now reload the csv file that was written earlier into df2 and continue using it.

In [None]:
df2 = pd.DataFrame()
df2 = df2.append(pd.read_csv(f'/content/drive/Shared drives/BNCS411_Final_Project/rand_fires.csv', delimiter=","))

Here we clean up our df2 to make it useable afterwards
- save length of statecode values into 'length'
- delete all rows where length is > 2. Those are all faulty API requests
- drop accidentally created column 'Unnamed: 0'

In [None]:
df2['length'] = df2.statecode.str.len() 
pd.set_option('display.max_rows', None)
df2 = df2[df2.length < 3]
del df2['length']
df.reset_index(inplace=True, drop=True)
del df2['Unnamed: 0']
df2

scatterplot of reduced df2 (50 000 rows)

In [None]:
df2.to_csv('rand_fires.csv')
!cp rand_fires.csv "/content/drive/Shared drives/BNCS411_Final_Project"
df.to_csv('rand_fires_empty.csv')
!cp rand_fires_empty.csv "/content/drive/Shared drives/BNCS411_Final_Project"

TODO:
- ~~create column 'Statecode'~~
- ~~**NEW** reduce size of df to managable requests~~
- ~~iterate over each line: make api request & save into 'Statecode'~~ **big problem here. Too many requests.**
- ~~**NEW** Use the data from the reduced dataframe (df2) to predict the rest of the df. Possible SVM use?~~
- ~~drop 'latitude' & 'longitude'~~
- ~~sort by aqu_date -> statecode~~
- ~~create df3 for the final data structure (state, date, fires)~~
- ~~count occurances per day/state & save into column~~
