<h1 align=center>SP20: SQL AND NOSQL: 12147</h1> 
<h1 align=center>US Parks</h1>
<h3>Project Team</h3>
<li>Huzefa Igatpuriwala - higatpur@iu.edu</li>
<li>Saurabh Swaroop - sswaroop@iu.edu</li>
<li>Vijayakumar Perumalsamy - vperumal@iu.edu </li>


<h5><b>Objective:</b></h5>
To gain more insights from the given NPS dataset. 

In [29]:
import pandas as pd
import numpy as np

<h3>Data Preprocessing</h3>

The following datasets are mainly used for this visualization.
<ul>
<li>US Parks Visitors Last 10 years.csv</li>
<li>geo_locations.csv</li>
<li>us_states_abbr.csv</li>
</ul>
<br>
Read complete US Parks dataset which includes last 10 years of RV campers, Tent Campers and Recreational Visits data by month/year and process the same.

In [30]:
np_df=pd.read_csv("US Parks Visitors Last 10 years.csv")
print("List of columns available in National Parks dataset:")
print(np_df.columns)

List of columns available in National Parks dataset:
Index(['Park', 'UnitCode', 'ParkType', 'Region', 'State', 'Year', 'Month',
       'RecreationVisits', 'TentCampers', 'RVCampers'],
      dtype='object')


In [31]:
print("National Parks dataset shape:",np_df.shape)
print("National Parks dataset sample:",np_df.head(1))

National Parks dataset shape: (53296, 10)
National Parks dataset sample:                              Park UnitCode                  ParkType  \
0  Abraham Lincoln Birthplace NHP     ABLI  National Historical Park   

       Region State  Year  Month  RecreationVisits  TentCampers  RVCampers  
0  Southeast     KY  2008      1              5829            0          0  


<h4>Geolocations data collection based on each National Park address</h4>
Populated latitude and longitude data from Google API URL for the given park address and API key and updated geo_locations.csv dataset accordingly. Refer the attached <b>Lat-Long-Google.ipynb</b> for more details. 
<br><br>
<b>Google API URL:</b> <a>https://maps.googleapis.com/maps/api/geocode/json?address={}&key={}</a>

Read geo-coordinates data from geo_locations.csv.

In [32]:
np_df_loc=pd.read_csv("geo_locations.csv")
np_df_loc= np_df_loc.rename(columns={"State": "State_Abbr"})
np_df_loc.reset_index()
print("List of columns available in National Park Locations dataset:")
print(np_df_loc.columns)

print("National Park Locations dataset shape:",np_df_loc.shape)
print("National Park Locations dataset sample:")
print(np_df_loc.head(5))

List of columns available in National Park Locations dataset:
Index(['UnitCode', 'ParkName', 'State_Abbr', 'Country', 'Latitude',
       'Longitude'],
      dtype='object')
National Park Locations dataset shape: (379, 6)
National Park Locations dataset sample:
  UnitCode                        ParkName State_Abbr Country   Latitude  \
0     ABLI  Abraham Lincoln Birthplace NHP         KY     USA  37.531540   
1     ACAD                       Acadia NP         ME     USA  44.338556   
2     ADAM                       Adams NHP         MA     USA  42.239235   
3     AFBG        African Burial Ground NM         NY     USA  40.714537   
4     AGFO            Agate Fossil Beds NM         NE     USA  42.425210   

    Longitude  
0  -85.735254  
1  -68.273335  
2  -71.003528  
3  -74.004467  
4 -103.734240  


Read US state full names and abbreviations dataset.

In [33]:
us_states_abbr_df=pd.read_csv("us_states_abbr.csv")
print("List of columns available in US state abbrevation dataset:")
print(us_states_abbr_df.columns)

print("US state abbrevation dataset shape:",us_states_abbr_df.shape)
print("US state abbrevation dataset sample:",us_states_abbr_df.head(5))

#Trim the state_abbr column values
us_states_abbr_df.State_Abbr = us_states_abbr_df.State_Abbr.str.strip()
us_states_abbr_df=us_states_abbr_df.set_index('State_Abbr')

List of columns available in US state abbrevation dataset:
Index(['State', 'State_Abbr'], dtype='object')
US state abbrevation dataset shape: (50, 2)
US state abbrevation dataset sample:          State State_Abbr
0     Alabama          AL
1      Alaska          AK
2     Arizona          AZ
3    Arkansas          AR
4  California          CA


##### Merge both National Park and US state abbreviations dataset to include state full name 


In [39]:
np_df_actual=np_df
np_df_actual= np_df_actual.rename(columns={"State": "State_Abbr"})
np_df_actual.State_Abbr = np_df_actual.State_Abbr.str.strip()
np_df_actual = pd.merge(np_df_actual,us_states_abbr_df, how='left', on='State_Abbr')
np_df_actual.State = np_df_actual.State.str.strip()
np_df_actual.head(10)


Unnamed: 0,Park,UnitCode,ParkType,Region,State_Abbr,Year,Month,RecreationVisits,TentCampers,RVCampers,State
0,Abraham Lincoln Birthplace NHP,ABLI,National Historical Park,Southeast,KY,2008,1,5829,0,0,Kentucky
1,Abraham Lincoln Birthplace NHP,ABLI,National Historical Park,Southeast,KY,2008,2,6911,0,0,Kentucky
2,Abraham Lincoln Birthplace NHP,ABLI,National Historical Park,Southeast,KY,2008,3,10720,0,0,Kentucky
3,Abraham Lincoln Birthplace NHP,ABLI,National Historical Park,Southeast,KY,2008,4,17829,0,0,Kentucky
4,Abraham Lincoln Birthplace NHP,ABLI,National Historical Park,Southeast,KY,2008,5,26295,0,0,Kentucky
5,Abraham Lincoln Birthplace NHP,ABLI,National Historical Park,Southeast,KY,2008,6,33817,0,0,Kentucky
6,Abraham Lincoln Birthplace NHP,ABLI,National Historical Park,Southeast,KY,2008,7,29225,0,0,Kentucky
7,Abraham Lincoln Birthplace NHP,ABLI,National Historical Park,Southeast,KY,2008,8,22166,0,0,Kentucky
8,Abraham Lincoln Birthplace NHP,ABLI,National Historical Park,Southeast,KY,2008,9,21765,0,0,Kentucky
9,Abraham Lincoln Birthplace NHP,ABLI,National Historical Park,Southeast,KY,2008,10,13747,0,0,Kentucky


In [40]:
np_df_loc_short=np_df_loc[["UnitCode",'Latitude',"Longitude"]]
np_df_final = pd.merge(np_df_actual,np_df_loc_short, how='left', on='UnitCode')
np_df_final.head(10)
np_df_final.shape
np_df_final.to_csv("nps_base_data.csv")