Zach Tretter

June 2020

# Step 4 - Build Overall Dataframe

### Read in Individual Dataframes
* [Fill Times](#Read-in-Fill-Times)
* [Weather](#Read-in-Weather-Data)
* [Visits](#Read-in-Visits-Data)
* [Air Quality](#Read-in-Air-Quality-Data)
* [Campground Attributes](#Read-in-Campground-Attributes)

### Integrate Dataframes

* Fill Times as foundation of [Final Dataframe](#Instantiate-final-df-as-df_fill)
* Left Join Dataframe with Air Quality on Date (AQI is same for all campsites)
* Left Join Dataframe with Monthly Visits on 'key_year_month') (e.g. YYYY_MMM)
* Left Join Dataframe with Campground Attributes on 'cg_name' (Campground Name)
    * Can now create 'supply' variable
    * Can now create 'date_wxstation'
* Left Join Dataframe with Weather on 'date_wxstation'
* Drop columns no longer needed

### Export to CSV
* [Export](#Export-Dataframe)

In [1]:
import pandas as pd
import os
import numpy as np

-------------

## Read in Fill Times

In [2]:
# Read in the data
df_fill = pd.read_csv('../data/02_filltimes_clean.csv')

df_fill.drop(columns='Unnamed: 0',inplace=True)

# Convert date to a datetype
df_fill['date'] = pd.to_datetime(df_fill['date'])

df_fill = df_fill.set_index('key_name_date');

#### View Dataframe

In [3]:
df_fill.T

key_name_date,2000-05-01_Apga,2000-05-02_Apga,2000-05-03_Apga,2000-05-04_Apga,2000-05-05_Apga,2000-05-06_Apga,2000-05-07_Apga,2000-05-08_Apga,2000-05-09_Apga,2000-05-10_Apga,...,2019-09-21_TwoM,2019-09-22_TwoM,2019-09-23_TwoM,2019-09-24_TwoM,2019-09-25_TwoM,2019-09-26_TwoM,2019-09-27_TwoM,2019-09-28_TwoM,2019-09-29_TwoM,2019-09-30_TwoM
cg_name,Apgar,Apgar,Apgar,Apgar,Apgar,Apgar,Apgar,Apgar,Apgar,Apgar,...,Two Medicine,Two Medicine,Two Medicine,Two Medicine,Two Medicine,Two Medicine,Two Medicine,Two Medicine,Two Medicine,Two Medicine
date,2000-05-01 00:00:00,2000-05-02 00:00:00,2000-05-03 00:00:00,2000-05-04 00:00:00,2000-05-05 00:00:00,2000-05-06 00:00:00,2000-05-07 00:00:00,2000-05-08 00:00:00,2000-05-09 00:00:00,2000-05-10 00:00:00,...,2019-09-21 00:00:00,2019-09-22 00:00:00,2019-09-23 00:00:00,2019-09-24 00:00:00,2019-09-25 00:00:00,2019-09-26 00:00:00,2019-09-27 00:00:00,2019-09-28 00:00:00,2019-09-29 00:00:00,2019-09-30 00:00:00
did_fill,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
fill_time,,,,,9:25am,,,,,,...,,,,,,,,,,
available,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,0,0,0,0
nickname,Apga,Apga,Apga,Apga,Apga,Apga,Apga,Apga,Apga,Apga,...,TwoM,TwoM,TwoM,TwoM,TwoM,TwoM,TwoM,TwoM,TwoM,TwoM
time_24,0,0,0,0,09:25,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
hours_after_midnight,0,0,0,0,9.42,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
year,2000,2000,2000,2000,2000,2000,2000,2000,2000,2000,...,2019,2019,2019,2019,2019,2019,2019,2019,2019,2019
month_num,5,5,5,5,5,5,5,5,5,5,...,9,9,9,9,9,9,9,9,9,9


-------------

## Read in Weather Data

In [4]:
df_weather = pd.read_csv('../data/03a_weather_clean.csv')
df_weather.drop(columns=["Unnamed: 0"],inplace=True)

df_weather = df_weather.set_index('date_wxstation');

#### Drop Irrelevant Columns

In [5]:
df_weather.drop(columns = [
    'LATITUDE',
    'LONGITUDE',
    'wx_station'
],inplace=True)

#### View Dataframe

In [6]:
df_weather.T

date_wxstation,2000-05-01_east_glac,2000-05-01_many_glac,2000-05-01_st_mary,2000-05-01_west_glac,2000-05-02_east_glac,2000-05-02_many_glac,2000-05-02_st_mary,2000-05-02_west_glac,2000-05-03_east_glac,2000-05-03_many_glac,...,2019-09-28_st_mary,2019-09-28_west_glac,2019-09-29_east_glac,2019-09-29_many_glac,2019-09-29_st_mary,2019-09-29_west_glac,2019-09-30_east_glac,2019-09-30_many_glac,2019-09-30_st_mary,2019-09-30_west_glac
date,2000-05-01,2000-05-01,2000-05-01,2000-05-01,2000-05-02,2000-05-02,2000-05-02,2000-05-02,2000-05-03,2000-05-03,...,2019-09-28,2019-09-28,2019-09-29,2019-09-29,2019-09-29,2019-09-29,2019-09-30,2019-09-30,2019-09-30,2019-09-30
PRCP,0,0,0,0,0,0,0,0.06,0,0.7,...,0.8,0.12,0.95,0.4,0.4,0.02,0.94,0.1,0.1,0.08
SNOW,0,0,0,0,0,0,0,0,0,0,...,15,0,19,19,19,0,7,7,7,0
TMAX,70,63,70,72,53,49,53,71,58,54,...,27,51,23,29,26,41,24,40,33,40
TMIN,39,37,39,31,39,37,39,40,41,40,...,22,33,21,26,23,32,20,18,12,30
did_PRCP,0,0,0,0,0,0,0,1,0,1,...,1,1,1,1,1,1,1,1,1,1
did_SNOW,0,0,0,0,0,0,0,0,0,0,...,1,0,1,1,1,0,1,1,1,0


-------------

## Read in Visits Data

In [7]:
df_visits = pd.read_csv('../data/03b_monthly_visits_clean.csv')
df_visits.set_index('key_year_month',inplace=True)
df_visits = pd.DataFrame(df_visits['visits'].astype(int))

#### View Dataframe

In [8]:
df_visits

Unnamed: 0_level_0,visits
key_year_month,Unnamed: 1_level_1
2019_Jan,13581
2018_Jan,12222
2017_Jan,14690
2016_Jan,15674
2015_Jan,12087
...,...
2004_Dec,10174
2003_Dec,10073
2002_Dec,8334
2001_Dec,3387


----------

## Read in Air Quality Data

In [9]:
df_aqi = pd.read_csv('../data/03c_air_quality_clean.csv')

#### Convert date to datetime and set date as index

In [10]:
df_aqi['date'] = pd.to_datetime(df_aqi['date'])
df_aqi.set_index('date',inplace=True)
df_aqi.sort_index(inplace=True)

#### Drop Irrelevant Columns

In [11]:
df_aqi.drop(columns=['Main Pollutant','date_aqi'],inplace=True)

#### View Dataframe

In [12]:
df_aqi

Unnamed: 0_level_0,aqi,ozone,PM10,PM25
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01,60,28,16,60
2000-01-02,29,29,12,53
2000-01-03,40,28,9,46
2000-01-04,40,30,8,40
2000-01-05,33,33,7,7
...,...,...,...,...
2019-12-27,29,25,20,29
2019-12-28,37,20,17,37
2019-12-29,43,23,13,43
2019-12-30,38,19,17,38


## Read in Campground Attributes

In [13]:
df_cg_attributes = pd.read_csv('../data/03d_campground_attributes_clean.csv',
                              index_col = 'cg_name')

#### View Dataframe

In [14]:
df_cg_attributes

Unnamed: 0_level_0,fee,sites,flush_toilets,showers,disposal_station,reservations,rv,primitive,isolated,nearest_wx_station
cg_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Apgar,20,194,1,0,1,0,1,0,0,west_glac
Avalanche,20,87,1,0,0,0,1,0,0,west_glac
Bowman Lake,15,46,0,0,0,0,0,0,1,west_glac
Cut Bank,10,14,0,0,0,0,1,1,1,east_glac
Fish Creek,23,178,1,1,1,1,1,0,0,west_glac
Kintla Lake,15,13,0,0,0,0,0,0,1,west_glac
Logging Creek,10,7,0,0,0,0,0,1,0,west_glac
Many Glacier,23,109,1,0,1,0,1,0,1,many_glac
Quartz Creek,10,7,0,0,0,0,0,1,0,west_glac
Rising Sun,20,84,1,1,1,0,1,0,0,st_mary


--------


# Integrate Dataframes
[Back-to-top](#Back-to-Top)

#### Instantiate final df as df_fill

In [15]:
df = df_fill.copy()

df = df.sort_values(['date',
                     'cg_name'])

#### Join with AQI on 'date'

In [16]:
df = pd.merge(df,
              df_aqi,
              left_on = 'date',
              right_index = True,
              how = 'left')

#### Join with Monthly Visits on 'key_year_month'

In [17]:
df = pd.merge(df,
              df_visits,
              left_on = 'key_year_month',
              right_index = True,
              how = 'left')

#### Join with Campground Attributes at 'cg_name'

In [18]:
df = pd.merge(df,
              df_cg_attributes,
              left_on = 'cg_name',
              right_index = True,
              how = 'left')

#### Create 'Supply' Variable and Integrate

In [19]:
df['cg_supply'] = df['sites'] * df['available']
df_supply = pd.DataFrame(df.groupby('date')['cg_supply'].sum())

df = pd.merge(df,
             df_supply,
             left_on = 'date',
             right_index = True,
             how = 'left')

In [20]:
# I don't get how this works but it does?
df['cg_supply_y'] = df['cg_supply_y'] - df['sites'] * df['available']

#### Create 'date_wxstation' column to enable integration with weather dataframe

In [21]:
df['date_wxstation'] = df['date'].astype(str) + '_' + df['nearest_wx_station']

#### Join with Weather on 'date_wxstation'

In [22]:
df = pd.merge(df,
             df_weather,
             left_on = 'date_wxstation',
             right_index = True,
             how = 'left')

#### Drop Columns not of Interest

In [23]:
df = df.drop(columns = [
    
    # Fill Dates
    'fill_time',
    'nickname',
    'key_year_month',
    
    
    # Campground Attributes
    'nearest_wx_station',
    'cg_supply_x',
    
    # From Weather Data
    'date_wxstation',
    'date_y',
    
])

#### Examine Dataframe

In [24]:
df.T

key_name_date,2000-05-01_Apga,2000-05-01_Aval,2000-05-01_BoLa,2000-05-01_CuBa,2000-05-01_FiCr,2000-05-01_KiLa,2000-05-01_LoCr,2000-05-01_MaGl,2000-05-01_QuCr,2000-05-01_RiSu,...,2019-09-30_CuBa,2019-09-30_FiCr,2019-09-30_KiLa,2019-09-30_LoCr,2019-09-30_MaGl,2019-09-30_QuCr,2019-09-30_RiSu,2019-09-30_SpCr,2019-09-30_StMa,2019-09-30_TwoM
cg_name,Apgar,Avalanche,Bowman Lake,Cut Bank,Fish Creek,Kintla Lake,Logging Creek,Many Glacier,Quartz Creek,Rising Sun,...,Cut Bank,Fish Creek,Kintla Lake,Logging Creek,Many Glacier,Quartz Creek,Rising Sun,Sprague Creek,St. Mary,Two Medicine
date_x,2000-05-01 00:00:00,2000-05-01 00:00:00,2000-05-01 00:00:00,2000-05-01 00:00:00,2000-05-01 00:00:00,2000-05-01 00:00:00,2000-05-01 00:00:00,2000-05-01 00:00:00,2000-05-01 00:00:00,2000-05-01 00:00:00,...,2019-09-30 00:00:00,2019-09-30 00:00:00,2019-09-30 00:00:00,2019-09-30 00:00:00,2019-09-30 00:00:00,2019-09-30 00:00:00,2019-09-30 00:00:00,2019-09-30 00:00:00,2019-09-30 00:00:00,2019-09-30 00:00:00
did_fill,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
available,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
time_24,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
hours_after_midnight,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
year,2000,2000,2000,2000,2000,2000,2000,2000,2000,2000,...,2019,2019,2019,2019,2019,2019,2019,2019,2019,2019
month_num,5,5,5,5,5,5,5,5,5,5,...,9,9,9,9,9,9,9,9,9,9
month_text,May,May,May,May,May,May,May,May,May,May,...,Sep,Sep,Sep,Sep,Sep,Sep,Sep,Sep,Sep,Sep
day_of_year,122,122,122,122,122,122,122,122,122,122,...,273,273,273,273,273,273,273,273,273,273


# Export Dataframe

[Back-to-top](#Back-to-Top)

In [25]:
df.to_csv('../data/04_Full_DataFrame.csv')