## Daily Cumulative Applications Year-to-Date

!!!! The backfill needs adjusting. We need to get rid of the early applications of Phen and Hawkins. How to do this? Replace specifically each time?

In [1]:
import pandas as pd
import datetime as dt
import os

Load data

In [2]:
file_name = '10_20_2020_7_27_39.csv'
file_path = f'C:\\Users\\avery\\Dropbox\\Data Analysis\\Databases\\Camper_Tracking_2021\\{file_name}'
full_df = pd.read_csv(file_path, index_col='PersonID')

Process origin data.

In [3]:
#  Change data type to datetime.
full_df['Application Date'] = pd.to_datetime(full_df['Application Date'])

# Fill empty enrolled sessions with applied sessions
full_df['Enrolled Sessions'] = full_df['Enrolled Sessions'].fillna(full_df['Applied Sessions'])

# Select subset to get only data we care about.
data_list = ['Full Name', 'Application Date', 'Enrolled Sessions', 'Gender']
df_sorted =  full_df[data_list]

# Sort by application date
df_sorted = df_sorted.sort_values('Application Date', ascending=True)

Change any early application dates to 9/15/2020

In [4]:
df_sorted.loc[df_sorted['Application Date'] < '2020-09-15', 'Application Date'] = '2020-09-15 00:00:00'

In [5]:
print(df_sorted.head())

               Full Name     Application Date               Enrolled Sessions  \
PersonID                                                                        
12457499    Hawkins Mead  2020-09-15 00:00:00                       Session 2   
10363284    Stephen Mead  2020-09-15 00:00:00                       Session 3   
12473840   Cecelia Abney  2020-09-15 00:00:00                       Session 3   
3565475   Charles Miller  2020-09-15 00:00:00  Leadership in Training (LIT) 1   
11097303      Sarah Weil  2020-09-15 00:00:00                       Session 2   

          Gender  
PersonID          
12457499    Male  
10363284    Male  
12473840  Female  
3565475     Male  
11097303  Female  


Reconvert application date to datetime - otherwise will fail to join later.

In [6]:
df_sorted['Application Date'] = pd.to_datetime(df_sorted['Application Date'])
print(df_sorted.head())

               Full Name Application Date               Enrolled Sessions  \
PersonID                                                                    
12457499    Hawkins Mead       2020-09-15                       Session 2   
10363284    Stephen Mead       2020-09-15                       Session 3   
12473840   Cecelia Abney       2020-09-15                       Session 3   
3565475   Charles Miller       2020-09-15  Leadership in Training (LIT) 1   
11097303      Sarah Weil       2020-09-15                       Session 2   

          Gender  
PersonID          
12457499    Male  
10363284    Male  
12473840  Female  
3565475     Male  
11097303  Female  


## Remap Sessions
Explicitly remap expedition sessions

In [7]:
exp_remap_dict = {'Western Expedition':'EXP', 
                   'Blue Ridge 1':'EXP', 
                   'Blue Ridge 2':'EXP', 
                   'Blue Ridge 3':'EXP', 
                   'Outer Banks 1':'EXP', 
                   'Outer Banks 2':'EXP', 
                  'Leadership in Training (LIT) 1':'EXP',
                  'Leadership in Training (LIT) 2':'EXP', 
                  'Mountain Biking 2- Base Camp Expedition':'EXP',
                  'Mountain Biking 1- Base Camp Expedition and Session 2':'Session 2',
                  'Session 2 and Rock Climbing 1- Base Camp Expedition': 'Session 2', 
                  'Session 3 and Circumnavigate GRP 2- Base Camp Expedition':'Session 3', 
                  'Session 2 and Session 4':'Session 4'}

df_remapped = df_sorted.replace(exp_remap_dict)

## Pivot to count apps per date

In [8]:
pivot_count_df = pd.pivot_table(df_remapped, 
                                values='Full Name', 
                                index=['Application Date'], 
                                columns=['Enrolled Sessions'], 
                                aggfunc='count', 
                                fill_value=0)

Convert dates back to string for joining.

Calculate cumulative sum of the pivot table rows down each column

In [9]:
pivot_cum_df = pivot_count_df.cumsum(axis=0)
pivot_cum_df.dtypes

Enrolled Sessions
EXP          int64
Session 1    int64
Session 2    int64
Session 3    int64
Session 4    int64
Session 5    int64
Session 6    int64
dtype: object

## Standardize Dates

Generate the standardized date dataframe.

In [10]:
# Place key inside a string to use as date start and end.
start_string = "2020-08-01"
end_string = "2021-07-31"
    
# Convert strings to datetime type
start_date= dt.datetime.strptime(start_string, '%Y-%m-%d')
end_date = dt.datetime.strptime(end_string, '%Y-%m-%d')
    
# Generate temporary dataframe with time series.
date_series = pd.date_range(start=start_date,end=end_date)
date_df = pd.DataFrame(date_series)
date_df.columns = ['Standard Date']
    
# Store each resulting dataframe in the storage dictionary.
print(date_df.head())

  Standard Date
0    2020-08-01
1    2020-08-02
2    2020-08-03
3    2020-08-04
4    2020-08-05


(If this were another year, we'd test for leap year and drop Feb 29. Code is in the historical models.)

## Join with standardized dates
and fill na values

In [11]:
joined_df = pd.merge_ordered(pivot_cum_df, date_df, left_on='Application Date', right_on='Standard Date', how='right')
joined_df.fillna(method='bfill', inplace=True, limit=1)
print(joined_df.head())

   EXP  Session 1  Session 2  Session 3  Session 4  Session 5  Session 6  \
0  NaN        NaN        NaN        NaN        NaN        NaN        NaN   
1  NaN        NaN        NaN        NaN        NaN        NaN        NaN   
2  NaN        NaN        NaN        NaN        NaN        NaN        NaN   
3  NaN        NaN        NaN        NaN        NaN        NaN        NaN   
4  NaN        NaN        NaN        NaN        NaN        NaN        NaN   

  Standard Date  
0    2020-08-01  
1    2020-08-02  
2    2020-08-03  
3    2020-08-04  
4    2020-08-05  


## Export & Open

In [12]:
joined_df.set_index('Standard Date', inplace=True)

export_file_path = 'C:\\Users\\avery\\Dropbox\\Data Analysis\\Outputs\\2021_Tracking\\Daily_Applications_YTD.csv'
joined_df.to_csv(export_file_path)

In [13]:
os.startfile(export_file_path)