# Updating Camper Database

Our database stores information on campers, households, and camp sessions across many years. We receive new camper applications every day leading up to summer camp. Before we can compare this season's campers with their local covid data, we need to make sure our database includes recently enrolled campers.

For privacy reasons, we update our database with a csv file from my hard drive and don't post personal details publicly. The files originally come from our data management system, called CampMinder. The raw file requires some processing in order to fit into the normalized database.

In [1]:
import pandas as pd
from sqlalchemy import create_engine, MetaData, Table

Upload csv with latest application

In [2]:
column_map = {'PersonID' : 'camper_id', 
              'Application Date': 'application_date', 
              'Enrolled Sessions': 'session_id', 
              'Applied Sessions': 'applied_id',
              'Full Name': 'name', 
              'Birth Date': 'birthdate', 
              'Gender': 'gender', 
              'Attention Notes': 'notes',
              'Primary Childhood ID': 'household_id', 
              'Primary Childhood HomeAddr1': 'street',
              'Primary Childhood HomeCity': 'city', 
              'Primary Childhood HomeState': 'state',
              'Primary Childhood HomeZip': 'zipcode', 
              'Primary Childhood HomeCountry Name': 'country'}
    
origin_df = pd.read_csv('C:\\Users\\avery\\OneDrive\\covid_geo_docs\\latest_campers_for_database.csv')

renamed_df = origin_df.rename(column_map, axis=1)

Split into three dataframes representing the three applicable tables in our database.

In [3]:
campers_df = renamed_df[['camper_id', 'name', 'birthdate', 'gender', 'notes', 'household_id']]

applications_df = renamed_df[['camper_id', 'application_date', 'session_id', 'applied_id']]

households_df = renamed_df[['household_id', 'street', 'city', 'state', 'zipcode', 'country']]

## 1. Prepare 'campers' dataframe
Set camper_id to the index and make birthdate into datetime.

In [4]:
pd.options.mode.chained_assignment = None

campers_df['birthdate'] = pd.to_datetime(campers_df['birthdate'])

campers_final_df = campers_df.set_index('camper_id')

## 2. Prepare 'households' dataframe

First set index to household_id and slice zip codes to first five digits.

In [5]:
households_indexed = households_df.set_index('household_id')

# Select only first five digits of zip code
households_indexed['zipcode'] = households_indexed['zipcode'].astype('str')
households_indexed['zipcode'] = households_indexed['zipcode'].str[:5]
households_indexed['zipcode'] = households_indexed['zipcode'].astype('int')

Validate zipcodes are all present.

In [6]:
if households_indexed['zipcode'].isna().sum() > 0:
    present_text = '!! Missing Zip Codes !!'
    print(households_indexed[['household_id', 'zipcode']].sort_values('zipcode', ascending=True).head(5))
else:
    present_text = 'All zip codes present.'

print(present_text)

All zip codes present.


Validate that zipcodes have the right number of digits.

In [7]:
digit_text = 'All codes 5 digits.'

for value in households_indexed['zipcode']:
    if value < 1000:
        digit_text = '!! Wrong Digit Present !!'
        print(value)
        print(households_indexed.loc[value, :])
    else:
        pass
print(digit_text)

All codes 5 digits.


Throw an error if our zipcodes are off.

In [8]:
assert present_text == 'All zip codes present.'
assert digit_text == 'All codes 5 digits.'

Rename to match later context.

In [9]:
households_final_df = households_indexed

## 3. Prepare 'applications' dataframe

This one requires transforming the session types from the way they appear in our data management system to the way they're stored in our Postgres database.

In [10]:
print(applications_df.head())

   camper_id application_date                      session_id applied_id
0    2424030        9/15/2020              Western Expedition        NaN
1    2587669        9/21/2020                       Session 2        NaN
2    2601673        10/7/2020  Leadership in Training (LIT) 1        NaN
3    2683141        10/1/2020              Western Expedition        NaN
4    2748592        9/15/2020              Western Expedition        NaN


Fill missing session_id with applied_id, then drop the applied_id.

In [11]:
apps_filled = applications_df.fillna(applications_df['applied_id'])
apps_fill_only = apps_filled.drop('applied_id', axis=1)

print(apps_fill_only.head())

   camper_id application_date                      session_id
0    2424030        9/15/2020              Western Expedition
1    2587669        9/21/2020                       Session 2
2    2601673        10/7/2020  Leadership in Training (LIT) 1
3    2683141        10/1/2020              Western Expedition
4    2748592        9/15/2020              Western Expedition


Adjust index, date type, and string format to match database standards.

In [12]:
apps_session_format = apps_fill_only.set_index('camper_id')

apps_session_format['application_date'] = pd.to_datetime(apps_session_format['application_date'])

apps_session_format['session_id'] = apps_session_format['session_id'].replace(' ', '_', regex=True).str.lower()

apps_session_format['year_id'] = 2021

print(apps_session_format.head())

          application_date                      session_id  year_id
camper_id                                                          
2424030         2020-09-15              western_expedition     2021
2587669         2020-09-21                       session_2     2021
2601673         2020-10-07  leadership_in_training_(lit)_1     2021
2683141         2020-10-01              western_expedition     2021
2748592         2020-09-15              western_expedition     2021


*Side note: We've been adding new sessions this year, and campers can now choose mulitple sessions. We need to download a list of who's enrolled and what session combinations exist to further refine our database. Let's download a list directly and move on.*

In [38]:
session_df = pd.DataFrame(apps_session_format.groupby('session_id').count())
session_df.to_csv('C:\\Users\\avery\\OneDrive\\grp_database_docs\\sessions_enrolled_count.csv')

Rename to match later context.

In [13]:
applications_final_df = apps_session_format

## 4. Insert into database

Connect to local Postrges database. Remember, the dataframes we want to upload are:

campers_final_df = 'campers' 

applications_final_df = 'applications'

households_final_df = 'households'

In [14]:
password = '**********'

engine = create_engine(f'postgresql://postgres:{password}@localhost:5432/grp_data')

metadata = MetaData()

connection = engine.connect()

Before we update, let's remove values from our dataframes that are already present in the databse tables. We are uploading to a database after all, and the database won't let us upload duplicate values for primary keys. 

We query the existing database to get a copy of each table as it exists before the update. We'll then merge dataframes, save only unique values, and upload only new data into the database.

In [39]:
camper_stmt = 'SELECT camper_id FROM campers'
campers_db = pd.read_sql(camper_stmt, con=connection, columns=['camper_id'])
print(campers_db.head())
print(campers_db.dtypes)

   camper_id
0    2421760
1    2421807
2    2421813
3    2421814
4    2421838
camper_id    int64
dtype: object


Let's replicate an SQL-style antijoin in Python code, resulting in only values that aren't in the database. 

In [40]:
join_campers = campers_df['camper_id'].isin(campers_db['camper_id'])

anti_join_campers = campers_df[~join_campers].set_index('camper_id')

print(anti_join_campers.dtypes)

name                    object
birthdate       datetime64[ns]
gender                  object
notes                   object
household_id             int64
dtype: object


Also use an antijoin with applications dataframe. The primary key is camper_id and year_id. (We have campers return for multiple years, so we use both columns as the key to identify unique applications.)

### We haven't uploaded any apps this year yet, so we don't need to remove duplicates.

apps_stmt = 'SELECT camper_id, year_id FROM applications'
apps_db = pd.read_sql(apps_stmt, con=connection, columns=['camper_id', 'year_id'])
print(apps_db.head())
print(apps_db.dtypes)

join_apps = applications_df[['camper_id', 'year_id']].isin(apps_db[['camper_id', 'year_id']])

anti_join_apps = applications_df[~join_apps].set_index('camper_id')

print(anti_join_apps.head())
print(anti_join_apps.dtypes)

Do final antijoin with households.

In [41]:
house_stmt = 'SELECT household_id FROM households'
house_db = pd.read_sql(house_stmt, con=connection, columns=['household_id'])
print(house_db.head())
print(house_db.dtypes)

   household_id
0        929676
1       4377270
2       2466708
3        870609
4        657635
household_id    int64
dtype: object


In [42]:
join_house = households_df['household_id'].isin(house_db['household_id'])

anti_join_house = households_df[~join_house].set_index('household_id')

print(anti_join_house.dtypes)

street     object
city       object
state      object
zipcode    object
country    object
dtype: object


Insert our dataframes into their respective tables and confirm results.

In [43]:
anti_join_house.to_sql('households', con=connection, if_exists='append')

anti_join_campers.to_sql('campers', con=connection, if_exists='append')

applications_final_df.to_sql('applications', con=connection, if_exists='append')

Build a query to validate the update.

In [48]:
stmt = "SELECT c.gender, h.zipcode, a.application_date FROM campers AS c JOIN households AS h USING(household_id) JOIN applications as a USING(camper_id) WHERE a.application_date > '2020-09-01' LIMIT 10;"

result = pd.read_sql(stmt, con=connection)

print(result.head())

   gender zipcode application_date
0  Female   28601       2020-09-15
1    Male   37350       2020-09-21
2    Male   32746       2020-10-07
3  Female   29464       2020-10-01
4    Male   27608       2020-09-15


Success! Our database is now updated.

In [49]:
connection.close()