## Members and HouseholdStructure

1. This notebook reads CSV files
2. It does not need the complete bcpp ENV.


### Problem: 
When filtering members on  `bcpp-year-1`; that is `survey_schedule='bcpp-year-1'`, too many members are returned (expected 20174).

How was the original member dataset for bcpp-year-1 filtered?

households = 11582
members = 20174

In [186]:
expected_members = 20174

In [187]:
path = os.path.expanduser('~/Documents/bcpp')
member_csv = 'members_20171016.csv'
household_structure_csv = 'householdstructure_20171016.csv'

In [188]:
import numpy as np
import pandas as pd
import os

from arrow import Arrow
from datetime import datetime
from django.db import connection
from edc_constants.constants import YES, NO, NEG, UNK
from pprint import pprint

In [189]:
def survey_year(row):
    value = row['survey_schedule']
    if pd.notnull(value):
        print(value)
        return value.split('.')[1][0]
    return value

In [190]:
df_member = pd.read_csv(os.path.join(path, member_csv))
df_household_structure = pd.read_csv(os.path.join(path, household_structure_csv))

In [191]:
# df_member.head()

In [192]:
# df_household_structure.head()

Create new column `survey_year` on both dataframes by spliting `survey_schedule`

In [193]:
# print(list(df_member.columns))
df_household_structure['survey_year'] = df_household_structure.apply(
    lambda row: row['survey_schedule'].split('.')[1], axis=1)
df_member['survey_year'] = df_member.apply(
    lambda row: row['survey_schedule'].split('.')[1], axis=1)

filtering on `bcpp-year-1` members (`survey_year='bcpp-year-1'`) returns too many members

In [194]:
count = len(df_member[df_member['survey_year'] == 'bcpp-year-1'])
print(f"{count} == {expected_members} is {count == expected_members}.")

32113 == 20174 is False.


Create new column `survey_community` on both dataframes by spliting `survey_schedule`

In [195]:
df_household_structure['survey_community'] = df_household_structure.apply(
    lambda row: row['survey_schedule'].split('.')[2], axis=1)
df_member['survey_community'] = df_member.apply(
    lambda row: row['survey_schedule'].split('.')[2], axis=1)

Are the values of `survey_schedule` on household_structure inconsistent with `survey_schedule` on member?

Merge members with household_structure on the foreign key; `'household_structure_id'='id'`

In [196]:
df = pd.merge(
    df_member, df_household_structure,
    left_on='household_structure_id', right_on='id', how='left', suffixes=['_mem', '_hhs'])

Show survey_year is the same for each member regardless

In [197]:
value = len(df[df['survey_year_mem'] != df['survey_year_hhs']]) == 0
if value:
    print(f'{value}. survey_year is consistent between models')
else:
    print(f'{value}. survey_year is NOT consistent between models!')

True. survey_year is consistent between models


In [198]:
# show if present_today was possibly used to filter
# df_member[(df['survey_year_mem'] == 'bcpp-year-1')
#           & (pd.isnull(df['cloned_datetime']))
#           & (df['present_today'].isin(['Yes', 'No']))        
#          ].info()

print('Add present_today in [\'Yes\', \'No\'] ...')
df_member[(df['survey_year_mem'] == 'bcpp-year-1')
          & (pd.isnull(df['cloned_datetime']))
          & (df['present_today'].isin(['Yes', 'No']))        
         ].groupby('present_today').size()



Add present_today in ['Yes', 'No'] ...


present_today
No      7256
Yes    20949
dtype: int64

In [200]:
print('Add subject_identifier as a filter (startswith \'066\') ...')
print(len(df_member[(df['survey_year_mem'] == 'bcpp-year-1')
          & (pd.isnull(df['cloned_datetime']))
          & (df['present_today'].isin(['Yes', 'No']))        
          & (df['subject_identifier'].str.startswith('066'))
         ]))

print('This many were excluded ...')
print(len(df_member[(df['survey_year_mem'] == 'bcpp-year-1')
          & (pd.isnull(df['cloned_datetime']))
          & (pd.isnull(df['present_today']))
          & (df['subject_identifier'].str.startswith('066'))
         ]))

print('Show min max created dates ...')
print(df_member[(df['survey_year_mem'] == 'bcpp-year-1')
          & (pd.isnull(df['cloned_datetime']))
          & (df['present_today'].isin(['Yes', 'No']))        
          & (df['subject_identifier'].str.startswith('066'))
         ]['created'].min())
print(df_member[(df['survey_year_mem'] == 'bcpp-year-1')
          & (pd.isnull(df['cloned_datetime']))
          & (df['present_today'].isin(['Yes', 'No']))        
          & (df['subject_identifier'].str.startswith('066'))
         ]['created'].max())

Add subject_identifier as a filter (startswith '066') ...
12902
This many were excluded ...
3435
Show min max created dates ...
2013-10-30 16:57:04.000000
2015-11-24 17:43:43.000000


Merge with subject consent