# Unexpected appliance use survey responses

In my 2016-03-02 notebook, I showed that there are households that said they own an appliance but did not respond to questions about how many hours they used these appliances.  This elaborates on that work.


We want to be sure that we are extrapolating correctly from our
appliance survey to the whole population.  For example if null responses
for the hours used by an appliance mean the household doesn't use that
appliance, then we should count that as a zero and the mean over all
households in the population will be correct.  If however, the household
owns the appliance and doesn't answer the hours survey question, then we
should assign that person the mean of the other responses.

| TV yes/no | TV hours |
| -         | -        |
| 1         | 6        |
| 1         | null     |
| 0         | 6        |
| 0         | null     |

The second line is important to check for and the third line should not
happen.  The third line shouldn't occur because the survey uses the
`selected()` command in the relevant column so that only respondents
owning the appropriate appliance are asked the question.  This is per
appliance.

One next step is to create a calculation that accounts for this
discrepancy and makes a comparison of its result.



- Tabulate bad responses in the data
- Show each appliance not having hours despite claiming the appliance
- Assert no records with hours despite not having appliance
- Show hours and weekly times above 24 or 7


In [1]:
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt

from drs_sentani import get_survey
import pandas as pd
survey = get_survey()
import pysentani as sti

survey['access_type'] = sti.access_type(survey)



# Unexpected nulls for appliance use

The records below have unexpected null values in the number of hours and times per week the appliance is reported in use.  This is despite responding that the household does own the appliance.

In [2]:
# create an array of booleans for 
# has appliance is true and weekly or hourly data is null
# count the corresponding records and display

for app in ['TV', 'fridge', 'radio', 'rice_cooker', 'fan', 'lighting']:
    app_yn = 'app_now/{}'.format(app)
    wk_col = 'app_{}_per_wk'.format(app)
    hr_col = 'app_{}_hrs'.format(app)
    mask = (survey[app_yn]==1) & (survey[wk_col].isnull() | survey[hr_col].isnull())
    print(survey[mask][[app_yn, wk_col, hr_col]].shape[0],
         'records with unexpected nulls for {}'.format(app))

34 records with unexpected nulls for TV
5 records with unexpected nulls for fridge
12 records with unexpected nulls for radio
5 records with unexpected nulls for rice_cooker
3 records with unexpected nulls for fan
25 records with unexpected nulls for lighting


# Unexpected hourly observations for no appliance

The survey instrument doesn't allow households to provide responses to questions about appliance use if they do not own the appliance.  This verifies that there are no records like that.

In [3]:
# notnull doesn't seem to do what I want
for app in ['TV', 'fridge', 'radio', 'rice_cooker', 'fan', 'lighting']:
    app_yn = 'app_now/{}'.format(app)
    wk_col = 'app_{}_per_wk'.format(app)
    hr_col = 'app_{}_hrs'.format(app)

    print(survey[(survey[app_yn]==0) & 
                  (survey[wk_col].notnull() 
                  | survey[hr_col].notnull())][[app_yn, wk_col, hr_col]].shape[0],
         'records with unexpected hours for {}'.format(app))

0 records with unexpected hours for TV
0 records with unexpected hours for fridge
0 records with unexpected hours for radio
0 records with unexpected hours for rice_cooker
0 records with unexpected hours for fan
0 records with unexpected hours for lighting


# Hourly and weekly responses

Below I show the responses that are above either 24 hours a day of use or using something 7 times per week.  

Many of the responses above seven times per week are 24, suggesting that the question was misinterpreted as hours per day.

Some of the large responses are inexplicably large and would be best handled by an input checker on the electronic survey tool.

In [4]:
for app in ['TV', 'radio', 'fridge', 'fan', 'rice_cooker', 'lighting']:
    wk_col = 'app_{}_per_wk'.format(app)
    hr_col = 'app_{}_hrs'.format(app)    
    print('weekly bad responses {}'.format(app), survey[survey[wk_col]>7][wk_col].values)

for app in ['TV', 'radio', 'fridge', 'fan', 'rice_cooker', 'lighting']:
    wk_col = 'app_{}_per_wk'.format(app)
    hr_col = 'app_{}_hrs'.format(app)
    print('hourly bad responses {}'.format(app), survey[survey[hr_col]>24][hr_col].values)

weekly bad responses TV [ 10.  24.  10.   9.  14.]
weekly bad responses radio [ 24.]
weekly bad responses fridge [ 20000.     24.     24.     24.     24.     24.]
weekly bad responses fan []
weekly bad responses rice_cooker [ 24.  24.]
weekly bad responses lighting [  2.00000000e+05   9.00000000e+00   2.40000000e+01]
hourly bad responses TV []
hourly bad responses radio []
hourly bad responses fridge []
hourly bad responses fan []
hourly bad responses rice_cooker []
hourly bad responses lighting [ 12000.     27.]


# Correcting these

We can either replace these with the highest option available or throw them out by filling them with nulls.

Here I replace with the highest option available and show the difference in the means.

In [5]:
import numpy as np
survey_copy = survey.copy(deep=True)

# use dictionaries to facilitate 
hour_differences = {}
week_differences = {}

for app in ['TV', 'radio', 'fridge', 'fan', 'rice_cooker']:
    # generate column labels for each appliance
    wk_col = 'app_{}_per_wk'.format(app)
    hr_col = 'app_{}_hrs'.format(app)
    energy_col = 'daily_{}_energy'.format(app)
    weekly_hr_col = 'weekly_{}_hrs'.format(app)
    
    # look at hourly values before and after
    hr_mean_uncorrected = survey_copy[hr_col].mean()
    survey_copy[hr_col] = survey_copy[hr_col].where(survey_copy[hr_col].isnull() | (survey_copy[hr_col] <= 24), 24)
    hr_mean_corrected = survey_copy[hr_col].mean()
    hour_differences[app] = hr_mean_uncorrected - hr_mean_corrected
    
    # look at weekly values before and after
    wk_mean_uncorrected = survey_copy[wk_col].mean()
    survey_copy[wk_col] = survey_copy[wk_col].where(survey_copy[wk_col].isnull() | (survey_copy[wk_col] <= 7), 7)
    wk_mean_corrected = survey_copy[wk_col].mean()
    week_differences[app] = wk_mean_uncorrected - wk_mean_corrected
    
    # ensure data is in range
    assert(all(survey_copy[wk_col]) <= 7)
    assert(all(survey_copy[wk_col]) >= 0)
    assert(all(survey_copy[hr_col]) <= 24)
    assert(all(survey_copy[hr_col]) >= 0)

pd.DataFrame({'hourly_mean_differences':hour_differences,
              'weekly_mean_differences':week_differences})

Unnamed: 0,hourly_mean_differences,weekly_mean_differences
TV,0,0.039506
fan,0,0.0
fridge,0,104.031088
radio,0,0.049708
rice_cooker,0,0.201183


In [6]:
import numpy as np
survey_copy = survey.copy(deep=True)

# use dictionaries to facilitate 
hour_differences = {}
week_differences = {}

for app in ['TV', 'radio', 'fridge', 'fan', 'rice_cooker']:
    # generate column labels for each appliance
    wk_col = 'app_{}_per_wk'.format(app)
    hr_col = 'app_{}_hrs'.format(app)
    energy_col = 'daily_{}_energy'.format(app)
    weekly_hr_col = 'weekly_{}_hrs'.format(app)
    
    # look at hourly values before and after
    hr_mean_uncorrected = survey_copy[hr_col].mean()
    survey_copy[hr_col] = survey_copy[hr_col].where(survey_copy[hr_col].isnull() | (survey_copy[hr_col] <= 24), np.nan)
    hr_mean_corrected = survey_copy[hr_col].mean()
    hour_differences[app] = hr_mean_uncorrected - hr_mean_corrected
    
    # look at weekly values before and after
    wk_mean_uncorrected = survey_copy[wk_col].mean()
    survey_copy[wk_col] = survey_copy[wk_col].where(survey_copy[wk_col].isnull() | (survey_copy[wk_col] <= 7), np.nan)
    wk_mean_corrected = survey_copy[wk_col].mean()
    week_differences[app] = wk_mean_uncorrected - wk_mean_corrected
    
    # ensure data is in range
    assert(all(survey_copy[wk_col]) <= 7)
    assert(all(survey_copy[wk_col]) >= 0)
    assert(all(survey_copy[hr_col]) <= 24)
    assert(all(survey_copy[hr_col]) >= 0)

pd.DataFrame({'hourly_mean_differences':hour_differences,
              'weekly_mean_differences':week_differences})

Unnamed: 0,hourly_mean_differences,weekly_mean_differences
TV,0,0.042926
fan,0,0.0
fridge,0,104.035244
radio,0,0.054201
rice_cooker,0,0.205152


These strategies result in very similar differences to the means.