In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab03.ipynb")

In [None]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import statsmodels.api as sm
from statsmodels.api import OLS
import warnings
warnings.filterwarnings('ignore')

# Lab 3: Wrangling Survey Data

This lab uses a real-world dataset from the Rurals Water Project Spring Protection Study that Innovations for Poverty Action collected in 2008. Professor Van Dusen was working as a Research Manager on the project at the time, and this is part of follow up work on the main study that was done to use high-frequency monitoring for disease surveillance and study the use of household chlorination products.

Part of this study was included in a Jounal Article in the Proceedings of the National Academy of Science (PNAS) called _Being Surveyed can change later behavior and related parameter estimates_. The paper can be found [here](https://www.pnas.org/doi/epdf/10.1073/pnas.1000776108). A summary of the work related to this notebook is contained in this [unpublished draft](https://berkeley.box.com/s/epj41ofw9bph99dim92dsa69gj17puul). 

In this project there are actually two datasets from repeated visits to the same households, one is a houseold level dataset, and one is a child level dataset (where there can be multiple children within a given household)

Before embarking on the lab, make sure you understand the overall structure of the [survey](https://berkeley.app.box.com/s/2thnb8dan58o7w44mu9tpk1zy8l4c0no) - it will be very helpful.

**Learning Objectives:**
- Perform an EDA of the Main Rural Project Spring Protection Study, a real-world dev-econ project
- Understands a complex survey
- Understands the Hawthorne effect and finds empirical evidence
- Analyzes survey data using Pandas tools we have learned so far

---
## Phase 1: Explorative Data Analysis (EDA)

Let's start off with loading in the dataset. As many datasets from academia are, this one is a Stata file, denoted .dta.

**Question 1.1:** Load in the dataset `BWM_child_EVDvars.dta` and read it into a Pandas dataframe from its Stata format. Name it `wg_df`.

Hint: Check out this [Pandas documentation](https://pandas.pydata.org/docs/user_guide/io.html) to see how to read in different types of files. 

In [None]:
wg_df = ...
wg_df.head()

In [None]:
grader.check("q1_1")

**Question 1.2:** Using the `shape` attribute, find the structure of the `wg_df` dataframe. How many rows and columns are there in the dataframe? Assign the values to the corresponding variables `N_rows` and `N_cols`. 

In [None]:
N_rows = ...
N_cols = ...
N_rows, N_cols

In [None]:
grader.check("q1_2")

**Granularity**: Now, let's focus on the granularity of the dataset. We define this as the level of aggregation in our data. 

- For geospatial data, think of data summarized over city, block, street, building, address, room number etc. as increasing level of granularity. 

- For time series data, granularity could for instance be weekly, daily, hourly averages of a variable. Survey data might be a bit trickier to entangle, but we're certain you got this!

<!-- BEGIN QUESTION -->

**Question 1.3:** What is the granularity of our dataset? Think of what each row represent. Choose 3 arbitrary columns you find interesting and explain how they help you understand the dataset's granularity. One of them should identify the *primary key* of this dataset. (Note that the primary key can be a combination of 2 or more columns.)

Hint: You can use [`pandas.Series.value_counts`](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) and/or [`pandas.Series.unique`](https://pandas.pydata.org/docs/reference/api/pandas.Series.unique.html).

_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 1.4** How many unique households and springs are there in this dataset? How many rounds of interviews were held? Assign the value to the corresponding variables. 

Note: `a1_hh_id` is the household ID. `a2_spring_id` is the spring ID. `bwm_round` specifies the round of interview. 

In [None]:
num_households = ...
num_springs = ...
num_interview_rounds = ...

print('Households: ', num_households)
print('Springs: ', num_springs)
print('Interview Rounds: ', num_interview_rounds)

In [None]:
grader.check("q1_4")

---
## Phase 2: Understanding the Survey

Our EDA in Phase 1 helped us understand the general structure of the data. Now, turn to the [survey PDF](https://berkeley.app.box.com/s/2thnb8dan58o7w44mu9tpk1zy8l4c0no). Glance over it and try to put yourself in the shoes of the interviewer and answer the questions below. Throughout the project, you might find it helpful to refer to [Appendix 1](#appendix_1) which connect the relevant survey sections with columns in our dataset.

<!-- BEGIN QUESTION -->

**Question 2.1:** What are the main parts of the survey? In this question, list out each section denoted by a letter an explain in 1 sentence what you believe to be its significance. We'll start you off with two:

- Section A: Introduction with general respondent and interview round information and consent. 
- Section B: Characteristics of respondent. Filled out once during the survey rounds (if respondent stays the same).

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.2:** After your first glance of the survey, what do you deem to be the most important "datapoints" collected that are relevant to the paper's research hypothesis? You can either refer to specific questions and columns. 

Hint: this paper focuses on the prevalence of diarrhea across treatment and control groups.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.3:** Outside of the paper's "sphere of research interest", what would be interesting datapoints to analyse further? This is an open-ended question, and we suggest you form a short research question and how you would use the data from the survey.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

---
## Phase 3: Analyzing Disease Prevalence Across Survey Rounds

As we've learned from our EDA and analyzing the survey itself, we understand that our dataset is the result of a series of survey rounds (20 in total). A series of households were interviewed, asked primarily about the general health condition of children in the household over the past week. If a child was ill on the survey day, it was examined further for the symptoms of diarrhea. We now direct our focus to the result of **section D: The Health History tables** (starting Page 7 in the [survey](https://berkeley.app.box.com/s/2thnb8dan58o7w44mu9tpk1zy8l4c0no)) - one filled out for each child in the household per survey round. During the following questions, we suggest you **keep a close eye on the survey.**

**Question 3.1:** How many households are surveyed each round? Use `groupby` and an aggregate function. Assign the result to the dataframe `grouped_by_round` that will have only two columns `bwm_round` and `a1_hh_id`. Then make a line plot using `plotly` that shows the number of households surveyed for each round. 

In [None]:
# group by
grouped_by_round = wg_df.groupby(...)[[...]].count().reset_index()

# plot
px.bar(..., x=..., y=...,title='Households Surveyed per Survey Round')

In [None]:
grader.check("q3_1")

**Question 3.2:** There appears to be 2 survey rounds denoted "99" and "161". For now, we choose not to include them. Remove them and make the bar plot again.

In [None]:
# select relevant rounds
relevant_rounds = wg_df[...]

# group by rounds
grouped_by_round = relevant_rounds.groupby(...)[[...]].count().reset_index()

# plot
px.bar(..., x=..., y=...,title='Households Surveyed per Survey Round')

In [None]:
grader.check("q3_2")

<!-- BEGIN QUESTION -->

**Question 3.3**: In the text cell below, share an observation from the plot and what you believe potential causes of the variation of participating households in each round could be. 

_Type your answer here, replacing this text._

<!-- END QUESTION -->


**What about round 99 and round 161?**  
During the survey, a decision was made to survey an additional set of households that were not being surveyed as frequently. Round 99 happened at the same time as round 9 as a new control group of households ( and 161 at the same time as round 16).
In other words, the researchers added a control group that was "less frequently surveyed". These households reported much higher diarrhea prevalence. We later use these extra-rounds to quantify the [Hawthorne effect](https://en.wikipedia.org/wiki/Hawthorne_effect) discussed in lecture.

**Question 3.4**: We now turn our attention to the variable of interest: The 7-day recall variable for diarrhea in the past week, denoted `d6a1_7dd_n` in the dataset. It is a binary categorical variable: 1 if a surveyed child in a given household had diarrhea the past 7 days, 0 if not. In the code cell below, plot the overall count for this variable across all survey rounds. Input the corresponding numbers in the answer cells below.

Hint: again use `groupby` and an aggregate function. Assign the result to the dataframe `grouped_by_diarrhea` that will have only two columns `d6a1_7dd_n` and `a1_hh_id`. 

In [None]:
grouped_by_diarrhea = relevant_rounds...
px.bar(..., x=..., y=..., 
       title='Surveyed Households with Child with Diarrhea past 7 days')

In [None]:
grader.check("q3_4")

**Question 3.5**: How does our variable of interest,`d6a1_7dd_n`, change across survey rounds? In the code cell below, make a line plot of the count of positive and negative 7-day diarrhea cases as a function of survey round.

In [None]:
# Select relevant columns
wg_df_plot_prep = relevant_rounds[['a1_hh_id', 'bwm_round','d6a1_7dd_n']]

# Group by survey round and binary 7-day diarrhea variable
wg_df_plot_prep = wg_df_plot_prep...

# Plot results as a line graph
px.line(..., x=..., y=..., 
    color='d6a1_7dd_n', # this will generate a different color for different `d6a1_7dd_n`
    title='Households Reporting Child Diarrhea Last 7 days across All Survey Rounds'
)

In [None]:
grader.check("q3_5")

<!-- BEGIN QUESTION -->

**Question 3.6**: Do you observe any particular trends in the reported past 7-day prevalence of child diarrhea across the survey rounds? Think of how its prevalence changes relative to previous survey rounds. Furthermore, take note of potential reasons for the trends you are observing.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 3.7**: Let's dive deeper into the trends we're observing here. Taking into consideration our results from 3.2 which showed that the number of surveyed households varied across all survey rounds, we can instead see how the average proportion of households reporting child diarrhea is changing over the survey rounds.

Hint: Use `groupby` and another aggregation function. The resulting dataframe `wg_df_plot_prep_new` should have two columns `bwm_round` and `d6a1_7dd_n`. 

In [None]:
# Group by round and take the average of the binary 7-day diarrhea variable.
wg_df_plot_prep_new = ...

# Plot results as a 1-line graph.
px.line(data=..., x=..., y=..., 
title='Average Proportion of Households Reporting Child Diarrhea Last 7 days across All Survey Rounds')

In [None]:
grader.check("q3_7")

**Question 3.8**: In the code cell below, generalize the prevalence plot code in the form of a function that inputs a condition column (e.g `d6a1_7dd_n`) and name (e.g. Diarrhea) and outputs a plot as above.

Hint: again use `groupby` and a correct aggregation function. `df_plot_prep` should have only two columns `"bwm_round"` and `condition_name` (this is a variable! For example, `condition_name` can be "Diarrhea")

In [None]:
"""
Takes in the condition column and actual name. Outputs plots of average prevalence of 
condition across survey rounds. 
"""
def prevalence_plotter(df, condition_column, condition_name):

    # Group by round and take the average of the binary 7-day diarrhea variable
    df_plot_prep = ...

    # Plot results as a 1-line graph
    title_string = f"Average proportion of households reporting {condition_name} in survey round"
    return px.line(df_plot_prep, x='bwm_round', y=condition_column, title=title_string)

prevalence_plotter(relevant_rounds, 'd6d1_7dc_n', 'Diarrhea')

In [None]:
grader.check("q3_8")

Now, we will apply the function on a dictionary of diseases already prepared for you. This will generate four plots that describe the average proportion of households reporting a certain disease over the survey rounds.

In [None]:
conditions = {
'd6d1_7dc_n' : 'Chough',
'd7d2_7dcn_n' : 'Chest Noise',
'd6b_7day_fever_n' : 'Fever',
'd6c_7day_vomiting_n':'Vomiting'}

for condition_column, condition_name in conditions.items():
    figure = prevalence_plotter(relevant_rounds, condition_column, condition_name)
    figure.show()

<!-- BEGIN QUESTION -->

**Question 3.9**: Choose one of the plots above and thoroughly reflect on a set of observations in a few sentences. Can you think of why disease prevalence is steadily declining as the number of survey rounds increase? And, what could have caused the sudden uptick in the last rounds? (Hint: Revisit the lecture slides).

_Type your answer here, replacing this text._

<!-- END QUESTION -->

---
## Phase 4: Analyzing the Hawthorne Effect

In this phase of the project we seek to analyze and quantify the [Hawthorne effect](https://en.wikipedia.org/wiki/Hawthorne_effect) mentioned in lecture. In general terms, it is a positive change in the performance of a group of persons taking part in an experiment or study due to their perception of being singled out for special consideration. (Collins Dictionary) In our context, we define it as the decrease in disease prevalence attributed to the mere fact that households are *aware of themselves being surveyed.* Let's see if we can put a number to it!

**Question 4.1**: In a separate dataframe, single out all results from round "99" and "161" and the columns relevant for an investigation of the 7-day diarrhea prevalence columns. 

Hint: same as before. Use `groupby` and a correct aggregation function. The resulting dataframe `de_grouped` should have two columns `bwm_round` and `d6a1_7dd_n`. 

In [None]:
# Select relevant rounds (use conditional filtering)
extra_rounds = ...

# Group by rounds
de_grouped = extra_rounds.groupby(...)...

de_grouped.head()

In [None]:
grader.check("q4_1")

In [None]:
# create a synthetic dataframe for plotting
de_grouped_plot = de_grouped.copy()
de_grouped_plot["bwm_round"] = np.array([9, 16])

# plot both groups
fig1 = px.line(wg_df_plot_prep_new, x='bwm_round', y='d6a1_7dd_n')
fig2 = px.scatter(de_grouped_plot, x='bwm_round', y='d6a1_7dd_n', color_discrete_sequence=['red'])

layout = go.Layout(title='Average Proportion of Households Reporting Child Diarrhea Last 7 days across All Survey Rounds')
fig_all = go.Figure(data=fig1.data + fig2.data, layout=layout)
fig_all.show()

<!-- BEGIN QUESTION -->

**Question 4.2**: Look at the graph above. The red points are the corresponding control groups 99 and 161. How different are these from the normal group quantitatively? (Feel free to just eyeball it or write some code) Are you surprised by your findings? 

_Type your answer here, replacing this text._

<!-- END QUESTION -->

---
## Phase 5: Analyzing WaterGuard Usage

In this phase of the project, we turn our attention to the effect of [WaterGuard](https://www.engineeringforchange.org/solutions/product/waterguard/) usage on disease prevalence across households.

**Question 5.1**: Read in the `BWM_HH_EVDvars.dta` dataset and convert it into a dataframe. 

In [None]:
hh_wg = ...
hh_wg.head()

In [None]:
grader.check("q5_1")

<!-- BEGIN QUESTION -->

**Question 5.2**: What does each row of `hh_wg` contain? What does it say about the granularity (or the level of aggregation)? How does it compare to the dataframe used in phase 1-4?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 5.3:** How many households appear in each survey round? 

Hint: The resulting dataframe should have two columns `bwm_round` and `a1_hh_id`. You may want to use [`reset_index`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html) after `groupby`. 

In [None]:
hh_round = ...
hh_round

In [None]:
grader.check("q5_3")

**Question 5.4:** In each round, how many households has validated usage of WaterGuard? 

Hint: The column with the obscure name of `G5XH5` might be of interest to you - it's a dummy variable for validated WG usage! 
The resulting dataframe should have three columns `bwm_round`, `G5XH5` and `a1_hh_id`. You may want to use [`reset_index`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html) after `groupby`.  

In [None]:
hh_round_vld = ...
hh_round_vld

In [None]:
grader.check("q5_4")

Now we turn to a truly interesting part of the survey: How does the promotion of WaterGuard affect actual use of the product? You might have noticed the `assign_wg` column in both of our datasets: This denotes whether or not a given household has been promoted WG as a product. Promotion includes free samples, encouragement scripts, and follow-ups on water quality. The encoding is as follows: 

- `assign_wg` = 0 is a control household with no WG Promotion.
- `assign_wg` = 1 is a household with WG Promotion.

Here's an example of how a household with WG promotion looks like across all survey rounds:  

In [None]:
hh_wg[hh_wg['a1_hh_id']== 872002][['bwm_round', 'assign_wg']]

**Question 5.5:** Select the `bwm_round`, `G5XH5`, `assign_wg` columns from our `hh_wg` dataset. Then rename `G5XH5` as `validated_wg` and `assign_wg` as `promoted_wg`, respectively. Only include the normal survey rounds (i.e. not round 99 and 161)

In [None]:
a_df = hh_wg[...][[...]]
a_df = a_df.rename({...}, axis=1)
a_df

In [None]:
grader.check("q5_5")

<!-- BEGIN QUESTION -->

**Question 5.6:** Which of the two Wateguard columns inform us whether or not a given household is in the *treatment* or *control* group? Which column stores our *outcome* variable?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 5.7:** How does validated Waterguard usage vary in our treatment and control groups across all survey rounds? Use `a_df` you defined above and find out the average proportion of households with validated WG usage per round for the treatment and control group respectively. 

Hint: You will need to `groupby` on two columns. 

In [None]:
tc_prep = a_df.groupby([...]).mean().reset_index()

px.line(tc_prep, x=..., y=..., color = 'promoted_wg',
        title='Average Proportion of Households with Validated WG Usage Per Round across Control and Treatment')

In [None]:
grader.check("q5_7")

**Question 5.8 (Extra Credit):** Now, quantify the relationship between our dependent variable (`validated_wg`) and our dependent variable (`promoted_wg`). Proceed with this performing a Ordinary Least Squares Regression using the `statsmodel` package. You should return a regression summary and 

In [None]:
df_reg = a_df.dropna()

y = ...
X = ...

# Remember to add an intercept term to X
X = ...

# Fit model and return the summary
model = ...
result = model.fit()
result.summary()

In [None]:
grader.check("q5_8")

<!-- BEGIN QUESTION -->

**Question 5.9 (Extra Credit):** Interpret your findings using the [Sign, Significance, and Size framework](https://are.berkeley.edu/courses/EEP118/spring2014/section/Handout4_2014.pdf). 

Hint: If you're new to interpreting `statsmodels` summaries, you might find this [blog post](https://medium.com/swlh/interpreting-linear-regression-through-statsmodels-summary-4796d359035a) helpul.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

---
## Phase 6: Impact of More-Frequent Surveying (Optional)

In this last part, we focus our attention to the main concern of Banerjee et al. (2010) _[Being Surveyed Can Change Later Behaviour and Related Parameter Estimates](https://www.pnas.org/doi/10.1073/pnas.1000776108)_. They ask:

- Does completing a household survey change the later behaviour of those surveyed?

We direct our focus to what our data may tell us: 

- What's the impact on of more-frequent surveys on chlorine use and diarrhea prevalence?

More specifically, we'd like to replicate the first columns of the following regression table, but without the protected spring variable and its interaction variable with 'surveyed more frequently'. Are you up for the challenge?

![regressions.png](attachment:regressions.png)

Before you start with the regressions, make sure you understand the experimental design as described in the [paper](https://www.pnas.org/doi/epdf/10.1073/pnas.1000776108)'s results section:

> The sample for this experiment is composed of 330 households in rural western Kenya who were randomly selected froma frame of 1,500 households involved in a larger randomized evaluation of spring protection and WaterGuard use (23). Of these, 170 households were randomly assigned to be surveyedabout health status biweekly [to accord with common practice insurveys of child diarrhea in epidemiology (9, 10)] for 18 2-wk rounds beginning in May 2007, although there was a 4-mo gap between rounds 15 and 16 owing to the postelection violence in early 2008. Afinal survey (round 19) was conducted 7 mo later in December 2008. The remaining 160 households were randomly selected to get the same survey just three times, or every 6 mo, during the same period: in biweekly survey rounds 9 (September 2007), 16 (April 2008), and 19 (December 2008). More than 97% of both the biweekly and low-frequency groups completed at least one of their survey rounds; 90% of the biweekly group completed at least 17 of the 19 surveys, and 90% of the low-frequency group completed at least 2 of the 3 surveys. ...

### Merging WG with Diarrhea Prevalence Data

Answering this, we first need to merge our WaterGuard usage data onto our diarrhea prevalence data.

We start off by selecting all rounds (including 9 and 16) and rename the columns as done earlier. This time, we also sort for survey rounds and households, to make the data a bit more interpretable.

In [None]:
# Select relevant rounds and rename.
df = hh_wg.rename({'G5XH5':'validated_wg', 'assign_wg': 'promoted_wg'}, axis=1)

# Sort for household id and round number.
df_s = df.sort_values(['a1_hh_id','bwm_round'])
df_s.head()

Next, our goals is to understand how the 7-day recall rate for diarrhea changes across household in each round. To do so, we need to find the data corresponding to the following answer in the survey:

![survey_snap.png](attachment:survey_snap.png)

The relevant columns are all named `d6a1_7day_diarrhea_XX` with a number from 01 to 22 corresponding to child 'number' for each household. For example, if `d6a1_7day_diarrhea_04` is 1.0, child number 4 of that household in that given round had diarrhea the past 7 days.

Now that we have identified our columns of interest and their format, we need to sub-select the relevant columns for further investigation.

**Question 6.2:** In the following cell, create `df_disease` that contains only the relevant columns for each given household using `df_s`. 

Hint: The relevant columns are all named `d6a1_7day_diarrhea_XX` with a number from 01 to 22 corresponding to child 'number' for each household.

In [None]:
# Subselect all relevant disease columns - per household per round
# df_disease = ...
# df_disease.head()

To clarify, `df_disease`'s rows now represent a household in a given round, and each column informs us whether or not a given child (up to 22!) has had diarrhea the past 7 days. But, there's something interesting with the values present: What does '2' or '99' that we observe when running `.describe()` represent?

In [None]:
# df_disease.describe()

If we take a second look at the specific survey entry below, we observe that the 99 is an encoding for 'DK', shorthand for 'Don't know.' In this survey data, '2.0' is an encoding for 'no'. Let's deal with these encodings to avoid faults in our analysis!

![survey_snap_2.png](attachment:survey_snap_2.png)

**Question 6.3:** In the following cell, use the [`pandas.DataFrame.replace`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html) function to change both 2.0 and 99.0 entries to 0.

In [None]:
# YOUR CODE HERE
# As 2.0 is that encoded answer for a no, we replace each such value with 0.
# df_disease = ...
# We do the same for values of 99, as that is the encoding for 'Don't know.'
# df_disease = ...
# df_disease.head()

Being interested in the diarrhea prevalence of each household in a given survey round, we ought to summarise our 22 columns into a meaningful variable.

**Question 6.4:** In the next question, you'll be asked to aggregate the findings above *horizontally*, that is, across columns. Before you do so, consider to options: taking the average or summing them together. What would the benefits and/or trade-offs of each approach be?

*Your answer here.*

**Question 6.5:** In the following cell, perform a horizontal aggregation of the columns with the chosen method from 6.4.

Hint: remember that we can specify the `axis` parameter in both `.mean()` and `.sum()`!

In [None]:
# YOUR CODE HERE
# df_d_agg = ...
# df_d_agg.head()

**Question 6.6:** Now, merge your aggreagated column back onto the existing `df_s`. Your final `df_m`should contain the relevant columns `'bwm_round'`, `'a1_hh_id'`, and `'validated_wg'`, plus the aggregate. Rename this one `child_diarrhea`.

In [None]:
# YOUR CODE HERE
# Merge disease prevalence per household per round back to originial df.
# df_m = ...

# Relabel columns
# df_m.columns = ...

# df_m.head()

### Regressing Child Diarrhea Prevalence on validated WG usage

**Question 6.7:** With that done, create `val_prep`, `df_m` grouped by `bwm_round` and whether or not the household had validated WG usage. It should contain the mean 7-day diarrhea household prevalence for each round and treatment/control group. After that, perform the regression as specified below.

In [None]:
# YOUR CODE HERE
# val_prep = ...
# val_prep.head()

The cell below plots our two trend lines: 7-day Mean Diarrhea Prevalence in households with and without validated WG usage. To illustrate the trend lines, we drop our less surveyed households, denoted as round 99 and 161.

In [None]:
# px.line(val_prep[val_prep['bwm_round'] <19], x='bwm_round', y='child_diarrhea', color = 'validated_wg',
#         title='7-day Mean Diarrhea Prevalence in Households with and without Validated WG Usage Across Rounds')

**Question 6.8:** In the plot above, it appears that the 7-day Mean Diarrhea Prevalence in households are lower for validated WG - on average. In the cell below, quantify the relationship between `d_7_days_mean_prev` and `validated_wg` and interpret your findings, following the example from exercise 5.10. Some questions to consider: What are you observing? Did you expect a larger impact from validated WaterGuard usage? Is this relationship significant?

In [None]:
# YOUR CODE HERE (REGRESSION)

*Your answer here. (SSS-interpretation).*

### Regressing Child Diarrhea Prevalence on Survey Rounds 

Next, let's turn our attention back towards the main focus of the aforementioned paper - What's the impact on of more-frequent surveys on chlorine use and diarrhea prevalence? In the regression below, you should compare the differences between round 9 and 99 and round 16 and 161. Remember, these rounds were held at the same time, but to househould surveyed bi-weekly (9 and 16) and biannually (99 and 161). Remember that these were randomly sampled from the larger sample of WaterGuard assigned (or not) households.

**Question 6.9:** In the cell below, run a regression of `'child_diarrhea` on the a series of dummy variables created - one for each round. 

In [None]:
# YOUR CODE HERE
# df_m = df_m.dropna(axis=0) # Drops 4 nans.
# df_reg = ... # Select relevant columns.

# y = ... # Select our dependent variable.
# X = ... # Select our independent variables.

# round_dummies = pd.get_dummies(X['bwm_round']) # One-Hot Encode our bwm_round variable.
# round_dummies = round_dummies.drop(1,axis=1) # Drop one round.

# X = X.drop('bwm_round',axis=1) # Remove the original bwm_round column.
# X = X.join(round_dummies) # Join OHE into X.
# X

Now, we'll fit our Ordinary Least Squares Linear Regression using the Statsmodel API. It provides us with a neat summary - but it's still on us to interpret it!

In [None]:
# OLS(y,X).fit().summary()

**Question 6.10:** What are you observing in in terms of the differences between less and more frequently surveyed rounds? Is this as expected? Also - how does diarrhea prevalence change across each survey rounds? Hint: Can you spot a 'trend line' in your data?

*Your answer here.*

Answer: Large differences between the less and more frequent surveyed groups. Progression towards less and less diarrhea prevalence as the rounds progresses - with the exception of round 99 and 161!

**Question 6.11:** Now repeat the exercise from above, but regress `validated_wg` on rounds instead. What do you observe?

In [None]:
# YOUR CODE HERE
# df_m = df_m.dropna(axis=0) # Drops 4 nans.
# df_reg = ... # Select relevant columns.

# y = ... # Select our dependent variable.
# X = ... # Select our independent variables.

# round_dummies = pd.get_dummies(X['bwm_round']) # One-Hot Encode our bwm_round variable.
# round_dummies = round_dummies.drop(1,axis=1) # Drop one round.

# X = X.drop('bwm_round',axis=1) # Remove the original bwm_round column.
# X = X.join(round_dummies) # Join OHE into X.
# X

**Question 6.11:** (Last one, I promise!) Interpret your findings above in the way that suits you the best. 
Hint: Again, we suggest using the [SSS-framework](https://are.berkeley.edu/courses/EEP118/spring2014/section/Handout4_2014.pdf) (Sign, Significance, and Size)

*Your answer here.* 

## Congratulations!

You just finished the longest lab in Econ 148. Give yourself a clap on the shoulder and sit back and appreciate your following accomplishments:

- You performed a stellar EDA of the Main Rural Project Spring Protection Study, a real-world dev-econ project!
- You delved deep into a truly complex survey, and put yourself into the situation of both the interviewer and the one being interviewed!
- With your understanding of the survey, you analysed the prevalence of 5 serious diseases across 2000+ households, and even built a function to do so!
- You understood and analysed the Hawthorne effect, and managed to quanitfy the extent to which it was present in our study data.
- You set yourself up for a further analysis of diarrhea prevalence through examining the usage and promotion of WaterGuard.
- And much more!
What a project!

---
## Feedback

**Question 7:** Please fill out this short [feedback form](https://forms.gle/HicDWSXkfaow2hVj7) to let us know your thoughts about this lab! We really appreciate your opinions and feedback! At the end of the Google form, you should see a codeword. Assign the codeword to the variable `codeword` below. 

In [None]:
codeword = ...

In [None]:
grader.check("q7")

<a id='appendix_1'></a>
## Appendix 1: Relevant Column Sections

Columns to register children born within the past 2 weeks

- 'c3_1_child_id',
 'c3_2a_name1',
 'c3_2b_name2',
 'c3_3_gender',
 'c3_4_doa_day',
 'c3_4_doa_month',
 'c3_4_doa_year',
 'c3_5_dob_day',
 'c3_5_dob_month',
 'c3_5_dob_year',
 'c3_6_age_months',
 'c3_6_age_weeks',
 'c3_6_age_years',
 'c3_7_verified',
 'c3_7_verified_other',

General Child Health History columns

Note: There are 10x sheets of these in the original survey, one for each child's health condition. They ask questions on whether or not child has had (during the past week) symptoms of diarrhea, blood in stool, fever, vomiting, constant cough, and/or weakness. One of the reasons for the amount of the columns is that the survey asks for daily symptom data.

-  'd1_child_id',
 'd3_clinic_card',
 'd4a_dob_day',
 'd4a_dob_month',
 'd4a_dob_year',
 'd4b_age_months',
 'd4b_age_weeks',
 'd4b_age_years',
 'd5a_main_other_relation',
 'd5a_main_relation',
 'd5b_hist_other_relation',
 'd5b_hist_relation',

- Diarrhea Section: 'd6a1_7day_diarrhea',
 'd6a2_7day_blood_in_stool',
 'd6b_7day_fever',
 'd6c_7day_vomiting',
 'd6d1_7day_cough',
 'd6d2_7day_chest_noise',
 'd6d3_7day_dif_breathing',
 'd6e_7day_weakness','d7a1_num_diarrhea',
 'd7a2_num_blood_in_stool',
 'd7b_num_fever',
 'd7c_num_vomiting',
 'd7d1_num_cough',
 'd7d2_num_chest_noise',
 'd7d3_num_dif_breathing',
 'd7e_num_weakness',
 'd8a1_unit_diarrhea',
 'd8a2_unit_blood_in_stool',
 'd8b_unit_fever',
 'd8c_unit_vomiting',
 'd8d1_unit_cough',
 'd8d2_unit_chest_noise',
 'd8d3_unit_dif_breath',
 'd8e_unit_weakness',
 'd9a1_today_diarrhea',
 'd9a2_today_blood_in_stool',
 'd9b_today_fever',
 'd9c_today_vomiting',
 'd9d1_today_cough',
 'd9d2_today_chest_noise',
 'd9d3_today_dif_breathing',
 'd9e_today_weakness',
 'd10a1_yest_diarrhea',
 'd10a2_yest_blood_in_stool',
 'd11a1_2day_diarrhea',
 'd11a2_2day_blood_in_stool',
 'd12a1_3day_diarrhea',
 'd12a2_3day_blood_in_stool',
 'd13a1_4day_diarrhea',
 'd13a2_4day_blood_in_stool',
 'd14a1_5day_diarrhea',
 'd14a2_5day_blood_in_stool',
 'd15a1_6day_diarrhea',
 'd15a2_6day_blood_in_stool',
 'd16a1_7day_diarrhea',
 'd16a2_7day_blood_in_stool',
 'd17b_numdays_fever',
 'd17c_numdays_vomiting',
 'd17d1_numdays_cough',
 'd17d2_numdays_chest_noise',
 'd17d3_numdays_dif_breathing',
 'd17e_numdays_weakness',
 'd18_stool',
 'd18_stool_other',
 'd19_drink_from_respondent',
 'd20_tears',
 'd21_urine',
 'd22_missed_school',
 'd23_hospital',
 'd24_breastfeeding',
 'd25_consent',

Child Examination Columns

Note: This section is filled out if child had diarrhea on the same day as the survey - after filling out section D, the one above. Each entry is for a unique child_id in the household. There can be multiple per visit (e.g several children have diarrhea in the household at the same time.)

- 'e1_1_child_id',
 'e1_2_mucous_membrane',
 'e1_3_nasal_flaring',
 'e1_4_acc_musc',
 'e1_5_fontanels',
 'e1_6_temp',
 'e1_7_resp_num',
 'e1_7_resp_sleep',
 'e1_7_resp_time',
 'e1_8_turgor',
 'e1_9_alertness',
 'e1_10_pulse_num',
 'e1_10_pulse_sleep',
 'e1_10_pulse_time',
 'e1_11_unable_to_assess',

Child Information Section

This section aims to map an overview of the two previous sections, registering whether or not there was a child health and/or examination section filled out for every individual child in the household. It could be thought of as a 'history' chart for a given household over the time of the survey. (ERIC)

- 'fa_id',
 'fb1_child_alive',
 'fb2_status_duplicate_id',
 'fc_history_taken',
 'fc_history_taken_explain',
 'fd_exam_done',
 'fd_exam_done_explain',
 'fe_clinic_card',
 'ff_card_measles',
 'fg_vaccinations',
 'fh_vacc_age',
 'fi_vacc_public',
 'f11_1_child_id',
 'f11_2_diarrhea',
 'f11_3_num_of_days',

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(run_tests=True)