# Imputing Missing Values

Proper handling of missing data is key to deriving reliable conclusions from an analysis, particularly when the missingness generating mechanism is not completely random. The following three examples illustrate these critical cases:

* A survey-based earnings series may have missing values because high-income respondents do not want to reveal their earnings. 
* Patients may be more likely to opt out of undertaking a medical treatment when the treatment causes discomfort. 
* A blood pressure data series may have missing values for young patients who have no cardiovascular diseases. 

The following analysis focus on one of the critical missingness generating mechanisms: "missingness at random", a confusing name as it does not sound much different than the "missingness completely at random" case. But, they are different. When data are missing at random, the probability of their missingness depends on the available information, similar to the blood pressure example listed above. I examine this case to understand how missing data might bias the predictions and if a random regression approach can be used to fill in for missingness. For a comprehensive treatment of the subject, please refer to Gelman and Hill.  

To this end, I use an unconditional income convergence model for its simplicity. In this model, growth in a country's incomes is explained by the country's initial income level, adjusted for cross-country price differentials. The expectation is that countries that start off from a relatively low income level grow faster and catch up with the rest. The "missingness at random" mechanism is implemented by setting half of the values in the independent variable, PPP adjusted per capita GDP series, to missing .

## Data Preparation



In [3]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
import requests

In [4]:
url = 'http://api.worldbank.org/v2/country/all/indicator'
indicator_list = ['NY.GDP.PCAP.KD.ZG',
                  'NY.GDP.PCAP.PP.CD']
parameters = {'date': '1989:2018',
              'footnote': 'n',
              'format': 'json',
              'per_page': 7920}
results_list = []
for indicator in indicator_list:
    results = requests.get(f'{url}/{indicator}', params = parameters)
    if results.status_code == 200:
        results_list.append(results.json())
    else:
        print('Failed request')

In [20]:
%run data_prep.py

In [21]:
df = flatten_wb_api_response(results_list)

In [26]:
df.value.isna().indicator.value_counts()

AttributeError: 'Series' object has no attribute 'indicator'

In [7]:
country_iso3_code = pd.read_html('https://unstats.un.org/unsd/methodology/m49/')

In [8]:
country_iso3_code = country_iso3_code[0]