
# Imputing Missing Values

## Description

Proper handling of missing data is crucial to deriving reliable conclusions from an analysis, particularly when the underlying process is not completely random. The following three examples illustrate these critical cases:

* A survey-based earnings series may have missing values because high-income respondents do not want to reveal their earnings. 
* Patients may be more likely to opt out of undertaking a medical treatment when the treatment causes discomfort. 
* A blood pressure data series may have missing values for young patients who have no cardiovascular diseases. 

This analysis focuses on one of these critical cases illustrated by the blood pressure example above. This case is called "missingness at random", a confusing name as it does not sound much different from the "missingness completely at random" case. But, they are different. When data are missing at random, the probability of missingness depends on the available information. The following analysis examines if a random regression approach can be used to impute the missing data in these cases (for a comprehensive treatment of the subject, please refer to Gelman and Hill).  

The analysis uses an unconditional income convergence model for its simplicity. In this model, growth in a country's incomes is explained by the country's initial income level, adjusted for cross-country price differentials. The expectation is that countries that start off from a relatively low income level grow faster and catch up with the rest. The "missingness at random" mechanism is implemented by setting half of the values in the independent variable, PPP adjusted per capita GDP series, to missing .

## Data Preparation



The data used in the analysis are from the World Development Indicators database, downloaded through the API. To limit the analysis the countries, the UN country code list is used as a filter.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import requests

In [2]:
url = 'http://api.worldbank.org/v2/country/all/indicator'
indicator_list = ['NY.GDP.PCAP.KD.ZG',
                  'NY.GDP.PCAP.PP.CD']
parameters = {'date': '1989:2018',
              'footnote': 'n',
              'format': 'json',
              'per_page': 7920}
results_list = []
for indicator in indicator_list:
    results = requests.get(f'{url}/{indicator}', params = parameters)
    if results.status_code == 200:
        results_list.append(results.json())
    else:
        print('Failed request')

In [24]:
%run data_prep.py

In [25]:
df = flatten_wb_api_response(results_list)

In [27]:
country_iso3_code = pd.read_html('https://unstats.un.org/unsd/methodology/m49/')
country_iso3_code = country_iso3_code[0]['ISO-alpha3 code']

In [28]:
df = df.loc[df.country_iso3_code.isin(country_iso3_code)]   

In [39]:
df.set_index(['indicator', 'country_iso3_code', 'country', 'year']).unstack(level = 0).info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 6450 entries, (ABW, Aruba, 1989) to (ZWE, Zimbabwe, 2018)
Data columns (total 2 columns):
(value, NY.GDP.PCAP.KD.ZG)    5794 non-null float64
(value, NY.GDP.PCAP.PP.CD)    5395 non-null float64
dtypes: float64(2)
memory usage: 136.0+ KB


In [37]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12900 entries, 1410 to 15839
Data columns (total 5 columns):
indicator            12900 non-null object
country              12900 non-null object
country_iso3_code    12900 non-null object
year                 12900 non-null int64
value                11189 non-null float64
dtypes: float64(1), int64(1), object(3)
memory usage: 604.7+ KB


In [12]:
df

Unnamed: 0,indicator,country,country_iso3_code,date,value


In [26]:
df.value.isna().indicator.value_counts()

AttributeError: 'Series' object has no attribute 'indicator'

## Analysis

In [None]:
import statsmodels.api as sm