# Scraping 3ie's Impact Evaluation Registry
The following code downloads nearly all of the data from the 3ie impact evaluation registry

## Download list of valid study numbers
In 3ie's registry, details of each impact evaluation are listed on a webpage with a url of the form "http://www.3ieimpact.org/en/evidence/impact-evaluations/details/_study_number" where study_number appears to be an arbitrarily assigned number.  Unfortunately, not all study numbers are valid.  
First I get all of the valid study numbers from the main webpage listing links to all of the impact evaluation webpages.

In [1]:
from bs4 import BeautifulSoup
import requests
import numpy as np
import re
import pandas as pd

# get the raw html of the 3ie page listing all of the studies
r = requests.get("http://www.3ieimpact.org/en/evidence/impact-evaluations/?q=&all=on&sort_by=alphabet")
# parse this page using BeautifulSoup
soup = BeautifulSoup(r.content, "lxml")
# use BeautifulSoup's findAll method to get a collection of all <a> tags with an href
# note: I think there is a way to pass a regular expression to the findAll method.  This would
# potentially allow the following code to be written 
all_links =soup.findAll("a", href=True)

# loop through all of the links in all_links, check if the link appears to be for a study page, and add to 
# study_nums if it is
study_nums = []
for link in all_links:
    m = re.search(r"details/([\d]+)", link["href"])
    if m:
        study_nums.append(m.group(1))

# convert the study_nums list to a numpy array and sort
valid_study_nums = np.asarray(study_nums).astype("int")
valid_study_nums = np.sort(valid_study_nums)

# Save a copy of the array.  (3ie's website doesn't always work, so important to have a backup)
np.save("/Users/douglasjohnson/Documents/code/datasets/study_nums",valid_study_nums)

## Download metadata for all studies
Iterate through the list of valid study numbers, download the metadata for each study, and then combine metadata for all studies into a dataframe.

In [1]:
# only execute these lines if you don't want to reload the study numbers 
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
import numpy as np
valid_study_nums = np.load("/Users/douglasjohnson/Documents/code/datasets/study_nums.npy")

In [2]:
# create an empty list of studies
studies = []
# loop over each valid study number and get, get the metadata and save it in temp_dict, 
# and then add temp_dict to studies
for idx, val in enumerate(valid_study_nums):
    try:
        print("loading data for study: " + str(val))
        url = "http://www.3ieimpact.org/en/evidence/impact-evaluations/details/" + str(val)
        r = requests.get(url)
        soup = BeautifulSoup(r.content, "lxml")
        temp_dict = {}
        temp_dict['study_num'] = val

        # search the soup for <section class="evidence_meta"> and then get each of the elements from this section
        dt_elements_of_dl = soup.find('section', {'class':"evidence_meta"}).find('dl').findAll('dt')
        for dt in dt_elements_of_dl:
            temp_dict[dt.next_element] = dt.next_sibling.next_element.next_element
        
        # get the title from the h1 tag
        if soup.find('h1'):
            temp_dict['title'] = soup.find('h1').next_element
            
        # get the time from the time tag
        if soup.find('time'):
            temp_dict['year'] = soup.find('time').next_element
        
        # get the link to the paper from the <section class="evidence_source"> tag
        try:
            source = soup.find('section', class_ ="evidence_source")
            temp_dict['source'] = source.find('p').next_element
            temp_dict['source_link'] = source.find('a')['href']
        except:
            print('no evidence source found')
        # the following code to load the methodology and findings doesn't work for webpages in which 
        # there is a context or synopsis before the methodology section. since there are only a few pages like 
        # this i am ignoring it for now.
        try:
            method_findings =soup.findAll('section', class_ ='summary_item')
            temp_dict['methodology'] = method_findings[0].find('p').next_element
            temp_dict['findings'] = method_findings[1].find('p').next_element
        except:
            print("couldn't load methodology or findings")
        studies.append(temp_dict)
        
        # every Xth study save all the results so that they aren't lost if the connection goes down.
        # i didn't end up using these files
        x = 100
        if idx%x == 0:
            df = pd.DataFrame(studies)
            df.to_csv("/Users/douglasjohnson/Documents/code/datasets/3ie/3ie_registry_"+ str(idx))
        
    except:
        print("loading metadata failed")
df = pd.DataFrame(studies)
df.to_csv("/Users/douglasjohnson/Documents/code/datasets/3ie/3ie_final.csv")

loading data for study: 3193
couldn't load methodology or findings
loading data for study: 3194
couldn't load methodology or findings
loading data for study: 3195
couldn't load methodology or findings
loading data for study: 3196
couldn't load methodology or findings
loading data for study: 3197
couldn't load methodology or findings
loading data for study: 3198
couldn't load methodology or findings
loading data for study: 3199
couldn't load methodology or findings
loading data for study: 3200
couldn't load methodology or findings
loading data for study: 3201
couldn't load methodology or findings
loading data for study: 3202
couldn't load methodology or findings
loading data for study: 3203
couldn't load methodology or findings
loading data for study: 3204
couldn't load methodology or findings
loading data for study: 3205
couldn't load methodology or findings
loading data for study: 3206
couldn't load methodology or findings
loading data for study: 3207
couldn't load methodology or find