## Scraping the Syrian War Timeline off of Wikipedia

Wikipedia has one of the richest timelines of the Syrian Civil War that includes references to news articles and others sources for information verification. The timeline, however, is not structured in a manner that makes it easy to export into an Excel sheet or other analysis package.

At the heart of this problem is the fact that there is a many-to-many relationship between the references on each page and the statements that cite them. This script uses the BeautifulSoup library and Python 3.6 to scrape the data from the Wikipedia pages and parse it into a SQLite database with a many-to-many structure, allowing for flexible querying of the data and exporting into a format suitable for analysis in an external application.

This script was created by Clay Heaton, for Harvard's FXB Center for Health and Human Rights, as part of The Lancet's Commission on Syria in July of 2017. More information can be found [at this link](https://epicenter.wcfia.harvard.edu/blog/documenting-burden-war-syrians).

To run this script, you need to running Python 3.5 or greater and have the following libraries installed and accessible to your Jupyter Notebook.

- [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [arrow](http://arrow.readthedocs.io/en/latest/)
- [dateparser](https://dateparser.readthedocs.io/en/latest/)
- [dataset](http://dataset.readthedocs.io/en/latest/)

Note that variations between the pages scraped make it impractical to effectively pull out the dates from all of the timeline statements. Several passes are made to extract the date but ultimately, research assistants involved with the project will review each citation and timeline statement for veracity and accuracy, correcting malformed dates in the process.

In [None]:
from bs4 import BeautifulSoup
import dataset
import requests
import arrow
import hashlib
import dateparser

db = dataset.connect("sqlite:///syria_timeline.sqlite")

months = ["January","February","March","April","May","June","July","August",
          "September","October","November","December"]

In [None]:
try:
    db['pages'].drop()
    db['citations'].drop()
    db['statements'].drop()
    db['citations_statements'].drop()
except:
    pass

tab_pages = db['pages']
tab_citations = db['citations']
tab_statements = db['statements']
tab_join = db['citations_statements']

In [None]:
pages = [
    "https://en.wikipedia.org/wiki/Timeline_of_the_Syrian_Civil_War_(January–April_2011)",
    "https://en.wikipedia.org/wiki/Timeline_of_the_Syrian_Civil_War_(May–August_2011)",
    "https://en.wikipedia.org/wiki/Timeline_of_the_Syrian_Civil_War_(September–December_2011)",
    "https://en.wikipedia.org/wiki/Timeline_of_the_Syrian_Civil_War_(January–April_2012)",
    "https://en.wikipedia.org/wiki/Timeline_of_the_Syrian_Civil_War_(May–August_2012)",
    "https://en.wikipedia.org/wiki/Timeline_of_the_Syrian_Civil_War_(September–December_2012)",
    "https://en.wikipedia.org/wiki/Timeline_of_the_Syrian_Civil_War_(January–April_2013)",
    "https://en.wikipedia.org/wiki/Timeline_of_the_Syrian_Civil_War_(May–December_2013)",
    "https://en.wikipedia.org/wiki/Timeline_of_the_Syrian_Civil_War_(January–July_2014)",
    "https://en.wikipedia.org/wiki/Timeline_of_the_Syrian_Civil_War_(August–December_2014)",
    "https://en.wikipedia.org/wiki/Timeline_of_the_Syrian_Civil_War_(January–July_2015)",
    "https://en.wikipedia.org/wiki/Timeline_of_the_Syrian_Civil_War_(August–December_2015)",
    "https://en.wikipedia.org/wiki/Timeline_of_the_Syrian_Civil_War_(January–April_2016)",
    "https://en.wikipedia.org/wiki/Timeline_of_the_Syrian_Civil_War_(May–August_2016)",
    "https://en.wikipedia.org/wiki/Timeline_of_the_Syrian_Civil_War_(September–December_2016)",
    "https://en.wikipedia.org/wiki/Timeline_of_the_Syrian_Civil_War_(January–April_2017)",
    "https://en.wikipedia.org/wiki/Timeline_of_the_Syrian_Civil_War_(May_2017–present)"
]

for page in pages:
    rec = {"url":page,"year":page[-5:-1]}
    if rec['year'] == 'sent':
        rec['year'] = 2017
    rec_id = tab_pages.insert(rec)

In [None]:
def citation_record_from_bottom(file_id,soup_element):
    """
    Send in a BeautifulSoup li tag from the References and return a
    dict that can be inserted into the references table in the local db.
    """
    reference_record = {}

    # The id on the reference links the citations to the reference
    f_id = soup_element.get('id')
    reference_record['citation_anchor'] = f_id

    # External link in citation
    try:
        f_ex_link = soup_element.find(class_='external').get('href')
        reference_record['external_link'] = f_ex_link
    except:
        reference_record['external_link'] = None

    # Reference Text as a larger chunk
    try:
        f_ref_text = soup_element.find('span',class_='reference-text')
        reference_record['citation_text'] = f_ref_text.text.replace('\xa0','')
    except:
        reference_record['citation_text'] = None
    
    reference_record['page_id'] = file_id
    return reference_record

def get_reference_link_from_citation(citation):
    try:
        return list(citation.children)[0].get('href')[1:]
    except:
        print("Error getting citation link")
        return None

In [None]:
def extract_dd_date(elem):
    if elem.parent.name == 'dd':
        try:
            while elem.name != 'dl':
                elem = elem.parent
                if elem.name == 'dl':
                    for e in elem.previous_siblings:
                        if e and e.name == 'ul':
                            try:
                                date = list(list(e.children)[1].children)[0].text
                                break
                            except:
                                date = list(list(e.children)[1].children)[0]
                                break
                    break
                    
            if any(month in date for month in months):
                return date
        except:
            return None

In [None]:
def find_nearest_date(elem, counter=0):

    date = None
    
    # dd elements often have the date as the first child
    # of the previous sibling to their parent element
    if elem.parent.name == 'dd':
        date = extract_dd_date(elem)
        if date and any(month in date for month in months):
            return date
    
    try:
        date = list(r.parent.children)[0].text
    except:
        try:
            date = list(list(r.parent.parent.previous_sibling.previous_sibling.children)[1].children)[0].text
        except:
            date = None
    
    if date and any(month in date for month in months):
        return date

    if any(month in elem for month in months):
        return elem.text
    
    for child in elem.children:
        if any(month in child for month in months):
            return child
    
    for sibling in elem.previous_siblings:
        if any(month in sibling for month in months):
            return sibling

        try:
            for child in sibling.children:
                if any(month in child for month in months):
                    return child
        except:
            pass
        
    # We haven't found anything if we are here
    elem = elem.parent
    if elem is not None:
        find_nearest_date(elem,counter+1)

def extract_li_date(ref):
    try:
        ref = ref.parent.parent.previous_sibling.previous_sibling
        t = list(ref.children)[0].text.strip()
        return t
    except:
        return None        

def extract_h3_date(ref,element_type='h3'):
    ref = ref.parent
    try:
        while ref.name != element_type:
            ref = ref.previous_sibling
            if ref == None:
                ref = ref.previous_sibling

        ref = ref.find(class_='mw-headline')
        t = ref.text
        if '–' in t:
            return t.split('–')[0].strip()
        elif '-' in t:
            return t.split('-')[0].strip()
        else:
            return t
            
    except:
        return None
        
def extract_h4_date(ref,element_type='h4'):
    ref = ref.parent
    try:
        while ref.name != element_type:
            ref = ref.previous_sibling
            if ref == None:
                ref = ref.previous_sibling

        ref = ref.find(class_='mw-headline')
        t = ref.text
        if '–' in t:
            return t.split('–')[0].strip()
        elif '-' in t:
            return t.split('-')[0].strip()
        else:
            return t
            
    except:
        return None

In [None]:
def build_timeline_element(ref,year,use_alt_date=False):
    row = {}
    row['statement'] = ref.parent.text
    row['year'] = year
    try:
        if use_alt_date is True:
            d = extract_h4_date(ref)
            
            if d is None:
                d = extract_h3_date(ref)
                
            if d is None:
                d = extract_li_date(ref)
        else:
            d = find_nearest_date(ref)
        
        row['date_guess'] = d + " " + row['year']
    except:
        row['date_guess'] = None
    row['date_guess_formatted'] = ""
    row['ref_number_in_statement'] = ref.text
    try:
        row['date_guess_formatted'] = arrow.get(dateparser.parse(row['date_guess'],date_formats=['%d %B %Y'])).format("YYYY-MM-DD")
        if arrow.get(row['date_guess_formatted']).date().year > int(row['year']):
            row['date_guess_formatted'] = None
    except:
        pass
    
    # Hash the content for de-duplication
    h = hashlib.sha256()
    h.update(row['statement'].encode())
    row['hash'] = h.hexdigest()
    
    return row

In [None]:
def process_page(ref_page_record):
    k = ref_page_record['id']
    v = ref_page_record['url']
    page_year = ref_page_record['year']

    response = requests.get(v)
    content = response.content
    soup = BeautifulSoup(content,'html.parser')
    
    # Isolate the citations at the bottom of the page
    ref_ol = soup.find(class_="references")
    references = []

    # Isolate each individual element in the citations
    # and create a database entry
    for elem in ref_ol.find_all('li'):
        references.append(citation_record_from_bottom(k,elem))
        
    # Try inserting them into the database
    tab_citations.insert_many(references)
    
    # Insert the statements
    citations = soup.find_all('sup',class_="reference")
    
    # Used to deduplicate
    page_citations = {}
    
    for citation in citations:
        join_link = get_reference_link_from_citation(citation)
        if k == 1:
            statement = build_timeline_element(citation,page_year)
        else:
            statement = build_timeline_element(citation,page_year,True)

        h = statement['hash']
        
        try:
            statement_link = v + "#" + citation.get('id')
        except:
            statement_link = None
            
        statement['statement_url'] = statement_link
        
        if h not in page_citations.keys():
            page_citations[h] = {}
            statement_ref_number = statement['ref_number_in_statement']
            del statement['ref_number_in_statement']
            del statement['hash']
            
            statement_id = tab_statements.insert(statement)
            page_citations[h]['id'] = statement_id
            page_citations[h]['links'] = [(k,join_link,statement_ref_number, statement_id)]
            page_citations[h]['statement'] = statement
        else:
            page_citations[h]['links'].append((k,join_link,statement['ref_number_in_statement'],page_citations[h]['id']))
            
        
    # At this point, all of the statements should be in the database.
    # Insert the links into the citations_statments table
    # collect them and make records
    join_records = []
    for h in page_citations.keys():
        links = page_citations[h]['links'] # list of tuples
        for l in links:
            join_rec = {"page_id":l[0],"citation_anchor":l[1],"text_ref_number":l[2],"statement_id":l[3]}
            reference = tab_citations.find_one(page_id=join_rec['page_id'],
                                                citation_anchor=join_rec['citation_anchor'])
            if reference is not None:
                join_rec['citation_id'] = reference['id']
                del join_rec['page_id']
                del join_rec['citation_anchor']
                
                join_records.append(join_rec)
            else:
                # Odd footnote statements may not have an entry in the citations table
                continue

    tab_join.insert_many(join_records)

In [None]:
records = tab_pages.find()

In [None]:
for r in records:
    process_page(r)