# Scraping Wikipedia for demographic data
The data I am working with is absolute numbers of suicides, and will be interpreted better if I had population and other demographic data to work with. The easiest way to source this data is Wikipedia, since the data is available state-wise, as opposed to district-wise when sourced from the Indian census website.

This Jupyter notebook contains the following steps I took to collect the data:
 1. Scrape [this Wikipedia article](https://en.wikipedia.org/wiki/States_and_union_territories_of_India) to get the list of demographic indicators available on Wikipedia
 2. Use the interact widget to manually select relevant articles without having to type in the names of each article I'm interested in
 3. Scrape each article to get the first table and save them to a list
 4. Clean up the HTML tables and load them into a list of pandas dataframes
 5. Save the list of dataframes into an Excel file that I can refer to later.
 
## Step 1: Scrape root article to get list of indicators

In [1]:
import requests
from bs4 import BeautifulSoup

WIKIPEDIA_BASE_URL = 'https://en.wikipedia.org'
ROOT_ARTICLE = '/wiki/States_and_union_territories_of_India'

In [2]:
def retrieve_page(article):
    wikipedia_url = WIKIPEDIA_BASE_URL + article
    response = requests.get(wikipedia_url)

    if response.status_code != 200:
        print ("Cannot fetch article {} from wikipedia. \nStatus code: {}, Reason: {}"
            .format(article, response.status_code, response.reason))

    return response

In [3]:
response = retrieve_page(ROOT_ARTICLE)
soup = BeautifulSoup(response.text, "html5lib")
table = soup.find_all('table')[1]
links = table.find_all('a',class_=None)

In [4]:
from collections import OrderedDict
categories = OrderedDict()
for link in links:
    categories[link.text] = link.get('href')


## Step 2: Manually select relevant articles to scrape

In [5]:
import ipywidgets as widgets
selections = widgets.SelectMultiple(
    options=list(categories.keys()),
    description='Pick a few categories',
    disabled=False,
    continueous_update=True
)

widgets.VBox([selections])

In [6]:
wanted_categories = list(selections.get_interact_value())

## Step 3: Scrape each article in the selected list and save the first table to a list

In [7]:
tables = []
for cat in list(categories.items()):
    if cat[0] in wanted_categories:
        response = retrieve_page(cat[1])
        soup = BeautifulSoup(response.text, "html5lib")
        tables.append(soup.find('table',class_='wikitable'))

## Step 4: Clean up tables, save to Excel file

In [9]:
import html5lib
import bs4
import pandas as pd
from pandas import Series, DataFrame

df_list = []
for table in tables:
    for td in table.find_all(['td','th']):
        if isinstance(td.span,bs4.element.Tag):
            td.span.decompose()
        if isinstance(td.sup,bs4.element.Tag):
            if isinstance(td.sup.a,bs4.element.Tag):
                td.sup.a.decompose()
            td.sup.decompose()
    data = pd.read_html(table.prettify().replace('th>','td>'), flavor='bs4', header=0)[0]
    df_list.append(data)

In [10]:
df_list[0].head(3)

Unnamed: 0,Rank,State or union territory,Population (2011 Census) (% of population of India) [15],Decadal growth (2001–2011),Rural pop. (%),Urban pop. (%),Area,Density,Sex ratio
0,1,Uttar Pradesh,"199,281,477 (16.49%)",20.1%,"155,111,022 (77.72%)","44,470,455 (22.28%)","240,928 km (93,023 sq mi)","828/km (2,140/sq mi)",908
1,2,Maharashtra,"112,372,972 (9.28%)",16.0%,"61,545,441 (54.77%)","50,827,531 (45.23%)","307,713 km (118,809 sq mi)",365/km (950/sq mi),946
2,3,Bihar,"103,804,637 (8.58%)",25.1%,"92,075,028 (88.70%)","11,729,609 (11.30%)","94,163 km (36,357 sq mi)","1,102/km (2,850/sq mi)",916


## Step 5: Save the dataframes to an Excel file

In [11]:
df_series = Series(df_list, index = wanted_categories)

In [12]:
excel_file = "India_Demographics.xlsx"
writer = pd.ExcelWriter(excel_file)

for index, df in enumerate(df_series):
    df.to_excel(writer,wanted_categories[index])

writer.save()