### Web Scraper to download files from Census 2011 Website

While researching publically available datasets in India, I decided to explore the Census 2011 website.

To quote the Census Organisation - "The Indian Census is the largest single source of a variety of statistical information on different characteristics of the people of India.The rich diversity of the people of India is truly brought out by the decennial census which has become one of the tools to understand and study India."

I wanted to explore and analyse the datasets on age structure and education status in India and the different factors impeding it from a public policy perspective. There were multiple tables and it was tedious to to click and download one by one. So, I developed a web scraper to extract and prepare the URLs to fetch the files/tables related to age (C-13) and education (C-08), download and save them on the local system.

For illustrative purposes, the scraper was extended to downloading language-reated content including pdf files/reports.

I will be publishing further reports on different hypotheses in future.

**Step 1: Import modules/packages**

Requisite modules/packages were imported.
- requests to fetch response for the given URL
- BeautifulSoup to parse the response and extract links to download files
- os to create file path

In [46]:
import requests
from bs4 import BeautifulSoup
import os

**Step 2: census_url, getUrlOnWebPage Function**
- census_url is the link to the Population Enumeration Data webpage which has the links to all the different datasets. 
- A custom function - getUrlOnWebPage was defined. It takes a URL as argument, extracts all the URLs on the webpage under anchor tag (a) and returns them as a dictionary.
- Further regular expression and string matching may have been used to extract the specific urls. However, the structure of the website is relatively straightforward and it was easier to visually inspect and select the links to download files in the next step from the dictionary of all the URLs returned from the population enumeration page.

In [47]:
census_url = "http://www.censusindia.gov.in/2011census/population_enumeration.html"

def getUrlOnWebPage(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'lxml')
    a_tag_list = soup.find_all("a")
    href_list = [link["href"] for link in a_tag_list]
    href_dict = {key: value for key, value in enumerate(href_list)}
    return href_dict

In [48]:
# Returns all the links on census_url webpage as dictionary
href_dict_census_home = getUrlOnWebPage(census_url)

In [49]:
href_dict_census_home

{0: 'http://censusindia.gov.in/pca/cdb_pca_census/cd_block.html',
 1: 'http://www.censusindia.gov.in/2011-Documents/PCA_HL_2011_Release.xls',
 2: 'http://www.censusindia.gov.in/2011census/SC-ST/pca_state_distt_sc.xls',
 3: 'http://www.censusindia.gov.in/2011census/pca_sc/pca-sc.html',
 4: 'http://www.censusindia.gov.in/2011census/SC-ST/pca_state_distt_st.xls',
 5: 'http://www.censusindia.gov.in/2011census/pca_st/pca-st.html',
 6: 'http://www.censusindia.gov.in/2011-Documents/slum_data_census_2011.xls',
 7: 'http://www.censusindia.gov.in/2011census/PCA/PCA_Highlights/PCA_Data_highlight.html',
 8: 'http://www.censusindia.gov.in/pca/DDW_PCA0000_2011_Indiastatedist.xlsx',
 9: 'http://censusindia.gov.in/pca/pcadata/pca.html',
 10: 'http://www.censusindia.gov.in/2011census/PCA/PCA_OTH_0000_2011.xlsx',
 11: 'http://www.censusindia.gov.in/2011census/PCA/Primary Census Abstractother.pdf',
 12: 'http://www.censusindia.gov.in/2011census/A-1_NO_OF_VILLAGES_TOWNS_HOUSEHOLDS_POPULATION_AND_AREA.xlsx

In [96]:
# Link to C-13 State-wise Single Year Age Data selected using key 30
# all URLs on C-13 webpage returned as href_dict_district
href_dict_district = getUrlOnWebPage(href_dict_census_home[30])

In [140]:
href_dict_district

{0: 'c-13/DDW-0000C-13.xls',
 1: 'c-13/DDW-0000C-13SC.xls',
 2: 'c-13/DDW-0000C-13ST.xls',
 3: 'c-13/DDW-3500C-13.xls',
 4: 'c-13/DDW-3500C-13ST.xls',
 5: 'c-13/DDW-2800C-13.xls',
 6: 'c-13/DDW-2800C-13SC.xls',
 7: 'c-13/DDW-2800C-13ST.xls',
 8: 'c-13/DDW-1200C-13.xls',
 9: 'c-13/DDW-1200C-13ST.xls',
 10: 'c-13/DDW-1800C-13.xls',
 11: 'c-13/DDW-1800C-13SC.xls',
 12: 'c-13/DDW-1800C-13ST.xls',
 13: 'c-13/DDW-1000C-13.xls',
 14: 'c-13/DDW-1000C-13SC.xls',
 15: 'c-13/DDW-1000C-13ST.xls',
 16: 'c-13/DDW-0400C-13.xls',
 17: 'c-13/DDW-0400C-13SC.xls',
 18: 'c-13/DDW-2200C-13.xls',
 19: 'c-13/DDW-2200C-13SC.xls',
 20: 'c-13/DDW-2200C-13ST.xls',
 21: 'c-13/DDW-2600C-13.xls',
 22: 'c-13/DDW-2600C-13SC.xls',
 23: 'c-13/DDW-2600C-13ST.xls',
 24: 'c-13/DDW-2500C-13.xls',
 25: 'c-13/DDW-2500C-13SC.xls',
 26: 'c-13/DDW-2500C-13ST.xls',
 27: 'c-13/DDW-0700C-13.xls',
 28: 'c-13/DDW-0700C-13SC.xls',
 29: 'c-13/DDW-3000C-13.xls',
 30: 'c-13/DDW-3000C-13SC.xls',
 31: 'c-13/DDW-3000C-13ST.xls',
 32: 'c-13

**Step 3: baseUrl Function**
- Function baseUrl was defined to extract the base URL. The relative path to different files/tables/reports is attached to this URL for download (as follows.)

In [76]:
def baseUrl(url):
    base_url = (url.rsplit('/', 1)[0]) + "/"
    return base_url

**Step 4: Downloading C-13 Tables:** 
- It comprises of State-wise Single Year Age Data.

In [92]:
c_13_age = href_dict_census_home[30]
base_url_c_13 = baseUrl(c_13_age)
base_url_c_13
excel_url_list_c_13 = [(base_url_c_13 + value) for key, value in href_dict_district.items()]

#excel_url_list_c_13


**Step 5: Specifying Download Directory/Filepath**
- Download Directory/Filepath is specified and the files/tables are downloaded there.

In [None]:
download_dir = "C:\\Users\\Arunank\\Documents\\Data Science\\public_policy\\CensusData\\c_13_age"

for link in excel_url_list_c_13:
    filename = os.path.join(download_dir, link.rsplit('/', 1)[-1])
    response_excel = requests.get(link)
    with open(filename, 'wb') as file:
        file.write(response_excel.content)

**Step:6 Downloading C-08 Tables: **

- It comprises of data for Educational Level By Age And Sex For Population Age 7 And Above (Total, SC/ST) (India & States/UTs-District Level).
- Similar steps as 1 to 5 above are repeated here.

In [74]:
c_08_education = href_dict_census_home[63]

In [98]:
href_dict_district_c_08 = getUrlOnWebPage(c_08_education)

In [99]:
#href_dict_district_c_08

In [100]:
base_url_c_08 = baseUrl(c_08_education)

In [101]:
base_url_c_08

'http://www.censusindia.gov.in/2011census/C-series/'

In [102]:
excel_url_list_c_08 = [(base_url_c_08 + value) for key, value in href_dict_district_c_08.items()]

In [90]:
#excel_url_list_c_08

In [88]:
download_dir_education = "C:\\Users\\Arunank\\Documents\\Data Science\\public_policy\\CensusData\\c_08_education"

In [94]:
for link in excel_url_list_c_08:
    filename_education = os.path.join(download_dir_education, link.rsplit("/", 1)[-1])
    response_excel_education = requests.get(link)
    with open(filename_education, 'wb') as file_education:
        file_education.write(response_excel_education.content)

**Step 7: Downloading Language related content:**

- It includes different papers/statememts with insights and C-16 language table.

In [137]:
c_16_language = href_dict_census_home[76]
href_dict_c_16 = getUrlOnWebPage(c_16_language)
base_url_c_16 = baseUrl(c_16_language) #(x.rsplit('/',1)[0]) + "/"

base_url_c_16
pdf_url_list_c_16 = [(base_url_c_16 + value) for key, value in href_dict_c_16.items()]

excel_url_list_c_13


In [141]:
href_dict_c_16.items()

dict_items([(0, 'C-16_25062018_NEW.pdf'), (1, 'Language-2011/General Note.pdf'), (2, 'Language-2011/Statement-1.pdf'), (3, 'Language-2011/Statement-2.pdf'), (4, 'Language-2011/Statement-3.pdf'), (5, 'Language-2011/Statement-4.pdf'), (6, 'Language-2011/Statement-5.pdf'), (7, 'Language-2011/Statement-6.pdf'), (8, 'Language-2011/Statement-7.pdf'), (9, 'Language-2011/Statement-8.pdf'), (10, 'Language-2011/Statement-9.pdf'), (11, 'Language-2011/Part-A.pdf'), (12, 'Language-2011/Part-B.pdf'), (13, 'http://www.censusindia.gov.in/2011census/C-16.html')])

In [142]:
# Selecting all URLs except key 13 
# which is link to the webpage with language tables
pdf_url_list_c_16 = [(base_url_c_16 + value) for key, value in href_dict_c_16.items() if key != 13]

In [None]:
pdf_url_list_c_16

In [117]:
download_dir_language = "C:\\Users\\Arunank\\Documents\\Data Science\\public_policy\\CensusData\\c_16_language"

**Different pdf files/reports are downloaded.**

In [119]:
for link in pdf_url_list_c_16:
    filename_language = os.path.join(download_dir_language, link.rsplit("/",1)[-1])
    response_pdf_language = requests.get(link)
    with open(filename_language, 'wb') as file_language:
        file_language.write(response_pdf_language.content)

**Link to C-16 Tables webpage is extracted.**

In [122]:
c_16_url_to_tables = href_dict_c_16[13]

In [123]:
c_16_url_to_tables

'http://www.censusindia.gov.in/2011census/C-16.html'

**All URLs on C-16 webpage extracted. These contain relative links to language tables.**

In [126]:
c_16_excel_dict = getUrlOnWebPage(c_16_url_to_tables)

In [128]:
#c_16_excel_dict

In [133]:
base_url_c_16_excel = baseUrl(c_16_url_to_tables)

In [134]:
base_url_c_16_excel

'http://www.censusindia.gov.in/2011census/'

In [135]:
excel_url_list_c_16 = [(base_url_c_16_excel + value) for key, value in c_16_excel_dict.items()]

**URL path to language related excel tables created.**

In [143]:
excel_url_list_c_16

['http://www.censusindia.gov.in/2011census/C-16/DDW-C16-STMT-MDDS-0000.XLSX',
 'http://www.censusindia.gov.in/2011census/C-16/DDW-C16-STMT-MDDS-3500.XLSX',
 'http://www.censusindia.gov.in/2011census/C-16/DDW-C16-STMT-MDDS-2800.XLSX',
 'http://www.censusindia.gov.in/2011census/C-16/DDW-C16-STMT-MDDS-1200.XLSX',
 'http://www.censusindia.gov.in/2011census/C-16/DDW-C16-STMT-MDDS-1800.XLSX',
 'http://www.censusindia.gov.in/2011census/C-16/DDW-C16-STMT-MDDS-1000.XLSX',
 'http://www.censusindia.gov.in/2011census/C-16/DDW-C16-STMT-MDDS-0400.XLSX',
 'http://www.censusindia.gov.in/2011census/C-16/DDW-C16-STMT-MDDS-2200.XLSX',
 'http://www.censusindia.gov.in/2011census/C-16/DDW-C16-STMT-MDDS-2600.XLSX',
 'http://www.censusindia.gov.in/2011census/C-16/DDW-C16-STMT-MDDS-2500.XLSX',
 'http://www.censusindia.gov.in/2011census/C-16/DDW-C16-STMT-MDDS-0700.XLSX',
 'http://www.censusindia.gov.in/2011census/C-16/DDW-C16-STMT-MDDS-3000.XLSX',
 'http://www.censusindia.gov.in/2011census/C-16/DDW-C16-STMT-MDD

**Language-related tables are downloaded.**

In [144]:
for link in excel_url_list_c_16:
    filename_language = os.path.join(download_dir_language, link.rsplit("/",1)[-1])
    response_excel_language = requests.get(link)
    with open(filename_language, 'wb') as file_language:
        file_language.write(response_excel_language.content)