# Scraping the BPUT Results Website

## _[bputexam.in](http://www.bputexam.in) sucks, we all know that._ 

This is an utility to fetch all marks and (possibly) preserve them in an "air-tight, sterilized, transparent, container" free from "viruses"

Simple interactive scraping project using lxml and requests, nothing fancy.

Perhaps, one day you will use this data for analytics and find the subject that sucks the most. Feel free to use this dump for whatever you want to do. This code is MIT licensed. You can see the terms at the end of this notebook.

In [1]:
# import 'em all!
import requests
import lxml.html
from ipywidgets import interact_manual

## A bit about scraping sites using ASP.NET Forms

It's a nightmare. Complicated with the fact that these people use Telerik.UI controls, which exposes a bunch of its own hidden states.

[Scraping Websites Based on ViewStates with Scrapy](https://blog.scrapinghub.com/2016/04/20/scrapy-tips-from-the-pros-april-2016-edition/) from ScrapingHub blog, explains a bit about various ASP.NET hidden inputs and their significance. 

### Submitting forms

In order to simulate a form submission, you need to submit all ASP.NET hidden fields: `__VIEWSTATE`, `__VIEWSTATEENCRYPTED`, and `__VIEWSTATEGENERATOR`. 

Moar doks:

 - https://msdn.microsoft.com/en-us/library/ms972976.aspx
 - https://msdn.microsoft.com/en-us/library/bb386448.aspx
 

_These folks use VIEW_STATE encryption, so it's not possible to tamper it. So we have to pass the viewstate as is._

After a bit of fiddling with Firefox's network debug tool, I came up with the "magic fields".

### Triggering an onClick (POST-back), server-side

Yuck, why? But it's ASP.NET, so you have a special post request to trigger an onClick, this is represented by the special parameter `__EVENTTARGET`. The value of this field is an ASP.NET ControlID.

Moar doks:

 - https://www.codeproject.com/Articles/134614/Way-To-Know-Which-Control-Has-Raised-PostBack
 
Another trick is the case of multiple submit buttons, Forms from lxml, by default include all submit buttons in fields. We need to delete the ones which we are not clicking, before making a request

In [2]:
# URL For student result page,
url = 'http://www.bputexam.in/StudentSection/ResultPublished/StudentResult.aspx'

# BPUT Website needs a browser session, so we create a session and let `requests`
# manage the cookie jar
s = requests.Session()

In [3]:
# Get initial data
res = s.get(url)
input_page = lxml.html.fromstring(res.text)
input_page.make_links_absolute(url)
exam_options = { option.text:option.attrib['value'] for option in input_page.cssselect('#ddlSession > option')}
    
@interact_manual(reg_no='', exam=exam_options)
def show_results(reg_no, exam):
    global input_page
    
    #print(reg_no,exam)
    
    # Fill the first form
    student_info_form = input_page.forms[0]
    student_info_form.fields['ddlSession'] = exam
    # BPUT does not validate date of birth
    student_info_form.fields['dpStudentdob'] = '2018-02-08'
    student_info_form.fields['txtRegNo'] = reg_no

    fields = { k:v for k,v in student_info_form.fields.items() }
    del fields['btnReset']
    #print(fields)
    
    # Submit the input page
    res = s.post(student_info_form.action,fields)
    
    intermediate_page = lxml.html.fromstring(res.text)
    intermediate_page.make_links_absolute(url)
    
    result_params = { k:v for k,v in intermediate_page.forms[0].fields.items() }
    # From browser inspector, it's the ControlID. It does not change between
    # page refreshes
    result_params['__EVENTTARGET'] = 'gvResultSummary$ctl02$lnkViewResult'
    result_params['__EVENTARGUMENT'] = ''
    del result_params['btnView']
    del result_params['btnReset']
    
    # Submit the summary page
    result_res = s.post(intermediate_page.forms[0].action,result_params)
    result_page = lxml.html.fromstring(result_res.text)
    results_table = result_page.cssselect('#gvViewResult')[0]
    
    # Extract data from the table
    rows = results_table.cssselect('tr')
    header = [str(th.text_content()).strip() for th in rows[0].cssselect('th')]
    data = [[str(td.text_content()).strip() for td in row.cssselect('td')] for row in rows[1:]]
    summary = data[-1]
    marks = data[:-1]
    
    name = result_page.cssselect('#lblName')[0].text_content()
    college = result_page.cssselect('#lblCollege')[0].text_content()
    exam_name = result_page.cssselect('#lblResultName')[0].text_content()
    lbl_branch = result_page.cssselect('#lblBranch')[0].text_content()
    
    # show results
    print('Name:', name)
    print('College:', college)
    
    for sub in marks:
        print(f'{sub[2]} ({sub[4]}) : {sub[5]}')
        
    print(summary[4])
    print(summary[5])

## LICENSE
Copyright 2018 Amitosh Swain Mahapatra

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.