##  Webscraping

In this notebook, we would webscrape from http://www.ipaidabribe.com, (An initiative in India to tackle corruption by harnessing the collective energy of citizens. IPaidABribe.com is a citizen driven mechanism for tracking bribe payment activity, as also instances of when people resisted bribe payments. )

We will extract the following information from each of the bribe's reported on the website,
1. Title of the report (Transaction)
2. Number of views of the report
3. Amount paid as bribery
4. City
5. State
6. Time stamp
7. Name of authority that took the bribe(Police,Transport etc.,) 

We extract the above information from around 1200 reports.

In [8]:
## Libraries 
from bs4 import BeautifulSoup as soup ##BeautifulSoup$
from urllib.request import urlopen as uReq
import requests

We will start with understanding how to extract information from a single report and later with that knowledge use for loop to extract information from 1200 reports

In [18]:
## url
my_url = "http://www.ipaidabribe.com/reports/all#gsc.tab=0"

In [19]:
#opening a connection
uClient = uReq(my_url)
uClient

<http.client.HTTPResponse at 0x110b3e400>

In [20]:
## Read the source html code
page_html= uClient.read()

## Close the connection 
uClient.close()

In [21]:
## seeing how it read the html code
page_html



We see that code is all over the place. Let's use BeautifulSoup to help our cause.

In [22]:
# Beautiful Soup
page_soup = soup(page_html,"lxml")

In [23]:
# Visualizing 
page_soup.contents

['html', <html lang="en">
 <head>
 <meta charset="utf-8"/>
 <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
 <meta content="width=device-width, initial-scale=1" name="viewport"/>
 <!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->
 <meta content="Initiative to tackle corruption by harnessing the collective energy of citizens. IPaidABribe.com is a citizen driven mechanism for tracking bribe payment activity, as also instances of when people resisted bribe payments." name="description"/>
 <meta content="" name="author"/>
 <meta content="dlojv8zdoMvJ_mCAaaZnnK2rEGof11H6mdt2GHAPTR0" name="google-site-verification"/>
 <link href="http://www.ipaidabribe.com/assets/images/ipab-favicon.ico" rel="icon"/>
 <title>I Paid A Bribe | All Reports</title>
 <!-- bootstrap core CSS -->
 <link href="http://www.ipaidabribe.com/assets/js/plugins/bootstrap/css/bootstrap.min.css" rel="stylesheet"/>
 <!-- chosen core CSS for form select ele

In [24]:
# Verifying 
page_soup.body.div

<div style="display:none">
<script src="//www.googleadservices.com/pagead/conversion.js" type="text/javascript"></script>
</div>

In [25]:
## we see that the class ="ref-module-paid-bribe" contains information of all the reports in a page
containers = page_soup.find_all("section", {"class":"ref-module-paid-bribe"})

In [26]:
## Number of reports in a page
len(containers)

10

We see that the length of the containers object is 10 which corresponds to the number of reports on the page. Now we will extract information one by one

In [27]:
## First Report 
containers[0]

<section class="ref-module-paid-bribe">
<ul class="overview clearfix">
<li class="label ref-font-color"><i class="fa fa-bullseye ref-font-color"></i>I Paid A Bribe
                    </li>
<li class="time-span"><i class="fa fa-clock-o"></i>2 days ago
                    </li>
<li class="views"><i class="fa fa-eye"></i>132 views</li>
</ul>
<h3 class="heading-3">
<a href="http://www.ipaidabribe.com/reports/paid/police-verification-for-passport-12" title="Police verification for passport ">
                      Police verification for passport 
                    </a>
</h3>
<ul class="department clearfix">
<li class="name">
<a href="http://www.ipaidabribe.com/reports/all/all-cities/police/all-amount" title="Police">Police</a>
</li>
<li class="transaction">
<a href="http://www.ipaidabribe.com/reports/all/all-cities/police/all-amount" title="Background or Personal Verification">Background or Personal Verification</a>
</li><li class="paid-amount">
<span>Paid INR 1,000
                    

In [28]:
## Views
views = containers[0].find("li", {"class":"views"}).contents[1][0:3]
views

'132'

In [29]:
## Amount paid
containers[0].find("li", {"class":"paid-amount"}).span.contents[-1].split()[2]

'1,000'

In [30]:
## Transaction: Title
str(containers[0].find("li", {"class":"transaction"}).a["title"])

'Background or Personal Verification'

In [31]:
# Name
containers[0].find("li", {"class":"name"}).a['title']

'Police'

In [32]:
#Date
containers[0].find("span",{"class":"date"}).contents[0]

'January 23, 2018'

In [33]:
# State
containers[0].find("div",{"class":"key"}).a['title'].split()[1].replace(',',"")

'Karnataka'

In [34]:
# City
containers[0].find("div",{"class":"key"}).a['title'].split()[0].replace(',',"")

'Shimoga'

Now that we know where to find each of the feature from the report, we will loop through pages sequentially starting from latest reports to the old reports. 

### Final Code to extract information from reports

In [16]:
data = []

for i in range(120):      # Number of pages plus one . In our case, we need 1200 reports 
    my_url = "http://www.ipaidabribe.com/reports/all?page={}gsc.tab=0".format(i*10)
    uClient = uReq(my_url)
    page_html= uClient.read()
    uClient.close()
    page_soup = soup(page_html,"lxml")
    containers = page_soup.find_all("section", {"class":"ref-module-paid-bribe"})
   
    for contains in containers:
        views = contains.find("li", {"class":"views"}).contents[1][0:3]
        paid_amount = contains.find("li", {"class":"paid-amount"}).span.contents[-1].split()[2]
        transaction = contains.find("li", {"class":"transaction"}).a["title"]
        name = contains.find("li", {"class":"name"}).a['title']
        date = contains.find("span",{"class":"date"}).contents[0]
        city = contains.find("div",{"class":"key"}).a['title'].split()[0].replace(',',"")
        state = contains.find("div",{"class":"key"}).a['title'].split()[1].replace(',',"")
    
        data.append((views,paid_amount,transaction,name,date,city,state))

### Saving to CSV

In [17]:
import pandas as pd
df = pd.DataFrame(data,columns=["views","paid_amount","transaction","name",'date',"city","state"])
df['date'] = pd.to_datetime(df['date'])
df.to_csv("I_Paid_Bribe.csv",index=False,encoding='utf-8')

In [35]:
## Loading the data created

df.head(20)

Unnamed: 0,views,paid_amount,transaction,name,date,city,state
0,131,1000,Background or Personal Verification,Police,2018-01-23,Shimoga,Karnataka
1,139,1500,Police Verification for Passport,Passport,2018-01-23,Ghaziabad,Uttar
2,149,500,Traffic Violations,Police,2018-01-23,Margao,Goa
3,192,500,Traffic Violations,Police,2018-01-22,New,Delhi
4,249,30,Duplicate Driving License,Transport,2018-01-20,Jamshedpur,Jharkhand
5,339,18000,Transfer of Property,Stamps and Registration,2018-01-19,Bangalore,Karnataka
6,320,70,Activities on Beat,Police,2018-01-19,Proddatur,Andhra
7,368,200000,Police Harassment,Police,2018-01-18,Bangalore,Karnataka
8,331,450,Traffic Violations,Police,2018-01-18,Pune,Maharashtra
9,336,6400,Driving licence Process,Transport,2018-01-18,Udumalaipettai,Tamil


### Now, that we have the data we can the following questions,  
1. What are the most corrupt cities/states?
2. Which Department(Name) is most corrupt?
3. Amount of money in bribes over time?
4. Situations(transaction) where one bribes most?