<h1>2. Web Scraping</h1>
<HR WIDTH="100%" size="6">


<table align='left'>
  <tr>
    <td><b>Step</b></td>
    <td><b>Description</b></td>
  </tr>
  <tr>
  <td><b>2.1 Function:</b>mongo_database_setup()</td>
  <td>Connect to the mongo DB. Create new database based on date. Insert two collections (reports,comments)</td>
  </tr>
  <tr>  
  <td><b>2.2 Function: </b>web_scrape(web_index)</td>
  <td>Scrap webpage and return <b>BeautifulSoup</p> object</td>
  </tr>
  <tr>  
  <td><b>2.3 Function: </b>parse_tables(soup,web_id,reports,comments)</td>
  <td>Take first table in <b>BeautifulSoup</b> object, and loop through through the rows of the table. Extract the data using 
  regular expressions. Save to python dictionary and save dictionary to Mongo DB collection <i>reports</i>. Take third table ,loop through rows extract data with regular expressions and save to python dictionary and save dictionary to Mongo DB collection <i>comments</i> </td>
  </tr>
  <tr> 
  <td><b>2.4 </b>Scrape, Parse, Save</td>
      <td>Provide counters for control starting page, and number of pages to scrape. Loop through and call functions to save to Mongo DB. Terminate when counter limits reached.</td>
  </tr>
    </table><br clear="left"/>

<table align='left'>
   <tr>
   <th colspan="4"><p style="text-align: center;">Packages Used</p></th>
  </tr>
  <tr style="background-color:azure">
    <td>Package</td>
    <td>Pre-installed with Anaconda</td>
    <td>Install instruction from command line</td>
    <td>Documentation Link</td>
    </tr>
   <tr>
    <td>time</td>
    <td colspan="2"> <p> Part of the Python Standard Library</p></td>
    <td>https://docs.python.org/2/library/time.html</td>
   </tr>
   <tr>
    <td>pymongo</td>
    <td><p style="text-align: center;">&#x2718;</td>
    <td>pip install pymongo</td>
    <td>http://api.mongodb.org/python/current/</td>
    </tr>
    <tr>
    <td>BeautifulSoup</td>
    <td><p style="text-align: center;">&#x2718;</td>
    <td>pip install beautifulsoup4</td>
    <td>http://www.crummy.com/software/BeautifulSoup/bs4/doc/</td>
    </tr>
    <tr>
    <td>urllib2</td>
     <td colspan="2"> <p> Part of the Python Standard Library (v2)</p></td>
    <td>https://docs.python.org/2/library/urllib2.html</td>
    <tr>
    <td>re</td>
    <td colspan="2"> <p> Part of the Python Standard Library</p></td>
    <td>https://docs.python.org/2/library/re.html</td>
    </tr>
   </table><br clear="left"/>



<HR WIDTH="100%" size="4">
<font size="5" color="red">A instance of MongoDB must be running before executing this notebook</font>

<HR WIDTH="100%" size="6">

<h3>2.1 Connect and Setup MongoDB </h3>

<p><font size="3" color="red">Mongo database must be running on local machine.</font> Create the connection to the mongo database. If successful connected then create a database <i>pillreport_ddmmyy</i>. If this database exists delete it and create again. Define two collections in the database <b>reports</b> and <b>comments</b>.

In [1]:
import time
import pymongo

def mongo_database_setup():
    database_name={}

    # Try to connect to MongoDB,  exit if not successful.
    try:
        conn=pymongo.MongoClient()
        print "Connected successfully!!!"
        
    except pymongo.errors.ConnectionFailure, e:
       print "Could not connect to MongoDB: %s" % e 


    #Use todays date for the database name:
    name='pillreports_'+time.strftime("%d%b%y")

    if name in conn.database_names():
        conn.drop_database(name) #Drop the database if it exists
        db = conn[name] #Create the database
        
        #Create two collections in the database
        reports = db.reports
        comments=db.comments
   
    else:
        db = conn[name] #Create the database
        #Create two collections in the database
        reports = db.reports
        comments=db.comments
   
    #return the connection, database name, collections names.     
    return conn,db,reports,comments

<h3>2.2 Scrape individual reports and comments</h3>

<p>Using the <b>urllib2</b> library; open and read a web page. Pass web page to <b>BeautifulSoup</b> library. Parse and extract all tables. If there are not 3 tables in the webpage return. Otherwise return the parsed webpage.</p>

In [2]:
from bs4 import BeautifulSoup
import urllib2

def web_scrape(web_index):
    base_path="http://www.pillreports.net/index.php?page=display_pill&id="
    web_path =base_path+str(web_index)
    
    #Open and read web page. 
    #Parse returned webpage with BeautifulSoup and extract all the tables in it. 
    try:
        web_page = urllib2.urlopen(web_path).read().decode('utf-8')
        soup=BeautifulSoup(web_page)
        number_of_tables = len(soup.findChildren('table'))
     
    #Based on research if there are not 3 tables exit and return false. Indicates that the webpage is not published report.
    #Report is in the first table. Comments are in the third table.
        if number_of_tables!=3:
            return False
        else:
            return soup

    #If encoding cannot be determined exit. 
    except (UnicodeDecodeError):
        print "Encoding error encountered. Page " +str(web_index)+ " skipped."
        return False


<h3>2.3 Parse "Description" and "Comments" tables</h3>

<p> Pass in the <b>Beutifulsoup</b> object,  webpage index, and pointers to the <i>reports</i> and <i>comments</i> collections in the MongoDB database. 

<p> Loop through rows in the first Table. Column 1 is table headers, column 2 is the data. Save to phyton dictionary and insert dictionary into <i>reports</i> collection on Mongo databse. Advantage of dictionary is do not have to specify keys in advance. <p>

<p> Loop through third table containing the comments. Extract information using regular epxression to match strings. Save to python dictionary and insert each dictionary into the <i>comments</i> collection on Mongo database.



In [3]:
import re

def parse_tables(soup,web_id,reports,comments):
    
    #Dictionary to hold values, ID value to the index.
    pill_report_dict={}
    pill_report_dict['ID: ']=unicode(web_id)
    
    #Select the 1st table and loop through each row of the table
    #extracting first column and second column using the find_next() method of BeautifulSoup
    for tr_tag in soup.find_all('table')[0]('tr'):
        col1=tr_tag.find_next()
        col2=col1.find_next()
    
        #If the type of is td, remove all <br> tags. 
        if col2.name=='td':   
            
            #\s is white space. Remove newlines etc. replace with one space.
            col2.text_tidy=re.sub(r'\s\s+', ' ', col2.text) 
            
            #Remove any <xx> replace with ''. This is will remove any HTML tags remaining.
            col2.text_tidy=re.sub(r'<.*?>','', col2.text_tidy) 
            
            #Add the row to the dictionary. col1 is the column1 header, col2 is the content.
            pill_report_dict[col1.text] = unicode(col2.text_tidy)
 
    #Insert the dictionary into the Mongo Database.
    reports.insert(pill_report_dict)
    
    
#####################################################################
#                         Parse Comments                            #
#####################################################################

    
    #Dictionary to hold values. 
    comment_report_dict={}
    
    #Index to hold the comment number
    i=0
        
    #Loop trhough all rows in the third table of soup object.
    for tr_tag in soup.find_all('table')[2]('tr'):
        
        try:
            row1=tr_tag.find_next()

            if row1.text=="There are no comments":
               comment_report_dict={}
               return
            else:
                #Don't save the first row of the table
                if row1.name=='td' and i > 0:
                    
                    comment_report_dict={}
                    comment_report_dict["Report ID:  "] = str(web_id)
                    comment_report_dict["Comment Number: "]=str(i)
                    
                    #Select everything betweenn 'Posted on ' and 'GMT'
                    comment_report_dict["Posted On: "] = re.search(r'Posted on (.*) GMT',row1.text).group(1)
                    #Select everything betweenn 'GMT by ' and '('
                    comment_report_dict["By: "] =        re.search(r'GMT by (.*) \(',row1.text).group(1)
                    #Select everything betweenn '(' and ')'
                    comment_report_dict["Member details: "] = re.search(r'\((.*)\)',row1.text).group(1)

                    #Remove any white space and HTML tags 
                    tidy_comment = re.sub(r'\s\s+', ' ', unicode(row1.text))
                    p = re.compile(r'<.*?>')
                    tidy_comment=p.sub('', tidy_comment)
                   
                    #Select everthing after ')'
                    comment_report_dict["Comment: "]= re.search(r'\)(.*)',tidy_comment).group(1)

                    #Insert dictionary in comemnts collection on pymongo
                    comments.insert(comment_report_dict)

            i=i+1
        
        except(AttributeError):
            print "Encoding error encountered. Not all comments captured on document:" +str(web_id)
            #return
        
    return


<h3>2.4 Scrape, Parse and Save</h3>

<p>Create a new Mongo database with two collections, with user defined function <i>web_scrape(index):</i></p>

<p>Set parameters for the number of reports to download. Start indexing for download</p>

<p>Pass <i>top_index</i> to <i>web_scarpe(index)</i> function. Parse the return <b>BeautifulSoup</b> object in the <i>parse_tables(soup,index,reports,comments)</i> function. </p>

<p>Decrement the counters. When <i>target</i> to download is reached, terminate and close Mongo DB connection. </p>

In [4]:
#Setup the database, get reference to the connection
#and the collections. 
conn,db,reports,comments=mongo_database_setup()

#Scrape x number of pages starting at the most recent. 

target=5001
top_index=34120
download_count=0

while download_count <= target:
    
    soup=web_scrape(top_index)
         
    if not isinstance(soup, bool):
        #print top_index
        parse_tables(soup,top_index,reports,comments)
        download_count=download_count+1
    
    top_index=top_index-1
 
print "A total of "+str(db.reports.count())+" reports were saved. A total of "+str(db.comments.count())+" comments were saved to the database."   

####Ensure connection is closed to save the database####

conn.close()
print top_index

Connected successfully!!!
A total of 51 reports were saved. A total of 313 comments were saved to the database.
34051


In [5]:
#Check database is saved.
conn=pymongo.MongoClient()
print conn.database_names()
conn.close()

[u'local', u'pillreports_31Mar15']
