## Boston Medical Malpractice Twitter Bot
by [Erica Dumore](https://www.linkedin.com/in/erica-dumore-62b306a2/)

## About

I am currently employed as a law clerk at a medical malpractice law firm in the Greater Boston area. Enjoying my work and the niche of law, I decided to create a Twitter Bot, [@Boston_MedMal](https://twitter.com/Boston_MedMal), that would tweet out updates on current medical malpractice legal issue. The twitter bot does this by scrapping posts from two Boston legal blogs. Whenever the blogs release a new post relating to medical malpractice, whether it is a current case decision, a settlement award, or discussing an ongoing issue, the twitter bot will tweet out the blog article title and the link for twitter followers to be directed to the full blog article. 

My hope with this blog is to help keep those interested in this area of law educated and up to date on the current issues with one easy click. 

## Google Spreadsheet ID 

When my twitter bot scraps the two blog pages for updated blog post articles it puts the article and link into a google spreadsheet. The spreadsheet, My Twitter Bot, has a document ID, 1-sYU2ur4kpERjl0dFr_ZcpsVRMWG_J1n98r9ExDOClE, which I linked to my twitter bot code. This spreadsheet timestamps each tweet to ensure each article is only tweeted once.

## Process

Before creating my twitter bot I brainstormed different ways to gather information to be tweeted. I had the idea of what I wanted my twitter bot to relate to, the area of medical malpractice, but I was not sure what I wanted my bot to scrape. 

I began the process of creating this twitter bot by creating a twitter account, [@Boston_MedMal](https://twitter.com/Boston_MedMal). After having created the account, I created the google spreadsheet discussed above and generated a documentation key. 



This project was adapted from the notebook found at [How to Build a Law Bot](https://lawyerist.com/how-build-law-bot/).

In [22]:
# Load the module for visiting and reading websites.
import urllib.request
# Load the module for running regular expressions (regex).
import re 
# Load the module for date and time stuff.
import datetime
# Define the variable now as equal to the current date and time.
now = datetime.datetime.now()

In [23]:
# Set the URL you want to scrape.
#url_1 = "https://www.bostoninjurylawyerblog.com/category/medical-malpractice"
url_1 ="https://www.bostonpersonalinjuryattorneyblog.com/category/medical-malpractice/feed"
url_2 = "https://www.bostoninjurylawyerblog.com/category/medical-malpractice/feed"

# If you want to scrape data from multiple pages, you can, 
# just replicate the above and below but change url_1 to url_2 et al.

In [24]:
# Load the module for accessing Google Sheets.
import gspread
# Load the module needed for securely communicating with Google Sheets.
from oauth2client.service_account import ServiceAccountCredentials
# The scope for your access credentials
scope = ['https://spreadsheets.google.com/feeds']

# Your spreadsheet's ID
document_key = "1-sYU2ur4kpERjl0dFr_ZcpsVRMWG_J1n98r9ExDOClE" 
#              ^^^^^^^^^^^ SWAP OUT FOR YOUR DOCUMENT ID/KEY
# Your Google project's .json key
credentials = ServiceAccountCredentials.from_json_keyfile_name('../../../../../CodingProject2-517c5ae88a4a.json', scope)
#                                                                              ^^^^^^^^ SWAP OUT FOR YOUR JSON KEY
# Use your credentials to authorize yourself.
gc = gspread.authorize(credentials)
# Open up the Sheet with the defined ID.
wks = gc.open_by_key(document_key)

#########################################
#
#  NOTE: The name of the sheet you are 
#  trying to access should be in the 
#  parenthetical below (e.g., Data). By
#  Default this is probably "Sheet1".
#
#########################################
worksheet = wks.worksheet("Sheet1")

# Count the number of rows in your Sheet &
# resize to remove blank rows.
worksheet.resize(worksheet.row_count)

In [26]:
# Import the relevant Twitter libraries so you can use Twitter.
import twitter
from twitter import TwitterError

# create the following four text files and add them to the same diretctry as you 
# Google API key. In each file add the appropriate value found when retrieving your 
# Twitter credentials

with open('../../../../../key.txt', 'r') as myfile:
    key=myfile.read()
    
with open('../../../../../secret.txt', 'r') as myfile:
    secret=myfile.read()
    
with open('../../../../../token_key.txt', 'r') as myfile:
    token_key=myfile.read()

with open('../../../../../token_secret.txt', 'r') as myfile:
    token_secret=myfile.read()

# Set you Twitter API credentials.
api = twitter.Api(consumer_key=key,
                  consumer_secret=secret,
                  access_token_key=token_key,
                  access_token_secret=token_secret)

## Read the contents of your first webpage

When you run the next cell, your program will visit the first URL you defined above. It will then print out that page's HTML. 

In [27]:
p_1 = urllib.request.build_opener(urllib.request.HTTPCookieProcessor).open(url_1).read()
print(p_1)

b'<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"\n\txmlns:content="http://purl.org/rss/1.0/modules/content/"\n\txmlns:wfw="http://wellformedweb.org/CommentAPI/"\n\txmlns:dc="http://purl.org/dc/elements/1.1/"\n\txmlns:atom="http://www.w3.org/2005/Atom"\n\txmlns:sy="http://purl.org/rss/1.0/modules/syndication/"\n\txmlns:slash="http://purl.org/rss/1.0/modules/slash/"\n\t>\n\n<channel>\n\t<title>medical malpractice &#8211; Boston Personal Injury Attorney Blog</title>\n\t<atom:link href="https://www.bostonpersonalinjuryattorneyblog.com/category/medical-malpractice/feed" rel="self" type="application/rss+xml" />\n\t<link>https://www.bostonpersonalinjuryattorneyblog.com</link>\n\t<description>Published by Boston, Massachusetts Personal Injury Lawyer \xe2\x80\x94 Jeffrey S. Glassman</description>\n\t<lastBuildDate>Tue, 10 Oct 2017 16:45:06 +0000</lastBuildDate>\n\t<language>en-US</language>\n\t<sy:updatePeriod>hourly</sy:updatePeriod>\n\t<sy:updateFrequency>1</sy:updateFrequency>\n\t\

In [28]:
p_2 = urllib.request.build_opener(urllib.request.HTTPCookieProcessor).open(url_2).read()
print(p_2)

b'<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"\n\txmlns:content="http://purl.org/rss/1.0/modules/content/"\n\txmlns:wfw="http://wellformedweb.org/CommentAPI/"\n\txmlns:dc="http://purl.org/dc/elements/1.1/"\n\txmlns:atom="http://www.w3.org/2005/Atom"\n\txmlns:sy="http://purl.org/rss/1.0/modules/syndication/"\n\txmlns:slash="http://purl.org/rss/1.0/modules/slash/"\n\t>\n\n<channel>\n\t<title>Medical Malpractice &#8211; Boston Injury Lawyer Blog</title>\n\t<atom:link href="https://www.bostoninjurylawyerblog.com/category/medical-malpractice/feed" rel="self" type="application/rss+xml" />\n\t<link>https://www.bostoninjurylawyerblog.com</link>\n\t<description>Published by Boston, Massachusetts Personal Injury Attorneys \xe2\x80\x94 Altman &#38; Altman</description>\n\t<lastBuildDate>Tue, 24 Oct 2017 13:50:21 +0000</lastBuildDate>\n\t<language>en-US</language>\n\t<sy:updatePeriod>hourly</sy:updatePeriod>\n\t<sy:updateFrequency>1</sy:updateFrequency>\n\t\n<site xmlns="com-wordpress:

------------------------------------

# One Data Point, One Match

## Parse the site's contents 

Scan the above HTML for the content you are trying to extract. Cut and paste the HTML above into the TEST STRING box over at [Regex 101](https://regex101.com/) and craft a regex that captures your desired content. Be sure to use the Python flavor.

Remember the parenthetical is the group you're pulling out. Once you have a working regex, plug it into the code below, and run the cell. If it worked, you'll see you scraped data as an output. 

In [29]:
res_1 = re.search(b"item>\s*<title>(.*)</title>\s*<link>(.*)</link>",p_1)
output_1 = res_1.group(1).decode('UTF-8')
output_2 = res_1.group(2).decode('UTF-8')
print(output_1)
print(output_2)

Medical Malpractice Cases and Medical Battery
https://www.bostonpersonalinjuryattorneyblog.com/2017/08/medical-malpractice-cases-medical-battery.html


In [30]:
res_2 = re.search(b"item>\s*<title>(.*)</title>\s*<link>(.*)</link>",p_2)
output_3 = res_2.group(1).decode('UTF-8')
output_4 = res_2.group(2).decode('UTF-8')
print(output_3)
print(output_4)

$3 Million Award in Delaware Birth Injury Case
https://www.bostoninjurylawyerblog.com/2017/10/3-million-award-delaware-birth-injury-case.html


## Post to Twitter and Save to Google (Two Data Point, One Match)

In [31]:
if ((res_1 and (worksheet.row_values(worksheet.row_count)[1]) != output_1
          and (worksheet.row_values(worksheet.row_count)[2]) != output_2)
    or
    (res_2 and (worksheet.row_values(worksheet.row_count)[3]) != output_3
          and (worksheet.row_values(worksheet.row_count)[4]) != output_4)):
    # same as above but now comparing two values
    
    if (res_1 and (worksheet.row_values(worksheet.row_count)[1]) != output_1
          and (worksheet.row_values(worksheet.row_count)[2]) != output_2):

        try:
            # Post to Twitter.
            status = api.PostUpdate('%s %s'%(output_1,output_2))
            print(status.text)
        except TwitterError:
            # Post to Twitter.
            status = api.PostUpdate('%s %s'%(output_1,output_2))
            print(status.text)        

    if (res_2 and (worksheet.row_values(worksheet.row_count)[3]) != output_3
          and (worksheet.row_values(worksheet.row_count)[4]) != output_4):

        try:
            # Post to Twitter.
            status = api.PostUpdate('%s %s'%(output_3,output_4))
            print(status.text)
        except TwitterError:
            # Post to Twitter.
            status = api.PostUpdate('%s %s'%(output_3,output_4))
            print(status.text)        

    # Save to Google only after Tweeting
    worksheet.append_row([now,output_1,output_2,output_3,output_4])    