by Graham Lim

# 1. Webscraping

In this notebook, we will be scraping contract clauses from LawInsider.com. There are 2 main requirements for this to work:

**1) You must already have a LawInsider premium account**. As of July 2020, it costs USD1.00 to sign up for a premium 30-day trial. 

  **If you don't wish to spend money, that's cool - please skip to the 2nd notebook in this project.** The rest of the Capstone Project will still work with the saved .csv files that were derived from the scraping done here. 

2) You will need to run the following pip install commands in terminal or cmd line:

* `pip install bs4` (for BeautifulSoup)
* `pip install selenium` (for Selenium)
* `pip install webdriver-manager` (for the automated Selenium web driver to work)

In [1]:
#Standard Python DS imports:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
#set column size to be larger
pd.set_option("display.max_colwidth", 1000)

We have to use `Selenium` because of the fact that all the clauses don't load in full in this website. The content only loads up in full via infinite scrolling down/paging down. 

Hence, we will import `Selenium` and the related `WebDriver Manager` tool to run a Chrome instance within Selenium that will keep scrolling down for us, so that we don't manually have to do this for our 15+ types of clauses.

In [3]:
#Selenium and WebDriver Manager imports:

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager


import time
from selenium.webdriver.common.keys import Keys

In [4]:
#let's assign the major URLS we want to scrape from LawInsider:

#these are our 2 target clauses - automatic and optional/manual renewal clauses:
auto_renewal_url = "https://www.lawinsider.com/clause/automatic-renewal"
optional_renewal_url = "https://www.lawinsider.com/clause/renewal-option"

#I then take the other most common clauses found in commercial agreements and list them:
licenses_url = "https://www.lawinsider.com/clause/licenses"
delivery_url = "https://www.lawinsider.com/clause/delivery"
fees_royalties_url = "https://www.lawinsider.com/clause/fees-and-royalties"
payment_url = "https://www.lawinsider.com/clause/payment-terms"
support_url = "https://www.lawinsider.com/clause/support"
marketing_url = "https://www.lawinsider.com/clause/marketing-and-publicity"
proprietary_rights_url = "https://www.lawinsider.com/clause/proprietary-rights"
warranty_url = "https://www.lawinsider.com/clause/warranty"
indemnification_url = "https://www.lawinsider.com/clause/indemnification"
confidentiality_url = "https://www.lawinsider.com/clause/confidentiality"
limited_liability_url = "https://www.lawinsider.com/clause/limitation-of-liability"
compliance_url = "https://www.lawinsider.com/clause/compliance-with-law"

## LawInsider.com Scraper Function

We then write a function that will scrape the clauses contained in the LawInsider site url(s) after scrolling down that page x number of pagedowns to load it in full.

It takes 3 arguments: the url/url list objects we previously assigned (`urls`), the number of pagedowns/scrolls downwards to execute (`pagedown_pushes`), and the delay between each page/scroll down (`pagedown_lag`), so that LawInsider doesn't get overwhelmed with too many requests.

In [5]:
#this function takes the url/url list, number of pagedown scrolling, 
#and how long the lag is between pagedowns/scrolling down
#and produces a list of scraped contract clauses.

def lawinsider_scraper(urls, pagedown_pushes, pagedown_lag):
    driver = webdriver.Chrome(ChromeDriverManager().install())
    browser = driver
    browser.get(urls)
    time.sleep(2)

   
    elem = browser.find_element_by_tag_name("body")

    no_of_pagedowns = pagedown_pushes

    while no_of_pagedowns:
        elem.send_keys(Keys.PAGE_DOWN)
        time.sleep(pagedown_lag)
        no_of_pagedowns-=1

    post_elems = browser.find_elements_by_class_name("snippet-content")

    list_name = [post.text for post in post_elems]

    return list_name

In [6]:
#We then write a simple function to convert and label these lists as DataFrames in pandas, and tells us what the `shape` of the dataframe is:

def clause_list_converter(list_name, df_name, clause):
    df_name = pd.DataFrame(list_name)
    df_name = df_name.rename(columns = {0:"clause_text"})
    df_name["clause_type"]=clause

    print(df_name.shape)
    
    return df_name

### Automatic Renewal Clauses
We first run our function to create a list object containing as many `automatic renewal clauses` as the site will offer us. These clauses basically enable contracts to renew by default automatically, without the need for the contracting parties to have to renegotiate.

For example, if you and Software X company have signed a software license agreement for 3 years, at the end of those 3 years, the automatic renewal clause will enliven, enabling the contract to be renewed by default without any further negotiations required.

In [7]:
auto_renewal_list = lawinsider_scraper(auto_renewal_url, 200, 2)

[WDM] - Current google-chrome version is 84.0.4147
[WDM] - Get LATEST driver version for 84.0.4147


 


[WDM] - Driver [/Users/grahamlim/.wdm/drivers/chromedriver/mac64/84.0.4147.30/chromedriver] found in cache


In [8]:
df_auto = clause_list_converter(auto_renewal_list, "df_auto", "automatic_renewal")

(161, 2)


### Renewal Option Clauses
We also want to scrape clauses that have `renewal option clauses`. These clauses generally don't let contracts renew automatically, and some pre-requisites must arise before the contract can be renewed e.g. 30 days notice must be given of intention to renew the contract.

In [9]:
optional_renewal_list = lawinsider_scraper(optional_renewal_url, 400, 1)

[WDM] - Current google-chrome version is 84.0.4147
[WDM] - Get LATEST driver version for 84.0.4147
[WDM] - Driver [/Users/grahamlim/.wdm/drivers/chromedriver/mac64/84.0.4147.30/chromedriver] found in cache


 


In [10]:
df_renewal_option = clause_list_converter(optional_renewal_list, "df_renewal_options", "renewal_option")

(70, 2)


### Other General Clauses
We also want our model to distinguish renewal clauses as compared with other common general clauses in contracts e.g. `warranty clauses`, `indemnification clauses`, `limitation of liability clauses` etc.

This means we should scrape other common clauses found in commercial contracts for services/products:

In [11]:
licenses_list = lawinsider_scraper(licenses_url, 400, 1)
df_licenses = clause_list_converter(licenses_list, "df_licenses", "licenses")

[WDM] - Current google-chrome version is 84.0.4147
[WDM] - Get LATEST driver version for 84.0.4147


 


[WDM] - Driver [/Users/grahamlim/.wdm/drivers/chromedriver/mac64/84.0.4147.30/chromedriver] found in cache


(630, 2)


In [12]:
delivery_list = lawinsider_scraper(delivery_url, 400, 1)
df_delivery = clause_list_converter(delivery_list, "df_delivery", "delivery")

[WDM] - Current google-chrome version is 84.0.4147
[WDM] - Get LATEST driver version for 84.0.4147
[WDM] - Driver [/Users/grahamlim/.wdm/drivers/chromedriver/mac64/84.0.4147.30/chromedriver] found in cache


 
(1000, 2)


In [13]:
fees_royalties_list = lawinsider_scraper(fees_royalties_url, 400, 1)
df_royalties = clause_list_converter(fees_royalties_list, "df_royalties", "royalties")

[WDM] - Current google-chrome version is 84.0.4147
[WDM] - Get LATEST driver version for 84.0.4147
[WDM] - Driver [/Users/grahamlim/.wdm/drivers/chromedriver/mac64/84.0.4147.30/chromedriver] found in cache


 
(51, 2)


In [14]:
payment_list = lawinsider_scraper(payment_url, 400, 1)
df_payment = clause_list_converter(payment_list, "df_payments", "payment")

[WDM] - Current google-chrome version is 84.0.4147
[WDM] - Get LATEST driver version for 84.0.4147


 


[WDM] - Driver [/Users/grahamlim/.wdm/drivers/chromedriver/mac64/84.0.4147.30/chromedriver] found in cache


(1000, 2)


In [15]:
support_list = lawinsider_scraper(support_url, 400, 1)
df_support = clause_list_converter(support_list, "df_support", "support")

[WDM] - Current google-chrome version is 84.0.4147
[WDM] - Get LATEST driver version for 84.0.4147
[WDM] - Driver [/Users/grahamlim/.wdm/drivers/chromedriver/mac64/84.0.4147.30/chromedriver] found in cache


 
(1000, 2)


In [16]:
marketing_list = lawinsider_scraper(marketing_url, 400, 1)
df_marketing = clause_list_converter(marketing_list, "df_marketing", "marketing")

[WDM] - Current google-chrome version is 84.0.4147
[WDM] - Get LATEST driver version for 84.0.4147
[WDM] - Driver [/Users/grahamlim/.wdm/drivers/chromedriver/mac64/84.0.4147.30/chromedriver] found in cache


 
(12, 2)


In [17]:
proprietary_list = lawinsider_scraper(proprietary_rights_url, 400, 1)
df_proprietary = clause_list_converter(proprietary_list, "df_proprietary", "proprietary_rights")

[WDM] - Current google-chrome version is 84.0.4147
[WDM] - Get LATEST driver version for 84.0.4147
[WDM] - Driver [/Users/grahamlim/.wdm/drivers/chromedriver/mac64/84.0.4147.30/chromedriver] found in cache


 
(930, 2)


In [18]:
warranty_list = lawinsider_scraper(warranty_url, 400, 1)
df_warranty = clause_list_converter(warranty_list, "df_warranties", "warranty")

[WDM] - Current google-chrome version is 84.0.4147
[WDM] - Get LATEST driver version for 84.0.4147


 


[WDM] - Driver [/Users/grahamlim/.wdm/drivers/chromedriver/mac64/84.0.4147.30/chromedriver] found in cache


(1000, 2)


In [19]:
indemnification_list = lawinsider_scraper(indemnification_url, 400, 1)
df_indemnity = clause_list_converter(indemnification_list, "df_indemnity", "indemnity")

[WDM] - Current google-chrome version is 84.0.4147
[WDM] - Get LATEST driver version for 84.0.4147


 


[WDM] - Driver [/Users/grahamlim/.wdm/drivers/chromedriver/mac64/84.0.4147.30/chromedriver] found in cache


(740, 2)


In [20]:
confidentiality_list = lawinsider_scraper(confidentiality_url, 400, 1)
df_confidentiality = clause_list_converter(confidentiality_list, "df_confidentiality", "confidentiality")

[WDM] - Current google-chrome version is 84.0.4147
[WDM] - Get LATEST driver version for 84.0.4147
[WDM] - Driver [/Users/grahamlim/.wdm/drivers/chromedriver/mac64/84.0.4147.30/chromedriver] found in cache


 
(670, 2)


In [21]:
limited_liability_list = lawinsider_scraper(limited_liability_url, 400, 1)
df_liability = clause_list_converter(limited_liability_list, "df_liability", "limited_liability")

[WDM] - Current google-chrome version is 84.0.4147
[WDM] - Get LATEST driver version for 84.0.4147


 


[WDM] - Driver [/Users/grahamlim/.wdm/drivers/chromedriver/mac64/84.0.4147.30/chromedriver] found in cache


(750, 2)


In [22]:
compliance_list = lawinsider_scraper(compliance_url, 400, 1)
df_compliance = clause_list_converter(compliance_list, "df_compliance", "compliance")

[WDM] - Current google-chrome version is 84.0.4147
[WDM] - Get LATEST driver version for 84.0.4147
[WDM] - Driver [/Users/grahamlim/.wdm/drivers/chromedriver/mac64/84.0.4147.30/chromedriver] found in cache


 
(1000, 2)


## Evaluating Clauses for a Balanced Dataset
It will be very difficult to account for an imbalanced dataset, and so I will intentionally pick out those dataframes that we scraped that have a larger sampling of clauses. We'll need to combine all our dataframes that have between `600-1,000` rows.

We'll also pick 10 dataframes (i.e. 10 categories) so that we don't have too many labels.

In [40]:
dfs = [df_licenses, df_delivery, df_payment, df_support, 
       df_proprietary, df_warranty, df_indemnity,
       df_confidentiality, df_liability, df_compliance]

df = pd.concat(dfs, join='outer', axis=0)

In [42]:
df.head(3)

Unnamed: 0,clause_text,clause_type
0,"Licenses. The Acquiror Company possesses from the appropriate Governmental Authority all licenses, permits, authorizations, approvals, franchises and rights that are necessary for the Acquiror Company to engage in its business as currently conducted and to permit the Acquiror Company to own and use its properties and assets in the manner in which it currently owns and uses such properties and assets (collectively, “Acquiror Company Permits”). The Acquiror Company has not received notice from any Governmental Authority or other Person that there is lacking any license, permit, authorization, approval, franchise or right necessary for the Acquiror Company to engage in its business as currently conducted and to permit the Acquiror Company to own and use its properties and assets in the manner in which it currently owns and uses such properties and assets. The Acquiror Company Permits are valid and in full force and effect. No event has occurred or circumstance exists that may (with or...",licenses
1,Licenses. The Depositor shall cause the Trust to use its best efforts to obtain and maintain the effectiveness of any licenses required in connection with this Agreement and the other Operative Agreements and the transactions contemplated hereby and thereby until such time as the Trust shall terminate in accordance with the terms hereof. It shall be the duty of the Owner Trustee to cooperate with the Depositor with respect to such matters.,licenses
2,"Licenses. AUGI and its subsidiaries hold all licenses and permits as may be requisite for carrying on the AUGI Business in the manner in which it has heretofore been carried on, which licenses and permits have been maintained and continue to be in good standing except where the failure to obtain or maintain such licenses or permits would not have a material adverse effect on the AUGI Business;",licenses


In [43]:
#great, we have a reasonably large dataframe filled with clause text that needs to be backed up.

df.shape

(8720, 2)

In [44]:
df.to_csv("../data/df_raw.csv")

## DataFrames and .csv Backups

In [23]:
# #Automatic Renewals Clause dataframe
# df_auto = clause_list_converter(auto_renewal_list, "df_auto", "automatic_renewal_clause")

In [24]:
# #Automatic Renewals dataframe has turned out ok.
# df_auto.head(3)

In [25]:
# #we backup our Automatic Renewals Clauses dataframe into a .csv file
# df_auto.to_csv("../data/df_auto_renewals.csv")

In [26]:
# #we do the same for Renewal Option Clauses and Other Clauses:
# df_option = clause_list_converter(optional_renewal_list, "df_option", "renewal_option_clause")

In [27]:
# #the Renewal Option Clause dataframe looks good too:
# df_option.head(3)

In [28]:
# #backing that up too
# df_option.to_csv("../data/df_optional_renewals.csv")

In [29]:
# #we do the same for Renewal Option Clauses and Other Clauses:
# df_others = clause_list_converter(other_clauses_list, "df_others", "other_clauses")

In [30]:
# #our other clauses look good too:

# df_others

In [31]:
# #backing that up too
# df_others.to_csv("../data/df_others.csv")

In [32]:
# df_others = pd.read_csv("../data/df_others.csv", index_col = 0)

In [33]:
# df_others

In [34]:
# #we now merge everything together - start by merging both renewal clause types
# df_merged = pd.merge(df_auto, df_option, how = "outer")

In [35]:
# #then we merge with all other clauses
# df_merged = pd.merge(df_merged, df_others, how = "outer")

In [36]:
# #this gives us a fairly large dataframe suitable for Machine Learning
# df_merged.shape

In [37]:
# #and the dataframe does look fine. we back it up too:

# df_merged.to_csv("../data/df_merged.csv")
# df_merged 