by Graham Lim

# 1. Webscraping and Cleaning

In this notebook, we will be scraping contract clauses from LawInsider.com. There are 2 main requirements for this to work:

**1) You must already have a LawInsider premium account**. As of July 2020, it costs USD1.00 to sign up for a premium 30-day trial. 

  **If you don't wish to spend money, that's cool - please skip to the 2nd notebook in this project.** The rest of the Capstone Project will still work with the saved .csv files that were derived from the scraping done here. 

2) You will need to run the following pip install commands in terminal or cmd line:

* `pip install bs4` (for BeautifulSoup)
* `pip install selenium` (for Selenium)
* `pip install webdriver-manager` (for the automated Selenium web driver to work)

In [1]:
#Standard Python DS imports:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
#set column size to be larger
pd.set_option("display.max_colwidth", 1000)

We have to use `Selenium` because of the fact that all the clauses don't load in full in this website. The content only loads up in full via infinite scrolling down/paging down. 

Hence, we will import `Selenium` and the related `WebDriver Manager` tool to run a Chrome instance within Selenium that will keep scrolling down for us, so that we don't manually have to do this for our 15+ types of clauses.

In [3]:
#Selenium and WebDriver Manager imports:

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager


import time
from selenium.webdriver.common.keys import Keys

In [4]:
#let's assign the major URLS we want to scrape from LawInsider:

#these are our 2 target clauses - automatic and optional/manual renewal clauses:
auto_renewal_url = "https://www.lawinsider.com/clause/automatic-renewal"
optional_renewal_url = "https://www.lawinsider.com/clause/renewal-option"

#I then take the other most common clauses found in commercial agreements and list them:
other_clauses_urls= ["https://www.lawinsider.com/clause/definitions",
                     "https://www.lawinsider.com/clause/licenses",
                     "https://www.lawinsider.com/clause/delivery",
                     "https://www.lawinsider.com/clause/fees-and-royalties",
                     "https://www.lawinsider.com/clause/payment-terms",
                     "https://www.lawinsider.com/clause/support",
                     "https://www.lawinsider.com/clause/marketing-and-publicity",
                     "https://www.lawinsider.com/clause/proprietary-rights",
                     "https://www.lawinsider.com/clause/warranty",
                     "https://www.lawinsider.com/clause/indemnification",
                     "https://www.lawinsider.com/clause/confidentiality",
                     "https://www.lawinsider.com/clause/limitation-of-liability",
                     "https://www.lawinsider.com/clause/compliance-with-law",
                     "https://www.lawinsider.com/clause/miscellaneous"]

In [5]:
len(other_clauses_urls)

14

In [6]:
(len(other_clauses_urls)-1)

13

## LawInsider.com Scraper Function

We then write a function that will scrape the clauses contained in the LawInsider site url(s) after scrolling down that page x number of pagedowns to load it in full.

It takes 3 arguments: the url/url list objects we previously assigned (`urls`), the number of pagedowns/scrolls downwards to execute (`pagedown_pushes`), and the delay between each page/scroll down (`pagedown_lag`), so that LawInsider doesn't get overwhelmed with too many requests.

In [18]:
#this function takes the url/url list, number of pagedown scrolling, 
#and long the lag is
#

def lawinsider_scraper(urls, pagedown_pushes, pagedown_lag):

    
    if (type(urls) == list):
        
        data=[]
        
        for i in range(0,(len(urls)-1)):
 
            url = urls[i]
            driver = webdriver.Chrome(ChromeDriverManager().install())
            browser = driver
            browser.get(url) 
            time.sleep(2)
            
            elem = browser.find_element_by_tag_name("body")

            no_of_pagedowns = pagedown_pushes

            while no_of_pagedowns:
                elem.send_keys(Keys.PAGE_DOWN)
                time.sleep(pagedown_lag)
                no_of_pagedowns-=1

            post_elems = browser.find_elements_by_class_name("snippet-content")

            list_name = [post.text for post in post_elems]

            for clause in list_name:
                data.append(clause)
                
        return data

    elif (type(urls) == str):
        driver = webdriver.Chrome(ChromeDriverManager().install())
        browser = driver
        browser.get(urls)
        time.sleep(2)

   
        elem = browser.find_element_by_tag_name("body")

        no_of_pagedowns = pagedown_pushes

        while no_of_pagedowns:
            elem.send_keys(Keys.PAGE_DOWN)
            time.sleep(pagedown_lag)
            no_of_pagedowns-=1

        post_elems = browser.find_elements_by_class_name("snippet-content")

        list_name = [post.text for post in post_elems]

        return list_name

### Automatic Renewal Clauses
We first run our function to create a list object containing as many `automatic renewal clauses` as the site will offer us.

In [8]:
auto_renewal_list = lawinsider_scraper(auto_renewal_url, 200, 2)

[WDM] - Current google-chrome version is 84.0.4147
[WDM] - Get LATEST driver version for 84.0.4147
[WDM] - Driver [/Users/grahamlim/.wdm/drivers/chromedriver/mac64/84.0.4147.30/chromedriver] found in cache


 


In [9]:
len(auto_renewal_list)

160

### Renewal Option Clauses
We also want to scrape clauses that have `renewal option clauses`. These clauses generally don't let contracts renew automatically, and some pre-requisites must arise before the contract can be renewed e.g. 30 days notice must be given of intention to renew the contract.

In [15]:
optional_renewal_list = lawinsider_scraper(optional_renewal_url, 400, 1)

[WDM] - Current google-chrome version is 84.0.4147
[WDM] - Get LATEST driver version for 84.0.4147
[WDM] - Driver [/Users/grahamlim/.wdm/drivers/chromedriver/mac64/84.0.4147.30/chromedriver] found in cache


 


In [16]:
len(optional_renewal_list)

583

### Other General Clauses
We also want our model to distinguish renewal clauses as compared with other common general clauses in contracts e.g. warranty clauses, indemnification clauses, limitation of liability clauses etc.

This means we should scrape other common clauses found in commercial contracts for services/products:

In [19]:
other_clauses_list = lawinsider_scraper(other_clauses_urls, 400, 1)

[WDM] - Current google-chrome version is 84.0.4147
[WDM] - Get LATEST driver version for 84.0.4147
[WDM] - Driver [/Users/grahamlim/.wdm/drivers/chromedriver/mac64/84.0.4147.30/chromedriver] found in cache


 


[WDM] - Current google-chrome version is 84.0.4147
[WDM] - Get LATEST driver version for 84.0.4147
[WDM] - Driver [/Users/grahamlim/.wdm/drivers/chromedriver/mac64/84.0.4147.30/chromedriver] found in cache


 


[WDM] - Current google-chrome version is 84.0.4147
[WDM] - Get LATEST driver version for 84.0.4147
[WDM] - Driver [/Users/grahamlim/.wdm/drivers/chromedriver/mac64/84.0.4147.30/chromedriver] found in cache


 


[WDM] - Current google-chrome version is 84.0.4147
[WDM] - Get LATEST driver version for 84.0.4147


 


[WDM] - Driver [/Users/grahamlim/.wdm/drivers/chromedriver/mac64/84.0.4147.30/chromedriver] found in cache
[WDM] - Current google-chrome version is 84.0.4147
[WDM] - Get LATEST driver version for 84.0.4147
[WDM] - Driver [/Users/grahamlim/.wdm/drivers/chromedriver/mac64/84.0.4147.30/chromedriver] found in cache


 


[WDM] - Current google-chrome version is 84.0.4147
[WDM] - Get LATEST driver version for 84.0.4147
[WDM] - Driver [/Users/grahamlim/.wdm/drivers/chromedriver/mac64/84.0.4147.30/chromedriver] found in cache


 


[WDM] - Current google-chrome version is 84.0.4147
[WDM] - Get LATEST driver version for 84.0.4147
[WDM] - Driver [/Users/grahamlim/.wdm/drivers/chromedriver/mac64/84.0.4147.30/chromedriver] found in cache


 


[WDM] - Current google-chrome version is 84.0.4147
[WDM] - Get LATEST driver version for 84.0.4147
[WDM] - Driver [/Users/grahamlim/.wdm/drivers/chromedriver/mac64/84.0.4147.30/chromedriver] found in cache


 


[WDM] - Current google-chrome version is 84.0.4147
[WDM] - Get LATEST driver version for 84.0.4147


 


[WDM] - Driver [/Users/grahamlim/.wdm/drivers/chromedriver/mac64/84.0.4147.30/chromedriver] found in cache
[WDM] - Current google-chrome version is 84.0.4147
[WDM] - Get LATEST driver version for 84.0.4147


 


[WDM] - Driver [/Users/grahamlim/.wdm/drivers/chromedriver/mac64/84.0.4147.30/chromedriver] found in cache
[WDM] - Current google-chrome version is 84.0.4147
[WDM] - Get LATEST driver version for 84.0.4147
[WDM] - Driver [/Users/grahamlim/.wdm/drivers/chromedriver/mac64/84.0.4147.30/chromedriver] found in cache


 


[WDM] - Current google-chrome version is 84.0.4147
[WDM] - Get LATEST driver version for 84.0.4147
[WDM] - Driver [/Users/grahamlim/.wdm/drivers/chromedriver/mac64/84.0.4147.30/chromedriver] found in cache


 


[WDM] - Current google-chrome version is 84.0.4147
[WDM] - Get LATEST driver version for 84.0.4147
[WDM] - Driver [/Users/grahamlim/.wdm/drivers/chromedriver/mac64/84.0.4147.30/chromedriver] found in cache


 


In [21]:
len(other_clauses_list)

9143

## DataFrames and .csv Backups

We then write a simple function to convert and label these lists as DataFrames in pandas, and tells us what the `shape` of the dataframe is:

In [31]:
def clause_list_converter(list_name, df_name, clause):
    df_name = pd.DataFrame(list_name)
    df_name = df_name.rename(columns = {0:"clause_text"})
    df_name["clause_type"]=clause

    print(df_name.shape)
    
    return df_name

In [32]:
#Automatic Renewals Clause dataframe
df_auto = clause_list_converter(auto_renewal_list, "df_auto", "automatic_renewal_clause")

(160, 2)


In [35]:
#Automatic Renewals dataframe has turned out ok.
df_auto.head(3)

Unnamed: 0,clause_text,clause_type
0,"Automatic Renewal. Upon the expiration of the original term or any renewal term of employment, Employee’s employment shall be automatically renewed for a one (1) year period unless, at least sixty (60) days prior to the renewal date, either party gives the other party written notice of its intent not to continue the employment relationship. During any renewal term of employment, the terms, conditions and provisions set forth in this Agreement shall remain in effect unless modified in accordance with Section 8.",automatic_renewal_clause
1,"Automatic Renewal. This Agreement shall be automatically extended for one additional year, unless on or before November 30, 2007 (for the initial term), or thirteen (13) months before the expiration of any extended term, either Party provides to the other written notice of its desire not to automatically renew this Agreement.",automatic_renewal_clause
2,"Automatic Renewal. This Agreement shall renew automatically, with respect to each series set forth in Schedule A, on the same terms, for a period of one year from the expiration of the waiver and/or expense commitment applicable to such series as set forth in Schedule A, unless prior to such an expiration, CMA and/or CMD provide notice to the appropriate Board of Trustees of a Company of any proposals to increase, decrease or eliminate the series’ fee waivers and/or expense commitment, or to change the time period covered or any other terms thereof, for a subsequent period. Any renewal of this Agreement with respect to a series does not preclude CMA or CMD from requesting that a Company’s Board of Trustees approve changes to the fee waivers and/or expense commitment, or to the time period covered or any other terms thereof, prior to a subsequent renewal.",automatic_renewal_clause


In [36]:
#we backup our Automatic Renewals Clauses dataframe into a .csv file
df_auto.to_csv("../data/df_auto_renewals.csv")

In [37]:
#we do the same for Renewal Option Clauses and Other Clauses:
df_option = clause_list_converter(optional_renewal_list, "df_option", "renewal_option_clause")

(583, 2)


In [39]:
#the Renewal Option Clause dataframe looks good too:
df_option.head(3)

Unnamed: 0,clause_text,clause_type
0,"Renewal Option. Lessor hereby grants Lessee (but no assignee or subtenant) two (2) options to renew this Lease, each option to be for a period of sixty(60) months, for a total of one hundred twenty (120) months in the event both renewal options are exercised. Each said renewal option shall be exercised by Lessee notifying Lessor thereof in writing not more than two hundred seventy (270) and at least two hundred ten (210) days prior to the expiration of the then current lease or renewal term, as the case may be. In the event a renewal agreement has not been executed at least one hundred twenty (120) days prior to the expiration date of the current lease or renewal term, the option shall automatically become null and void. Each such renewal shall be subject to all of the terms and conditions of this Lease except that (i) the rentals payable during each renewal term shall be as set forth below and (ii) no further renewal option shall exist during the second renewal term. It shall be a...",renewal_option_clause
1,"Renewal Option. This Contract may be renewed under the same terms and conditions, subject to the approval of the Commissioner of the Department of Administration and the State Budget Director in compliance with IC §5-22-17-4. The term of the renewed contract may not be longer than the term of the original contract.",renewal_option_clause
2,"Renewal Option. Landlord hereby grants to Tenant, and Tenant shall have, the right and option to extend the Term of this Lease for one (1) period of five (5) years (the “Renewal Term”). The Renewal Term shall commence upon the day next following the last day of the initial Term. Tenant shall notify Landlord in writing of its election to extend this Lease for the Renewal Term not less than six (6) months prior to the expiration of the initial Term, time being of the essence with respect to such notification. Notice thereof shall be deemed sufficient if given in the manner hereinafter provided. If Landlord does not receive such written notice as and when required herein, the Renewal Term shall terminate and be of no further force or effect, and this Lease shall expire as of the then-scheduled expiration date. The Renewal Term shall be upon all of the terms, covenants and conditions of this Lease, except that the Fixed Rent shall be increased by adding the CPI Adjustment Amount (defin...",renewal_option_clause


In [40]:
#backing that up too
df_option.to_csv("../data/df_optional_renewals.csv")

In [41]:
#we do the same for Renewal Option Clauses and Other Clauses:
df_others = clause_list_converter(other_clauses_list, "df_others", "other_clauses")

(9143, 2)


In [44]:
#our other clauses look good too:

df_others

Unnamed: 0,clause_text,clause_type
0,"Definitions. As used in this Agreement, the following terms shall have the following meanings:",other_clauses
1,Definitions. For purposes of this Agreement:,other_clauses
2,"Definitions. For purposes of this Agreement, the following terms shall have the following meanings:",other_clauses
3,Definitions. As used in this Agreement:,other_clauses
4,"Definitions. For all purposes of this Indenture, except as otherwise expressly provided or unless the context otherwise requires:",other_clauses
...,...,...
9138,"Compliance with Law. In the performance of such services as described herein, the Insurer shall comply with applicable laws, rules and regulations.",other_clauses
9139,Compliance with Law. It shall comply with all applicable laws relating to its performance under this Agreement.,other_clauses
9140,"Compliance with Law. JCR agrees to comply with all applicable laws, rules and regulations with respect to Product for use in the Field in the Territory.",other_clauses
9141,"Compliance with Law. Landlord and Tenant shall each do all acts necessary to comply with all applicable laws, statutes, ordinances, and rules of any public authority relating to their respective maintenance obligations as set forth herein. The provisions of Section 9.2. are deemed restated here.",other_clauses


In [45]:
#backing that up too
df_others.to_csv("../data/df_others.csv")

In [47]:
#we now merge everything together - start by merging both renewal clause types
df_merged = pd.merge(df_auto, df_option, how = "outer")

In [49]:
#then we merge with all other clauses
df_merged = pd.merge(df_merged, df_others, how = "outer")

In [51]:
#this gives us a fairly large dataframe suitable for Machine Learning
df_merged.shape

(9886, 2)

In [54]:
#and the dataframe does look fine. we back it up too:

df_merged.to_csv("../data/df_merged.csv")
df_merged 

Unnamed: 0,clause_text,clause_type
0,"Automatic Renewal. Upon the expiration of the original term or any renewal term of employment, Employee’s employment shall be automatically renewed for a one (1) year period unless, at least sixty (60) days prior to the renewal date, either party gives the other party written notice of its intent not to continue the employment relationship. During any renewal term of employment, the terms, conditions and provisions set forth in this Agreement shall remain in effect unless modified in accordance with Section 8.",automatic_renewal_clause
1,"Automatic Renewal. This Agreement shall be automatically extended for one additional year, unless on or before November 30, 2007 (for the initial term), or thirteen (13) months before the expiration of any extended term, either Party provides to the other written notice of its desire not to automatically renew this Agreement.",automatic_renewal_clause
2,"Automatic Renewal. This Agreement shall renew automatically, with respect to each series set forth in Schedule A, on the same terms, for a period of one year from the expiration of the waiver and/or expense commitment applicable to such series as set forth in Schedule A, unless prior to such an expiration, CMA and/or CMD provide notice to the appropriate Board of Trustees of a Company of any proposals to increase, decrease or eliminate the series’ fee waivers and/or expense commitment, or to change the time period covered or any other terms thereof, for a subsequent period. Any renewal of this Agreement with respect to a series does not preclude CMA or CMD from requesting that a Company’s Board of Trustees approve changes to the fee waivers and/or expense commitment, or to the time period covered or any other terms thereof, prior to a subsequent renewal.",automatic_renewal_clause
3,"Automatic Renewal. If a Holder of such Security has not delivered a Repayment Election for repayment of the Security on or prior to the 15th day following the Maturity Date, and the Company did not notify the Holder of its intention to repay the Security in the Notice of Maturity, then such maturing Security shall be extended automatically for an additional term equal to the original term, and shall be deemed to be renewed by the Holder and the Company as of the Maturity Date of such maturing Security. A maturing Security will continue to renew as described herein absent a Redemption Notice or Repurchase Request by the Holder or an indication by the Company that it will repay and not allow the Security to be renewed in the Notice of Maturity. Interest on the renewed Security shall accrue from the Issue Date thereof, which is the first day of such renewed term (i.e., the Maturity Date of the maturing Security). Such renewed Security will be deemed to have the identical terms and pro...",automatic_renewal_clause
4,Automatic Renewal. This Agreement shall be renewed automatically for succeeding terms of three (3) years each unless either party gives written notice to the other at least ninety (90) days prior to the expiration of any term of Executive’s or Company’s intention not to renew pursuant to Company’s bylaws.,automatic_renewal_clause
...,...,...
9881,"Compliance with Law. In the performance of such services as described herein, the Insurer shall comply with applicable laws, rules and regulations.",other_clauses
9882,Compliance with Law. It shall comply with all applicable laws relating to its performance under this Agreement.,other_clauses
9883,"Compliance with Law. JCR agrees to comply with all applicable laws, rules and regulations with respect to Product for use in the Field in the Territory.",other_clauses
9884,"Compliance with Law. Landlord and Tenant shall each do all acts necessary to comply with all applicable laws, statutes, ordinances, and rules of any public authority relating to their respective maintenance obligations as set forth herein. The provisions of Section 9.2. are deemed restated here.",other_clauses
