# Extract names from doc - baseline

The following solution utilizes a basic sequential approach to finding the party names in an agreement.<br>
Format:  Agreement Name, Party1:  YOU, Party2: Apple

For this baseline solution, **Syntaxnet** and **Machine Learning** are **NOT** utilized.  
These techniques will be used in the next iterations to improve on the baseline performance.

**Challenges with current baseline solution:**
- Does not capure party names that contain multiple words
- Current logic captures the first word after "between" for the first party and "and" for the second party.  One of the test cases captured the word "the" as opposed to the party name.
- The major challenge in this approach is adding lines of codes to handle the nuances in the writing style of each agreement. This can result in almost limitless lines of code.

**Conclusion**: A more sophisticated solution is needed. Google's Syntaxnet could be applied to process and identify word types/dependencies in a sentence.  These dependencies can then be used to retrieve the correct party names.  Another alternative could be to process the agreement text via a machine learning approach.  With machine learning, an algorithm can be trained to process text and identify target labels (aka party names).  Once an algorithm is successfully trained, the algorithm can be applied to extract party names on unseen data.

In [1]:
# Import libraries
import pandas as pd
import re

Adjust scroll bar activation threshold...

In [2]:
%%javascript
IPython.OutputArea.auto_scroll_threshold = 100;

<IPython.core.display.Javascript object>

In [3]:
# Choose agreement file from list in csv

# read csv file
agreements_data = pd.read_csv("data/agreements_dataset1.csv")
print "\n","Agreements data read successfully!","\n"

# Select text file name
selected_text_file_name = agreements_data['file_name'][16]
print "Selected text file name from csv -> ", selected_text_file_name, "\n"


Agreements data read successfully! 

Selected text file name from csv ->  CareerBuilder_eula.txt 



In [4]:
# Cleanse and parse text into separate sentences

caps = "([A-Z])"
prefixes = "(Mr|St|Mrs|Ms|Dr|www|1|2|2|3|4|5|6|7|8|9|10)[.]"
suffixes = "(Inc|INC|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov)"

def split_into_sentences(text):
    text = " " + text + "  "
    text = text.replace("\n"," ")
    text = text.replace("\r"," ")
    text = re.sub(prefixes,"\\1<prd>",text)
    text = re.sub(websites,"<prd>\\1",text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    text = re.sub("\s" + caps + "[.] "," \\1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
    text = re.sub(caps + "[.]" + caps + "[.]" + caps + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
    text = re.sub(caps + "[.]" + caps + "[.]","\\1<prd>\\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
    text = re.sub(" " + caps + "[.]"," \\1<prd>",text)
    if "”" in text: text = text.replace(".”","”.")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")
    sentences = text.split("<stop>")
    sentences = sentences[:-1]
    sentences = [s.strip() for s in sentences]
    return sentences

filepath = "data/" + selected_text_file_name
with open(filepath, 'r') as my_agreement:
    my_agreement_text=my_agreement.read()

my_agreement_lines = split_into_sentences(my_agreement_text)

print "\n", "Agreement text has been cleansed and parsed into separate sentences.", "\n"


Agreement text has been cleansed and parsed into separate sentences. 



In [5]:
# Find sentences with "between"

key_sentences = []
type_of_sentence = ""
key_words = ["between"]
Party1 = "not found yet"
Party2 = "not found yet"

for line in my_agreement_lines:
    for word in key_words:
        if word in line.lower(): 
            key_sentences.append(line)
                
print "\n", "Number of sentences retrieved:", len(key_sentences),  "\n"
# print "\n", key_sentences, "\n"


Number of sentences retrieved: 2 



In [6]:
## Identify number of sentences retrieved

def summary():
    print "\n", "Agreement:", selected_text_file_name
    print "Party1:",Party1,",","Party2:",Party2,"\n"


if len(key_sentences) == 0: 
    print "There is no 'between' clause in the document"
    print "Party1 ->",Party1
    print "Party2 ->",Party2
    
if len(key_sentences) == 1: 

    match1 = re.search(r'between (\w+)', key_sentences[0])
    after_match1 = re.search(r'between (.*)', key_sentences[0])
    match2 = re.search(r'and (\w+)', after_match1.group(1))
    
    Party1 = match1.group(1)
    Party2 = match2.group(1) 
    summary()
    
if len(key_sentences) > 1: 
    
    key_sentences2 = []
    listOfWords = ['between','agreement']

    for line in key_sentences:
        if all(word in line.lower() for word in listOfWords):
            key_sentences2.append(line)
    
    if len(key_sentences2) == 0:
        print "no key sentences returned."
        
    if len(key_sentences2) == 1:
        
        match1 = re.search(r'between (\w+)', key_sentences[0])
        after_match1 = re.search(r'between (.*)', key_sentences[0])
        match2 = re.search(r'and (\w+)', after_match1.group(1))
        
        Party1 = match1.group(1)
        Party2 = match2.group(1)
        summary()
        
    if len(key_sentences2) > 1:
        print "Number of sentences retrieved:", len(key_sentences2)
        print "Processing the first retrieved sentence"
        # print key_sentences2
        
        match1 = re.search(r'between (\w+)', key_sentences2[0], flags=re.I) # re.I -> ignore upper/lower case
        after_match1 = re.search(r'between (.*)', key_sentences2[0], flags=re.I)
        match2 = re.search(r'and (\w+)', after_match1.group(1), flags=re.I)
    
        Party1 = match1.group(1)
        Party2 = match2.group(1)
        summary()


Agreement: CareerBuilder_eula.txt
Party1: you , Party2: CareerBuilder 



**Successful extracts**

1. ABBYY_eula.txt, Party1: you , Party2: ABBYY 
2. Aeria Games & Entertainment Inc_eula.txt, Party1: Licensor , Party2: you 
3. AllCursors_eula.txt, Party1: you , Party2: Licensor 
4. AnyChart_eula.txt, Party1: you , Party2: Sibental
5. AOL Inc_eula.txt, Party1: you , Party2: us
6. Bitdefender_eula.txt, Party1: you , Party2: BITDEFENDER
7. BTC_eula.txt, Party1: you , Party2: bigtincan
8. Caphyon_eula.txt, Party1: YOU , Party2: CAPHYON 
9. CareerBuilder_eula.txt, Party1: you , Party2: CareerBuilder 
10. Caristix_eula.txt, Party1: you , Party2: Caristix 
<br>
...
    

**FAILED extracts:**

1. Google_Construction_Agreement_C.txt, Party1: GOOGLE , Party2: S    
2. 2Think1 Solutions Inc_eula.txt, Party1: you , Party2: 2THINK1 
3. ALM Works Ltd_eula.txt, Party1: you , Party2: the
4. app square OG_eula.txt, Party1: you , Party2: appsquare 
5. APPEARTOME LIMITED_eula.txt, Party1: you , Party2: Appeartome 
6. Atlassian_eula.txt, Party1: you , Party2: Atlassian
7. Avanquest Software SA_eula.txt, Party1: you , Party2: Avanquest
8. Blizzard Entertainment Inc_eula.txt, Party1: Blizzard , Party2: you 
9. ChemAxon Ltd_eula.txt, Party1: you , Party2: ChemAxon
10. Cloudfind Limited_eula.txt, Party1: you , Party2: Cloudfind
<br>
...
    

**Challenges with current baseline solution:**
- Does not capure party names that contain multiple words
- Current logic captures the first word after "between" for the first party and "and" for the second party.  One of the test cases captured the word "the" as opposed to the party name.
- The major challenge in this approach is adding lines of codes to handle the nuances in the writing style of each agreement. This can result in almost limitless lines of code.

**Conclusion**: A more sophisticated solution is needed. Google's Syntaxnet could be applied to process and identify word types/dependencies in a sentence.  These dependencies can then be used to retrieve the correct party names.  Another alternative could be to process the agreement text via a machine learning approach.  With machine learning, an algorithm can be trained to process text and identify target labels (aka party names).  Once an algorithm is successfully trained, the algorithm can be applied to extract party names on unseen data.