# Summary

This notebook demonstrates the current progress of training light machine learning models on customized datasets with the spaCy framework.<br>
The detailed information of the dataset and models will be documented below or in separate notebooks.<br>

A sample notebook showing the procedure of creating the dataset (inserting) and training the model can be accessed at https://github.com/azavea/cicero-nlp/blob/vanilaNER/Part2-Light_Spacy_Models/TrainSpacy_traing6.ipynb

## load data

load the data for demonstration purposes. <br>
Please note that the loaded data here is different from the training datasets for most models, except model 2.<br>

In [1]:
from bs4 import BeautifulSoup
import collections
from tqdm import tqdm
import pandas as pd

In [2]:
def remove_tags(html):
  
    # parse html content
    soup = BeautifulSoup(html, "html.parser")
  
    for data in soup(['style', 'script']):
        # Remove tags
        data.decompose()
  
    # return data by retrieving the tag content
    return ' '.join(soup.stripped_strings)

In [3]:
import os, glob
result_dict = collections.defaultdict()
path = '/content/drive/MyDrive/Cicero/webpages'
for filename in tqdm(glob.glob(os.path.join(path, '*.html'))):
   with open(filename, 'r') as f: # open in readonly mode
      id = str(os.path.splitext(filename)[0].split('/')[-1])
      page = f.read()
      content = remove_tags(page)
      result_dict[id] = content

100%|██████████| 1548/1548 [02:08<00:00, 12.07it/s]


In [4]:
sample = pd.read_csv('/content/drive/MyDrive/Cicero/cicero_officials_sample_2022-09-08.csv', \
                       error_bad_lines=False)



  exec(code_obj, self.user_global_ns, self.user_ns)
b'Skipping line 13: expected 55 fields, saw 60\nSkipping line 34: expected 55 fields, saw 59\nSkipping line 110: expected 55 fields, saw 60\nSkipping line 125: expected 55 fields, saw 60\nSkipping line 127: expected 55 fields, saw 60\nSkipping line 171: expected 55 fields, saw 60\nSkipping line 175: expected 55 fields, saw 65\nSkipping line 181: expected 55 fields, saw 60\nSkipping line 196: expected 55 fields, saw 60\nSkipping line 197: expected 55 fields, saw 58\nSkipping line 206: expected 55 fields, saw 60\nSkipping line 220: expected 55 fields, saw 60\nSkipping line 261: expected 55 fields, saw 60\nSkipping line 272: expected 55 fields, saw 63\nSkipping line 279: expected 55 fields, saw 60\nSkipping line 353: expected 55 fields, saw 58\nSkipping line 356: expected 55 fields, saw 59\nSkipping line 407: expected 55 fields, saw 57\nSkipping line 429: expected 55 fields, saw 60\nSkipping line 446: expected 55 fields, saw 69\nSkippi

In [5]:
sample.iloc[0]['id']

363085

In [6]:
result_dict.get('363085')

"West Virginia Senate skip navigation SENATE PRESIDENT SENATORS COMMITTEES DISTRICT MAPS VIDEO/AUDIO SENATE CLERK SENATE RULES HOUSE SPEAKER DELEGATES COMMITTEES DISTRICT MAPS VIDEO/AUDIO HOUSE CLERK HOUSE RULES JOINT INTERIM COMMITTEES ADMINISTRATION DIVISION JUDICIAL COMP. COMMISSION BUDGET DIVISION DIVISION OF REGULATORY AND FISCAL AFFAIRS CLAIMS COMMISSION CRIME VICTIMS LASD LEGISLATIVE SERVICES PERD POST AUDIT PUBLIC INFORMATION RULE-MAKING REVIEW SPECIAL INVESTIGATIONS JOINT RULES REDISTRICTING STAFF INFO BILL STATUS BILL STATUS BILL TRACKING STATE LAW WEST VIRGINIA CODE ACTS OF THE LEGISLATURE CODE OF 1931 WV CONSTITUTION US CONSTITUTION REPORTS AGENCY REPORTS AGENCY GRANT AWARDS PERFORMANCE EVALUATIONS POST AUDITS EDUCATIONAL CITIZEN’S GUIDE INTERNSHIP PROGRAM PAGE PROGRAM PUBLICATIONS PHOTO GALLERY CAPITOL HISTORY HOW A BILL BECOMES LAW CONTACT SENATE ROSTER HOUSE ROSTER PUBLIC INFO. WEBMASTER HELPFUL LINKS West Virginia State Senate House Roster | Senate Roster Find by Name C

## Show Models

In [12]:
import spacy
from spacy import displacy

test sample 1

In [19]:
sample.iloc[0]

id                                                                              363085
party                                                                       Republican
initial_term_start_date                                                            NaN
initial_term_start_date_precision                                                  NaN
valid_from                                                      2020-12-01 00:00:00+00
valid_from_precision                                                                 D
valid_to                                                        2022-12-01 00:00:00+00
valid_to_precision                                                                   D
first_name                                                                       David
middle_initial                                                                    Bugs
last_name                                                                       Stover
name_suffix                                

In [21]:
sample.iloc[0]['url_1']

'https://www.wvlegislature.gov/Senate1/lawmaker.cfm?member=Senator%20Stover'

In [8]:
sample1 = result_dict.get(str(sample.iloc[0]['id']))

In [9]:
sample1

"West Virginia Senate skip navigation SENATE PRESIDENT SENATORS COMMITTEES DISTRICT MAPS VIDEO/AUDIO SENATE CLERK SENATE RULES HOUSE SPEAKER DELEGATES COMMITTEES DISTRICT MAPS VIDEO/AUDIO HOUSE CLERK HOUSE RULES JOINT INTERIM COMMITTEES ADMINISTRATION DIVISION JUDICIAL COMP. COMMISSION BUDGET DIVISION DIVISION OF REGULATORY AND FISCAL AFFAIRS CLAIMS COMMISSION CRIME VICTIMS LASD LEGISLATIVE SERVICES PERD POST AUDIT PUBLIC INFORMATION RULE-MAKING REVIEW SPECIAL INVESTIGATIONS JOINT RULES REDISTRICTING STAFF INFO BILL STATUS BILL STATUS BILL TRACKING STATE LAW WEST VIRGINIA CODE ACTS OF THE LEGISLATURE CODE OF 1931 WV CONSTITUTION US CONSTITUTION REPORTS AGENCY REPORTS AGENCY GRANT AWARDS PERFORMANCE EVALUATIONS POST AUDITS EDUCATIONAL CITIZEN’S GUIDE INTERNSHIP PROGRAM PAGE PROGRAM PUBLICATIONS PHOTO GALLERY CAPITOL HISTORY HOW A BILL BECOMES LAW CONTACT SENATE ROSTER HOUSE ROSTER PUBLIC INFO. WEBMASTER HELPFUL LINKS West Virginia State Senate House Roster | Senate Roster Find by Name C

Model 1

The model 1 is trained on the cusutomized dataset with different attributes inserted and different attributes tagged.

In [13]:
train1 = spacy.load("/content/drive/MyDrive/workdata_10 05 2022/model-best")

In [14]:
doc = train1(sample1)
displacy.render(doc, style="ent", jupyter=True)

Model 2

The model 2 is trained on the labeled dataset, which is extractly same as the demonstration data. 
Alought its result below looks nice, but this is beacuse the model is trained on it and just remembered the right answers (enven including the mis-labeled ones).

In [15]:
train2 = spacy.load('/content/drive/MyDrive/worddata_10 08 2022/model-best')

In [16]:
doc = train2(sample1)
displacy.render(doc, style="ent", jupyter=True)

Model 3

The model 3 is trained on the cusutomized dataset with different attributes inserted but only address attributes tagged.

In [17]:
train3 = spacy.load('/content/drive/MyDrive/workdata_10 09 2022/model1/model-best')

In [18]:
doc = train3(sample1)
displacy.render(doc, style="ent", jupyter=True)

Model 4

The model 4 is trained on the cusutomized dataset with different attributes inserted but only address attributes tagged. To improve model'ability of differentiating address information from other entities, the name entities extracted from [CoNLL-2003](https://paperswithcode.com/dataset/conll-2003) are also inserted into the dataset

In [22]:
train4 = spacy.load('/content/drive/MyDrive/workdata_10 09 2022/model2/model-best')

In [23]:
doc = train4(sample1)
displacy.render(doc, style="ent", jupyter=True)