# extract-pages-from-mongo-v7
SanjayKAroraPhD@gmail.com <br>
December 2018

## Description
This version of the notebook extracts groups of pages from mongodb by firm_name to create firm-centric <b>about</b> page output files that can later be topic modeled.  In doing so, it removes repetitive content (e.g., repeated menu items) and garbage content (e.g., improperly parsed HTML code). 

## Change log
v4 focuses on about pages

## TODO:
* Whole process: get data, topic model and see if it looks sufficiently interesting/different
* Enhance data collection, per the following: 
    * Select a region or country — WAIT 
        * http://www.ivoclarvivadent.com: Please select your region
        * https://www.enersys.com/: PLEASE SELECT A REGION
        * https://www.m-petfilm.com/: ENGLISH
    * Crawl from focal about page only following links that look like part of the about story, maintaining ordering.  Check to see if the other links identified above are also there? 
        * http://xtalsolar.com/investors_partners.html
* Order known about us pages in the same way the links are found on a home page or about us landing page

In [1]:
# import data processing and other libraries
import csv
import sys
import requests
import os
import re
import pprint
import pymongo
import traceback
from time import sleep
import requests
import pandas as pd
import io
from IPython.display import display
import time
import numpy as np
from bs4 import BeautifulSoup
import string
import random
from urllib.parse import urlparse, urljoin
from collections import defaultdict
from collections import OrderedDict
import collections

In [2]:
from boilerpipe.extract import Extractor

In [3]:
# import sklearn
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.model_selection import GroupKFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix

In [4]:
MONGODB_DB = "FirmDB_20181226"
MONGODB_COLLECTION = "pages_ABOUT2"
CONNECTION_STRING = "mongodb://localhost"

client = pymongo.MongoClient(CONNECTION_STRING)
db = client[MONGODB_DB]
col = db[MONGODB_COLLECTION]

ABOUT_DIR = '/Users/sarora/dev/EAGER/data/orgs/about/'
DATA_DIR = '/Users/sarora/dev/EAGER/data/orgs/parsed_page_output/'
TRAINING_PERCENT = .10

PHRASE_LENGTH = 60
MIN_PARA_LEN = 5

pp = pprint.PrettyPrinter()

In [369]:
# output urls for labeling of training data
results = col.find({},{"url": 1, "firm_name": 1})
df = pd.DataFrame(columns = ('firm_name', 'url'))
for i in range(results.count()):
    result = results.next()
    url = result['url'][0]
    firm_name = result['firm_name'][0] if 'firm_name' in result else ''
    df.loc[i] = [firm_name, url]
    
df['gid'] = df.groupby(['firm_name']).ngroup()

In [6]:
df.gid.nunique()
label_ids = random.sample(range(1, df.gid.nunique()), 200)
df_label = df[df['gid'].isin(label_ids)]
with open(ABOUT_DIR + 'about_pages_to_label.csv', mode='w') as to_label:
    df_label.to_csv(to_label, index=False)

In [7]:
# read back labeled data (note that about, management/team and partners, are dichotomous)
df_about_labeled = pd.read_csv(ABOUT_DIR + 'about_pages_labeled_v4.csv')
df_about_labeled = df_about_labeled.fillna(0)
df_about_labeled['pages_in_domain_ftr'] = df_about_labeled.groupby(["firm_name"])["url"].transform("count")

labeled_urls = list(df_about_labeled['url']) # for training models on labeled urls below
df_about_labeled = df_about_labeled.set_index(['firm_name', 'url'])
df_about_labeled.head()

# final test set is the rows of the original data frame without the urls in df_about_labeled 

Unnamed: 0_level_0,Unnamed: 1_level_0,about_lbl,mgmt_lbl,partners_lbl,ip_lbl,about_agg_lbl,gid,pages_in_domain_ftr
firm_name,url,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
3M Innovative Properties Company,https://www.3m.com/,0.0,0.0,0.0,0.0,0.0,1.0,16
3M Innovative Properties Company,https://www.3m.com/3M/en_US/company-us/3m-science-applied-to-life/,0.0,0.0,0.0,0.0,0.0,0.0,16
3M Innovative Properties Company,https://www.3m.com/3M/en_US/company-us/about-3m/,1.0,0.0,0.0,0.0,1.0,1.0,16
3M Innovative Properties Company,https://www.3m.com/3M/en_US/company-us/about-3m/state-of-science-index-survey/,0.0,0.0,0.0,0.0,0.0,0.0,16
3M Innovative Properties Company,https://www.3m.com/3M/en_US/company-us/about-3m/technologies/,1.0,0.0,0.0,0.0,0.0,0.0,16


In [8]:
df_about_labeled.shape

(1455, 7)

## Create features to predict about pages
Create features:
1. number of domain pages (identified above)
2. whether a given page is an about us page (as opposed to a home page)
3. number of words on a page (and share of words on a page given all candidate pages)
4. number of sentences (and share fo sentences on a page given all candidate pages)
5. title and url path fragment unigrams (also tried n-grams) with worse results
6. descriptor text around the focal link (as identified upstream when crawling as as persisted to mongodb then)

Other text-based ideas for features may be found here: https://towardsdatascience.com/understanding-feature-engineering-part-3-traditional-methods-for-text-data-f6f7d70acd41

In [9]:
# pattern regex to remove unwanted words that show up in topic models
p = re.compile(r"(\(\)|''|``|\"|null|ul|li|ol|^\.|^:|^/|\\|--|cooki|'s|corpor|busi|inc\.|ltd|co\.|compan|keyboard|product|technolog)", flags=re.IGNORECASE)

# remove html content
def is_javascript (x):
    match_string = r"(CDATA|return\s+true|return\s+false|getelementbyid|function|\w+\(.*?\);|\w{2,}[\\.|:]+\w{2,}|header|hover|'\w+':\s+'\w+|\\|{|}|\r|\n|\/\/')"
    # capture CDATA; function declarations; function calls; word sequences separated by a period (e.g., denoting paths)
    regex = re.findall(match_string, x) 
    # check to see if the regex finds some percentage of the words look like javascript patterns
    if (len(regex) / float(len(x.split())) > .10):
        return True 
    else:
        return False

def clean_page_content (text_list):
    # remove whatever we think is html
    removed_html = filter(lambda x: not( bool(BeautifulSoup(x, "html.parser").find()) ), text_list)
    # remove content that looks like javascript 
    removed_js = filter(lambda x: not (is_javascript(x)), removed_html)
    # add other checks here as needed

    return removed_js

# this method called from below
def count_page_features (result): 
    if not result:
        return 0, 0
    
    # get number of words
    running_text = ''
    clnd_text = clean_page_content(result['full_text'])
    clnd_text = '\n'.join(clnd_text)
    boilerpipe = None

    if 'body' in result:
        extractor = Extractor(extractor='DefaultExtractor', html = result['body'][0])
        lines = extractor.getText().replace(u'\xa0', u' ').split('\n')
        filtered = filter(lambda x: not re.match(r'^\s*$', x), lines)
        boilerpipe = '\n'.join(filtered)

    # TODO fix to split().  Counting characters currently 
    if boilerpipe and (len(boilerpipe) > .5 * len(clnd_text)):
        running_text += boilerpipe
    else:
        running_text += clnd_text
    
    num_words = len(running_text.split())
    num_sentences = 0
    
    # loop over text and add title elements to the paragraph they describe
    document = running_text.split('\n')
    for i in range(len(document)): # figure out a way to chunk groups of content
        if len(document[i]) <= 12 or len(document[i].split()) < MIN_PARA_LEN : # maybe a menu or simple pagragraph heading? 
            continue
        num_sentences += len(re.findall(r'(\.|;|\!)( |$)', document[i])) # count what appears to be number of sentences sentence

    # pp.pprint (joined)
    return num_words, num_sentences

In [10]:
# remove simple article words and punctuation (need to keep 'about')
stop_words = ['the','a'] + list(string.punctuation) 
# remove known company names for model training and evaluation in the labeled data 
remove_regex = re.compile(r'^(3m|united|states|menu|en_us|algeternal|s\d+|sarepta|skygen|nexgen|abbott|adlens|errorpage|\d{1,3}|\d{5,}|\w+\d+|\d+\w+|asten|johnson|baker|hughes|ge|bhge|biocon|egfr|gcsf|biocon|pegfilgrastim|bostik|canon|chevron|phillips|coloplast|cyberonics|microsoft|evoqua|ford|hitachi|glucanbio|hunter|douglas|kimberly|clark|lextar|fisher|lockheed|martin |lux|nec|nanocopoeia|cisco|schlumberger|weccamerica|inanobio|nanocomposix|zoetis|zygo)$', re.IGNORECASE)
# used to filter top-level header content
header_in = re.compile('(about|company|corporate|who.we.are|(^|/)vision|awards|profile|corporate|management|team|history|values|strategy|our |technology|research|commercialization)', flags=re.IGNORECASE)
header_regex = re.compile(r'h[1-9]+')

def get_domain (url):
    o = urlparse(url.lower())
    domain = o.netloc.strip('www.')
    return domain

def strip_firm_name (firm_name, text):
    strip_regex = re.compile(r"(" + "\s|".join(firm_name.split()) + "\s)", re.IGNORECASE)
    clnd_text = strip_regex.sub ('', text)
    
    more_regex = re.compile(r"([A-Z]\.?){1,} ")
    clnd_text = more_regex.sub ('', clnd_text)
    
    return ' '.join(clnd_text.split(' '))

# standard firm cleaning regex
def clean_firm_name (firm):
    firm_clnd = re.sub('(\.|,| corporation| incorporated| llc| inc| international| gmbh| ltd)', '', firm, flags=re.IGNORECASE).rstrip()
    return firm_clnd

def clean_string(in_string):
    if not in_string:
        return in_string
    split_words = in_string.lower().split()
    result_words  = [word for word in split_words if word not in stop_words]
    result_words  = [word for word in result_words if not remove_regex.search(word)]
    result = ' '.join(result_words)
    return ' ' + result

def get_page_path_text (url):
    o = urlparse(url.lower())
    path = o.path
    path_parts = path.split ('/')
    path_parts = [part.split('.')[0] for part in path_parts] # remove page names
    path_parts = [split for part in path_parts for split in part.split('-') ] # split on underscores, hyphens, et al
    path_parts = [split for part in path_parts for split in part.split('_') ] 
    clnd_string = clean_string(' '.join(path_parts))
    return clnd_string

# recurse through the header text to add into feature grams
def get_header_text (headers, names, index):
    texts = [clean_string(header.text) for header in headers if header.name == names[index]]
    texts = list(filter(header_in.search, texts))
    if texts and len(texts[0].split()) > 4:
        if(len(names) > (index + 1)):
            return get_header_text (headers, names, index + 1)
        else:
            return ''
    else: 
        return ' '.join (texts)

# load page data and create features (this method kicks everything else off)
def process_firms (urls): 
    firm_text_features = {}
    firm_count_features = {} # ['is_about', 'num_words', 'num_sentences']
    
    for url in urls: 
        result = col.find_one({"url": url})
        if not result:
            result = col.find_one({"orig_url": url})
        if not result: # just can't find the page
            print ('Cannot find ' + url)
            continue
        
        # --------------------
        # text based features
        # --------------------
        firm_name = result
        domain = get_domain(url)
        
        if 'html' not in result:
            print ('Cannot find html for', url)
            continue
            
        html = result['html'][0]
        
        descriptor_text = ''
        
        # text from the text wrapping the link
        descriptor = result['descriptor'][0].replace ('|', '').replace('None', '')
        if descriptor: 
            descriptor_text = clean_string(descriptor)
        
        # path text within the url 
        path_text = get_page_path_text(url)
         
        # title from the html page
        soup = BeautifulSoup(html, 'lxml')
        title_text = ''
        if soup.title and soup.title.string:
            # print (soup.title.string)
            title_text = clean_string(soup.title.string)
        
        # headers from the page
        header_text = ''
        headers = soup.find_all(header_regex, text=True)
        names = sorted(set ([header.name for header in headers]))
        header_text = get_header_text (headers, names, 0)

        firm_name = result['firm_name'][0]      
        firm_text_features[url] = [strip_firm_name(descriptor_text), strip_firm_name(path_text), strip_firm_name(title_text), strip_firm_name(header_text)]
        
        # --------------------
        # count based features
        # --------------------        
        is_about = int(result['is_about'][0])
        num_words, num_sentences = count_page_features (result)
        firm_count_features[url] = [is_about, num_words, num_sentences]
        
    return firm_text_features, firm_count_features

In [61]:
# Test various methods and regex used above 
print (get_page_path_text ('http://biocon.com/biocon_aboutus_businesses.asp'))

print (get_page_path_text('http://www.google.com/path-en/path_to/page.html'))
print (re.split("\W+|_", "Testing this_thing"))
print (clean_string('3m 01	08	100	10m ford 235 1990 s129 188209 0913lk the ? about us'))
print (clean_string('3m 01	08	100	10m ford 235 1990 s129 188209 0913lk the ? about us'))
pp.pprint (list(filter(header_in.search, ['about us', 'not found', 'company'])))

print (strip_firm_name('Ford Motor Company', 'This Ford Motor Company has been around for a while.'))
print (strip_firm_name (clean_firm_name('Ford Motor Company'), 'Ford is a motor company.  It has been building vehicles for over a century. H.W.F_ Ford was a nice guy.'))

 aboutus businesses
 path en path to page
['Testing', 'this', 'thing']
 about us
 about us
['about us', 'company']
This has been around for a while.
is a company.  It has been building vehicles for over a century. H.W.F_ was a nice guy.


In [13]:
# get firm website data for n-gram processing AND grab count features
labeled_firm_text_features, labeled_firm_count_features = process_firms (labeled_urls)

Cannot find http://www.genomichealth.com/en-US/


In [62]:
# testing, should be  
# pages about us about us – about us 
# [1, 882, 19]
print(labeled_firm_text_features['https://nanocomposix.com/pages/about-us'])
print(labeled_firm_count_features['https://nanocomposix.com/pages/about-us'])

['', ' pages about us', ' about us –', ' about us']
[1, 882, 19]


### Process firm text features into a vectorized format

In [151]:
urls = labeled_firm_text_features.keys() # create in an order
print (len(urls))
corpus = []


for url in urls:
    running_text = ''
    descriptor_text = labeled_firm_text_features[url][0]
    path_text = labeled_firm_text_features[url][1]
    title_text = labeled_firm_text_features[url][2]
    header_text = labeled_firm_text_features[url][3]
    running_text = descriptor_text + path_text + title_text + header_text
    corpus.append (running_text)
    
# unigram
ubv = TfidfVectorizer(min_df=0., max_df=1.)
# you can set the n-gram range to 1,2 to get unigrams as well as bigrams (performs worse than just unigrams)
# ubv = TfidfVectorizer(ngram_range=(1,2)) 

ubv_matrix = ubv.fit_transform(corpus)

ubv_matrix = ubv_matrix.toarray()
vocab = ubv.get_feature_names()
ubv_df = pd.DataFrame(ubv_matrix, columns=vocab)
ubv_df.index = urls
ubv_df.index.name='url'
ubv_df.head()

1454


Unnamed: 0_level_0,003,10,10m,10mseriesafinancing,13485,14001,1933,1961,1962,1975,...,zegage,zeitung,zeno,zero,zerowastemanagement,zoetis,zonne,中文,公司简介,隆達電子
url,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
https://www.3m.com/,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
https://www.3m.com/3M/en_US/company-us/3m-science-applied-to-life/,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
https://www.3m.com/3M/en_US/company-us/about-3m/,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
https://www.3m.com/3M/en_US/company-us/about-3m/state-of-science-index-survey/,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
https://www.3m.com/3M/en_US/company-us/about-3m/technologies/,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [152]:
count_df = pd.DataFrame.from_dict(labeled_firm_count_features, orient='index', columns = ['is_about_ftr', 'num_words_ftr', 'num_sentences_ftr'])
count_df.index.name = 'url'
count_df.head()

Unnamed: 0_level_0,is_about_ftr,num_words_ftr,num_sentences_ftr
url,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
https://www.3m.com/,1,3487,210
https://www.3m.com/3M/en_US/company-us/3m-science-applied-to-life/,1,1140,14
https://www.3m.com/3M/en_US/company-us/about-3m/,1,854,38
https://www.3m.com/3M/en_US/company-us/about-3m/state-of-science-index-survey/,1,1033,9
https://www.3m.com/3M/en_US/company-us/about-3m/technologies/,1,3739,175


In [153]:
# processed above but output here for clarity
df_about_labeled.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,about_lbl,mgmt_lbl,partners_lbl,ip_lbl,about_agg_lbl,gid,pages_in_domain_ftr
firm_name,url,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
3M Innovative Properties Company,https://www.3m.com/,0.0,0.0,0.0,0.0,0.0,1.0,16
3M Innovative Properties Company,https://www.3m.com/3M/en_US/company-us/3m-science-applied-to-life/,0.0,0.0,0.0,0.0,0.0,0.0,16
3M Innovative Properties Company,https://www.3m.com/3M/en_US/company-us/about-3m/,1.0,0.0,0.0,0.0,1.0,1.0,16
3M Innovative Properties Company,https://www.3m.com/3M/en_US/company-us/about-3m/state-of-science-index-survey/,0.0,0.0,0.0,0.0,0.0,0.0,16
3M Innovative Properties Company,https://www.3m.com/3M/en_US/company-us/about-3m/technologies/,1.0,0.0,0.0,0.0,0.0,0.0,16


### Merge text-based features with count features and labels for model training and evaluation

In [154]:
# merge datasets (features and labeled data)
print(ubv_df.shape)
print(df_about_labeled.shape)
print(count_df.shape)

merged = ubv_df.join(df_about_labeled, how='inner')

labeled = merged.join(count_df, how='inner')
labeled['num_words_firm_ftr'] = labeled['num_words_ftr'].groupby(level=0).transform('sum')
labeled['share_of_words_ftr'] = labeled['num_words_ftr'] / labeled['num_words_firm_ftr']

labeled['num_sentences_firm_ftr'] = labeled['num_sentences_ftr'].groupby(level=0).transform('sum')
labeled['share_of_sentences_ftr'] = labeled['num_sentences_ftr'] / labeled['num_sentences_firm_ftr']

print(labeled.shape)
labeled.head()

(1454, 2491)
(1455, 7)
(1454, 3)
(1454, 2505)


Unnamed: 0_level_0,Unnamed: 1_level_0,003,10,10m,10mseriesafinancing,13485,14001,1933,1961,1962,1975,...,about_agg_lbl,gid,pages_in_domain_ftr,is_about_ftr,num_words_ftr,num_sentences_ftr,num_words_firm_ftr,share_of_words_ftr,num_sentences_firm_ftr,share_of_sentences_ftr
firm_name,url,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
3M Innovative Properties Company,https://www.3m.com/,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,16,1,3487,210,28252,0.123425,869,0.241657
3M Innovative Properties Company,https://www.3m.com/3M/en_US/company-us/3m-science-applied-to-life/,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,16,1,1140,14,28252,0.040351,869,0.01611
3M Innovative Properties Company,https://www.3m.com/3M/en_US/company-us/about-3m/,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1.0,16,1,854,38,28252,0.030228,869,0.043728
3M Innovative Properties Company,https://www.3m.com/3M/en_US/company-us/about-3m/state-of-science-index-survey/,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,16,1,1033,9,28252,0.036564,869,0.010357
3M Innovative Properties Company,https://www.3m.com/3M/en_US/company-us/about-3m/technologies/,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,16,1,3739,175,28252,0.132345,869,0.201381


In [155]:
# check for missing urls in the final labeled data frame
for url in labeled_urls: # labeled data
    if url not in list(labeled.index.levels[1]): # feature data
        print ('Missing', url)

labeled.xs('https://nanocomposix.com/pages/about-us', level=1)

Missing http://www.genomichealth.com/en-US/


Unnamed: 0_level_0,003,10,10m,10mseriesafinancing,13485,14001,1933,1961,1962,1975,...,about_agg_lbl,gid,pages_in_domain_ftr,is_about_ftr,num_words_ftr,num_sentences_ftr,num_words_firm_ftr,share_of_words_ftr,num_sentences_firm_ftr,share_of_sentences_ftr
firm_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
nanoComposix,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,1049.0,3,1,882,19,3116,0.283055,102,0.186275


In [156]:
# labeled train/test split
print (len(ubv_df.columns))
X = labeled.iloc[:,1:len(ubv_df.columns)]
       
# other in other non text-based features
# 1. number of domain pages (identified above)
# 2. is home page and doesn't have any other pages (is_sole_page)
# 3. whether a given page is an about us page (as opposed to a home page)
# 4. number of words on a page (and share of words)
# 5. number of sentences (and share of sentences)

X['pages_in_domain_ftr'] = np.reciprocal(labeled['pages_in_domain_ftr']) # in
X['is_about_ftr'] = labeled['is_about_ftr'] # in
# X['num_words_ftr'] = labeled['num_words_ftr']
# X['share_of_words_ftr'] = labeled['share_of_words_ftr']
# X['num_sentences_ftr'] = labeled['num_sentences_ftr']
X['share_of_sentences_ftr'] = labeled['share_of_sentences_ftr'] # in
X.to_csv(ABOUT_DIR + 'X.csv', index = True) # for manual inspection

# normalize
# X = (X - X.mean()) / X.std()
# X = (X - X.min()) / (X.max() - X.min())

y = labeled.loc[:,'about_lbl']
print (X.shape)
print (y.shape)
X.head()

2491
(1454, 2493)
(1454,)


Unnamed: 0_level_0,Unnamed: 1_level_0,10,10m,10mseriesafinancing,13485,14001,1933,1961,1962,1975,1976,...,zero,zerowastemanagement,zoetis,zonne,中文,公司简介,隆達電子,pages_in_domain_ftr,is_about_ftr,share_of_sentences_ftr
firm_name,url,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
3M Innovative Properties Company,https://www.3m.com/,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,0.241657
3M Innovative Properties Company,https://www.3m.com/3M/en_US/company-us/3m-science-applied-to-life/,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,0.01611
3M Innovative Properties Company,https://www.3m.com/3M/en_US/company-us/about-3m/,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,0.043728
3M Innovative Properties Company,https://www.3m.com/3M/en_US/company-us/about-3m/state-of-science-index-survey/,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,0.010357
3M Innovative Properties Company,https://www.3m.com/3M/en_US/company-us/about-3m/technologies/,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,1,0.201381


## Train and evaluate the model
On just the labeled data

In [157]:
# specify a few models

names = ["Nearest Neighbors", "Linear SVM", "RBF SVM", 
         "Decision Tree", "Random Forest", "Neural Net", "AdaBoost",
         "Naive Bayes", "SVC", "QDA"]

classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1),
    AdaBoostClassifier(),
    GaussianNB(),
    SVC(gamma=0.001, C=100.), 
    QuadraticDiscriminantAnalysis()]

In [158]:
# build dataframe for output metrics 
eval_df = pd.DataFrame (names,index=(range(len(names))), columns=["Name"])
eval_df['Accuracy'] = np.float64(0)

In [159]:
# build evaluation outputs (currently limited to accuracy)
i = np.int64(0)
for name, clf in zip(names, classifiers):
    display (name)
    scores = cross_val_score(clf, X, y)
    avg_score = np.mean(scores)
    eval_df.set_value(i, 'Accuracy', avg_score)
    i = i + 1
    
display(eval_df)
eval_df.to_clipboard()
# Neural net work best

'Nearest Neighbors'

'Linear SVM'

'RBF SVM'

'Decision Tree'

'Random Forest'

'Neural Net'

'AdaBoost'

'Naive Bayes'

'SVC'

'QDA'

Unnamed: 0,Name,Accuracy
0,Nearest Neighbors,0.821864
1,Linear SVM,0.732463
2,RBF SVM,0.83287
3,Decision Tree,0.815697
4,Random Forest,0.732463
5,Neural Net,0.890656
6,AdaBoost,0.851447
7,Naive Bayes,0.729668
8,SVC,0.847329
9,QDA,0.456059


## Grid search using MLPClassifier to tune hyperparameters
The above results clearly show that a type of feed-forward neural network is the most accurate type of model

In [160]:
hls = []
# hls.append([20,])
# hls.append([70,])
hls.append([100,])
# hls.append([50,50])
# hls.append([70,70,70])
# hls.append([40,40,40])
# hls.append([10,10,10])
# hls.append([50,50,50,50])
pp.pprint(hls)

[[100]]


In [161]:
parameters = {'solver': ['adam'], 'max_iter': [50, 100], 'alpha': 10.0 ** -np.arange(1, 3), 'hidden_layer_sizes': hls}
clf_grid = GridSearchCV(MLPClassifier(), parameters, n_jobs=-1)
clf_grid.fit(X,y)

print("Best score: %0.4f" % clf_grid.best_score_)
print("Using the following parameters:")
print(clf_grid.best_params_)

Best score: 0.8886
Using the following parameters:
{'alpha': 0.1, 'hidden_layer_sizes': [100], 'max_iter': 50, 'solver': 'adam'}


In [162]:
# train neural net model with best hyperparameter configuration
clf = MLPClassifier(alpha=0.1, hidden_layer_sizes=(100,), max_iter=100, solver='adam')
clf.fit(X, y)

y_hat = clf.predict(X)
print(confusion_matrix(y, y_hat))

[[1061    4]
 [   6  383]]


In [None]:
# print all instances where predictions don't match labels (for inspection)
for key, y_i, y_hat_i in zip(list(X.index), y, y_hat):
    if y_i != y_hat_i:
        print(key[1], 'has been classified as ', y_hat_i, 'but should be ', y_i) 

## Predict about pages for unlabeled data

In [419]:
# prepare domain level features 
df_predict = df[~df['url'].isin(labeled_urls)] 
df_predict['pages_in_domain_ftr'] = df_predict.groupby(["firm_name"])["url"].transform("count")

df_predict = df_predict.set_index(['firm_name', 'url'])
df_predict = df_predict.fillna(0)

df_predict = df_predict.sort_values(by=['gid'])
print (df_predict.shape)
df_predict.head()

(8208, 2)


Unnamed: 0_level_0,Unnamed: 1_level_0,gid,pages_in_domain_ftr
firm_name,url,Unnamed: 2_level_1,Unnamed: 3_level_1
22nd Century Limited,http://www.xxiicentury.com/history/,0,4
22nd Century Limited,http://www.xxiicentury.com/our-management/,0,4
22nd Century Limited,http://www.xxiicentury.com/,0,4
22nd Century Limited,http://www.xxiicentury.com/profile/,0,4
ABB AB,https://new.abb.com/investorrelations,2,17


In [420]:
# check to see whether there are duplicate urls
# note: there should be because different assignees may map to the same domain (see error above)
counter=collections.Counter(df_predict.index)
most_common = counter.most_common(5)
pp.pprint (most_common)

[(('22nd Century Limited', 'http://www.xxiicentury.com/history/'), 1),
 (('22nd Century Limited', 'http://www.xxiicentury.com/our-management/'), 1),
 (('22nd Century Limited', 'http://www.xxiicentury.com/'), 1),
 (('22nd Century Limited', 'http://www.xxiicentury.com/profile/'), 1),
 (('ABB AB', 'https://new.abb.com/investorrelations'), 1)]


In [421]:
# prepare n-gram and count features
unlabeled_firm_text_features, unlabeled_firm_count_features = process_firms (set(df_predict.index.get_level_values('url')))

Cannot find html for https://www.fujielectric.com/company/news/rss/rss.xml
Cannot find html for https://www.bridgestone.com/corporate/news/rss/bridgestone_global_news.xml


In [425]:
unlabeled_firm_count_features['http://www.foxconn.com/Investors_En/Corporate_Governance.html?index=2']

[1, 56, 0]

In [426]:
prediction_urls = unlabeled_firm_text_features.keys()
pred_corpus = []

for url in prediction_urls:
    running_text = ''
    descriptor_text = unlabeled_firm_text_features[url][0]
    path_text = unlabeled_firm_text_features[url][1]
    title_text = unlabeled_firm_text_features[url][2]
    header_text = unlabeled_firm_text_features[url][3]
    running_text = descriptor_text + path_text + title_text + header_text
    pred_corpus.append (running_text)

ubv_prediction_matrix = ubv.transform(pred_corpus)

ubv_prediction_matrix = ubv_prediction_matrix.toarray()
vocab = ubv.get_feature_names()
ubv_prediction_df = pd.DataFrame(ubv_prediction_matrix, columns=vocab)
ubv_prediction_df.index = prediction_urls
ubv_prediction_df.index.name='url'
ubv_prediction_df.head()

Unnamed: 0_level_0,003,10,10m,10mseriesafinancing,13485,14001,1933,1961,1962,1975,...,zegage,zeitung,zeno,zero,zerowastemanagement,zoetis,zonne,中文,公司简介,隆達電子
url,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
https://www.eaton.com/us/en-us/company/policies-and-statements.html,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
https://www.fei.com/life-sciences/history-of-cryo-em/,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
https://www.daikin.com/csr/company/index.html,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
https://www.dana-farber.org/about-us/locations/?homepagewidget,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
https://www.abbvie.com/partnerships/additional-collaboration-opportunities/open-innovation-program.html,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [427]:
count_pred_df = pd.DataFrame.from_dict(unlabeled_firm_count_features, orient='index', columns = ['is_about_ftr', 'num_words_ftr', 'num_sentences_ftr'])
count_pred_df.index.name = 'url'
print (count_pred_df.shape)
count_pred_df.head()

(8206, 3)


Unnamed: 0_level_0,is_about_ftr,num_words_ftr,num_sentences_ftr
url,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
https://www.eaton.com/us/en-us/company/policies-and-statements.html,1,684,8
https://www.fei.com/life-sciences/history-of-cryo-em/,1,2742,228
https://www.daikin.com/csr/company/index.html,1,1244,6
https://www.dana-farber.org/about-us/locations/?homepagewidget,1,700,10
https://www.abbvie.com/partnerships/additional-collaboration-opportunities/open-innovation-program.html,1,379,18


In [428]:
# merge datasets (features and labeled data)
print(ubv_prediction_df.shape)
print(df_predict.shape)
print(count_pred_df.shape)

predict_merged = ubv_prediction_df.join(df_predict, how='inner')

unlabeled = predict_merged.join(count_pred_df, how='inner')
unlabeled['num_sentences_firm_ftr'] = unlabeled['num_sentences_ftr'].groupby(level=0).transform('sum')
unlabeled['share_of_sentences_ftr'] = unlabeled['num_sentences_ftr'] / unlabeled['num_sentences_firm_ftr']
# fill in empties with 0's
unlabeled.loc[np.isnan(unlabeled['share_of_sentences_ftr']), 'share_of_sentences_ftr'] = 0 

print(unlabeled.shape)
unlabeled.head()

(8206, 2491)
(8208, 2)
(8206, 3)
(8206, 2498)


Unnamed: 0_level_0,Unnamed: 1_level_0,003,10,10m,10mseriesafinancing,13485,14001,1933,1961,1962,1975,...,中文,公司简介,隆達電子,gid,pages_in_domain_ftr,is_about_ftr,num_words_ftr,num_sentences_ftr,num_sentences_firm_ftr,share_of_sentences_ftr
firm_name,url,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
22nd Century Limited,http://www.xxiicentury.com/history/,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0,4,1,1388,77,175,0.44
22nd Century Limited,http://www.xxiicentury.com/our-management/,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0,4,1,799,59,175,0.337143
22nd Century Limited,http://www.xxiicentury.com/,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0,4,1,1104,23,175,0.131429
22nd Century Limited,http://www.xxiicentury.com/profile/,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0,4,1,403,16,175,0.091429
ABB AB,https://new.abb.com/investorrelations,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,2,17,1,1109,6,142,0.042254


In [429]:
# merge
X_test = unlabeled.iloc[:,1:len(ubv_prediction_df.columns)]

print (X.shape)
print (X_test.shape) # should be the same number of cols

X_test['pages_in_domain_ftr'] = np.reciprocal(unlabeled['pages_in_domain_ftr'])
X_test['is_about_ftr'] = unlabeled['is_about_ftr']
X_test['share_of_sentences_ftr'] = unlabeled['share_of_sentences_ftr']
X_test.to_csv(ABOUT_DIR + 'X_test.csv', index = True) # for manual inspection

(1454, 2493)
(8206, 2490)


In [493]:
# predict with newly constructed X
y_predicted = clf.predict(X_test)

In [496]:
y_hat = y_predicted.astype(int) 
print (type(y_hat))

<class 'numpy.ndarray'>


In [565]:
# write to file
def find_home_urls (firm_names):
    urls = [] # return object 
    for fn in firm_names: 
        result = col.find_one({"firm_name": fn, "is_about": "0"}, {'orig_url': 1})
        if not result: 
            print ("Cannot find home page result for", fn)
        else: 
            urls.append(result['orig_url'][0])
    return urls
    

about_pages = []
has_about_firm_names = {}

# output predicted values to file
for fn, u, predicted_value in zip(X_test.index.get_level_values('firm_name'), X_test.index.get_level_values('url'), y_predicted):
    # print (cfn + ' with url ' + u + ' has predicted value ' + str(predicted_value))
    about_pages.append([fn, u, predicted_value, 0])
    if predicted_value == 1.: 
        has_about_firm_names[fn] = 1
        
# and the labeled ones too...
for fn, u, labeled_value in zip(X.index.get_level_values('firm_name'), X.index.get_level_values('url'), y):
    # print (cfn + ' with url ' + u + ' has predicted value ' + str(labeled_value))
    about_pages.append([fn, u, labeled_value, 0])
    if labeled_value == 1: 
        has_about_firm_names[fn] = 1
    
about_df = pd.DataFrame (about_pages, columns = ('firm_name', 'url', 'is_about', 'default_to_home'))
about_df.set_index(['firm_name', 'url'], inplace=True)

firm_names = list(about_df.index.levels[0])
no_positive_pred_firm_names = set(firm_names).difference(set(has_about_firm_names)) # len is 97
default_home_urls = find_home_urls (no_positive_pred_firm_names)
# print (default_home_urls)

Cannot find home page result for AT&T Corporation
Cannot find home page result for AGC Flat Glass North America
Cannot find home page result for Seagate Technology LLC
Cannot find home page result for Coactive Drive Corporation
Cannot find home page result for Nanocopocia
Cannot find home page result for Community Power Corporation
Cannot find home page result for Biological Dynamics
Cannot find home page result for Gilead Sciences


In [None]:
# check output
about_df.iloc[about_df.index.get_level_values('firm_name') == "Zygo Corporation"]
# about_df.iloc[about_df.index.get_level_values('url') == "https://spc-intl.com/(X(1)S(zopdv2d2knciwj2xbkkpfqfz))/?AspxAutoDetectCookieSupport=1"]

In [567]:
# in the data frame set these pages to is_about = 1 and default_to_home = 1
about_df.iloc[about_df.index.get_level_values('url').isin (default_home_urls), 0:2] = 1
about_df.iloc[about_df.index.get_level_values('url').isin (default_home_urls), 0:2].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,is_about,default_to_home
firm_name,url,Unnamed: 2_level_1,Unnamed: 3_level_1
ALSTOM Technology Ltd,https://www.alstom.com/,1.0,1
ASTUTE MEDICAL,http://www.astutemedical.com/,1.0,1
Advanced Silicon Group,http://www.advancedsilicongroup.com/,1.0,1
Aerogen,https://www.aerogen.com/,1.0,1
Agena Bioscience,http://agenabio.com/,1.0,1


In [574]:
# clean firm names
about_out_df = about_df.reset_index()
about_out_df['firm_name'] = about_out_df['firm_name'].apply(lambda x: clean_firm_name(x))
about_out_df.head()

# write df to csv
about_out_df.to_csv(ABOUT_DIR + 'about_predicted_and_labels.csv', index = False)

## Extract data from mongodb
* Now that we know which pages are about pages, extract from mongodb and output for topic modeling
* For now, construct paragraphs from different pages by ordering urls by their length.  In the future, might want to contruct paragraphs in their 'natural' sequential order as they would appear on a home page or landing page

In [360]:
# combine both labeled and predicted frames
# TODO: add in homepages where needed! -- ska (2/19/19)
print (X_test.shape)
print(X.shape)

combined = X_test.append(X)
print (combined.shape)
print (len(y_predicted))
print (len(y))
abouts = pd.DataFrame(index=combined.index)

abouts['is_about'] = list(y_predicted) + list(y)
abouts = abouts.reset_index()
abouts.head()

(8206, 2493)
(1454, 2493)
(9660, 2493)
8206
1454


Unnamed: 0,firm_name,url,is_about
0,Xintec Inc.,http://www.xintec.com.tw/eng/AX_Core-Value_Vis...,1.0
1,Xintec Inc.,http://www.xintec.com.tw/eng/IR_Company_Profil...,0.0
2,ZENA TECHNOLOGIES,http://xena-technologies.com/about-us/,1.0
3,Xintec Inc.,http://www.xintec.com.tw/eng/CSR_Corporate_Gov...,0.0
4,22nd Century Limited,http://www.xxiicentury.com/,1.0


In [184]:
# gather unique firm_names from mongodb
firm_names = set(abouts['firm_name'])
print (len(firm_names))
pp = pprint.PrettyPrinter()
pp.pprint(firm_names)

183
{'22nd Century Limited',
 '3M Innovative Properties Company',
 'ABB AB',
 'ACACIA RESEARCH GROUP LLC',
 'AGFA-GEVAERT N.V.',
 'ALGETERNAL TECHNOLOGIES',
 'ASML Netherlands B.V.',
 'AVI BioPharma',
 'AVOGY',
 'Abbott Molecular Inc.',
 'Abbott Point of Care Inc.',
 'Adlens Beacon',
 'Advanced Analogic Technologies',
 'Agienic',
 'Allertein Therapeutics',
 'Allison Transmission',
 'Altex Technologies Corporation',
 'Amberwave Inc.',
 'Amgen Fremont Inc.',
 'Amicus Therapeutics',
 'Amtech Systems',
 'Andritz Inc.',
 'Angiotech Pharmaceuticals (US)',
 'Applied Materials',
 'Archer Daniels Midland Company',
 'Arkema Inc.',
 'AstenJohnson',
 'Atom Nanoelectronics',
 'BRIGHTLEAF TECHNOLOGIES INC.',
 'Baker Hughes Incorporated',
 'Baxter Healthcare SA',
 'Biocon Limited',
 'Bostik',
 'Braun Intertec Geothermal',
 'Brookhaven Science Associates LLC ',
 'Butamax(TM) Advanced Biofuels LLC',
 'CI4 Technologies',
 'CMC ICOS BIOLOGICS',
 'COOK MEDICAL TECHNOLOGIES LLC',
 'Cadence Design Systems',

In [362]:
def get_ordered_about_urls (firm_name):
    print (firm_name)
    urls = list (abouts.loc[(abouts['firm_name'] == firm_name), 'url'])
    urls.sort(key = len)
    print ('Original urls')
    pp.pprint(urls)

    index = {}
    for url in urls:
        path_fragments = len(url.split('/'))
        added = False
        for i in range(1, path_fragments):
            key_phrase = url.rsplit('/', maxsplit=i)[0]
            if key_phrase in urls or (key_phrase + '/') in urls: 
                od = index.setdefault(key_phrase, OrderedDict())
                od[url] = 1
                added = True
                continue
        if not added:
            od = index.setdefault(url, OrderedDict())
            od[url] = 1
 
    # pp.pprint (index)
    
    return_urls = []
    seen = set ()
    for key in index.keys():
        tree_urls = index[key]
        for fu in tree_urls:
            if fu not in seen:
                return_urls.append(fu)
                seen.add(fu)
    
    # finally remove home page if it exists and if there are other pages to draw on
    if not return_urls: 
        return None
    else: 
#         first_page = return_urls[0]
#         first_page_path = get_page_path_text (first_page)
#         if first_page_path == ' ' or first_page_path == '':
#             print (first_page_path + 'empty')
#             return_urls.pop(0)
        return return_urls

test_urls = get_ordered_about_urls ('Previvo Genetics')
print ('Ordered urls')
pp.pprint (test_urls)

Previvo Genetics
Original urls
['https://previvo.com/']
Ordered urls
['https://previvo.com/']


In [363]:
# iterate through firm urls and return concatenated string
def get_content (urls): 
    running_text = ''
    for url in urls:
        print ('\tWorking on ' + url)
        result = col.find_one( {"url": url} )
        if result:
            clnd_text = clean_page_content(result['full_text'])
            clnd_text = '\n'.join(clnd_text)
            boilerpipe = None
            
            if 'body' in result:
                extractor = Extractor(extractor='DefaultExtractor', html = result['body'][0])
                lines = extractor.getText().replace(u'\xa0', u' ').split('\n')
                filtered = filter(lambda x: not re.match(r'^\s*$', x), lines)
                boilerpipe = '\n'.join(filtered)

            # TODO fix to split().  Counting characters currently 
            if boilerpipe and (len(boilerpipe) > .5 * len(clnd_text)):
                print ('\t\tUsing boilerplate')
                running_text += boilerpipe
            else:
                print ('\t\tUsing clnd_text')
                running_text += clnd_text
        else:
            print ('Cannot find url: ' + url)

    return running_text

In [364]:
# regex test 
regex = re.findall(r"(CDATA|return\s+true|return\s+false|getelementbyid|function|\w+\(.*?\);|\w{2,}[\\.|:]+\w{2,}|'\w+':\s+'\w+|\\|{|}|\r|\n|\/\/')", 
                   "CDATA function contact-us getelementbyid javascript.function linker:autoLink www.littlekidsinc.com fxnCall(param.param); email@dextr.us 'type': 'image' return true return false rev7bynlh\\u00252bvcgrjg\\ {height}") # last part is words sequences separated by punct

In [365]:
print (test_urls)
test_site_text = get_content (test_urls)
print (test_site_text)

['https://previvo.com/']
	Working on https://previvo.com/
		Using boilerplate
Begin typing your search above and press return to search.
Learn More
Empowering Women & Couples with Fertility Choices
Previvo offers a new choice, a new medical device, and a new procedure that empowers women and couples, both fertile and infertile, in assisted reproduction. Our proprietary and patented medical device is designed with you and your potential baby in mind.
Our studies demonstrate that we can conduct Previvo Uterine Lavage while maintaining safety of the woman and the embryo(s).
The Previvo Uterine Lavage System is built upon robust and extensive intellectual property and clinical testing. We continue to invest in our intellectual portfolio by filing new methods and system improvements. With over 130 lavages, The Previvo Uterine Lavage System has demonstrated the ability to recover healthy embryos.
The Previvo Uterine Lavage System empower women and couples with a new assisted reproductive tec

In [367]:
# run process_firm and write to file
pp = pprint.PrettyPrinter()
for firm_name in firm_names: 
    print ("Working on " + firm_name)
    about_urls = get_ordered_about_urls(firm_name)
    if not about_urls:
        print ("\tCouldn't find any urls for firm!")
        firm_urls = df_predict.xs(firm_name, level=0)
        home_page = firm_urls.loc[firm_urls['is_about_ftr'] == 0]
        about_urls.append(home_page.index.get_level_values('url'))
        
    about_text = get_content (about_urls)
    
    if about_text: 
        firm_clnd = clean_firm_name(firm_name) # standard cleaning code throughout project
        about_clnd = strip_firm_name (firm_name, about_text)
        file = re.sub('\/', '|', firm_clnd) + '.txt'
        with io.open(DATA_DIR + file,'w',encoding='utf8') as f:
            f.write (about_clnd)
    else:
        print ("\tCouldn't find any text for firm!")

Working on Evoqua Water Technologies LLC
Evoqua Water Technologies LLC
Original urls
['http://evoqua.com/',
 'http://evoqua.com/en/about',
 'http://evoqua.com/en/about/Pages/default.aspx',
 'http://evoqua.com/de/about/Seiten/Impressum.aspx',
 'http://evoqua.com/en/about/Pages/contact-us.aspx',
 'http://evoqua.com/de/about/Seiten/Datenschutzerklaerung.aspx',
 'http://evoqua.com/en/about/data-protection/Pages/default.aspx',
 'http://evoqua.com/en/about/data-protection/Pages/Cookies-Policy.aspx',
 'https://www.evoqua.com/en/about/data-protection/Pages/Cookies-Policy.aspx',
 'http://evoqua.com/en/about/data-protection/Pages/Data-Privacy-Protection-Policy.aspx']
	Working on http://evoqua.com/
		Using boilerplate
	Working on http://evoqua.com/en/about
		Using clnd_text
	Working on http://evoqua.com/en/about/Pages/default.aspx
		Using boilerplate
	Working on http://evoqua.com/de/about/Seiten/Impressum.aspx
		Using boilerplate
	Working on http://evoqua.com/en/about/Pages/contact-us.aspx
		Usin

		Using clnd_text
	Working on https://allisontransmission.com/company/centennial/today
		Using boilerplate
	Working on https://allisontransmission.com/company/history-heritage
		Using boilerplate
	Working on https://allisontransmission.com/company/centennial/future
		Using boilerplate
	Working on https://allisontransmission.com/company/centennial/history
		Using clnd_text
	Working on https://allisontransmission.com/company/corporate-leadership
		Using boilerplate
	Working on https://allisontransmission.com/company/corporate-responsibility
		Using clnd_text
	Working on https://allisontransmission.com/company/corporate-responsibility/sustainability
		Using clnd_text
	Working on https://allisontransmission.com/company/corporate-responsibility/causes-we-support
		Using clnd_text
	Working on https://allisontransmission.com/company/news-article/!details/2018/11/26/agrale-mercedes-choose-allison-transmissions-for-brazil-buses
		Using boilerplate
	Working on https://www.allisontransmission.com

		Using boilerplate
Working on Applied Materials
Applied Materials
Original urls
['http://www.appliedmaterials.com/',
 'http://www.appliedmaterials.com/company/about',
 'http://www.appliedmaterials.com/company/careers',
 'http://www.appliedmaterials.com/company/contact',
 'http://www.appliedmaterials.com/company/news-media',
 'http://www.appliedmaterials.com/company/news/events',
 'http://www.appliedmaterials.com/company/careers/jobs',
 'http://www.appliedmaterials.com/company/collaboration',
 'http://www.appliedmaterials.com/company/about/our-story',
 'http://www.appliedmaterials.com/company/about/leadership',
 'http://www.appliedmaterials.com/company/applied-ventures',
 'http://www.appliedmaterials.com/company/careers/diversity',
 'http://www.appliedmaterials.com/company/contact/locations',
 'http://www.appliedmaterials.com/company/news/media-center',
 'http://www.appliedmaterials.com/company/careers/university',
 'http://www.appliedmaterials.com/company/investor-relations',
 'http:/

		Using clnd_text
	Working on https://www.eni.com/en_IT/company/fuel-cafe.page
		Using clnd_text
	Working on https://www.eni.com/en_IT/company/governance.page
		Using clnd_text
	Working on https://www.eni.com/en_IT/company/eni-history.page
		Using clnd_text
	Working on https://www.eni.com/en_IT/company/news-company.page
		Using clnd_text
	Working on https://www.eni.com/en_IT/company/our-management.page
		Using clnd_text
	Working on https://www.eni.com/en_IT/company/company-profile.page
		Using clnd_text
	Working on https://www.eni.com/en_IT/investors/risk-management.page
		Using clnd_text
	Working on https://www.eni.com/en_IT/company/governance/by-laws.page
		Using clnd_text
	Working on https://www.eni.com/en_IT/company/governance/eni-model.page
		Using clnd_text
	Working on https://www.eni.com/en_IT/company/governance/audit-firm.page
		Using clnd_text
	Working on https://www.eni.com/en_IT/company/company-profile/people.page
		Using clnd_text
	Working on https://www.eni.com/en_IT/compa

		Using boilerplate
	Working on https://www.toray.com/aboutus/philosophy.html
		Using boilerplate
	Working on https://www.toray.com/aboutus/event/index.html
		Using clnd_text
	Working on https://www.toray.com/ir/management/index.html
		Using clnd_text
	Working on https://www.toray.com/aboutus/vision/index.html
		Using boilerplate
	Working on https://www.toray.com/aboutus/visual/index.html
		Using clnd_text
	Working on https://www.toray.com/aboutus/history/index.html
		Using boilerplate
	Working on https://www.toray.com/aboutus/download/index.html
		Using clnd_text
	Working on https://www.toray.com/aboutus/ourpeople/index.html
		Using clnd_text
	Working on https://www.toray.com/csr/activity/compliance/index.html
		Using clnd_text
	Working on https://www.toray.com/csr/activity/governance/index.html
		Using clnd_text
	Working on https://www.toray.com/csr/activity/riskmanagement/index.html
		Using clnd_text
Working on nanoComposix
nanoComposix
Original urls
['https://nanocomposix.com/',
 '

		Using boilerplate
	Working on https://leonardodrs.com/about-us/management-team/bill-lynn-s-corner/
		Using boilerplate
	Working on https://www.leonardodrs.com/about-us/
		Using boilerplate
	Working on https://www.leonardodrs.com/about-us/ethics/
		Using boilerplate
	Working on https://www.leonardodrs.com/about-us/investors/
		Using clnd_text
	Working on https://www.leonardodrs.com/about-us/our-structure/
		Using clnd_text
	Working on https://www.leonardodrs.com/about-us/sustainability/
		Using boilerplate
	Working on https://www.leonardodrs.com/about-us/management-team/
		Using boilerplate
	Working on https://www.leonardodrs.com/about-us/in-the-community/
		Using boilerplate
	Working on https://www.leonardodrs.com/about-us/contract-vehicles/
		Using boilerplate
	Working on https://www.leonardodrs.com/about-us/thought-leadership/
		Using clnd_text
Working on R.A. Miller Industries
R.A. Miller Industries
Original urls
['https://www.rami.com/', 'https://www.rami.com/about/']
	Working on

		Using boilerplate
	Working on https://www.arkema.com/en/innovation/water-management/
		Using clnd_text
	Working on https://www.arkema.com/en/social-responsibility/vision-and-strategy/
		Using clnd_text
	Working on https://www.arkema.com/en/arkema-group/governance/executive-committee/
		Using clnd_text
	Working on https://www.arkema.com/en/careers/working-at-arkema/career-management-/
		Using boilerplate
	Working on https://www.arkema.com/en/investor-relations/arkema-share/share-profile/
		Using clnd_text
	Working on https://www.arkema.com/en/investor-relations/corporate-governance/compensation/
		Using clnd_text
	Working on https://www.arkema.com/en/investor-relations/corporate-governance/board-of-directors/
		Using boilerplate
	Working on https://www.arkema.com/en/investor-relations/corporate-governance/specialized-committees/
		Using boilerplate
	Working on https://www.arkema.com/en/social-responsibility/vision-and-strategy/materiality-analysis/
		Using boilerplate
	Working on http

		Using clnd_text
	Working on https://www.nec.com/en/global/about/corporate_profile.html
		Using clnd_text
	Working on https://www.nec.com/en/press/201811/global_20181129_01.html
		Using boilerplate
	Working on https://www.nec.com/en/global/about/fact/index.html?cid=gltop183006
		Using clnd_text
	Working on https://www.nec.com/en/global/about/vision/index.html?cid=gltop183001
		Using clnd_text
Working on AVOGY
AVOGY
Original urls
['https://nexgenpowersystems.com/',
 'https://nexgenpowersystems.com/about/',
 'https://nexgenpowersystems.com/contact/',
 'https://nexgenpowersystems.com/management/']
	Working on https://nexgenpowersystems.com/
		Using boilerplate
	Working on https://nexgenpowersystems.com/about/
		Using boilerplate
	Working on https://nexgenpowersystems.com/contact/
		Using clnd_text
	Working on https://nexgenpowersystems.com/management/
		Using boilerplate
Working on Nutech Ventures
Nutech Ventures
Original urls
['http://www.nutechventures.org/', 'http://www.nutechventures

		Using boilerplate
	Working on https://www.3m.com/3M/en_US/company-us/patent/
		Using clnd_text
	Working on https://www.3m.com/3M/en_US/company-us/about-3m/
		Using boilerplate
	Working on https://www.3m.com/3M/en_US/company-us/site-map/
		Using clnd_text
	Working on https://www.3m.com/3M/en_US/home-improvement-us/
		Using clnd_text
	Working on https://www.3m.com/3M/en_US/company-us/SDS-search/
		Using clnd_text
	Working on https://www.3m.com/3M/en_US/company-us/help-center/
		Using clnd_text
	Working on https://www.3m.com/3M/en_US/company-us/privacy-policy/
		Using boilerplate
	Working on https://www.3m.com/3M/en_US/company-us/legal-information/
		Using clnd_text
	Working on https://www.3m.com/3M/en_US/company-us/partners-suppliers/
		Using clnd_text
	Working on https://www.3m.com/3M/en_US/company-us/about-3m/technologies/
		Using boilerplate
	Working on https://www.3m.com/3M/en_US/company-us/dmca-copyright-policy/
		Using clnd_text
	Working on https://www.3m.com/3M/en_US/company-us/

		Using boilerplate
	Working on https://www.christiedigital.com/en-us/about-christie/news-room/pr-contacts
		Using clnd_text
	Working on https://www.christiedigital.com/en-us/about-christie/tradeshows-and-events
		Using boilerplate
	Working on https://www.christiedigital.com/en-us/about-christie/anti-slavery-statement
		Using clnd_text
	Working on https://www.christiedigital.com/en-us/about-christie/news-room/awards-and-industry-firsts
		Using boilerplate
	Working on https://www.christiedigital.com/en-us/about-christie/news-room/press-releases/christie-wins-most-influential-company-award
		Using boilerplate
	Working on https://www.christiedigital.com/en-us/about-christie/news-room/press-releases/CGR-Cinemas-declares-future-of-cinema-is-RGB-laser
		Using boilerplate
	Working on https://www.christiedigital.com/en-us/_layouts/15/cdlogin.aspx?ReturnUrl=/en-us/_layouts/15/Authenticate.aspx?Source=%2Fen%2Dus%2Fpartner%2Dhome
		Using clnd_text
Working on Microsemi SoC Corporation
Microsemi So

		Using boilerplate
	Working on https://www.phoseon.com/resources/partners
		Using clnd_text
	Working on https://www.phoseon.com/about-industrial-curing
		Using boilerplate
Working on Unifrax I LLC
Unifrax I LLC
Original urls
['http://www.unifrax.com/',
 'http://www.unifrax.com/about-us/',
 'http://www.unifrax.com/about-us/locations/',
 'http://www.unifrax.com/about-us/luyang-partnership/',
 'http://www.unifrax.com/about-us/career-opportunities/']
	Working on http://www.unifrax.com/
		Using clnd_text
	Working on http://www.unifrax.com/about-us/
		Using boilerplate
	Working on http://www.unifrax.com/about-us/locations/
		Using clnd_text
	Working on http://www.unifrax.com/about-us/luyang-partnership/
		Using clnd_text
	Working on http://www.unifrax.com/about-us/career-opportunities/
		Using clnd_text
Working on Amberwave Inc.
Amberwave Inc.
Original urls
['http://amberwave.com/', 'http://amberwave.com/about.html']
	Working on http://amberwave.com/
		Using boilerplate
	Working on http://a

		Using clnd_text
	Working on http://www.deltaww.com/about/grouplink.aspx?secID=5&pid=7&tid=0&hl=en-US
		Using clnd_text
	Working on http://www.deltaww.com/about/milestone.aspx?secID=5&pid=5&tid=0&hl=en-US
		Using clnd_text
	Working on http://www.deltaww.com/about/csr_Report.aspx?secID=5&pid=6&tid=9&hl=en-US
		Using clnd_text
	Working on http://www.deltaww.com/about/leadership.aspx?secID=5&pid=1&tid=0&hl=en-US
		Using boilerplate
	Working on http://www.deltaww.com/about/operations.aspx?secID=5&pid=3&tid=0&hl=en-US
		Using clnd_text
	Working on http://www.deltaww.com/about/innovation2.aspx?secID=5&pid=4&tid=0&hl=en-US
		Using boilerplate
	Working on http://www.deltaww.com/about/csr_features_eng.aspx?secID=5&pid=6&tid=0&hl=en-US
		Using boilerplate
Working on Envision Solar International
Envision Solar International
Original urls
['http://www.envisionsolar.com/',
 'http://www.envisionsolar.com/about-envision-solar/']
	Working on http://www.envisionsolar.com/
		Using clnd_text
	Working on

		Using clnd_text
Working on Renewable Power Conversion
Renewable Power Conversion
Original urls
['http://rpcincorporated.com/',
 'http://rpcincorporated.com/about-rpc.html',
 'http://rpcincorporated.com/rpc-technology.html']
	Working on http://rpcincorporated.com/
		Using clnd_text
	Working on http://rpcincorporated.com/about-rpc.html
		Using boilerplate
	Working on http://rpcincorporated.com/rpc-technology.html
		Using boilerplate
Working on Abbott Point of Care Inc.
Abbott Point of Care Inc.
Original urls
['https://www.pointofcare.abbott/us/en/home',
 'https://www.pointofcare.abbott/us/en/about-us/careers',
 'https://www.pointofcare.abbott/us/en/about-us/locations',
 'https://www.pointofcare.abbott/us/en/about-us/contact-us',
 'https://www.pointofcare.abbott/us/en/about-us/phone-numbers',
 'https://www.pointofcare.abbott/us/en/about-us/benefits-of-point-of-care-testing']
	Working on https://www.pointofcare.abbott/us/en/home
		Using boilerplate
	Working on https://www.pointofcare.abb

		Using boilerplate
	Working on https://www.andritz.com/group-en/about-us/gr-company-boards
		Using boilerplate
	Working on https://www.andritz.com/group-en/about-us/gr-sustainability
		Using boilerplate
	Working on https://www.andritz.com/group-en/investors/corporate-governance
		Using boilerplate
	Working on https://www.andritz.com/group-en/about-us/gr-compliance-startseite
		Using boilerplate
	Working on https://www.andritz.com/group-en/about-us/research-and-development
		Using boilerplate
	Working on https://www.andritz.com/group-en/about-us/suppliers/direct-material
		Using clnd_text
	Working on https://www.andritz.com/group-en/about-us/suppliers/terms-conditions
		Using clnd_text
	Working on https://www.andritz.com/group-en/about-us/suppliers/indirect-material
		Using boilerplate
	Working on https://www.andritz.com/group-en/about-us/suppliers/scoc-declaration-form
		Using boilerplate
	Working on https://www.andritz.com/group-en/about-us/gr-profile-vision-strategy-goals
		Using bo

		Using clnd_text
	Working on http://taiyo-hd.co.jp/en/about/group/
		Using clnd_text
	Working on http://taiyo-hd.co.jp/en/about/history/
		Using clnd_text
	Working on http://taiyo-hd.co.jp/en/about/message/
		Using clnd_text
	Working on http://taiyo-hd.co.jp/en/about/overview/
		Using clnd_text
	Working on http://taiyo-hd.co.jp/en/about/philosophy/
		Using clnd_text
	Working on http://taiyo-hd.co.jp/en/group/ink/
		Using clnd_text
	Working on http://taiyo-hd.co.jp/en/group/ink/history/
		Using clnd_text
	Working on http://taiyo-hd.co.jp/en/group/thou/
		Using clnd_text
	Working on http://taiyo-hd.co.jp/en/investor/policy/
		Using clnd_text
	Working on http://taiyo-hd.co.jp/en/investor/governance/
		Using clnd_text
	Working on https://www.taiyo-hd.co.jp/en/contact/?_contact=company
		Using clnd_text
Working on Transposagen Biopharmaceuticals
Transposagen Biopharmaceuticals
Original urls
['http://www.transposagenbio.com/',
 'http://www.transposagenbio.com/about-us',
 'http://www.transpo

		Using boilerplate
	Working on https://www.tel.com/about/release/2018/
		Using clnd_text
	Working on https://www.tel.com/about/sponsor/2017/
		Using clnd_text
	Working on https://www.tel.com/about/riskmanagement/
		Using boilerplate
	Working on https://www.tel.com/museum/exhibition/history/
		Using boilerplate
	Working on https://www.tel.com/about/release/2018/20180619_002.html
		Using boilerplate
	Working on https://www.tel.com/about/release/2018/20180726_001.html
		Using clnd_text
	Working on https://www.tel.com/about/release/2018/20181002_001.html
		Using clnd_text
	Working on https://www.tel.com/about/release/2018/20181031_001.html
		Using clnd_text
	Working on https://www.tel.com/about/release/2018/20181031_002.html
		Using clnd_text
	Working on https://www.tel.com/about/release/2018/20181031_003.html
		Using clnd_text
Working on Siliconware Precision Industries Co.
Siliconware Precision Industries Co.
Original urls
['http://www.spil.com.tw/',
 'http://www.spil.com.tw/about/',
 '

		Using clnd_text
Working on Selkermetrics
Selkermetrics
Original urls
['http://selkermetrics.com/', 'http://selkermetrics.com/company.htm']
	Working on http://selkermetrics.com/
		Using boilerplate
	Working on http://selkermetrics.com/company.htm
		Using boilerplate
Working on SAE Magnetics (HK) Ltd.
SAE Magnetics (HK) Ltd.
Original urls
['http://www.sae.com.hk/', 'http://www.sae.com.hk/about-us']
	Working on http://www.sae.com.hk/
		Using clnd_text
	Working on http://www.sae.com.hk/about-us
		Using boilerplate
Working on PLEX LLC
PLEX LLC
Original urls
['https://www.plex.tv/',
 'https://www.plex.tv/about/',
 'https://www.plex.tv/about/careers/',
 'https://www.plex.tv/about/charity/',
 'https://www.plex.tv/about/privacy-legal/']
	Working on https://www.plex.tv/
		Using clnd_text
	Working on https://www.plex.tv/about/
		Using boilerplate
	Working on https://www.plex.tv/about/careers/
		Using boilerplate
	Working on https://www.plex.tv/about/charity/
		Using boilerplate
	Working on http

		Using clnd_text
	Working on https://www.oracle.com/search/customers
		Using clnd_text
	Working on http://www.oracle.com/partners/index.html
		Using clnd_text
	Working on https://www.oracle.com/enterprise-manager/
		Using boilerplate
	Working on https://www.oracle.com/applications/performance-management/
		Using boilerplate
	Working on https://www.oracle.com/applications/performance-management/solutions/business-planning.html
		Using clnd_text
	Working on https://www.oracle.com/applications/performance-management/solutions/financial-close-reporting.html
		Using boilerplate
	Working on https://www.oracle.com/applications/performance-management/solutions/profitability-cost-management.html
		Using boilerplate
	Working on https://www.oracle.com/applications/performance-management/planning-budgeting-cloud-midsize-companies.html
		Using boilerplate
	Working on https://www.oracle.com/applications/supply-chain-management/
		Using boilerplate
	Working on https://www.oracle.com/applications/sup

		Using boilerplate
	Working on https://www.kimberly-clark.com/en-us/company/employees
		Using clnd_text
	Working on https://www.kimberly-clark.com/en-us/company/innovation
		Using boilerplate
	Working on https://www.kimberly-clark.com/en-us/company/leadership
		Using boilerplate
	Working on https://www.kimberly-clark.com/en-us/company/our-vision
		Using clnd_text
	Working on https://www.kimberly-clark.com/en-us/company/supplier-link
		Using clnd_text
	Working on https://www.kimberly-clark.com/en-us/company/technology-licensing
		Using boilerplate
	Working on https://www.kimberly-clark.com/en-us/investors/corporate-governance
		Using clnd_text
	Working on https://www.kimberly-clark.com/en-us/company/supplier-link/procure-to-pay
		Using boilerplate
	Working on https://www.kimberly-clark.com/en-us/company/supplier-link/supplier-portals
		Using boilerplate
	Working on https://www.kimberly-clark.com/en-us/company/supplier-link/supplier-diversity
		Using boilerplate
	Working on https://www.

		Using clnd_text
Working on Kronos International Inc
Kronos International Inc
Original urls
['https://www.kronos.com/',
 'https://www.kronos.com/about-us',
 'https://www.kronos.com/about-us/events',
 'https://www.kronos.com/about-us/careers',
 'https://www.kronos.com/about-us/newsroom',
 'https://www.kronos.com/about-us/leadership',
 'https://www.kronos.com/about-us/our-culture',
 'https://www.kronos.com/about-us/1-100-million',
 'https://www.kronos.com/kronos-partner-network',
 'https://www.kronos.com/about-us/kronos-history',
 'https://www.kronos.com/about-us/awards-and-recognition',
 'https://www.kronos.com/about-us/locations-and-global-reach',
 'https://www.kronos.com/kronos-partner-network/become-partner',
 'https://www.kronos.com/resources/kronos-partner-network-brochure',
 'https://www.kronos.com/about-us/events/kronos-hr-payroll-esymposium',
 'https://www.kronos.com/about-us/kronos-corporate-social-responsibility-csr',
 'https://www.kronos.com/kronos-partner-network/find-partn

		Using boilerplate
	Working on https://braunintertec.com/about-us/history/
		Using boilerplate
	Working on https://braunintertec.com/about-us/leadership/
		Using boilerplate
	Working on https://braunintertec.com/about-us/fleet-equipment/
		Using clnd_text
	Working on https://braunintertec.com/about-us/community-engagement/
		Using clnd_text
	Working on https://braunintertec.com/about-us/safety-quality-certifications/
		Using boilerplate
Working on Humanetics Corporation
Humanetics Corporation
Original urls
['http://www.humaneticsatd.com/',
 'http://www.humaneticsatd.com/about-us',
 'http://www.humaneticsatd.com/about-us/News',
 'http://www.humaneticsatd.com/about-us/legal',
 'http://www.humaneticsatd.com/about-us/events',
 'http://www.humaneticsatd.com/about-us/careers',
 'http://www.humaneticsatd.com/about-us/press-releases',
 'http://www.humaneticsatd.com/about-us/platinum-spares',
 'http://www.humaneticsatd.com/about-us/message-president',
 'http://www.humaneticsatd.com/about-us/ne

		Using clnd_text
	Working on https://www.westerndigital.com/company/newsroom
		Using boilerplate
	Working on https://www.westerndigital.com/company/leadership
		Using clnd_text
	Working on https://www.westerndigital.com/company/innovations
		Using boilerplate
	Working on https://www.westerndigital.com/company/innovations/history
		Using boilerplate
	Working on https://www.westerndigital.com/company/innovations/publications
		Using clnd_text
	Working on https://www.westerndigital.com/solutions/data-protection-management
		Using boilerplate
	Working on https://www.westerndigital.com/company/newsroom/regional-events/risc-v
		Using clnd_text
	Working on https://www.westerndigital.com/company/innovations/academic-collaborations
		Using boilerplate
	Working on https://www.westerndigital.com/company/newsroom/regional-events/flash-memory-summit-2018
		Using clnd_text
Working on Propagation Research Associates
Propagation Research Associates
Original urls
['http://pra-corp.com/', 'http://pra-c

		Using clnd_text
	Working on http://agfa.com/corporate/about-us/history/
		Using boilerplate
	Working on http://agfa.com/corporate/about-us/technology/
		Using clnd_text
	Working on http://agfa.com/corporate/about-us/our-company/
		Using boilerplate
	Working on http://agfa.com/corporate/privacy-cookie-settings/
		Using clnd_text
	Working on http://agfa.com/corporate/privacy-and-legal-notice/
		Using boilerplate
	Working on http://agfa.com/corporate/about-us/agfa-in-the-world/
		Using clnd_text
	Working on http://agfa.com/corporate/investor-relations/faq-investors-contact/
		Using clnd_text
	Working on http://agfa.com/corporate/about-us/our-company/company-film-meet-the-world-of-agfa/
		Using clnd_text
	Working on https://www.agfa.com/corporate/
		Using clnd_text
	Working on https://www.agfa.com/corporate/faq/
		Using boilerplate
	Working on https://www.agfa.com/corporate/press/
		Using clnd_text
	Working on https://www.agfa.com/corporate/contact/
		Using clnd_text
	Working on https://

		Using clnd_text
	Working on https://nuclear.gepower.com/company-info/our-experts
		Using clnd_text
	Working on https://nuclear.gepower.com/company-info/news-releases
		Using clnd_text
	Working on https://nuclear.gepower.com/company-info/about-ge-hitachi
		Using boilerplate
	Working on https://nuclear.gepower.com/company-info/ge-in-the-community
		Using boilerplate
	Working on https://nuclear.gepower.com/company-info/nuclear-power-basics
		Using boilerplate
	Working on https://nuclear.gepower.com/company-info/our-experts/eric-mino
		Using boilerplate
	Working on https://nuclear.gepower.com/company-info/our-experts/patty-mccumbee
		Using clnd_text
	Working on https://nuclear.gepower.com/service-and-optimize/solutions/all-solutions/long-term-asset-management
		Using clnd_text
	Working on https://nuclear.gepower.com/fuel-a-plant/tools-and-resources/learn-about-fuel-manufacturing-locations
		Using boilerplate
Working on Sumitomo Electric Device Innovations
Sumitomo Electric Device Innovat

		Using clnd_text
	Working on https://www.cadence.com/content/cadence-www/global/en_US/home/company.html
		Using clnd_text
	Working on https://www.cadence.com/content/cadence-www/global/en_US/home/company/events.html
		Using clnd_text
	Working on https://www.cadence.com/content/cadence-www/global/en_US/home/company/careers.html
		Using clnd_text
	Working on https://www.cadence.com/content/cadence-www/global/en_US/home/company/culture.html
		Using clnd_text
	Working on https://www.cadence.com/content/cadence-www/global/en_US/home/company/newsroom.html
		Using clnd_text
	Working on https://www.cadence.com/content/cadence-www/global/en_US/home/company/customers.html
		Using clnd_text
	Working on https://www.cadence.com/content/cadence-www/global/en_US/home/company/investors.html
		Using clnd_text
	Working on https://www.cadence.com/content/cadence-www/global/en_US/home/company/contact-us.html
		Using clnd_text
	Working on https://www.cadence.com/content/cadence-www/global/en_US/home/compa

		Using clnd_text
	Working on http://www.firstsolar.com/en/About-Us/Locations
		Using clnd_text
	Working on http://www.firstsolar.com/en/About-Us/Leadership
		Using boilerplate
	Working on http://www.firstsolar.com/en/PV-Plants/Corporate
		Using boilerplate
	Working on http://www.firstsolar.com/About-Us/Privacy-Policy
		Using boilerplate
	Working on http://www.firstsolar.com/en/About-Us/Press-Center
		Using clnd_text
	Working on http://www.firstsolar.com/en/About-Us/Terms-of-Use
		Using clnd_text
	Working on http://www.firstsolar.com/en/About-Us/Privacy-Policy
		Using boilerplate
	Working on http://www.firstsolar.com/en/About-Us/Corporate-Responsibility
		Using boilerplate
Working on Sun Chemical Corporation
Sun Chemical Corporation
Original urls
['https://www.sunchemical.com/',
 'https://www.sunchemical.com/about/',
 'https://www.sunchemical.com/200-years/',
 'https://www.sunchemical.com/about/locations/',
 'https://www.sunchemical.com/about/leadership/',
 'https://www.sunchemical.com