# extract-pages-from-mongo-v5
SanjayKAroraPhD@gmail.com <br>
December 2018

## Description
This version of the notebook extracts groups of pages from mongodb by firm_name to create firm-centric <b>about</b> page output files that can later be topic modeled.  In doing so, it removes repetitive content (e.g., repeated menu items) and garbage content (e.g., improperly parsed HTML code). 

## Change log
v4 focuses on about pages

## TODO:
* Whole process: get data, topic model and see if it looks sufficiently interesting/different
* Enhance data collection, per the following: 
    * Select a region or country — WAIT 
        * http://www.ivoclarvivadent.com: Please select your region
        * https://www.enersys.com/: PLEASE SELECT A REGION
        * https://www.m-petfilm.com/: ENGLISH
    * Crawl from focal about page only following links that look like part of the about story, maintaining ordering.  Check to see if the other links identified above are also there? 
        * http://xtalsolar.com/investors_partners.html
* Order known about us pages in the same way the links are found on a home page or about us landing page

In [1]:
# import data processing and other libraries
import csv
import sys
import requests
import os
import re
import pprint
import pymongo
import traceback
from time import sleep
import requests
import pandas as pd
import io
from IPython.display import display
import time
import numpy as np
from bs4 import BeautifulSoup
import string
import random
from urllib.parse import urlparse, urljoin
from collections import defaultdict
from collections import OrderedDict
import collections

In [2]:
from boilerpipe.extract import Extractor

In [3]:
# import sklearn
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.model_selection import GroupKFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix

In [4]:
MONGODB_DB = "FirmDB_20181226"
MONGODB_COLLECTION = "pages_ABOUT"
CONNECTION_STRING = "mongodb://localhost"

client = pymongo.MongoClient(CONNECTION_STRING)
db = client[MONGODB_DB]
col = db[MONGODB_COLLECTION]

ABOUT_DIR = '/Users/sarora/dev/EAGER/data/orgs/about/'
DATA_DIR = '/Users/sarora/dev/EAGER/data/orgs/parsed_page_output/'
TRAINING_PERCENT = .10
pp = pprint.PrettyPrinter()

In [5]:
def get_domain (url):
    o = urlparse(url.lower())
    domain = o.netloc.strip('www.')
    return domain

# output urls for labeling of training data
results = col.find({},{"url": 1, "firm_name": 1})
df = pd.DataFrame(columns = ('firm_name', 'url', 'label'))
domain_count = defaultdict(lambda:0,{})
for i in range(results.count()):
    result = results.next()
    url = result['url'][0]
    domain_count[get_domain(url)] += 1
    firm_name = result['firm_name'][0] if 'firm_name' in result else ''
    df.loc[i] = [firm_name, url, '']
    
df['gid'] = df.groupby(['firm_name']).ngroup()

In [22]:
df.gid.nunique()
label_ids = random.sample(range(1, df.gid.nunique()), 200)
df_label = df[df['gid'].isin(label_ids)]
with open(ABOUT_DIR + 'about_pages_to_label.csv', mode='w') as to_label:
    df_label.to_csv(to_label, index=False)

In [37]:
# read back labeled data (note that about, management/team and partners, are dichotomous)
df_about_labeled = pd.read_csv(ABOUT_DIR + 'about_pages_labeled.csv')
df_about_labeled = df_about_labeled.fillna(0)

# count pages per domain
for index, row in df_about_labeled.iterrows():
    pages_in_domain = domain_count[get_domain(row['url'])]
    df_about_labeled.loc[index,'pages_in_domain'] = pages_in_domain
    is_sole_page = 0 if pages_in_domain > 1 else 1
    df_about_labeled.loc[index,'is_sole_page'] = is_sole_page
    
labeled_urls = list(df_about_labeled['url']) # for training models on labeled urls below
df_about_labeled = df_about_labeled.set_index(['firm_name', 'url'])
print (df_about_labeled.columns.tolist())

# final test set is the rows of the original data frame without the urls in df_about_labeled 

['about_lbl', 'mgmt_lbl', 'partners_lbl', 'ip_lbl', 'about_agg_lbl', 'gid', 'pages_in_domain', 'is_sole_page']


## Create features to predict about pages
Create features:
1. title and url path fragment unigrams (also tried n-grams, as well as content from headers, with worse results) 
2. is home page and doesn't have any other pages
3. other ideas here: https://towardsdatascience.com/understanding-feature-engineering-part-3-traditional-methods-for-text-data-f6f7d70acd41

In [38]:
# load page data and create features

# remove simple article words and punctuation (need to keep 'about')
stop_words = ['the','a'] + list(string.punctuation) 
# remove known company names for model training and evaluation in the labeled data 
remove_regex = re.compile(r'^(3m|united|states|menu|en_us|algeternal|s\d+|sarepta|skygen|nexgen|abbott|adlens|errorpage|\d{1,3}|\d{5,}|\w+\d+|\d+\w+|asten|johnson|baker|hughes|ge|bhge|biocon|egfr|gcsf|biocon|pegfilgrastim|bostik|canon|chevron|phillips|coloplast|cyberonics|microsoft|evoqua|ford|hitachi|glucanbio|hunter|douglas|kimberly|clark|lextar|fisher|lockheed|martin |lux|nec|nanocopoeia|cisco|schlumberger|weccamerica|inanobio|nanocomposix|zoetis|zygo)$', re.IGNORECASE)
# used to filter top-level header content
header_in = re.compile('(about|company|corporate|who.we.are|(^|/)vision|awards|profile|corporate|management|team|history|values|strategy|our |technology|research|commercialization)', flags=re.IGNORECASE)
header_regex = re.compile(r'h[1-9]+')

def clean_string(in_string):
    if not in_string:
        return in_string
    split_words = in_string.lower().split()
    result_words  = [word for word in split_words if word not in stop_words]
    result_words  = [word for word in result_words if not remove_regex.search(word)]
    result = ' '.join(result_words)
    return ' ' + result

def get_page_path_text (url):
    o = urlparse(url.lower())
    path = o.path
    path_parts = path.split ('/')
    path_parts = [part.split('.')[0] for part in path_parts] # remove page names
    path_parts = [split for part in path_parts for split in part.split('-') ] # split on underscores, hyphens, et al
    path_parts = [split for part in path_parts for split in part.split('_') ] # split on underscores, hyphens, et al
    clnd_string = clean_string(' '.join(path_parts))
    return clnd_string

# recurse through the header text to add into feature grams
def get_header_text (headers, names, index):
    texts = [clean_string(header.text) for header in headers if header.name == names[index]]
    texts = list(filter(header_in.search, texts))
    if texts and len(texts[0].split()) > 4:
        if(len(names) > (index + 1)):
            return get_header_text (headers, names, index + 1)
        else:
            return ''
    else: 
        return ' '.join (texts)
    
def process_firms (urls): 
    firm_page_features = {}
    for url in urls: 
        result = col.find_one({"url": url})
        domain = get_domain(url)
        html = result['html'][0]
        
        # print (url)

        soup = BeautifulSoup(html, 'lxml')
        running_text = ''
        path_text = get_page_path_text(url)
        
        if path_text:
            # print (path_text)
            running_text += path_text
            
        if soup.title and soup.title.string:
            # print (soup.title.string)
            running_text += clean_string(soup.title.string)
            
        headers = soup.find_all(header_regex, text=True)
        names = sorted(set ([header.name for header in headers]))
        running_text += get_header_text (headers, names, 0)

        firm_page_features[url] = running_text
        
    return firm_page_features

In [39]:
# for testing regex
print (get_page_path_text('http://www.google.com/path-en/path_to/page.html'))
print (re.split("\W+|_", "Testing this_thing"))
print (clean_string('3m 01	08	100	10m ford 235 1990 s129 188209 0913lk the ? about us'))
pp.pprint (list(filter(header_in.search, ['about us', 'not found', 'company'])))

 path en path to page
['Testing', 'this', 'thing']
 about us
['about us', 'company']


In [40]:
# get firm website data for n-gram processing 
labeled_firm_page_features = process_firms (labeled_urls)

urls = labeled_firm_page_features.keys()
print (len(urls))
corpus = []
for url in urls:
    corpus.append (labeled_firm_page_features[url])
    
# unigram
ubv = TfidfVectorizer(min_df=0., max_df=1.)
# you can set the n-gram range to 1,2 to get unigrams as well as bigrams (performs worse than just unigrams)
# ubv = TfidfVectorizer(ngram_range=(1,2)) 

ubv_matrix = ubv.fit_transform(corpus)

ubv_matrix = ubv_matrix.toarray()
vocab = ubv.get_feature_names()
ubv_df = pd.DataFrame(ubv_matrix, columns=vocab)
ubv_df.index = urls
ubv_df.index.name='url'

1031


In [41]:
# merge two datasets (features and labeled data)
print(ubv_df.shape)
print(df_about_labeled.shape)

all_merged = ubv_df.join(df_about_labeled, how='inner')
print(all_merged.shape)

(1031, 1472)
(1031, 8)
(1031, 1480)


In [42]:
# split labeled and predict datasets 
labeled = all_merged[all_merged['gid'].notnull()]
print (df_about_labeled.columns.tolist())

['about_lbl', 'mgmt_lbl', 'partners_lbl', 'ip_lbl', 'about_agg_lbl', 'gid', 'pages_in_domain', 'is_sole_page']


In [48]:
# labeled train/test split
X = labeled.iloc[:,1:len(ubv_df.columns)]
X['pages_in_domain'] = labeled['pages_in_domain']
X['is_sole_page'] = labeled['is_sole_page']
X.to_csv(ABOUT_DIR + 'X.csv', index = True) # for manual inspection

y = labeled.loc[:,'about_lbl']

## Train and evaluate the model
On just the labeled data

In [49]:
# specify a few models

names = ["Nearest Neighbors", "Linear SVM", "RBF SVM", 
         "Decision Tree", "Random Forest", "Neural Net", "AdaBoost",
         "Naive Bayes", "SVC", "QDA"]

classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", C=0.025),
    SVC(gamma=2, C=1),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1),
    AdaBoostClassifier(),
    GaussianNB(),
    SVC(gamma=0.001, C=100.), 
    QuadraticDiscriminantAnalysis()]

In [50]:
# build dataframe for output metrics 
eval_df = pd.DataFrame (names,index=(range(len(names))), columns=["Name"])
eval_df['Accuracy'] = np.float64(0)
display (eval_df)

Unnamed: 0,Name,Accuracy
0,Nearest Neighbors,0.0
1,Linear SVM,0.0
2,RBF SVM,0.0
3,Decision Tree,0.0
4,Random Forest,0.0
5,Neural Net,0.0
6,AdaBoost,0.0
7,Naive Bayes,0.0
8,SVC,0.0
9,QDA,0.0


In [51]:
# build evaluation outputs (currently limited to accuracy)
i = np.int64(0)
for name, clf in zip(names, classifiers):
    display (name)
    scores = cross_val_score(clf, X, y)
    avg_score = np.mean(scores)
    eval_df.set_value(i, 'Accuracy', avg_score)
    i = i + 1
    
display(eval_df)
eval_df.to_clipboard()

'Nearest Neighbors'

'Linear SVM'

'RBF SVM'

'Decision Tree'

'Random Forest'

'Neural Net'

'AdaBoost'

'Naive Bayes'

'SVC'

'QDA'

Unnamed: 0,Name,Accuracy
0,Nearest Neighbors,0.683819
1,Linear SVM,0.646946
2,RBF SVM,0.702277
3,Decision Tree,0.750716
4,Random Forest,0.646946
5,Neural Net,0.847723
6,AdaBoost,0.81671
7,Naive Bayes,0.660529
8,SVC,0.800189
9,QDA,0.47525


## Grid search using MLPClassifier to tune hyperparameters

In [19]:
hls = []
hls.append([100,])
hls.append([70,70,70])
hls.append([40,40,40])
hls.append([10,10,10])
pp.pprint(hls)

[[100], [70, 70, 70], [40, 40, 40], [10, 10, 10]]


In [52]:
parameters = {'solver': ['lbfgs'], 'max_iter': [300,500,700], 'alpha': 10.0 ** -np.arange(1, 10), 'hidden_layer_sizes': hls, 'random_state':[5,10,15]}
clf_grid = GridSearchCV(MLPClassifier(), parameters, n_jobs=-1)
clf_grid.fit(X,y)

print("Best score: %0.4f" % clf_grid.best_score_)
print("Using the following parameters:")
print(clf_grid.best_params_)

NameError: name 'hls' is not defined

In [53]:
# train neural net model with best hyperparameter configuration
clf = MLPClassifier(alpha=0.01, hidden_layer_sizes=(70,70,70), max_iter=300, random_state=10, solver='lbfgs')
clf.fit(X, y)

y_hat = clf.predict(X)

In [54]:
# print all instances where predictions don't match labels (for inspection)
confusion_matrix(y, y_hat)

for key, y_i, y_hat_i in zip(list(X.index), y, y_hat):
    if y_i != y_hat_i:
        print(key[1], 'has been classified as ', y_hat_i, 'but should be ', y_i) 

http://acaciaresearch.com/history/ has been classified as  1.0 but should be  0.0
https://www.asml.com/company/faq/en/s32894 has been classified as  1.0 but should be  0.0
https://www.asml.com/company/company-calendar/en/s32775 has been classified as  1.0 but should be  0.0
https://allisontransmission.com/company/centennial has been classified as  0.0 but should be  1.0
https://allisontransmission.com/company/history-heritage has been classified as  1.0 but should be  0.0
https://www.amgen.com/about/how-we-operate/policies-practices-and-disclosures/ has been classified as  0.0 but should be  1.0
https://www.amgen.com/about/quick-facts/ has been classified as  0.0 but should be  1.0
https://www.andritz.com/group-en/about-us/gr-company-boards has been classified as  1.0 but should be  0.0
http://angiotech.com/about/itemlist/category/59-come-see-us has been classified as  1.0 but should be  0.0
https://www.adm.com/our-company/procurement has been classified as  0.0 but should be  1.0
http

## Predict about pages for unlabeled data

In [31]:
# prepare domain level features 
df_predict = df[~df['url'].isin(labeled_urls)] # careful: index is firm_name and url now
# count pages per domain
for index, row in df_predict.iterrows():
    pages_in_domain = domain_count[get_domain(row['url'])]
    df_predict.loc[index,'pages_in_domain'] = pages_in_domain
    is_sole_page = 0 if pages_in_domain > 1 else 1
    df_predict.loc[index,'is_sole_page'] = is_sole_page
    
# set index 
df_predict = df_predict.set_index(['firm_name', 'url'])
print (df_predict.columns.tolist())

['label', 'gid', 'pages_in_domain', 'is_sole_page']


In [37]:
# check to see whether there are duplicate urls
# note: there should be because different assignees may map to the same domain (see error above)
counter=collections.Counter(df_predict.index)
most_common = counter.most_common(10)
pp.pprint (most_common)

[(('Mitsubishi Metal Corporation',
   'https://www.mitsubishicorp.com/jp/en/about/plan/'),
  1),
 (('22nd Century Limited', 'http://www.xxiicentury.com/history/'), 1),
 (('Mitsubishi Metal Corporation',
   'https://www.mitsubishicorp.com/jp/en/about/global/'),
  1),
 (('Mitsubishi Metal Corporation',
   'https://www.mitsubishicorp.com/jp/en/about/message/'),
  1),
 (('Forest Concepts', 'http://forestconcepts.com/index.php?page=01002'), 1),
 (('ZENA TECHNOLOGIES', 'http://xena-technologies.com/program-management/'), 1),
 (('Kansai Paint Co.', 'https://www.kansai.com/about-us/corporate_data.html'),
  1),
 (('W&Wsens Devices', 'https://www.wwsensdevices.com/'), 1),
 (('Kansai Paint Co.', 'https://www.kansai.com/about-us/brand.html'), 1),
 (('Kansai Paint Co.', 'https://www.kansai.com/about-us/'), 1)]


In [38]:
# prepare n-gram features
unlabeled_firm_page_features = process_firms (set(df_predict.index.get_level_values('url')))

prediction_urls = unlabeled_firm_page_features.keys()

pred_corpus = []
for url in prediction_urls:
    pred_corpus.append (unlabeled_firm_page_features[url])

ubv_prediction_matrix = ubv.transform(pred_corpus)

ubv_prediction_matrix = ubv_prediction_matrix.toarray()
vocab = ubv.get_feature_names()
ubv_prediction_df = pd.DataFrame(ubv_prediction_matrix, columns=vocab)
ubv_prediction_df.index = prediction_urls
ubv_prediction_df.index.name='url'

In [39]:
print(ubv_prediction_df.shape)
print(df_predict.shape)

predict_merged = ubv_prediction_df.join(df_predict, how='right', rsuffix='_lbl')
print(predict_merged.shape)

# merge
X_test = predict_merged.iloc[:,1:len(ubv_prediction_df.columns)]
X_test['pages_in_domain'] = predict_merged['pages_in_domain']
X_test['is_sole_page'] = predict_merged['is_sole_page']
print (X.shape)
print (X_test.shape) # should be the same number of cols

X_test

(4147, 1472)
(4516, 4)
(4516, 1476)
(1031, 1473)
(4516, 1473)


Unnamed: 0_level_0,Unnamed: 1_level_0,10m,13485,14001,1870s,1910s,1920s,2016,2020,20pages,3d,...,z18038e,zegage,zeno,zero,zoetis,zonne,公司简介,隆達電子,pages_in_domain,is_sole_page
firm_name,url,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Mitsubishi Metal Corporation,https://www.mitsubishicorp.com/jp/en/about/plan/,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,11.0,0.0
22nd Century Limited,http://www.xxiicentury.com/history/,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
Mitsubishi Metal Corporation,https://www.mitsubishicorp.com/jp/en/about/global/,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,11.0,0.0
Mitsubishi Metal Corporation,https://www.mitsubishicorp.com/jp/en/about/message/,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,11.0,0.0
Forest Concepts,http://forestconcepts.com/index.php?page=01002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
ZENA TECHNOLOGIES,http://xena-technologies.com/program-management/,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0
Kansai Paint Co.,https://www.kansai.com/about-us/corporate_data.html,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0
W&Wsens Devices,https://www.wwsensdevices.com/,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
Kansai Paint Co.,https://www.kansai.com/about-us/brand.html,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0
Kansai Paint Co.,https://www.kansai.com/about-us/,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0


In [40]:
# predict with newly constructed X
y_predicted = clf.predict(X_test)

In [41]:
# standard firm cleaning regex
def clean_firm_name (firm):
    firm_clnd = re.sub('(\.|,| corporation| incorporated| llc| inc| international| gmbh| ltd)', '', firm, flags=re.IGNORECASE).rstrip()
    return firm_clnd

# write to file
with open(ABOUT_DIR + 'about_predicted_and_labels.csv', mode='w') as about_file:
    about_writer = csv.writer(about_file, delimiter=',', quotechar='"')
    about_writer.writerow(['firm_name', 'url', 'is_about'])
    # output predicted values to file
    for fn, u, predicted_value in zip(X_test.index.get_level_values('firm_name'), X_test.index.get_level_values('url'), y_predicted):
        # print (fn + ' with url ' + u + ' has predicted value ' + str(predicted_value))
        about_writer.writerow([clean_firm_name(fn), u, predicted_value])
    # and the labeled ones too...
    for fn, u, labeled_value in zip(X.index.get_level_values('firm_name'), X.index.get_level_values('url'), y):
        # print (fn + ' with url ' + u + ' has predicted value ' + str(labeled_value))
        about_writer.writerow([clean_firm_name(fn), u, labeled_value])

## Extract data from mongodb
* Now that we know which pages are about pages, extract from mongodb and output for topic modeling
* For now, construct paragraphs from different pages by ordering urls by their length.  In the future, might want to contruct paragraphs in their 'natural' sequential order as they would appear on a home page or landing page

In [42]:
# combine both labeled and predicted frames
print (X_test.shape)
print(X.shape)

combined = X_test.append(X)
print (combined.shape)
print (len(y_predicted))
print (len(y))
abouts = pd.DataFrame(index=combined.index)

abouts['is_about'] = list(y_predicted) + list(y)
abouts = abouts.reset_index()
abouts

(4516, 1473)
(1031, 1473)
(5547, 1473)
4516
1031


Unnamed: 0,firm_name,url,is_about
0,Mitsubishi Metal Corporation,https://www.mitsubishicorp.com/jp/en/about/plan/,0.0
1,22nd Century Limited,http://www.xxiicentury.com/history/,1.0
2,Mitsubishi Metal Corporation,https://www.mitsubishicorp.com/jp/en/about/glo...,1.0
3,Mitsubishi Metal Corporation,https://www.mitsubishicorp.com/jp/en/about/mes...,0.0
4,Forest Concepts,http://forestconcepts.com/index.php?page=01002,0.0
5,ZENA TECHNOLOGIES,http://xena-technologies.com/program-management/,0.0
6,Kansai Paint Co.,https://www.kansai.com/about-us/corporate_data...,1.0
7,W&Wsens Devices,https://www.wwsensdevices.com/,1.0
8,Kansai Paint Co.,https://www.kansai.com/about-us/brand.html,1.0
9,Kansai Paint Co.,https://www.kansai.com/about-us/,1.0


In [43]:
# gather unique firm_names from mongodb
firm_names = set(abouts['firm_name'])
print (len(firm_names))
pp = pprint.PrettyPrinter()
pp.pprint(firm_names)

1050
{'22nd Century Limited',
 '3M Innovative Properties Company',
 'ABB AB',
 'AC International Inc.',
 'ACACIA RESEARCH GROUP LLC',
 'ACUCELA INC.',
 'ACell',
 'ADASA INC.',
 'ADVANCED INNOVATION CENTER LLC',
 'AFMODEL',
 'AGC Flat Glass North America',
 'AGFA-GEVAERT N.V.',
 'ALGETERNAL TECHNOLOGIES',
 'ALSTOM Technology Ltd',
 'AMPT',
 'APPLIED STEMCELL',
 'ARBOR THERAPEUTICS',
 'ASCENT SOLAR TECHNOLOGIES',
 'ASM America',
 'ASML Netherlands B.V.',
 'ASTUTE MEDICAL',
 'AT&T Corporation',
 'ATC Technologies',
 'ATOMERA INCORPORATED',
 'ATTOSTAT',
 'AVI BioPharma',
 'AVOGY',
 'AbbVie Inc.',
 'Abbott Molecular Inc.',
 'Abbott Point of Care Inc.',
 'Abengoa Bioenergy New Technologies',
 'Ablexis',
 'Access Business Group International LLC',
 'AccuRay Corporation',
 'Acorn Technologies',
 'Adaptive Biotechnologies Corp.',
 'Adeka Corporation',
 'Adhesives Research',
 'Adlens Beacon',
 'Adobe Systems Incorporated',
 'Adtran',
 'Advanced Analogic Technologies',
 'Advanced Aqua Group',
 'A

In [177]:
def get_ordered_about_urls (firm_name):
    urls = list (abouts.loc[(abouts['firm_name'] == firm_name) & (abouts['is_about'] == 1), 'url'])
    urls.sort(key = len)
    # print ('Original urls')
    # pp.pprint(urls)

    index = {}
    for url in urls:
        path_fragments = len(url.split('/'))
        added = False
        for i in range(1, path_fragments):
            key_phrase = url.rsplit('/', maxsplit=i)[0]
            if key_phrase in urls or (key_phrase + '/') in urls: 
                od = index.setdefault(key_phrase, OrderedDict())
                od[url] = 1
                added = True
                continue
        if not added:
            od = index.setdefault(url, OrderedDict())
            od[url] = 1
 
    # pp.pprint (index)
    
    return_urls = [] 
    seen = set ()
    for key in index.keys():
        tree_urls = index[key]
        for fu in tree_urls:
            if fu not in seen:
                return_urls.appju8j nend(fu)
                seen.add(fu)
    
    # finally remove home page if it exists and if there are other pages to draw on
    if not return_urls: 
        return None
    else: 
        first_page = return_urls[0]
        first_page_path = get_page_path_text (first_page)
        if first_page_path == ' ' or first_page_path == '':
            print (first_page_path + 'empty')
            return_urls.pop(0)
        return return_urls

test_urls = get_ordered_about_urls ('Previvo Genetics')
print ('Ordered urls')
pp.pprint (test_urls)

Ordered urls
None


In [155]:
# remove html content
def is_javascript (x):
    match_string = r"(CDATA|return\s+true|return\s+false|getelementbyid|function|\w+\(.*?\);|\w{2,}[\\.|:]+\w{2,}|header|hover|'\w+':\s+'\w+|\\|{|}|\r|\n|\/\/')"
    # capture CDATA; function declarations; function calls; word sequences separated by a period (e.g., denoting paths)
    regex = re.findall(match_string, x) 
    # check to see if the regex finds some percentage of the words look like javascript patterns
    if (len(regex) / float(len(x.split())) > .10):
        return True 
    else:
        return False

def clean_page_content (text_list):
    # remove whatever we think is html
    removed_html = filter(lambda x: not( bool(BeautifulSoup(x, "html.parser").find()) ), text_list)
    # remove content that looks like javascript 
    removed_js = filter(lambda x: not (is_javascript(x)), removed_html)
    # add other checks here as needed

    return removed_js
    

# iterate through firm urls and return concatenated string
def get_content (urls): 
    running_text = ''
    for url in urls:
        print ('\tWorking on ' + url)
        result = col.find_one( {"url": url} )
        if result:
            clnd_text = clean_page_content(result['full_text'])
            clnd_text = '\n'.join(clnd_text)
            boilerpipe = None
            
            if 'body' in result:
                extractor = Extractor(extractor='DefaultExtractor', html = result['body'][0])
                lines = extractor.getText().replace(u'\xa0', u' ').split('\n')
                filtered = filter(lambda x: not re.match(r'^\s*$', x), lines)
                boilerpipe = '\n'.join(filtered)

            # TODO fix to split().  Counting characters currently 
            if boilerpipe and (len(boilerpipe) > .5 * len(clnd_text)):
                print ('\t\tUsing boilerplate')
                running_text += boilerpipe
            else:
                print ('\t\tUsing clnd_text')
                running_text += clnd_text
        else:
            print ('Cannot find url: ' + url)

    return running_text

def clean_text (firm_name, text):
    strip_regex = re.compile(r"(" + "\s|".join(firm_name.split()) + "\s)", re.IGNORECASE)
    clnd_text = strip_regex.sub ('', text)
    
    more_regex = re.compile(r"([A-Z]\.?){1,} ")
    clnd_text = more_regex.sub ('', clnd_text)
    
    return ' '.join(clnd_text.split(' '))
    

In [156]:
# regex test 
regex = re.findall(r"(CDATA|return\s+true|return\s+false|getelementbyid|function|\w+\(.*?\);|\w{2,}[\\.|:]+\w{2,}|'\w+':\s+'\w+|\\|{|}|\r|\n|\/\/')", 
                   "CDATA function contact-us getelementbyid javascript.function linker:autoLink www.littlekidsinc.com fxnCall(param.param); email@dextr.us 'type': 'image' return true return false rev7bynlh\\u00252bvcgrjg\\ {height}") # last part is words sequences separated by punct
print (regex)

print (clean_firm_name('Ford Motor Co. Corporation'))
print (clean_text (clean_firm_name('Ford Motor Co.'), 'Ford is a motor company.  It has been building vehicles for over a century. H.W.F_ Ford was a nice guy.'))

['CDATA', 'function', 'getelementbyid', 'javascript.function', 'linker:autoLink', 'www.littlekidsinc', 'fxnCall(param.param);', 'dextr.us', "'type': 'image", 'return true', 'return false', 'rev7bynlh\\u00252bvcgrjg', '\\', '{', '}']
Ford Motor Co
is a company.  It has been building vehicles for over a century. H.W.F_ was a nice guy.


In [157]:
test_site_text = get_content (test_urls)
print (test_site_text)

	Working on http://www.xxiicentury.com/history/
		Using boilerplate
	Working on http://www.xxiicentury.com/profile/
		Using boilerplate
Company History
THE IDEA
In 1998, Joseph Pandolfino founded 22nd Century Limited, LLC (22nd Century) to provide funding to North Carolina State University (NCSU) for a research and development collaboration on nicotine biosynthesis in the tobacco plant. Mr. Pandolfino asked the question: “Since coffee without caffeine and beer without alcohol are commercially available, why aren’t tobacco cigarettes without nicotine a choice for consumers?” Further, he hypothesized: “If it were possible to produce tobacco cigarettes without nicotine, couldn’t smokers use these extraordinary cigarettes to successfully quit smoking?”
A paradox to be sure. Nonetheless, Mr. Pandolfino’s hypothesis stemmed from his careful observation of real smokers. As an importer of tobacco-free herbal cigarettes, Mr. Pandolfino learned that many consumers were using nicotine-free herbal

In [174]:
# run process_firm and write to file
pp = pprint.PrettyPrinter()
for firm_name in firm_names: 
    print ("Working on " + firm_name)
    about_urls = get_ordered_about_urls(firm_name)
    if not about_urls:
        print ("\tCouldn't find any urls for firm!")
        continue
        
    about_text = get_content (about_urls)
    
    if about_text: 
        firm_clnd = clean_firm_name(firm_name) # standard cleaning code throughout project
        about_clnd = clean_text (firm_name, about_text)
        file = re.sub('\/', '|', firm_clnd) + '.txt'
        with io.open(DATA_DIR + file,'w',encoding='utf8') as f:
            f.write (about_clnd)
    else:
        print ("\tCouldn't find any text for firm!")

Working on DEKA Products Limited Partnership
	Working on http://www.dekaresearch.com/about-deka/
		Using boilerplate
Working on Elenion Technologies
	Working on https://elenion.com/summary/
		Using boilerplate
Working on INFINEUM INTERNATIONAL LIMITED
	Working on https://www.infineum.com/en/about-us/
		Using clnd_text
	Working on https://www.infineum.com/en/about-us/safety/
		Using clnd_text
	Working on https://www.infineum.com/en/about-us/history/
		Using boilerplate
	Working on https://www.infineum.com/en/about-us/overview/
		Using boilerplate
	Working on https://www.infineum.com/en/about-us/our-brand/
		Using clnd_text
	Working on https://www.infineum.com/en/careers/our-teams/
		Using boilerplate
Working on OmniVision Technologies
	Working on https://www.ovt.com/company
		Using clnd_text
	Working on https://www.ovt.com/company/company-profile
		Using clnd_text
Working on Kimberly-Clark Worldwide
	Working on https://www.kimberly-clark.com/en-us/company/awards
		Using clnd_text
	Worki

		Using boilerplate
Working on Livetv
	Couldn't find any urls for firm!
Working on Poly-Med
	Working on http://poly-med.com/about/
		Using boilerplate
Working on SELMAN AND ASSOCIATES
	Working on https://selmanlog.com/Default.aspx
		Using boilerplate
Working on INVISTA North America S.a.r.l.
	Working on https://www.invista.com/about/who-we-are
		Using boilerplate
	Working on https://www.invista.com/About/Who-We-Are
		Using boilerplate
Working on WCA Group LLC
	Working on https://www.wca-group.com/about-us/
		Using clnd_text
Working on Nanotek Instruments
	Working on http://nanotekinstruments.com/about-us/
		Using boilerplate
Working on First Solar
	Working on http://www.firstsolar.com/en/About-Us/Overview
		Using clnd_text
Working on Altex Technologies Corporation
	Working on http://altextech.com/about.html
		Using boilerplate
Working on Dermazone Solutions
	Couldn't find any urls for firm!
Working on Innovation Hammer
	Working on http://ihammerllc.com/aboutus
		Using boilerplate
	Work

		Using boilerplate
	Working on https://www.goodyear.com/en-US/company/college-football
		Using boilerplate
Working on Eastman Kodak Company
	Working on https://kodak.com/corp/company/default.htm
		Using clnd_text
	Working on https://kodak.com/motion/About/default.htm
		Using clnd_text
	Working on https://kodak.com/US/en/services-business/who-we-are/default.htm
		Using clnd_text
Working on POSiFA MICROSYSTEMS
	Working on http://posifamicrosystems.com/about_us.php
		Using boilerplate
Working on LGS Innovations LLC
	Working on https://www.lgsinnovations.com/about/
		Using boilerplate
	Working on https://www.lgsinnovations.com/history/
		Using boilerplate
	Working on https://www.lgsinnovations.com/about-alliant/
		Using clnd_text
	Working on https://www.lgsinnovations.com/mission-values/
		Using clnd_text
	Working on https://www.lgsinnovations.com/company-culture
		Using boilerplate
Working on Community Power Corporation
	Couldn't find any urls for firm!
Working on Oculus VR
	Couldn't fin

		Using boilerplate
	Working on https://global-sei.com/company/vision.html
		Using boilerplate
Working on Saint-Gobain Adfors Canada
	Working on http://www.adfors.com/us/about-adfors
		Using clnd_text
Working on Heat Seal LLC
	Working on https://heatsealco.com/about-ampak
		Using clnd_text
	Working on https://heatsealco.com/about-heatseal
		Using boilerplate
Working on Yageo Corporation
	Working on http://www.yageo.com/portal/about_yageo/about.jsp?SWITCH_CATEGORY=/about_yageo/Corporate%20overview&menuid=0&title1=c1
		Using clnd_text
	Working on http://www.yageo.com/portal/about_yageo/about.jsp?SWITCH_CATEGORY=/about_yageo/Social%20Responsibility&menuid=1&title1=c2
		Using clnd_text
	Working on http://www.yageo.com/portal/about_yageo/about.jsp?SWITCH_CATEGORY=/about_yageo/Environment%20and%20Safety&menuid=2&title1=c3
		Using clnd_text
	Working on http://www.yageo.com/portal/about_yageo/about.jsp?SWITCH_CATEGORY=/about_yageo/Certificates%20and%20Awards&menuid=4&title1=c5
		Using clnd_tex

		Using clnd_text
	Working on https://www.schneider-electric.us/en/about-us/company-profile/
		Using clnd_text
Working on Takeda Pharmaceutical Company Limited
	Working on https://www.takeda.com/who-we-are/
		Using clnd_text
	Working on https://www.takeda.com/who-we-are/company-information/
		Using boilerplate
	Working on https://www.takeda.com/who-we-are/corporate-governance/
		Using boilerplate
	Working on https://www.takeda.com/who-we-are/corporate-philosophy/
		Using boilerplate
Working on Clearside Biomedical
	Working on http://www.clearsidebio.com/aboutus.htm
		Using clnd_text
Working on Abengoa Bioenergy New Technologies
	Working on http://www.abengoa.com/web/en/compania/index.html
		Using clnd_text
Working on Microsemi SoC Corporation
	Working on https://www.microsemi.com/company/awards
		Using clnd_text
	Working on https://www.microsemi.com/company/quality
		Using clnd_text
	Working on https://www.microsemi.com/company/about-us
		Using clnd_text
	Working on https://www.microse

Working on Advanced Silicon Group
	Couldn't find any urls for firm!
Working on Kaneka Corporation
	Working on http://www.kaneka.co.jp/en/corporate/
		Using clnd_text
	Working on http://www.kaneka.co.jp/en/corporate/director/
		Using clnd_text
Working on Nutech Ventures
	Working on http://www.nutechventures.org/about-us/
		Using clnd_text
Working on Sun Chemical Corporation
	Working on https://www.sunchemical.com/about/
		Using boilerplate
	Working on https://www.sunchemical.com/about/research-and-development/
		Using clnd_text
	Working on https://www.sunchemical.com/200-years/
		Using boilerplate
Working on CoolEarth Solar
 empty
	Couldn't find any urls for firm!
Working on Industrial Technology Research Institute
	Working on https://www.itri.org.tw/eng/Content/Messagess/contents.aspx?SiteID=1&MmmID=617731521661672477
		Using clnd_text
Working on Arrowhead Center
	Working on http://arrowheadcenter.nmsu.edu/our-team/
		Using boilerplate
Working on SanDisk Technologies LLC
	Working on ht

		Using boilerplate
Working on Adobe Systems Incorporated
	Working on https://www.adobe.com/about-adobe.html?promoid=2NVQCDBQ&mv=other
		Using clnd_text
Working on Met Tech Inc.
	Working on http://www.mettech.us/about-us/
		Using boilerplate
Working on Bemis Company
	Working on http://www.bemis.com/about-bemis
		Using clnd_text
	Working on http://www.bemis.com/about-bemis/about-us
		Using clnd_text
Working on TP Solar
 empty
	Couldn't find any urls for firm!
Working on SEaB Energy Holdings Ltd.
	Working on http://seabenergy.com/about-seab/
		Using boilerplate
Working on Moxtek
	Working on http://moxtek.com/about/
		Using clnd_text
Working on Singular Bio
	Couldn't find any urls for firm!
Working on Kia Motors Corporation 
	Working on https://www.kia.com/us/en/content/about-kia/who-we-are
		Using boilerplate
Working on Sequenom
	Working on https://www.sequenom.com/company
		Using boilerplate
Working on DePuy Synthes Products
	Working on https://www.depuysynthes.com/about
		Using clnd_te

		Using clnd_text
Working on Wenger Corporation
	Working on https://www.wengercorp.com/about-wenger.php
		Using boilerplate
Working on MEDport
	Couldn't find any urls for firm!
Working on OrbusNeich Medical
	Working on http://orbusneich.com/en/general/history
		Using clnd_text
	Working on http://orbusneich.com/en/general/our-vision
		Using clnd_text
	Working on http://orbusneich.com/en/patient/about-coronary-artery-disease-cad
		Using clnd_text
Working on CSEM Centre Suisse d'Electronique et de Microtechnique SA-Recherche et Developpement
	Working on https://www.csem.ch/About
		Using boilerplate
	Working on https://www.csem.ch/About/Awards
		Using clnd_text
	Working on https://www.csem.ch/About/History
		Using boilerplate
	Working on https://www.csem.ch/About/Start-ups
		Using boilerplate
	Working on https://www.csem.ch/About/Certifications
		Using boilerplate
	Working on https://www.csem.ch/About/Socialresponsibility
		Using boilerplate
	Working on https://www.csem.ch/About/MissionVis

		Using clnd_text
	Working on http://www.fujifilm.com/about/history/
		Using clnd_text
	Working on http://www.fujifilm.com/about/research/
		Using clnd_text
	Working on http://www.fujifilm.com/about/research/oih/
		Using clnd_text
Working on Pinnacle Technology
	Working on http://pinnacle-technology.com/about.php
		Using boilerplate
	Working on http://pinnacle-technology.com/software-development.php
		Using boilerplate
Working on AccuRay Corporation
	Working on https://www.accuray.com/company/
		Using boilerplate
	Working on https://www.accuray.com/who-we-are/news/
		Using clnd_text
	Working on https://www.accuray.com/who-we-are/media/
		Using clnd_text
	Working on https://www.accuray.com/who-we-are/careers/
		Using boilerplate
	Working on https://www.accuray.com/who-we-are/events-2/
		Using clnd_text
	Working on https://www.accuray.com/who-we-are/locations/
		Using clnd_text
Working on Winbond Electronics Corp.
	Couldn't find any urls for firm!
Working on Sundrop Fuels
	Working on htt

Working on Sumitomo Rubber Industries
	Working on https://sumitomorubber-usa.com/company/about/
		Using boilerplate
	Working on https://sumitomorubber-usa.com/company/culture/
		Using boilerplate
	Working on https://sumitomorubber-usa.com/company/history/
		Using boilerplate
Working on Angiotech Pharmaceuticals (US)
	Working on http://angiotech.com/about
		Using boilerplate
	Working on http://angiotech.com/about/itemlist/category/5-corporate-profile
		Using boilerplate
Working on Paratek Pharmaceuticals
	Working on https://paratekpharma.com/about/
		Using clnd_text
Working on ACUCELA INC.
	Working on https://www.acucela.com/company/index.html
		Using clnd_text
	Working on https://www.acucela.com/company/information/index.html
		Using clnd_text
Working on Heliae Development
	Working on http://heliae.com/company/
		Using boilerplate
	Working on https://heliae.com/company/
		Using boilerplate
Working on Tessera
	Couldn't find any urls for firm!
Working on Pacific Biosciences of California

		Using clnd_text
	Working on https://www.bp.com/en_us/bp-us/who-we-are/commitment-to-safety/adapting-innovative-technologies.html
		Using clnd_text
	Working on https://www.bp.com/en_us/bp-us/who-we-are/commitment-to-safety/stopping-problems-before-they-start.html
		Using clnd_text
Working on BRIGHTLEAF TECHNOLOGIES INC.
	Working on http://www.brightleafpower.com/values/
		Using boilerplate
	Working on http://www.brightleafpower.com/about-us/
		Using boilerplate
	Working on http://www.brightleafpower.com/innovation/
		Using boilerplate
Working on Kobe Steel
	Working on http://www.kobelco.co.jp/english/about_kobelco/
		Using clnd_text
	Working on http://www.kobelco.co.jp/english/about_kobelco/kobesteel/
		Using boilerplate
Working on k-Space Associates
	Working on https://www.k-space.com/company/customers/
		Using clnd_text
	Working on https://www.k-space.com/company/company-videos/
		Using clnd_text
	Working on https://www.k-space.com/company/newsletter-archive/
		Using clnd_text
Worki

		Using clnd_text
	Working on http://investors.kalarx.com/phoenix.zhtml?c=254596&p=irol-govHighlights
		Using clnd_text
Working on Synopsys
	Working on https://www.synopsys.com/company.html
		Using clnd_text
	Working on https://www.synopsys.com/community/snug/about-snug.html
		Using clnd_text
Working on iNanoBio LLC
	Working on http://inanobio.com/aboutus.php
		Using boilerplate
Working on Graphene Technologies
 empty
	Couldn't find any urls for firm!
Working on Seiko Epson Corporation
	Working on https://global.epson.com/company/
		Using clnd_text
	Working on https://global.epson.com/innovation/vision/
		Using boilerplate
Working on Performance Plants
	Working on http://performanceplants.com/corporate
		Using boilerplate
Working on Fluidigm Corporation
	Working on https://www.fluidigm.com/about/aboutfluidigm
		Using boilerplate
Working on Antaya Technologies Corporation
	Working on https://www.antaya.com/company/
		Using boilerplate
Working on Integrated Solar Technology
	Working on h

		Using boilerplate
	Working on https://www.lyondellbasell.com/en/about-us/fortune/
		Using clnd_text
	Working on https://www.lyondellbasell.com/en/about-us/who-we-are/
		Using boilerplate
	Working on https://www.lyondellbasell.com/en/about-us/company-investments/
		Using clnd_text
	Working on https://www.lyondellbasell.com/en/investors/company-earnings/
		Using clnd_text
Working on Luna Innovations Incorporated
	Working on http://lunainc.com/about-luna/
		Using clnd_text
Working on Novus Technology
	Working on https://www.novustek.net/about-nti
		Using boilerplate
Working on II-VI Incorporated
	Working on https://www.ii-vi.com/about-us/
		Using boilerplate
	Working on https://mobile.twitter.com/ceramtec
		Using clnd_text
Working on Bigelow Aerospace
	Working on http://bigelowaerospace.com/pages/whoweare/
		Using boilerplate
Working on PDF Solutions
	Working on http://pdf.com/corporate-overview
		Using boilerplate
	Working on http://pdf.com/investors-corporate-overview
		Using boilerpl

		Using clnd_text
Working on Toray Industries Inc. 
	Working on https://www.toray.com/aboutus/
		Using clnd_text
	Working on https://www.toray.com/aboutus/index.html
		Using clnd_text
	Working on https://www.toray.com/aboutus/outline.html
		Using clnd_text
	Working on https://www.toray.com/aboutus/philosophy.html
		Using boilerplate
	Working on https://www.toray.com/aboutus/vision/index.html
		Using boilerplate
Working on VINDICO NANOBIO TECHNOLOGY INC.
	Couldn't find any urls for firm!
Working on H R D CORPORATION
	Working on http://haywardrec.org/27/About-Us
		Using boilerplate
Working on NanoLab
	Working on https://www.nano-lab.com/aboutdefault.html
		Using clnd_text
Working on Global OLED Technology LLC
	Working on http://www.globaloledtech.com/global-oled-technology.html
		Using boilerplate
Working on Polysar Corporation
	Working on https://www.polystar.com/company/
		Using boilerplate
	Working on https://www.polystar.com/company/whistleblower/
		Using clnd_text
	Working on https:

		Using clnd_text
	Working on https://www.shire.com/who-we-are/our-story/our-history
		Using boilerplate
	Working on https://www.shire.com/who-we-are/our-story/our-culture
		Using clnd_text
Working on Analog Devices
	Working on https://www.analog.com/en/about-adi.html
		Using boilerplate
	Working on https://www.analog.com:443/en/myhistory.html
		Using boilerplate
Working on W&Wsens Devices
 empty
	Couldn't find any urls for firm!
Working on Phillips 66 Company
	Working on https://www.phillips66.com/about
		Using boilerplate
Working on Lockheed Martin Corporation
	Working on https://lockheedmartin.com/en-us/who-we-are.html
		Using clnd_text
	Working on https://lockheedmartin.com/en-us/who-we-are/eesh.html
		Using boilerplate
	Working on https://lockheedmartin.com/en-us/who-we-are/ethics.html
		Using clnd_text
	Working on https://lockheedmartin.com/en-us/news/features/history.html
		Using clnd_text
	Working on https://lockheedmartin.com/en-us/who-we-are/international.html
		Using clnd_te

		Using boilerplate
	Working on https://www.janssen.com/company
		Using clnd_text
Working on PluroGen Therapeutics
	Working on http://plurogen.com/about-us/
		Using boilerplate
	Working on http://plurogen.com/about-us/about-dr-rodeheaver/
		Using boilerplate
Working on Opel Solar
	Couldn't find any urls for firm!
Working on Furukawa Electric Co.
	Working on https://www.furukawa.co.jp/en/company/
		Using clnd_text
	Working on https://www.furukawa.co.jp/en/company/outline.html
		Using clnd_text
	Working on https://www.furukawa.co.jp/en/company/history.html
		Using boilerplate
	Working on https://www.furukawa.co.jp/en/company/hereandthere/
		Using clnd_text
	Working on https://www.furukawa.co.jp/en/rd/vision.html
		Using clnd_text
Working on SolaBlock LLC
	Couldn't find any urls for firm!
Working on LIQUID X PRINTED METALS
	Working on http://www.liquid-x.com/about/
		Using boilerplate
Working on Arkema Inc.
	Working on https://www.arkema.com/en/arkema-group/profile/
		Using clnd_text
	Wor

		Using boilerplate
	Working on https://www.dolby.com/us/en/about/brand-identity.html
		Using boilerplate
Working on APPLIED STEMCELL
	Working on https://www.appliedstemcell.com/about-us
		Using clnd_text
Working on Ivoclar Vivadent AG
	Working on http://www.ivoclarvivadent.com/en/
		Using boilerplate
Working on Lawrence Livermore National Security
	Couldn't find any urls for firm!
Working on Silicon Storage Technology
	Working on http://sst.com/about-sst
		Using clnd_text
	Working on http://sst.com/about-sst/
		Using clnd_text
	Working on http://sst.com/about-sst/press-release
		Using clnd_text
	Working on http://sst.com/about-sst/corporate-overview
		Using clnd_text
Working on WAFERTECH
	Working on http://www.wafertech.com/en/csr/
		Using clnd_text
	Working on http://www.wafertech.com/en/csr/supply.html
		Using boilerplate
	Working on http://www.wafertech.com/en/about/
		Using clnd_text
	Working on http://www.wafertech.com/en/about/values.html
		Using boilerplate
	Working on http://w

		Using boilerplate
	Working on https://www.weatherford.com/en/about-us/resource-hub/
		Using boilerplate
Working on IDEALAB
	Couldn't find any urls for firm!
Working on Synaptic Research
	Couldn't find any urls for firm!
Working on Intermolecular
	Working on https://intermolecular.com/company/
		Using clnd_text
Working on Pentair Thermal Management LLC
	Working on https://www.nventthermal.com/about-us/index.aspx
		Using boilerplate
Working on Selecta Biosciences
	Working on https://selectabio.com/about/about-selecta/
		Using boilerplate
Working on Vaxiion Therapeutics
	Working on http://www.vaxiion.com/the-company.html
		Using boilerplate
Working on Kabushikikaisha Toshiba
	Working on http://www.toshiba.com/tai/about_us.jsp
		Using clnd_text
	Working on http://www.toshiba.com/tai/about_us_companies.jsp
		Using clnd_text
Working on Crossbar
	Working on https://crossbar-inc.com/en/company/about-crossbar/
		Using boilerplate
	Working on https://crossbar-inc.com/en/technology/reram-overvi

		Using clnd_text
	Working on https://www.redhat.com/en/about/privacy-policy
		Using boilerplate
Working on Bristol-Myers Squibb Company
	Working on https://www.bms.com/about-us.html
		Using boilerplate
	Working on https://www.bms.com/about-us/our-company/achievements.html
		Using boilerplate
	Working on https://www.bms.com/about-us/our-company/worldwide-facilities.html
		Using clnd_text
Working on AGFA-GEVAERT N.V.
	Working on http://agfa.com/corporate/about-us/history/
		Using boilerplate
	Working on http://agfa.com/corporate/about-us/technology/
		Using clnd_text
Working on Taiyo Ink Mfg. Co.
	Working on http://taiyo-hd.co.jp/en/about/
		Using clnd_text
	Working on http://taiyo-hd.co.jp/en/about/group/
		Using clnd_text
	Working on http://taiyo-hd.co.jp/en/about/history/
		Using clnd_text
	Working on http://taiyo-hd.co.jp/en/about/overview/
		Using clnd_text
	Working on http://taiyo-hd.co.jp/en/about/philosophy/
		Using clnd_text
	Working on http://taiyo-hd.co.jp/en/group/thou/
		Us

		Using clnd_text
Working on United Technologies Corporation
	Working on http://www.utc.com/Pages/Home.aspx
		Using clnd_text
	Working on http://www.utc.com/Who-We-Are/Pages/Key-Facts.aspx
		Using boilerplate
	Working on http://www.utc.com/Who-We-Are/Pages/Our-People.aspx
		Using boilerplate
	Working on http://www.utc.com/Who-We-Are/Pages/At-A-Glance.aspx
		Using boilerplate
	Working on http://www.utc.com/Investors/Pages/Stock-Split-History.aspx
		Using clnd_text
	Working on http://www.utc.com/Who-We-Are/Research-Center/Pages/default.aspx
		Using clnd_text
Working on EMC Corporation
	Working on https://www.emcins.com/aboutEMC/
		Using boilerplate
Working on Deployable Space Systems
	Working on https://www.dss-space.com/about-capabilities
		Using boilerplate
	Working on https://www.dss-space.com/about-company-profile
		Using boilerplate
Working on CyboEnergy
	Working on http://cyboenergy.com/company/awards.html
		Using boilerplate
	Working on http://cyboenergy.com/company/about_cyboener

		Using clnd_text
	Working on https://grampower.com/about-us/join-us.html
		Using boilerplate
	Working on https://king-electric.com/about/company-bio/
		Using clnd_text
	Working on https://grampower.com/about-us/about-us.html
		Using boilerplate
	Working on https://grampower.com/about-us/our-team.html
		Using boilerplate
	Working on https://www.capsugel.com/now-a-lonza-company
		Using clnd_text
	Working on http://www.perkinelmer.com/company/index.html
		Using clnd_text
	Working on http://daylightsolutions.com/about-us/awards/
		Using clnd_text
	Working on https://www.accessbusinessgroup.com/about-abg/
		Using boilerplate
	Working on https://www.gsk.com/en-gb/investors/about-gsk/
		Using clnd_text
	Working on https://www.bosch.com/research/about-research/
		Using boilerplate
	Working on https://www.bosch.com/research/about-research/roots/
		Using boilerplate
	Working on http://pidc.com/our-company/corporate-overview
		Using clnd_text
	Working on https://corporate.dow.com/en-us/about/loc

		Using clnd_text
	Working on https://www.omron.com/about/ir/
		Using clnd_text
	Working on https://www.omron.com/about/social/
		Using clnd_text
	Working on https://www.omron.com/about/annual/
		Using clnd_text
	Working on https://www.omron.com/about/history/
		Using clnd_text
	Working on https://www.omron.com/about/outline/
		Using clnd_text
	Working on https://www.omron.com/about/strategy/
		Using clnd_text
Working on AltaRock Energy
	Working on http://altarockenergy.com/about-us/
		Using boilerplate
Working on Alexion Pharmaceuticals
	Working on http://alexion.com/about-alexion-pharmaceuticals
		Using boilerplate
	Working on http://alexion.com/about-alexion-pharmaceuticals/history
		Using boilerplate
	Working on http://alexion.com/en/about-alexion-pharmaceuticals
		Using boilerplate
Working on Proterra Inc.
	Couldn't find any urls for firm!
Working on Magnachip Semiconductor
	Working on http://investors.magnachip.com/investor-overview
		Using clnd_text
Working on Interface Performa

Working on Nanosys
	Working on http://www.nanosysinc.com/who-we-are
		Using boilerplate
	Working on http://www.nanosysinc.com/who-we-are/
		Using boilerplate
Working on TMC Corporation
	Working on http://www.tcm.com/my-profile/index.html
		Using boilerplate
Working on Banpil Photonics
	Working on http://banpil.com/vmv.htm
		Using clnd_text
Working on Forest Concepts
	Working on http://forestconcepts.com/index.php?page=5000
		Using clnd_text
	Working on http://forestconcepts.com/index.php?page=05004
		Using boilerplate
	Working on http://forestconcepts.com/index.php?page=01000
		Using clnd_text
Working on iBio
	Working on https://ibioinc.com/index.php/company-background/disclaimer
		Using boilerplate
	Working on https://ibioinc.com/index.php/company-background/company-background-2
		Using clnd_text
Working on GE Healthcare Dharmacon
	Working on https://dharmacon.horizondiscovery.com/about-us/
		Using clnd_text
	Working on https://dharmacon.horizondiscovery.com/about-us/about-open-biosys

Working on MILLENIUM SYNTHFUELS CORPORATION
	Working on http://milleniumsynthfuels.com/about/
		Using boilerplate
Working on Inphenix
	Working on https://www.inphenix.com/en/about-us/
		Using boilerplate
Working on Gram Power
	Working on https://grampower.com/about-us/join-us.html
		Using boilerplate
	Working on https://grampower.com/about-us/about-us.html
		Using boilerplate
	Working on https://grampower.com/about-us/our-team.html
		Using boilerplate
Working on Bluestar Silicones France
	Working on https://silicones.elkem.com/EN/about-us
		Using boilerplate
	Working on https://silicones.elkem.com:443/EN/company/About_Us/Pages/home.aspx
		Using boilerplate
	Working on https://silicones.elkem.com/EN/company/About_Us/Pages/Our-values.aspx
		Using clnd_text
	Working on https://silicones.elkem.com/EN/company/Lists/Event_company/Dispform.aspx?ID=149
		Using boilerplate
	Working on https://silicones.elkem.com/EN/company/Lists/Event_company/Dispform.aspx?ID=152
		Using boilerplate
	Working on