### extract-pages-from-mongo
SanjayKAroraPhD@gmail.com <br>
November 2018

## Description
This notebook extracts groups of pages from mongodb by firm_name to create firm-centric page output files that can later be topic modeled.  In doing so, it removes repetitive content (e.g., repeated menu items) and garbage content (e.g., improperly parsed HTML code). 

## Change log
v3 adds the python boilerplate  api for web page cleaning.

## TODO:
* Need to make better use of all pages in the site, e.g., to improve data quality and use additional paragraph data found on non-homepages 

In [2]:
# import data processing and other libraries
import csv
import sys
import requests
import os
import re
import pprint
import pymongo
import traceback
from time import sleep
import requests
import pandas as pd
import io
from IPython.display import display
import time
import numpy as np
from bs4 import BeautifulSoup
import string

In [5]:
from boilerpipe.extract import Extractor

In [6]:
MONGODB_DB = "FirmDB"
MONGODB_COLLECTION = "pages_co_urls"
CONNECTION_STRING = "mongodb://localhost"
username = "scrapy"
password = "eager"
authSource = "FirmDB"
authMechanism='SCRAM-SHA-1'

client = pymongo.MongoClient(CONNECTION_STRING, username=username, password=password, authSource=authSource, authMechanism=authMechanism)
db = client[MONGODB_DB]
col = db[MONGODB_COLLECTION]

DATA_DIR = '/home/eager/EAGER/data/orgs/workshop/depth0_boilerpipe/'

In [7]:
# gather unique firm_names from mongodb

def get_firm_aggregates ():
    query = [ { "$group": {"_id":"$firm_name" , "number":{"$sum":1}} } ]
    results = col.aggregate(query)

    mongo_dict = {}
    for result in results:
        key = (result['_id'])
        if key:
            mongo_dict[key[0]] = result['number']
        else:
            mongo_dict['NA'] = result['number']
    
    return mongo_dict

results_dict = get_firm_aggregates()
firm_names = results_dict.keys()
print (len(firm_names))
pp = pprint.PrettyPrinter()
pp.pprint(firm_names)

24
dict_keys(['Ticona LLC', 'Plant Sensory Systems', 'CyboEnergy', 'EMD Technologies Inc.', 'Ferro Corporation', 'NanoOncology', 'Genesco Inc.', 'Immunolight', 'Iogen Corporation', 'NA', 'ACACIA RESEARCH GROUP LLC', 'Kinetech Power Company LLC', 'Lux Bio Group', 'Pharmatrophix', 'Alliance for Sustainable Energy', 'Nanoquantum Sciences', 'Glucan Biorenewables LLC', 'Magnolia Solar', 'Ablexis', 'FULL CIRCLE BIOCHAR', 'GOAL ZERO LLC', 'Matrix Genetics', 'Gilead Connecticut', 'Mattson Technology'])


In [8]:
# remove html content
def is_javascript (x):
    match_string = r"(CDATA|return\s+true|return\s+false|getelementbyid|function|\w+\(.*?\);|\w{2,}[\\.|:]+\w{2,}|'\w+':\s+'\w+|\\|{|}|\r|\n|\/\/')"
    # capture CDATA; function declarations; function calls; word sequences separated by a period (e.g., denoting paths)
    regex = re.findall(match_string, x) 
    # check to see if the regex finds some percentage of the words look like javascript patterns
    if (len(regex) / float(len(x.split())) > .10):
        return True 
    else:
        return False

def clean_page_content (text_list):
    # remove whatever we think is html
    removed_html = filter(lambda x: not( bool(BeautifulSoup(x, "html.parser").find()) ), text_list)
    # remove content that looks like javascript 
    removed_js = filter(lambda x: not (is_javascript(x)), removed_html)
    # add other checks here as needed

    return removed_js
    

# iterate through each firm, get all pages associated with a firm, and produce data structure
# url --> depth
#     --> content (list)
# return data structure
def process_firm (firm_name): 
    regex = '^' + re.escape(firm_name) + '$'
    results = col.find( {"firm_name": re.compile(firm_name, re.IGNORECASE) } )
    firm_pages_dict = {}
    depth0_page_text = [] # home page
    for result in results:
        key = result['url'][0]
        if key:
            page_dict = {}
            depth = result['depth'][0]
            page_dict['depth'] = depth
            page_dict['domain'] = result['domain'][0]
            result['domain'][0]
            page_dict['firm_name'] = firm_name
            clnd_text = clean_page_content(result['full_text'])
            page_dict['clnd_text'] = clnd_text
            extractor = Extractor(extractor='DefaultExtractor', html = result['body'][0])
            page_dict['boilerpipe'] = extractor.getText()
            firm_pages_dict[key] = page_dict
            
            if depth == -1:
                # depth0_page_text = clnd_text
                depth0_page_text = page_dict['boilerpipe']
        else:
            continue
            
    return firm_pages_dict, depth0_page_text
# TODO: identify which pieces of content are common across all sites, and remove those
# def clean_content(firm_dict): 

In [9]:
# regex test 
regex = re.findall(r"(CDATA|return\s+true|return\s+false|getelementbyid|function|\w+\(.*?\);|\w{2,}[\\.|:]+\w{2,}|'\w+':\s+'\w+|\\|{|}|\r|\n|\/\/')", 
                   "CDATA function contact-us getelementbyid javascript.function linker:autoLink www.littlekidsinc.com fxnCall(param.param); email@dextr.us 'type': 'image' return true return false rev7bynlh\\u00252bvcgrjg\\ {height}") # last part is words sequences separated by punct
print (regex)

['CDATA', 'function', 'getelementbyid', 'javascript.function', 'linker:autoLink', 'www.littlekidsinc', 'fxnCall(param.param);', 'dextr.us', "'type': 'image", 'return true', 'return false', 'rev7bynlh\\u00252bvcgrjg', '\\', '{', '}']


In [10]:
firm_pages_dict, depth0_page_text = process_firm ("Kinetech Power Company LLC")
print (depth0_page_text)

What We Do
Kinetech Power Systems (KPS) has developed a low-cost, flexible duration  - long or short - flywheel energy storage system (FESS), also known as a mechanical battery, that provides non-toxic, environmentally friendly power for up to 30 years with little maintenance required.
How We Do It
Constant innovation and hard work are behind our FESS. We've managed to create our own technology and designs for our high speed hybrid composite flywheel system, which allows us to drive our costs down without sacrificing efficiency. 
Why We Do It
We strongly believe energy storage to be essential for more widespread renewable energy adoption. We are committed to the development of technology that can put us all one step closer to a clean future for all generations to come.
Our Core Technology and Key Advantages
Innovative FESS Design
Our patented FESS design has been tested and proven to provide long duration energy delivery in several prototype systems. KPS is constantly developing new in

In [11]:
print (DATA_DIR)

/home/eager/EAGER/data/orgs/workshop/depth0_boilerpipe/


In [12]:
# run
pp = pprint.PrettyPrinter()
for firm_name in firm_names: 
    print ("Working on " + firm_name)
    firm_pages_dict, depth0_page_text = process_firm (firm_name)
    # pp.pprint(depth0_page_text)
    if depth0_page_text: 
        file = re.sub('\.|\/', '_', firm_name) + '.txt'
        with io.open(DATA_DIR + file,'w',encoding='utf8') as f:
            f.write (depth0_page_text)

Working on Ticona LLC
Working on Plant Sensory Systems
Working on CyboEnergy
Working on EMD Technologies Inc.
Working on Ferro Corporation
Working on NanoOncology
Working on Genesco Inc.
Working on Immunolight
Working on Iogen Corporation
Working on NA
Working on ACACIA RESEARCH GROUP LLC
Working on Kinetech Power Company LLC
Working on Lux Bio Group
Working on Pharmatrophix
Working on Alliance for Sustainable Energy
Working on Nanoquantum Sciences
Working on Glucan Biorenewables LLC
Working on Magnolia Solar
Working on Ablexis
Working on FULL CIRCLE BIOCHAR
Working on GOAL ZERO LLC
Working on Matrix Genetics
Working on Gilead Connecticut
Working on Mattson Technology
