## Post HTML Featurizer

<br>Just a short notebook to take a url corresponding to a post or article and create some base raw html type tables that would be loaded back into a database for further downstream feature extraction. 

<br>...(Note: Not particulary fast, fancy or optimized in any way)

### Set Up

In [9]:
# encoding=utf-8

##########################################
## IMPORTS & SETUP
##########################################

from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
from collections import OrderedDict, Counter
import json

<br>For this example just create a list of random posts, in real world this would be a feed of most recent posts to be processed.

In [10]:
# get a sample of post urls to scrape
post_urls = ['http://hollywoodlife.com/2018/01/20/ewan-mcgregor-wife-eve-mavrakis-divorce-disappointing/',
            'http://variety.com/2018/film/news/producers-guild-awards-pga-2017-winners-list-1202671339/',
            'https://www.rollingstone.com/culture/features/who-owns-the-womens-march-w515597',
            'http://deadline.com/2018/01/donald-trump-womens-march-tweet-los-angeles-new-york-seattle-1202264495/'
            ]

### Get & Parse HTML

In [11]:
# create a some df's to collect results into
df_out = pd.DataFrame()
df_out_links = pd.DataFrame()
df_out_meta = pd.DataFrame()

##########################################
## LOOP OVER POSTS
##########################################

for post_url in post_urls:

    ##########################################
    ## PARSE POST
    ##########################################

    # try parse each post, just move on if fails
    try:

        # read and parse the html
        page = urlopen(post_url).read()
        soup = BeautifulSoup(page, "lxml")

        # get raw html in case need it later for anything else
        raw_html = soup.prettify()
        #print(raw_html)

        # get all tags found and their counts
        tag_counts_json = json.dumps(Counter([str(tag.name) for tag in soup.find_all()]))

        # append info to df_out, one row one post
        df_out = df_out.append([{
            'post_url':post_url,
            'raw_html':raw_html,
            'tag_counts_json':tag_counts_json
        }])

        # get all links in post
        links = []
        for link in soup.find_all(href=True):
            links.append(link['href'])

        # count link occurances
        links_counter = dict(Counter(links))

        # now add each link found to df_out_links
        for k in links_counter:
            found_link = k
            found_link_count = links_counter[k]
            df_out_links = df_out_links.append({
                'post_url': post_url,
                'found_link': found_link,
                'found_link_count': found_link_count,
            }, ignore_index=True)

        # get all meta info in post as key value pairs type info
        for meta in soup.find_all('meta'):

            # get meta info
            meta_name = meta.get('name')
            meta_property = meta.get('property')
            meta_class = meta.get('class')
            meta_content = meta.get('content')
            
            # append each meta tag onto df_out_meta
            df_out_meta = df_out_meta.append({
                'post_url': post_url,
                'meta_name': str(meta_name),
                'meta_property': str(meta_property),
                'meta_class': str(meta_class),
                'meta_content': str(meta_content),
            }, ignore_index=True)

    # if fails just move on and ignore it, dont mind if fails every now and then for whatever random reason
    except Exception as e:
        print(e)
        pass
    
# reorder cols
df_out = df_out[['post_url','tag_counts_json','raw_html']]
df_out_links = df_out_links[['post_url','found_link','found_link_count']]
df_out_meta = df_out_meta[['post_url','meta_name','meta_property','meta_class','meta_content']]

### df_out

<br>One row per post, raw html for use later, a json array of counts by tag type that can be pulled into specific features later.

In [12]:
df_out.head()

Unnamed: 0,post_url,tag_counts_json,raw_html
0,http://hollywoodlife.com/2018/01/20/ewan-mcgre...,"{""html"": 1, ""head"": 1, ""meta"": 33, ""link"": 64,...","<!DOCTYPE html>\n<html lang=""en"">\n <head>\n ..."
0,http://variety.com/2018/film/news/producers-gu...,"{""html"": 1, ""head"": 1, ""meta"": 50, ""link"": 67,...","<!DOCTYPE html>\n<!--[if IE 6]>\n<html id=""ie6..."
0,https://www.rollingstone.com/culture/features/...,"{""html"": 1, ""head"": 1, ""script"": 25, ""meta"": 2...","<!DOCTYPE html>\n<html class=""no-js"" lang="""">\..."
0,http://deadline.com/2018/01/donald-trump-women...,"{""html"": 1, ""head"": 1, ""meta"": 48, ""title"": 1,...","<!DOCTYPE html>\n<html lang=""en"">\n <head>\n ..."


### df_out_links

<br>For each link in the post get a count, can be useful for finding posts that are linking to twitter or specific domains etc. 

In [13]:
df_out_links.sample(25)

Unnamed: 0,post_url,found_link,found_link_count
238,http://variety.com/2018/film/news/producers-gu...,//load.s3.amazonaws.com,1.0
551,http://deadline.com/2018/01/donald-trump-women...,http://tvline.com/2018/01/21/snl-recap-stormy-...,1.0
412,https://www.rollingstone.com/culture/features/...,/hip-hop,2.0
482,http://deadline.com/2018/01/donald-trump-women...,http://fonts.googleapis.com/css?family=Open+Sa...,1.0
286,http://variety.com/2018/film/news/producers-gu...,https://www.facebook.com/sharer.php?u=http%3A%...,1.0
95,http://hollywoodlife.com/2018/01/20/ewan-mcgre...,http://hollywoodlife.com/celeb/rihanna/,2.0
424,https://www.rollingstone.com/culture/features/...,/movies/reviews,1.0
218,http://variety.com/2018/film/news/producers-gu...,http://0.gravatar.com/blavatar/8181b523e3c891b...,1.0
362,http://variety.com/2018/film/news/producers-gu...,http://variety.com/c/in-contention/,1.0
147,http://hollywoodlife.com/2018/01/20/ewan-mcgre...,http://www.reddit.com/submit?url=http://hollyw...,1.0


### df_out_meta

<br>Capture the content value and key attributes for all post meta types. These can be useful for measuring the extent to which the post is optimized for SEO etc. 

In [14]:
df_out_meta.sample(25)

Unnamed: 0,post_url,meta_name,meta_property,meta_class,meta_content
84,https://www.rollingstone.com/culture/features/...,,,,ie=edge
118,http://deadline.com/2018/01/donald-trump-women...,news_keywords,,,"Donald Trump, Women's March"
135,http://deadline.com/2018/01/donald-trump-women...,twitter:creator,,,@GregEvans5
105,https://www.rollingstone.com/culture/features/...,twitter:site,,,@RollingStone
91,https://www.rollingstone.com/culture/features/...,,fb:app_id,,144417125962063
101,https://www.rollingstone.com/culture/features/...,,og:image,,http://img.wennermedia.com/social/h_14917462-f...
141,http://deadline.com/2018/01/donald-trump-women...,title,,['swiftype'],"Donald Trump Mansplains Women's March, Takes C..."
58,http://variety.com/2018/film/news/producers-gu...,twitter:title,,,‘The Shape of Water’ Wins Producers Guild Awar...
38,http://variety.com/2018/film/news/producers-gu...,,fb:admins,,697514199
5,http://hollywoodlife.com/2018/01/20/ewan-mcgre...,generator,,,WordPress.com


<br><br>In real life setting the df's would be streamed into a database for longer term storage and downstream additional feature processing etc....

<br><br>...that is all...plans are to see what sorts of features extracted from the raw html of posts can have predictive power in modelling the amount of pageviews a post is expected to make in first 7 days. 