The goal here is to write a function snapshot() that will obtain a company's quarterly earnings report from its investor relations website and output the relative metrics with which we are concerned (a "snapshot" of the report). In many cases, these metrics will consist of a company's current quarter earnings per share (EPS), current quarter revenue, and estimates for what these metrics will be in the next quarter, known as the "guidance". For many companies, however, there are various other metrics that concern us in addition to these, or in some cases instead of. Furthermore, with all the companies that report their quarterly earnings on their IR websites, there exists very little uniformity in the way in which their reports are structured. Thus, we have our work cut out for us.

To start, we will try to parse the release of Netflix (NFLX). We are primarily concerned with identifying GAAP EPS and revenue along with guidance for these metrics for next quarter. For NFLX, we are also concerned with identifying net streaming adds.

First task is to obtain the reports from the websites. In practice, we will want to have to program running maybe one minute before the expected earnings report time so that it is refreshing the page every tenth of a second or so and can have the report text the second it is released by the website. Reports are usually released as PDFs, although for NVDA they report in a press release in HTML format so we may have to account for this possibility.

Most companies structure their reports such that there it consists of dialogue talking about the metrics followed by a table of comprehensive metrics and numbers. Will probably want to pull separate the two so they are individually parsable


TODO:
    
    -Write the get pdf functions that will refresh on the quarterly results IR page and download the file
        - get_nflx
        - get_amzn
        - get_twtr
        - get_tsla
        - get_aapl
        
    -Write the table parsers for each company that will get the information we want for each company from the table and get paragraphs containing keywords
        - nflx_parser
        - amzn_parser
        - twtr_parser
        - tsla_parser
        - aapl_parser
        
    - Wrap up the notebook so that it is usable from command line

In [2]:
import itertools
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

import math
import requests
import time
import json
from bs4 import BeautifulSoup
import sys

import pdfminer
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.converter import PDFPageAggregator


#Get the PDFs

In [3]:
def get_nflx(link_dict):
    while True:
        page = requests.get(link_dict['NFLX'])
        soup = BeautifulSoup(page.text, 'html.parser')
        q3_html = soup.find_all('div', {'class': 'accBody'})[0]
        docs = q3_html.find_all('a')
        dwnload = []
        found = False
        for doc in docs:
            if doc.text == 'Q316 Letter to shareholders':
                link = doc['href']
                found = True
                break
        if found:
            break
        time.sleep(1)
    link = 'https://ir.netflix.com/' + link
    pdfile = requests.get(link)
    with open('nflx.pdf', 'wb') as f:
        f.write(pdfile.content)

In [101]:
def get_amzn(link_dict):
    while True:
        page = requests.get(link_dict['AMZN'])
        soup = BeautifulSoup(page.text, 'html.parser')
        q3_html = soup.find_all('div', {'class': 'a-section article-copy'})[0]
        docs = q3_html.find_all('a')
        dwnload = []
        found = False
        for doc in docs:
            if doc.text == 'Q3 2016 Financial Results':
                link = doc['href']
                found = True
                break
        if found:
            break
        time.sleep(1)
    pdfile = requests.get(link)
    with open('amzn.pdf', 'wb') as f:
        f.write(pdfile.content)


#Code to parse the PDFs, extract tables

In [102]:
def extract_layout_by_page(pdf_path):
    """
    Extracts LTPage objects from a pdf file.
    
    slightly modified from
    https://euske.github.io/pdfminer/programming.html
    """
    laparams = LAParams()

    fp = open(pdf_path, 'rb')
    parser = PDFParser(fp)
    document = PDFDocument(parser)

    if not document.is_extractable:
        raise PDFTextExtractionNotAllowed

    rsrcmgr = PDFResourceManager()
    device = PDFPageAggregator(rsrcmgr, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)

    layouts = []
    for page in PDFPage.create_pages(document):
        interpreter.process_page(page)
        layouts.append(device.get_result())

    return layouts

TEXT_ELEMENTS = [
    pdfminer.layout.LTTextBox,
    pdfminer.layout.LTTextBoxHorizontal,
    pdfminer.layout.LTTextLine,
    pdfminer.layout.LTTextLineHorizontal
]

def flatten(lst):
    """Flattens a list of lists"""
    return [subelem for elem in lst for subelem in elem]


def extract_characters(element):
    """
    Recursively extracts individual characters from 
    text elements. 
    """
    if isinstance(element, pdfminer.layout.LTChar):
        return [element]

    if any(isinstance(element, i) for i in TEXT_ELEMENTS):
        return flatten([extract_characters(e) for e in element])

    if isinstance(element, list):
        return flatten([extract_characters(l) for l in element])

    return []

def does_it_intersect(x, (xmin, xmax)):
    return (x <= xmax and x >= xmin)

def convert_to_rows(characters):
    x_limit = 10
    y_limit = 5
    paragraph_limit = 20

    rows = []
    row = []
    cell = ""
    prior_x = None
    prior_y = None

    y_s = [];
    x_s = [];
    for c in characters:
        c_x, c_y = math.floor((c.bbox[0] + c.bbox[2]) / 2), math.floor((c.bbox[1] + c.bbox[3]) / 2)
        if prior_x is not None and not (c_x - prior_x <= x_limit and abs(c_y - prior_y) <= y_limit):
            if abs(c_y - prior_y) > y_limit:
                row.append(cell)

                # find the right row
                for i in xrange(len(rows)):
                    if abs(y_s[i] - prior_y) <= y_limit:
                        for j in xrange(len(x_s[i])):
                            if prior_x < x_s[i][j]:
                                rows[i] = rows[i][:j] + row + rows[i][j:]
                                x_s[i] = x_s[i][:j] + [prior_x] + x_s[i][j:]
                                break
                        else:
                            rows[i] += row
                            x_s[i].append(prior_x)
                            break
                        break
                else:
                    rows.append(row)
                    y_s.append(prior_y)
                    x_s.append([prior_x])

                cell = ""
                row = []
            elif c_x - prior_x > x_limit:
                row.append(cell)
                cell = ""

        cell += c.get_text()
        prior_x = c_x
        prior_y = c_y

    # handle the last row
    row.append(cell)
    for i in xrange(len(rows)):
        if abs(y_s[i] - prior_y) <= y_limit:
            for j in xrange(len(x_s[i])):
                if prior_x < x_s[i][j]:
                    rows[i] = rows[i][:j] + row + rows[i][j:]
                    x_s[i] = x_s[i][:j] + [prior_x] + x_s[i][j:]
                    break
            else:
                rows[i] += row
                x_s[i].append(prior_x)
                break
            break
    else:
        rows.append(row)
        y_s.append(prior_y)
        x_s.append([prior_x])
        
    # insert blank rows between particularly separated lines
    for i in xrange(len(y_s) - 2, -1, -1):
        if abs(y_s[i] - y_s[i+1]) > paragraph_limit:
            rows = rows[:i+1] + [[]] + rows[i+1:]
    
    return rows

#Page Parsers 
###Company specific

functions:
- parse_pages: takes the pdf files and converts them into analysable format
- parser: company specific parser that retrieves info from document text

In [103]:
def parse_pages(url, parser):
    
    page_layouts = extract_layout_by_page(url)
    #objects_on_page = set(type(o) for o in page_layouts[3])

    pages = []
    for i in xrange(len(page_layouts)):
        current_page = page_layouts[i]

        texts = []

        # seperate text and rectangle elements
        for e in current_page:
            if isinstance(e, pdfminer.layout.LTTextBoxHorizontal):
                texts.append(e)

        # sort them into 
        characters = extract_characters(texts)
        pages.append(convert_to_rows(characters))
    parser(pages)           


In [152]:
def nflx_parser(pages):
    for page in pages:
        if len(page) > 1 and len(page[1]) > 0 and "Consolidated Statements of Operations " == page[1][0]:
            for row in page:
                if len(row) > 0 and row[0] == "Revenues":
                    print("Revenue: " + row[2] + ",000")
                # we want the first Basic in the table
                elif len(row) > 0 and row[0] == "Basic":
                    print("Basic EPS: " + row[2])
                    break
        if len(page) > 1 and len(page[0]) > 0 and "Q3 Results" in page[0][0]:
            for idx in range(len(page)):
                row = page[idx]
                if not row:
                    last_blank = idx
                if len(row) > 0 and any("global net adds" in s for s in row):
                    paragraph = []
                    nest_id = last_blank + 1
                    while page[nest_id] != []:
                        paragraph.append(page[nest_id])
                        nest_id = nest_id+1
                    paragraph = list(itertools.chain.from_iterable(paragraph))
                    p = reduce((lambda x, y: x + y), paragraph)
                    print p

In [147]:
def amzn_parser(pages):
    page = pages[0]
    tail = pages[1:]
    flattened = list(itertools.chain.from_iterable(page))
    if any("Consolidated Statements of Operations" in s for s in flattened):
        index1 = flattened.index('Total net sales ') + 1
        print("Revenue: " + flattened[index1] + ",000,000") 
        index2 = flattened.index('Basic earnings per share ') + 2
        print("Basic EPS: " + flattened[index2])
        amzn_parser(tail)
    elif any("Segment Information" in s for s in flattened):
        index3 = flattened.index('Net sales ') + 2
        print("AWS Rev: " + flattened[index3] + ',000,000')
    elif any("Fourth Quarter 2016 Guidance" in s for s in flattened):
        idx4 = flattened.index('Fourth Quarter 2016 Guidance ')
        idx = idx4
        paragraph = []
        while flattened[idx] != ' ':
            paragraph.append(flattened[idx])
            idx = idx+1
        paragraph = list(itertools.chain.from_iterable(paragraph))
        p = reduce((lambda x, y: x + y), paragraph)
        print p
        amzn_parser(tail)
    else:
        if tail != []:
            amzn_parser(tail)
        else:
            print "No Data"

In [None]:
def twtr_parser(pages):
    page = pages[0]
    tail = pages[1:]
    flattened = list(itertools.chain.from_iterable(page))
    if any("Consolidated Statements of Operations" in s for s in flattened):
        index1 = flattened.index('Revenue ') + 1
            if index1:
                idx = index1 + 1
                print("Revenue: " + flattened[index1] + flattened[idx] + ",000")
        index2 = flattened.index('Basic and diluted ') + 1
            if index2:
                idx2 = index2 + 1
                print("EPS: " + flattened[index2] + flattened[idx2]) 
    elif any("MAUs" in s for s in flattened):
        index3 = 
        

#Put It All Together

In [151]:
tix = ['TWTR', 'TSLA', 'NFLX', 'AMZN', 'AAPL']
links = ['https://investor.twitterinc.com/index.cfm', 'http://ir.tesla.com/', 'https://ir.netflix.com/results.cfm' ,'http://phx.corporate-ir.net/phoenix.zhtml?c=97664&p=irol-reportsOther', ]

ir_dict = dict(zip(tix, links))
        
#get_nflx(ir_dict)
#parse_pages('nflx.pdf', nflx_parser)

get_amzn(ir_dict)
parse_pages('amzn.pdf', amzn_parser)

Fourth Quarter 2016 Guidance •  Net sales are expected to be between $42.0 billion and $45.5 billion, or to grow between 17% and 27% compared with fourth quarter 2015. This guidance anticipates approximately 60 basis points of favorable impact from foreign exchange rates. •  Operating income is expected to be between $0 and $1.25 billion, compared with $1.1 billion in fourth quarter 2015. •  This guidance assumes, among other things, that no additional business acquisitions, investments, restructurings, or legal settlements are concluded. A conference call will be webcast live today at 2:30 p.m. PT/5:30 p.m. ET, and will be available for at least three months at www.amazon.com/ir. This call will contain forward-looking statements and other material information regarding the Company’s financial and operating results. These forward-looking statements are inherently difficult to predict. Actual results could differ materially for a variety of reasons, including, in addition to the factors

In [154]:
page_layouts = extract_layout_by_page('twtr.pdf')
#objects_on_page = set(type(o) for o in page_layouts[3])

pages = []
for i in xrange(len(page_layouts)):
    current_page = page_layouts[i]

    texts = []

    # seperate text and rectangle elements
    for e in current_page:
        if isinstance(e, pdfminer.layout.LTTextBoxHorizontal):
            texts.append(e)

    # sort them into 
    characters = extract_characters(texts)
    pages.append(convert_to_rows(characters))
    

[[u'\u2022 Q3 adjusted EBITDA of $181 million, up 28% year-over-year, ',
  u'Monthly Active Users (MAU)'],
 [u'representing an adjusted EBITDA margin of 29%.'],
 [u'In Millions'],
 [],
 [u'\u2022 Average monthly active users (MAUs) were 317 million for Q3, up 3% '],
 [u'year-over-year and compared to 313 million in the previous quarter.',
  u'313',
  u'317'],
 [u'\u2022 Average U.S. MAUs were 67 million for Q3, up 1% year-over-',
  u'66',
  u'67'],
 [u'year and compared to 66 million in the previous quarter. '],
 [u'\u2022 Average international MAUs were 250 million for Q3,  ',
  u'247',
  u'250'],
 [u'up 4% year-over-year and compared to 247 million in  '],
 [u'the previous quarter.        ', u'Q2\u201916', u'Q3\u201916'],
 [u'\u2022 ', u'Mobile MAUs represented 83% of total MAUs.'],
 [u'\u2022 Average daily active usage* (DAU) grew 7% year-over-year,  '],
 [u'an acceleration from 5% in Q2 and 3% in Q1.'],
 [],
 [u'We\u2019re focused on driving value across three key areas of our serv

In [159]:
flattened = list(itertools.chain.from_iterable(pages[2]))

In [170]:
p = map((lambda x: x = " " if x = [] else x), pages[2])
p

SyntaxError: invalid syntax (<ipython-input-170-bba889e32c10>, line 1)