# Lesson 7 Intermediate Python for Data Analytics (Finance Performance and Fraudulent Detection)
# Introduction

In here we are going to try to scrape financial data:
* Income statement
* Balance sheet
* Cashflows

## Objective

* To use twippy (Python Library linked with Twitter) to load data from Twitter
* Setting up Twitter Developer account linked to profile
* Initialize Connections and Extracting the presidential election tweets
* To explore Presidential Election Trump vs Hillary from Twitter data 
    * Viewing the data
    * Search Term Analysis
    * Exploring Twitter Trends
* Sentimental Analysis
    * Generating Sentimental Analysis
    * Plotting out Sentimental Analysis
    * How about the news media. How often do they mention the election candidates?
* Topic Analysis
    * Generating Topic with LDA
    * Plotting out Topic Analysis
* Challenges:
    * Analysing fake news
    * Analysing geographic locations sentiments given charts
    * Applying these techniques for companies, commodities, and stocks
* Next lesson:
    * Lesson 6 Basic Python for Data Analytics (Optimization Model for Operations Management)

# Scraping Wikipedia SP500 Data Using Beautiful Soup

In [56]:
import bs4 as bs
import pickle
import requests

def save_sp500_tickers():
    resp = requests.get('http://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
    soup = bs.BeautifulSoup(resp.text, 'lxml')
    table = soup.find('table', {'class': 'wikitable sortable'})
    tickers = []
    for row in table.findAll('tr')[1:]:
        ticker = row.findAll('td')[0].text
        tickers.append(ticker.lower())
        
    return tickers

tickers = save_sp500_tickers()

In [58]:
tickers[:5]

[u'mmm', u'abt', u'abbv', u'acn', u'atvi']

# Scraping Financial Data
Using Selenium to scrape 
http://www.nasdaq.com/symbol/aapl/financials?query=income-statement&data=quarterly

## Repeat the scraping for multiple websites

In [4]:
import pandas as pd
from numpy import nan
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options

In [10]:
## return nan values if elements not found, and convert the webelements to text
def get_elements(xpath,browser):
    ## find the elements
    elements = browser.find_elements_by_xpath(xpath)
    ## if any are missing, return all nan values
    if len(elements) != 4:
        return [nan] * 4
    ## otherwise, return just the text of the element 
    else:
        text = []
        for e in elements:
            text.append(e.text)
        return text

In [60]:
import numpy as np
def clean_up(period,df):
    dollarlist = [u'total_revenue', u'gross_profit',u'net_income', u'total_assets', u'total_liabilities', u'total_equity','net_cash_flow'] 
    df.loc[:,dollarlist] = df.loc[:,dollarlist].replace( '[\$,)]','', regex=True ).replace( '[(]','-',   regex=True ).astype(int)
    if period=='yearly':
        ratiolist = [u'liq_current_ratio', u'liq_quick_ratio', u'liq_cash_ratio',
               u'prof_gross_margin', u'prof_operating_margin', u'prof_profit_margin']
        df.loc[:,ratiolist] = df.loc[:,ratiolist].apply(lambda x: x.str.replace('%','')).astype(np.float)/100
    return df

In [61]:
# Yearly

def scrapenasdaq(period,symbols,csv_path):
    
    chrome_options = Options()
    chrome_options.add_argument("--disable-extensions")

    ## launch the Chrome browser (this will generate a window)
    my_path = "chromedriver.exe"
    browser = webdriver.Chrome(executable_path=my_path,chrome_options=chrome_options)
    browser.maximize_window()
    df = pd.DataFrame()
    
    financials_xpath = "//tbody/tr/th[text() = '{}']/../td[contains(text(), '$')]"
    ratios_xpath = "//tbody/tr/th[text() = '{}']/../td[contains(text(), '%')]"

#     Create yearly scraping or quarterly scraping
    if period == 'yearly':
        url_form = "http://www.nasdaq.com/symbol/{}/financials?query={}" 
    else:
        url_form = "http://www.nasdaq.com/symbol/{}/financials?query={}&data=quarterly" 

    for i, symbol in enumerate(symbols):
        try:
            ## navigate to income statement quarterly page    
            url = url_form.format(symbol, "income-statement")
            browser.get(url)

            company_xpath = "//h1[contains(text(), 'Company Financials')]"
            company = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, company_xpath))).text
            
            period_endings=""
            quarters = ""
            quarter_endings = ""
            if period=='yearly':
                period_endings_xpath = "//thead/tr[th[1][text() = 'Period Ending:']]/th[position()>=3]"
                period_endings = get_elements(period_endings_xpath,browser)
            else:
                quarters_xpath = "//thead/tr[th[1][text() = 'Quarter:']]/th[position()>=3]"
                quarters = get_elements(quarters_xpath,browser)

                quarter_endings_xpath = "//thead/tr[th[1][text() = 'Quarter Ending:']]/th[position()>=3]"
                quarter_endings = get_elements(quarter_endings_xpath,browser)
                         
            total_revenue = get_elements(financials_xpath.format("Total Revenue"),browser)
            gross_profit = get_elements(financials_xpath.format("Gross Profit"),browser)
            net_income = get_elements(financials_xpath.format("Net Income"),browser)

            ## navigate to balance sheet quarterly page 
            url = url_form.format(symbol, "balance-sheet")
            browser.get(url)

            total_assets = get_elements(financials_xpath.format("Total Assets"),browser)
            total_liabilities = get_elements(financials_xpath.format("Total Liabilities"),browser)
            total_equity = get_elements(financials_xpath.format("Total Equity"),browser)

            ## navigate to cash flow quarterly page 
            url = url_form.format(symbol, "cash-flow")
            browser.get(url)

            net_cash_flow = get_elements(financials_xpath.format("Net Cash Flow"),browser)

            
            ## navigate to ratios page 
            if period=='yearly':
                url = url_form.format(symbol, "ratios")
                browser.get(url)

                liq_current_ratio = get_elements(ratios_xpath.format("Current Ratio"),browser)
                liq_quick_ratio = get_elements(ratios_xpath.format("Quick Ratio"),browser)
                liq_cash_ratio = get_elements(ratios_xpath.format("Cash Ratio"),browser)
                prof_gross_margin = get_elements(ratios_xpath.format("Gross Margin"),browser)
                prof_operating_margin = get_elements(ratios_xpath.format("Operating Margin"),browser)
                prof_profit_margin = get_elements(ratios_xpath.format("Profit Margin"),browser)
            
            ## fill the datarame with the scraped data, 4 rows per company
            for j in range(4):  
                row = i*4 + j
                df.loc[row, 'company'] = company
                
                if period == 'yearly':
                    df.loc[row, 'period_endings'] = period_endings[j]
                    df.loc[row, 'liq_current_ratio'] = liq_current_ratio[j]
                    df.loc[row, 'liq_quick_ratio'] = liq_quick_ratio[j]
                    df.loc[row, 'liq_cash_ratio'] = liq_cash_ratio[j]
                    df.loc[row, 'prof_gross_margin'] = prof_gross_margin[j]
                    df.loc[row, 'prof_operating_margin'] = prof_operating_margin[j]            
                    df.loc[row, 'prof_profit_margin'] = prof_profit_margin[j]
                else:
                    df.loc[row, 'quarter'] = quarters[j]
                    df.loc[row, 'quarter_ending'] = quarter_endings[j]
                
                df.loc[row, 'total_revenue'] = total_revenue[j]
                df.loc[row, 'gross_profit'] = gross_profit[j]
                df.loc[row, 'net_income'] = net_income[j]
                df.loc[row, 'total_assets'] = total_assets[j]
                df.loc[row, 'total_liabilities'] = total_liabilities[j]
                df.loc[row, 'total_equity'] = total_equity[j]
                df.loc[row, 'net_cash_flow'] = net_cash_flow[j]
        except IOError as e: 
            print(e)
    browser.quit()
    df = clean_up(period,df)
    
    ## create a csv file in our working directory with our scraped data
    df.to_csv(csv_path, index=False)
    
    return df

In [63]:
%%time
scrapenasdaq('quarterly',['amzn'],'output.csv')

Welcome to symbol enumerate
Hi i am here
Hi i am at yearly
hi I am out
('Quarter', [u'4th', u'3rd', u'2nd', u'1st'])
('Quarter endings', [u'12/31/2016', u'9/30/2016', u'6/30/2016', u'3/31/2016'])


Unnamed: 0,company,quarter,quarter_ending,total_revenue,gross_profit,net_income,total_assets,total_liabilities,total_equity,net_cash_flow
0,AMZN Company Financials,4th,12/31/2016,43741000,14782000,749000,83402000,64117000,19285000,5678000
1,AMZN Company Financials,3rd,9/30/2016,32714000,11455000,252000,70897000,53115000,17782000,1135000
2,AMZN Company Financials,2nd,6/30/2016,30404000,11223000,857000,65076000,48538000,16538000,51000
3,AMZN Company Financials,1st,3/31/2016,29128000,10262000,513000,61128000,46372000,14756000,-3420000


# Generating Financial Ratios data

A ratio analysis is a quantitative analysis of information contained in a company’s financial statements. Ratio analysis is based on line items in financial statements like the balance sheet, income statement and cash flow statement; the ratios of one item – or a combination of items - to another item or combination are then calculated. Ratio analysis is used to evaluate various aspects of a company’s operating and financial performance such as its efficiency, liquidity, profitability and solvency.

Read more: Ratio Analysis Definition | Investopedia http://www.investopedia.com/terms/r/ratioanalysis.asp#ixzz4ZzYSyt15 
Follow us: Investopedia on Facebook 

http://www.investorguide.com/article/13690/list-of-important-financial-ratios-for-stock-analysis/