# Automated News Update, or, "Dwyer's attempt at automating himself out of a job"

### To Do
* Add Governing
* Axios
* Functionality to add news article to SQL database after the fact
* Auto open word doc in gen_docx
* Detroit News
* Add Phys.org
* Add Jalopnik
* Transport Policy
* [Journal of Modern Transportation](https://link.springer.com/journal/40534)
* Journal of Urban Economics
* https://www.springer.com/engineering/civil+engineering/journal/42421?TrucksFoT

### Can't because of paywalls:
* WSJ
* Nikkei
* Automotive World

In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:80% !important; }</style>"))

In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Import packages, define important stuff

In [3]:
import os

import numpy as np
import pandas as pd
import datetime as dt
import time

import matplotlib.pyplot as plt
import seaborn as sns
import sqlite3

from bs4 import BeautifulSoup, SoupStrainer
from selenium import webdriver
import requests

import docx
from docx.enum.text import WD_COLOR_INDEX
from docx.shared import Pt
import win32com.client as win32

In [4]:
# Keyword lists for each of the different news updates
cav_keywords = ['self-driving', 'automated', 'self driving', 'autonomous', 'MaaS', 'ride-sharing', 'ridesharing', 'ride-hailing',
                'ridehailing', 'lidar', 'LiDAR', 'rideshare', 'ridehail', 'ride-hail', 'ridesource', 'ride-source', 'ride-sourcing',
                'carsharing', 'car-sharing', 'carshare', 'car-share', 'Uber', 'Lyft', 'Chariot', 'connected car', 'Waymo', 'TRI',
                'Cruise', 'Zoox', 'Mobileye', 'Softbank', 'peer-to-peer', 'Turo']
afv_keywords = ['rare-earth', 'rare earth', 'natural gas', 'electric vehicles', 'electric vehicle', 'electric car', 'EV', 'electrification', 'alternative fuel', 'CNG', 'LNG',
                'alt-fuel', 'propane', 'charging stations', 'EVSE', 'electric vehicle charging', 'HEV', 'hybrid', 'hybrid-electric', 'plug-in', 'PHEV', 'electric motor',
                'bio-fuel', 'biofuel', 'idle reduction', 'fuel cell', 'electric bus', 'electric buses', 'electric truck', 'electric trucks', 'electric drive',
                'battery-electric', 'battery electric', 'electric', 'battery-electric-powered']
truck_keywords = ['alternative fuels', 'natural gas', 'compressed natural gas', 'liquefied natural gas', 'CNG', 'LNG', 'propane', 'LPG', 'dimethyl ether', 'DME', 'electric', 'electricity', 'electrified', 'electric drive',
                  'battery', 'energy storage', 'hydrogen', 'fuel cell', 'hybrid', 'hybrid electric', 'hybrid hydraulic', ' Phase 2', 'Phase II', 'efficiency', 'fuel efficiency', 'fuel economy', 'aftertreatment',
                  'emission control', 'diesel particulate filter', 'DPF', 'selective catalytic reduction', 'SCR', 'aerodynamics', 'sustainability', 'waste heat recovery', 'Rankine', 'organic Rankine', 'SuperTruck',
                  'automated manual', 'AMT', 'platooning', 'lithium', 'biofuel', 'fast charging', 'downspeed', 'downsize', 'clean diesel', 'turbocompound', 'rolling resistance', 'skirt', 'boat tail', 'axle', 'low viscosity',
                  'catenary', 'autonomy', 'autonomous', 'connected and autonomous', 'connected', 'telematics', 'driver assist', 'CACC', 'active cruise control', 'crash avoidance', 'crashworthiness', 'weigh-in-motion', 'weigh in motion',
                  'high productivity', 'truck size and weight', 'V2I', 'V2V', 'vehicle to infrastructure', ' vehicle to vehicle',  'restructuring', 'acquisition', 'driver cost', 'operational efficiency',
                  'facilities', 'proving ground', 'partnership', 'regional haul', 'joint venture', 'grant', 'FOA', 'funding opportunity', 'unveil', 'announce', 'offer', 'expansion', 'greenhouse gas', 'GHG', 'emission regulation',
                  'emissions regulation', 'idle', 'idling', 'zero emissions', 'strategic plan', 'SmartWay', 'VIUS', 'well to wheels', 'pump to wheels', 'well to pump', 'CARB', 'CEC', 'air resources board', 'energy commission', 'EPA',
                  'Environmental Protection Agency', 'smart mobility', 'smart cities']
hyperloop_keywords = ['hyperloop', 'high-speed train',
                      'high speed train', 'bullet train']

# Used for diagnostics/tracking later
scrape_specs = {}

# Set day of week for each scraper category, to reduce run time (avoid searching for all keyword lists every run)
scraper_sched = {'CAV': 0, 'AFV': 2, '21CTP': 4, 'Hyperloop': 0}

# Age filter, in days (only want to pull articles that are <= 1 week old)
max_age = 7

# For file naming and tracking
search_date = str(dt.date.today())

# Needed for web scraping "browser"
headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}

# For database update; ensures duplicates aren't loaded
db_update = False

In [5]:
# scraped_count = 0
# skip_count = 0
# too_old = 0
# iteration = 0
# skip_ind = []
# old_ind = []


def replace_em(text):
    '''Replaces odd characters in text. Used for page titles and summaries'''
    bad_chars = ['â€œ', 'â€™', 'â€�', '\n', 'Â',
                 'â€”', '(earlier post)', 'â€?', '\t', 'â€œ', '(TNS) — ', '(Reuters) - ',
                 'DUBAI (Reuters) - ']
    for bad_char in bad_chars:
        text = text.replace(bad_char, '')
    return text


def grab_homepage(url):
    '''Creates BeautifulSoup object using input url'''
#     headers = {'user-agent': 'Mozilla/5.0'}
    page_1 = requests.get(url, headers=headers)
    return BeautifulSoup(page_1.content, "html5lib")


def print_results(site, scraped_count, skip_count, too_old, df, duration, scrape_specs):
    '''Prints out a quick summary of one website's full scraping and adds summary specs to scrape_specs dictionary'''
    print(f'{scraped_count} {site} article(s) scraped')
    print(f'{skip_count} {site} article(s) skipped due to error')
    print(f'{too_old} {site} article(s) skipped due to age')
    print(f'{df.shape[0]} relevant article(s) collected')
    scrape_specs[f"{site}"] = {'Pages Scraped': scraped_count, 'Relevant Articles': df.shape[0], 'Errors': skip_count,
                               'Too old': too_old, 'Time spent': duration}
    return scrape_specs


def page_scan(title, summary, url, date, source):
    '''
    Searches a web page title and summary for keywords; returns the dictionary object that is used to create 
    the final dataframe. Searches the title first; if the keyword is there, it doesn't search the summary.

    Only searches for keywords specific to that day of the week's news update.
    '''
    bool_dict = {'Hyperloop': 0, 'CAV': 0, 'AFV': 0, '21CTP': 0}
    title_scrape = title+' '+title.lower()
    summary_scrape = summary+' '+summary.lower()

    if dt.date.today().weekday() == scraper_sched['CAV']:
        if any(keyword in title_scrape for keyword in cav_keywords):
            bool_dict['CAV'] = 1
        elif any(keyword in summary_scrape for keyword in cav_keywords):
            bool_dict['CAV'] = 1

    if dt.date.today().weekday() == scraper_sched['Hyperloop']:
        if any(keyword in title_scrape for keyword in hyperloop_keywords):
            bool_dict['Hyperloop'] = 1
        elif any(keyword in summary_scrape for keyword in hyperloop_keywords):
            bool_dict['Hyperloop'] = 1

    if dt.date.today().weekday() == scraper_sched['AFV']:
        if any(keyword in title_scrape for keyword in afv_keywords):
            bool_dict['AFV'] = 1
        elif any(keyword in summary_scrape for keyword in afv_keywords):
            bool_dict['AFV'] = 1

    if dt.date.today().weekday() == scraper_sched['21CTP']:
        if any(keyword in title + title_scrape for keyword in truck_keywords) & (('truck' in title_scrape) | ('trucks' in title_scrape)):
            bool_dict['21CTP'] = 1
        elif any(keyword in summary_scrape for keyword in truck_keywords) & (('truck' in summary_scrape) | ('trucks' in summary_scrape)):
            bool_dict['21CTP'] = 1

    if sum(bool_dict.values()) > 0:
        return {'title': title.strip(), 'summary': summary.strip(), 'link': url, 'source': source,
                'date': date, 'AFV': bool_dict['AFV'], 'CAV': bool_dict['CAV'], '21CTP': bool_dict['21CTP'],
                'Hyperloop': bool_dict['Hyperloop']}
    else:
        return 'Most definitely nope'

# The following two functions are for the Word document output!


def add_hyperlink(paragraph, url, text):
    '''
    :param paragraph: The paragraph we are adding the hyperlink to.
    :param url: A string containing the required url
    :param text: The text displayed for the url
    :return: The hyperlink object
    '''
    # This gets access to the document.xml.rels file and gets a new relation id value
    part = paragraph.part
    r_id = part.relate_to(
        url, docx.opc.constants.RELATIONSHIP_TYPE.HYPERLINK, is_external=True)

    # Create the w:hyperlink tag and add needed values
    hyperlink = docx.oxml.shared.OxmlElement('w:hyperlink')
    hyperlink.set(docx.oxml.shared.qn('r:id'), r_id, )

    # Create a w:r element
    new_run = docx.oxml.shared.OxmlElement('w:r')

    # Create a new w:rPr element
    rPr = docx.oxml.shared.OxmlElement('w:rPr')

    # bold the text
    u = docx.oxml.shared.OxmlElement('w:b')
    rPr.append(u)

    # Join all the xml elements together add add the required text to the w:r element
    new_run.append(rPr)
    new_run.text = text
    hyperlink.append(new_run)

    paragraph._p.append(hyperlink)

    return hyperlink


def gen_docx(newstype, dwyer=True, CA_nums='NEED TO INSERT'):
    '''
    Generates news Word doc using data file from web scrape
    :param newstype: Either "21CTP", "CAV", or "AFV"
    :param dwyer: If not running on Dwyer's computer, set this to False and put all needed files in the same directory
    :param CA_nums: Input string for the CA EVSE numbers (automatically populates the caption for the EVSE bar chart figure)
    '''

    # select data file (xls) based on the newstype and date. Note that search_date is a global variable defined outside
    # of this function. Each news update only happens once a week --> only one xls file per newstype per week --> can't just
    # pick any old search_date and make a file.
    if dwyer:
        # Name of the excel file (standardized)
        data_file = f"{newstype.lower()}_news_updates/{search_date}_{newstype}_news_download.xls"
    else:
        data_file = f"{search_date}_{newstype}_news_download.xls"

    # Read the data in from the selected file
    df = pd.read_excel(data_file)
    df = df.reset_index(drop=True).T.to_dict()

    # Start creating the word doc
    newsdoc = docx.Document(docx='python_docx.docx')

    # Add up-front stuff - title, headers, and for the AFV update, some other stuff (two captions and some text)
    if newstype == 'AFV':
        newsdoc.add_heading(
            f"Alternative Fuel Vehicle Weekly News Update – {dt.date.today().strftime('%m/%d/%Y')}", 0)
        newsdoc.add_heading('EVSE Market Analysis', 1)
        evse_bar_chart = newsdoc.add_paragraph().add_run('INSERT EVSE BAR CHART HERE')
        evse_bar_chart.font.bold = True
        evse_bar_chart.font.size = Pt(16)
        evse_bar_chart.font.highlight_color = WD_COLOR_INDEX.YELLOW
        newsdoc.add_paragraph('Figure: Number of EVSE plugs (note: not stations) by state and charging level.'
                              'CA is not included, since it would make the rest of the state numbers illegible.'
                              f"CA holds a disproportionately large share of the total EVSE plugs: {CA_nums} "
                              'of Level 1, Level 2, and DCFC plugs respectively. Data Source: U.S. DOE AFDC Station Locator.',
                              style='Caption')
        newsdoc.add_paragraph(' ')
        newsdoc.add_paragraph('The table below summarizes overall changes in number of EV charging stations by state between '
                              f"{(dt.date.today() - dt.timedelta(7)).strftime('%m/%d/%Y')} and {dt.date.today().strftime('%m/%d/%Y')}:",
                              style='Normal')
        newsdoc.add_paragraph('Table 1: Change in number of EV charging stations by state, between '
                              f"{(dt.date.today() - dt.timedelta(7)).strftime('%m/%d/%Y')} and {dt.date.today().strftime('%m/%d/%Y')}",
                              style='Caption')
        evse_delta_table = newsdoc.add_paragraph().add_run('INSERT EVSE DELTA TABLE HERE')
        evse_delta_table.font.bold = True
        evse_delta_table.font.size = Pt(16)
        evse_delta_table.font.highlight_color = WD_COLOR_INDEX.YELLOW

    if newstype == 'CAV':
        newsdoc.add_heading(
            f"Connected and Automated Vehicle Weekly News Update – {dt.date.today().strftime('%m/%d/%Y')}", 0)
        newsdoc.add_paragraph(' ')
        newsdoc.add_paragraph('Includes coverage of ride-sharing and other smart mobility technologies. '
                              'The majority of this is direct quotations from the respective articles. I '
                              'claim none of this text content as my own, having only sifted through the '
                              'web to find already-existing pieces relevant to these topics.')

    if newstype == '21CTP':
        newsdoc.add_heading(
            f"21CTP Trucking Weekly News Update – {dt.date.today().strftime('%m/%d/%Y')}", 0)

    for header in ['Business and Market Analysis', 'Technology, Testing, and Analysis', 'Policy and Government']:
        newsdoc.add_heading(header, 1)
        newsdoc.add_paragraph('')

    # Add all of the actual news items
    for row in df:
        row = df[row]
        newsdoc.add_heading(row['title'], level=2)
        p = newsdoc.add_paragraph(row['summary'] + ' ')
        p.add_run('(')
        # This is where the add_hyperlink function is used
        add_hyperlink(p, '{}'.format(row['link']), '{}'.format(row['source']))
        p.add_run(')')
    if newstype == 'CAV':
        newsdoc.add_heading('Relevant Transportation Research', 1)
        newsdoc.add_paragraph('This section includes publications, papers, articles, and conferences that investigate and/or'
                              'discuss transportation and travel demand impacts of MaaS or other “future travel” considerations.'
                              'Portions of the abstract or description (not my words) are included under each title for more information.')
    if newstype == 'AFV':
        newsdoc.add_heading('Relevant Transportation Research', 1)
        newsdoc.add_paragraph('This section includes publications, papers, articles, and conferences that investigate and/or'
                              'discuss alternative fuel vehicle impacts on transportation systems. Portions of the abstract '
                              'or description (not my words) are included under each title for more information.')
    if dwyer:
        newsdoc.save(
            f"{newstype.lower()}_news_updates/Energetics {newstype} News Update - {search_date}.docx")
    else:
        filename = f"Energetics {newstype} News Update - {search_date}.docx"
        newsdoc.save(filename)


def which_keyword_found(row):
    ''' Identifies and stores which keywords triggered the news item pull '''
    words_found = []
    for keyword in cav_keywords+afv_keywords:
        try:
            if (row['summary'].find(keyword) > 0) | (row['title'].find(keyword) > 0):
                words_found.append(keyword)
        except:
            continue
    return ', '.join(words_found)


def keyword_pull(string):
    ''' Pulls all relevant capitalized words out of the title, as a quick "keyword" list '''
    not_keywords = ['A', 'The', 'This', 'I', 'To', 'Who', 'Silicon', 'Valley', 'System', 'Build', 'Payment', 'Business', 'API', 'JV', 'JVs',
                    'European', 'American', 'America', 'Europe', 'China', 'But', 'Are', 'They', 'Legal', 'Says', 'AV', 'Revolution', 'Is',
                    'TechCrunch', 'For', 'EVs', 'Really', 'Get', 'Money', 'Adds', 'We', 'All', 'Starts', 'Return', 'Apart',
                    'Them', 'Cities', 'After', 'Insurance', 'Back', 'Against', 'Would', 'Displace', 'Improves', 'While',
                    'That', 'You', 'Find', 'Along', 'From', 'Their', 'Not', 'So', 'Say', 'Experts', 'Drivers', 'Its', 'Into', 'Fully',
                    'Ranks', 'Stretch', 'SUV', 'Data', 'Sharing', 'Live', 'When', 'Agencies', 'Still', 'Trying', 'Program', 'Offer', 'Four',
                    'Will', 'Backs', 'Just', 'Around', 'Years', 'Its', 'Future', 'Deploying', 'Objects', 'Distance', 'Highlights']
    string = string.replace(';', '').replace(',', '').lstrip().split(' ')
    keywords = [word for word in string if (
        word[0].isupper()) & (word not in not_keywords)]
    return ', '.join(keywords)

### Create a scraper class that will be used for each website

In [6]:
class scraypah:
    '''
    Scraypah is a web scraper that searches through all of the recent articles on a website and extracts key information
    from those that include relevant keywords. It requires a dictionary of parameters specific to each website that needs
    to be scraped. See the __init__ docstring for information on the input parameter requirements.
    '''

    def __repr__(self):
        return "This is an object of class scraypah!"

    def __init__(self, params):
        '''
        Attributes:
            params[url] (str): Homepage of the website, where each of the article page links are extracted from
            params[source] (str): Name of the website
            params[strain_bool] (bool): Is there a soup strainer for this website or not?
            params[strain_tag] (str): Tag used for soup strainer
            params[strain_attr_name] (str): Attribute name used for soup strainer
            params[strain_attr_value] (str): Attribute value used for soup strainer
            params[date_loc] (str): Location of the date in the HTML
            params[date_format] (str): Allows user to set the date format, if the format on the website does not parse automaticall
            params[sum_loc] (str): Location of the summary in the HTML (this is typically the first 3 paragraphs of the article)
            params[title_loc] (str): Location of the title in the HTML
            params[url_list_query] (str): BeautifulSoup code to extract the list of articles from the website homepage(s) (url)
        '''
        self.base_url = params['url']
        self.source = params['source']
        self.strainer = params['strain_bool']
        if self.strainer:
            self.strain_tag = params['strain_tag']
            self.strain_attr_name = params['strain_attr_name']
            self.strain_attr_value = params['strain_attr_value']
        self.date_loc = params['date_loc']
        self.date_format = params['date_format']
        self.sum_loc = params['sum_loc']
        self.title_loc = params['title_loc']
        self.url_list_query = params['url_list_query']

    def get_urls(self):
        '''Populates self.urls_to_scrape with a list of urls extracted from the website homepage(s)'''
        self.urls_to_scrape = []
        with requests.Session() as s:

            # Checks if the base_url is a single url or a list of urls - some websites publish enough articles
            # that we have to pull multiple pages
            if isinstance(self.base_url, str):

                # Checks if there is a "soup strainer" for the website being scraped. See here:
                # https://www.crummy.com/software/BeautifulSoup/bs4/doc/#parsing-only-part-of-a-document
                if not self.strainer:
                    page = requests.get(self.base_url, headers=headers)
                    time.sleep(0.5)
                    self.base_soup = BeautifulSoup(page.content, "lxml")
                else:
                    only_parse = SoupStrainer(self.strain_tag, attrs={
                                              self.strain_attr_name: self.strain_attr_value})
                    self.base_soup = BeautifulSoup(requests.get(
                        self.base_url, headers=headers).content, "lxml", parse_only=only_parse)

                time.sleep(1)
                self.urls_to_scrape = eval(self.url_list_query)

            else:
                for url in list(self.base_url):
                    if not self.strainer:
                        page = requests.get(url, headers=headers)
                        time.sleep(0.5)
                        self.base_soup = BeautifulSoup(page.content, "lxml")
                    else:
                        only_parse = SoupStrainer(self.strain_tag, attrs={
                                                  self.strain_attr_name: self.strain_attr_value})
                        self.base_soup = BeautifulSoup(requests.get(
                            url, headers=headers).content, "lxml", parse_only=only_parse)
                    time.sleep(1)
                    self.urls_to_scrape += eval(self.url_list_query)
        self.urls_to_scrape = list(set(self.urls_to_scrape))

    def scrape_em(self):
        self.relevant_articles = {}
        self.scraped_count = 0
        self.skip_count = 0
        self.too_old = 0
        self.iteration = 0
        self.skip_ind = []
        self.old_ind = []
        for url in self.urls_to_scrape:
            time.sleep(0.2)
            self.iteration += 1
            summary = None
            title = None
            date = None
            try:
                with requests.Session() as s:
                    page = s.get(url, headers=headers)
                    if self.source in ['Semiconductor Engineering', 'Reuters', 'Recode']:
                        article = BeautifulSoup(page.content, "html5lib")
                    else:
                        article = BeautifulSoup(page.content, "lxml")
                    date = pd.to_datetime(eval(self.date_loc).strip().replace(
                        '\\xa0', '').replace(' -\nBy:', ''), format=self.date_format).date()
                    if (date - dt.date.today()).days >= -max_age:
                        if self.source == 'Autoblog':
                            try:
                                summary = eval(self.sum_loc)
                                summary = replace_em(
                                    summary[0].text + ' '+summary[1].text + ' '+summary[2].text)
                            except:
                                summary = ' '.join(article.find('div', attrs={
                                                   'class': 'post-body'}).text.replace('\\t', '').replace('\\n\\n', '\n').split('\n')[1:4])
                        else:
                            summary = eval(self.sum_loc)
                            try:
                                summary = replace_em(
                                    summary[0].text + ' '+summary[1].text + ' '+summary[2].text)
                            except:
                                # Some articles are actually just one paragraph
                                summary = replace_em(summary[0])
                        title = eval(self.title_loc).replace('â€™', "'").replace(
                            '\\xa0', ' ').replace('\\n', '').lstrip().replace('  ', '')
                        temp = page_scan(title, summary, url,
                                         date, self.source)
                        if temp != 'Most definitely nope':
                            self.relevant_articles[self.scraped_count] = temp
                        self.scraped_count += 1
                    else:
                        self.too_old += 1
                        self.old_ind.append(self.iteration-1)
            except Exception as exc:
                print(
                    f"{str(exc)}: {url} \ndate:{date}\ntitle:{title}\nsummary:{summary}")
                self.skip_count += 1
                self.skip_ind.append(self.iteration-1)
                continue
        self.relevant_df = pd.DataFrame.from_dict(self.relevant_articles).T
        if not self.relevant_df.empty:
            self.relevant_df.drop_duplicates('link', inplace=True)

### Set parameters for each website scraper

In [36]:
# 'article' is the variable that stores the BeautifulSoup soup for a particular article page. e.g. "energy.gov/some-article."
# The values of date_loc, sum_loc, and title_loc should be the BeautifulSoup commands for accessing the date location, 
# summary location and title location, respectively.

scraper_dict = {'MIT': {'url': 'http://news.mit.edu/mit-news',
                        'source': 'MIT',
                        'strain_tag': 'ul',
                        'strain_attr_name': 'class',
                        'strain_attr_value': 'view-mit-news clearfix',
                        'url_list_query': "['http://news.mit.edu'+item.a['href'] for item in self.base_soup.find('ul', class_='view-mit-news clearfix').find_all('li')]",
                        'date_loc': "article.find('span', attrs={'itemprop':'datePublished'}).text",
                        'date_format': None,
                        'sum_loc': "article.find('div', attrs={'class': 'field-item even'}).find_all('p')",
                        'title_loc': "article.find('h1', attrs={'class':'article-heading'}).text",
                        'strain_bool': True},
                'SemEng': {'url': 'http://semiengineering.com/category-main-page-iot-security/',
                           'source': 'Semiconductor Engineering',
                           'strain_tag': 'div',
                           'strain_attr_name': 'class',
                           'strain_attr_value': 'l_col',
                           'url_list_query': "[item['href'] for item in self.base_soup.find('div', class_='l_col').find_all('a', href=True,title=True)]",
                           'date_loc': "article.find('div',class_='loop_post_meta').contents[0]",
                           'date_format': None,
                           'sum_loc': "article.find('div', class_='post_cnt post_cnt_first_letter').find_all('p')[1:4]",
                           'title_loc': "article.find('h1', class_='post_title').text",
                           'strain_bool': True},
                'Quartz': {'url': 'https://qz.com/search/self-driving',
                           'source': 'Quartz',
                           'url_list_query': "['https://qz.com' + a['href'] for a in self.base_soup.find_all('a', class_='_5ff1a')]",
                           'date_loc': "article.time.text",
                           'date_format': None,
                           'sum_loc': "article.find_all('p')[:3]",
                           'title_loc': "article.h1.text",
                           'strain_bool': False},
                            # Note: member exclusive articles for Quartz will be skipped.
                'Recode': {'url': 'https://www.recode.net/',
                           'source': 'Recode',
                           'strain_tag': 'a',
                           'strain_attr_name': 'data-analytics-link',
                           'strain_attr_value': 'article',
                           'url_list_query': "[item['href'] for item in self.base_soup.find_all('a', attrs={'data-analytics-link':'article'})]",
                           'date_loc': "article.time.text.replace('\\n', '')",
                           'date_format': None,
                           'sum_loc': "article.find_all('p')",
                           'title_loc': "article.h1.text",
                           'strain_bool': True},
                'GovTech': {'url': 'http://www.govtech.com/fs/transportation/',
                            'source': 'GovTech',
                            'url_list_query': "[item.a['href'] for item in self.base_soup.find_all(class_=['sub-feature-article','feature-article'])]",
                            'date_loc': "article.find('span', class_='date').text.strip()",
                            'date_format': None,
                            'sum_loc': "[item for item in article.find(class_='col-md-10').find_all('div') if len(str(item)) > 12] \
                                        if len([item for item in article.find(class_='col-md-10').find_all('p')]) < 3 \
                                        else [item for item in article.find(class_='col-md-10').find_all('p')]",
                            'title_loc': "article.find('h1').text.strip()",
                            'strain_bool': False},
                'Reuters': {'url': ['https://www.reuters.com/news/technology',
                                    'https://www.reuters.com/news/archive/technologynews?view=page&page=2',
                                    'https://www.reuters.com/news/archive/technologynews?view=page&page=3',
                                    'https://www.reuters.com/news/archive/technologynews?view=page&page=4',
                                    'https://www.reuters.com/news/archive/technologynews?view=page&page=5'],
                            'source': 'Reuters',
                            'url_list_query': "['https://www.reuters.com'+item.a['href'] for item in self.base_soup.find_all('div', class_='story-content')]",
                            'date_loc': "article.find('div', attrs={'class':'ArticleHeader_date'}).text.split('/')[0]",
                            'date_format': None,
                            'sum_loc': "article.find('div', attrs={'class':'StandardArticleBody_body'}).find_all('p')",
                            'title_loc': "article.h1.text",
                            'strain_bool': False},
                'CityLab': {'url': 'https://www.citylab.com/transportation/',
                            'source': 'Citylab',
                            'strain_tag': ['h2', 'h1'],
                            'strain_attr_name': 'class', 'strain_attr_value': ['c-promo__hed', 'c-river-item__hed c-river-item__hed--'],
                            'url_list_query': "[item.a['href'] for item in self.base_soup.find_all(['h1','h2'], class_=['c-promo__hed','c-river-item__hed c-river-item__hed--'])]",
                            'date_loc': "article.time.text",
                            'date_format': None,
                            'sum_loc': "article.find_all('p')[1:]",
                            'title_loc': "article.h1.text",
                            'strain_bool': True},
                'Autoblog': {'url': ['https://www.autoblog.com/archive/']
                                     + ['https://www.autoblog.com/archive/pg-' + str(i) for i in range(2,6)],
                             'source': 'Autoblog',
                             'strain_tag': 'h6',
                             'strain_attr_name': 'class',
                             'strain_attr_value': 'record-heading',
                             'url_list_query': "['https://www.autoblog.com' + header.a['href'] for header in self.base_soup.find_all('h6', class_ = 'record-heading')]",
                             'date_loc': "article.find('div', class_='post-date').text.strip().split(' at')[0]",
                             'date_format': None,
                             'sum_loc': "article.find('div', attrs={'class':'post-body'}).find_all('p')",
                             'title_loc': "article.h1.text",
                             'strain_bool': True},
                'Electrek': {'url': ['https://electrek.co/'] + 
                                     ['https://electrek.co/page/' + str(i) for i in range(2,6)],
                             'source': 'Electrek',
                             'strain_tag': 'h1',
                             'strain_attr_name': 'class', 'strain_attr_value': 'post-title',
                             'url_list_query': "[item.a['href'] for item in self.base_soup.find_all('h1', class_='post-title')]",
                             'date_loc': "article.find('p', class_='time-twitter').text",
                             'date_format': None,
                             'sum_loc': "article.find('div', class_='post-body').find_all('p')[1:]",
                             'title_loc': "article.find('h1', class_='post-title').text",
                             'strain_bool': True},
                'The Verge': {'url': 'https://www.theverge.com/transportation',
                              'source': 'The Verge',
                              'strain_tag': 'h2',
                              'strain_attr_name': 'class', 'strain_attr_value': 'c-entry-box--compact__title',
                              'url_list_query': "[item.a['href'] for item in self.base_soup.find_all('h2', class_='c-entry-box--compact__title')]",
                              'date_loc': "article.time.text",
                              'date_format': None,
                              'sum_loc': "article.find_all('p')",
                              'title_loc': "article.h1.text",
                              'strain_bool': True},
                'Crunchbase': {'url': 'https://news.crunchbase.com/',
                               'source': 'Crunchbase',
                               'url_list_query': "[item.a['href'] for item in self.base_soup.find_all('h2',class_=['entry-title h3','entry-title h5'])]",
                               'date_loc': "article.find('div', class_='meta-item herald-date').text",
                               'date_format': None,
                               'sum_loc': "article.find('div', class_='entry-content herald-entry-content').find_all('p')[1:]",
                               'title_loc': "article.find('h1', class_='entry-title h1').text",
                               'strain_bool': False},
                'Truck News': {'url': ['https://www.trucknews.com/news',
                                       'https://www.trucknews.com/news/page/2/'],
                               'source': 'Truck News',
                               'url_list_query': "[item.a['href'] for item in self.base_soup.find('ul', class_='media-list').find_all('h4')]",
                               'date_loc': "article.find('div', class_ = 'well').find('p').text.split('by')[0].strip()",
                               'date_format': None,
                               'sum_loc': "[p.text for p in article.find('div', class_ = 'the-content').find_all('p')]",
                               'title_loc': "article.find('h2').text.strip()",
                               'strain_bool': False},
                'Trucks.com': {'url': ['https://www.trucks.com/category/news/tech/autonomous-vehicles/',
                                       'https://www.trucks.com/category/editors-picks/'],
                               'source': 'Trucks.com',
                               'url_list_query': "[item.find(['h2','div'], attrs={'class':['title','h4']}).a['href'] for item in self.base_soup.find_all('div', attrs={'class':['content-block','cb-meta container-page-trucks']})]",
                               'date_loc': "article.find('div',class_='date-author').text.strip().split(' by')[0]",
                               'date_format': None,
                               'sum_loc': "article.find('section', attrs={'itemprop':'articleBody'}).find_all('p', attrs={'class':None})",
                               'title_loc': "article.h1.text",
                               'strain_bool': False},
                'TechCrunch': {'url': ['https://techcrunch.com/', 
                                       'https://techcrunch.com/page/2/', 
                                       'https://techcrunch.com/page/3/', 
                                       'https://techcrunch.com/page/4/'], 
                               'source': 'TechCrunch',
                               'strain_tag': 'a',
                               'strain_attr_name': 'class',
                               'strain_attr_value': 'post-block__title__link',
                               'url_list_query': "[item['href'] for item in self.base_soup.find_all('a', class_='post-block__title__link')]",
                               'date_loc': "url[23:33]",
                               'date_format': None,
                               'sum_loc': "article.find('div', attrs={'class':'article-content'}).find_all('p')",
                               'title_loc': "article.find('h1', attrs={'class':'article__title'}).text",
                               'strain_bool': True},
                'Charged EVs': {'url': ['https://chargedevs.com/category/newswire/', 'https://chargedevs.com/category/newswire/page/2/'],
                                'source': 'Charged EVs',
                                'strain_tag': 'h3',
                                'strain_attr_name': 'class',
                                'strain_attr_value': 'h2',
                                'url_list_query': '[item.a["href"] for item in self.base_soup.find_all("h3", class_="h2")]',
                                'date_loc': "article.find('time').text",
                                'date_format': None,
                                'sum_loc': "article.find('section',class_='entry-content clearfix').find_all('p')",
                                'title_loc': "article.find('h2', class_='page-title').text",
                                'strain_bool': True},
                'ARS Technica': {'url': 'https://arstechnica.com/cars/',
                                 'source': 'ARS Technica',
                                 'strain_tag': 'a',
                                 'strain_attr_name': 'class',
                                 'strain_attr_value': 'overlay',
                                 'url_list_query': "[item['href'] for item in self.base_soup.find_all('a', attrs={'class': 'overlay'})]",
                                 'date_loc': "article.find('time', attrs={'class':'date'}).text",
                                 'date_format': None,
                                 'sum_loc': "article.find('div', attrs={'itemprop':'articleBody'}).find_all('p', attrs={'class':None})",
                                 'title_loc': "article.h1.text",
                                 'strain_bool': True},
                'Venture Beat': {'url': 'https://venturebeat.com/category/transportation/',
                                 'source': 'Venture Beat',
                                 'url_list_query': "[item.a['href'] for item in self.base_soup.select('h2.article-title')]+[item.a['href'] for item in self.base_soup.select('article')]",
                                 'date_loc': "article.find('meta', attrs={'property':'article:published_time'})['content']",
                                 'date_format': None,
                                 'sum_loc': "[p.text for p in article.find('div', class_ = 'article-content').find_all('p')]",
                                 'title_loc': "article.find('h1').text",
                                 'strain_bool': False},
                'IEEE Spectrum': {'url': 'https://spectrum.ieee.org/transportation', 'source': 'IEEE Spectrum',
                                  'strain_tag': 'article',
                                  'strain_attr_name': 'class',
                                  'url_list_query': "['https://spectrum.ieee.org'+item.a['href'] for item in self.base_soup.find_all('article')]",
                                  'strain_attr_value': 'item sml_article transportation',
                                  'date_loc': "article.label.text",
                                  'date_format': '%d %b %Y | %H:%M GMT',
                                  'sum_loc': "article.find_all('p', limit=5)",
                                  'title_loc': "article.h1.text",
                                  'strain_bool': True},
                'Transport Topics': {'url': ['https://www.ttnews.com/government',
                                             'https://www.ttnews.com/government?page=1',
                                             'https://www.ttnews.com/government?page=2',
                                             'https://www.ttnews.com/government?page=3',
                                             'https://www.ttnews.com/business',
                                             'https://www.ttnews.com/business?page=1',
                                             'https://www.ttnews.com/business?page=2',
                                             'https://www.ttnews.com/technology',
                                             'https://www.ttnews.com/technology?page=1',
                                             'https://www.ttnews.com/technology?page=2',
                                             'https://www.ttnews.com/equipment',
                                             'https://www.ttnews.com/equipment?page=1',
                                             'https://www.ttnews.com/equipment?page=2'],
                                     'source': 'Transport Topics',
                                     'url_list_query': "['https://www.ttnews.com'+item.a['href'] for item in self.base_soup.find_all('h2', class_='content-access-1067')]",
                                     'date_loc': "article.find('span',class_='date-display-single')['content']",
                                     'date_format': None,
                                     'sum_loc': "[p for p in article.find_all('p') if p.text and len(p.text)>10]",
                                     'title_loc': "article.find('h1').text",
                                     'strain_bool': False},
                'GreenCarCongress': {'url': ['http://www.greencarcongress.com/', 'http://www.greencarcongress.com/page/2/'],
                                     'source': 'GreenCarCongress',
                                     'strain_tag': 'article',
                                     'strain_attr_name': 'class',
                                     'strain_attr_value': 'post entry',
                                     'url_list_query': "[item.a['href'] for item in self.base_soup.find_all('article', attrs={'class': 'post entry'})]",
                                     'date_loc': "article.find('span', attrs={'class':'entry-date'}).a.text",
                                     'date_format': None,
                                     'sum_loc': "article.find_all('p', limit=5)",
                                     'title_loc': "article.h2.a.text",
                                     'strain_bool': True},
                'Green Car Reports': {'url': 'https://www.greencarreports.com/news',
                                     'source': 'Green Car Reports',
                                     'strain_tag': 'div',
                                     'strain_attr_name': 'class',
                                     'strain_attr_value': 'right-side',
                                     'url_list_query': "['https://www.greencarreports.com' + item.h2.a['href'] for item in self.base_soup.find_all('div', attrs={'class': 'right-side'})]",
                                     'date_loc': "article.find('div', class_='by-line-comments-views-date').span.text",
                                     'date_format': None,
                                     'sum_loc': "article.find('div', class_='article_content').find_all('p', limit = 5)",
                                     'title_loc': "article.h1.text",
                                     'strain_bool': True},
                'The Fuse': {'url': 'http://energyfuse.org/category/autonomous-vehicles/',
                                     'source': 'The Fuse',
                                     'strain_tag': 'div',
                                     'strain_attr_name': 'class',
                                     'strain_attr_value': 'category-content-block active',
                                     'url_list_query': "[a['href'] for a in self.base_soup.find_all('a', class_='full-block-link')]",
                                     'date_loc': "article.h2.text.split('| ')[-1]",
                                     'date_format': None,
                                     'sum_loc': "article.find('div', class_='content-wrapper').find('p')",
                                     'title_loc': "article.h1.text",
                                     'strain_bool': True},
                'Business Wire': {'url': 'https://www.businesswire.com/portal/site/home/news/',
                                     'source': 'Business Wire',
                                     'strain_tag': 'a',
                                     'strain_attr_name': 'class',
                                     'strain_attr_value': 'bwTitleLink',
                                     'url_list_query': "['https://www.businesswire.com' + a['href'] for a in self.base_soup.find_all('a', class_='bwTitleLink') if '/en/' in a['href']]",
                                     'date_loc': "' '.join(article.find('time').text.split()[:3])",
                                     'date_format': None,
                                     'sum_loc': "['<p>' + p.text + '</p>' for p in article.find('div', class_='bw-release-story').find_all('p')]",
                                     'title_loc': "' '.join(article.h1.text.strip().split())",
                                     'strain_bool': True},
                'U.S. Department of Energy': {'url': 'https://www.energy.gov/listings/energy-news',
                                     'source': 'U.S. Department of Energy',
                                     'strain_tag': 'a',
                                     'strain_attr_name': 'class',
                                     'strain_attr_value': 'title-link',
                                     'url_list_query': "['https://www.energy.gov' + a['href'] for a in self.base_soup.find_all('a', class_='title-link')]",
                                     'date_loc': "article.find('div', class_='node-hero-date').text",
                                     'date_format': None,
                                     'sum_loc': "article.find('div', class_='field-items').find_all('p', limit = 3)",
                                     'title_loc': "article.h1.text",
                                     'strain_bool': True},
               
                'jmtonline': {'url': 'https://link.springer.com/journal/40534/onlineFirst/page/1',
                                     'source': 'Journal of Modern Transportation',
                                     'strain_tag': 'div',
                                     'strain_attr_name': 'class',
                                     'strain_attr_value': 'toc-item',
                                     'url_list_query': "['https://link.springer.com' + title.a['href'] for title in self.base_soup.find_all('div', class_='toc-item') if title.p.text == 'OriginalPaper']",
                                     'date_loc': "article.time.text",
                                     'date_format': None,
                                     'sum_loc': "article.find('section', class_='Abstract').p",
                                     'title_loc': "article.h1.text",
                                     'strain_bool': True},
          
                }

### Run the scrapers
Note: there will be a couple errors, especially with Autoblog. The scraper for that site still picks up a couple irrelevant items that it can't handle.

For testing a single scraper - only needed when adding new sites (don't want to run all of them over and over...)

In [37]:
scrape_specs = {}
scraypahs = {}
temp_start_time = time.time()

name = 'jmtonline'
scraypahs[name] = scraypah(scraper_dict[name])
scraypahs[name].get_urls()
scraypahs[name].scrape_em()
scrape_specs = print_results(scraypahs[name].source, scraypahs[name].scraped_count, scraypahs[name].skip_count,
                             scraypahs[name].too_old, scraypahs[name].relevant_df, round(
                                 time.time()-temp_start_time, 2),
                             scrape_specs)

0 Journal of Modern Transportation article(s) scraped
0 Journal of Modern Transportation article(s) skipped due to error
7 Journal of Modern Transportation article(s) skipped due to age
0 relevant article(s) collected


In [9]:
scrape_specs = {}
scraypahs = {}
start_time = time.time()

for site in list(scraper_dict.keys()):
    temp_start_time = time.time()
    print('\n'+site)
    scraypahs[site] = scraypah(scraper_dict[site])
    scraypahs[site].get_urls()
    scraypahs[site].scrape_em()
    scrape_specs = print_results(scraypahs[site].source, scraypahs[site].scraped_count, scraypahs[site].skip_count,
                                 scraypahs[site].too_old, scraypahs[site].relevant_df, round(
                                     time.time()-temp_start_time, 2),
                                 scrape_specs)


MIT
7 MIT article(s) scraped
0 MIT article(s) skipped due to error
25 MIT article(s) skipped due to age
0 relevant article(s) collected

SemEng
'NoneType' object is not callable: https://semiengineering.com/inferencing-at-the-edge/ 
date:2019-01-10
title:None
summary:[<p><span class="embed-youtube" style="text-align:center; display: block;"><iframe allowfullscreen="true" class="youtube-player" height="360" src="https://www.youtube.com/embed/1BTxwew--5U?version=3&amp;rel=1&amp;fs=1&amp;autohide=2&amp;showsearch=0&amp;showinfo=1&amp;iv_load_policy=1&amp;wmode=transparent" style="border:0;" type="text/html" width="640"></iframe></span></p>]
4 Semiconductor Engineering article(s) scraped
1 Semiconductor Engineering article(s) skipped due to error
13 Semiconductor Engineering article(s) skipped due to age
2 relevant article(s) collected

Quartz
'NoneType' object has no attribute 'text': https://qz.com/1495144/chinas-self-driving-efforts-need-to-accelerate/ 
date:None
title:None
summary:Non



19 Recode article(s) scraped
0 Recode article(s) skipped due to error
16 Recode article(s) skipped due to age
2 relevant article(s) collected

GovTech
7 GovTech article(s) scraped
0 GovTech article(s) skipped due to error
40 GovTech article(s) skipped due to age
2 relevant article(s) collected

Reuters
61 Reuters article(s) scraped
0 Reuters article(s) skipped due to error
0 Reuters article(s) skipped due to age
3 relevant article(s) collected

CityLab




13 Citylab article(s) scraped
0 Citylab article(s) skipped due to error
8 Citylab article(s) skipped due to age
3 relevant article(s) collected

Autoblog
'NoneType' object has no attribute 'text': https://www.autoblog.com/2019/01/09/2019-jaguar-i-pace-review-quick-spin/ 
date:2019-01-09
title:None
summary:None
HTTPSConnectionPool(host='www.autoblog.comhttps', port=443): Max retries exceeded with url: //www.autoblog.com/photos/top-10-classic-cars-on-the-rise/ (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x000000000BC0D240>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond')): https://www.autoblog.comhttps://www.autoblog.com/photos/top-10-classic-cars-on-the-rise/ 
date:None
title:None
summary:None
HTTPSConnectionPool(host='www.autoblog.comhttps', port=443): Max 

# Summary

In [10]:
# Meta-data from the scrape session
scrape_specs_df = pd.DataFrame.from_dict(scrape_specs).T.reset_index()
scrape_specs_df['Time per relevant article'] = scrape_specs_df['Time spent'] / \
    scrape_specs_df['Relevant Articles']
display(scrape_specs_df)

# List all of the relevant news from each of the scrapers (each scraypah item has an attribute "relevant_df", which is a pandas
# dataframe with all of the selected news items from that website)
all_news_dfs = []
for key, value in scraypahs.items():
    all_news_dfs.append(value.relevant_df)

# Stack all of the articles into a single dataframe and do some cleaning (drop duplicate articles)
all_df = pd.concat(all_news_dfs)
all_df = all_df[['title', 'date', 'AFV', 'CAV', '21CTP', 'Hyperloop',
                 'summary', 'source', 'link']].sort_values('date', ascending=False)
all_df.drop_duplicates(subset='title', inplace=True)
all_df = all_df.replace('\$', '$', regex=True)

print('Smart Mobility articles found: {}'.format(
    all_df['CAV'].sum().astype(int)))
print('Alternative Fuel Vehicle articles found: {}'.format(
    all_df['AFV'].sum().astype(int)))
print('21CTP articles found: {}'.format(all_df['21CTP'].sum().astype(int)))
print('Hyperloop articles found: {}'.format(
    all_df['Hyperloop'].sum().astype(int)))

# Populate meta-data columns (helpful for searching all news items in the future if we want)
all_df['reason_for_tag'] = all_df.apply(which_keyword_found, axis=1)
all_df['keywords'] = all_df['title'].str.strip().apply(keyword_pull)

# Format for excel writing
AFV_news = all_df[all_df['AFV'] == 1].sort_values(
    'date', ascending=False).drop(['AFV', 'CAV', '21CTP', 'Hyperloop'], axis=1)
CAV_news = all_df[all_df['CAV'] == 1].sort_values(
    'date', ascending=False).drop(['AFV', 'CAV', '21CTP', 'Hyperloop'], axis=1)
truck_news = all_df[all_df['21CTP'] == 1].sort_values(
    'date', ascending=False).drop(['AFV', 'CAV', '21CTP', 'Hyperloop'], axis=1)
hyperloop_news = all_df[all_df['Hyperloop'] == 1].sort_values(
    'date', ascending=False).drop(['AFV', 'CAV', '21CTP', 'Hyperloop'], axis=1)

Unnamed: 0,index,Errors,Pages Scraped,Relevant Articles,Time spent,Too old,Time per relevant article
0,MIT,0.0,7.0,0.0,37.02,25.0,inf
1,Semiconductor Engineering,1.0,4.0,2.0,42.08,13.0,21.04
2,Quartz,1.0,0.0,0.0,7.57,9.0,inf
3,Recode,0.0,19.0,2.0,28.9,16.0,14.45
4,GovTech,0.0,7.0,2.0,29.66,40.0,14.83
5,Reuters,0.0,61.0,3.0,47.27,0.0,15.756667
6,Citylab,0.0,13.0,3.0,13.56,8.0,4.52
7,Autoblog,5.0,99.0,6.0,133.56,0.0,22.26
8,Electrek,0.0,43.0,6.0,29.13,1.0,4.855
9,The Verge,1.0,24.0,8.0,24.21,14.0,3.02625


Smart Mobility articles found: 90
Alternative Fuel Vehicle articles found: 0
21CTP articles found: 0
Hyperloop articles found: 1


### Write dataframe to a spreadsheet
CAVs on Monday, AFVs on Wednesday, 21CTP on Friday

In [12]:
if (dt.date.today().weekday() == scraper_sched['CAV']):
    print('Monday!')
    filename = f'cav_news_updates/{search_date}_cav_news_download.xls'
    CAV_news.to_excel(filename)
    if hyperloop_news.shape[0] > 0:
        filename2 = f'hyperloop_news_updates/{search_date}_hyperloop_news_download.xls'
        hyperloop_news.to_excel(filename2)
        print('Some hyperloop stuff!')
elif (dt.date.today().weekday() == scraper_sched['AFV']):
    print('Wednesday!')
    filename = f'afv_news_updates/{search_date}_afv_news_download.xls'
    AFV_news.to_excel(filename)
elif (dt.date.today().weekday() == scraper_sched['21CTP']):
    print('Friday!')
    filename = f'21CTP_news_updates/{search_date}_21CTP_news_download.xls'
    truck_news.to_excel(filename)

# # Open excel file to edit or add any additional news items
cwd = os.getcwd()
xls_file = cwd+'/'+filename

excel = win32.gencache.EnsureDispatch('Excel.Application')
excel.Visible = True

# # open the file
excel.Workbooks.Open(xls_file)

# # wait before closing
_ = input("Press enter to close Excel: ")
excel.Application.Quit()

Monday!
Some hyperloop stuff!


<win32com.gen_py.None.Workbook>

Press enter to close Excel:  


## Create Word file from the news update spreadsheets
Automatically does CAV on Mondays, AFV on Wednesdays, and 21CTP on Fridays. 

In [13]:
if dt.date.today().weekday() == scraper_sched['AFV']:
    print('AFV')
    gen_docx('AFV')
elif dt.date.today().weekday() == scraper_sched['CAV']:
    print('CAV')
    gen_docx('CAV')
elif dt.date.today().weekday() == scraper_sched['21CTP']:
    print('21CTP')
    gen_docx('21CTP')

CAV


## Update news item tracking and news scraper meta-data databases
Only run when **final** news item spreadsheet is saved in your working directory (i.e., after you have manually added other articles to the already-saved spreadsheet from the cell above)

In [14]:
# This is in case you go to upload this week's news items to the database, and realize you forgot to do last week's. Just replace
# all instances of "search_date" in the next cell with "last_week"and run it. Make sure you switch them all back to "search_date"..
last_week = str((pd.to_datetime(search_date) - dt.timedelta(days=7)).date())

In [15]:
conn = sqlite3.connect('news_updates.db')
if (dt.date.today().weekday() == scraper_sched['CAV']) & (~db_update):
    print('CAV')
    pd.read_excel('cav_news_updates/{}_cav_news_download.xls'.format(search_date)
                  ).to_sql('CAV', conn, if_exists='append', index=False)
    db_update = True
elif (dt.date.today().weekday() == scraper_sched['AFV']) & (~db_update):
    print('AFV')
    pd.read_excel('afv_news_updates/{}_afv_news_download.xls'.format(search_date)
                  ).to_sql('AFV', conn, if_exists='append', index=False)
    db_update = True
conn.close()

# This saves the meta-data from all of the scraper runs every Wednesday (print out "scrape_specs_df" to see what the meta-data includes)
if dt.date.today().weekday() == 2:
    conn = sqlite3.connect('news_updates_meta.db')
    scrape_specs_df.drop(['Time spent', 'Time per relevant article'], axis=1).to_sql(
        'news_updates_meta', conn, if_exists='append', index=False)
    conn.close()
    print('Uploaded metadata! So many datas!')

CAV


## Academic articles scraper (**NOTE** I only run this for CAVs, so only on Mondays)
Dumps all recently-published articles (in the past week) and their abstracts into a word file. Only does a few journals right now. Check your working directory for a file called *{date} papers.docx* after you run the cells below.

In [16]:
driver = webdriver.Chrome()

In [17]:
def paypuh_scraypuh(url, source):
    '''
    bad_egg: Missing a key component (usually abstract), so skip printout/tracking
    still_more: Date is still within past week, continue scraping!
    '''
    soup = grab_homepage(url)
    papers_to_scrape = [paper.a['href'] for paper in soup.find_all(
        'div', attrs={'class': 'pod-listing-header'})]
    still_more = True
    scraped_count = 0
    papers = {}

    for paper in papers_to_scrape:
        #bad_egg bool becomes True if publication date of article can't be accessed.
        bad_egg = False 
        if not still_more:
            break
        # Open article URL using selenium.
        driver.get(paper)
        # Get article publication date, title, and summary.
        try:
            driver.find_element_by_css_selector(
                "span[class='CollapseText']").click()
            date = pd.to_datetime(driver.find_element_by_css_selector(
                "dl[class='articleDates smh']").text.split('Available online ')[1]).date()
            soup = BeautifulSoup(driver.page_source, "html5lib")
            if (date - dt.date.today()).days > -max_age:
                title = soup.find('h1', class_='svTitle').text
                summary = soup.find(
                    'div', class_='abstract svAbstract ').p.text
            # If article is too old, set still_more = False.  This will break out of the for loop so that
            # paypuh_scraypuh() stops scraping papers from this source.
            else:
                still_more = False
        except:
            try:
                # Use selenium to "click" the "Show more" button to reveal the date.
                driver.find_element_by_css_selector(
                    "button[class='show-hide-details']").click()
                soup = BeautifulSoup(driver.page_source, "html5lib")
                date = pd.to_datetime(soup.find('div', class_='wrapper').p.text.split(
                    'Available online ')[1]).date()
                # If age of article is less than max_age, get the article title and summary.
                if (date - dt.date.today()).days > -max_age:
                    title = soup.find('span', class_='title-text').text
                    summary = soup.find('div', class_='abstract author').p.text
                else:
                    still_more = False
            # If publication date can't be accessed, skip article.
            except:
                bad_egg = True
                print('bad egg in {}: {}'.format(source, paper))
                pass
        scraped_count += 1

        if still_more and not bad_egg:
            papers[scraped_count] = {
                'title': title, 'summary': summary, 'link': paper, 'source': source, 'date': date}

    # Print the number of papers scraped from each source.
    print('{} new papers in {}'.format(scraped_count, source))

    return pd.DataFrame(papers).T

In [18]:
tparta = paypuh_scraypuh(
    'https://www.journals.elsevier.com/transportation-research-part-a-policy-and-practice/recent-articles', 'Transportation Part A')
tpartb = paypuh_scraypuh(
    'https://www.journals.elsevier.com/transportation-research-part-b-methodological/recent-articles', 'Transportation Part B')
tpartc = paypuh_scraypuh(
    'https://www.journals.elsevier.com/transportation-research-part-c-emerging-technologies/recent-articles', 'Transportation Part C')
tpartd = paypuh_scraypuh(
    'https://www.journals.elsevier.com/transportation-research-part-d-transport-and-environment/recent-articles', 'Transportation Part D')
tparte = paypuh_scraypuh(
    'https://www.journals.elsevier.com/transportation-research-part-e-logistics-and-transportation-review/recent-articles', 'Transportation Part E')
tpartf = paypuh_scraypuh(
    'https://www.journals.elsevier.com/transportation-research-part-f-traffic-psychology-and-behaviour/recent-articles', 'Transportation Part F')

11 new papers in Transportation Part A
4 new papers in Transportation Part B
2 new papers in Transportation Part C
2 new papers in Transportation Part D
6 new papers in Transportation Part E
3 new papers in Transportation Part F


In [20]:
week_o_papers = pd.concat([tparta, tpartb, tpartc, tpartd, tparte, tpartf])
# week_o_papers.to_excel('{} papers.xls'.format(search_date))
week_o_papers.dropna(how='all', axis=0, inplace=True)

In [21]:
newsdoc = docx.Document(docx='python_docx.docx')

for row in week_o_papers.reset_index(drop=True).T:
    row = week_o_papers.iloc[row, :]
    newsdoc.add_heading(row['title'], level=2)
    p = newsdoc.add_paragraph(row['summary'] + ' ')
    p.add_run('(')
    add_hyperlink(p, '{}'.format(row['link']), '{}'.format(row['source']))
    p.add_run(')')
newsdoc.save('{} papers.docx'.format(search_date))

<docx.text.paragraph.Paragraph at 0xe7dda20>

<docx.text.run.Run at 0xe7ddd68>

<Element {http://schemas.openxmlformats.org/wordprocessingml/2006/main}hyperlink at 0xd527048>

<docx.text.run.Run at 0xe7ddc50>

<docx.text.paragraph.Paragraph at 0xc8cd400>

<docx.text.run.Run at 0xe7ddcc0>

<Element {http://schemas.openxmlformats.org/wordprocessingml/2006/main}hyperlink at 0xe7d3ac8>

<docx.text.run.Run at 0xe7ddc50>

<docx.text.paragraph.Paragraph at 0xc8cd400>

<docx.text.run.Run at 0xe7ddf98>

<Element {http://schemas.openxmlformats.org/wordprocessingml/2006/main}hyperlink at 0xe7d3908>

<docx.text.run.Run at 0xe7ddc50>

<docx.text.paragraph.Paragraph at 0xc8cd400>

<docx.text.run.Run at 0xe7dda58>

<Element {http://schemas.openxmlformats.org/wordprocessingml/2006/main}hyperlink at 0xe7e9908>

<docx.text.run.Run at 0xe7ddc50>

<docx.text.paragraph.Paragraph at 0xc8cd400>

<docx.text.run.Run at 0xe7ddc88>

<Element {http://schemas.openxmlformats.org/wordprocessingml/2006/main}hyperlink at 0xe7e9a48>

<docx.text.run.Run at 0xe7ddc50>

<docx.text.paragraph.Paragraph at 0xc8cd400>

<docx.text.run.Run at 0xe7dd898>

<Element {http://schemas.openxmlformats.org/wordprocessingml/2006/main}hyperlink at 0xe7e9b88>

<docx.text.run.Run at 0xe7ddc50>

<docx.text.paragraph.Paragraph at 0xc8cd400>

<docx.text.run.Run at 0xe7ddd68>

<Element {http://schemas.openxmlformats.org/wordprocessingml/2006/main}hyperlink at 0xe7e9cc8>

<docx.text.run.Run at 0xe7ddc50>

<docx.text.paragraph.Paragraph at 0xc8cd400>

<docx.text.run.Run at 0xe7dd8d0>

<Element {http://schemas.openxmlformats.org/wordprocessingml/2006/main}hyperlink at 0xe7e9e08>

<docx.text.run.Run at 0xe7ddc50>

<docx.text.paragraph.Paragraph at 0xc8cd400>

<docx.text.run.Run at 0xe7dda58>

<Element {http://schemas.openxmlformats.org/wordprocessingml/2006/main}hyperlink at 0xe7e9e08>

<docx.text.run.Run at 0xe7ddc50>

<docx.text.paragraph.Paragraph at 0xc8cd400>

<docx.text.run.Run at 0xe7ddc88>

<Element {http://schemas.openxmlformats.org/wordprocessingml/2006/main}hyperlink at 0xe7e9e08>

<docx.text.run.Run at 0xe7ddc50>

<docx.text.paragraph.Paragraph at 0xc8cd400>

<docx.text.run.Run at 0xe7dd898>

<Element {http://schemas.openxmlformats.org/wordprocessingml/2006/main}hyperlink at 0xe7e9e08>

<docx.text.run.Run at 0xe7ddc50>

<docx.text.paragraph.Paragraph at 0xc8cd400>

<docx.text.run.Run at 0xe7ddd68>

<Element {http://schemas.openxmlformats.org/wordprocessingml/2006/main}hyperlink at 0xe7e9e08>

<docx.text.run.Run at 0xe7ddc50>

<docx.text.paragraph.Paragraph at 0xc8cd400>

<docx.text.run.Run at 0xe7dd8d0>

<Element {http://schemas.openxmlformats.org/wordprocessingml/2006/main}hyperlink at 0xe7e9e08>

<docx.text.run.Run at 0xe7ddc50>

<docx.text.paragraph.Paragraph at 0xc8cd400>

<docx.text.run.Run at 0xe7dda58>

<Element {http://schemas.openxmlformats.org/wordprocessingml/2006/main}hyperlink at 0xe7e9e08>

<docx.text.run.Run at 0xe7ddc50>

<docx.text.paragraph.Paragraph at 0xc8cd400>

<docx.text.run.Run at 0xe7ddc88>

<Element {http://schemas.openxmlformats.org/wordprocessingml/2006/main}hyperlink at 0xe7e9e08>

<docx.text.run.Run at 0xe7ddc50>

<docx.text.paragraph.Paragraph at 0xc8cd400>

<docx.text.run.Run at 0xe7dd898>

<Element {http://schemas.openxmlformats.org/wordprocessingml/2006/main}hyperlink at 0xe7e9e08>

<docx.text.run.Run at 0xe7ddc50>

<docx.text.paragraph.Paragraph at 0xc8cd400>

<docx.text.run.Run at 0xe7ddd68>

<Element {http://schemas.openxmlformats.org/wordprocessingml/2006/main}hyperlink at 0xe7e9e08>

<docx.text.run.Run at 0xe7ddc50>

<docx.text.paragraph.Paragraph at 0xc8cd400>

<docx.text.run.Run at 0xe7dd8d0>

<Element {http://schemas.openxmlformats.org/wordprocessingml/2006/main}hyperlink at 0xe7e9e08>

<docx.text.run.Run at 0xe7ddc50>

<docx.text.paragraph.Paragraph at 0xc8cd400>

<docx.text.run.Run at 0xe7dda58>

<Element {http://schemas.openxmlformats.org/wordprocessingml/2006/main}hyperlink at 0xe7e9e08>

<docx.text.run.Run at 0xe7ddc50>

<docx.text.paragraph.Paragraph at 0xc8cd400>

<docx.text.run.Run at 0xe7ddc88>

<Element {http://schemas.openxmlformats.org/wordprocessingml/2006/main}hyperlink at 0xe7e9e08>

<docx.text.run.Run at 0xe7e3ef0>

<docx.text.paragraph.Paragraph at 0xc8cd400>

<docx.text.run.Run at 0xe7e37f0>

<Element {http://schemas.openxmlformats.org/wordprocessingml/2006/main}hyperlink at 0xe7db148>

<docx.text.run.Run at 0xe7e3ef0>

<docx.text.paragraph.Paragraph at 0xc8cd400>

<docx.text.run.Run at 0xe7ddd68>

<Element {http://schemas.openxmlformats.org/wordprocessingml/2006/main}hyperlink at 0xe7db148>

<docx.text.run.Run at 0xe7e3ef0>

In [22]:
conn = sqlite3.connect('news_papers.db')
week_o_papers.to_sql('news_papers', conn, if_exists='append', index=False)
conn.close()

# For scraper development (no need to run)

In [None]:
headers

In [None]:
url = 'https://venturebeat.com/2018/09/19/renault-unveils-autonomous-delivery-concept-ez-pro-with-customizable-robo-pods/'
page = requests.get(url, headers = headers)
soup = BeautifulSoup(page.content, 'html5lib')

In [None]:
[item for item in soup.find('div', class_ = ['the-content','article-content']).find_all('p')]

In [None]:
[p for p in soup.find_all('p') if p.text and len(p.text)>10]

In [None]:
'strain_tag':'a', 
'strain_attr_name':'class', 
'strain_attr_value':'post-block__title__link',

In [None]:
test_dict = {'url':'https://venturebeat.com/category/transportation/', 
                                 'source': 'Venture Beat', 
                                 'url_list_query':"[item.a['href'] for item in self.base_soup.select('h2.article-title')]+[item.a['href'] for item in self.base_soup.select('article')]",
                                 'date_loc': "article.find('meta', attrs={'property':'article:published_time'})['content']", 
                                 'date_format':None,
                                 'sum_loc': "[p for p in article.find('div', class_ = ['the-content','article-content']).find_all('p')]",
                                 'title_loc':"article.find('h1').text", 
                                 'strain_bool':False}

In [None]:
scraper = scraypah(test_dict)
scraper.get_urls()
scraper.urls_to_scrape

In [None]:
def test_a_scraypah(attr_dict):
    start = time.time()
    scraper = scraypah(attr_dict)
    print(time.time()-start)
    scraper.get_urls()
    print(time.time()-start)
    scraper.scrape_em()
    print(time.time()-start)
    return scraper

In [None]:
with pd.option_context('display.max_colwidth', 100):
    print(ttopics.relevant_df['title'])

In [None]:
ttopics.relevant_df['summary'][0]

In [None]:
ttopics = test_a_scraypah(test_dict)

### Unused scrapers

In [None]:
'Engadget': {'url': ['https://www.engadget.com/tags/transportation/', 'https://www.engadget.com/tag/transportation/page/2/'],
                             'source': 'Engadget',
                             'strain_tag': 'a',
                             'strain_attr_name': 'class',
                             'strain_attr_value': 'o-hit__link',
                             'url_list_query': "['https://www.engadget.com'+item['href'] for item in self.base_soup.find_all('a', attrs={'class':'o-hit__link'})]",
                             'date_loc': "article.find('meta', attrs={'name':'published_at'})['content']",
                             'date_format': None,
                             'sum_loc': "article.find('div', attrs={'class':'container@m-'}).find_all('p')",
                             'title_loc': "article.title.text",
                             'strain_bool': True},
'NGV Global': {'url': 'http://www.ngvglobal.com/',
                               'source': 'NGV Global',
                               'strain_tag': 'h2',
                               'strain_attr_name': 'class',
                               'strain_attr_value': 'entry-title',
                               'url_list_query': "[item.a['href'] for item in self.base_soup.find_all('h2', attrs={'class':'entry-title'})]",
                               'date_loc': "article.find('time')['title']",
                               'date_format': None,
                               'sum_loc': "article.find('div', attrs={'class':'pf-content'}).find_all('p')",
                               'title_loc': "article.find('h1', attrs={'class':'entry-title'}).text",
                               'strain_bool': True},

# Change Log
* 8/29/2018: Added Citylab, Electrek, cleaned code
* 8/7/2018: Added Transport Reviews to academic paper scraper
* 7/30/2018: Fixed GovTech scraper
* 6/29/2018: Changed the whole scraper over to utilize a new class called *scraypah*. 
* 5/12/2018: Added Semiconductor Engineering scraper and academic articles scraper (~3 hours)
* 4/13/2018: Integrated word document production through python
* 3/19/2018: Added OEM/Gov section that quickly checks 17 sites for updates - only prints a notification that it needs to be checked if there are new updates from the past week
* 2/27/2018: Wrote a function *page_scan* to more efficiently create the relevant web page dictionary "profiles"
* 2/27/2018: Added 21CTP trucking news keywords to search for. Integrated functionality into existing web scraper.
* 2/14/2018: Added NGV Global scraper for AFV stuff
* 2/14/2018: Added fuel cells, hybrid, hybrid-electric, 'electric buses', 'electric truck', 'electric trucks', 'electric drive' to the search terms for AFVs...
* 1/31/2018: Added *print_results* function to streamline printed results for each scraper. Added counter to track #articles that were too old. Added meta-data tracking capability (dumps into SQL database every week)
* 1/31/2018: Split EV market analysis and web scraper into two different Notebooks
* 1/26/2018: Added Lexology scraper
* 1/19/2018: Fixed GreenCarCongress scraper (site redesign)
* 1/4/2018: Added Engadget scraper
* 1/4/2018: Added "replace_em" function to streamline removal of meaningless substrings from body text summaries
* 12/29/2017: Added Reuters, MITNews, and ARSTechnica scrapers. Did some streamlining in the EV Sales analysis
* 12/20/2017: Wrote up quick-guide to all the post-Python processing needed for the final News Update doc.
* 12/20/2017: Changed to .xls format. Had to import a different package to do so, but makes mail merge work better
* 12/13/2017: Fixed Trucks.com scraper - was pulling out the wrong date for each article (pulled a date from the sidebar...)
* 12/8/2017: Edited Trucks.com search so that it doesn't pick up paragraph tags that are actually image captions (added condition that "class = None")
* 12/8/2017: Added a bunch of comments, specifically in the first code segment ("IEEE Spectrum") for explanatory purposes
* 1/8/2019: Added Green Car Reports, DOE, Business Wire, and The Fuse.
* 1/10/2019: Uploaded ipynb to Energetics' GitHub (EICode).  Log entries can now be found via GitHub.