## Extracting article url info from ESPN

In [78]:
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd

In [7]:
def pprint(soup):# pretty print
    print(soup.prettify())

In [79]:
html_page =  requests.get("http://www.espn.com/nfl/team/_/name/buf/buffalo-bills")
soup = BeautifulSoup(html_page.content, 'lxml')

In [60]:
# we know that all article links are in the 'article' tag
articles = [art for art in soup.find_all('article') if 'data-id' in art.attrs.keys()] # remove extraneous articles
print(len(articles))


25


## Types of articles 

By looking at the `class` attribute of all article tags, it looks like there are 3 main types of articles:

* news-feed-story-package
* news-now/news-feed-shortstop (sometimes with attached media has-media)
* video-standalone/video

We only care about the `news-feed-story-package` articles

(Attributes can be reached within a soup object directly, for example article['class'] returns ['news-feed-item', 'news-feed-story-package'])

In [61]:
for art in articles:
    print('---'*30)
    print(art['class'])
    print(art.a['class'])

------------------------------------------------------------------------------------------
['news-feed-item', 'news-feed-story-package']
['story-link']
------------------------------------------------------------------------------------------
['news-feed-item', 'news-feed-story-package']
['story-link']
------------------------------------------------------------------------------------------
['news-feed-item', 'news-now', 'news-feed-shortstop']
['btn-social', 'sm', 'icon-font-before', 'icon-facebook-solid-before', 'Shortstop']
------------------------------------------------------------------------------------------
['news-feed-item', 'news-now', 'news-feed-shortstop']
['btn-social', 'sm', 'icon-font-before', 'icon-facebook-solid-before', 'Shortstop']
------------------------------------------------------------------------------------------
['news-feed-item', 'news-now', 'news-feed-shortstop']
['btn-social', 'sm', 'icon-font-before', 'icon-facebook-solid-before', 'Shortstop']
---------

---

Lets find the desired info from the first article from the html below.


Results:
The attributes in the highest level `a` tag (class `story-link`) appear to have all the info we need, with the url in `data-popup-href`.  the `a` tag with class `realStory` includes the same incomplete `href` link, but isn't that useful.

 The highest level `div` tag (with class `text-container no-headlines`) seems to contain any other possible useful info, but the only relevent info seems to be the text within the `span` tag with class `author`.
 

In [63]:
article = articles[0]
for child in article .children:
    print(child.name)
print()
pprint(article)

a
figure
div

<article class="news-feed-item news-feed-story-package" data-id="buffalo-bills-32874">
 <a class="story-link" data-id="buffalo-bills-32874" data-popup-href="http://espn.com/blog/buffalo-bills/post/_/id/32874/why-bills-lorenzo-alexander-stays-in-frigid-buffalo-in-offseason" data-sport="nfl" href="/blog/buffalo-bills/post/_/id/32874/why-bills-lorenzo-alexander-stays-in-frigid-buffalo-in-offseason" name="&amp;lpos=nfl:feed:xx:news">
 </a>
 <figure class="feed-item-figure ">
  <div class="img-wrap">
   <a data-mptype="image" data-sport="nfl" href="/blog/buffalo-bills/post/_/id/32874/why-bills-lorenzo-alexander-stays-in-frigid-buffalo-in-offseason" name="&amp;lpos=nfl:feed:xx:news">
    <picture>
     <source data-srcset="https://a4.espncdn.com/combiner/i?img=%2Fphoto%2F2019%2F0206%2Fr498444_1296x518_5%2D2.jpg&amp;w=375&amp;h=150&amp;scale=crop&amp;cquality=80&amp;location=origin, https://a4.espncdn.com/combiner/i?img=%2Fphoto%2F2019%2F0206%2Fr498444_1296x518_5%2D2.jpg&amp;w=7

In [77]:
article_info = {}

#if article['class'] = ['news-feed-item', 'news-feed-story-package']:
    
for child in article.children:
    if child.name == 'a':
        article_info['class'] = child['class'][0]
        article_info['data-id'] = child['data-id']
        article_info['url'] = child['data-popup-href']
        article_info['sport'] = child['data-sport']
    if child.name == 'div':
        for span in child.div.div.children: # should be a timestap and author span tag
            if 'timestamp' in span['class']: # Beautiful soup always makes the class a list (NOT a string)
                article_info['timestamp'] = span.string
            elif 'author' in span['class']:
                article_info['author'] = span.string

article_info

{'author': 'Mike Rodak',
 'class': 'story-link',
 'data-id': 'buffalo-bills-32874',
 'sport': 'nfl',
 'timestamp': '3d',
 'url': 'http://espn.com/blog/buffalo-bills/post/_/id/32874/why-bills-lorenzo-alexander-stays-in-frigid-buffalo-in-offseason'}

---
Now that we have the logic set up for 1 page, lets get the full logic to extract all the articles


In [86]:
def extract_articles(soup, teamname): 
    articles = [art for art in soup.find_all('article') if 'data-id' in art.attrs.keys()]

    articles_list = []
    for article in articles:
        if 'news-feed-story-package' in article['class']: # should be ['news-feed-item', 'news-feed-story-package']
            article_info = {}
            article_info['teamname'] = teamname
            for child in article.children:
                if child.name == 'a':
                    article_info['class'] = child['class'][0]
                    article_info['data-id'] = child['data-id']
                    article_info['url'] = child['data-popup-href']
                    article_info['sport'] = child['data-sport']
                if child.name == 'div':
                    for span in child.div.div.children: # should be a timestap and author span tag
                        if 'timestamp' in span['class']: # Beautiful soup always makes the class a list (NOT a string)
                            article_info['timestamp'] = span.string
                        elif 'author' in span['class']:
                            article_info['author'] = span.string
                            
        articles_list.append(article_info)
    
    # convert list of dictionaries into dataframe
    df = pd.DataFrame(articles_list)
    return df

def get_df_from_teamname_link(link):
    teamname = link.split('/')[-1] # grab top link (ex 'buffalo-bills')
    html_page =  requests.get(link)
    soup = BeautifulSoup(html_page.content, 'lxml')
    df = extract_articles(soup, teamname)
    return df

In [87]:
link = "http://www.espn.com/nfl/team/_/name/buf/buffalo-bills"
df = get_df_from_teamname_link(link)
df

Unnamed: 0,author,class,data-id,sport,teamname,timestamp,url
0,Mike Rodak,story-link,buffalo-bills-32874,nfl,buffalo-bills,3d,http://espn.com/blog/buffalo-bills/post/_/id/3...
1,Jeremy Willis,story-link,25932328,nfl,buffalo-bills,4d,http://www.espn.com/nfl/story/_/id/25932328/nf...
2,Jeremy Willis,story-link,25932328,nfl,buffalo-bills,4d,http://www.espn.com/nfl/story/_/id/25932328/nf...
3,Jeremy Willis,story-link,25932328,nfl,buffalo-bills,4d,http://www.espn.com/nfl/story/_/id/25932328/nf...
4,Jeremy Willis,story-link,25932328,nfl,buffalo-bills,4d,http://www.espn.com/nfl/story/_/id/25932328/nf...
5,Bill Barnwell,story-link,25834281,nfl,buffalo-bills,12d,http://www.espn.com/nfl/story/_/id/25834281/pr...
6,Bill Barnwell,story-link,25834281,nfl,buffalo-bills,12d,http://www.espn.com/nfl/story/_/id/25834281/pr...
7,ESPN,story-link,nflnation-292685,nfl,buffalo-bills,16d,http://espn.com/blog/nflnation/post/_/id/29268...
8,Mike Rodak,story-link,buffalo-bills-32847,nfl,buffalo-bills,16d,http://espn.com/blog/buffalo-bills/post/_/id/3...
9,Mike Rodak,story-link,buffalo-bills-32849,nfl,buffalo-bills,17d,http://espn.com/blog/buffalo-bills/post/_/id/3...


## Final Notes

* timestamp doesn't seem very useful
* data-id sees to have higher level tags for type of article
* it looks like there are (many) duplicates (can be removed with df.drop_duplicates).
