# Introduction to Web Scraping, Part Two (Python)
- UMN LATIS & Libraries workshop, Dec 3, 2021
- Cody Hennesy (chennesy@umn.edu) and Michael Beckstrand (mjbeckst@umn.edu)

In this part of the workshop, we'll explore reproducible web scraping methods using Python. 

Specifically, in this part of the workshop we will:
* Use Python 3 in a Jupyter computing environment
* Use the Requests and BeautifulSoup Python libraries to access HTML data from the web
* Create variables, lists and loops to work with web data in Python

Credits: Content for this workshop was adapted from [Rochelle Terman's Web Scraping workshop](https://github.com/rochelleterman/scrape-interwebz) and from [Software Carpentry Python lessons](http://swcarpentry.github.io/python-novice-inflammation/).

### Why Python? 
- Reproducibility
- Repeatable
- Extensible
- Great for data access and data cleaning

### What's Jupyter?
- Web-based, easy to share
- Easy to read, easy to run
- Run code piece by piece

## Python variables
- You can use Python as a calculator. 
- To "run" a Jupyter cell hold down shift and select Return/Enter, or choose the "play icon" (right-facing triangle) from the Jupyter menu above. 

In [None]:
weight_kg = 60

In [None]:
print(weight_kg)

In Python, variable names:

* can include letters, digits, and underscores
* cannot start with a digit
* are case sensitive.

You can do calculations and save text strings in variables too.

In [None]:
website = "All the words on a website"
print(website)

## Importing Libraries
Importing a library is like getting a piece of lab equipment out of a storage locker and setting it up on the bench. Libraries provide additional functionality to the basic Python package, much like a new piece of equipment adds functionality to a lab space. Just like in the lab, importing too many libraries can sometimes complicate and slow down your programs - so we only import what we need for each program.

Our primary tools will be the [Requests library](http://docs.python-requests.org/en/latest/user/quickstart/)
and [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/). 

Note: Another popular tool for web scraping with Python is [Scrapy](https://scrapy.org/). The general consensus is that it's a faster tool, and it does *more* than Beautiful Soup, but it might be more complex to learn.

In [1]:
import requests
from bs4 import BeautifulSoup

### Library functions
The expression ```requests.get(...)``` is a function call that asks Python to run the function ```get``` which belongs to the ```requests``` library. 

This dotted notation is used everywhere in Python: the thing that appears before the dot contains the thing that appears after.

As an example, we could use the dot notation to write the relationship between Minneapolis and Minnesota as ```Minnesota.Minneapolis```, just as *get* is a function that belongs to the *requests* library.

In [2]:
requests.get('http://www.startribune.com/')

<Response [200]>

#### What did we do above?
1. Created a Python HTTP request object for a GET
2. Send the HTTP request to webserver at http://www.startribune.com/
3. Received the response ```[200]``` from http://www.startribune.com/ - [what's that mean?](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)

In Jupyter notebooks using Python you can explore functions of a library using the *tab* key.

And to understand each function you can get information by putting a question mark after it:

In [3]:
requests.get?

[0;31mSignature:[0m [0mrequests[0m[0;34m.[0m[0mget[0m[0;34m([0m[0murl[0m[0;34m,[0m [0mparams[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Sends a GET request.

:param url: URL for the new :class:`Request` object.
:param params: (optional) Dictionary, list of tuples or bytes to send
    in the query string for the :class:`Request`.
:param \*\*kwargs: Optional arguments that ``request`` takes.
:return: :class:`Response <Response>` object
:rtype: requests.Response
[0;31mFile:[0m      /anaconda3/lib/python3.7/site-packages/requests/api.py
[0;31mType:[0m      function


You can store the data that is returned from the GET request in a variable:

In [5]:
star_trib = requests.get('http://www.startribune.com/')
print(star_trib)

<Response [200]>


### Ethical scraping
One way to make sure you're engaging in transparent and ethical scraping practices is to send the website information about *yourself* along with your request. 

```requests.get``` includes a ```headers=``` parameter that you can use to send in your name and information about the software we're using to collect data:

In [6]:
headers = {'user-agent': 'python-requests/2.22.0; chennesy@umn.edu; Cody Hennesy'}
star_trib = requests.get('http://www.startribune.com/', headers=headers)

Now you can explore the attributes of the data object stored in ```star_trib``` using the same dot notation. 

Use tab to explore the options, and the question mark to read more about the attribute.

```star_trib.text```, for example.

In [7]:
src = star_trib.text

Let's move the .text content that was returned from the Request into a BeautifulSoup object so we can start to explore the HTML tree.

In [8]:
# parse the response into an HTML tree by calling BeautifulSoup
soup = BeautifulSoup(src, 'lxml')

# look at what it looks like now, using the soup.prettify tool
# [:1000] will give us the first 1000 characters in the soup object so it doesn't fill up the whole screen
print(soup.prettify()[:1000])

<!DOCTYPE html>
<!--[if IE 8 ]>    <html dir="ltr" lang="en-US" class="no-js ie8 oldie"> <![endif]-->
<!--[if IE 9]><html lang="en" class="ie ie9"><![endif]-->
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <script>
   window.zeusAdUnitPath ='/7932/website/web_homepage';
  </script>
  <link href="startribune.zeustechnology.com" rel="dns-prefetch"/>
  <link href="securepubads.g.doubleclick.net" rel="dns-prefetch"/>
  <link href="static.doubleclick.net" rel="dns-prefetch"/>
  <link href="ib.adnxs.com" rel="dns-prefetch"/>
  <link href="as-sec.casalemedia.com" rel="dns-prefetch"/>
  <link href="js-sec.indexww.com" rel="dns-prefetch"/>
  <link href="ox-ui.mst.servedbyopenx.com" rel="dns-prefetch"/>
  <link href="hbopenbid.pubmatic.com" rel="dns-prefetch"/>
  <link href="fastlane.rubiconproject.com" rel="dns-prefetch"/>
  <link href="ap.lijit.com" rel="dns-prefetch"/>
  <link href="tlx.3lift.com" rel="dns-prefetch"/>
  <link href="c.amazo

## Find Elements

BeautifulSoup has a number of functions to find things on a page. Like other webscraping tools, Beautiful Soup lets you find elements by their:

1. HTML tags
2. HTML Attributes
3. CSS Selectors

#### HTML tags
Let's search first for **HTML tags**. 

The function `find_all` searches the `soup` tree to find all the elements with an a particular HTML tag, and returns all of those elements.

What does the example below do?

In [14]:
# find all elements in a certain tag
soup.find_all("a")

In [15]:
soup.find_all('p')

In [16]:
soup.find_all('h3')

Because `find_all()` is the most popular method in the Beautiful Soup search API, you can use a shortcut for it. If you treat the BeautifulSoup object as though it were a function, then it’s the same as calling `find_all()` on that object. 

These two lines of code are equivalent:

In [13]:
soup.find_all("a")
soup("a")

#### HTML Attributes 
If you search for everything with the `a` tag, you're likely to get a lot of stuff, much of which you don't want. What if we wanted to search for HTML tags ONLY with certain attributes, like particular CSS classes? 

We can do this by adding an additional argument to the `find_all` like this: ```soup("a", class_="class_name")```

### Challenge (1): Finding the Most Read and Emailed articles on the Star Tribune homepage
We can use Chrome's *Inspect* feature, to find the class name for the Most Read and Most Emailed articles lists (feed-list-link). 

** DEMO using Chrome Inspect to look at Headlines **

1. Let's create a variable called most_read, and use ```soup()``` to find all of the links with the appropriate class
2. Then we'll print out the matches below


In the example below, we are finding all the `a` tags, and then filtering those with `class_="feed-list-link"`.

In [17]:
# Get only the 'a' tags in 'sidemenu' class
most_read = soup("a", class_="feed-list-link")
print(most_read)

[<a class="feed-list-link" data-linkname="DNR: Walleye poachers near Baudette were 48 fish over limit" data-linktype="headline" data-modulename="most-read - n-a" data-moduletype="zone2-most-read" data-position="2-1" href="https://www.startribune.com/dnr-walleye-poachers-near-baudette-were-48-fish-over-their-limit/600117379/">
                    DNR: Walleye poachers near Baudette were 48 fish over limit
                </a>, <a class="feed-list-link" data-linkname="Two charged with fatally beating pregnant woman whose body was found in Uptown" data-linktype="headline" data-modulename="most-read - n-a" data-moduletype="zone2-most-read" data-position="2-2" href="https://www.startribune.com/two-charged-with-fatally-beating-pregnant-woman-whose-body-was-found-in-uptown/600117428/">
                    Two charged with fatally beating pregnant woman whose body was found in Uptown
                </a>, <a class="feed-list-link" data-linkname="'The Bachelorette' brings her suitors — includin

#### CSS Selectors
It can be more efficient to search and find things on a website by **CSS selector.** For this we have to use a different method, `select()`. Just pass a string into the `.select()` to get all elements with that string as a valid CSS selector.

In the example above, we can use "a.feed-list-link" as a CSS selector, which returns all `a` tags with class `feed-list-link`.

##### How to find selectors?
A number of browser extensions and other tools exist to help you find HTML and CSS. 
- [Selector Gadget](https://selectorgadget.com/)
- [CSS Selector Helper](https://chrome.google.com/webstore/detail/css-selector-helper-for-c/gddgceinofapfodcekopkjjelkbjodin?hl=en)


In [18]:
# get elements with "a.sidemenu" CSS Selector.
most_read_select = soup.select("a.feed-list-link")

In [19]:
most_read_select

[<a class="feed-list-link" data-linkname="DNR: Walleye poachers near Baudette were 48 fish over limit" data-linktype="headline" data-modulename="most-read - n-a" data-moduletype="zone2-most-read" data-position="2-1" href="https://www.startribune.com/dnr-walleye-poachers-near-baudette-were-48-fish-over-their-limit/600117379/">
                     DNR: Walleye poachers near Baudette were 48 fish over limit
                 </a>,
 <a class="feed-list-link" data-linkname="Two charged with fatally beating pregnant woman whose body was found in Uptown" data-linktype="headline" data-modulename="most-read - n-a" data-moduletype="zone2-most-read" data-position="2-2" href="https://www.startribune.com/two-charged-with-fatally-beating-pregnant-woman-whose-body-was-found-in-uptown/600117428/">
                     Two charged with fatally beating pregnant woman whose body was found in Uptown
                 </a>,
 <a class="feed-list-link" data-linkname="'The Bachelorette' brings her suitors — in

### Python Lists

The most popular kind of data collection in Python is the list, which takes the place of arrays in programming languages like C and Fortran.
Lists have two primary important characteristics:
1. They are mutable, i.e., they can be changed after they are created.
2. They are heterogeneous, i.e., they can store values of many different types.

To create a new list, you can just put some values in square brackets with commas in between.

In [20]:
my_list = ['red', 'orange', 'yellow']
my_list

['red', 'orange', 'yellow']

To fetch the element at a specific location, put the *index* of that location in square brackets. But keep in mind that Python lists start the index from 0. So the list above has three index values: ```my_list[0] my_list[1] my_list[2]```

In [21]:
my_list[1]
my_list[3]

IndexError: list index out of range

Let's go back to all of the a links with the class selector ```feed-list-link```.

In [22]:
#this is a Python list
most_read = soup.select("a.feed-list-link")

You can see how many items are in your list using the ```len()``` Python function.

In [23]:
len(most_read)

10

And you can look at the first element in the list using the syntax variable[0]. 

Note: [0] refers to the first element in a list in Python, and [1] refers to the second.

In [24]:
most_read[0]

<a class="feed-list-link" data-linkname="DNR: Walleye poachers near Baudette were 48 fish over limit" data-linktype="headline" data-modulename="most-read - n-a" data-moduletype="zone2-most-read" data-position="2-1" href="https://www.startribune.com/dnr-walleye-poachers-near-baudette-were-48-fish-over-their-limit/600117379/">
                    DNR: Walleye poachers near Baudette were 48 fish over limit
                </a>

We can use a built in Python function called type() to explore the results.

In [25]:
# save the first element in the list to its own variable to make it easier to explore
first_link = most_read[0]

# check out its class
type(first_link)

bs4.element.Tag

It's a tag! If we look up Tag in the BeautifulSoup documentation, we know that we can use `.text` to look at the text.

In [26]:
first_link.text

'\n                    DNR: Walleye poachers near Baudette were 48 fish over limit\n                '

We can also look at the href attribute to check out the URL:

In [27]:
first_link['href']

'https://www.startribune.com/dnr-walleye-poachers-near-baudette-were-48-fish-over-their-limit/600117379/'

### Loops

If we want to explore all of the most popular articles, we can loop through each link and only grab the information that we care about. 

#### Note the syntax: 

```
for x in y:
    do_something # the code in the loop needs to be indented
    do_another_thing
```

For 'a' tags, we also know there's an 'href' attribute that tells us where the link URL goes. 

We'll clean up the print output in Python by using a built-in .strip() function that removes extra white space from strings, and by adding some line breaks between elements using the ```'\n'``` escape character.

In [28]:
for link in most_read:
    print(link.text.strip(), '\n', link['href'], '\n')

DNR: Walleye poachers near Baudette were 48 fish over limit 
 https://www.startribune.com/dnr-walleye-poachers-near-baudette-were-48-fish-over-their-limit/600117379/ 

Two charged with fatally beating pregnant woman whose body was found in Uptown 
 https://www.startribune.com/two-charged-with-fatally-beating-pregnant-woman-whose-body-was-found-in-uptown/600117428/ 

'The Bachelorette' brings her suitors — including 'Minnesota Joe' — home to Minnesota 
 https://www.startribune.com/the-bachelorette-brings-her-suitors-including-minnesota-joe-home-to-minnesota/600117702/ 

Minnesota preps COVID booster shot expansion for all adults 
 https://www.startribune.com/minnesota-urges-booster-expansion-school-protections-against-worsening-covid-19-wave/600117372/ 

Minnesota has worst 7-day rate of new COVID cases in U.S. 
 https://www.startribune.com/minnesota-has-nation-s-worst-7-day-rate-of-new-covid-19-infections/600116980/ 

5 Twin Cities chefs share tips on preparing the best Thanksgiving tu

While printing content can provide a useful output as we code, it's usually much more useful to store the data so that we can operate on it later on (save it, clean it, etc.). In this case let's create an empty Python list and then extract the URLs from our most_read list to save there.

In [29]:
# you can create an empty list using open and closed square brackets without content
most_read_urls = []

for link in most_read:
    # we can append new items to a list "in place" (without using an equals sign) using the append() function
    most_read_urls.append(link['href'])

In [30]:
print(most_read_urls)

['https://www.startribune.com/dnr-walleye-poachers-near-baudette-were-48-fish-over-their-limit/600117379/', 'https://www.startribune.com/two-charged-with-fatally-beating-pregnant-woman-whose-body-was-found-in-uptown/600117428/', 'https://www.startribune.com/the-bachelorette-brings-her-suitors-including-minnesota-joe-home-to-minnesota/600117702/', 'https://www.startribune.com/minnesota-urges-booster-expansion-school-protections-against-worsening-covid-19-wave/600117372/', 'https://www.startribune.com/minnesota-has-nation-s-worst-7-day-rate-of-new-covid-19-infections/600116980/', 'https://www.startribune.com/5-twin-cities-chefs-share-tips-on-preparing-the-best-thanksgiving-turkey/600117719/', 'https://www.startribune.com/green-bay-packers-kick-off-stock-sale/600117440/', 'https://www.startribune.com/minnesota-has-nation-s-worst-7-day-rate-of-new-covid-19-infections/600116980/', 'https://www.startribune.com/the-bachelorette-brings-her-suitors-including-minnesota-joe-home-to-minnesota/6001

Now we have a list of URLs that we can later use as part of a modular approach to scraping. If we figure out how to scrape the content from an article page, we can re-use that code and say "scrape the article content for every page in this list of URLs." To do that, let's figure out how to scrape content from an article page.

## Scrape an article page


In [31]:
page = requests.get("https://www.startribune.com/former-vikings-great-matt-blair-dies-at-age-70-likely-linked-to-cte/572832852/")

In [32]:
src = page.text

In [33]:
page_soup = BeautifulSoup(src, 'lxml')

Exploring the HTML in Chrome is a great way to find the right selectors or attributes to scrape, but you can also take sneak peaks at common tags using ```.find_all()``` to help pinpoint specific elements. For example: ```.find_all('h1')``` or ```.find_all('p')```

In [34]:
page_soup.find_all('p')

[<p class="Text_Body">Matt Blair, one of the greatest linebackers in Vikings history, died Thursday of what’s believed to be complications from chronic traumatic encephalopathy (CTE), the neurodegenerative disease linked to football and considered to be the signature menace in the NFL’s concussion claims in recent years.</p>,
 <p class="Text_Body">He was 70 and had been in hospice care for an extended period of time.</p>,
 <p class="Text_Body">“He’d been suffering for a while, so I guess maybe it’s a blessing in disguise,” said former teammate Scott Studwell, the only person in Vikings history with more tackles than the 1,452 Blair had from 1974-85. “But it’s still too young. It’s a sad day.”</p>,
 <p class="Text_Body">In February 2015, a still-chiseled 64-year-old Blair broke down in tears during a Star Tribune interview. A local neurologist had just given Blair and his wife, Mary Beth, the bad news that his early signs of dementia were likely the results of CTE — which can’t be diagn

It looks like the p class ```Text_Body``` would snag the full-text of the article for us:

In [35]:
article_text = page_soup.select("p.Text_Body")

In [36]:
for article in article_text:
    print(article.text)

Matt Blair, one of the greatest linebackers in Vikings history, died Thursday of what’s believed to be complications from chronic traumatic encephalopathy (CTE), the neurodegenerative disease linked to football and considered to be the signature menace in the NFL’s concussion claims in recent years.
He was 70 and had been in hospice care for an extended period of time.
“He’d been suffering for a while, so I guess maybe it’s a blessing in disguise,” said former teammate Scott Studwell, the only person in Vikings history with more tackles than the 1,452 Blair had from 1974-85. “But it’s still too young. It’s a sad day.”
In February 2015, a still-chiseled 64-year-old Blair broke down in tears during a Star Tribune interview. A local neurologist had just given Blair and his wife, Mary Beth, the bad news that his early signs of dementia were likely the results of CTE — which can’t be diagnosed until after death — and were about to accelerate. Blair is believed to have had Alzheimer’s diseas

### Comments on Star Tribune - How to get them?

Some elements of websites are "hidden" from web scrapers like BeautifulSoup because they appear as part of an iFrame, or because they require other code such as Javascript to load on the page. 

If we look at the link to "Show Comments" on the Star Tribune article, for example, there's not a URL, but a call to a Javascript tool called *js-comments* that is visible in the a class selector:

```<a href="#" class="js-comments-show comments-count-link talk-enabled"><div class="comments-count">18</div><span class="comments-show js-comments-show-txt">Show Comments</span><img class="comment-count-image-tracking" alt="" src="http://apps.startribune.com/circulars/images/blank.gif" style="display: none;"></a>```

To capture this "hidden" data, some researchers use browser emulators such as:

- [Selenium](https://pypi.org/project/selenium/). This also requires other tools such as [ChromeDriver](https://sites.google.com/a/chromium.org/chromedriver/downloads), [RSelenium](https://cran.r-project.org/web/packages/RSelenium/index.html), and/or [Selenium with Python](https://selenium-python.readthedocs.io/).
- [Puppeteer](https://pptr.dev/)

## Challenge (3): Scrape the headline, byline, and date
Let's explore the HTML for a Star Tribune article to see if we can scrape the headline, byline (author), and the date the article was posted from the page.

In [37]:
headline = page_soup.h1
headline.text

'Former Vikings great Matt Blair dies at age 70, likely linked to CTE'

In [38]:
byline = page_soup.select('div.article-byline')
byline[0].a.text

'Mark Craig'

In [39]:
date = page_soup.select('div.article-dateline')
date[0].text.strip()[:-9]

'October 23, 2020'

## Functions
Now let's define a function ```scrape_articles``` that cycles through our ```most_read_urls``` list, and grabs all of the data we care about from each page.

The function definition opens with the keyword ```def``` followed by the name of the function (format_articles) and a parenthesized list of parameter names (unformatted_docs). The body of the function — the statements that are executed when it runs — is indented below the definition line. The body concludes with a return keyword followed by the value we want to take from the function.

In [40]:
def add_things(x, y):
    z = x + y
    return z

We can run the function, passing values to the x and y parameters, and save the return value to a new variable.

In [41]:
z = add_things(10,25)
print(z)

35


## Putting it all together: Scraping list function
So let's use a function to scrape a few key elements from article pages. The input that the function will take is a list of URLs. Then for each url, it will scrape the headline, byline, and article text. Since we'll be hitting the Star Tribune server pretty rapidly, let's build in a timer using the time library to pause between each page.

In [42]:
import time

In [43]:
# new function that accepts one parameter, a list.
def scrape_strib(url_list):
    #empty list to hold each page
    results_list = []
    
    for url in url_list:
        print('Scraping:', url)
        time.sleep(5) # wait for 5 seconds
        
        # request the page content and convert it to a soup object
        page = requests.get(url)
        src = page.text
        page_soup = BeautifulSoup(src, 'lxml')
        
        #find the text on the page
        article_text = page_soup.select("p") # because different kinds of articles use different p classes, we'll make this as generic as possible

        # save all of the p_tags to a long string
        article_string = ''.join(article.text.strip() for article in article_text)
        
        # save the headline (there should only be one!)
        headline = page_soup.h1
        
        # we haven't talked about if/else statements, but before assigning a value we want to check that headline exists
        # we can do that with the handy statement if x is not None: 
        # whatever is indented will only take place if the 'if' statement evaluates as true; otherwise we'll skip it, and go to the else
        if headline is not None:
            headline = headline.text.strip()
        else:
            headline = ''
        
        # bylines have a lot of variation from page to page so we're going to do a somewhat complex if/else check
        byline = page_soup.select('div.article-byline')
        if len(byline) > 0:
            if byline[0].a is not None:
                byline = byline[0].a.text
            else:
                byline = ''
        else:
            byline = ''
        
        # this is a new kind of container object, known as a tuple. It's a good way to package these lists and strings together for each page, and then assign them as one object to the overall results_list. 
        article_tuple = (url, headline, byline, article_string)
        results_list.append(article_tuple)
    
    #return results_list after the for loop concludes
    return results_list

Now we can call our function, passing the most_read_urls list as a parameter, and saving the return value to a variable called scraped_articles.

In [44]:
scraped_articles = scrape_strib(most_read_urls)

Scraping: https://www.startribune.com/dnr-walleye-poachers-near-baudette-were-48-fish-over-their-limit/600117379/
Scraping: https://www.startribune.com/two-charged-with-fatally-beating-pregnant-woman-whose-body-was-found-in-uptown/600117428/
Scraping: https://www.startribune.com/the-bachelorette-brings-her-suitors-including-minnesota-joe-home-to-minnesota/600117702/
Scraping: https://www.startribune.com/minnesota-urges-booster-expansion-school-protections-against-worsening-covid-19-wave/600117372/
Scraping: https://www.startribune.com/minnesota-has-nation-s-worst-7-day-rate-of-new-covid-19-infections/600116980/
Scraping: https://www.startribune.com/5-twin-cities-chefs-share-tips-on-preparing-the-best-thanksgiving-turkey/600117719/
Scraping: https://www.startribune.com/green-bay-packers-kick-off-stock-sale/600117440/
Scraping: https://www.startribune.com/minnesota-has-nation-s-worst-7-day-rate-of-new-covid-19-infections/600116980/
Scraping: https://www.startribune.com/the-bachelorette-b

We can explore the content by referring to the index for each item in the scraped_articles list. Also, looking over the URLs above, we can guess that some of these (like the videos) are probably not going to fit our articles list very well. When building a scraper, you'll want to spend a fair amount of time testing different kinds of content and building if/else statements to only collect the data you need.

In [45]:
# we can look at the full tuple for each article
print(scraped_articles[0])

('https://www.startribune.com/dnr-walleye-poachers-near-baudette-were-48-fish-over-their-limit/600117379/', 'DNR: Walleye poachers near Baudette were 48 fish over their limit', 'Tony Kennedy', 'Three Twin Cities area men who went fishing instead of hunting during opening weekend of the deer season were caught poaching walleyes and saugers from Rainy River and Lake of the Woods, according to charges against them.Conservation Officer Corey Sura  caught onto the group\'s activity Sunday afternoon while they were launching their boat into the river at Wheeler\'s Point Public Access near Baudette, Minn. They initially denied having an overabundance of fish, but their stash was uncovered when the game warden heard a flopping noise.In total, the three men were busted for possessing 72 walleyes and saugers, 48 over their combined limit.The three men were identified asMichael Sysa, 22, of Oak Grove; David Sysa, 23, of Oak Grove; and Yevgeniy Simonovich, 29, of Elk River. Sura\'s report said the

In [46]:
# we can look at specific items by adding referring to the index of the tuple
print('First article:')
print('url:', scraped_articles[0][0])
print('title:', scraped_articles[0][1])
print('byline:', scraped_articles[0][2])
print('First 250 chars from the text:', scraped_articles[0][3][0:250])

First article:
url: https://www.startribune.com/dnr-walleye-poachers-near-baudette-were-48-fish-over-their-limit/600117379/
title: DNR: Walleye poachers near Baudette were 48 fish over their limit
byline: Tony Kennedy
First 250 chars from the text: Three Twin Cities area men who went fishing instead of hunting during opening weekend of the deer season were caught poaching walleyes and saugers from Rainy River and Lake of the Woods, according to charges against them.Conservation Officer Corey Su


# Data view and export

When you've collected a lot of data like this it can be helpful to save it for later use. Which format you want to use to save data in Python depends on the structure of the data, but common formats are CSVs, JSON, and pickle files.

In the case of more complex data structures, like our list of lists, we can use a tabular data tool called Pandas to store each article in a row, and then save that Pandas dataframe to a CSV file.

In [47]:
import pandas as pd

# pandas has a function called DataFrame() that accepts a list as its input, and then we'll define the names of our column names
df = pd.DataFrame(scraped_articles, columns=['url', 'title', 'byline', 'text'])

# now let's view the table
df

Unnamed: 0,url,title,byline,text
0,https://www.startribune.com/dnr-walleye-poache...,DNR: Walleye poachers near Baudette were 48 fi...,Tony Kennedy,Three Twin Cities area men who went fishing in...
1,https://www.startribune.com/two-charged-with-f...,Two charged with fatally beating pregnant woma...,Paul Walsh,"Two people, one of them a convicted sex offend..."
2,https://www.startribune.com/the-bachelorette-b...,'The Bachelorette' brings her suitors — includ...,Jenna Ross,"Minnesota is getting lots of love on ""The Bach..."
3,https://www.startribune.com/minnesota-urges-bo...,"Minnesota urges booster expansion, school prot...",Jeremy Olson,Minnesota is preparing to expand COVID-19 vacc...
4,https://www.startribune.com/minnesota-has-nati...,Minnesota has nation's worst 7-day rate of new...,Jeremy Olson,Minnesota's rate of new coronavirus infections...
5,https://www.startribune.com/5-twin-cities-chef...,5 Twin Cities chefs share tips on preparing th...,Rick Nelson,"To brine, or not to brine. Roasting vs. frying..."
6,https://www.startribune.com/green-bay-packers-...,Green Bay Packers kick off 'stock' sale,Burl Gilyard,"The Green Bay Packers, a storied football team..."
7,https://www.startribune.com/minnesota-has-nati...,Minnesota has nation's worst 7-day rate of new...,Jeremy Olson,Minnesota's rate of new coronavirus infections...
8,https://www.startribune.com/the-bachelorette-b...,'The Bachelorette' brings her suitors — includ...,Jenna Ross,"Minnesota is getting lots of love on ""The Bach..."
9,https://www.startribune.com/after-the-disrupti...,After disruption at Guthrie's 'Christmas Carol...,Rohan Preston,Twin Cities theater companies are taking a new...


And we can save a dataframe to a csv file using the pandas function to_csv().

In [None]:
df.to_csv('scraped_sites.csv')

### More resources

- [Programming Historian's Intro to BeautifulSoup](https://programminghistorian.org/en/lessons/intro-to-beautiful-soup)