# Introduction

In this workshop Python V, we'll be scraping the work of the late Daphne Caruana Galizia, unless somebody else has a website they desparately would like to scrape. In that case, we'll look at that. What we'll be looking at in depth is the a technique to basically scrape any website that doesn't need interaction. As soon as interaction comes into play, things will get a lot more complicated.

[BeautifulSoup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

# Imports

First of all, we'll import the modules and libraries we'll be needing. 

In [3]:
from bs4 import BeautifulSoup
#This lets us parse the information we need from the website we are visiting.
import requests
#This lets us visit any URL automatically.
import pandas as pd
#We'll use Pandas to store the data we have scraped
import pickle
#We could also use Pickle, if we want to store the data more efficiently.
import time
#This combined with the 
import progressbar
#If you need to install any of these modules or libraries please do so with the !pip install XXX command. 
#```time```will be installed already, and ```pip install progressbar2```needs a 2 at the end.

# Strategy - how do you want to proceed?

1. Pull out all the URLs including the date
2. Use a list of them to visit every URL
3. Use each and every URL to pull out the text

# Set up

In [5]:
# Let's get started: scrape main page
url = "https://daphnecaruanagalizia.com"
response = requests.get(url)
daphne = BeautifulSoup(response.text, 'html.parser')

In [None]:
daphne

# Find, Find All & .text

First, lets take a look at [Googles Developer Tools](https://developer.chrome.com/devtools) (I hope you're using Google Chrome). We'll use the developer tools to get the structural information of the site.

In [11]:
# Looking for the first post on the page
post = daphne.find("div", class_="postmaster")

In [12]:
post

<div class="postmaster" data-postid="97964">
<p class="column-caption"></p>
<div class="post">
<h1><a href="https://daphnecaruanagalizia.com/2017/10/first-things-first-something-horrendous-posture/" rel="bookmark" title="Permanent Link to First things first: do something about that horrendous posture">
First things first: do something about that horrendous posture </a>
</h1>
<div class="entry">
<p>
You can wear the flashiest watch and keep your snazzy shirt-cuff turned up to make …</p>
</div>
<p class="postmetadata"><a href="https://daphnecaruanagalizia.com/2017/10/first-things-first-something-horrendous-posture/#respond">Post a comment</a> | <a href="https://daphnecaruanagalizia.com/2017/10/first-things-first-something-horrendous-posture/#comments"><span class="dsq-postid" data-dsqidentifier="97964 https://daphnecaruanagalizia.com/?p=97964">Read (4)</span></a> | <span class="time">Monday, 16 October 2:09 pm</span></p>
</div>
</div>

In [13]:
#Looking for all the posts on the page
posts = daphne.find_all("div", class_="postmaster")

In [16]:
posts

[<div class="postmaster" data-postid="97964">
 <p class="column-caption"></p>
 <div class="post">
 <h1><a href="https://daphnecaruanagalizia.com/2017/10/first-things-first-something-horrendous-posture/" rel="bookmark" title="Permanent Link to First things first: do something about that horrendous posture">
 First things first: do something about that horrendous posture </a>
 </h1>
 <div class="entry">
 <p>
 You can wear the flashiest watch and keep your snazzy shirt-cuff turned up to make …</p>
 </div>
 <p class="postmetadata"><a href="https://daphnecaruanagalizia.com/2017/10/first-things-first-something-horrendous-posture/#respond">Post a comment</a> | <a href="https://daphnecaruanagalizia.com/2017/10/first-things-first-something-horrendous-posture/#comments"><span class="dsq-postid" data-dsqidentifier="97964 https://daphnecaruanagalizia.com/?p=97964">Read (4)</span></a> | <span class="time">Monday, 16 October 2:09 pm</span></p>
 </div>
 </div>, <div class="postmaster" data-postid="97

In [17]:
#Lets just look at in the list, to see which information we need.
posts[0]

<div class="postmaster" data-postid="97964">
<p class="column-caption"></p>
<div class="post">
<h1><a href="https://daphnecaruanagalizia.com/2017/10/first-things-first-something-horrendous-posture/" rel="bookmark" title="Permanent Link to First things first: do something about that horrendous posture">
First things first: do something about that horrendous posture </a>
</h1>
<div class="entry">
<p>
You can wear the flashiest watch and keep your snazzy shirt-cuff turned up to make …</p>
</div>
<p class="postmetadata"><a href="https://daphnecaruanagalizia.com/2017/10/first-things-first-something-horrendous-posture/#respond">Post a comment</a> | <a href="https://daphnecaruanagalizia.com/2017/10/first-things-first-something-horrendous-posture/#comments"><span class="dsq-postid" data-dsqidentifier="97964 https://daphnecaruanagalizia.com/?p=97964">Read (4)</span></a> | <span class="time">Monday, 16 October 2:09 pm</span></p>
</div>
</div>

In [21]:
posts[0].find('a')

<a href="https://daphnecaruanagalizia.com/2017/10/first-things-first-something-horrendous-posture/" rel="bookmark" title="Permanent Link to First things first: do something about that horrendous posture">
First things first: do something about that horrendous posture </a>

In [22]:
posts[0].find('a')['href']

'https://daphnecaruanagalizia.com/2017/10/first-things-first-something-horrendous-posture/'

In [26]:
posts[0].find('span', {'class':'time'})

<span class="time">Monday, 16 October 2:09 pm</span>

In [27]:
posts[0].find('span', {'class':'time'}).text

'Monday, 16 October 2:09 pm'

# Looping

In [None]:
#Now lets look at the whole lists and using a for loop extract only the 