# Introduction

In this workshop Python V, we'll be scraping the work of the late Daphne Caruana Galizia, unless somebody else has a website they desparately would like to scrape. In that case, we'll look at that. What we'll be looking at in depth is the a technique to basically scrape any website that doesn't need interaction. As soon as interaction comes into play, things will get a lot more complicated.

[BeautifulSoup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

# Imports

First of all, we'll import the modules and libraries we'll be needing. 

In [1]:
from bs4 import BeautifulSoup
#This lets us parse the information we need from the website we are visiting.
import requests
#This lets us visit any URL automatically.
import pandas as pd
#We'll use Pandas to store the data we have scraped

# Strategy - how do you want to proceed?

1. Pull out all the URLs including the date
2. Use a list of them to visit every URL
3. Use each and every URL to pull out the text

# Set up

In [2]:
# Let's get started: scrape main page
url = "https://daphnecaruanagalizia.com"
response = requests.get(url)
daphne = BeautifulSoup(response.text, 'html.parser')

In [3]:
daphne

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html lang="en-GB" prefix="og: http://ogp.me/ns#" xmlns="http://www.w3.org/1999/xhtml">
<head profile="http://gmpg.org/xfn/11">
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="width=device-width,initial-scale=1.0" name="viewport">
<title>Daphne Caruana Galizia's Notebook | Running Commentary - Daphne Caruana Galizia is a journalist working in Malta. | Daphne Caruana Galizia</title>
<link href="https://daphnecaruanagalizia.com/wp-content/themes/daphne-v3/style.css?v=2823442352343" media="screen" rel="stylesheet" type="text/css"/>
<!--[if IE]>
<link rel="stylesheet" href="https://daphnecaruanagalizia.com/wp-content/themes/daphne-v3/style-ie.css?v=25032009" type="text/css" media="screen" />
<![endif]-->
<link href="https://daphnecaruanagalizia.com/feed/" rel="alternate" title="posts" type="application/rss+xml"/>
<link href="https://

# Find, Find All & .text

First, lets take a look at [Googles Developer Tools](https://developer.chrome.com/devtools) (I hope you're using Google Chrome). We'll use the developer tools to get the structural information of the site.

In [4]:
# Looking for the first post on the page
post = daphne.find("div", class_="postmaster")

In [5]:
post

<div class="postmaster" data-postid="97964">
<p class="column-caption"></p>
<div class="post">
<h1><a href="https://daphnecaruanagalizia.com/2017/10/first-things-first-something-horrendous-posture/" rel="bookmark" title="Permanent Link to First things first: do something about that horrendous posture">
First things first: do something about that horrendous posture </a>
</h1>
<div class="entry">
<p>
You can wear the flashiest watch and keep your snazzy shirt-cuff turned up to make …</p>
</div>
<p class="postmetadata"><a href="https://daphnecaruanagalizia.com/2017/10/first-things-first-something-horrendous-posture/#respond">Post a comment</a> | <a href="https://daphnecaruanagalizia.com/2017/10/first-things-first-something-horrendous-posture/#comments"><span class="dsq-postid" data-dsqidentifier="97964 https://daphnecaruanagalizia.com/?p=97964">Read (4)</span></a> | <span class="time">Monday, 16 October 2:09 pm</span></p>
</div>
</div>

In [6]:
#Looking for all the posts on the page
posts = daphne.find_all("div", class_="postmaster")

In [7]:
posts

[<div class="postmaster" data-postid="97964">
 <p class="column-caption"></p>
 <div class="post">
 <h1><a href="https://daphnecaruanagalizia.com/2017/10/first-things-first-something-horrendous-posture/" rel="bookmark" title="Permanent Link to First things first: do something about that horrendous posture">
 First things first: do something about that horrendous posture </a>
 </h1>
 <div class="entry">
 <p>
 You can wear the flashiest watch and keep your snazzy shirt-cuff turned up to make …</p>
 </div>
 <p class="postmetadata"><a href="https://daphnecaruanagalizia.com/2017/10/first-things-first-something-horrendous-posture/#respond">Post a comment</a> | <a href="https://daphnecaruanagalizia.com/2017/10/first-things-first-something-horrendous-posture/#comments"><span class="dsq-postid" data-dsqidentifier="97964 https://daphnecaruanagalizia.com/?p=97964">Read (4)</span></a> | <span class="time">Monday, 16 October 2:09 pm</span></p>
 </div>
 </div>, <div class="postmaster" data-postid="97

In [8]:
#Lets just look at in the list, to see which information we need.
posts[0]

<div class="postmaster" data-postid="97964">
<p class="column-caption"></p>
<div class="post">
<h1><a href="https://daphnecaruanagalizia.com/2017/10/first-things-first-something-horrendous-posture/" rel="bookmark" title="Permanent Link to First things first: do something about that horrendous posture">
First things first: do something about that horrendous posture </a>
</h1>
<div class="entry">
<p>
You can wear the flashiest watch and keep your snazzy shirt-cuff turned up to make …</p>
</div>
<p class="postmetadata"><a href="https://daphnecaruanagalizia.com/2017/10/first-things-first-something-horrendous-posture/#respond">Post a comment</a> | <a href="https://daphnecaruanagalizia.com/2017/10/first-things-first-something-horrendous-posture/#comments"><span class="dsq-postid" data-dsqidentifier="97964 https://daphnecaruanagalizia.com/?p=97964">Read (4)</span></a> | <span class="time">Monday, 16 October 2:09 pm</span></p>
</div>
</div>

In [9]:
posts[0].find('a')

<a href="https://daphnecaruanagalizia.com/2017/10/first-things-first-something-horrendous-posture/" rel="bookmark" title="Permanent Link to First things first: do something about that horrendous posture">
First things first: do something about that horrendous posture </a>

In [10]:
posts[0].find('a')['href']

'https://daphnecaruanagalizia.com/2017/10/first-things-first-something-horrendous-posture/'

In [11]:
posts[0].find('span')

<span class="dsq-postid" data-dsqidentifier="97964 https://daphnecaruanagalizia.com/?p=97964">Read (4)</span>

In [27]:
posts[0].find_all('span')[1]

<span class="time">Monday, 16 October 2:09 pm</span>

In [27]:
posts[0].find('span', {'class':'time'}).text

'Monday, 16 October 2:09 pm'

# Looping

In [25]:
#Now lets look at the whole lists and using a for loop extract only the 
for elem in posts:
    #First the URL
    print(elem.find('a')['href'])

https://daphnecaruanagalizia.com/2017/10/first-things-first-something-horrendous-posture/
https://daphnecaruanagalizia.com/2017/10/austrias-new-chancellor-31-will-form-coalition-neo-nazi/
https://daphnecaruanagalizia.com/2017/10/party-leaders-sunday-morning/
https://daphnecaruanagalizia.com/2017/10/looks-like-delia-surrounding-like-minded-individuals/
https://daphnecaruanagalizia.com/2017/10/chris-cardona-one-track-mind/
https://daphnecaruanagalizia.com/2017/10/david-thake-subject-adrian-delia/
https://daphnecaruanagalizia.com/2017/10/not-blinded-little-intelligence-much-personal-ambition-reality/
https://daphnecaruanagalizia.com/2017/10/toni-bezzina-throws-hat-ring-two-minutes-deadline/
https://daphnecaruanagalizia.com/2017/10/chris-said-just-made-public-statement-deputy-leadership-election/
https://daphnecaruanagalizia.com/2017/10/chris-said-says-will-not-stand-election-nationalist-party-deputy-leader/
https://daphnecaruanagalizia.com/2017/10/nationalist-party-deputy-leadership-conte

In [28]:
#Now the time
for elem in posts:
    #First the URL
    print(elem.find('span', {'class':'time'}).text)

Monday, 16 October 2:09 pm
Sunday, 15 October 10:07 pm
Sunday, 15 October 7:26 pm
Saturday, 14 October 12:52 am
Saturday, 14 October 12:26 am
Friday, 13 October 11:20 pm
Friday, 13 October 6:29 pm
Friday, 13 October 6:19 pm
Friday, 13 October 5:22 pm
Friday, 13 October 4:46 pm
Friday, 13 October 12:45 pm
Thursday, 12 October 8:45 pm
Thursday, 12 October 7:21 pm
Thursday, 12 October 6:53 pm


In [29]:
#Saving them into a list
lst = []
for elem in posts:
    
    url = elem.find('a')['href']
    date = elem.find('span', {'class':'time'}).text
    
    mini_dict = {'URL': url,
                 'Date': date}
    
    lst.append(mini_dict)

# Making it Human Readable

In [30]:
#And now we'll bring in Pandas and make it readable for us humans
pd.DataFrame(lst)

Unnamed: 0,Date,URL
0,"Monday, 16 October 2:09 pm",https://daphnecaruanagalizia.com/2017/10/first...
1,"Sunday, 15 October 10:07 pm",https://daphnecaruanagalizia.com/2017/10/austr...
2,"Sunday, 15 October 7:26 pm",https://daphnecaruanagalizia.com/2017/10/party...
3,"Saturday, 14 October 12:52 am",https://daphnecaruanagalizia.com/2017/10/looks...
4,"Saturday, 14 October 12:26 am",https://daphnecaruanagalizia.com/2017/10/chris...
5,"Friday, 13 October 11:20 pm",https://daphnecaruanagalizia.com/2017/10/david...
6,"Friday, 13 October 6:29 pm",https://daphnecaruanagalizia.com/2017/10/not-b...
7,"Friday, 13 October 6:19 pm",https://daphnecaruanagalizia.com/2017/10/toni-...
8,"Friday, 13 October 5:22 pm",https://daphnecaruanagalizia.com/2017/10/chris...
9,"Friday, 13 October 4:46 pm",https://daphnecaruanagalizia.com/2017/10/chris...


In [32]:
#Lets save it off
df = pd.DataFrame(lst)
df.to_csv('d/date_urls.csv')

# Lets get all the URLs

In [33]:
# Examine the URL and the next page
url = "https://daphnecaruanagalizia.com/page/2/" #all the way to 1441
response = requests.get(url)
daphne = BeautifulSoup(response.text, 'html.parser')

In [None]:
# How to loop through all the pages?
whole_list = []

for elem range(1441):
    url = "https://daphnecaruanagalizia.com/page/" + str(elem)
    response = requests.get(url)
    daphne = BeautifulSoup(response.text, 'html.parser')

In [None]:
# Now we need to tell the page to pull out everything everytime step of the loop

whole_list = []

for elem in range(1441):
    url = "https://daphnecaruanagalizia.com/page/" + str(elem)
    response = requests.get(url)
    daphne = BeautifulSoup(response.text, 'html.parser')
    
    ###### this is what we did before #####
    lst = [] 
    for elem in posts:
    
        url = elem.find('a')['href']
        date = elem.find('span', {'class':'time'}).text
        mini_dict = {'URL': url,
                 'Date': date}
        lst.append(mini_dict)
    ###### this is what we did before #####
        
    whole_list += lst

In [None]:
df = pd.DataFrame(whole_list)

# Next steps?
1. Create the entire dataframe
2. Use the URL to visit each page
3. Create a new dataframe with the entire text