

<div class="alert alert-success">

## Note 

> This notebook scrapes the programme information for all the episodes of the BBC Radio 4 programme 'In Our Time' which has run since 1998. It then does some initial cleaning to produce a useful dataframe (although short of the guests' details and affiliations).

> The robots.txt on the website has no relevant interdictions https://www.bbc.co.uk/robots.txt and the the Terms https://www.bbc.co.uk/usingthebbc/terms-of-use/ essentially request normal and respectful usage.

</div>

In [1]:
# imports

import requests
from bs4 import BeautifulSoup as bs
import time
from random import randint
import pandas as pd
import json
from datetime import datetime as dt


## all episodes - collect list of urls

> this includes future episodes and URLs that lead to empty pages

In [40]:
# gather 932 urls

nb = 1
url_list = []

my_headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:71.0) Gecko/20100101 Firefox/84.0"}

url = f'https://www.bbc.co.uk/programmes/b006qykl/episodes/guide?page={nb}'

for x in range (32):
    url = f'https://www.bbc.co.uk/programmes/b006qykl/episodes/guide?page={nb}'
    req = requests.get(url, timeout = 5, headers = my_headers)
    soup = bs(req.content)
    for x in soup.find_all('a', class_ = 'br-blocklink__link block-link__target'):
        link = x.attrs['href']
        url_list.append(link)
    print(nb)
    nb = nb + 1
    sleep(2)

print ('done', len(url_list), 'urls')

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
done 932 urls


In [44]:
# test for duplicates in the url list

print(len(url_list))
print(len(set(url_list)))

932
932


In [45]:
# save url list locally as a backup, indicating in file name how many urls there are for future reference

with open("url_list_932.json", "w") as f:
    json.dump(url_list, f)


## run a scrape of the 932 urls

In [134]:
my_headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:71.0) Gecko/20100101 Firefox/84.0"}

nb = 0
rows = []

print('START')
dt_string = dt.now().strftime("%d/%m/%Y %H:%M:%S")
print(dt_string)

for x in range(len(url_list)):
    
    url = url_list[nb]
    req = requests.get(url, timeout = 10, headers = my_headers)
   
    if req.status_code != 200:
        rows.append['MISSING']
        nb = nb + 1
        continue
    
    else:
        soup = bs(req.content)
        
        title = soup.find('h1').text
        
        # get the date (which breaks down sometimes if episode is unavailable)
        if soup.find('span', class_ = 'broadcast-event__date text-base timezone--date') is None:
            for x in soup.find_all('div', class_ = 'episode-panel__meta'):
                ep_time = x.text.split(':')[1]
                print('date fixed here!', url)
        else:
            for x in soup.find('span', class_ = 'broadcast-event__date text-base timezone--date'):
                ep_time = x
        
        info = []
        for x in soup.find_all('div', class_ = 'synopsis-toggle__long'):
            for y in x.find_all('p'):
                info.extend(y)
        
        tags = []
        for x in soup.find_all('div', class_ = 'programme block-link highlight-box--list br-keyline br-blocklink-page br-page-linkhover-onbg015--hover'):
            for y in x.find_all('span', class_ = 'programme__title'):
                tags.append(y.text)

        rows.append({'date':ep_time, 'subject':title, 'information':info, 'featured_in_tags':tags, 'address':url})
        
    nb = nb + 1
    if nb%25 == 0:
        print('starting', nb)
    time.sleep(randint(3, 14))

print('DONE!')
dt_string = dt.now().strftime("%d/%m/%Y %H:%M:%S")
print(dt_string)


START
21/01/2021 11:37:39
date fixed here! https://www.bbc.co.uk/programmes/p08f91mx
date fixed here! https://www.bbc.co.uk/programmes/p08b58rd
date fixed here! https://www.bbc.co.uk/programmes/p07hp4q2
date fixed here! https://www.bbc.co.uk/programmes/p08nmz16
date fixed here! https://www.bbc.co.uk/programmes/p08qwkk5
date fixed here! https://www.bbc.co.uk/programmes/p06dzrmx
date fixed here! https://www.bbc.co.uk/programmes/p08qdxtk
date fixed here! https://www.bbc.co.uk/programmes/p08msds7
date fixed here! https://www.bbc.co.uk/programmes/p06f8b6f
date fixed here! https://www.bbc.co.uk/programmes/p088cfz4
date fixed here! https://www.bbc.co.uk/programmes/p08pl28v
date fixed here! https://www.bbc.co.uk/programmes/p08m474r
date fixed here! https://www.bbc.co.uk/programmes/p06f89pm
date fixed here! https://www.bbc.co.uk/programmes/p07jswd0
date fixed here! https://www.bbc.co.uk/programmes/p08dlmmh
date fixed here! https://www.bbc.co.uk/programmes/p06g342f
date fixed here! https://www.b

## check the results

In [137]:
# any difference between number of URLs and number of items gathered?

diff = (len(url_list) - len(rows))
diff

0

In [138]:
# what was the last 'nb' run in the loop?

nb

932

In [139]:
# what was the last url checked?

url

'https://www.bbc.co.uk/programmes/m000gvbv'

In [140]:
# what is the last url in the URL list

url_list[-1]

'https://www.bbc.co.uk/programmes/m000gvbv'

## make df and save the raw locally

In [575]:
df = pd.DataFrame(rows)

In [467]:
len(df)

932

In [143]:
df[:4]

Unnamed: 0,date,subject,info,featured_in_tags,address
0,Thu 4 Feb 2021,04/02/2021,[],[],https://www.bbc.co.uk/programmes/m000rvnj
1,Next Thursday,Saint Cuthbert,[Melvyn Bragg and guests the Northumbrian man ...,[],https://www.bbc.co.uk/programmes/m000rll4
2,Today,The Plague of Justinian,[Melvyn Bragg and guests discuss the plague th...,"[Ancient Rome, History]",https://www.bbc.co.uk/programmes/m000rc43
3,Last Thursday,The Great Gatsby,[Melvyn Bragg and guests discuss F Scott Fitzg...,"[20th Century, Culture]",https://www.bbc.co.uk/programmes/m000r4tq
4,Thu 31 Dec 2020,Eclipses,[Melvyn Bragg and guests discuss solar eclipse...,[Science],https://www.bbc.co.uk/programmes/m000qmnj
...,...,...,...,...,...
927,\n26 March 2020\n,26/03/2020,[],[],https://www.bbc.co.uk/programmes/m000glmv
928,\n09 July 2020\n,"1816, the Year Without a Summer (Summer Repeat)","[In a programme first broadcast in 2016, Melvy...",[],https://www.bbc.co.uk/programmes/p08k3fjp
929,\n14 May 2020\n,14/05/2020,[],[],https://www.bbc.co.uk/programmes/m000j1hw
930,\n07 May 2020\n,07/05/2020,[],[],https://www.bbc.co.uk/programmes/m000hvsb


In [144]:
# export a raw df copy locally as csv with no index as a backup

df.to_csv('iot_fullscrape_jan21.csv', index = False)

## clean 

In [577]:
# drop first row

df = df[1:]
df[:3]

Unnamed: 0,date,subject,information,featured_in_tags,address
1,Next Thursday,Saint Cuthbert,[Melvyn Bragg and guests the Northumbrian man ...,[],https://www.bbc.co.uk/programmes/m000rll4
2,Today,The Plague of Justinian,[Melvyn Bragg and guests discuss the plague th...,"[Ancient Rome, History]",https://www.bbc.co.uk/programmes/m000rc43
3,Last Thursday,The Great Gatsby,[Melvyn Bragg and guests discuss F Scott Fitzg...,"[20th Century, Culture]",https://www.bbc.co.uk/programmes/m000r4tq


## repeats

In [578]:
# there are 28 summer repeats which have to go - this only happens in 2018, 19 and 20 (hence so few)
nosr = df[~df['subject'].str.contains("Summer Repeat")]

# there are 8 non summer repeats which have to go
nosr = nosr[~nosr['subject'].str.contains("repeat")]

# this shows 4 more repeats - and 2 empty pages
nosr = nosr[~nosr['date'].str.contains("May 2020\n")]

# still more repeats and empty pages
nosr = nosr[~nosr['date'].str.contains("\n")]

len(nosr)

880

## fix dates that are labelled 'boxing', 'christmas' etc

In [505]:
# find the problem days...

nosr[nosr['date'].str.contains("New Year")]

Unnamed: 0,date,subject,information,featured_in_tags,address
245,New Year's Day 2015,The Sun,[Melvyn Bragg and his guests discuss the Sun. ...,"[Solstice, In praise of the Moon, Planets, Sat...",https://www.bbc.co.uk/programmes/b048nlfb
481,New Year's Day 2009,The Consolations of Philosophy,[Melvyn Bragg and guests discuss the consolati...,"[Ancient Rome, Philosophy]",https://www.bbc.co.uk/programmes/b00g46p0


In [579]:
# correct the three instances of 'boxing day' instead of dates

nosr.iat[27, 0] = 'Thu 26 Dec 2019'
nosr.iat[270, 0] = 'Thu 26 Dec 2013'
nosr.iat[682, 0] = 'Thu 26 Dec 2003'

In [580]:
# xmas eve

nosr.iat[191, 0] = 'Thu 24 Dec 2015'
nosr.iat[439, 0] = 'Thu 24 Dec 2009'
nosr.iat[869, 0] = 'Thu 24 Dec 1998'

In [581]:
# new years day

nosr.iat[243, 0] = 'Thu 1 Jan 2015'
nosr.iat[479, 0] = 'Thu 1 Jan 2009'

In [582]:
# we could fix the entry for last week, this week and next week but let's drop it. It will tag on once it's gone to air not before

nosr = nosr[~nosr['date'].str.contains("Next ")]
nosr = nosr[~nosr['date'].str.contains("Today")]
nosr = nosr[~nosr['date'].str.contains("Last ")]


In [583]:
len(nosr)

877

## remove one remaining empty row

In [642]:
len(nosr)

877

In [633]:
nosr[nosr.date_str == '01-02-2007']

Unnamed: 0,date,subject,information,featured_in_tags,address,date_str,date_dt,day,summary
560,Thu 1 Feb 2007,01/02/2007,[],[],https://www.bbc.co.uk/programmes/b00773nl,01-02-2007,2007-02-01,Thursday,EMPTY
561,Thu 1 Feb 2007,Genghis Khan,[Melvyn Bragg and guests discuss Genghis Khan....,"[Medieval, History]",https://www.bbc.co.uk/programmes/b00773mr,01-02-2007,2007-02-01,Thursday,Melvyn Bragg and guests discuss Genghis Khan. ...


In [641]:
nosr[555:556]

Unnamed: 0,date,subject,information,featured_in_tags,address,date_str,date_dt,day,summary
560,Thu 1 Feb 2007,01/02/2007,[],[],https://www.bbc.co.uk/programmes/b00773nl,01-02-2007,2007-02-01,Thursday,EMPTY


In [647]:
# remove the empty duplicate row

nosr = nosr.drop(nosr.index[[555]])


In [648]:
# check latest length

len(nosr)

876

## fix time


In [584]:
# change format of date_time col to day/month/year str and with no seconds in the time

nosr['date_str'] = nosr['date'].apply(lambda x: time.strftime("%d-%m-%Y", time.strptime(x,"%a %d %b %Y")))


In [586]:
# same data but in datetime format

nosr['date_dt'] = pd.to_datetime(nosr.date_str, dayfirst = True) 

In [588]:
# add a col with day name 

nosr['day'] = nosr.date_dt.dt.day_name()

In [589]:
# check our layout

nosr[:4]

Unnamed: 0,date,subject,information,featured_in_tags,address,date_str,date_dt,day
4,Thu 31 Dec 2020,Eclipses,[Melvyn Bragg and guests discuss solar eclipse...,[Science],https://www.bbc.co.uk/programmes/m000qmnj,31-12-2020,2020-12-31,Thursday
5,Thu 17 Dec 2020,The Cultural Revolution,[Melvyn Bragg and guests discuss Chairman Mao ...,"[20th Century, History]",https://www.bbc.co.uk/programmes/m000q9b6,17-12-2020,2020-12-17,Thursday
6,Thu 10 Dec 2020,John Wesley and Methodism,[Melvyn Bragg and guests discuss John Wesley (...,"[18th Century, Religion]",https://www.bbc.co.uk/programmes/m000q3m2,10-12-2020,2020-12-10,Thursday
7,Thu 3 Dec 2020,Fernando Pessoa,[Melvyn Bragg and guests discuss the Portugues...,"[20th Century, Culture]",https://www.bbc.co.uk/programmes/m000q0yj,03-12-2020,2020-12-03,Thursday


In [590]:
len(nosr)

877

## fix text 1

In [591]:
# extract episode summaries into a list and a col

episode_list = []

for x in nosr.information:
    if len(x) == 0:
        episode_list.append('EMPTY')
    else:
        summary = (x[0])
        episode_list.append(summary)
    people = x[1:]
    people_list.append(people)

nosr['summary'] = episode_list


In [592]:
nosr[:4]


Unnamed: 0,date,subject,information,featured_in_tags,address,date_str,date_dt,day,summary
4,Thu 31 Dec 2020,Eclipses,[Melvyn Bragg and guests discuss solar eclipse...,[Science],https://www.bbc.co.uk/programmes/m000qmnj,31-12-2020,2020-12-31,Thursday,Melvyn Bragg and guests discuss solar eclipses...
5,Thu 17 Dec 2020,The Cultural Revolution,[Melvyn Bragg and guests discuss Chairman Mao ...,"[20th Century, History]",https://www.bbc.co.uk/programmes/m000q9b6,17-12-2020,2020-12-17,Thursday,Melvyn Bragg and guests discuss Chairman Mao a...
6,Thu 10 Dec 2020,John Wesley and Methodism,[Melvyn Bragg and guests discuss John Wesley (...,"[18th Century, Religion]",https://www.bbc.co.uk/programmes/m000q3m2,10-12-2020,2020-12-10,Thursday,Melvyn Bragg and guests discuss John Wesley (1...
7,Thu 3 Dec 2020,Fernando Pessoa,[Melvyn Bragg and guests discuss the Portugues...,"[20th Century, Culture]",https://www.bbc.co.uk/programmes/m000q0yj,03-12-2020,2020-12-03,Thursday,Melvyn Bragg and guests discuss the Portuguese...
8,Thu 26 Nov 2020,The Zong Massacre,[Melvyn Bragg and guests discuss the notorious...,"[18th Century, History]",https://www.bbc.co.uk/programmes/m000pqbz,26-11-2020,2020-11-26,Thursday,Melvyn Bragg and guests discuss the notorious ...
...,...,...,...,...,...,...,...,...,...
877,Thu 12 Nov 1998,The City in the 20th Century,"[Melvyn Bragg and guests discuss the artistic,...","[The Built Environment, 20th Century, Culture,...",https://www.bbc.co.uk/programmes/p005457r,12-11-1998,1998-11-12,Thursday,"Melvyn Bragg and guests discuss the artistic, ..."
878,Thu 5 Nov 1998,Science in the 20th century,[Melvyn Bragg and guests discuss how perceptio...,"[20th Century, Science]",https://www.bbc.co.uk/programmes/p005457h,05-11-1998,1998-11-05,Thursday,Melvyn Bragg and guests discuss how perception...
879,Thu 29 Oct 1998,Science's Revelations,[Melvyn Bragg and guests discuss whether the m...,"[20th Century, Science]",https://www.bbc.co.uk/programmes/p005454c,29-10-1998,1998-10-29,Thursday,Melvyn Bragg and guests discuss whether the ma...
880,Thu 22 Oct 1998,Politics in the 20th Century,[Melvyn Bragg talks to Gore Vidal and Alan Cla...,"[20th Century, History]",https://www.bbc.co.uk/programmes/p005456z,22-10-1998,1998-10-22,Thursday,Melvyn Bragg talks to Gore Vidal and Alan Clar...


## clean up this version of dataset (and export as interim backup)

> Note - the guest details still have to be extracted and cleaned from the 'information' column

> Note - pickling is a better export since the datetime and list formats don't get lost in the .CSV

In [683]:
test = nosr[['subject', 'summary', 'date_str', 'featured_in_tags', 'address', 'date_dt', 'day']]

In [684]:
test

Unnamed: 0,subject,summary,date_str,featured_in_tags,address,date_dt,day
4,Eclipses,Melvyn Bragg and guests discuss solar eclipses...,31-12-2020,[Science],https://www.bbc.co.uk/programmes/m000qmnj,2020-12-31,Thursday
5,The Cultural Revolution,Melvyn Bragg and guests discuss Chairman Mao a...,17-12-2020,"[20th Century, History]",https://www.bbc.co.uk/programmes/m000q9b6,2020-12-17,Thursday
6,John Wesley and Methodism,Melvyn Bragg and guests discuss John Wesley (1...,10-12-2020,"[18th Century, Religion]",https://www.bbc.co.uk/programmes/m000q3m2,2020-12-10,Thursday
7,Fernando Pessoa,Melvyn Bragg and guests discuss the Portuguese...,03-12-2020,"[20th Century, Culture]",https://www.bbc.co.uk/programmes/m000q0yj,2020-12-03,Thursday
8,The Zong Massacre,Melvyn Bragg and guests discuss the notorious ...,26-11-2020,"[18th Century, History]",https://www.bbc.co.uk/programmes/m000pqbz,2020-11-26,Thursday
...,...,...,...,...,...,...,...
877,The City in the 20th Century,"Melvyn Bragg and guests discuss the artistic, ...",12-11-1998,"[The Built Environment, 20th Century, Culture,...",https://www.bbc.co.uk/programmes/p005457r,1998-11-12,Thursday
878,Science in the 20th century,Melvyn Bragg and guests discuss how perception...,05-11-1998,"[20th Century, Science]",https://www.bbc.co.uk/programmes/p005457h,1998-11-05,Thursday
879,Science's Revelations,Melvyn Bragg and guests discuss whether the ma...,29-10-1998,"[20th Century, Science]",https://www.bbc.co.uk/programmes/p005454c,1998-10-29,Thursday
880,Politics in the 20th Century,Melvyn Bragg talks to Gore Vidal and Alan Clar...,22-10-1998,"[20th Century, History]",https://www.bbc.co.uk/programmes/p005456z,1998-10-22,Thursday


In [686]:
# sort the df by date, 1998+

test = test.sort_values('date_dt')

Unnamed: 0,subject,summary,date_str,featured_in_tags,address,date_dt,day
881,War in the 20th Century,In the first programme of a new series examini...,15-10-1998,"[20th Century, History, In Our Time: American ...",https://www.bbc.co.uk/programmes/p0054578,1998-10-15,Thursday
880,Politics in the 20th Century,Melvyn Bragg talks to Gore Vidal and Alan Clar...,22-10-1998,"[20th Century, History]",https://www.bbc.co.uk/programmes/p005456z,1998-10-22,Thursday
879,Science's Revelations,Melvyn Bragg and guests discuss whether the ma...,29-10-1998,"[20th Century, Science]",https://www.bbc.co.uk/programmes/p005454c,1998-10-29,Thursday
878,Science in the 20th century,Melvyn Bragg and guests discuss how perception...,05-11-1998,"[20th Century, Science]",https://www.bbc.co.uk/programmes/p005457h,1998-11-05,Thursday
877,The City in the 20th Century,"Melvyn Bragg and guests discuss the artistic, ...",12-11-1998,"[The Built Environment, 20th Century, Culture,...",https://www.bbc.co.uk/programmes/p005457r,1998-11-12,Thursday
...,...,...,...,...,...,...,...
8,The Zong Massacre,Melvyn Bragg and guests discuss the notorious ...,26-11-2020,"[18th Century, History]",https://www.bbc.co.uk/programmes/m000pqbz,2020-11-26,Thursday
7,Fernando Pessoa,Melvyn Bragg and guests discuss the Portuguese...,03-12-2020,"[20th Century, Culture]",https://www.bbc.co.uk/programmes/m000q0yj,2020-12-03,Thursday
6,John Wesley and Methodism,Melvyn Bragg and guests discuss John Wesley (1...,10-12-2020,"[18th Century, Religion]",https://www.bbc.co.uk/programmes/m000q3m2,2020-12-10,Thursday
5,The Cultural Revolution,Melvyn Bragg and guests discuss Chairman Mao a...,17-12-2020,"[20th Century, History]",https://www.bbc.co.uk/programmes/m000q9b6,2020-12-17,Thursday


In [632]:
# examine the breakdown by year

test.groupby(test.date_dt.dt.year)['subject'].count()


date_dt
1998    12
1999    44
2000    34
2001    33
2002    38
2003    36
2004    40
2005    37
2006    42
2007    43
2008    41
2009    42
2010    41
2011    43
2012    43
2013    40
2014    38
2015    41
2016    39
2017    34
2018    37
2019    39
2020    40
Name: subject, dtype: int64

In [689]:
# keep a copy of this df as the cleanest thing we have so far

test.to_csv('iot_df_halfclean.csv', index = False)

In [688]:
# keep a copy of the raw as the cleanest thing we have unfiltered so far

nosr.to_csv('iot_df_raw_876.csv', index = False)