# Web Scraping
---
https://www.w3schools.com/tags/ref_httpmethods.asp

Types of http requests:
- get: request data from resource
- post: send data to a server
- put: same as a post, but are idempotent (multiple put requests are treated the same as a single one)

Types of http codes:
- 200 : everything went okay
- 300s: redirection
- 400s: errors

see: https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

Example of an API: http://api.open-notify.org/iss-now.json (tells you lat and long of international space station)

In [2]:
import datetime as dt
import json
import requests

In [2]:
#we want to access the location of the iss through pyton
url = r'http://api.open-notify.org/iss-now.json'
r = requests.get(url)

In [3]:
r.status_code #tells us our code

200

In [4]:
r.text #the response text, this is a string

'{"iss_position": {"latitude": "41.6512", "longitude": "-63.0113"}, "timestamp": 1602682093, "message": "success"}'

In [5]:
r.content #same as above but in binary format: useful for things like features or anything that isn't text

b'{"iss_position": {"latitude": "41.6512", "longitude": "-63.0113"}, "timestamp": 1602682093, "message": "success"}'

In [11]:
response = json.loads(r.text) #sting has been converted to dictionary
response #timestamp is time since epoch(amount of seconds since 01-01-1970)

{'iss_position': {'latitude': '41.6512', 'longitude': '-63.0113'},
 'timestamp': 1602682093,
 'message': 'success'}

In [9]:
response['iss_position']['latitude'] #now we can get the latitude

'41.6512'

In [13]:
dt.datetime.utcfromtimestamp(response['timestamp']) #converted time

datetime.datetime(2020, 10, 14, 13, 28, 13)

## ISS Pass Times
see: http://api.open-notify.org/iss-pass.json?lat=LAT&lon=LON

In [14]:
url = r'http://api.open-notify.org/iss-pass.json?lat=51.5074&lon=0.1278' #input london's lat and long
#the stuff after the question mark are things that we can change
r = requests.get(url)
response = json.loads(r.text)
response

{'message': 'success',
 'request': {'altitude': 100,
  'datetime': 1602682762,
  'latitude': 51.5074,
  'longitude': 0.1278,
  'passes': 5},
 'response': [{'duration': 282, 'risetime': 1602694171},
  {'duration': 490, 'risetime': 1602748711},
  {'duration': 636, 'risetime': 1602754400},
  {'duration': 656, 'risetime': 1602760187}]}

In [25]:
#convert the times to datetimes
[dt.datetime.utcfromtimestamp(i['risetime']) for i in response['response']] #easy with a list comprehension

[datetime.datetime(2020, 10, 14, 16, 49, 31),
 datetime.datetime(2020, 10, 15, 7, 58, 31),
 datetime.datetime(2020, 10, 15, 9, 33, 20),
 datetime.datetime(2020, 10, 15, 11, 9, 47)]

## Pokemon API

In [26]:
pokemon = 'charizard'
url = rf'https://pokeapi.co/api/v2/pokemon/{pokemon}'
r = requests.get(url)

In [27]:
response = json.loads(r.text)

In [29]:
response.keys()

dict_keys(['abilities', 'base_experience', 'forms', 'game_indices', 'height', 'held_items', 'id', 'is_default', 'location_area_encounters', 'moves', 'name', 'order', 'species', 'sprites', 'stats', 'types', 'weight'])

In [31]:
for m in response['moves']:
    print(m['move']['name'])

mega-punch
fire-punch
thunder-punch
scratch
swords-dance
cut
wing-attack
fly
mega-kick
headbutt
body-slam
take-down
double-edge
leer
growl
roar
ember
flamethrower
hyper-beam
submission
counter
seismic-toss
strength
solar-beam
dragon-rage
fire-spin
earthquake
fissure
dig
toxic
rage
mimic
double-team
smokescreen
defense-curl
reflect
bide
fire-blast
swift
skull-bash
rest
rock-slide
slash
substitute
snore
curse
protect
scary-face
mud-slap
outrage
sandstorm
endure
swagger
fury-cutter
steel-wing
attract
sleep-talk
return
frustration
dynamic-punch
dragon-breath
iron-tail
metal-claw
hidden-power
twister
sunny-day
rock-smash
heat-wave
will-o-wisp
facade
focus-punch
brick-break
secret-power
blast-burn
air-cutter
overheat
rock-tomb
aerial-ace
dragon-claw
roost
natural-gift
tailwind
fling
flare-blitz
air-slash
dragon-pulse
focus-blast
giga-impact
shadow-claw
fire-fang
defog
captivate
ominous-wind
hone-claws
flame-burst
flame-charge
round
echoed-voice
sky-drop
incinerate
inferno
fire-pledge
bulldoz

# Scraping websites with BeautifulSoup
- Some stuff on HTML: https://www.w3schools.com/html/html_basic.asp
- HTML Tags: https://www.w3schools.com/TAGS/ref_byfunc.asp

In [33]:
from bs4 import BeautifulSoup

In [32]:
url = r'https://www.w3schools.com/TAGS/ref_byfunc.asp'
r = requests.get(url)
r.text #press ctrl+shift+i to see it on the website



In [34]:
#we create a soup object from the html
soup = BeautifulSoup(r.text,'html.parser')
soup


<!DOCTYPE html>

<html lang="en-US">
<head>
<title>HTML Reference</title>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<meta content="HTML,CSS,JavaScript,SQL,PHP,jQuery,XML,DOM,Bootstrap,Python,Java,Web development,W3C,tutorials,programming,training,learning,quiz,primer,lessons,references,examples,exercises,source code,colors,demos,tips" name="Keywords"/>
<meta content="Well organized and easy to understand Web building tutorials with lots of examples of how to use HTML, CSS, JavaScript, SQL, PHP, Python, Bootstrap, Java and XML." name="Description"/>
<link href="/favicon.ico" rel="icon" type="image/x-icon"/>
<link href="/w3css/4/w3.css" rel="stylesheet"/>
<link href="https://fonts.googleapis.com/css?family=Source Code Pro" rel="stylesheet"/>
<style>
a:hover,a:active{color:#4CAF50}
table.w3-table-all{margin:20px 0}
/*OPPSETT AV TOP, TOPNAV, SIDENAV, MAIN, RIGHT OG FOOTER:*/
.top {
position:relative;
background-color:#ffffff;
height:68px

In [37]:
#now we want the basic HTML table at the top of the webpage
tag = 'table'
attributes = {'class':'w3-table-all notranslate'}
table_soup = soup.find(tag,attributes) #this finds the first match with the right tag and attributes

In [38]:
#store table data as a list of lists
table_data = []
for row in table_soup.find_all('tr'):
    print(row)
    print('----') #gets us all of the table rows

<tr>
<th style="width:20%">Tag</th>
<th>Description</th>
</tr>
----
<tr>
<td><a href="tag_doctype.asp">&lt;!DOCTYPE&gt;</a> </td>
<td>Defines the document type</td>
</tr>
----
<tr>
<td><a href="tag_html.asp">&lt;html&gt;</a></td>
<td>Defines an HTML document</td>
</tr>
----
<tr>
<td><a href="tag_head.asp">&lt;head&gt;</a></td>
<td>Contains metadata/information for the document</td>
</tr>
----
<tr>
<td><a href="tag_title.asp">&lt;title&gt;</a></td>
<td>Defines a title for the document</td>
</tr>
----
<tr>
<td><a href="tag_body.asp">&lt;body&gt;</a></td>
<td>Defines the document's body</td>
</tr>
----
<tr>
<td><a href="tag_hn.asp">&lt;h1&gt; to &lt;h6&gt;</a></td>
<td> Defines HTML headings</td>
</tr>
----
<tr>
<td><a href="tag_p.asp">&lt;p&gt;</a></td>
<td>Defines a paragraph</td>
</tr>
----
<tr>
<td><a href="tag_br.asp">&lt;br&gt;</a></td>
<td>Inserts a single line break</td>
</tr>
----
<tr>
<td><a href="tag_hr.asp">&lt;hr&gt;</a></td>
<td> Defines a thematic change in the content</td>

In [39]:
for row in table_soup.find_all('tr'):
    row_text = [e.text for e in row.find_all('td')]
    table_data.append(row_text)
table_data    

[[],
 ['<!DOCTYPE>\xa0', 'Defines the document type'],
 ['<html>', 'Defines an HTML document'],
 ['<head>', 'Contains metadata/information for the document'],
 ['<title>', 'Defines a title for the document'],
 ['<body>', "Defines the document's body"],
 ['<h1> to <h6>', ' Defines HTML headings'],
 ['<p>', 'Defines a paragraph'],
 ['<br>', 'Inserts a single line break'],
 ['<hr>', ' Defines a thematic change in the content'],
 ['<!--...-->', 'Defines a comment']]

In [41]:
for row in table_soup.find_all('tr'):
    row_text = [e.text.strip() for e in row.find_all('td')] #add a .strip to remove annoying whitespace
    table_data.append(row_text)
table_data    

[[],
 ['<!DOCTYPE>\xa0', 'Defines the document type'],
 ['<html>', 'Defines an HTML document'],
 ['<head>', 'Contains metadata/information for the document'],
 ['<title>', 'Defines a title for the document'],
 ['<body>', "Defines the document's body"],
 ['<h1> to <h6>', ' Defines HTML headings'],
 ['<p>', 'Defines a paragraph'],
 ['<br>', 'Inserts a single line break'],
 ['<hr>', ' Defines a thematic change in the content'],
 ['<!--...-->', 'Defines a comment'],
 [],
 ['<!DOCTYPE>', 'Defines the document type'],
 ['<html>', 'Defines an HTML document'],
 ['<head>', 'Contains metadata/information for the document'],
 ['<title>', 'Defines a title for the document'],
 ['<body>', "Defines the document's body"],
 ['<h1> to <h6>', 'Defines HTML headings'],
 ['<p>', 'Defines a paragraph'],
 ['<br>', 'Inserts a single line break'],
 ['<hr>', 'Defines a thematic change in the content'],
 ['<!--...-->', 'Defines a comment'],
 [],
 ['<!DOCTYPE>', 'Defines the document type'],
 ['<html>', 'Defines 

## Pulling tables using pandas

In [43]:
import pandas as pd
list_df = pd.read_html(url)


In [44]:
list_df[0]

Unnamed: 0,Tag,Description
0,<!DOCTYPE>,Defines the document type
1,<html>,Defines an HTML document
2,<head>,Contains metadata/information for the document
3,<title>,Defines a title for the document
4,<body>,Defines the document's body
5,<h1> to <h6>,Defines HTML headings
6,<p>,Defines a paragraph
7,<br>,Inserts a single line break
8,<hr>,Defines a thematic change in the content
9,<!--...-->,Defines a comment


In [45]:
len(list_df)

12

In [46]:
list_df[3] #really easy

Unnamed: 0,Tag,Description
0,<frame>,Not supported in HTML5.Defines a window (a fra...
1,<frameset>,Not supported in HTML5.Defines a set of frames
2,<noframes>,Not supported in HTML5.Defines an alternate co...
3,<iframe>,Defines an inline frame


## World cup wiki page
https://en.wikipedia.org/wiki/FIFA_World_Cup

In [47]:
url = r'https://en.wikipedia.org/wiki/FIFA_World_Cup'
r = requests.get(url)
soup = BeautifulSoup(r.text,'html.parser')
soup

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>FIFA World Cup - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"32bca89f-5348-434a-b363-df0259f9ce7f","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"FIFA_World_Cup","wgTitle":"FIFA World Cup","wgCurRevisionId":982193558,"wgRevisionId":982193558,"wgArticleId":11370,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Webarchive template wayback links","All articles with dead external links","Articles with dead external links from November 2017","Articles with permanently dead exter

In [52]:
#now inspect the webpage for the table that we want
tag = 'table'
attributes = {'class':'wikitable'}
len(soup.find_all(tag,attributes)) #there are 6 of them though
table_soup = soup.find_all(tag,attributes)[1] #ours is the second on the page so we will just index it

In [53]:
table_data = []
for row in table_soup.find_all('tr'):
    row_text = [e.text.strip() for e in row.find_all('td')] #add a .strip to remove annoying whitespace
    table_data.append(row_text)
table_data #we have the right data but we are missing the headers

[['', '', ''],
 ['1',
  '1930',
  'Uruguay',
  'Uruguay',
  '4–2 Estadio Centenario, Montevideo',
  'Argentina',
  'United States',
  '[note 1]',
  'Yugoslavia',
  '13'],
 ['2',
  '1934',
  'Italy',
  'Italy',
  '2–1 (a.e.t.) Stadio Nazionale PNF, Rome',
  'Czechoslovakia',
  'Germany',
  '3–2 Stadio Giorgio Ascarelli, Naples',
  'Austria',
  '16'],
 ['3',
  '1938',
  'France',
  'Italy',
  '4–2 Stade de Colombes, Paris',
  'Hungary',
  'Brazil',
  '4–2 Parc Lescure, Bordeaux',
  'Sweden',
  '15'],
 ['1942', 'Editions cancelled without organization because of World War II'],
 ['1946'],
 ['4',
  '1950',
  'Brazil',
  '',
  'Uruguay',
  '[note 2]2–1 Maracanã, Rio de Janeiro',
  'Brazil',
  '',
  'Sweden',
  '[note 2]3–1 Pacaembu, São Paulo',
  'Spain',
  '',
  '13'],
 ['5',
  '1954',
  'Switzerland',
  'West Germany',
  '3–2 Wankdorfstadion, Bern',
  'Hungary',
  'Austria',
  '3–1 Hardturm, Zürich',
  'Uruguay',
  '16'],
 ['6',
  '1958',
  'Sweden',
  'Brazil',
  '5–2 Råsundastadion, Sol

In [54]:
table_data = []
for row in table_soup.find_all('tr'):
    header_text = [e.text.strip() for e in row.find_all('th')]
    row_text = [e.text.strip() for e in row.find_all('td')] #add a .strip to remove annoying whitespace
    if header_text:
        table_data.append(header_text)
    else:
        table_data.append(row_text)
table_data 

[['Edition',
  'Year',
  'Hosts',
  'Champions',
  'Score and Venue',
  'Runners-up',
  'Third place',
  'Score and Venue',
  'Fourth place',
  'No. of Teams'],
 ['1',
  '1930',
  'Uruguay',
  'Uruguay',
  '4–2 Estadio Centenario, Montevideo',
  'Argentina',
  'United States',
  '[note 1]',
  'Yugoslavia',
  '13'],
 ['2',
  '1934',
  'Italy',
  'Italy',
  '2–1 (a.e.t.) Stadio Nazionale PNF, Rome',
  'Czechoslovakia',
  'Germany',
  '3–2 Stadio Giorgio Ascarelli, Naples',
  'Austria',
  '16'],
 ['3',
  '1938',
  'France',
  'Italy',
  '4–2 Stade de Colombes, Paris',
  'Hungary',
  'Brazil',
  '4–2 Parc Lescure, Bordeaux',
  'Sweden',
  '15'],
 ['1942', 'Editions cancelled without organization because of World War II'],
 ['1946'],
 ['4',
  '1950',
  'Brazil',
  '',
  'Uruguay',
  '[note 2]2–1 Maracanã, Rio de Janeiro',
  'Brazil',
  '',
  'Sweden',
  '[note 2]3–1 Pacaembu, São Paulo',
  'Spain',
  '',
  '13'],
 ['5',
  '1954',
  'Switzerland',
  'West Germany',
  '3–2 Wankdorfstadion, Be

In [55]:
final_table = pd.DataFrame(table_data[1:])

In [56]:
final_table #the two rows where the war was on are ruining the format

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,1,1930,Uruguay,Uruguay,"4–2 Estadio Centenario, Montevideo",Argentina,United States,[note 1],Yugoslavia,13,,,
1,2,1934,Italy,Italy,"2–1 (a.e.t.) Stadio Nazionale PNF, Rome",Czechoslovakia,Germany,"3–2 Stadio Giorgio Ascarelli, Naples",Austria,16,,,
2,3,1938,France,Italy,"4–2 Stade de Colombes, Paris",Hungary,Brazil,"4–2 Parc Lescure, Bordeaux",Sweden,15,,,
3,1942,Editions cancelled without organization becaus...,,,,,,,,,,,
4,1946,,,,,,,,,,,,
5,4,1950,Brazil,,Uruguay,"[note 2]2–1 Maracanã, Rio de Janeiro",Brazil,,Sweden,"[note 2]3–1 Pacaembu, São Paulo",Spain,,13.0
6,5,1954,Switzerland,West Germany,"3–2 Wankdorfstadion, Bern",Hungary,Austria,"3–1 Hardturm, Zürich",Uruguay,16,,,
7,6,1958,Sweden,Brazil,"5–2 Råsundastadion, Solna",Sweden,France,"6–3 Ullevi, Gothenburg",West Germany,16,,,
8,7,1962,Chile,Brazil,"3–1 Estadio Nacional, Santiago",Czechoslovakia,Chile,"1–0 Estadio Nacional, Santiago",Yugoslavia,16,,,
9,8,1966,England,England,"4–2 (a.e.t.) Wembley Stadium, London",West Germany,Portugal,"2–1 Wembley Stadium, London",Soviet Union,16,,,


In [58]:
table_data_filt = table_data[0:4] + table_data[6:]

In [59]:
final_table = pd.DataFrame(table_data_filt[1:])

In [60]:
final_table #Brazil row

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,1,1930,Uruguay,Uruguay,"4–2 Estadio Centenario, Montevideo",Argentina,United States,[note 1],Yugoslavia,13,,,
1,2,1934,Italy,Italy,"2–1 (a.e.t.) Stadio Nazionale PNF, Rome",Czechoslovakia,Germany,"3–2 Stadio Giorgio Ascarelli, Naples",Austria,16,,,
2,3,1938,France,Italy,"4–2 Stade de Colombes, Paris",Hungary,Brazil,"4–2 Parc Lescure, Bordeaux",Sweden,15,,,
3,4,1950,Brazil,,Uruguay,"[note 2]2–1 Maracanã, Rio de Janeiro",Brazil,,Sweden,"[note 2]3–1 Pacaembu, São Paulo",Spain,,13.0
4,5,1954,Switzerland,West Germany,"3–2 Wankdorfstadion, Bern",Hungary,Austria,"3–1 Hardturm, Zürich",Uruguay,16,,,
5,6,1958,Sweden,Brazil,"5–2 Råsundastadion, Solna",Sweden,France,"6–3 Ullevi, Gothenburg",West Germany,16,,,
6,7,1962,Chile,Brazil,"3–1 Estadio Nacional, Santiago",Czechoslovakia,Chile,"1–0 Estadio Nacional, Santiago",Yugoslavia,16,,,
7,8,1966,England,England,"4–2 (a.e.t.) Wembley Stadium, London",West Germany,Portugal,"2–1 Wembley Stadium, London",Soviet Union,16,,,
8,9,1970,Mexico,Brazil,"4–1 Estadio Azteca, Mexico City",Italy,West Germany,"1–0 Estadio Azteca, Mexico City",Uruguay,16,,,
9,10,1974,West Germany,West Germany,"2–1 Olympiastadion, Munich",Netherlands,Poland,"1–0 Olympiastadion, Munich",Brazil,16,,,


In [62]:
table_data_filt = table_data[0:4] + table_data[6:]
table_data_filt = [table_data_filt[0]] + [i[:-3] for i in table_data_filt[1:]]

In [63]:
final_table = pd.DataFrame(table_data_filt[1:])
final_table

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1,1930,Uruguay,Uruguay,"4–2 Estadio Centenario, Montevideo",Argentina,United States,,,
1,2,1934,Italy,Italy,"2–1 (a.e.t.) Stadio Nazionale PNF, Rome",Czechoslovakia,Germany,,,
2,3,1938,France,Italy,"4–2 Stade de Colombes, Paris",Hungary,Brazil,,,
3,4,1950,Brazil,,Uruguay,"[note 2]2–1 Maracanã, Rio de Janeiro",Brazil,,Sweden,"[note 2]3–1 Pacaembu, São Paulo"
4,5,1954,Switzerland,West Germany,"3–2 Wankdorfstadion, Bern",Hungary,Austria,,,
5,6,1958,Sweden,Brazil,"5–2 Råsundastadion, Solna",Sweden,France,,,
6,7,1962,Chile,Brazil,"3–1 Estadio Nacional, Santiago",Czechoslovakia,Chile,,,
7,8,1966,England,England,"4–2 (a.e.t.) Wembley Stadium, London",West Germany,Portugal,,,
8,9,1970,Mexico,Brazil,"4–1 Estadio Azteca, Mexico City",Italy,West Germany,,,
9,10,1974,West Germany,West Germany,"2–1 Olympiastadion, Munich",Netherlands,Poland,,,


In [64]:
[len(row) for row in table_data] #we see that there are extra elements in the table for the brazil row

[10,
 10,
 10,
 10,
 2,
 1,
 13,
 10,
 10,
 10,
 10,
 10,
 10,
 10,
 10,
 10,
 10,
 10,
 10,
 10,
 10,
 10,
 10,
 10,
 10,
 10]

## Customising requests headers

In [4]:
url = r'https://www.reddit.com/r/MachineLearning.json' #for reddit you put .json at the end and you can get the json
r = requests.get(url)
r.status_code #but reddit sees us doing this then blocks it

429

In [None]:
#go to any webpage, go developer tools, click on network, refresh webpage, click on any element, scroll to bottom to user agent and copy
#An example user agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36 Edg/86.0.622.38

In [5]:
user_agent = r'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36 Edg/86.0.622.38'
r = requests.get(url, headers={'User-Agent': user_agent}) #we are now pretending to be our edge web browser
r.status_code #now it works

200

In [6]:
r.text

'{"kind": "Listing", "data": {"modhash": "", "dist": 27, "children": [{"kind": "t3", "data": {"approved_at_utc": null, "subreddit": "MachineLearning", "selftext": "Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead!\\n\\nThread will stay alive until next one so keep posting after the date in the title.\\n\\nThanks to everyone for answering questions in the previous thread!", "author_fullname": "t2_6l4z3", "saved": false, "mod_reason_title": null, "gilded": 0, "clicked": false, "title": "[D] Simple Questions Thread October 11, 2020", "link_flair_richtext": [], "subreddit_name_prefixed": "r/MachineLearning", "hidden": false, "pwls": 6, "link_flair_css_class": "one", "downs": 0, "thumbnail_height": null, "top_awarded_type": null, "hide_score": false, "name": "t3_j9difr", "quarantine": false, "link_flair_text_color": "dark", "upvote_ratio": 1.0, "author_flair_background_color": null, "subreddit_type": "

# Selenium

You need this https://chromedriver.chromium.org/downloads and install chrome

Put the contents into the anaconda3 folder

See: https://www.browserstack.com/guide/launch-edge-browser-in-selenium for the edge driver if you care

## Why we need Selenium
Often we need to interact with the webpage to make the content click

A lot of these use AJAX(Asynchonus Javascript and XML) to let the website changes without reloading it

The requests module doesn't allow us to interact with webpages.

Selenium works by automating browsers to load the website

In [65]:
import requests
from bs4 import BeautifulSoup

### Examples of us failing to scrape a website
http://pythonscraping.com/pages/javascript/ajaxDemo.html we want the second line of text from this url

In [66]:
r = requests.get(r'http://pythonscraping.com/pages/javascript/ajaxDemo.html')
r.text

'<html>\n<head>\n<title>Some JavaScript-loaded content</title>\n<script src="../js/jquery-2.1.1.min.js"></script>\n\n</head>\n<body>\n<div id="content">\nThis is some content that will appear on the page while it\'s loading. You don\'t care about scraping this.\n</div>\n\n<script>\n$.ajax({\n    type: "GET",\n    url: "loadedContent.php",\n    success: function(response){\n\n\tsetTimeout(function() {\n\t    $(\'#content\').html(response);\n\t}, 2000);\n    }\n  });\n\nfunction ajax_delay(str){\n setTimeout("str",2000);\n}\n</script>\n</body>\n</html>'

In [67]:
soup = BeautifulSoup(r.text,'html.parser')
msg = soup.find('div').text
print(msg) #not ideal, we need to interact with the website first


This is some content that will appear on the page while it's loading. You don't care about scraping this.



In [68]:
url = r'https://uk.reuters.com/search/news?blob=covid-19'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

results = [i.text for i in soup.find_all('h3', {'class':'search-result-title'})]
len(results) #we only get 10 headlines when there are loads more

10

## Using Selenium

In [76]:
from selenium import webdriver #doesnt work the first time so conda install selenium in anaconda prompt
from selenium.webdriver.common.keys import Keys 
import time


In [77]:
driver = webdriver.Chrome()

In [78]:
driver.get(r'http://pythonscraping.com/pages/javascript/ajaxDemo.html')
time.sleep(3)

In [79]:
soup = BeautifulSoup(driver.page_source, 'html.parser')
soup.find('div').text

'Here is some important text you want to retrieve! A button to click!'

In [80]:
driver.close()

In [89]:
driver = webdriver.Chrome()

In [90]:
url = r'https://uk.reuters.com/search/news?blob=covid-19'
driver.get(url) #cookies pop-up happens - need to click accept in an automated way

In [91]:
#xpath_cookies_accept = r'//button[@class="evidon-barrier-acceptbutton"]'
#driver.find_element_by_xpath(xpath_cookies_accept).click()

xpath_cookies_accept = r'//button[@class="evidon-barrier-acceptbutton"]'
driver.find_element_by_xpath(xpath_cookies_accept).click()

In [92]:
css_more = ".search-result-more-txt"
driver.find_element_by_css_selector(css_more).click()
time.sleep(2)
driver.find_element_by_css_selector(css_more).click()
time.sleep(2)
driver.find_element_by_css_selector(css_more).click()
time.sleep(2)
driver.find_element_by_css_selector(css_more).click()

In [94]:
soup = BeautifulSoup(driver.page_source,'html.parser')
results = [i.text for i in soup.find_all('h3',{'class':'search-result-title'})] 
#results
len(results) #now we have 50 results

50

In [95]:
driver.close()

## More advanced reuters example

XLM Syntax: https://www.w3schools.com/xml/xpath_syntax.asp

//book: matches all book elements
//title: matches all the title elements
//book/price: matches all the price elements within book elements
/bookstore/book[1]: matches the second book

In [22]:
import datetime as dt
import json
import requests
from selenium import webdriver #doesnt work the first time so conda install selenium in anaconda prompt
from selenium.webdriver.common.keys import Keys 
import time
from bs4 import BeautifulSoup

In [8]:
driver=webdriver.Chrome()

In [9]:
url =r'https://uk.reuters.com'
driver.get(url)

In [10]:
xpath_cookies_accept = r'//button[@class="evidon-barrier-acceptbutton"]' #right click the accept button and inspect to find this
driver.find_element_by_xpath(xpath_cookies_accept).click() #we have accepted cookies 

In [16]:
xpath_search_icon = r'//div[@class="search-icon"]' #find all search icon objects
driver.find_element_by_xpath(xpath_search_icon).click() #we have now selected the search icon

In [14]:
#alternatively find the element, right click then copy the xpath directly
xpath_search_icon2 = r'//*[@id="headerNav"]/div/ul/li[13]/div'
driver.find_element_by_xpath(xpath_search_icon2).click() #the search bar has been selected again
#this is not recommended because it might not work for someone else

In [18]:
xpath_search_field = r'//input[@class="search-field"]'
driver.find_element_by_xpath(xpath_search_field).send_keys('covid-19',Keys.ENTER) #Keys.ENTER clicks the enter button for us

In [20]:
xpath_more = r'//div[@class="search-result-more-txt"]'
driver.find_element_by_xpath(xpath_more).click()
time.sleep(2)
driver.find_element_by_xpath(xpath_more).click()
time.sleep(2)
driver.find_element_by_xpath(xpath_more).click()
time.sleep(2)
driver.find_element_by_xpath(xpath_more).click()

In [23]:
soup = BeautifulSoup(driver.page_source,'html.parser')
results = [i.text for i in soup.find_all('h3',{'class':'search-result-title'})]  #inspect a headline to see this
for i in results:
    print(i) #our covid headlines

BRIEF-Academedia Comments On Covid-19
BRIEF-Bergenbio Provides COVID-19 Impact Assessment
BRIEF-Biocept To Begin Covid-19 Testing
BRIEF-Deva Holding Develops No Vaccine For Covid-19
BRIEF-Iconovo Sees Limited Impact Of COVID-19
BRIEF-Infinity Pharmaceuticals Provides COVID-19 Update
BRIEF-Mateon Expands Covid-19 Therapeutic Program
BRIEF-Biotest Is Developing COVID-19 Therapy With Trimodulin
BRIEF-Spie Updates On COVID-19 Impact
BRIEF-Hemogenyx Pharma Starts COVID-19 Project
BRIEF-BrainCool Is Conducting Study With COVID-19 Patients
BRIEF-Vaxil Commences Preclinical COVID-19 Vaccine Trial And Files An Additional COVID-19 Patent
BRIEF-Deva Holding Develops No Vaccine For Covid-19
BRIEF-Omega Provides Update On Covid-19 Impact
BRIEF-Isofol Medical: Information In Relation To COVID-19
BRIEF-Ampio Provides Update On COVID-19 Program
BRIEF-Centogene To Expand Testing For COVID-19
BRIEF-Iconovo Sees Limited Impact Of COVID-19
BRIEF-Infinity Pharmaceuticals Provides COVID-19 Update
BRIEF-Mate

In [24]:
#what if we want the links instead of the headlines?
links = [i.find('a')['href'] for i in soup.find_all('h3',{'class':'search-result-title'})] #the link is in the a element, we want the href
links = ['https://uk.reuters.com/' + i  for i in links]
links #all of our article links

['https://uk.reuters.com//article/idUKFWN2B8087',
 'https://uk.reuters.com//article/idUKFWN2BK1C4',
 'https://uk.reuters.com//article/idUKFWN2BX191',
 'https://uk.reuters.com//article/idUKFWN2CH17A',
 'https://uk.reuters.com//article/idUKFWN2BC16Z',
 'https://uk.reuters.com//article/idUKFWN2BV0QP',
 'https://uk.reuters.com//article/idUKFWN2BW0FZ',
 'https://uk.reuters.com//article/idUKFWN2BR0N7',
 'https://uk.reuters.com//article/idUKFWN2BK1FG',
 'https://uk.reuters.com//article/idUKFWN2CA054',
 'https://uk.reuters.com//article/idUKFWN2DF0EI',
 'https://uk.reuters.com//article/idUKFWN2BK19S',
 'https://uk.reuters.com//article/idUKFWN2CH17A',
 'https://uk.reuters.com//article/idUKFWN2BK1M1',
 'https://uk.reuters.com//article/idUKFWN2BP1G1',
 'https://uk.reuters.com//article/idUKFWN2D4029',
 'https://uk.reuters.com//article/idUKFWN2CN0VF',
 'https://uk.reuters.com//article/idUKFWN2BC16Z',
 'https://uk.reuters.com//article/idUKFWN2BV0QP',
 'https://uk.reuters.com//article/idUKFWN2BW0FZ',
