<a href="https://colab.research.google.com/github/dpanagop/COVID/blob/main/Scap_COVID_announcements.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# General Information
This notebook scraps information from the daily press release of Greece's National Public Health Organisation (EODY) about 
*   the number of hospitalised COVID-19 patients that receive respiratory help
*   the number of deaths that are attributed to COVID-19

The page with the announcements is https://eody.gov.gr/category/anakoinoseis/
We are interested in annonucements related to the daily COVID-19 preess release. This is title "Ημερήσια έκθεση επιτήρησης COVID-19" (ex. https://eody.gov.gr/20201113_briefing_covid19/ ) 

In [1]:
# Importing necessary libraries
import requests
import json
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import re

In [2]:
def get_announcement(url="",search=""):
  """ A function that takes a url as input and returns a list with all the 
      links that contain the string in search variable""" 
  results=[]
  page = requests.get(url)
  soup = BeautifulSoup(page.content, 'html.parser')
  for label in soup.find_all("a",attrs={"aria-label":True}):
    text=label.contents[0]
    if text.find(search)>=0:
      # the next two print functions can be removed, they are used for debugging
      print(text)   
      print(label['href'])
      results.append(label['href'])
  return(results)

An inspection of "https://eody.gov.gr/category/anakoinoseis/" that it contains at the bottom links to next pages. In HTML code such a link is in an ```<a class="next page-numbers" href=...>```  element.

The loop bellow starts with a url and searches for links to the COVID daily releases with the get_announcement function. If the url contains a link to a next page, then the loop repeats using this new page. The links are stored in a list named announcements.

In [3]:
announcements=[] #the urls of daily press releases
url="https://eody.gov.gr/category/anakoinoseis/"
announcements=get_announcement(url=url,search="Ημερήσια έκθεση επιτήρησης COVID-19")
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
next_page=soup.find("a",{'class': 'next page-numbers'})
while next_page:
  url=next_page["href"]
  print("Checking page "+url)
  new_announcements=get_announcement(url=url,search="Ημερήσια έκθεση επιτήρησης COVID-19")
  if len(new_announcements)>0:
    announcements=announcements+new_announcements
  page = requests.get(url)
  soup = BeautifulSoup(page.content, 'html.parser')
  next_page=soup.find("a",{'class': 'next page-numbers'})

Ημερήσια έκθεση επιτήρησης COVID-19 (14/11/2020)
https://eody.gov.gr/20201114_briefing_covid19/
Ημερήσια έκθεση επιτήρησης COVID-19 (13/11/2020)
https://eody.gov.gr/20201113_briefing_covid19/
Ημερήσια έκθεση επιτήρησης COVID-19 (12/11/2020)
https://eody.gov.gr/20201112_briefing_covid19/
Ημερήσια έκθεση επιτήρησης COVID-19 (11/11/2020)
https://eody.gov.gr/20201111_briefing_covid19/
Ημερήσια έκθεση επιτήρησης COVID-19 (10/11/2020)
https://eody.gov.gr/20201110_briefing_covid19/
Ημερήσια έκθεση επιτήρησης COVID-19 (09/11/2020)
https://eody.gov.gr/20201109_briefing_covid19/
Ημερήσια έκθεση επιτήρησης COVID-19 (08/11/2020)
https://eody.gov.gr/20201108_briefing_covid19/
Ημερήσια έκθεση επιτήρησης COVID-19 (07/11/2020) – ΟΡΘΗ
https://eody.gov.gr/20201107_briefing_covid19/
Ημερήσια έκθεση επιτήρησης COVID-19 (06/11/2020)
https://eody.gov.gr/20201106_briefing_covid19/
Ημερήσια έκθεση επιτήρησης COVID-19 (05/11/2020)
https://eody.gov.gr/20201105_briefing_covid19/
Checking page https://eody.gov.gr

An inspection of the above output reveals that the format of the daily press releases has changed from something like
https://eody.gov.gr/20201113_briefing_covid19/ to https://eody.gov.gr/0915_briefing_covid19/
Ie. from ```https://eody.gov.gr/(**date**)_briefing_covid19/``` to  ```https://eody.gov.gr/(**post_ID**)_briefing_covid19/```

This justifies the use of a search string for detecting the press releases and not using their url format.


Each page with a press release has several meta tags. We can use them to extract page's title and time of publication. An example is shown in the code below.


In [4]:
print(f'Using {announcements[0]} as example')
page = requests.get(announcements[0])
soup = BeautifulSoup(page.content, 'html.parser')
title=soup.find('meta',{'property': 'og:title'})
print(title['content'])
timestamp=soup.find('meta',{'property': 'article:published_time'})
print(timestamp['content'])

Using https://eody.gov.gr/20201114_briefing_covid19/ as example
Ημερήσια έκθεση επιτήρησης COVID-19 (14/11/2020) - Εθνικός Οργανισμός Δημόσιας Υγείας
2020-11-14T16:06:26+00:00


The number of patients in ventilator is between the pharse


*   "σχετιζόμενα με ήδη γνωστό κρούσμα" (it is part of the sentence that informs of the number of confirmed COVID cases that are related to other confirmed cases) and
*   "συμπολίτες μας νοσηλεύονται διασωληνωμένοι." (it means "fellow citizens are hospitalised in respirator")

We extract the number between those two phrases.

Then we do the same for the number of COVID deaths that is between


*   "Τέλος," (meaning "Finnaly,") and
*   "ακόμα καταγεγραμμέ" (meaning "more are recorded")

Actually, in the last phrase the second word is without the suffix because it changed from singular to plural.

Bellow is the code used for the number extraction.








In [6]:
text=soup.prettify()
ventilator_start=text.index("σχετιζόμενα με ήδη γνωστό κρούσμα")
ventilator_end=text.index("συμπολίτες μας νοσηλεύονται διασωληνωμένοι.")
ventilator= re.findall(r'\b\d+\b', text[ventilator_start:ventilator_end])
ventilator=int(ventilator[0])
#print the text with the number of patients in respirator
print(re.sub('<[^<]+?>', '', text[ventilator_start:ventilator_end+43])) 
print(ventilator)
deaths_start=text.index("Τέλος,")
deaths_end=text.index("ακόμα καταγεγραμμέ")
deaths= re.findall(r'\b\d+\b', text[deaths_start:deaths_end])
deaths=int(deaths[0])
# print the text with number of COVID deaths
print(re.sub('<[^<]+?>', '', text[deaths_start:deaths_end+31]))
print(deaths)

σχετιζόμενα με ήδη γνωστό κρούσμα.


 
  366
 
 συμπολίτες μας νοσηλεύονται διασωληνωμένοι.
366
Τέλος, έχουμε
 
  38
 
 ακόμα καταγεγραμμένους θανάτους
38


Finally, we create an dataframe named announcements_content that for each press release will hold the url, page's title, time of publication, number of pattiens in respirator and COVID deaths.  

In [7]:
announcements_content=pd.DataFrame(announcements,columns=['url'])
announcements_content['title']=""
announcements_content['timestamp']=""
announcements_content['ventilator']=""
announcements_content['deaths']=""
announcements_content.head()

Unnamed: 0,url,title,timestamp,ventilator,deaths
0,https://eody.gov.gr/20201114_briefing_covid19/,,,,
1,https://eody.gov.gr/20201113_briefing_covid19/,,,,
2,https://eody.gov.gr/20201112_briefing_covid19/,,,,
3,https://eody.gov.gr/20201111_briefing_covid19/,,,,
4,https://eody.gov.gr/20201110_briefing_covid19/,,,,


The for loop below gets the contenet of each announcement and populates the dataframe. Not that there is a try-catch part when extractiong number of COVID deaths. That is because there were some fortunate cases where no COVID death was announced.

The code prints each url as well as the part of the texts about the hospitalized in respiratory patients and COVID deaths with the corresponding extracted numbers. This might seem unecessary clatter but it was valuable during the initial development for debugging reasons. 

In [9]:
for idx,row in announcements_content.iterrows():
  url=row['url']
  print(f'Index {idx}')
  print(url)
  page = requests.get(url)
  soup = BeautifulSoup(page.content, 'html.parser')
  title=soup.find('meta',{'property': 'og:title'})
  print(title['content'])
  timestamp=soup.find('meta',{'property': 'article:published_time'})
  text=soup.prettify()
  ventilator_start=text.index("σχετιζόμενα με ήδη γνωστό κρούσμα")
  ventilator_end=text.index("συμπολίτες μας νοσηλεύονται διασωληνωμένοι.")
  ventilator= re.findall(r'\b\d+\b', text[ventilator_start:ventilator_end])
  ventilator=int(ventilator[0])
  print("----VENTILATOR-----")
  print(re.sub('<[^<]+?>', '', text[ventilator_start:ventilator_end+43]))
  print(ventilator)
  try:
   deaths_start=text.index("Τέλος,")
   deaths_end=text.index("ακόμα καταγεγραμμέ")
   deaths= re.findall(r'\b\d+\b', text[deaths_start:deaths_end])
   deaths=int(deaths[0])
   print("----DEATHS----")
   print(re.sub('<[^<]+?>', '', text[deaths_start:deaths_end+31]))
   print(deaths)
  except:
   print("DEATHS - not found")
   deaths=0
  announcements_content['title'][idx]=title['content']
  announcements_content['timestamp'][idx]=timestamp['content']
  announcements_content['ventilator'][idx]=ventilator
  announcements_content['deaths'][idx]=deaths

Index 0
https://eody.gov.gr/20201114_briefing_covid19/
Ημερήσια έκθεση επιτήρησης COVID-19 (14/11/2020) - Εθνικός Οργανισμός Δημόσιας Υγείας
----VENTILATOR-----
σχετιζόμενα με ήδη γνωστό κρούσμα.


 
  366
 
 συμπολίτες μας νοσηλεύονται διασωληνωμένοι.
366
----DEATHS----
Τέλος, έχουμε
 
  38
 
 ακόμα καταγεγραμμένους θανάτους
38
Index 1
https://eody.gov.gr/20201113_briefing_covid19/
Ημερήσια έκθεση επιτήρησης COVID-19 (13/11/2020) - Εθνικός Οργανισμός Δημόσιας Υγείας
----VENTILATOR-----
σχετιζόμενα με ήδη γνωστό κρούσμα.


 
  336
 
 συμπολίτες μας νοσηλεύονται διασωληνωμένοι.
336
----DEATHS----
Τέλος, έχουμε
 
  38
 
 ακόμα καταγεγραμμένους θανάτους
38
Index 2
https://eody.gov.gr/20201112_briefing_covid19/
Ημερήσια έκθεση επιτήρησης COVID-19 (12/11/2020) - Εθνικός Οργανισμός Δημόσιας Υγείας
----VENTILATOR-----
σχετιζόμενα με ήδη γνωστό κρούσμα.


 
  310
 
 συμπολίτες μας νοσηλεύονται διασωληνωμένοι.
310
----DEATHS----
Τέλος, έχουμε
 
  50
 
 ακόμα καταγεγραμμένους θανάτους
50
Index 3

In [10]:
announcements_content

Unnamed: 0,url,title,timestamp,ventilator,deaths
0,https://eody.gov.gr/20201114_briefing_covid19/,Ημερήσια έκθεση επιτήρησης COVID-19 (14/11/202...,2020-11-14T16:06:26+00:00,366,38
1,https://eody.gov.gr/20201113_briefing_covid19/,Ημερήσια έκθεση επιτήρησης COVID-19 (13/11/202...,2020-11-13T15:35:20+00:00,336,38
2,https://eody.gov.gr/20201112_briefing_covid19/,Ημερήσια έκθεση επιτήρησης COVID-19 (12/11/202...,2020-11-12T17:43:19+00:00,310,50
3,https://eody.gov.gr/20201111_briefing_covid19/,Ημερήσια έκθεση επιτήρησης COVID-19 (11/11/202...,2020-11-11T16:04:59+00:00,297,43
4,https://eody.gov.gr/20201110_briefing_covid19/,Ημερήσια έκθεση επιτήρησης COVID-19 (10/11/202...,2020-11-10T16:43:50+00:00,263,41
...,...,...,...,...,...
113,https://eody.gov.gr/0723_briefing_covid19/,Ημερήσια έκθεση επιτήρησης COVID-19 (23/07/202...,2020-07-23T15:03:58+00:00,8,1
114,https://eody.gov.gr/0722_briefing_covid19/,Ημερήσια έκθεση επιτήρησης COVID-19 (22/07/202...,2020-07-22T15:56:38+00:00,10,3
115,https://eody.gov.gr/0721_briefing_covid19/,Ημερήσια έκθεση επιτήρησης COVID-19 (21/07/202...,2020-07-21T16:07:41+00:00,10,2
116,https://eody.gov.gr/0720_briefing_covid19/,Ημερήσια έκθεση επιτήρησης COVID-19 (20/07/202...,2020-07-20T15:46:59+00:00,12,1


The two cells below store the dataframe into xlsx format and use a google colab library to download it.

In [11]:
from pandas import ExcelWriter
writer = ExcelWriter('deaths_ventilator_20201114.xlsx')
announcements_content.to_excel(writer,'all')
writer.save()

In [12]:
from google.colab import files
files.download('deaths_ventilator_20201114.xlsx')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>