# Part B: Take Home | Alkiviadis Kariotis 241735

## Use your favorite web crawling technique and crawl information related to the CORONAVIRUS from the following websites:
https://www.who.int/emergencies/diseases/novel-coronavirus-2019
https://www.un.org/en/coronavirus

Make sure that you just get:
-   information in English,
-   information RELATED to coronavirus / COVID-19 (including non-textual data if any),
-   information on the topic as found in the above pages and the links included in these pages


Once you have finished, answer the following questions:
*     How many documents have you successfully acquired? [5%]
*     How did you restrict or clean-up your crawling as requested above? Please paste in this space your exact code and corresponding explanation of how the crawler was restricted as requested; pay particular notice at justifying the necessity of your decisions (e.g., you should not include restrictions that are redundant/have no effect in crawling because default restrictions apply, etc.). [20%]
*     Vectorize the MAIN text contained in the following link using Vector Space approach? Use a vocabulary of a maximum of 20 words that you choose carefully to be representative of the document. State the criteria that made you make these choices. [25%] https://www.who.int/health-topics/coronavirus#tab=tab 1


In [1]:
import requests
from bs4 import BeautifulSoup
import re
import time
import warnings

with warnings.catch_warnings():
    warnings.filterwarnings("ignore")
    
# Function to parse HTML content and extract links
def extract_links(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    links = [link['href'] for link in soup.find_all('a', href=True) if 'http' in link['href']]
    return links

# Filter functions
def is_english_url(url):
    return '/en/' in url or 'lang=en' in url or re.search(r'english|en_', url, re.IGNORECASE) is not None

def is_related_to_covid(url):
    covid_keywords = ['coronavirus', 'covid', 'covid-19', 'pandemic']
    return any(keyword in url.lower() for keyword in covid_keywords)

# Constants
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
DELAY_BETWEEN_REQUESTS = 2  # Delay in seconds

# Function to crawl a website and extract COVID-related information
def crawl_website(url):
    headers = {'User-Agent': USER_AGENT}
    try:
        time.sleep(DELAY_BETWEEN_REQUESTS)  # Delay to prevent being blocked
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            page_content = response.text
            print(f"Successfully retrieved content from {url}")
            return page_content
        else:
            print(f"Failed to retrieve content from {url}, status code: {response.status_code}")
            return None
    except requests.exceptions.RequestException as e:
        print(f"Request failed: {e}")
        return None

# Function to run the crawler and store contents
def run_crawler_and_store_contents(urls):
    contents = {
        'UN': [],
        'WHO': []
    }
    for url in urls:
        page_content = crawl_website(url)
        if page_content:
            links = extract_links(page_content)
            filtered_links = [link for link in links if is_english_url(link) and is_related_to_covid(link)]
            for link in filtered_links:
                subpage_content = crawl_website(link)
                if subpage_content:
                    if 'un.org' in link:
                        contents['UN'].append(subpage_content)
                    elif 'who.int' in link:
                        contents['WHO'].append(subpage_content)
    return contents

# Function to print the summary of contents
def print_summary(contents):
    for site, pages in contents.items():
        print(f"{site} site: Retrieved {len(pages)} documents.")
        if pages:  # If there is at least one document
            # Print first 500 characters for example
            print(f"Example content from {site} site:\n{pages[0][:500]}")

# List of URLs to crawl
urls = [
    'https://www.who.int/emergencies/diseases/novel-coronavirus-2019',
    'https://www.un.org/en/coronavirus'
]

# Run the crawler and store contents
contents = run_crawler_and_store_contents(urls)

# Print the summary of retrieved contents
print_summary(contents)

Successfully retrieved content from https://www.who.int/emergencies/diseases/novel-coronavirus-2019
Successfully retrieved content from https://www.un.org/en/coronavirus
Successfully retrieved content from https://news.un.org/en/tags/covid-19
Successfully retrieved content from https://news.un.org/en/events/un-news-coverage-coronavirus-outbreak
Successfully retrieved content from https://www.un.org/en/coronavirus/covid-19-faqs
Successfully retrieved content from https://www.un.org/en/coronavirus/covid-19-faqs
Successfully retrieved content from https://www.un.org/en/coronavirus/covid-19-faqs
Successfully retrieved content from https://www.un.org/en/coronavirus/covid-19-faqs
Successfully retrieved content from https://www.un.org/en/coronavirus/financing-development/global-accelerator
Successfully retrieved content from https://www.un.org/en/coronavirus/information-un-system
Successfully retrieved content from https://www.un.org/en/coronavirus/information-un-system
Successfully retrieved

-----------

## Documents successfully acquired (a)

In [2]:
print_summary(contents)

UN site: Retrieved 19 documents.
Example content from UN site:
<!DOCTYPE html>
<html lang="en" dir="ltr" prefix="og: https://ogp.me/ns#">
  <head>
    <meta charset="utf-8" />
<script>window.dataLayer = window.dataLayer || []; window.dataLayer.push({"drupalLanguage":"en","drupalCountry":"US","siteName":"UN News","entityLangcode":"en","entityType":"taxonomy_term","entityBundle":"tags","entityId":"233091","entityTitle":"COVID-19","userUid":0});</script>
<link rel="canonical" href="https://news.un.org/en/tags/covid-19" />
<link rel="shortlink" href="https://ne
WHO site: Retrieved 0 documents.


---------

## Restricting and Cleaning-up the Crawling: (b)

In [3]:
#Language Restriction:
def is_english_url(url):
    return '/en/' in url or 'lang=en' in url or re.search(r'english|en_', url, re.IGNORECASE) is not None

In [4]:
#Topic Filtering
def is_related_to_covid(url):
    covid_keywords = ['coronavirus', 'covid', 'covid-19', 'pandemic']
    return any(keyword in url.lower() for keyword in covid_keywords)

In [5]:
#User-Agent and Delays
headers = {'User-Agent': USER_AGENT}
time.sleep(DELAY_BETWEEN_REQUESTS)

----------

## Vectorizing the MAIN text:

### First attempt without the stopwords removal

In [6]:
import requests
from bs4 import BeautifulSoup
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
import warnings
warnings.filterwarnings('ignore')


#Retrieve HTML content
def get_html(url):
    response = requests.get(url)
    return response.text if response.status_code == 200 else None

#Parse HTML to extract the main text
def extract_main_text(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    article_text = ' '.join([p.get_text() for p in soup.find_all('p')])
    return article_text

#Select a representative vocabulary
def select_vocabulary(text, max_features=20):
    # Tokenize the text
    tokens = text.split()
    # Count the frequency of each token
    counter = Counter(tokens)
    # Select the most common tokens as the vocabulary
    most_common_tokens = [word for word, count in counter.most_common(max_features)]
    return most_common_tokens

#Create a document-term matrix or vector for the main text
def vectorize_text(text, vocabulary):
    vectorizer = CountVectorizer(vocabulary=vocabulary)
    vector = vectorizer.fit_transform([text])
    return vector.toarray()


url = 'https://www.who.int/health-topics/coronavirus#tab=tab_1'

# Perform the vectorization process
html_content = get_html(url)
if html_content:
    main_text = extract_main_text(html_content)
    vocabulary = select_vocabulary(main_text)
    vector = vectorize_text(main_text, vocabulary)
    print("Vocabulary:", vocabulary)
    print("Vector representation:", vector)
else:
    print("Failed to retrieve HTML content")


Vocabulary: ['and', 'the', 'to', 'of', 'a', 'in', 'or', 'COVID-19', 'is', 'with', 'at', 'from', 'on', 'disease', 'by', 'people', 'infected', 'virus', 'respiratory', 'for']
Vector representation: [[27 22 16  9  0  7  6  0  5  5  5  5  6  6  4  5  4  5  4  4]]


###  To make the vocabulary more representative of the document, we can consider the following:

1.Synonyms and Variations
2.Medical and Scientific Terms
3.Public Health Keywords
4.Severity and Impact
5.Action Words
6.Policy and Guidance

We could refine our vocabulary by replacing less informative words with terms that could add more specificity and relevance to the document's subject, based on the aformentioned criteria.

In [7]:
import requests
from bs4 import BeautifulSoup
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords', quiet=True)
import warnings
warnings.filterwarnings('ignore')

# Download the set of stop words the first time
nltk.download('stopwords')

#Retrieve HTML content
def get_html(url):
    response = requests.get(url)
    return response.text if response.status_code == 200 else None

#Parse HTML to extract the main text
def extract_main_text(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    article_text = ' '.join([p.get_text() for p in soup.find_all('p')])
    return article_text

#Select a representative vocabulary
def select_vocabulary(text, max_features=20):
    # Tokenize the text
    tokens = text.split()
    # Remove common stop words
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
    # Count the frequency of each token
    counter = Counter(filtered_tokens)
    # Select the most common tokens as the vocabulary
    most_common_tokens = [word for word, count in counter.most_common(max_features)]
    return most_common_tokens

#Create a document-term matrix or vector for the main text
def vectorize_text(text, vocabulary):
    vectorizer = CountVectorizer(vocabulary=vocabulary)
    vector = vectorizer.fit_transform([text])
    return vector.toarray()


url = 'https://www.who.int/health-topics/coronavirus#tab=tab_1'

# Perform the vectorization process
html_content = get_html(url)
if html_content:
    main_text = extract_main_text(html_content)
    vocabulary = select_vocabulary(main_text)
    vector = vectorize_text(main_text, vocabulary)
    print("Vocabulary:", vocabulary)
    print("Vector representation:", vector)
else:
    print("Failed to retrieve HTML content")


Vocabulary: ['COVID-19', 'disease', 'people', 'infected', 'virus', 'respiratory', 'health', 'mild', 'recover', 'medical', 'symptoms:', 'symptoms', 'Health', 'pandemic', 'SARS-CoV-2', 'moderate', 'illness', 'without', 'become', 'seriously']
Vector representation: [[0 6 5 4 5 4 8 3 3 3 0 7 0 3 0 2 3 2 2 2]]


[nltk_data] Downloading package stopwords to /Users/alkis/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### The following includes a predefined list of coronavirus-related terms that are expected to be representative of the content. We then filter the most common words from the text to match this list, ensuring that the final vocabulary is both frequent in the document and specific to the coronavirus topic. The isalpha check ensures that we're only dealing with words (not numbers or symbols), and we use a pool larger than 20 words before the final filtering to ensure we have enough words to choose from.

In [8]:
import requests
from bs4 import BeautifulSoup
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords', quiet=True)
import warnings
warnings.filterwarnings('ignore')

# Download the set of stop words the first time
nltk.download('stopwords')

# Retrieve HTML content
def get_html(url):
    response = requests.get(url)
    return response.text if response.status_code == 200 else None

# Parse HTML to extract the main text
def extract_main_text(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    article_text = ' '.join([p.get_text() for p in soup.find_all('p')])
    return article_text

# Select a representative vocabulary
def select_vocabulary(text, max_features=20):
    # Tokenize the text
    tokens = text.split()
    # Remove common stop words
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word.lower() not in stop_words and word.isalpha()]
    # Count the frequency of each token
    counter = Counter(filtered_tokens)
    # Select the most common tokens as the vocabulary
    most_common_tokens = [word for word, count in counter.most_common(max_features * 2)]  # Increase to have a selection pool

    # Apply further criteria to refine the vocabulary
    covid_related_terms = ['coronavirus', 'COVID-19', 'pandemic', 'SARS-CoV-2', 'transmission', 'quarantine', 
                           'vaccination', 'symptoms', 'outbreak', 'spread', 'prevent', 'treatment', 'mask', 
                           'social distancing', 'public health', 'testing', 'immune', 'hospital', 'sanitizer', 'guidelines']
    # Intersect the most common with the covid_related list while preserving order
    representative_tokens = [word for word in most_common_tokens if word.lower() in covid_related_terms][:max_features]
    
    return representative_tokens

# Create a document-term matrix or vector for the main text
def vectorize_text(text, vocabulary):
    vectorizer = CountVectorizer(vocabulary=vocabulary)
    vector = vectorizer.fit_transform([text])
    return vector.toarray()

url = 'https://www.who.int/health-topics/coronavirus#tab=tab_1'

# Perform the vectorization process
html_content = get_html(url)
if html_content:
    main_text = extract_main_text(html_content)
    vocabulary = select_vocabulary(main_text)
    vector = vectorize_text(main_text, vocabulary)
    print("Vocabulary:", vocabulary)
    print("Vector representation:", vector)
else:
    print("Failed to retrieve HTML content")


[nltk_data] Downloading package stopwords to /Users/alkis/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Vocabulary: ['symptoms', 'pandemic', 'prevent', 'transmission', 'Coronavirus']
Vector representation: [[7 3 2 2 0]]
