## News extraction with `BeautifulSoup`
This notebook is the first part of a project for training an LDA model. Here we will obtain a collection of news articles from Australian news. In order to get this corpus we will use BeautifulSoup library to scrap the news articles. In a second notebook we will proceed to **pre-process** the data and transform it to train an LDA model. To see the second part of the project please check the folder 'LDA_model_and_vis'.

### Web Scrapper - getting the corpus
For the first section we proceed to obtain some news articles from Australian news 

In [1]:
# loading libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from urllib.request import urlopen # library to open URLs
import requests # get the html content
from bs4 import BeautifulSoup

#### Specify URL with data and use requests to get html content

In [10]:
url = 'https://www.theguardian.com/'
r1 = requests.get(url)
coverpage = r1.content #html content

In [11]:
soup = BeautifulSoup(coverpage, 'html5lib') # breaks html file into python objects
type(soup)

bs4.BeautifulSoup

Find all tags corresponding to h3 titles and get the `title-container` class from them to access the content (for this specific website, in other html we may require another class). In this section we are just having an idea of the structure to get the contents systematically in the next section

In [28]:
# soup # shows the source code of the website. Look for h3 tags and get the name of the class

In [17]:
coverpage_news = soup.find_all('h3',class_='fc-item__title') # use the name of the class for h3 tags

In [19]:
coverpage_news[4]

<h3 class="fc-item__title"><a class="fc-item__link" data-link-name="article" href="https://www.theguardian.com/australia-news/2020/aug/04/covid-australia-police-officer-in-victoria-allegedly-brutally-bashed-by-anti-masker"><span class="fc-item__kicker">Melbourne</span> <span class="u-faux-block-link__cta fc-item__headline"> <span class="js-headline-text">Police officer allegedly brutally bashed by anti-masker</span></span> </a></h3>

In [20]:
coverpage_news[4].get_text() # check text of fifth object

'Melbourne  Police officer allegedly brutally bashed by anti-masker '

In [43]:
coverpage_news[4].find('a')['href']
# coverpage_news[4].find('a').get_text()
# requests.get(str(coverpage_news[4].find('a')['href']))

'https://www.theguardian.com/australia-news/2020/aug/04/covid-australia-police-officer-in-victoria-allegedly-brutally-bashed-by-anti-masker'

#### Scrap the first 50 articles

In [161]:
import json
import operator
import functools

# Function for the live_news section which contains a different structure with json format
def live_news(soup_obj, tit_list, cont_list):
    body = soup_obj.find_all('script', type='application/ld+json') # get script tag for live news section
    for article in body: # for each of the found tags 
        json_file = json.loads(article.text) # get the text (a dictionary of the json format)

        try:
            json_txt = functools.reduce(operator.iconcat, json_file['liveBlogUpdate'], []) # flat list of dicts            
            full_dict = {d['headline']:d['articleBody'] for d in json_txt} # dict of headline and article content
            for key,value in full_dict.items():
                tit_list.append(key)
                cont_list.append(value)
        except:
            pass

In [184]:
num_articles = 50

cont_list = []
links_list = []
titles_list = []

for n in np.arange(0,num_articles): # for each article
    link = coverpage_news[n].find('a')['href'] # get the link of news article
    links_list.append(link)
    
    article = requests.get(str(link)) # request the link from the main url 
    cont = article.content # get the article html

    soup_article = BeautifulSoup(cont,'html5lib') # BS object
    body = soup_article.find_all('div', {'class':['content__article-body from-content-api js-article__body',
                                                  'article-body-commercial-selector css-79elbk']}) # get the content in the specific class_
    try:
        x = body[0].find_all('p') # get only the p (paragraph) tags
    except:
        live_news(soup_article, titles_list, cont_list)

    else:
        list_paragraphs = []
        for p in np.arange(0,len(x)): # for each p tag in the body (contained in the variable x)
            paragraph = x[p].get_text() # get the text of each paragraph
            list_paragraphs.append(paragraph)
            final_article = ' '.join(list_paragraphs) # put all paragraphs together

        cont_list.append(final_article)
        title = coverpage_news[n].find('a').get_text() # get the title of news article
        titles_list.append(title)

After the extraction process we obtained 68 articles (the live news sections contain more than one title, therefore we end up with more than 50)

In [188]:
print(len(titles_list))
print(len(cont_list))
titles_list[50],cont_list[50][:300]

68
68


("  \n\n\n  Bolt's talent for speed becomes more apparent now it is denied to us ",
 'Something you may not know about me: there is almost no set of circumstances – personal, professional, medical – in which I will not drop everything to watch Usain Bolt. Naturally, my personalised YouTube algorithm has already known this for some time, and will now instantly recommend me a selection')

### Conclusion
After exploring the html content of the selected news webpage we were able to extract the first 50 articles of the main page. During this process we came across with two types of articles (Live updates and regular article) which required two different approaches to be extracted. The first one uses a json format for the content and the second one was obtained with the use of the `div` tag with the corresponding `class` for it.