# Web Scraping: Daily Mail

In [24]:
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import re
import time

### Obtain list of news from the coverpage

URL definition:

In [25]:
# url definition
url = "https://www.iiot-world.com/artificial-intelligence-ml/artificial-intelligence/five-challenges-in-designing-a-fully-autonomous-system-for-driverless-cars/"

List of news:

In [26]:
# Request
r1 = requests.get(url)
r1.status_code

# We'll save in coverpage the cover page content
newspage = r1.content

# Soup creation
soup1 = BeautifulSoup(newspage, 'html5lib')

# News identification
news_titles = soup1.find_all('strong')
news_titles = news_titles[10:13]

In [27]:
news_titles1 = [news_title.get_text() for news_title in news_titles]
news_titles1

['3. Traffic conditions', '4. Accident Liability', '5. Radar Interference']

In [28]:
soup_article = BeautifulSoup(coverpage, 'html5lib')
body = soup_article.find_all('p')
body = body[6:9]

In [29]:
bodies = [x.get_text() for x in body ]
bodies

['Autonomous cars would have to get onto the road where they would have to drive in all sorts of traffic conditions. They would have to drive with other autonomous cars on the road, and at the same time, there would also be a lot of humans. Wherever humans are involved, there are involved a lot of emotions. Traffic could be highly moderated and self-regulated. But often there are cases where people may be breaking traffic rules. An object may turn up in unexpected conditions. In the case of dense traffic, even the movement of few cms per minute does matter. One can’t wait endlessly for traffic to automatically clear and have some precondition to start moving. If more of such cars on the road are waiting for traffic to get cleared, ultimately that may result in a traffic deadlock.',
 'The most important aspect of autonomous cars is accidents liability. Who is liable for accidents caused by a self-driving car? In the case of autonomous cars, the software will be the main component that w

In [30]:
def file_writer(num,title,body):
    filename = '00'+str(num+1)+".txt"
    file = open(filename,"w")
    n = file.write(title + "\n\n"+body )
    file.close
for i in range(0,3):
    file_writer(i,news_titles1[i], bodies[i])

Now we have a list in which every element is a news article:

In [6]:
coverpage_news[0]

<h2 class="entry-title qodef-post-title" itemprop="name">
    
	        Five challenges in designing a fully autonomous system for driverless cars        	
</h2>

### Let's extract the text from the articles:

First, we'll define the number of articles we want:

In [7]:
number_of_articles = 1

In [11]:
title = coverpage_news[0].get_text()
title

'\n    \n\t        Five challenges in designing a fully autonomous system for driverless cars        \t\n'

In [14]:
soup_article = BeautifulSoup(coverpage, 'html5lib')
body = soup_article.find_all('p')
body[6]

<p>Autonomous cars would have to get onto the road where they would have to drive in all sorts of traffic conditions. They would have to drive with other autonomous cars on the road, and at the same time, there would also be a lot of humans. Wherever humans are involved, there are involved a lot of emotions. Traffic could be highly moderated and self-regulated. But often there are cases where people may be breaking traffic rules. An object may turn up in unexpected conditions. In the case of dense traffic, even the movement of few cms per minute does matter. One can’t wait endlessly for traffic to automatically clear and have some precondition to start moving. If more of such cars on the road are waiting for traffic to get cleared, ultimately that may result in a traffic deadlock.</p>

In [8]:
# Empty lists for content, links and titles
news_contents = []
list_links = []
list_titles = []

for n in np.arange(0, number_of_articles):
        
    # Getting the link of the article
    link = url + coverpage_news[n].find('a')['href']
    list_links.append(link)
    
    # Getting the title
    title = coverpage_news[n].find('a').get_text()
    list_titles.append(title)
    
    # Reading the content (it is divided in paragraphs)
    article = requests.get(link)
    article_content = article.content
    soup_article = BeautifulSoup(article_content, 'html5lib')
    body = soup_article.find_all('p', class_='mol-para-with-font')
    
    # Unifying the paragraphs
    list_paragraphs = []
    for p in np.arange(0, len(body)):
        paragraph = body[p].get_text()
        list_paragraphs.append(paragraph)
        final_article = " ".join(list_paragraphs)
        
    # Removing special characters
    final_article = re.sub("\\xa0", "", final_article)
        
    news_contents.append(final_article)

TypeError: 'NoneType' object is not subscriptable

Let's put them into:
* a dataset which will the input of the models (`df_features`)
* a dataset with the title and the link (`df_show_info`)

In [7]:
# df_features
df_features = pd.DataFrame(
     {'Article Content': news_contents 
    })

# df_show_info
df_show_info = pd.DataFrame(
    {'Article Title': list_titles,
     'Article Link': list_links})

In [8]:
df_features

Unnamed: 0,Article Content
0,A pair of student drug dealers have been spare...
1,Abu Hamza's son is the suspect held in connect...
2,The doorman stabbed in Mayfair on New Year's E...
3,Quite a lot of space on the internet has been ...
4,Grace Kelly's granddaughterCharlotte Casiraghi...


In [9]:
df_show_info

Unnamed: 0,Article Title,Article Link
0,Student drug dealers are SPARED jail after imp...,https://www.dailymail.co.uk/news/article-65535...
1,"Revealed: Hate preacher Abu Hamza's son, 26, i...",https://www.dailymail.co.uk/news/article-65540...
2,Doorman stabbed in Mayfair on New Year's Eve d...,https://www.dailymail.co.uk/news/article-65485...
3,"Enjoy crispy hash browns, the freshest fries, ...",https://www.dailymail.co.uk/femail/food/articl...
4,Grace Kelly's granddaughter Charlotte Casiragh...,https://www.dailymail.co.uk/femail/article-655...


### Time Elapsed

We are interested in how much time the script takes to get the news because this will impact directly on user experience. For this, we'll put it all into a single function and then call it:

In [10]:
def get_news_dailymail():
    
    # url definition
    url = "https://www.dailymail.co.uk"
    
    # Request
    r1 = requests.get(url)
    r1.status_code

    # We'll save in coverpage the cover page content
    coverpage = r1.content

    # Soup creation
    soup1 = BeautifulSoup(coverpage, 'html5lib')

    # News identification
    coverpage_news = soup1.find_all('h2', class_='linkro-darkred')
    len(coverpage_news)
    
    number_of_articles = 5

    # Empty lists for content, links and titles
    news_contents = []
    list_links = []
    list_titles = []

    for n in np.arange(0, number_of_articles):

        # Getting the link of the article
        link = url + coverpage_news[n].find('a')['href']
        list_links.append(link)

        # Getting the title
        title = coverpage_news[n].find('a').get_text()
        list_titles.append(title)

        # Reading the content (it is divided in paragraphs)
        article = requests.get(link)
        article_content = article.content
        soup_article = BeautifulSoup(article_content, 'html5lib')
        body = soup_article.find_all('p', class_='mol-para-with-font')

        # Unifying the paragraphs
        list_paragraphs = []
        for p in np.arange(0, len(body)):
            paragraph = body[p].get_text()
            list_paragraphs.append(paragraph)
            final_article = " ".join(list_paragraphs)

         # Removing special characters
        final_article = re.sub("\\xa0", "", final_article)
        
        news_contents.append(final_article)
        

    # df_features
    df_features = pd.DataFrame(
         {'Content': news_contents 
        })

    # df_show_info
    df_show_info = pd.DataFrame(
        {'Article Title': list_titles,
         'Article Link': list_links,
         'Newspaper': 'Daily Mail'})
    
    return (df_features, df_show_info)

In [13]:
start = time.time()
x, y = get_news_dailymail()
end =time.time()
te = end-start
print("The time elapsed is %f seconds" %(te))

The time elapsed is 31.879444 seconds


Really slow. We won't include in the app.