<a href="https://colab.research.google.com/github/este7734/Web_scraping_project/blob/master/Der_Speigel_Web_Scraper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <font color='lightgreen'> Der Spiegel Web Scraper </font>

---



## Import Dependencies

In [1]:
# Import libraries for processing web text
from bs4 import BeautifulSoup
import requests

from textblob import TextBlob
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from lxml import html

# Import these dependencies if using Google Colab 
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Define All Functions

In [2]:
# Get content of the webpage in an html string format by passing a url 
def get_html(url):
    page = requests.get(url)
    html_out = html.fromstring(page.content)
    text = page.text
    return html_out, text

# Convert html into soup to enable soup menthods
def get_soup(html_string):
    soup = BeautifulSoup(html_string, 'html.parser')
    return soup

# Extract hyperlinks from soup
def get_soup_links(soup):
    links = []
    for link in soup.find_all('a'):
        out_link = link.get('href')
        links.append(out_link)
    return links

# This function is for use with only the Topic pages on reuters.com
# Search through ALL links and filter for only those that are for actual articles
# links are formatted differently 
def get_articles_reuters_topics(links, old_url_set):
    articles = []
    for link in links:
        try:
            split_link = link.split('/')
            #if 'www.reuters.com'in split_link:
            if 'www.spiegel.de'in split_link:
              for topic in spiegel_topics:
                if topic in split_link:
                  link = link
                  if url_check(old_url_set, link) == False:
                    articles.append(link)
        except:
            continue
    articles = list(set(articles))
    old_url_set = set(articles + list(old_url_set))       
    return articles, old_url_set

# Check if new urls exists in the old_url_set. if yes, return True; if no, return False
# This function is used in the get_articles_reuters_topics function
def url_check(old_url_set, url):
    url_set = set([url])
    test_set = old_url_set & url_set
    if len(test_set) == 0:
        check = False
    else:
        check = True
    return check

# Get html strings from list of article weblinks
def get_html_reuters(articles):
    soup_list = []
    for article in articles:
        _, text = get_html(article)
        soup = get_soup(text)
        soup_list.append(soup)
    return soup_list

Tags for different websites

In [3]:
# Der Spiegel classes and tags
body_class = 'div'
headline_class = 'h2'
date_class = 'div'

body_tag = 'clearfix lg:pt-32 md:pt-32 sm:pt-24 md:pb-48 lg:pb-48 sm:pb-32'
headline_tag = 'lg:mb-20 md:mb-20 sm:mb-24'
date_tag = 'font-sansUI lg:text-base md:text-base sm:text-s text-shade-dark'

In [4]:
# Break out article_body, article_headline, and article_date from each article in provided hyperlinks and put into a dictionary called: out_list
def get_reuters_elements(soup_list, articles):
    out_list = []
    i = 0
    for article in soup_list:
        link = articles[i] # I don't think this is used at all here, which means there is no reason to require the second argument: articles
        i += 1
        try:
            article_body = article.find_all(body_class, {'class': body_tag})
            article_p = []
            for item in article_body:
                p_list = item.find_all('p')
                for p in p_list:
                    article_p.append(p.text)
            out_text = ' '.join(article_p)
            if out_text == '':
              continue
            if out_text.startswith('Besondere Reportagen'):
              continue
            out_dict = dict([('Text',out_text),('url',link)])
            out_list.append(out_dict)
        except:
            print('Unable to decode...skipping article...')
            continue

    return out_list

## Define URL Variables and Run Functions

Step 1. Instantiate `old_url_set` to be used in the `get_articles_reuters_topics` function. This is a running log of article links that will be compiled by iterating from steps 2 - 6.

In [5]:
# Instantiate empty set to use a running list of hyperlinks while 
# running the scrape iterations
old_url_set = set([])

## Scrape Reuters Topics pages for all the most recent news articles. <font color='orange'>*Run Steps 2 - 7 for each instance of `url` variable, before moving on to the next steps*</font>

<font color='orange'>Step 2.</font> Define variables for each of Reuters main topics pages. Run this cell for each iteration by uncommenting a different url each time.

In [26]:
# Define url variables
# NOTE: You must run these individually through the end of this section
# I didn't have time to figure out how to loop through all of them properly
# There is a section at the very bottom where you can see that I attempted but ran
# into a problem on one of the last functions. 

# Der Speigel Links
#url = r'https://www.spiegel.de/'
#url = r'https://www.spiegel.de/plus/'
#url = r'https://www.spiegel.de/schlagzeilen/'
#url = r'https://www.spiegel.de/politik/deutschland/'#zero
#url = r'https://www.spiegel.de/politik/ausland/' #zero
#url = r'https://www.spiegel.de/panorama/'#one
#url = r'https://www.spiegel.de/wirtschaft/'#one
#url = r'https://www.spiegel.de/netzwelt/' #one
#url = r'https://www.spiegel.de/wissenschaft/' #one
#url = r'https://www.spiegel.de/geschichte/'
url = r'https://www.spiegel.de/thema/leben/' #zero

spiegel_topics = ['plus', 'schlagzeilen', 'politik/deutschland', 'politik/ausland', 'panorama', 'wirtschaft', 'netzwelt', 'wissenschaft', 'geschichte', 'thema/leben']


## Scraper

<font color='orange'>Step 3.</font> Get HTML srting from web `url`

In [27]:
# Pass the each instance of `url` variable to return the web page in HTML format and convert it to a string
html_string = str(get_html(url))
# Pass the HTML string (of the web page) to get its soup
soup = get_soup(html_string)
# Find ALL links on within the soup
links = get_soup_links(soup)
# Use this for Topics Pages only
# Filter out only those links that are for actual articles. We only want the "good" links
# This filters out things like links to images and advertisements or non-news worthy pages
articles, old_url_set = get_articles_reuters_topics(links, old_url_set)
print(len(articles))
# Print out the running list of hyperlinks to see how many you have
print(len(old_url_set))

0
487


## <font color='skyblue'>Parse soup from entire list of hyperlinks that you just accumulated</font>

Step 8. Get soup for every link

Step 9. Parse the soup for each link into `article_body`, `article_title`, and `article_date`. Create list of dictionaries for each web page

Step 10. Run `TextBlob` on list of dictionaries to separate all sentences in a single list

In [28]:
# Get soup for each one of the "good" links
url_links = list(old_url_set) # Convert the running set of links to a list for use in the following functions

soup_list = get_html_reuters(url_links)
print(f'Length of Soup List: {len(soup_list)}')
#print(soup_list[0])

# Parse the soup for each "good" link to get article text, title, and date
out_list = get_reuters_elements(soup_list, url_links)
print(f'Length of out_list: {len(out_list)}')
#out_list[0:2]
blob_sentences =[]

Length of Soup List: 487
Length of out_list: 440


In [29]:
for i in range(len(out_list)):
  try:
    blob = out_list[i]['Text']
    trans = TextBlob(blob) # Enter string object
    trans = trans.translate(to='en')
    trans = str(trans) # Change to = 'en' for English
    trans = TextBlob(trans)
    for item in trans.sentences:
      blob_sentences.append(trans)
    print('\n', i+1, '\n',trans)
  except:
    print('got an error ...., skipping article....')
print(f'\nYou have {len(blob_sentences)} total sentences from {i+1} different articles')


 1 
 Numerous people were killed in heavy storms in India, and numerous people are said to have died as a result of lightning strikes. The authorities assume at least 104 deaths. Several others were injured, for example because gusts of wind and heavy rain outlined trees and electricity pylons and destroyed simply built houses. This was announced by the civil protection authorities of the two affected states, Bihar and Uttar Pradesh. According to the authorities, the victims are mainly farmers and homeless people who were outside at the time of the storm. The storm was part of the beginning monsoon in northern India. The monsoon season in South Asia usually lasts from June to September. Rain is vital for agriculture, but it also often causes great damage. This also includes lightning strikes - and dozens of people die again and again. However, as many people as now in the state of Bihar have not died on a single day for several years, said a civil protection worker there. Bihar is one

This is just a troubleshooting section to ensure you're `TextBlob` came out right. You should see a list of stentences, each starting with the word `Sentence`. Check this before moving to the next step.

In [30]:
# print the first 6 sentences so you see what it looks like
blob_sentences[0]
len(blob_sentences)

17112

## <font color='skyblue'> Event Detection </font>



Step 11. Determine key words used to search for event of interest in all the articles

In [31]:
# Allow user to type in key words to search the text for
# Note this is case sensitive... so you need to make sure you enter your search 
# Try the entering the following to see some results: corona,COVID,death
filter_list = input("Enter key words to search for separated by commas. don't use spaces.\nSearch is case sensitive.\nExample search: coronavirus,COVID,death\n\n") #.title() # This is still a string... not a list yet

Enter key words to search for separated by commas. don't use spaces.
Search is case sensitive.
Example search: coronavirus,COVID,death

America


Step 12. Convert keywords into an iterable list for use in the next step


In [32]:
# Split filter words and convert into a list for itterating in the next step 

f = []
for word in (filter_list.split(",")):  # Split string into separate words, separate by comma
  f.append(word)                       # Generate new list containing each key word
f # This is now a list of key words that the user typed in

['America']

Step 13. <font color='skyblue'> Search </font>  All sentences for Key Words and return only those consisting of a key word



In [33]:
# Instantiate list for holding the sentences
sentences = []

# Generate empty list of lists to store the sentences with your key words in them
for i in range(len(f)):
  sentences.append([])
#print('Here is what you just made, and empty list of lists: ', sentences, '\n')

# Generate lists of sentences for each key word and plug them into the list of lists from above            
for i in range(len(f)):
  for sentence in blob_sentences:
    if f[i] in sentence:
        sentences[i].append(sentence)
        
# Print number of sentences containing each key word
# Print out all sentences containing each key word
for i in range(len(f)):
  print('='*200)   
  print('\nThere are {} sentences containing the word: {} '.format(len(sentences[i]), f[i])) 
  print('-'*200)   


file_Der_Speigel = open("MyFileDer_Speigel.txt", "w") 
file_Der_Speigel.write(str(sentences))
file_Der_Speigel.close()
  #for sentence in sentences[i]:
      #print(sentence)


There are 5706 sentences containing the word: America 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


## <font color='black'>Step 13. Store All article data in DataFrame </font>  
<font color='grey'>Note: This is not just the filtered data, this contains all information from each web page. While not used in this project, it represents all of the original data used in this run. It can be used for reference.</font> 

In [34]:
# Put parsed data into Pandas DataFrame
pd.DataFrame(out_list)

Unnamed: 0,Text,url
0,Bei schweren Unwettern sind in Indien zahlreic...,https://www.spiegel.de/panorama/gesellschaft/i...
1,"SPIEGEL: Herr Professor Patzold, seit vielen J...",https://www.spiegel.de/geschichte/faszination-...
2,Der Vorstand des Bezahldienstleisters Wirecard...,https://www.spiegel.de/wirtschaft/unternehmen/...
3,Schauspieler Dennis Quaid und seine Partnerin ...,https://www.spiegel.de/panorama/leute/dennis-q...
4,In den USA steigen die Zahlen der Neuinfektion...,https://www.spiegel.de/panorama/corona-pandemi...
...,...,...
435,"""Viele Einwegprodukte aus Kunststoff sind über...",https://www.spiegel.de/wirtschaft/umweltschutz...
436,Die britische Regierung will die Übernahme wic...,https://www.spiegel.de/wirtschaft/unternehmen/...
437,Die schwedische Modekette Hennes & Mauritz hat...,https://www.spiegel.de/wirtschaft/coronakrise-...
438,"Von Silke Fokken, Annette Großbongardt, Armin ...",https://www.spiegel.de/panorama/bildung/corona...
