## DS103 - Data Engineering - Final Project 
### Name: Wong Woon Yong

### Introduction
‘Web Scraping’ is a technique for gathering structured data or information from web pages, a quick way to acquire data which is presented on the web with a particular format. In this Web-Scraping project which will show the 3 basic step of retrieving information from web-page,<br> 

Step 1: Accessing the target Website using HTTP library requests.<br>
Step 2: Parse the content of web using Web Parsing library Beautiful Soup.<br>
Step 3: Save result to DataFrame format.<br>


### Objective
From the NEWS website shows the process of Parsing and Collecting information, by using "Requests" library to make an HTTP request and collect the HTML. Check for the connection information and scrape the HTML with using "Beautiful Soup". Target to retrieve the Title, Header (H1 and H2) and HTML elements on the page by the tag name (2 tag name). 


### Import libraries and modules for selecting HTML tags.

In [1]:
import pandas as pd
from bs4 import BeautifulSoup, diagnose
import requests

### 1) Established a Connection to a News Web-page.
Start with a Connection setup to Straits times websites and Connection Status was checked with Reponses 200. The "headers" in the code indicate connection setup as "User".

* Code Details:<br>
1) Code 200 series indicating connection successful <br>
2) Code 400 series is "Forbidden, cannot access due to blocked or protected" <br>
3) Code 500 series indicating "Server Error" <br>

* Important: Only when having connection status with Code 200 then can proceed to next step.<br>


In [2]:
url = "https://www.straitstimes.com/global"      # Assign Web-page link to "url"
connection1 = requests.get(url, headers={"user-agent":"Mozilla/80.0"})

connection1.status_code    # Check the Connection status
    

200

### Content check from web-page
With the successful on connection of Code 200, the content on the web-page can be check by using a direct ".content" function. But this method will display all content without proper format and is very confusing.

In [18]:
connection1.content

### 2) Parsing the content and extracting the text
### a) Applying Parser

Generally, it is common to use BeautifulSoup in conjunction with the requests library, by applying the requests that fetch a page and BeautifulSoup will extract the resulting data or parsing the HTML. This is because data from the web-page itself are raw and need to parse for further understand on the content.

There are few types of Parser, here showing example of (a) "html.parser" and (b) "html5lib", and assigning to a variable "soup". The result will then be print with ".prettify()" function that will make the HTML code look better.


### i) html.parser


In [19]:
soup = BeautifulSoup(connection1.content, 'html.parser') 
print(soup.prettify()) 


### ii) html5lib

In [20]:
soup_2 = BeautifulSoup(connection1.content,'html5lib')
print(soup_2.prettify()) 


### Observation on the two Parser method above
- The result of using both 'html.parser' and 'html5lib' parser is not much different, all contents that extracted from the web-page are 95% similar. 
- Based on research, the 'html.parser' speed is decent and the "html5lib" is slower, but the "html5lib" is much better in handling tangling tag issues that somehow happen in the middle of the web-page. For example, a Paragraph (tag-p) without closing issues.

## *** Adding a Check on available Parser
By applying a simple check with the "diagnose" function from bs4, the available parser can be indentify, and also indication on which parser can give the best result. <br>

### i. Create a dummy of "copy_html" and check with "diagnose".

In [6]:
copy_html = """ 
<!DOCTYPE html>
<!--[if IE 8]> <html class="no-js lt-ie9 is-ie"> <![endif]-->
<!--[if IE 9]> <html class="no-js is-ie"> <![endif]-->
<!--[if gt IE 9]><!-->
<html class="no-js" dir="ltr" lang="en" prefix="og: http://ogp.me/ns# article: http://ogp.me/ns/article# book: http://ogp.me/ns/book# profile: http://ogp.me/ns/profile# video: http://ogp.me/ns/video# product: http://ogp.me/ns/product# content: http://purl.org/rss/1.0/modules/content/ dc: http://purl.org/dc/terms/ foaf: http://xmlns.com/foaf/0.1/ rdfs: http://www.w3.org/2000/01/rdf-schema# sioc: http://rdfs.org/sioc/ns# sioct: http://rdfs.org/sioc/types# skos: http://www.w3.org/2004/02/skos/core# xsd: http://www.w3.org/2001/XMLSchema#">
 <!--<![endif]-->
 <head profile="http://www.w3.org/1999/xhtml/vocab">
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no" name="viewport"/>
  <script src="/sites/all/themes/custom/bootdemo/js/ads_checker.js">
  </script>
"""

### ii. Check with "diagnose"
Apply the "diagnose" function on the dummy of "copy_html" and check for the result.

In [7]:
diagnose.diagnose(copy_html)       # applying diagnose model.

Diagnostic running on Beautiful Soup 4.9.1
Python version 3.8.3 (default, Jul  2 2020, 17:30:36) [MSC v.1916 64 bit (AMD64)]
Found lxml version 4.5.2.0
Found html5lib version 1.1

Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
<!DOCTYPE html>
<!--[if IE 8]> <html class="no-js lt-ie9 is-ie"> <![endif]-->
<!--[if IE 9]> <html class="no-js is-ie"> <![endif]-->
<!--[if gt IE 9]><!-->
<html class="no-js" dir="ltr" lang="en" prefix="og: http://ogp.me/ns# article: http://ogp.me/ns/article# book: http://ogp.me/ns/book# profile: http://ogp.me/ns/profile# video: http://ogp.me/ns/video# product: http://ogp.me/ns/product# content: http://purl.org/rss/1.0/modules/content/ dc: http://purl.org/dc/terms/ foaf: http://xmlns.com/foaf/0.1/ rdfs: http://www.w3.org/2000/01/rdf-schema# sioc: http://rdfs.org/sioc/ns# sioct: http://rdfs.org/sioc/types# skos: http://www.w3.org/2004/02/skos/core# xsd: http://www.w3.org/2001/XMLSchema#">
 <!--<![endif]-->
 <head profile

### Obsevation on Diagnose Result:
The "html5lib" is the best parser in handling missing tag, from the retrieved data above, we can see it replaced the missing tag of **"< body >    < /body >"** accordingly.

### 2) Parsing the content and extracting the text
### b) Extracting Web-page Title
HTML containing different type of tags, it is necessary to understand what each html tags stand for, and that's will help to identify the tag pattern and ease the web scraping process. First, extracting the Title element, it is a required HTML element that used to assign a title to an HTML document.<br>

*** The title can be directly extract by using simple ".title" method, or using a more comprehensive  of ".find_all()" function.


In [8]:
soup.title                           # Simple ".title" method


<title>The Straits Times - Breaking News, Lifestyle &amp; Multimedia News</title>

In [9]:
title = soup.find_all('title')       # Extracting the "title" with ".find_all" function.
print("All title in the web-page: \n", title)


All title in the web-page: 
 [<title>The Straits Times - Breaking News, Lifestyle &amp; Multimedia News</title>]


### Extracting the Headlines and Sub-headlines.
In a web-page, header tags have their own place and to be used in a proper order start with header or h1. It contains targeted keywords and close to related page title and content. Where sub-header or h2 should contain similar keywords as h1 tag.
Here applying the ".find_all()" function to extract the header and sub-header.


In [10]:
header_chk=soup.find_all(['h1','h2'])       # Applying "find_all" for h1 and h2.
total_links=len(header_chk)                 # count the total number of h1 and h2 in the web-page.
print("total links in my website :", total_links)

for a in header_chk:       # Using for-loop to display the h1 and h2.
    print(a)


total links in my website : 14
<h1 class="site-name"><a class="name navbar-brand" href="/" title="Home"><span>The Straits Times</span></a></h1>
<h2 class="pane-title">
            Top Stories          </h2>
<h2 class="pane-title">
            top picks          </h2>
<h2 class="pane-title">
            covid-19          </h2>
<h2 class="pane-title">
            For Subscribers          </h2>
<h2 class="pane-title">
            VIEWS          </h2>
<h2 class="pane-title">
            Asian Insider          </h2>
<h2 class="pane-title">
            DISCOVER          </h2>
<h2 class="pane-title">
            Videos          </h2>
<h2 class="pane-title">
            GZERO MEDIA          </h2>
<h2 class="pane-title">
            PODCASTS          </h2>
<h2 class="pane-title">
            MULTIMEDIA          </h2>
<h2 class="pane-title">
            MOST POPULAR          </h2>
<h2 class="pane-title">
            Branded Content          </h2>


### Extracting other HTML tags - "a" and "span"
Next, extracting other type of tags and similarly with applying the ".find_all()" function to locate the following two tags,<br>

1) An anchor or "a" element is used to create hyperlink for a web-page or a location within the web-page itself. <br>
2) Span element which used to select inline content for purely styling purposes.


In [11]:
other_tag1=soup.find_all('a')
total_links=len(other_tag1) 
print("total links in my website :", total_links)

for i in other_tag1[:8]:        # print 8 rows for checking
    print(i)

total links in my website : 215
<a class="element-invisible element-focusable" href="#main-content">Skip to main content</a>
<a class="name navbar-brand" href="/" title="Home"><span>The Straits Times</span></a>
<a class="name navbar-brand" href="/" title="Home"><span>The Straits Times</span></a>
<a class="globallink-ed" href="/global/" id="global-ed">International</a>
<a class="sinlink-ed" href="/" id="sin-ed">Singapore</a>
<a href="http://stepaper.straitstimes.com" target="_blank">E-paper</a>
<a href="/">Home</a>
<a href="/singapore">Singapore</a>


In [12]:
other_tag2=soup.find_all('span')
total_links=len(other_tag2) 
print("\nTotal SPAN tag in this website (displaying 8 rows): ", total_links)

for i in other_tag2[:8]:       # print 8 rows for checking
    print(i)


Total SPAN tag in this website (displaying 8 rows):  97
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span>The Straits Times</span>
<span>The Straits Times</span>
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>


### Further Extract - removing the tag name
In order to remove the tags name from the retrieved information, printing the result with using ".get_text()" function.
Example of using the span tag results and after running the ".get_text()" function, the result (in cell below) is more easy to read and understand.

In [13]:
other_tag2=soup.find_all('span')
total_links=len(other_tag2) 
print("\nTotal SPAN tag in this website (displaying 8 rows): ", total_links)

for i in other_tag2[:8]:         # print 8 rows for checking
    print(i.get_text())



Total SPAN tag in this website (displaying 8 rows):  97
Toggle navigation



The Straits Times
The Straits Times
Toggle navigation



### Alternatively, the tag name can be remove by using ".text.strip()" function
Below using a for-loop and ".text.strip()" on the "a-tag". After remove the tag name and save the information into an "a_list". 

In [14]:
a_list = []            
for b in other_tag1[0:]:
    result = b.text.strip()
    a_list.append(result)
print(a_list) 

['Skip to main content', 'The Straits Times', 'The Straits Times', 'International', 'Singapore', 'E-paper', 'Home', 'Singapore', 'Jobs', 'Housing', 'Parenting & Education', 'Politics', 'Health', 'Transport', 'Courts & Crime', 'Consumer', 'Environment', 'Community', 'Asia', 'SE Asia', 'East Asia', 'South Asia', 'Australia/NZ', 'World', 'United States', 'Europe', 'Middle East', 'Opinion', 'ST Editorial', 'Cartoons', 'Forum', 'Life', 'Food', 'Entertainment', 'Style', 'Travel', 'Arts', 'Motoring', 'Home & Design', 'Business', 'Economy', 'Invest', 'Banking', 'Companies & Markets', 'Property', 'Tech', 'Tech News', 'E-sports', 'Reviews', 'Sport', 'Football', 'Schools', 'Formula One', 'Combat Sports', 'Basketball', 'Tennis', 'Golf', 'More', 'Opinion', 'Life', 'Business', 'Tech', 'Sport', 'Videos', 'Podcasts', 'Multimedia', 'SPH Websites', 'news with benefits', 'SPH Rewards', 'STJobs', 'STCars', 'STProperty', 'STClassifieds', 'SITES', 'Berita Harian', 'Hardwarezone', 'Lianhe Wanbao', 'STOMP', '

### 3) Saving the extracted information into DataFrame
Using "a-list" above as example, and convert it into a Dataframe. As a dataframe can be save into another format accordingly, for example the CSV, Excel and etc.

Cell below showing the conversion into dataframe by using Pandas.DataFrame() function, and the "a_list" being saved into a "df" with column name "Tag-a".

In [15]:
import pandas as pd
df = pd.DataFrame()
df['Tag-a'] = a_list

df.head(8)         # display the first 8 rows

Unnamed: 0,Tag-a
0,Skip to main content
1,The Straits Times
2,The Straits Times
3,International
4,Singapore
5,E-paper
6,Home
7,Singapore


### *** Adding a Complete Program
By putting all the function above in a complete program, it will display all the result once in text format. The "tag_a" and "tag_span" will be saved in a list accordingly and convert into DataFrame in the next cell.

In [16]:
# Import Libreries
import pandas as pd
from bs4 import BeautifulSoup
import requests

# Assigning website link to "url".
# url = input("Please Enter the website address: ")     # Option for user to input a link.
url = "https://www.straitstimes.com/global"
connection2 = requests.get(url, headers = {"user-agent" : "Mozilla/80.0"})  # Requests

# Connection status check
connection2.status_code  
    
# Applying BeautifulSoup
soup = BeautifulSoup(connection2.content, 'html.parser') 
# print(soup)

# Extracting the Header (h1) and Sub-header (h2) with for-loop.
many_link1=soup.find_all(['h1','h2'])
total_links=len(many_link1) 
print("\nTotal Header and Sub-header in this website : ", total_links)

for a in many_link1:          # print both h1 and h2 in text
    print(a.get_text())


# Create a list "tag-a" to further saving the retrieved information into Dataframe.
tag_a = []
many_link2=soup.find_all('a', href=True)
total_links=len(many_link2) 
print("\nTotal 'a'-tag in this website (displaying 8 rows): ", total_links)

for b in many_link2[:8]:         # print the first 8 rows of "a" tag in text
    print(b.get_text())
    tag_a.append(b.get_text())   # Adding the information into "list".
print(tag_a)    
    
    
tag_span =[]   
many_link3=soup.find_all('span')
total_links=len(many_link3) 
print("\nTotal SPAN tag in this website (displaying 8 rows): ", total_links)

for c in many_link3[:8]:        # print the first 8 rows of "span" tag in text
    print(c.get_text())
    tag_span.append(c.get_text())
print(tag_span)  


# Additional check in counting the "link" and "image" on the web-page.
# Counting number of links
count = 0
for link in soup.find_all('a', href=True):
    count = count+1
print("\nLinks: ", count)

# Counting image
for img in soup.findAll():
    if(img.name == 'img'):
        count = count+1
print("Images: ", count)



Total Header and Sub-header in this website :  14
The Straits Times

            Top Stories          

            top picks          

            covid-19          

            For Subscribers          

            VIEWS          

            Asian Insider          

            DISCOVER          

            Videos          

            GZERO MEDIA          

            PODCASTS          

            MULTIMEDIA          

            MOST POPULAR          

            Branded Content          

Total 'a'-tag in this website (displaying 8 rows):  209
Skip to main content
The Straits Times
The Straits Times
International
Singapore
E-paper
Home
Singapore
['Skip to main content', 'The Straits Times', 'The Straits Times', 'International', 'Singapore', 'E-paper', 'Home', 'Singapore']

Total SPAN tag in this website (displaying 8 rows):  97
Toggle navigation



The Straits Times
The Straits Times
Toggle navigation

['Toggle navigation', '', '', '', 'The Straits Times', 'The Strai

### Svaing the Information into a DataFrame
Create a Dataframe (df1) that having column Tag-a and Tag-Span wiht the information.

In [17]:
import pandas as pd
df1 = pd.DataFrame({'Tag-a': tag_a,'Tag-Span': tag_span})     # Create Dataframe
df1

Unnamed: 0,Tag-a,Tag-Span
0,Skip to main content,Toggle navigation
1,The Straits Times,
2,The Straits Times,
3,International,
4,Singapore,The Straits Times
5,E-paper,The Straits Times
6,Home,Toggle navigation
7,Singapore,
