# Webscraping

When performing data science tasks, it's common to want to use data found on the internet. You'll usually be able to access this data in csv format, or via an Application Programming Interface (API). There are times when the data you want can only be accessed as part of a web page. In cases like this, you'll need to use a technique called web scraping to get the data from the web page into a format we can work with in the analysis.

## BeautifulSoup
Beautiful Soup is a Python library for extracting data out of HTML and XML files. We can use the BeautifulSoup library to parse document, and extract the text from tags. 

You may install using `pip` via the following command: 

```bash
pip install beautifulsoup4
```

For full documentation refer to: https://www.crummy.com/software/BeautifulSoup/bs4/doc/




## Initializing BeautifulSoup

We can use the BeautifulSoup library to parse this document, and extract the text from the p tag. We first have to import the library, and create an instance of the BeautifulSoup class to parse our document:

```python
from bs4 import BeautifulSoup
```

Also because we are going to download content from a website, we will need the `requests` package.

```python
import requests
```

we can then proceed to load the page and parse the content using BeautifulSoup as below.

```python
page_response = requests.get(url, timeout=5)
page_content = BeautifulSoup(page_response.content, "html.parser")
```

In [12]:
from bs4 import BeautifulSoup

import requests

url ='https://www.thestar.com.my/news/nation/2018/08/31/prices-of-cigarettes-to-go-up-after-sales-tax-fixed-at-10-pc/'
# fetch the content from url
page_response = requests.get(url, timeout=5)
# parse html
page_content = BeautifulSoup(page_response.content, "html.parser")

We are going to use CSS selector for this task, some examples are:
- `p a` — finds all a tags inside of a p tag.
- `body p a` — finds all a tags inside of a p tag inside of a body tag.
- `html body` — finds all body tags inside of an html tag.
- `p.outer-text` — finds all p tags with a class of outer-text.
- `p#first` — finds all p tags with an id of first.
- `body p.outer-text` — finds any p tags with a class of outer-text inside of a body tag.

Next we will start to extract content from the page. Follow the lead of your instructor, combine with the knowledge of HTML and CSS, we learn that:

- Title is stored inside `h1`
- Image is stored inside an `img` tag inside a div with class `.story-image`
- Content is stored inside div of class `.story` using multiple `p` tags.

So for instance, if we were to extract the title, we will need to extract the text from the `.ph-wrapper` tag. 

```python
titles = page_content.select('h1')
title = titles[0].text.strip()
```

You may continue the rest of the tasks using the following hints:
- Extract `src` attribute for the image
- Loop through the `p` tags and combine them together into a single string.

In [15]:
titles = page_content.select('h1')
title = titles[0].text.strip()

title


'Prices of cigarettes to go up after sales tax fixed at 10%'

In [16]:
# Get the title of the article
titles = page_content.select('h1')
title = titles[0].text.strip()

# Get the image 
imgs = page_content.select('div.story-image img')
# Since I already know that there is only one element in the img, we can safely access it via array.
img = imgs[0]['src']

# Get the text content
nodes = page_content.select('div.story p')
# Inspect nodes, and you will realize that there are multiple <p> tags for different paragraphs, we may join them together.
content = ''
for node in nodes:
    content = content + node.text
    
news = {
    'title': title,
    'image': img,
    'content': content
}

news


{'title': 'Prices of cigarettes to go up after sales tax fixed at 10%',
 'image': '/~/media/online/2018/08/16/03/25/cigarettes.ashx/?w=620&h=413&crop=1&hash=555F31920A04C7C3BA0C89B58D414A308066E5E4',
 'content': 'PETALING JAYA: Sales and Service Tax (SST) for cigarettes has been gazetted at 10% as of Aug 29, and this will cause the prices of cigarette to increase.British American Tobacco (M) Bhd (BAT) managing director Erik Stoel said the company is concerned about the impact that the 10% SST will have on the cigarette industry, given the high incidence of illegal cigarettes.He urged the Government to re-consider an SST increase on tobacco products "in light of these high levels of illegal cigarette trade and the persistent pressure on disposable income for the average Malaysian consumer"."For the tobacco industry, a SST of 10% is higher than the previous 6% GST (goods and service tax). "Furthermore, this implies a double taxation as the SST will be levied inclusive of the high levels 

---
Finally, you may combine the data you have extracted into a single `news` object.

```python
news = {
    'title': title,
    'image': img,
    'content': content
}
```



In [11]:
news

{'title': 'Prices of cigarettes to go up after sales tax fixed at 10%',
 'image': '/~/media/online/2018/08/16/03/25/cigarettes.ashx/?w=620&h=413&crop=1&hash=555F31920A04C7C3BA0C89B58D414A308066E5E4',
 'content': 'PETALING JAYA: Sales and Service Tax (SST) for cigarettes has been gazetted at 10% as of Aug 29, and this will cause the prices of cigarette to increase.British American Tobacco (M) Bhd (BAT) managing director Erik Stoel said the company is concerned about the impact that the 10% SST will have on the cigarette industry, given the high incidence of illegal cigarettes.He urged the Government to re-consider an SST increase on tobacco products "in light of these high levels of illegal cigarette trade and the persistent pressure on disposable income for the average Malaysian consumer"."For the tobacco industry, a SST of 10% is higher than the previous 6% GST (goods and service tax). "Furthermore, this implies a double taxation as the SST will be levied inclusive of the high levels 