# Basic Web Scraping

Notebook ini akan mendemonstrasikan Teknik-teknik untuk melakukan data mining (web scraping), yang meliputi:
- Instalasi
- Pembuatan User Agent dan permohonan URL
- Pembuatan "Soup Information"
- Menentukan tags dari objek yang akan diekstrak
- Ekstraksi informasi produk
- Komparasi hasil dari berbagai web


## Preparation

Approach:

- First, we are going to import our required libraries.
- Then we will take the URL stored in our text file.
- We will feed the URL to our soup object which will then extract relevant information from the given URL
- based on the element id we provide it and save it to our CSV file.

### 1. Install

In [1]:
#!pip install beautifulsoup4
## BeautifulSoup: Our primary module contains a method to access a webpage over HTTP.

#!pip install lxml
## lxml: Helper library to process webpages in python language.

#!pip install requests
##requests: Makes the process of sending HTTP requests flawless.the output of the function

### 2. Creating a User Agents and Sending a request to a URL

The website contains tons of user agents for the reader to choose from. Following is an example of a User-Agent within the header value.

In [2]:
from bs4 import BeautifulSoup
import requests

File = open("out.csv", "a")

HEADERS = ({'User-Agent':
            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
            'Accept-Language': 'en-US, en;q=0.5'})

#Sending a request to a URL

URL = "https://www.amazon.com/Sony-PlayStation-Pro-1TB-Console-4/dp/B07K14XKZH/"
webpage = requests.get(URL, headers=HEADERS)

A webpage is accessed by its URL (Uniform Resource Locator). With the help of the URL, we will send the request to the webpage for accessing its data.

The requested webpage features an Amazon product. Hence, our Python script focuses on extracting product details like “The Name of the Product”, “The Current Price” and so on.

Note: The request to the URL is sent via "requests" library. In case the user gets a “No module named requests” error, it can be installed by "pip install requests".

### 3. Creating a soup of information

The webpage variable contains a response received by the website. We pass the content of the response and the type of parser to the Beautiful Soup function.

lxml is a high-speed parser employed by Beautiful Soup to break down the HTML page into complex Python objects. 

In [3]:
#Creating a soup of information

soup = BeautifulSoup(webpage.content, "lxml")

### 4. Discovering the exact tags for Object Extraction

One of the most hectic part of this project is unearthing the ids and tags storing the relevant information. As mentioned before, we use web browsers for accomplishing this task.

We open the webpage in the browser and inspect the relevant element by pressing right-click.

See presentation for example

### 5. Extracting the Product Title

Using the find() function available for searching specific tags with specific attributes we locate the Tag Object containing title of the product.

In [4]:
# Outer Tag Object
title = soup.find("span", attrs={"id":'productTitle'})

Then, we take out the NavigableString Object

In [5]:
# Inner NavigableString Object
title_value = title.string

And finally, we strip extra spaces and convert the object to a string value.

In [6]:
# Title as a string value
title_string = title_value.strip()

We can take a look at types of each variable using type() function.

In [7]:
# Printing types of values for efficient understanding
print(type(title))
print(type(title_value))
print(type(title_string))
print()

# Printing Product Title
print("Product Title = ", title_string)

<class 'bs4.element.Tag'>
<class 'bs4.element.NavigableString'>
<class 'str'>

Product Title =  Sony PlayStation 4 Pro 1TB Console - Black (PS4 Pro)


### 6. Python Script to extract product information

The following Python script displays the following details for a product:

The Title of the Product
The Price of the Product
The Rating of the Product
Number of Customer Reviews
Product Availability

In [8]:
from bs4 import BeautifulSoup
import requests

# Function to extract Product Title
def get_title(soup):
	
	try:
		# Outer Tag Object
		title = soup.find("span", attrs={"id":'productTitle'})

		# Inner NavigableString Object
		title_value = title.string

		# Title as a string value
		title_string = title_value.strip()

		# # Printing types of values for efficient understanding
		# print(type(title))
		# print(type(title_value))
		# print(type(title_string))
		# print()

	except AttributeError:
		title_string = ""	

	return title_string

# Function to extract Product Price
def get_price(soup):

	try:
		price = soup.find("span", attrs={'id':'priceblock_ourprice'}).string.strip()

	except AttributeError:
		price = ""	

	return price

# Function to extract Product Rating
def get_rating(soup):

	try:
		rating = soup.find("i", attrs={'class':'a-icon a-icon-star a-star-4-5'}).string.strip()
		
	except AttributeError:
		
		try:
			rating = soup.find("span", attrs={'class':'a-icon-alt'}).string.strip()
		except:
			rating = ""	

	return rating

# Function to extract Number of User Reviews
def get_review_count(soup):
	try:
		review_count = soup.find("span", attrs={'id':'acrCustomerReviewText'}).string.strip()
		
	except AttributeError:
		review_count = ""	

	return review_count

# Function to extract Availability Status
def get_availability(soup):
	try:
		available = soup.find("div", attrs={'id':'availability'})
		available = available.find("span").string.strip()

	except AttributeError:
		available = ""	

	return available	

if __name__ == '__main__':

	# Headers for request
	HEADERS = ({'User-Agent':
	            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
	            'Accept-Language': 'en-US, en;q=0.5'})

	# The webpage URL
	URL = "https://www.amazon.com/Sony-PlayStation-Pro-1TB-Console-4/dp/B07K14XKZH/"

	# HTTP Request
	webpage = requests.get(URL, headers=HEADERS)

	# Soup Object containing all data
	soup = BeautifulSoup(webpage.content, "lxml")

	# Function calls to display all necessary product information
	print("Product Title =", get_title(soup))
	print("Product Price =", get_price(soup))
	print("Product Rating =", get_rating(soup))
	print("Number of Product Reviews =", get_review_count(soup))
	print("Availability =", get_availability(soup))
	print()
	print()

Product Title = Sony PlayStation 4 Pro 1TB Console - Black (PS4 Pro)
Product Price = $1,499.97
Product Rating = 4.6 out of 5 stars
Number of Product Reviews = 4,168 ratings
Availability = 




### 7. Fetching links from an Amazon search result webpage

Previously, we obtained information about a random PlayStation 4. It would be a resourceful idea to extract such information for multiple PlayStations for comparison of prices and ratings.

We can find a link enclosed in a <a><\a> tag as a value for the href attribute.
    
Instead of fetching a single link, we can extract all the similar links using find_all() function.

In [9]:
# Fetch links as List of Tag Objects
links = soup.find_all("a", attrs={'class':'a-link-normal s-no-outline'})

The find_all() function returns an iterable object containing multiple Tag objects. As a result, we pick each Tag object and pluck out the link stored as a value for href attribute.

In [10]:
# Store the links
links_list = []

# Loop for extracting links from Tag Objects
for link in links:
	links_list.append(link.get('href'))

In [13]:
# Loop for extracting product details from each link 
for link in links_list:
    new_webpage = requests.get("https://www.amazon.com" + link, headers=HEADERS)
    new_soup = BeautifulSoup(new_webpage.content, "lxml")

    print("Product Title =", get_title(new_soup))
    print("Product Price =", get_price(new_soup))
    print("Product Rating =", get_rating(new_soup))
    print("Number of Product Reviews =", get_review_count(new_soup))
    print("Availability =", get_availability(new_soup))

Product Title = Madden NFL 23 – PlayStation 4
Product Price = $39.88
Product Rating = 4.6 out of 5 stars
Number of Product Reviews = 693 ratings
Availability = In Stock
Product Title = DualShock 4 Wireless Controller for PlayStation 4 - Jet Black
Product Price = $58.75
Product Rating = 4.7 out of 5 stars
Number of Product Reviews = 140,837 ratings
Availability = In Stock
Product Title = PlayStation 4 Slim 1TB Console
Product Price = 
Product Rating = 4.7 out of 5 stars
Number of Product Reviews = 15,164 ratings
Availability = 
Product Title = Hunting Simulator - PlayStation 4
Product Price = $16.71
Product Rating = 4.3 out of 5 stars
Number of Product Reviews = 939 ratings
Availability = In Stock
Product Title = BlueFire Professional 3.5mm PS4 Gaming Headset Headphone with Mic and LED Lights for Playstation 4, PS5, Xbox one,Laptop, Computer (Blue)
Product Price = 
Product Rating = 4.4 out of 5 stars
Number of Product Reviews = 8,261 ratings
Availability = In Stock.
Product Title = Hori

### 8. Python Script to extract product details across multiple webpages


In [14]:
from bs4 import BeautifulSoup
import requests

# Function to extract Product Title
def get_title(soup):
	
	try:
		# Outer Tag Object
		title = soup.find("span", attrs={"id":'productTitle'})

		# Inner NavigatableString Object
		title_value = title.string

		# Title as a string value
		title_string = title_value.strip()

		# # Printing types of values for efficient understanding
		# print(type(title))
		# print(type(title_value))
		# print(type(title_string))
		# print()

	except AttributeError:
		title_string = ""	

	return title_string

# Function to extract Product Price
def get_price(soup):

	try:
		price = soup.find("span", attrs={'id':'priceblock_ourprice'}).string.strip()

	except AttributeError:

		try:
			# If there is some deal price
			price = soup.find("span", attrs={'id':'priceblock_dealprice'}).string.strip()

		except:		
			price = ""	

	return price

# Function to extract Product Rating
def get_rating(soup):

	try:
		rating = soup.find("i", attrs={'class':'a-icon a-icon-star a-star-4-5'}).string.strip()
		
	except AttributeError:
		
		try:
			rating = soup.find("span", attrs={'class':'a-icon-alt'}).string.strip()
		except:
			rating = ""	

	return rating

# Function to extract Number of User Reviews
def get_review_count(soup):
	try:
		review_count = soup.find("span", attrs={'id':'acrCustomerReviewText'}).string.strip()
		
	except AttributeError:
		review_count = ""	

	return review_count

# Function to extract Availability Status
def get_availability(soup):
	try:
		available = soup.find("div", attrs={'id':'availability'})
		available = available.find("span").string.strip()

	except AttributeError:
		available = "Not Available"	

	return available	


if __name__ == '__main__':

	# Headers for request
	HEADERS = ({'User-Agent':
	            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
	            'Accept-Language': 'en-US'})

	# The webpage URL
	URL = "https://www.amazon.com/s?k=playstation+4&ref=nb_sb_noss_2"
	
	# HTTP Request
	webpage = requests.get(URL, headers=HEADERS)

	# Soup Object containing all data
	soup = BeautifulSoup(webpage.content, "lxml")

	# Fetch links as List of Tag Objects
	links = soup.find_all("a", attrs={'class':'a-link-normal s-no-outline'})

	# Store the links
	links_list = []

	# Loop for extracting links from Tag Objects
	for link in links:
		links_list.append(link.get('href'))


	# Loop for extracting product details from each link 
	for link in links_list:

		new_webpage = requests.get("https://www.amazon.com" + link, headers=HEADERS)

		new_soup = BeautifulSoup(new_webpage.content, "lxml")
		
		# Function calls to display all necessary product information
		print("Product Title =", get_title(new_soup))
		print("Product Price =", get_price(new_soup))
		print("Product Rating =", get_rating(new_soup))
		print("Number of Product Reviews =", get_review_count(new_soup))
		print("Availability =", get_availability(new_soup))
		print()
		print()

Product Title = PlayStation 4 Slim 1TB Console
Product Price = 
Product Rating = 4.7 out of 5 stars
Number of Product Reviews = 15,164 ratings
Availability = 


Product Title = Newest Sony Playstation 4 Slim 1TB SSD Console - Marvel's Spider-Man PS4 Bundle with DualShock-4 Wireless Controller
Product Price = 
Product Rating = 4.6 out of 5 stars
Number of Product Reviews = 239 ratings
Availability = 


Product Title = FXH Wireless PS4 Controller for PS4/ Slim/Pro with Highly Sensitive Buttons /Dual Vibration/6-Axis Motion Sensor/Audio Jack PS4 Controller ,Compatible with Playstation-4-Black…
Product Price = 
Product Rating = 4.9 out of 5 stars
Number of Product Reviews = 1,278 ratings
Availability = Not Available


Product Title = PlayStation 4 Slim 1TB Console - Black (Renewed)
Product Price = $479.00
Product Rating = 4.6 out of 5 stars
Number of Product Reviews = 1,174 ratings
Availability = 


Product Title = PlayStation 4 Slim 1TB Console (Renewed)
Product Price = $424.99
Product Ra