<a id="ref0"></a>

<h2 id="http">Overview of HTTP</h2>

When the **client** uses a web page your browser sends an **HTTP** request to the **server** where the page is hosted. The server tries to find the desired **resource** such as the home page (index.html).

If your request is successful, the server will send the resource to the client in an **HTTP response**; this includes information like the type of the **resource**, the length of the **resource**, and other information.   

<p>
The figure below represents the process; the circle on the left represents the client, the circle on the right represents the  Web server.  The table under the Web server represents a list of resources stored in the web server. In  this case an <code>HTML</code> file, <code>png</code> image, and <code>txt</code> file .
</p>
<p>
The <b>HTTP</b> protocol allows you to send and receive information through the web including webpages, images, and other web resources.
</p



<center>
    <img src="https://fpt.edu.vn/Resources/brand/uploads/749540_132829686029858301_o.jpg" width="500" alt="cognitiveclass.ai logo"  />
</center>

# Lab 1: WebScraping

<br>

#### Class name: ______________________

#### Student code: ______________________

#### Student name: ______________________

<br>

## Objectives

After completing this lab you will be able to:

* Understand HTML via coding practice
* Handle the HTTP Requests and response using R
* Perform basic webscraping using rvest


Estimated time needed: **60** minutes
<h4 style='color:red; font-weight:bold'>DO NOT CHEAT! 1 point for anybody copy or share code</h4>

<h2 id="#httr">The httr library</h2>

`httr` is a R library that allows you to build and send <code>HTTP</code> requests, as well as process <code>HTTP</code> requests easily.  We can import the package as follows (may take less than minute to import):

In [1]:
# This lab require some library packages. If error happen when running please uncomment lines below to install them:
# install.packages("httr", type = "binary")
# install.packages("rvest", type="binary")


In [2]:
library(httr)
library(rvest)

Loading required package: xml2
Registered S3 method overwritten by 'rvest':
  method            from
  read_xml.response xml2


## 1. Example code

In [3]:
url <- 'https://fap.fpt.edu.vn/'
response<-GET(url, encodeString='unicode')

print(sprintf("Time: %s", response$date))
print(sprintf("URL link: %s", response$url))
print(sprintf("Status code: %d", response$status_code))

[1] "Time: 2024-05-20 05:29:32"
[1] "URL link: https://fap.fpt.edu.vn/"
[1] "Status code: 200"


In [4]:
root <- read_html(response)
options_node <- html_nodes(root, "option")
values <- c()
print("List of FPT University campus: ")
for(node in options_node){
    v <- as.integer(html_attr(node, "value"))
    if(!is.na(v) && !(v %in% values)){
        values<- c(values, v)
        print(html_text(node))
    }
}

[1] "List of FPT University campus: "
[1] "FU-Hòa L<U+1EA1>c"
[1] "FU-H<U+1ED3> Chí Minh"
[1] "FU-Ðà N<U+1EB5>ng"
[1] "FU-C<U+1EA7>n Tho"
[1] "FU-Quy Nhon"


## 2. Data source
Implement that code by change the URL

* https://webtygia.com/

* https://giavang.org/

* https://tygiadola.net/giavang/gia-vang-hom-nay

* https://nongnghiep.vn/bang-gia-vang-sjc-9999-24k-18k-14k-10k-hom-nay-24-10-2022-d335344.html

or any other URL that you can find!


## 3. Tasks

#### 3.1 Getting the data

Using Webscraping to crawling data of SJC gold price in major cities and provinces in Vietnam. The data should have more than 10 records. Display a table to show the data.

In [5]:
# Enter code here
page <- GET("https://giavang.org/khu-vuc/")
page_content <- content(page, "text")

print(sprintf("Time: %s", page$date))
print(sprintf("URL link: %s", page$url))
print(sprintf("Status code: %d", page$status_code))



[1] "Time: 2024-05-20 05:29:32"
[1] "URL link: https://giavang.org/khu-vuc/"
[1] "Status code: 200"


In [12]:
webpage <- read_html(page_content)
table_node <- html_node(webpage, "table") 
  
# Extract the table content 
table_content <- html_table(table_node, fill =TRUE)
table_content[-c(20),]


  
# Print the table 

table_content


Unnamed: 0,Khu vực,Mua vào,Bán ra
1,TP. H<U+1ED3> Chí Minh,88.40089.400,90.20090.800
2,Biên Hòa,84.600,86.800
3,Hà N<U+1ED9>i,87.80089.400,90.10090.700
4,Ðà N<U+1EB5>ng,88.50088.700,90.20090.500
5,Mi<U+1EC1>n Tây,89.100,90.800
6,Tây Nguyên,88.500,90.500
7,Ðông Nam B<U+1ED9>,88.40088.500,90.50090.600
8,B<U+1EAF>c Ninh,89.400,90.700
9,H<U+1EA3>i Duong,89.400,90.700
10,B<U+1EBF>n Tre,89.400,90.600


Khu vực,Mua vào,Bán ra
TP. H<U+1ED3> Chí Minh,88.40089.400,90.20090.800
Biên Hòa,84.600,86.800
Hà N<U+1ED9>i,87.80089.400,90.10090.700
Ðà N<U+1EB5>ng,88.50088.700,90.20090.500
Mi<U+1EC1>n Tây,89.100,90.800
Tây Nguyên,88.500,90.500
Ðông Nam B<U+1ED9>,88.40088.500,90.50090.600
B<U+1EAF>c Ninh,89.400,90.700
H<U+1EA3>i Duong,89.400,90.700
B<U+1EBF>n Tre,89.400,90.600


#### 3.2 Which province has the highest gold selling price?

In [7]:
# Enter code here
table_content[,2] <- as.numeric(table_content[,2])
table_content[,3] <- as.numeric(table_content[,3])

"NAs introduced by coercion"

In [8]:
# Enter code here
highest_selling <- max(table_content[,3]) 
highest_selling

#### 3.3 Which provinces have the biggest difference in selling and buying prices?

In [9]:
# Enter code here


#### 3.4 Find all the province has selling price below average

In [10]:
# Enter code here


#### 3.5 Find the difference between highest buying price and lowest selling price of all provinces

In [11]:
# Enter code here


## Author

#### <a href="" target="_blank"></a>

## Change Log

| Date (YYYY-MM-DD) | Version | Changed By | Change Description                 |
| ----------------- | ------- | ---------- | ---------------------------------- |
| 2024-01-10        | 2.1     |     | Create the 2.1st version             |
|                   |         |            |                                    |
|                   |         |            |                                    |

<hr>

## <h3 align="center"> © FPT University. All rights reserved. <h3/>
