# 1. Accessing the content of the website.

## 1.1. Utilizing the Requests library

First step, we need to import the **library**  to access the website's HTTP.


In this notebook, I will scrape the data for the Indonesia Cup of Excellence result for the year 2021.

In [1]:
# Import the library
import requests

# Assign the URL of the website to be scraped
url = 'https://allianceforcoffeeexcellence.org/indonesia-2021/'

In the web scraping practice, we need to open the website in our browser, later we will jump back and forth to the `url` in order to inspect the website's HTML elements (simply by right-click on the website and the click "inspect") to extract the particular content inside.


[Click here to open the 'url' variable.](https://allianceforcoffeeexcellence.org/indonesia-2021/)

In [2]:
# Request the access to the server's HTTP
r = requests.get(url)

# Check the value and python data type
print("The value of 'r' is:", r)
print("The data type of 'r' is:",{type(r)})

The value of 'r' is: <Response [200]>
The data type of 'r' is: {<class 'requests.models.Response'>}


Now we already have a server’s response to an HTTP request, the `r` variable.

Python recognized it as a `Response` data type.

The value `<Response [200]>` means that website is still functional and serving content.

Then we need to convert `r` into a string.

In [3]:
# Convert the Response object into a string
r.text

'<!DOCTYPE html>\n<html lang="en">\n<head>\n<meta charset="UTF-8" /><meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0, user-scalable=0" /><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" /><meta name="format-detection" content="telephone=no"> <style>\n        #wpadminbar #wp-admin-bar-p404_free_top_button .ab-icon:before {\n            content: "\\f103";\n            color:red;\n            top: 2px;\n        }\n    </style>\n<script type="text/javascript">var ajaxurl = "https://allianceforcoffeeexcellence.org/wp-admin/admin-ajax.php";</script><meta name="robots" content="index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1" />\n<script>window._wca = window._wca || [];</script>\n<style id="critical-path-css" type="text/css">\n\t\t\tbody,html{width:100%;height:100%;margin:0;padding:0}.page-preloader{top:0;left:0;z-index:999;position:fixed;height:100%;width:100%;text-align:center}.preloader-prev

We successfully get text-based content of website's files.

Please note the text is not comfortable to read because there is not much spacing.



---



## 1.2. Utilizing the BeatifulSoup library

Import the BeautifulSoup library for further process.

In [4]:
# Import the library
from bs4 import BeautifulSoup

# Creates a parsed data tree from the raw HTML content
soup = BeautifulSoup(r.text, 'lxml')

Now the `soup` variable contains the text that more readable than what we saw coming from the Requests object.

We can print it with the `prettify()` method in order to turn the Beautiful Soup parse tree into nicely formatted text, with each **HTML tag** on its own line.

Be careful! It will be a long way scrolling down.

In [5]:
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0, user-scalable=0" name="viewport"/>
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
  <meta content="telephone=no" name="format-detection"/>
  <style>
   #wpadminbar #wp-admin-bar-p404_free_top_button .ab-icon:before {
            content: "\f103";
            color:red;
            top: 2px;
        }
  </style>
  <script type="text/javascript">
   var ajaxurl = "https://allianceforcoffeeexcellence.org/wp-admin/admin-ajax.php";
  </script>
  <meta content="index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1" name="robots"/>
  <script>
   window._wca = window._wca || [];
  </script>
  <style id="critical-path-css" type="text/css">
   body,html{width:100%;height:100%;margin:0;padding:0}.page-preloader{top:0;left:0;z-index:999;position:fixed;height:100%;width:100%;text-align:center}.prelo

I mentioned about the HTML tag before, which is the important element in navigate the particular target content in the website to be scrapped.

The HTML tag we can found in every single website is the `title`, for example:

In [6]:
# Get the "title" tag from the url HTML
title = soup.title

# Check the python data type
type(title)

bs4.element.Tag

As we can see from the code above, Python recognized it as a 'Tag' data type.

In [7]:
# Get the value of the HTML tag
title

<title>Indonesia 2021 - Alliance For Coffee Excellence</title>

We can notice that the HTML tag is described by an opening < and closing > angular bracket with the name of the tag inside it as a start, while it marks an ending if there is a forward slash / after the opening angular bracket.

In [8]:
# Extract the string that is contained inside the HTML tag
title.text

'Indonesia 2021 - Alliance For Coffee Excellence'

Now we have the text content inside the Tag.



---



# 2. Navigating the content of the website.

In the previous chapter, we already have an  idea of how to parse the HTML content using some basic functions and methods in the Beautiful Soup library.

Let's navigate the website.

## 2.1. Extracting the tables.

We want to get the specific content, the tables containing the information of Indonesia Cup of Excellence 2021.

Assisted by the inspect element tool in the web browser, we found the tables are located in the `ul` Tag, with the class name `vc_tta-tabs-list`.

In [9]:
# Get the tables HTML location
tabs = soup.select('ul.vc_tta-tabs-list')[0]
print(tabs.prettify())

<ul class="vc_tta-tabs-list">
 <li class="vc_tta-tab vc_active" data-vc-tab="">
  <a data-vc-container=".vc_tta" data-vc-tabs="" href="#1513038058770-76a416aa-73df">
   <span class="vc_tta-title-text">
    COE Competition Results
   </span>
  </a>
 </li>
 <li class="vc_tta-tab" data-vc-tab="">
  <a data-vc-container=".vc_tta" data-vc-tabs="" href="#1643312940613-53cf472d-7577">
   <span class="vc_tta-title-text">
    COE Auction Results
   </span>
  </a>
 </li>
 <li class="vc_tta-tab" data-vc-tab="">
  <a data-vc-container=".vc_tta" data-vc-tabs="" href="#1621020840203-316b2877-44f1">
   <span class="vc_tta-title-text">
    NW Competition Results
   </span>
  </a>
 </li>
 <li class="vc_tta-tab" data-vc-tab="">
  <a data-vc-container=".vc_tta" data-vc-tabs="" href="#1644012076250-c970f6cf-2425">
   <span class="vc_tta-title-text">
    NW Auction Results
   </span>
  </a>
 </li>
 <li class="vc_tta-tab" data-vc-tab="">
  <a data-vc-container=".vc_tta" data-vc-tabs="" href="#1617922008283-

Before we move on to navigate further into the `tabs` variable, let's review about the `select()` method first.

The method itself will result in a python list, thus we need to do the indexing of the list so we can get the Tag inside.

Here are the comparison:



In [10]:
# Compare the data type result of the codes
a = soup.select('ul.vc_tta-tabs-list')
tabs = soup.select('ul.vc_tta-tabs-list')[0]

print("The data type of select() method without indexing is:", type(a))
print("The data type of select() method with indexing is:", type(tabs))

The data type of select() method without indexing is: <class 'bs4.element.ResultSet'>
The data type of select() method with indexing is: <class 'bs4.element.Tag'>


Now let's back to our business, the `tabs` variable.

The tables are located in the separated sub-URLs, still at the same parent address as the `url` variable.

The sub-URL links are found in the `a` Tags under the `tabs` variable.

To get those links, we use the `finds_all()` method.

In [27]:
# Get the 'a' Tags
tabs_links = tabs.find_all('a')
tabs_links

[<a data-vc-container=".vc_tta" data-vc-tabs="" href="#1513038058770-76a416aa-73df"><span class="vc_tta-title-text">COE Competition Results</span></a>,
 <a data-vc-container=".vc_tta" data-vc-tabs="" href="#1643312940613-53cf472d-7577"><span class="vc_tta-title-text">COE Auction Results</span></a>,
 <a data-vc-container=".vc_tta" data-vc-tabs="" href="#1621020840203-316b2877-44f1"><span class="vc_tta-title-text">NW Competition Results</span></a>,
 <a data-vc-container=".vc_tta" data-vc-tabs="" href="#1644012076250-c970f6cf-2425"><span class="vc_tta-title-text">NW Auction Results</span></a>,
 <a data-vc-container=".vc_tta" data-vc-tabs="" href="#1617922008283-0d0ddaee-ded3"><span class="vc_tta-title-text">International Jury</span></a>,
 <a data-vc-container=".vc_tta" data-vc-tabs="" href="#1617922031187-d6ee3ac3-edbf"><span class="vc_tta-title-text">National Jury</span></a>,
 <a data-vc-container=".vc_tta" data-vc-tabs="" href="#1528229985999-af37d072-45f5"><span class="vc_tta-title-tex

In [29]:
tabs_links[1].attrs

{'href': '#1643312940613-53cf472d-7577',
 'data-vc-tabs': '',
 'data-vc-container': '.vc_tta'}

Now we have a list containing the `a` Tags under the `tabs` variable.

Before we move on to navigate further into the `tabs` variable, let's review about the `find_all()`, `find()`, and `attrs` methods first.

The `finds_all()` method returns a `ResultSet` data type (which is simply means a list of Tags) under a particular variable, while the `find()` returns only the first Tag found.

Here are the comparison:

In [12]:
# Find all Tags
tabs_links = tabs.find_all('a')

# Find the first Tag only
find_link = tabs.find('a')

print("tabs_links result:", tabs_links)
print("The data type of tabs_links is:", type(tabs_links))
print('')
print("find_link result:", find_link)
print("The data type of find_link is:", type(find_link))

tabs_links result: [<a data-vc-container=".vc_tta" data-vc-tabs="" href="#1513038058770-76a416aa-73df"><span class="vc_tta-title-text">COE Competition Results</span></a>, <a data-vc-container=".vc_tta" data-vc-tabs="" href="#1643312940613-53cf472d-7577"><span class="vc_tta-title-text">COE Auction Results</span></a>, <a data-vc-container=".vc_tta" data-vc-tabs="" href="#1621020840203-316b2877-44f1"><span class="vc_tta-title-text">NW Competition Results</span></a>, <a data-vc-container=".vc_tta" data-vc-tabs="" href="#1644012076250-c970f6cf-2425"><span class="vc_tta-title-text">NW Auction Results</span></a>, <a data-vc-container=".vc_tta" data-vc-tabs="" href="#1617922008283-0d0ddaee-ded3"><span class="vc_tta-title-text">International Jury</span></a>, <a data-vc-container=".vc_tta" data-vc-tabs="" href="#1617922031187-d6ee3ac3-edbf"><span class="vc_tta-title-text">National Jury</span></a>, <a data-vc-container=".vc_tta" data-vc-tabs="" href="#1528229985999-af37d072-45f5"><span class="vc_

Python-wise, the `find()` method is equal to `find_all()[0]`.

As we can see, the Tag data in `find_link` variable is more complicated than the `title` variable that we try as an example before.

Commonly, a Tag data may have any number of attributes, that we can check with `.attrs` method, returns as a dictionary data type.

In [13]:
# Check the Tag attributes
find_link.attrs

{'href': '#1513038058770-76a416aa-73df',
 'data-vc-tabs': '',
 'data-vc-container': '.vc_tta'}

Now we need to parse the 'href' attribute, which contains the sub_URL of where the target table is located.

In [14]:
# Create a list containing the hyperlinks found in tabs_links variable
relative_links = [t.get("href") for t in tabs_links]
relative_links

['#1513038058770-76a416aa-73df',
 '#1643312940613-53cf472d-7577',
 '#1621020840203-316b2877-44f1',
 '#1644012076250-c970f6cf-2425',
 '#1617922008283-0d0ddaee-ded3',
 '#1617922031187-d6ee3ac3-edbf',
 '#1528229985999-af37d072-45f5',
 '#1623963496226-1c196f4d-5732']

As we can see, it returns the strings that are not properly formatted as URL.

Those are called as the relative URLs where the link is on the same site and therefore part of the same root domain.

In this case, the root domain is saved as `url` variable.

To understand further about the href attribute in HTML, [please follow this link](https://www.semrush.com/blog/ahref-link/)




In [30]:
# Join the relative_links with the root domain
tables_urls = [f"{url}/{l}" for l in relative_links]
tables_urls

['https://allianceforcoffeeexcellence.org/indonesia-2021//#1513038058770-76a416aa-73df',
 'https://allianceforcoffeeexcellence.org/indonesia-2021//#1643312940613-53cf472d-7577',
 'https://allianceforcoffeeexcellence.org/indonesia-2021//#1621020840203-316b2877-44f1',
 'https://allianceforcoffeeexcellence.org/indonesia-2021//#1644012076250-c970f6cf-2425',
 'https://allianceforcoffeeexcellence.org/indonesia-2021//#1617922008283-0d0ddaee-ded3',
 'https://allianceforcoffeeexcellence.org/indonesia-2021//#1617922031187-d6ee3ac3-edbf',
 'https://allianceforcoffeeexcellence.org/indonesia-2021//#1528229985999-af37d072-45f5',
 'https://allianceforcoffeeexcellence.org/indonesia-2021//#1623963496226-1c196f4d-5732']

In [16]:
tables_urls[0]

'https://allianceforcoffeeexcellence.org/indonesia-2021/#1513038058770-76a416aa-73df'

Now we have the complete URLs of the tables' link, saved in `tables_urls` variable.

Now we access the website in the `tables_urls` with requests library.

In [32]:
# Request the website's Response.
tables_Response = requests.get(tables_urls[0])

# Convert the Response into a string.
tables_Response.text

'<!DOCTYPE html>\n<html lang="en">\n<head>\n<meta charset="UTF-8" /><meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0, user-scalable=0" /><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" /><meta name="format-detection" content="telephone=no"> <style>\n        #wpadminbar #wp-admin-bar-p404_free_top_button .ab-icon:before {\n            content: "\\f103";\n            color:red;\n            top: 2px;\n        }\n    </style>\n<script type="text/javascript">var ajaxurl = "https://allianceforcoffeeexcellence.org/wp-admin/admin-ajax.php";</script><meta name="robots" content="index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1" />\n<script>window._wca = window._wca || [];</script>\n<style id="critical-path-css" type="text/css">\n\t\t\tbody,html{width:100%;height:100%;margin:0;padding:0}.page-preloader{top:0;left:0;z-index:999;position:fixed;height:100%;width:100%;text-align:center}.preloader-prev

With pandas library, now we turn the HTML tables to become readable using the `read_html()` function.

In [18]:
import pandas as pd

# Extract the HTML tables
tables = pd.read_html(tables_Response.text, header = 0)
tables

[    RANK  SCORE                           FARM                   FARMER  \
 0      1  89.28                  Pantan Musara           Dilen Ali Gogo   
 1      2  89.04                       Ibun Ita               Ita Rosita   
 2      3  88.89                  Pantan Musara  Roberto Bagus Syahputra   
 3      4  88.75                   Ijen Lestari           Dandy Darmawan   
 4      5  88.58                      Ibun Yudi                     Yudi   
 5      6  88.49     Koperasi Koerintji Barokah                  Triyono   
 6      7  88.46                   Avatara Gayo               Drs Hamdan   
 7      8  88.30                       Kamojang             Ahmad Vansyu   
 8      9  88.25                         Topidi              Daeng Halim   
 9     10  88.15  Lereng Gunung Argopuro Krucil         Dinul Haq Sabyli   
 10    11  88.14                         Topidi         Daeng Balengkang   
 11    12  88.05          PT Sulotco Jaya Abadi         Samuel Karundeng   
 12    13  8

Now we have a list containing the tables in pandas DataFrame format.

In [19]:
type(tables)

list

From the `tables` variable, we can get the DataFrame from each category listed in the main domain, as follows:

In [20]:
COE_competition_results = tables[0]
COE_auction_results = tables[1]
NW_competition_results = tables[2]
NW_auction_results = tables[3]
International_Jury = tables[4]
National_Jury = tables[5]
Organizing_Country_Commissions = tables[6]


From those tables, we just need two for this project: COE Competition Results which contain the characteristics of coffee green beans, and COE Auction Results which contain the coffee's highest bid price and buyers.

In [21]:
COE_competition_results

Unnamed: 0,RANK,SCORE,FARM,FARMER,REGION,WEIGHT (kg),VARIETY,PROCESS
0,1,89.28,Pantan Musara,Dilen Ali Gogo,Aceh,210,"Ateng, Gayo 1, P88",Honey
1,2,89.04,Ibun Ita,Ita Rosita,Jawa Barat,186,Sigararutang,Natural
2,3,88.89,Pantan Musara,Roberto Bagus Syahputra,Aceh,262,"Ateng, Gayo 1, P88",Washed
3,4,88.75,Ijen Lestari,Dandy Darmawan,Jawa Timur,359,"USDA, Colombia Brazil",Natural
4,5,88.58,Ibun Yudi,Yudi,Jawa Barat,206,"Sigararutang, Kartika, S-795",Natural
5,6,88.49,Koperasi Koerintji Barokah,Triyono,Jambi,206,"Sigararutang, S-795, Andungsari",Honey
6,7,88.46,Avatara Gayo,Drs Hamdan,Aceh,218,"Abyssinia, P88, Ateng, Typica",Natural
7,8,88.3,Kamojang,Ahmad Vansyu,Jawa Barat,203,"Ateng, Sigararutang, S-795, Andungsari",Washed
8,9,88.25,Topidi,Daeng Halim,Sulawesi Selatan,189,Typica,Washed
9,10,88.15,Lereng Gunung Argopuro Krucil,Dinul Haq Sabyli,Jawa Timur,194,Cobra & Typica,Natural


In [22]:
COE_auction_results

Unnamed: 0,Rank,Farm,Score,Weight (lbs),High Bid,Total Value,Company Name
0,1,Pantan Musara,89.28,462.97,$80.00,"$37,037.60",Wataru for YAMATOYA COFFEE
1,2,Ibun Ita,89.04,410.06,$65.20,"$26,735.91",Proud Mary Coffee Roasters
2,3,Pantan Musara,88.89,577.61,$69.10,"$39,912.85",Terarosa (Haksan Co. Ltd)
3,4,Ijen Lestari,88.75,791.46,$35.20,"$27,860.80",SUPREMO COFFEE
4,5,Ibun Yudi,88.58,454.15,$25.10,"$11,399.17",Latorre&Dutch (China) for Cut Hand Group（剁手咖啡群...
5,6,Koperasi Koerintji Barokah,88.49,454.15,$21.70,"$9,855.06","MARISSTELLA COFFEE, INTELLIGENTSIA, RYANS COFF..."
6,7,Avatara Gayo,88.46,480.61,$45.20,21723.57,Blue Bottle Coffee
7,8,Kamojang,88.3,447.54,$19.20,8592.77,"MUSEO Co., Ltd. // wondumyungga cafe de Jura a..."
8,9,Topidi,88.25,416.67,$24.30,10125.08,"Orsir Coffee Co., Ltd."
9,10,Lereng Gunung Argopuro Krucil,88.15,427.7,$24.20,10350.34,Herd Coffee Roaster. CARA Instant Coffee. Besk...
