# ``Web Scraping``

## Import beberapa function yang diperlukan

Function `BeautifulSoup` dan `requests` digunakan untuk melakukan Web Scrapping. Sedangkan `Pandas` digunakan untuk mengolah data.

Learn more ``Beautiful Soup Documentation``: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
from IPython.display import display

<hr>

# **Scraping**: World Population Dataset
``Scrape data from:`` https://www.worldometers.info/world-population/population-by-country/

## Akses data ke Web yang ingin digali datanya

In [8]:
world_web = requests.get('https://www.worldometers.info/world-population/population-by-country/')
world_web

<Response [200]>

In [10]:
data_web = BeautifulSoup(world_web.content, 'html.parser')
# data_web

## Mengambil data pada tag tertentu

In [5]:
# mencari data judul di web disertai tag html
data_web.title

<title>Population by Country (2020) - Worldometer</title>

In [11]:
# mencari data judul di web tanpa tag html
data_web.title.string

'Population by Country (2020) - Worldometer'

In [5]:
data_web.title.text

'Population by Country (2020) - Worldometer'

In [6]:
# tag tr adalah perintah untuk membuat satu baris tabel di html
# tag ini tidak terlalu akurat karena di dalamnya masih ada tag yang lebih dekat dengan data yang dicari
tr = data_web.find_all('tr')
# tr

In [7]:
# <p> adalah perintah untuk membuat paragraf di html
p = data_web.find_all('p')
p[0].text

'This list includes both countries and dependent territories. Data based on the latest United Nations Population Division estimates. Click on the name of the country or dependency for current estimates (live population clock), historical data, and projected figures.  See also: World Population  '

In [8]:
# <th> adalah perintah untuk membuat baris pertama (header table) di html
th = data_web.find_all('th')
th[7].text

'Migrants (net)'

In [9]:
# <td> adalah perintah untuk membuat baris kedua dst (data table) di html
td = data_web.find_all('td')
td[13].text

'India'

## Mengolah data

In [10]:
# menyimpan data dlaam sebuah variabel list
th = data_web.find_all('th')
td = data_web.find_all('td')

box = [element.text for element in td]
# print(box)
kotak = [element.text for element in th]
# print(kotak)

In [11]:
# another method to loop 
th = data_web.find_all('th')
td = data_web.find_all('td')

box = []
for element in td:
    box.append(element.text) 
    
kotak = []
for element in th:
    kotak.append(element.text) 

# kotak
# box

In [12]:
# memecah data per baris (per negara) dalam satu list, lalu di simpan dalam list "world_population"
world_population = []
a = 0
for i in range(235): #235 adalah jumlah baris di data html (jumlah negara)
    b = (i+1)*12
    world_population.append(box[a:b])
    a = b

print(world_population[3])

['4', 'Indonesia', '273,523,615', '1.07 %', '2,898,047', '151', '1,811,570', '-98,955', '2.3', '30', '56 %', '3.51 %']


In [13]:
# ubah menjadi sebuah dataframe dengan kolom dari variabel "kotak"
world_df = pd.DataFrame(world_population, columns=kotak)
world_df.head()

Unnamed: 0,#,Country (or dependency),Population (2020),Yearly Change,Net Change,Density (P/Km²),Land Area (Km²),Migrants (net),Fert. Rate,Med. Age,Urban Pop %,World Share
0,1,China,1439323776,0.39 %,5540090,153,9388211,-348399,1.7,38,61 %,18.47 %
1,2,India,1380004385,0.99 %,13586631,464,2973190,-532687,2.2,28,35 %,17.70 %
2,3,United States,331002651,0.59 %,1937734,36,9147420,954806,1.8,38,83 %,4.25 %
3,4,Indonesia,273523615,1.07 %,2898047,151,1811570,-98955,2.3,30,56 %,3.51 %
4,5,Pakistan,220892340,2.00 %,4327022,287,770880,-233379,3.6,23,35 %,2.83 %


In [14]:
# mengubah nama kolom
world_df.rename(columns={'Yearly Change':'Yearly Change (%)'}, inplace=True)
world_df.head()

Unnamed: 0,#,Country (or dependency),Population (2020),Yearly Change (%),Net Change,Density (P/Km²),Land Area (Km²),Migrants (net),Fert. Rate,Med. Age,Urban Pop %,World Share
0,1,China,1439323776,0.39 %,5540090,153,9388211,-348399,1.7,38,61 %,18.47 %
1,2,India,1380004385,0.99 %,13586631,464,2973190,-532687,2.2,28,35 %,17.70 %
2,3,United States,331002651,0.59 %,1937734,36,9147420,954806,1.8,38,83 %,4.25 %
3,4,Indonesia,273523615,1.07 %,2898047,151,1811570,-98955,2.3,30,56 %,3.51 %
4,5,Pakistan,220892340,2.00 %,4327022,287,770880,-233379,3.6,23,35 %,2.83 %


In [15]:
# menghilangkan unsur '%' di kolom 'Yearly Change (%)'
gudang = []
for i in range(len(world_df['Yearly Change (%)'])):
    j = world_df['Yearly Change (%)'][i].split()
    gudang.append(float(j[0]))

In [16]:
world_df['Yearly Change (%)'] = gudang
world_df.head()

Unnamed: 0,#,Country (or dependency),Population (2020),Yearly Change (%),Net Change,Density (P/Km²),Land Area (Km²),Migrants (net),Fert. Rate,Med. Age,Urban Pop %,World Share
0,1,China,1439323776,0.39,5540090,153,9388211,-348399,1.7,38,61 %,18.47 %
1,2,India,1380004385,0.99,13586631,464,2973190,-532687,2.2,28,35 %,17.70 %
2,3,United States,331002651,0.59,1937734,36,9147420,954806,1.8,38,83 %,4.25 %
3,4,Indonesia,273523615,1.07,2898047,151,1811570,-98955,2.3,30,56 %,3.51 %
4,5,Pakistan,220892340,2.0,4327022,287,770880,-233379,3.6,23,35 %,2.83 %


<hr>

# **Take Class Exercise**: Indonesia Population in 1960, 1970, ...., 2050
## ``Indonesia Population in 10 years period``

### A. Get data (scrape) from this site: https://www.worldometers.info/world-population/indonesia-population/

### B. Create DataFrame with columns: ``Year, Population, Median Age, Fertility Rate, & Urban Pop Percentage``

In [24]:
# ambil data dari website
indo_web = requests.get('https://www.worldometers.info/world-population/indonesia-population/')
data_indo = BeautifulSoup(indo_web.content, 'html.parser')

In [26]:
# ambil data header table
th_indo = data_indo.find_all('th')
kolom_header = [th_indo[i].text for i in range(len(th_indo))]
kolom_header = kolom_header[0:13]
kolom_header                      

['Year',
 'Population',
 'Yearly %  Change',
 'Yearly Change',
 'Migrants (net)',
 'Median Age',
 'Fertility Rate',
 'Density (P/Km²)',
 'Urban Pop %',
 'Urban Population',
 "Country's Share of World Pop",
 'World Population',
 'IndonesiaGlobal Rank']

In [29]:
# ambil data table
td_indo = data_indo.find_all('td')
kolom_data = [td_indo[i].text for i in range(len(td_indo))]
kolom_data = kolom_data[2:kolom_data.index('1')]
# kolom_data

In [19]:
Indo_pop = []
a = 0
for i in range(25): 
    b = (i+1)*13
    Indo_pop.append(kolom_data[a:b])
    a = b

print(Indo_pop[0])

['2020', '273,523,615', '1.07 %', '2,898,047', '-98,955', '29.7', '2.32', '151', '56.4 %', '154,188,546', '3.51 %', '7,794,798,739', '4']


In [33]:
indo_df = pd.DataFrame(Indo_pop, columns=kolom_header)
indo_df = indo_df.sort_values(by='Year')[['Year', 'Population', 'Median Age', 'Fertility Rate', 'Urban Pop %']]
indo_df.drop_duplicates(subset='Year', inplace=True)
# indo_df

In [34]:
ten_period = [indo_df['Year'].tolist()[i] for i in range(len(indo_df['Year'])) if int(indo_df['Year'].tolist()[i]) % 10 == 0]
ten_period

['1960',
 '1970',
 '1980',
 '1990',
 '2000',
 '2010',
 '2020',
 '2030',
 '2040',
 '2050']

In [35]:
indo_df[indo_df['Year'].isin(ten_period)].reset_index(drop=True)

Unnamed: 0,Year,Population,Median Age,Fertility Rate,Urban Pop %
0,1960,87751068,20.2,5.67,14.6 %
1,1970,114793178,18.6,5.57,17.1 %
2,1980,147447836,19.1,4.73,22.1 %
3,1990,181413402,21.3,3.4,30.6 %
4,2000,211513823,24.4,2.55,42.0 %
5,2010,241834215,27.2,2.5,50.1 %
6,2020,273523615,29.7,2.32,56.4 %
7,2030,299198430,32.4,2.32,62.1 %
8,2040,318637858,35.1,2.32,66.8 %
9,2050,330904664,37.4,2.32,70.7 %


<hr>

# **Take Home Exercise** (Technically Challenging)
## ``Ultraman & Monster Dataset``

### A. Get data (scrape) from this site: http://www.scifijapan.com/articles/2015/10/04/bandai-ultraman-ultra-500-figure-list/

### B. Create DataFrame with columns: ``Ultraman Name``

### C. Create DataFrame with columns: ``Monster Name``

## **Scraping Ultraman Dataset**: Akses data ke Web yang ingin digali datanya

In [2]:
web = requests.get('http://www.scifijapan.com/articles/2015/10/04/bandai-ultraman-ultra-500-figure-list/')
data = BeautifulSoup(web.content, 'html.parser')

## Mengambil data pada tag tertentu

In [3]:
# mendapatkan data title
data.title.string

'Bandai ULTRAMAN Ultra 500 Figure List «  SciFi Japan'

In [8]:
# get all text in website
# data.get_text()

In [4]:
strong = data.find_all('strong')


daftar = []
for element in strong:
    daftar.append(element.text)  

daftar
# (strong[0]).text

['Note:',
 'Ultra Hero 500/ ウルトラヒーロー５００',
 '01 Ultraman',
 '02 Ultra Seven',
 '03 Zoffy',
 '04 Ultraman Jack',
 '05 Ultraman Ace',
 '06 Ultraman Taro',
 '07 Ultraman Leo',
 '08 Ultraman Tiga (Multi-Type)',
 '09 Ultraman Gaia (V2)',
 '10 Ultraman Agul (V2)',
 '11 Ultraman Ginga',
 '12 Jean-Nine',
 '13 Astra',
 '14 Ultraman Dyna (Flash Type)',
 '15 Ultraman 80',
 '16 Ultraman Cosmos (Luna Mode)',
 '17 Ultraman Nexus (Anphans)',
 '18 Ultraman Max',
 '19 Ultraman Mebius',
 '20 Ultraman Hikari',
 '21 Ultraman Zero',
 '22 Ultraman Nice',
 '23 Father of Ultra',
 '24 Ultraman King',
 '25 Ultraman Saga',
 '26 Tiga Dark (Spark Doll)',
 '27 Ultraman Dark (Spark Doll)',
 '28 Ultraman Victory',
 '29 Ultraman Ginga Strium',
 '30 Ultraman Gingavictory',
 '31 Shining Ultraman Zero',
 '32 Ultraman Nexus Junis',
 '33 Ultraman Cosmos Eclipse Mode',
 '34 Ultraman Victory Knight',
 'Ultra Monster 500/ ウルトラ怪獣５００',
 '01 Alien Baltan',
 '02 Gomora',
 '03 Zetton',
 '04 Zaragas',
 '05 Eleking',
 '06 Alien Godor

## Mengolah data

In [5]:
ultraman = daftar[2:36]

In [6]:
# (ultraman[0]).split(maxsplit=1)

['01', 'Ultraman']

In [7]:
ultraman_baru = []
for i in range(len(ultraman)):
    ultraman_baru.append((ultraman[i]).split( maxsplit=1))

ultraman_baru

[['01', 'Ultraman'],
 ['02', 'Ultra Seven'],
 ['03', 'Zoffy'],
 ['04', 'Ultraman Jack'],
 ['05', 'Ultraman Ace'],
 ['06', 'Ultraman Taro'],
 ['07', 'Ultraman Leo'],
 ['08', 'Ultraman Tiga (Multi-Type)'],
 ['09', 'Ultraman Gaia (V2)'],
 ['10', 'Ultraman Agul (V2)'],
 ['11', 'Ultraman Ginga'],
 ['12', 'Jean-Nine'],
 ['13', 'Astra'],
 ['14', 'Ultraman Dyna (Flash Type)'],
 ['15', 'Ultraman 80'],
 ['16', 'Ultraman Cosmos (Luna Mode)'],
 ['17', 'Ultraman Nexus (Anphans)'],
 ['18', 'Ultraman Max'],
 ['19', 'Ultraman Mebius'],
 ['20', 'Ultraman Hikari'],
 ['21', 'Ultraman Zero'],
 ['22', 'Ultraman Nice'],
 ['23', 'Father of Ultra'],
 ['24', 'Ultraman King'],
 ['25', 'Ultraman Saga'],
 ['26', 'Tiga Dark (Spark Doll)'],
 ['27', 'Ultraman Dark (Spark Doll)'],
 ['28', 'Ultraman Victory'],
 ['29', 'Ultraman Ginga Strium'],
 ['30', 'Ultraman Gingavictory'],
 ['31', 'Shining Ultraman Zero'],
 ['32', 'Ultraman Nexus Junis'],
 ['33', 'Ultraman Cosmos Eclipse Mode'],
 ['34', 'Ultraman Victory Knight']]

In [8]:
df = pd.DataFrame(ultraman_baru, columns=['No', 'Nama Ultraman'])
df

Unnamed: 0,No,Nama Ultraman
0,1,Ultraman
1,2,Ultra Seven
2,3,Zoffy
3,4,Ultraman Jack
4,5,Ultraman Ace
5,6,Ultraman Taro
6,7,Ultraman Leo
7,8,Ultraman Tiga (Multi-Type)
8,9,Ultraman Gaia (V2)
9,10,Ultraman Agul (V2)
