# Web Scrapping
Program ini menggunakan website list karakter canon One Piece
Data yang diambil merupakan tabel karakter dengan tipe data teks

### Import Libraries
Program ini menggunakan 5 library. 2 library untuk berhubungan dengan website. 1 library untuk mengolah data. 2 libray untuk mengolah file

In [107]:
from bs4 import BeautifulSoup
import requests
import re
import csv
import pandas as pd

### Link Website
Pada bagian ini dideklarasikan link url website yang ingin diambil datanya

In [108]:
web_url = "https://onepiece.fandom.com/wiki/List_of_Canon_Characters"

### Connect to Website
Fungsi connect_to_web() berguna untuk berhubungan dengan website. Pada fungsi ini tag html pada website diambil. Selain itu fungsi ini juga memeriksa apakah link yang kita masukkan valid atau tidak.

In [109]:
def connect_to_web(url):
    """fungsi untuk membuat koneksi dan mengambil data ke website yang diinginkan

    Args:
        url (string): url website yang akan digunakan

    Returns:
        class.beautiful: variabel yang berisi tag html
    """
    connect = requests.get(url)
    page = BeautifulSoup(connect.content, "html.parser")
    
    if connect.status_code == 200:
        print(f"Berhasil terhubung ke Web: {page.h1.text}")
    else:
        print(f"Gagal terhubung ke web \nKode Error: {connect.status_code}")
    
    return page

### Mengambil Tag
Cell di bawah mengambil semua tag th dan td sekaligus memanggil fungsi connect_to_web

In [111]:
webPage_content = connect_to_web(web_url)

table_header = webPage_content.find_all("th")
table_content = webPage_content.find_all("td")

Berhasil terhubung ke Web: 
					List of Canon Characters				


### Mendapatkan Content Header
Pada fungsi getHeader_values program mengambil content header tabel

In [68]:
def getHeader_values(content_wrapped, begin, end):
    """mengambil data table header dari website yang dituju

    Args:
        content_wrapped (beautifulsoup): file yang berisi tag halaman website
        begin (int): awal perulangan
        end (int): akhir perulangan

    Returns:
        list: data tabel header
    """
    temp = []
    
    for index in range(begin, end):
        content_text = content_wrapped[index].b.text
        temp.append(re.sub(r", ", r"", content_text))
        
    return temp

### Mendapatkan Content Isi
Pada fungsi getContent_values program mengambil dan membersihkan content isi dari tabel.

In [51]:
def getContent_values(content_wrapped):
    """mengambil data isi dari tabel

    Args:
        content_wrapped (beautifulsoup): file yang berisi tag html halaman website

    Returns:
        list: data isi dari tabel dan berbentuk list di dalam list
    """
    length_content = len(content_wrapped)
    temp_list = []
    loopBegin = 1
    
    for lane in range(loopBegin, length_content//6):
        temp_val = []
        
        for index in range(loopBegin, 6*lane):
            content_text = content_wrapped[index].text
            content_text = re.sub(r"$(\r\n|\r|\n)", r"", content_text)
            
            if content_text == "":
                temp_val.append("Empty")
            else:
                temp_val.append(content_text)
            
        loopBegin += 6
        temp_list.append(temp_val)
        
        if len(temp_list) >= 1317:
            break
    
    return temp_list


### Calling the function
Pada cell di bawah ini program memaggil fungsi content di atas dan memasukkan nilainya pada variabel

In [69]:
character_list = getContent_values(table_content)
header = getHeader_values(table_header, 1, 6)

### File CSV
Pada bagian ini program membuka dan mengubah isi dari file csv. File csv tersebut kemudian diisi dengan data yang telah kita ambil sebelumnya

In [105]:
with open('data_from_webScrapping.csv', '+w', encoding="utf-8") as file:
    writer = csv.writer(file, delimiter=",")
    
    writer.writerow(header)
    for lineValues in character_list:
        writer.writerow(lineValues)

### Membaca File
Pada bagian ini program membuka file csv yang telah dibuat menggunakan library pandas

In [106]:
data = pandas.read_csv("data_from_webScrapping.csv", sep="|")
print(data)

                         Name,Chapter,Episode,Year,Note
0     A O,0551,0460,2009,His name was revealed in th...
1                         Abdullah,0704,0632,2013,Empty
2                          Absalom,0444,0339,2007,Empty
3                           Acilia,0706,0652,2013,Empty
4     Adele,0608,0527,2010,Her name was revealed in ...
...                                                 ...
1312  Zeus,0827,0786,2016,His name was revealed in C...
1313  Zodia,0553,0462,2009,His name was revealed in ...
1314  Zotto,0533,0432,02009,His name was revealed in...
1315  Zucca,0564,0489,2009,His name was revealed in ...
1316                       Zunesha,0802,0751,2015,Empty

[1317 rows x 1 columns]


In [97]:
data = pandas.read_csv("data_from_webScrapping.csv", sep="|", encoding="latin-1")
print(data)

                         Name,Chapter,Episode,Year,Note
0     A O,0551,0460,2009,His name was revealed in th...
1                         Abdullah,0704,0632,2013,Empty
2                          Absalom,0444,0339,2007,Empty
3                           Acilia,0706,0652,2013,Empty
4     Adele,0608,0527,2010,Her name was revealed in ...
...                                                 ...
1312  Zeus,0827,0786,2016,His name was revealed in C...
1313  Zodia,0553,0462,2009,His name was revealed in ...
1314  Zotto,0533,0432,02009,His name was revealed in...
1315  Zucca,0564,0489,2009,His name was revealed in ...
1316                       Zunesha,0802,0751,2015,Empty

[1317 rows x 1 columns]


In [94]:
data.head(15)

Unnamed: 0,"Name,Chapter,Episode,Year,Note"
0,"A O,0551,0460,2009,His name was revealed in th..."
1,"Abdullah,0704,0632,2013,Empty"
2,"Absalom,0444,0339,2007,Empty"
3,"Acilia,0706,0652,2013,Empty"
4,"Adele,0608,0527,2010,Her name was revealed in ..."
5,"Aggie 68,0552,0461,2009,His name was revealed ..."
6,"Agotogi,0163,0100,2001,His name was revealed i..."
7,"Agsilly,0570,0482,2010,His name was revealed i..."
8,"Agyo (aka Fighting Lion),0706,0652,2013,His na..."
9,"Ahho Desunen IX,0587,0501,2010,His name was re..."


In [92]:
data.tail()

Unnamed: 0,"Name,Chapter,Episode,Year,Note"
1312,"Zeus,0827,0786,2016,His name was revealed in C..."
1313,"Zodia,0553,0462,2009,His name was revealed in ..."
1314,"Zotto,0533,0432,02009,His name was revealed in..."
1315,"Zucca,0564,0489,2009,His name was revealed in ..."
1316,"Zunesha,0802,0751,2015,Empty"
