# Capstone Webscrapping using BeautifulSoup

This notebook contains guidances & tasks on the data processing for the application

## background

Sangat tertarik untuk eksplorasi penelitian bidang NLP dan hal itu akan banyak membutuhkan scraping data dari berbagai sumber data secara online. Oleh karena itu pada capstone project ini saya menentukan pilihan WEB SCRAPING menggunakan BeautifulShoup.

## Requesting the Data and Creating a BeautifulSoup

Let's begin with requesting the web from the site with `get` method.

In [1]:
import requests

url_get = requests.get('https://www.kalibrr.id/job-board/te/data/1')

To visualize what exactly you get from the `request.get`, we can use .content so ee what we exactly get, in here i slice it so it won't make our screen full of the html we get from the page. You can delete the slicing if you want to see what we fully get.

In [2]:
url_get.content[1:500]

b'!DOCTYPE html><html lang="en"><head><meta charSet="utf-8"/><meta name="viewport" content="width=device-width"/><script type="application/ld+json">{\n    "@context": "https://schema.org",\n    "@type": "WebSite",\n    "url": "https://www.kalibrr.com",\n    "potentialAction": [\n      {\n        "@type": "SearchAction",\n        "target": "https://www.kalibrr.com/job-board/te/={search_term_string}",\n        "query-input": "required name=search_term_string"\n      }\n     ]\n  }</script><meta property="og:i'

As we can see we get a very unstructured and complex html, which actually contains the codes needed to show the webpages on your web browser. But we as human still confused what and where we can use that piece of code, so here where we use the beautifulsoup. Beautiful soup class will result a beautifulsoup object. Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. 

Let's make Beautiful soup object and feel free to explore the object here.

In [3]:
from bs4 import BeautifulSoup 

soup = BeautifulSoup(url_get.content,"html.parser")

## Finding the right key to scrap the data & Extracting the right information

Find the key and put the key into the `.find()` Put all the exploring the right key at this cell. (please change this markdown with your explanation)

In [7]:
posisi = soup.find('a',attrs={'class':'k-text-primary-color'})
print(table.prettify()[1:1000])

a class="k-text-primary-color" href="/c/pt-berlian-sistem-informasi/jobs/204136/data-analyst-2" itemprop="name">
 Data Analyst
</a>



In [18]:
posisi = soup.find_all('a',attrs={'class':'k-text-primary-color'})
posisi

[<a class="k-text-primary-color" href="/c/pt-berlian-sistem-informasi/jobs/204136/data-analyst-2" itemprop="name">Data Analyst</a>,
 <a class="k-text-primary-color" href="/c/vlink-inc/jobs/210140/data-engineer" itemprop="name">Data Engineer</a>,
 <a class="k-text-primary-color" href="/c/pgi-data/jobs/215150/it-data-center-monitoring" itemprop="name">IT Data Center  Monitoring</a>,
 <a class="k-text-primary-color" href="/c/pt-indocyber-global-technology/jobs/226730/big-data-engineer-dm" itemprop="name">Big Data Engineer (DM)</a>,
 <a class="k-text-primary-color" href="/c/pgi-data/jobs/187713/full-stack-developer-reactjs-golang" itemprop="name">Full Stack Developer (ReactJS &amp; Golang)</a>,
 <a class="k-text-primary-color" href="/c/magna-solusi-indonesia/jobs/212247/data-engineer" itemprop="name">Data Engineer</a>,
 <a class="k-text-primary-color" href="/c/alam-sutera/jobs/225458/digital-marketing-specialist" itemprop="name">Digital Marketing Specialist</a>,
 <a class="k-text-primary-c

In [11]:
lokasi = soup.find_all('a',attrs={'class':'k-text-subdued k-block'})
lokasi

[<a class="k-text-subdued k-block" href="/job-board/l/East-Jakarta">East Jakarta, Indonesia</a>,
 <a class="k-text-subdued k-block" href="/job-board/l/Jakarta-Selatan">Jakarta Selatan, Indonesia</a>,
 <a class="k-text-subdued k-block" href="/job-board/l/Jakarta">Jakarta, Indonesia</a>,
 <a class="k-text-subdued k-block" href="/job-board/l/West-Jakarta">West Jakarta, Indonesia</a>,
 <a class="k-text-subdued k-block" href="/job-board/l/Jakarta">Jakarta, Indonesia</a>,
 <a class="k-text-subdued k-block" href="/job-board/l/Jakarta-Selatan">Jakarta Selatan, Indonesia</a>,
 <a class="k-text-subdued k-block" href="/job-board/l/Tangerang-Kota">Tangerang Kota, Indonesia</a>,
 <a class="k-text-subdued k-block" href="/job-board/l/Jakarta-Selatan">Jakarta Selatan, Indonesia</a>,
 <a class="k-text-subdued k-block" href="/job-board/l/East-Jakarta">East Jakarta, Indonesia</a>,
 <a class="k-text-subdued k-block" href="/job-board/l/Tangerang-Selatan">Tangerang Selatan, Indonesia</a>,
 <a class="k-text-

In [12]:
perusahaan = soup.find_all('a',attrs={'class':'k-text-subdued'})
perusahaan

[<a class="k-bg-white k-flex k-items-center k-flex-shrink k-justify-center k-text-4xl k-text-subdued k-overflow-hidden k-px-4 k-py-2 k-row-span-4" href="/c/pt-berlian-sistem-informasi/jobs"><div><img alt="PT Berlian Sistem Informasi" class="k-block k-max-w-full k-max-h-full k-bg-white k-mx-auto" decoding="async" height="80" loading="eager" src="https://rec-data.kalibrr.com/www.kalibrr.com/logos/HELY75DFBQLJMNBNXJWB4LJCAZRYW7WPUJLN8CUH-649a6aba.png" width="130"/></div></a>,
 <a class="k-text-subdued" href="/c/pt-berlian-sistem-informasi/jobs">PT Berlian Sistem Informasi</a>,
 <a class="k-text-subdued k-block" href="/job-board/l/East-Jakarta">East Jakarta, Indonesia</a>,
 <a class="k-bg-white k-flex k-items-center k-flex-shrink k-justify-center k-text-4xl k-text-subdued k-overflow-hidden k-px-4 k-py-2 k-row-span-4" href="/c/vlink-inc/jobs"><div><img alt="VLink Inc" class="k-block k-max-w-full k-max-h-full k-bg-white k-mx-auto" decoding="async" height="80" loading="eager" src="https://rec

In [13]:
post_submit = soup.find_all('span',attrs={'class':'k-block k-mb-1'})
post_submit

[<span class="k-block k-mb-1">Posted 23 days ago • Apply before 29 Oct</span>,
 <span class="k-block k-mb-1">Posted 12 days ago • Apply before 25 Apr</span>,
 <span class="k-block k-mb-1">Posted 13 days ago • Apply before 29 Oct</span>,
 <span class="k-block k-mb-1">Posted 12 days ago • Apply before 17 Dec</span>,
 <span class="k-block k-mb-1">Posted 4 days ago • Apply before 19 Sep</span>,
 <span class="k-block k-mb-1">Posted 10 days ago • Apply before 30 Oct</span>,
 <span class="k-block k-mb-1">Posted 9 days ago • Apply before 30 Sep</span>,
 <span class="k-block k-mb-1">Posted 16 days ago • Apply before 16 Oct</span>,
 <span class="k-block k-mb-1">Posted a month ago • Apply before 23 Nov</span>,
 <span class="k-block k-mb-1">Posted a month ago • Apply before 5 Nov</span>,
 <span class="k-block k-mb-1">Posted 19 days ago • Apply before 13 Oct</span>,
 <span class="k-block k-mb-1">Posted 2 months ago • Apply before 29 Sep</span>,
 <span class="k-block k-mb-1">Posted 2 months ago • Ap

In [39]:
jobdesc = soup.find_all('div')
jobdesc

[<div data-reactroot="" id="__next"><div class="k-flex k-flex-col k-items-stretch k-min-h-screen k-bg-body"><header class="k-sticky k-flex k-flex-col k-top-0 k-bg-white k-border-b k-border-grey-300 k-z-50"><noscript class="k-bg-red k-text-white"><div class="k-container k-p-4"><strong>Not working?</strong> You need to <a class="k-border-transparent k-border-b k-text-white" href="https://www.activatejavascript.org" target="_blank">enable JavaScript</a> to use Kalibrr.</div></noscript><div class="k-container k-flex k-justify-between md:k-px-4"><a class="k-px-6 k-py-5 md:k-p-6 md:k-border-b-4 k-border-transparent" href="/"><img alt="Kalibrr" class="k-w-auto k-h-8" decoding="async" height="20" src="https://static.kalibrr.com/public/new-kalibrr-logos/kalibrr-logo-blue%403x.min.__f366811b__.png" width="100"/></a><button class="k-px-4 k-text-primary-color k-flex k-items-center k-space-x-2 k-justify-center k-border-transparent md:k-hidden md:k-border-b-4"><svg aria-hidden="true" class="MuiSvgIc

Finding row length.

In [19]:
row=soup.find_all('a',attrs={'class':'k-text-primary-color'})
print(len(row))
jumlahbaris=(len(row))

18


In [None]:
... = table.find_all(...)
row_length = len(...)

In [24]:
jumlahbaris=15

Do the scrapping process here (please change this markdown with your explanation)

In [30]:
temp = [] #initiating a tuple

for i in range(1, jumlahbaris):
    #get data
    posisi = soup.find_all('a',attrs={'class':'k-text-primary-color'})[i].text
    lokasi = soup.find_all('a',attrs={'class':'k-text-subdued k-block'})[i].text
    perusahaan = soup.find_all('a',attrs={'class':'k-text-subdued'})[i].text
    post_submit = soup.find_all('span',attrs={'class':'k-block k-mb-1'})[i].text
    
    temp.append((posisi,lokasi,perusahaan,post_submit))
    
temp 

[('Data Engineer',
  'Jakarta Selatan, Indonesia',
  'PT Berlian Sistem Informasi',
  'Posted 12 days ago • Apply before 25 Apr'),
 ('IT Data Center  Monitoring',
  'Jakarta, Indonesia',
  'East Jakarta, Indonesia',
  'Posted 13 days ago • Apply before 29 Oct'),
 ('Big Data Engineer (DM)',
  'West Jakarta, Indonesia',
  '',
  'Posted 12 days ago • Apply before 17 Dec'),
 ('Full Stack Developer (ReactJS & Golang)',
  'Jakarta, Indonesia',
  'VLink Inc',
  'Posted 4 days ago • Apply before 19 Sep'),
 ('Data Engineer',
  'Jakarta Selatan, Indonesia',
  'Jakarta Selatan, Indonesia',
  'Posted 10 days ago • Apply before 30 Oct'),
 ('Digital Marketing Specialist',
  'Tangerang Kota, Indonesia',
  '',
  'Posted 9 days ago • Apply before 30 Sep'),
 ('Data Engineer',
  'Jakarta Selatan, Indonesia',
  'PGI Data',
  'Posted 16 days ago • Apply before 16 Oct'),
 ('IT Asset Management Officer',
  'East Jakarta, Indonesia',
  'Jakarta, Indonesia',
  'Posted a month ago • Apply before 23 Nov'),
 ('De

In [None]:
# coba pindah halaman

temp = [] #initiating a tuple

for i in range(1, jumlahbaris):
    #get data
    posisi = soup.find_all('a',attrs={'class':'k-text-primary-color'})[i].text
    lokasi = soup.find_all('a',attrs={'class':'k-text-subdued k-block'})[i].text
    perusahaan = soup.find_all('a',attrs={'class':'k-text-subdued'})[i].text
    post_submit = soup.find_all('span',attrs={'class':'k-block k-mb-1'})[i].text
    
    temp.append((posisi,lokasi,perusahaan,post_submit))
    if i==1:
        try:
            lanjut = soup.find('a', class_='').get('href') 
            driver.get('https://www.jobstreet.co.id'+lanjut)
        except:
            break    



temp 





## Creating data frame & Data wrangling

Put the array into dataframe

In [31]:
import pandas as pd

df = pd.DataFrame(temp,columns=('posisi','lokasi','perusahaan','post_submit'))
df.head()

Unnamed: 0,posisi,lokasi,perusahaan,post_submit
0,Data Engineer,"Jakarta Selatan, Indonesia",PT Berlian Sistem Informasi,Posted 12 days ago • Apply before 25 Apr
1,IT Data Center Monitoring,"Jakarta, Indonesia","East Jakarta, Indonesia",Posted 13 days ago • Apply before 29 Oct
2,Big Data Engineer (DM),"West Jakarta, Indonesia",,Posted 12 days ago • Apply before 17 Dec
3,Full Stack Developer (ReactJS & Golang),"Jakarta, Indonesia",VLink Inc,Posted 4 days ago • Apply before 19 Sep
4,Data Engineer,"Jakarta Selatan, Indonesia","Jakarta Selatan, Indonesia",Posted 10 days ago • Apply before 30 Oct


Do the data cleaning here (please change this markdown with your explanation of what you do for data wrangling)

Data visualisation (please change this markdown with your explanation of what you do for data wrangling)

### Implementing your webscrapping to the flask dashboard

- Copy paste all of your web scrapping process to the desired position on the `app.py`
- Changing the title of the dasboard at `index.html`

## Finishing This Notebook with Your Analysis and Conclusion

First you can do start with making the data visualisation. 


(Put your analysis and conclusion here.)

### Implement it at the webapps

- You can create additional analysis from the data.
- Implement it to the dashboard with at `app.py` dan `index.html`.