Before using selenium webdriver we have to install some required software:
- [Chrome browser](https://www.google.com/intl/id_id/chrome/) and [chromedriver](https://chromedriver.chromium.org/downloads) that is suitable with the version of browser, or
- [Firefox browser](https://www.mozilla.org/id/firefox/new/) and [geckodriver](https://github.com/mozilla/geckodriver/releases) that is suitable with the version of browser
- And install selenium using pip ```
pip install selenium
```

Define url to be crawl, i.e. google search

In [1]:
url = 'https://www.google.com/'

Start Chrome instance, and set the url. The automated chrome instance will appeared with appropriate url.

In [3]:
from selenium import webdriver

driver = webdriver.Chrome()
driver.get(url)

Inspect some element to make interaction with the DOM (or simply element). For example we want to set search box with some value, i.e. **"sensus penduduk 2010"**. If we want to exclude result from spesific site, in google search we can make use of **-site**

In [4]:
driver.find_element_by_name('q').send_keys('sensus penduduk 2020 -site:bps.go.id')

Submit the form to see the result

In [5]:
driver.find_element_by_name('f').submit()

From the result, we want to get the url of each item. We can make use of css selector. ```.mw .g .r > a``` is the title of each item.

In [6]:
google_result_urls = []
for el in driver.find_elements_by_css_selector('.mw .g .r > a'):
        google_result_urls.append(el.get_attribute('href'))

Let's see the result of the first page

In [7]:
google_result_urls

['https://www.tribunnews.com/nasional/2020/03/11/segera-isi-sensus-penduduk-online-2020-pastikan-anda-mengisi-dengan-benar-di-situs-sensusbpsgoid',
 'https://www.tribunnews.com/nasional/2020/03/12/cara-isi-data-sensus-penduduk-online-2020-di-sensusbpsgoid-cuma-5-menit-per-anggota-keluarga',
 'https://www.tagar.id/tata-cara-mengisi-sensus-penduduk-online-2020',
 'https://indonesia.go.id/layanan/kependudukan/ekonomi/cara-mengikuti-sensus-penduduk-online-2020',
 'https://katadata.co.id/berita/2020/02/17/sensus-penduduk-2020-begini-cara-pengisiannya-secara-online',
 'https://www.kompas.com/tren/read/2020/03/04/193000265/sensus-penduduk-online-2020-ini-tahapan-cara-daftar-dan-isi-data',
 'https://money.kompas.com/read/2020/02/26/170027726/cepat-dan-mudah-begini-prosedur-ikut-sensus-penduduk-online-2020?page=all',
 'https://www.liputan6.com/tag/sensus-penduduk-2020']

Google search list the result in several pages. So if we want to get all result, we have to make a **click** to the **next** button. **next** button have the id **pnnext**. After clicking **next** button, we can get the url of each item in this page.

In [8]:
driver.find_element_by_id('pnnext').click()
for el in driver.find_elements_by_css_selector('.mw .g .r > a'):
    google_result_urls.append(el.get_attribute('href'))

Let's see the result of the first and second pages

In [9]:
google_result_urls

['https://www.tribunnews.com/nasional/2020/03/11/segera-isi-sensus-penduduk-online-2020-pastikan-anda-mengisi-dengan-benar-di-situs-sensusbpsgoid',
 'https://www.tribunnews.com/nasional/2020/03/12/cara-isi-data-sensus-penduduk-online-2020-di-sensusbpsgoid-cuma-5-menit-per-anggota-keluarga',
 'https://www.tagar.id/tata-cara-mengisi-sensus-penduduk-online-2020',
 'https://indonesia.go.id/layanan/kependudukan/ekonomi/cara-mengikuti-sensus-penduduk-online-2020',
 'https://katadata.co.id/berita/2020/02/17/sensus-penduduk-2020-begini-cara-pengisiannya-secara-online',
 'https://www.kompas.com/tren/read/2020/03/04/193000265/sensus-penduduk-online-2020-ini-tahapan-cara-daftar-dan-isi-data',
 'https://money.kompas.com/read/2020/02/26/170027726/cepat-dan-mudah-begini-prosedur-ikut-sensus-penduduk-online-2020?page=all',
 'https://www.liputan6.com/tag/sensus-penduduk-2020',
 'https://www.cnbcindonesia.com/news/20200215110504-4-138104/perhatian-hari-ini-sensus-penduduk-2020-online-dimulai',
 'https:

Continue iterate over pages until the last page. The indicator of the last page is there is no **next** button anymore. So we can use ```except``` block to achive it.

In [10]:
from selenium.common.exceptions import NoSuchElementException

while True:
    try:
        driver.find_element_by_id('pnnext').click()
        for el in driver.find_elements_by_css_selector('.mw .g .r > a'):
            google_result_urls.append(el.get_attribute('href'))
    except NoSuchElementException as e:
        break

Let's see all result

In [11]:
google_result_urls

['https://www.tribunnews.com/nasional/2020/03/11/segera-isi-sensus-penduduk-online-2020-pastikan-anda-mengisi-dengan-benar-di-situs-sensusbpsgoid',
 'https://www.tribunnews.com/nasional/2020/03/12/cara-isi-data-sensus-penduduk-online-2020-di-sensusbpsgoid-cuma-5-menit-per-anggota-keluarga',
 'https://www.tagar.id/tata-cara-mengisi-sensus-penduduk-online-2020',
 'https://indonesia.go.id/layanan/kependudukan/ekonomi/cara-mengikuti-sensus-penduduk-online-2020',
 'https://katadata.co.id/berita/2020/02/17/sensus-penduduk-2020-begini-cara-pengisiannya-secara-online',
 'https://www.kompas.com/tren/read/2020/03/04/193000265/sensus-penduduk-online-2020-ini-tahapan-cara-daftar-dan-isi-data',
 'https://money.kompas.com/read/2020/02/26/170027726/cepat-dan-mudah-begini-prosedur-ikut-sensus-penduduk-online-2020?page=all',
 'https://www.liputan6.com/tag/sensus-penduduk-2020',
 'https://www.cnbcindonesia.com/news/20200215110504-4-138104/perhatian-hari-ini-sensus-penduduk-2020-online-dimulai',
 'https:

Next, we want to get the content of each url. But for example, we will take the first result

In [12]:
url_content = google_result_urls[0]
url_content

'https://www.tribunnews.com/nasional/2020/03/11/segera-isi-sensus-penduduk-online-2020-pastikan-anda-mengisi-dengan-benar-di-situs-sensusbpsgoid'

We can use the initial instace of automated chrome, or create a new instace by instantiate ```webdriver.Chrome()```

In [None]:
driver2 = webdriver.Chrome()
driver2.get(url_content)

After that, we can get the full content of the web

In [16]:
content = driver2.find_element_by_tag_name('body').text
content

'Jumat, 13 Maret 2020\nCari\nNetwork\nLogin\n\nTribun\nHome\nNasional\nInternasional\nRegional\nMetropolitan\nSains\nPendidikan\nHome » Nasional » Umum\nSensus Penduduk 2020\nSegera Isi Sensus Penduduk Online 2020, Pastikan Anda Mengisi dengan Benar di Situs sensus.bps.go.id\nRabu, 11 Maret 2020 23:11 WIB\nINSTAGRAM/@bps_statistics\nSegera Isi Sensus Penduduk Online 2020, Pastikan Anda Mengisi dengan Benar di Situs sensus.bps.go.id \nTRIBUNNEWS.COM - Pemerintah melalui Badan Pusat Statistik (BPS) kembali menggelar Sensus Penduduk 2020.\nKabar gembiranya, Sensus Penduduk 2020 dilakukan melalui online lewat situs resmi BPS: www.sensus.bps.go.id.\nSensus Penduduk 2020 lewat online telah berlangsung sejak pertengahan Februari dan akan berakhir pada 31 Maret 2020.\nSensus Penduduk 2020 lewat online dilakukan secara mandiri oleh masyarakat dan dapat diakses kapan saja dan di mana saja.\nBaca: Cara Isi Sensus Penduduk Online 2020 di Laman sensus.bps.go.id Tanpa Eror, Gunakan LTE atau 4G\nBaik

After content successfully taken, close the browser and session

In [17]:
driver2.quit()

To get content of all urls, we can iterate it

In [19]:
from multiprocessing.dummy import Pool as ThreadPool

def crawl(g_url):
    driver2 = webdriver.Chrome()
    driver2.get(g_url)    
    content = driver2.find_element_by_tag_name('body').text    
    driver2.quit()    
    return (g_url, content);

pool = ThreadPool(4)
url_content_list = pool.map(crawl, google_result_urls)
pool.close()
pool.join()

check the result

In [20]:
url_content_list

[('https://www.tribunnews.com/nasional/2020/03/11/segera-isi-sensus-penduduk-online-2020-pastikan-anda-mengisi-dengan-benar-di-situs-sensusbpsgoid',
  'Jumat, 13 Maret 2020\nCari\nNetwork\nLogin\n\nTribun\nHome\nNasional\nInternasional\nRegional\nMetropolitan\nSains\nPendidikan\nHome » Nasional » Umum\nSensus Penduduk 2020\nSegera Isi Sensus Penduduk Online 2020, Pastikan Anda Mengisi dengan Benar di Situs sensus.bps.go.id\nRabu, 11 Maret 2020 23:11 WIB\nINSTAGRAM/@bps_statistics\nSegera Isi Sensus Penduduk Online 2020, Pastikan Anda Mengisi dengan Benar di Situs sensus.bps.go.id \nTRIBUNNEWS.COM - Pemerintah melalui Badan Pusat Statistik (BPS) kembali menggelar Sensus Penduduk 2020.\nKabar gembiranya, Sensus Penduduk 2020 dilakukan melalui online lewat situs resmi BPS: www.sensus.bps.go.id.\nSensus Penduduk 2020 lewat online telah berlangsung sejak pertengahan Februari dan akan berakhir pada 31 Maret 2020.\nSensus Penduduk 2020 lewat online dilakukan secara mandiri oleh masyarakat dan

In [25]:
url_content_list = list(map(lambda row: (row[0], row[1].replace('\n', ' ').replace('\t', ' ')), url_content_list))

In [27]:
with open('sensus_penduduk_news.txt', 'w+') as f:    
    f.write('\n'.join(list(map(lambda row: row[1], url_content_list))))