# I. Introduction


*Notes : This notebook is prepared to scrape news data from Google Search for sentiment analysis regarding Prabowo-Gibran as the presidential and vice-presidential candidates in the 2024 Indonesian election.*

*Prepared by* : **Achmad Dhani & Faris Arief Mawardi**

# II. Import Libraries and Setting Up Functions

In [1]:
import warnings
warnings.filterwarnings("ignore")

## 2.1 Libraries

In [2]:
# Required Libraries
from bs4 import BeautifulSoup as bs
from selenium import webdriver
import pandas as pd
from time import sleep
import numpy as np
import itertools

## 2.2 Functions

In [5]:
def get_element_text(element, value, value_type="class"):
    item = element.find(attrs={value_type: value})
    return item.text if item else None # Return the text of the found item. If the item is not found, return 'None' as string.

def get_data(parent_tag_, day_, month_):
    data = [{
    'Headline' : get_element_text(el,'n0jPhd ynAwRc MBeuO nDgy9d', 'class'),
    'Media' : get_element_text(el, 'MgUUmf NUnG9d', 'class'),
    'Date' : f"{day_}/{month_}/2023",
    'url': el.find("a", {"class":"WlydOe"})['href']
    } for el in parent_tag_]
    return data

scrape_data= []

# final code
def scraping(month, day):
    chrome_options = webdriver.ChromeOptions()
    driver = webdriver.Chrome(options=chrome_options)

    url=f"https://www.google.com/search?q=prabowo+gibran&sca_esv=590380016&biw=1710&bih=953&sxsrf=AM9HkKl2jyc3pgk2OS3bmHnxuxjCN9ANBw%3A1702718802779&source=lnt&tbs=sbd%3A1%2Ccdr%3A1%2Ccd_min%3A{month}%2F{day}%2F2023%2Ccd_max%3A{month}%2F{day}%2F2023&tbm=nws"
    driver.get(url)
    sleep(2)
    html = driver.page_source # Get the page's html content
    soup = bs(html, "html.parser") # parse the html content using BeautifulSoup
    main_tag = soup.find_all("div", {"class": "SoaBEf"})
    first_data= get_data(main_tag, day, month)
    scrape_data.extend(first_data)
    link= soup.find('a', id="pnnext")['href']

    while True:
        driver.get('https://www.google.com' + link)
        sleep(2)
        html = driver.page_source # Get the page's html content
        soup = bs(html, "html.parser") # parse the html content using BeautifulSoup
        loop_tag = soup.find_all("div", {"class": "SoaBEf"})
        next_data = get_data(loop_tag, day, month)
        scrape_data.extend(next_data)
        try:
            link= soup.find('a', id="pnnext")['href']
        except TypeError:
            break
    driver.quit()
    
def media_merge(df, div, parent):
    df['portal_media'] = df['portal_media'].apply(lambda x: parent if div in x else x)
    return df

# III. Scraping

## 3.1 Webdriver Setup

In [3]:
chrome_options = webdriver.ChromeOptions()
driver = webdriver.Chrome(options=chrome_options)

# Navigate to the website
driver.get("https://www.google.com")
print(driver.title)

driver.quit()

Google


## 3.2 Building Scraping Code

In [23]:
url='https://www.google.com/search?q=prabowo+gibran&sca_esv=590380016&biw=1710&bih=953&sxsrf=AM9HkKl2jyc3pgk2OS3bmHnxuxjCN9ANBw%3A1702718802779&source=lnt&tbs=sbd%3A1%2Ccdr%3A1%2Ccd_min%3A11%2F15%2F2023%2Ccd_max%3A12%2F15%2F2023&tbm=nws'

In [27]:
chrome_options = webdriver.ChromeOptions()
driver = webdriver.Chrome(options=chrome_options)

In [29]:
chrome_options = webdriver.ChromeOptions()
driver = webdriver.Chrome(options=chrome_options)

scrape_data=[]
driver.get(url)
sleep(3)
html = driver.page_source # Get the page's html content
soup = bs(html, "html.parser") # parse the html content using BeautifulSoup

driver.quit()

## 3.3 Getting all the elements

In [36]:
parent_tag = soup.find_all("div", {"class":"SoaBEf"})
for parent in parent_tag:
    media= parent.find("div", {"class":"MgUUmf NUnG9d"})
    print(media.get_text())
    headline= parent.find("div", {"class":"n0jPhd ynAwRc MBeuO nDgy9d"})
    print(headline.get_text())
    date= parent.find("div", {"class":"OSrXXb rbYSKb LfVVr"})
    print(date.get_text())
    news_link= parent.find("a", {"class":"WlydOe"})['href']
    print(news_link)
    print('')

Bisnis Tempo
Prabowo-Gibran Diklaim Bisa Sejahterakan Petani dan Peternak
2 hari lalu
https://bisnis.tempo.co/read/1809886/prabowo-gibran-diklaim-bisa-sejahterakan-petani-dan-peternak

KOMPAS.com
Ini Jurus Prabowo-Gibran Atasi Kekerasan terhadap ...
2 hari lalu
https://nasional.kompas.com/read/2023/12/15/18061001/ini-jurus-prabowo-gibran-atasi-kekerasan-terhadap-perempuan-dan-anak

detikcom
7 Baliho Prabowo-Gibran di Pati Sobek, Loyalis: Milik Capres Lain Masih Utuh
2 hari lalu
https://www.detik.com/jateng/berita/d-7092148/7-baliho-prabowo-gibran-di-pati-sobek-loyalis-milik-capres-lain-masih-utuh

Finansial Bisnis
Chandra Arie Mundur dari Bank Mandiri Taspen, Gabung TKN Prabowo-Gibran
3 hari lalu
https://finansial.bisnis.com/read/20231214/90/1724032/chandra-arie-mundur-dari-bank-mandiri-taspen-gabung-tkn-prabowo-gibran

Antaranews.com
Repnas Prabowo-Gibran akan luncurkan program Super Mentor
1 hari lalu
https://www.antaranews.com/berita/3874161/repnas-prabowo-gibran-akan-luncurkan-prog

In [37]:
data = [{
    'Headline' : get_element_text(el,'n0jPhd ynAwRc MBeuO nDgy9d', 'class'),
    'Media' : get_element_text(el, 'MgUUmf NUnG9d', 'class'),
    'Date' : get_element_text(el, 'OSrXXb rbYSKb LfVVr', 'class'),
    'url': el.find("a", {"class":"WlydOe"})['href']
} for el in parent_tag]

In [38]:
df= pd.DataFrame(data)

In [39]:
df

Unnamed: 0,Headline,Media,Date,url
0,Prabowo-Gibran Diklaim Bisa Sejahterakan Petan...,Bisnis Tempo,2 hari lalu,https://bisnis.tempo.co/read/1809886/prabowo-g...
1,Ini Jurus Prabowo-Gibran Atasi Kekerasan terha...,KOMPAS.com,2 hari lalu,https://nasional.kompas.com/read/2023/12/15/18...
2,"7 Baliho Prabowo-Gibran di Pati Sobek, Loyalis...",detikcom,2 hari lalu,https://www.detik.com/jateng/berita/d-7092148/...
3,"Chandra Arie Mundur dari Bank Mandiri Taspen, ...",Finansial Bisnis,3 hari lalu,https://finansial.bisnis.com/read/20231214/90/...
4,Repnas Prabowo-Gibran akan luncurkan program S...,Antaranews.com,1 hari lalu,https://www.antaranews.com/berita/3874161/repn...
5,"Prabowo: Kalau Rakyat Tak Suka Prabowo-Gibran,...",CNBC Indonesia,4 hari lalu,https://www.cnbcindonesia.com/news/20231212211...
6,Nelayan Cilacap Gelar Deklarasi Dukung Prabowo...,detikNews,2 hari lalu,https://news.detik.com/pemilu/d-7091452/nelaya...
7,KPU Bakal Tegur Gibran Usai Bakar Semangat Saa...,KOMPAS.com,3 hari lalu,https://nasional.kompas.com/read/2023/12/14/11...
8,Viral Bantuan Presiden Dibagikan Tim Prabowo-G...,detikcom,1 hari lalu,https://www.detik.com/jateng/berita/d-7092364/...
9,"Hasil Survei Prabowo-Gibran Unggul di Bali, Pe...",detikcom,6 hari lalu,https://www.detik.com/bali/berita/d-7083660/ha...


In [44]:
link = soup.find('a', id="pnnext")['href'] # to go to the next page
print(link)

/search?q=prabowo+gibran&tbm=nws&tbs=cdr:1,cd_min:11/19/2023,cd_max:12/13/2023,sbd:1&sa=X&sca_esv=590380016&biw=1710&bih=953&gbv=1&sei=fhl5ZYO7IKaVg8UP-pu-sAg


## 3.4 Web Scrape Code

### Initial Code

```python
chrome_options = webdriver.ChromeOptions()
driver = webdriver.Chrome(options=chrome_options)

driver.get(url)
sleep(2)
html = driver.page_source # Get the page's html content
soup = bs(html, "html.parser") # parse the html content using BeautifulSoup
parent_tag = soup.find_all("div", {"class": "SoaBEf"})

scrape_data = [{
    'judul_berita' : get_element_text(el,'n0jPhd ynAwRc MBeuO nDgy9d', 'class'),
    'media' : get_element_text(el, 'MgUUmf NUnG9d', 'class'),
    'tanggal_publikasi' : get_element_text(el, 'OSrXXb rbYSKb LfVVr', 'class'),
    'url': el.find("a", {"class":"WlydOe"})['href']
} for el in parent_tag]
link= soup.find('a', id="pnnext")['href']

while True:
    driver.get('https://www.google.com' + link)
    sleep(2)
    html = driver.page_source # Get the page's html content
    soup = bs(html, "html.parser") # parse the html content using BeautifulSoup
    loop_tag = soup.find_all("div", {"class": "SoaBEf"})
    data = [{
    'judul_berita' : get_element_text(ele,'n0jPhd ynAwRc MBeuO nDgy9d', 'class'),
    'media' : get_element_text(ele, 'MgUUmf NUnG9d', 'class'),
    'tanggal_publikasi' : get_element_text(ele, 'OSrXXb rbYSKb LfVVr', 'class'),
    'url': ele.find("a", {"class":"WlydOe"})['href']
    } for ele in loop_tag]
    scrape_data.extend(data)
    try:
        link= soup.find('a', id="pnnext")['href']
    except TypeError:
        break
driver.quit()
```

### Final Code

```python
def scraping(month, day):
    chrome_options = webdriver.ChromeOptions()
    driver = webdriver.Chrome(options=chrome_options)

    url=f"https://www.google.com/search?q=prabowo+gibran&sca_esv=590380016&biw=1710&bih=953&sxsrf=AM9HkKl2jyc3pgk2OS3bmHnxuxjCN9ANBw%3A1702718802779&source=lnt&tbs=sbd%3A1%2Ccdr%3A1%2Ccd_min%3A{month}%2F{day}%2F2023%2Ccd_max%3A{month}%2F{day}%2F2023&tbm=nws"
    driver.get(url)
    sleep(2)
    html = driver.page_source # Get the page's html content
    soup = bs(html, "html.parser") # parse the html content using BeautifulSoup
    main_tag = soup.find_all("div", {"class": "SoaBEf"})
    first_data= get_data(main_tag, day, month)
    scrape_data.extend(first_data)
    link= soup.find('a', id="pnnext")['href']

    while True:
        driver.get('https://www.google.com' + link)
        sleep(2)
        html = driver.page_source # Get the page's html content
        soup = bs(html, "html.parser") # parse the html content using BeautifulSoup
        loop_tag = soup.find_all("div", {"class": "SoaBEf"})
        next_data = get_data(loop_tag, day, month)
        scrape_data.extend(next_data)
        try:
            link= soup.find('a', id="pnnext")['href']
        except TypeError:
            break
    driver.quit()

```

## 3.5 Scraping Google Search

In [51]:
scrape_data=[]

- November (15th - 30th)
- December (1st - 15th)

In [None]:
# scrape for november
list_days= range(15,31)
for day in list_days:
    scraping(11, day)

In [61]:
# scrape for december
list_days= range(1,16)
for day in list_days:
    scraping(12, day)

In [62]:
df= pd.DataFrame(scrape_data)
df.shape

(2121, 4)

In [69]:
df.select_dtypes(include='object').describe()

Unnamed: 0,Headline,Media,Date,url
count,2121,2121,2121,2121
unique,2113,527,31,2121
top,Kementerian Komunikasi dan Informatika,detikNews,6/12/2023,https://www.cnbcindonesia.com/news/20231115130...
freq,3,92,130,1


In [80]:
df['Headline'].duplicated().sum()

8

In [79]:
df[df['Headline'].duplicated()]

Unnamed: 0,Headline,Media,Date,url
80,Kementerian Komunikasi dan Informatika,Kementerian Komunikasi dan Informatika,16/11/2023,https://www.kominfo.go.id/content/detail/52963...
95,Kementerian Komunikasi dan Informatika,Kementerian Komunikasi dan Informatika,16/11/2023,https://www.kominfo.go.id/content/detail/52976...
893,TKN Prabowo-Gibran dan DKPP RI Kumpulkan PPK d...,TubasMedia.com,2/12/2023,https://tubasmedia.com/tkn-prabowo-gibran-dan-...
1324,Perempuan Muda Nahdliyin Deklarasi Dukung Prab...,Viva,7/12/2023,https://www.viva.co.id/berita/politik/1665357-...
1496,Klarifikasi Deklarasi Dukungan Pasangan Prabow...,Detik Peristiwa,9/12/2023,https://www.detikperistiwa.com/news-619559/kla...
1497,Relawan Pedagang Indonesia Maju Deklarasi Duku...,Penasultra.com,9/12/2023,https://penasultra.com/relawan-pedagang-indone...
1740,TKN Prabowo-Gibran Paparkan Kesuksesan Makan S...,HEADLINE KALTIM,12/12/2023,https://headlinekaltim.co/tkn-prabowo-gibran-p...
2102,"Sekali Bicara, Gibran Tak Bisa Bedakan Pilpres...",KBA News,15/12/2023,https://kbanews.com/hot-news/sekali-bicara-gib...


In [70]:
df.to_csv('webscrape.csv', sep=';', index=False)

## 3.6 Reading File Test

In [71]:
test= pd.read_csv('webscrape.csv', delimiter=';')

In [72]:
test.shape

(2121, 4)

In [82]:
test.iloc[95]

Headline               Kementerian Komunikasi dan Informatika
Media                  Kementerian Komunikasi dan Informatika
Date                                               16/11/2023
url         https://www.kominfo.go.id/content/detail/52976...
Name: 95, dtype: object

# IV. Data Cleaning

In [41]:
df = pd.read_csv('./media_datasets/webscrape.csv', delimiter=';')

## 4.1 Data Engineering

In [47]:
df

Unnamed: 0,judul_berita,portal_media,tanggal_publikasi,url
0,Live Now! Tim Prabowo-Gibran Beberkan Visi-Pro...,CNBC Indonesia,15/11/2023,https://www.cnbcindonesia.com/news/20231115130...
1,Ratusan Nelayan Jabar Deklarasi Dukung Prabowo...,Politik,15/11/2023,https://politik.rmol.id/read/2023/11/15/597394...
2,Waketum PAN Analogikan Prabowo-Gibran seperti ...,detikNews,15/11/2023,https://news.detik.com/pemilu/d-7038627/waketu...
3,"Prabowo-Gibran Nomor Urut 2, Gerindra Jatim: P...",Detik,15/11/2023,https://www.detik.com/jatim/berita/d-7038617/p...
4,"Prabowo-Gibran Nomor Urut 2, PAN: Jalan Tengah...",Politik,15/11/2023,https://politik.rmol.id/read/2023/11/15/597367...
...,...,...,...,...
2116,Mahfud Md ke Pendukung: Jangan Terpengaruh Has...,detikcom,15/12/2023,https://www.detik.com/jabar/berita/d-7092431/m...
2117,Alumni HMI Prihatin Banyak Caleg Tak Pasang Fo...,DRberita.ID,15/12/2023,https://www.drberita.id/politik/alumni-hmi-pri...
2118,Gibran Slated for First Campaign Outside of Java,Jakarta Globe,15/12/2023,https://jakartaglobe.id/news/gibran-slated-for...
2119,"Anies Singgung Soal Oposisi, Prabowo Singgung ...",KOMPAS.tv,15/12/2023,https://www.kompas.tv/video/469419/anies-singg...


In [44]:
df.rename(columns={'Headline': 'judul_berita', 'Media': 'portal_media', 'Date': 'tanggal_publikasi'}, inplace=True)

In [50]:
df['tanggal_publikasi'] = pd.to_datetime(df['tanggal_publikasi'], format='%d/%m/%Y')

In [51]:
df

Unnamed: 0,judul_berita,portal_media,tanggal_publikasi,url
0,Live Now! Tim Prabowo-Gibran Beberkan Visi-Pro...,CNBC Indonesia,2023-11-15,https://www.cnbcindonesia.com/news/20231115130...
1,Ratusan Nelayan Jabar Deklarasi Dukung Prabowo...,Politik,2023-11-15,https://politik.rmol.id/read/2023/11/15/597394...
2,Waketum PAN Analogikan Prabowo-Gibran seperti ...,detikNews,2023-11-15,https://news.detik.com/pemilu/d-7038627/waketu...
3,"Prabowo-Gibran Nomor Urut 2, Gerindra Jatim: P...",Detik,2023-11-15,https://www.detik.com/jatim/berita/d-7038617/p...
4,"Prabowo-Gibran Nomor Urut 2, PAN: Jalan Tengah...",Politik,2023-11-15,https://politik.rmol.id/read/2023/11/15/597367...
...,...,...,...,...
2116,Mahfud Md ke Pendukung: Jangan Terpengaruh Has...,detikcom,2023-12-15,https://www.detik.com/jabar/berita/d-7092431/m...
2117,Alumni HMI Prihatin Banyak Caleg Tak Pasang Fo...,DRberita.ID,2023-12-15,https://www.drberita.id/politik/alumni-hmi-pri...
2118,Gibran Slated for First Campaign Outside of Java,Jakarta Globe,2023-12-15,https://jakartaglobe.id/news/gibran-slated-for...
2119,"Anies Singgung Soal Oposisi, Prabowo Singgung ...",KOMPAS.tv,2023-12-15,https://www.kompas.tv/video/469419/anies-singg...


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2121 entries, 0 to 2120
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   judul_berita       2121 non-null   object        
 1   portal_media       2121 non-null   object        
 2   tanggal_publikasi  2121 non-null   datetime64[ns]
 3   url                2121 non-null   object        
dtypes: datetime64[ns](1), object(3)
memory usage: 66.4+ KB


In [12]:
df= df[(df['portal_media'] != 'Nasional') & (df['portal_media'] != 'Politik')]

In [15]:
df['portal_media'].value_counts()

portal_media
detikNews          92
Antaranews.com     90
KOMPAS.com         87
Liputan6.com       62
detikcom           54
                   ..
Tribun Lampung      1
suara indonesia     1
SuaraSikka          1
Lumajang Satu       1
DRberita.ID         1
Name: count, Length: 525, dtype: int64

In [16]:
df['portal_media'] = df['portal_media'].apply(lambda x: 'Kompas.com' if 'KOMPAS' in x else x)

In [17]:
df = media_merge(df,'Kompas', 'Kompas.com')

In [18]:
df = media_merge(df,'detik', 'detik.com')

In [19]:
df = media_merge(df, 'Tribun', 'TribunNews')

In [20]:
df = media_merge(df, 'Detik', 'detik.com')

In [21]:
df = media_merge(df, 'Antara News', 'Antaranews.com')

In [22]:
df = media_merge(df, 'ANTARA', 'Antaranews.com')

In [23]:
df = media_merge(df, 'JPNN', 'JPNN.com')

## 4.2 Double Checking

In [25]:
df.select_dtypes(include='object').describe()

Unnamed: 0,judul_berita,portal_media,url
count,1986,1986,1986
unique,1978,456,1986
top,Kementerian Komunikasi dan Informatika,TribunNews,https://www.cnbcindonesia.com/news/20231115130...
freq,3,196,1


In [26]:
df[df['judul_berita'].duplicated()]

Unnamed: 0,judul_berita,portal_media,tanggal_publikasi,url
80,Kementerian Komunikasi dan Informatika,Kementerian Komunikasi dan Informatika,2023-11-16,https://www.kominfo.go.id/content/detail/52963...
95,Kementerian Komunikasi dan Informatika,Kementerian Komunikasi dan Informatika,2023-11-16,https://www.kominfo.go.id/content/detail/52976...
893,TKN Prabowo-Gibran dan DKPP RI Kumpulkan PPK d...,TubasMedia.com,2023-12-02,https://tubasmedia.com/tkn-prabowo-gibran-dan-...
1324,Perempuan Muda Nahdliyin Deklarasi Dukung Prab...,Viva,2023-12-07,https://www.viva.co.id/berita/politik/1665357-...
1496,Klarifikasi Deklarasi Dukungan Pasangan Prabow...,detik.com,2023-12-09,https://www.detikperistiwa.com/news-619559/kla...
1497,Relawan Pedagang Indonesia Maju Deklarasi Duku...,Penasultra.com,2023-12-09,https://penasultra.com/relawan-pedagang-indone...
1740,TKN Prabowo-Gibran Paparkan Kesuksesan Makan S...,HEADLINE KALTIM,2023-12-12,https://headlinekaltim.co/tkn-prabowo-gibran-p...
2102,"Sekali Bicara, Gibran Tak Bisa Bedakan Pilpres...",KBA News,2023-12-15,https://kbanews.com/hot-news/sekali-bicara-gib...


In [27]:
df['portal_media'].value_counts().reset_index().head(5)

Unnamed: 0,portal_media,count
0,TribunNews,196
1,detik.com,194
2,Kompas.com,171
3,Antaranews.com,150
4,Liputan6.com,62


In [29]:
df.to_csv('./media_datasets/cleaned_media.csv', index=False)

# V. Conclusion

The web scraping process for Google Search was executed seamlessly with the implementation of Selenium and BeautifulSoup. The code underwent specific adjustments to target and extract the precise day of publication, as opposed to a broader range of days. This modification ensures accurate date retrieval, accommodating Google's varied descriptions such as "1 week ago," "2 weeks ago," "1 month ago," and for instances within the last week, "1 day ago," "2 days ago," and so forth. The data has been meticulously engineered to aggregate media divisions under their respective parent categories.