<a href="https://colab.research.google.com/github/felkira/unifact.github.io/blob/modeling/scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# install uninstalled module
!pip install grequests
!pip install ipython-autotime
!pip install langdetect

# import all modules that will be used
from bs4 import BeautifulSoup as bs
from math import ceil
from time import sleep
from collections import Counter
from langdetect import detect_langs
import pandas as pd
import numpy as np
import re, string, grequests, gdown

'''
module import and configuration bellow is used to
stop show warning message on importing grequests module
'''
import sys
del sys.modules["grequests"]
del grequests
del sys.modules["gevent.monkey"]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# load autotime generator and import grequests module again
# then use grequests once to ensure that the next output is clean
%load_ext autotime
import grequests
grequests.map([grequests.get('https://www.google.com')])


sys.settrace() should not be used when the debugger is being used.
This may cause the debugger to stop working correctly.
If this is needed, please check: 
http://pydev.blogspot.com/2007/06/why-cant-pydev-debugger-work-with.html
to see how to restore the debug tracing back correctly.
Call Location:
  File "/usr/local/lib/python3.8/dist-packages/gevent/threadpool.py", line 163, in _before_run_task
    _sys.settrace(_get_thread_trace())


sys.settrace() should not be used when the debugger is being used.
This may cause the debugger to stop working correctly.
If this is needed, please check: 
http://pydev.blogspot.com/2007/06/why-cant-pydev-debugger-work-with.html
to see how to restore the debug tracing back correctly.
Call Location:
  File "/usr/local/lib/python3.8/dist-packages/gevent/threadpool.py", line 168, in _after_run_task
    _sys.settrace(None)



[<Response [200]>]

time: 181 ms (started: 2022-12-21 10:31:49 +00:00)


# **This is some core function to scraping**

The get_urls function is used to collect the url of each page. While the get_data function is used to collect the data of each collected article before. I use grequest module to send http request, because this module support parallel request and so fast like Lightning McQueen. By the way, I separate the parallel request to some part. Each part consist of 50 request with time sleep 2.5s. Because I'm a scraper, not defacer. Big request in the same time can make the server of the target to be down. That is evil right.

In [None]:
def get_urls(url, pages):return [f'{url}{i}/' for i in range(1, pages+1)]

def get_data(urls):
  part = ceil(len(urls)/50)
  req = [[]] * part
  res = []
  for i in range(part):
    req[i] = grequests.map([grequests.get(url) for url in urls[ceil(i*(len(urls)/part)):ceil((i+1)*(len(urls)/part))]])
    sleep(2.5)
    # print alert and unidentified variable z to throw error and stop fetch process when there is any bad request
    for r in req[i]:
      if r.status_code != 200:
        print("Found any bad request.\nForcibly stop the process!")
        print(z)
    print(f"Processed Part: {ceil(i*(len(urls)/part)) + 1} - {ceil((i+1)*(len(urls)/part))}")
    res += req[i]
  return res

time: 2.56 ms (started: 2022-12-21 10:31:49 +00:00)


# **Hoax Scraping Part**

I use [turnbackhoax](https://turnbackhoax.id) to get hoax data. This website accomodate it.

In [None]:
hoax_urls = get_urls('https://turnbackhoax.id/page/', 492)

time: 707 µs (started: 2022-12-21 10:40:33 +00:00)


In [None]:
hoax_res = get_data(hoax_urls)

Processed Part: 1 - 50
Processed Part: 51 - 99
Processed Part: 100 - 148
Processed Part: 149 - 197
Processed Part: 198 - 246
Processed Part: 247 - 296
Processed Part: 297 - 345
Processed Part: 346 - 394
Processed Part: 395 - 443
Processed Part: 444 - 492
time: 1min 45s (started: 2022-12-21 10:40:37 +00:00)


# **Let's Scrape the Datas !**

We can use BeautifulSoup not to cook a soup, but to extract the desired content from articles. For now, I use this module to extract the url from each article on each page. With the collected url, of course I can extract more details of each article.

In [None]:
hoax_links = []
for r in hoax_res:
  sp = bs(r.content, 'html.parser') # start to scrape
  data = sp.find_all('header', {'class': 'mh-loop-header'})
  for i in data:
    title = i.find('a').text
    link = i.find('a')['href']
    if ('[SALAH]' in title or '[HOAX]' in title or 'ISINFORMASI]' in title) and '[FALSE]' not in title:
      hoax_links.append(link) # I don't want the article in english, so I skip the article with '[FALSE]' on its title

# calculate the number of article that has fetched successfully
print(f'Number of fetched data: {len(hoax_links)}')

Number of fetched data: 8050
time: 21.1 s (started: 2022-12-21 10:42:42 +00:00)


In [None]:
hoax_content = get_data(hoax_links)

Well, we have the urls now. Lets extract more details of each article, like title, narrative, date, and category. Because hoax has seven categories. You can refer to [FirstDraft (2017)](https://firstdraftnews.org/articles/fake-news-complicated/) to get more insight about hoax or fake news. By the way, the article in turnbackhoax.id is difficult enough to be scraped, so that, I must use some filters 😥 .

![image.png](https://firstdraftnews.org/wp-content/uploads/2017/02/FDN_7Types_Misinfo-01-1024x576.jpg)

In [None]:
hoax_titles = []
hoax_narratives = []
hoax_dates = []
hoax_categories = []

def filter(i, j, k, nar, exp, narrative):
  if k != None and '@' not in k and '#' not in k and 'dua klaim itu sudah masuk' not in k:
    nar_idx = narrative.index(nar)
    exp_idx = narrative.index(exp)
    hoax_narratives.append(' '.join(narrative[nar_idx+1:exp_idx]))
    hoax_titles.append(i)
    hoax_dates.append(j)
    hoax_categories.append(k)

for i in hoax_content:
  sp = bs(i.content, 'html.parser')
  data = sp.find_all('article')
  for i in data:
    title = i.find('h1').text.lower()
    narrative = i.find('div', {'class': 'entry-content'}).getText(separator=" ").lower().split()
    date = i.find('span', {'class': 'entry-meta-date updated'}).text
    category = i.find(text=re.compile(
    '''(
      : satir|: parodi|: misleading|: konten yang menyesatkan|: fabricated|: konten palsu|: konten|: impostor|
      : false context|: konteks yang salah|: konten yang salah|: manipulated|: konten yang dimanipulasi|: imposter|
      : konten tiruan|: false connection|: koneksi yang salah|
      satir|satire.|parodi.|misleading content.|konten yang menyesatkan.|fabricated content.|konten palsu.|impostor content.|
      false context.|konteks yang salah.|konten yang salah.|manipulated content.|konten yang dimanipulasi.|imposter content.|
      konten tiruan.|false connection.|koneksi yang salah.
    )'''
    , re.IGNORECASE))

    try:
      filter(title, date, category, 'narasi:', 'penjelasan:', narrative)
    except:
      try:
        filter(title, date, category, '(narasi):', '(penjelasan):', narrative)
      except:
        try:
          filter(title, date, category, '[narasi]:', '[penjelasan]:', narrative)
        except:
          try:
            filter(title, date, category, 'narasi :', 'penjelasan :', narrative)
          except:
            try:
              filter(title, date, category, 'narasi:', 'penjelasan', narrative)
            except:
              if category != None and '@' not in category and '#' not in category and 'dua klaim itu sudah masuk' not in category:
                hoax_narratives.append(title)
                hoax_titles.append(title)
                hoax_dates.append(date)
                hoax_categories.append(category)

time: 3min 4s (started: 2022-12-21 11:27:54 +00:00)


In [None]:
print(hoax_narratives[:5])

['[salah] video juru kamera terlihat lebih cepat dari peserta lomba lari', '[salah] pernyataan ceo pfizer: “saya mengundurkan diri dan vaksin mrna belum terbukti aman”', '[salah] “video perkumpulan lgbt di sicc sentul”', '[salah] dijerat perkara terselubung bawaslu resmi tetapkan anies pasal berat', '“120 penonton non-muslim yang datang untuk menyaksikan pertandingan sepak bola piala dunia fifa 2022 di qatar telah memeluk islam.” = = =']
time: 731 µs (started: 2022-12-21 11:30:59 +00:00)


In [None]:
for i in range(len(hoax_narratives)):
  if hoax_narratives[i] == "":
    hoax_narratives[i] = hoax_titles[i]

time: 2.51 ms (started: 2022-12-21 11:30:59 +00:00)


# **Boring Time !**

It's time to make the data clean. The hardest part in NLP. ( I mean Indonesian NLP )

In [None]:
def clean_text(text):
    text = text.lower() # make text in lowercase
    text = re.sub(r'(\S+|)(http|blogspot|@|mail|dot)(\S+|)', '', text) # remove mail and weird urls like http://blabla[dot]com
    text = " ".join([w.replace(w, '') if ('.' and '/') in w else w for w in text.split()]) # remove all word that contain dot and slash (urls)
    text = re.sub('[.,-]', ' ', text) # remove weird symbol, just Indotizen can do it
    text = "".join([i for i in text if ord(i) < 128]) # remove non-ascii character (this is remove all emoticon)
    text = re.sub('\[.*?\]', '', text) # remove text in square brackets
    text = re.sub('<.*?>+', '', text) # remove html tags
    text = " ".join([re.sub('[%s]' % re.escape(string.punctuation), ' ', w) if '#' not in w else w for w in text.split()]) # remove punctuation
    text = " ".join([w.replace(w, '') if w.isdigit() else w.replace(w, '') if (len(re.findall(r'\d+', w)) > 0 and len(re.findall(r'\d+', w)[0]) >= 7) else w for w in text.split()]) # remove all number and word with a lot number aka WA urls
    return text.strip() # remove tab

for i in range(len(hoax_narratives)):
  hoax_narratives[i] = clean_text(hoax_narratives[i])
  hoax_titles[i] = clean_text(hoax_titles[i])

time: 2.24 s (started: 2022-12-21 11:31:27 +00:00)


In [None]:
print(hoax_narratives[:5])

['video juru kamera terlihat lebih cepat dari peserta lomba lari', 'pernyataan ceo pfizer saya mengundurkan diri dan vaksin mrna belum terbukti aman', 'video perkumpulan lgbt di sicc sentul', 'dijerat perkara terselubung bawaslu resmi tetapkan anies pasal berat', 'penonton non muslim yang datang untuk menyaksikan pertandingan sepak bola piala dunia fifa  di qatar telah memeluk islam']
time: 1.84 ms (started: 2022-12-21 11:31:29 +00:00)


In [None]:
for i in range(len(hoax_narratives)):
  if hoax_narratives[i] == "": # just to verify if anything failed while scraping
    hoax_narratives[i] = hoax_titles[i]

time: 2.35 ms (started: 2022-12-21 11:31:29 +00:00)


In [None]:
'''
define function for better life again.
first function is used to download file from google drive
while the second function is used to convert txt file to sets.
I will use both functions a lot later.
'''

def drive(id, name):
  gdown.download(f'https://drive.google.com/u/0/uc?id={id}&export=download', name, quiet=False)

def txt_to_sets(txt, sets, sep):
  filename = open(f'/content/{txt}', "r")
  file_read = filename.read()
  for i in file_read.split(sep=sep): sets.add(i)
  filename.close()

time: 1.23 ms (started: 2022-12-21 11:31:29 +00:00)


In [None]:
drive('1a4wLhTZc3zKfsfm3JPtjWXjmMYT5b9cE', 'tags.txt')

hashtag = set()
txt_to_sets('tags.txt', hashtag, ',') # this txt file contains some data of indonesian tags, incomplete but enough

wordList = list(dict.fromkeys(hashtag)) # dict.fromkeys() is used to remove duplicate datas in set. thanks to python
wordList.remove('')

Downloading...
From: https://drive.google.com/u/0/uc?id=1a4wLhTZc3zKfsfm3JPtjWXjmMYT5b9cE&export=download
To: /content/tags.txt
100%|██████████| 11.4k/11.4k [00:00<00:00, 10.1MB/s]

time: 1.42 s (started: 2022-12-21 11:31:29 +00:00)





In [None]:
'''
hastag in Indonesian is so weird
like #TurunkanHargaBBM or #jokowi_mundur
content in hashtag is important but difficult enough to be extracted
'''

wordOr = '|'.join(wordList)

def splitHashTag(hashTag):
  new_words = []
  for wordSequence in re.findall('(?:' + wordOr + ')+', hashTag):
    for word in re.findall(wordOr, wordSequence):
      new_words.append(word)
  return ' '.join(new_words)

time: 941 µs (started: 2022-12-21 11:31:31 +00:00)


In [None]:
for i in range(len(hoax_narratives)):
  hoax_narratives[i] = ' '.join([splitHashTag(w) if '#' in w else w for w in hoax_narratives[i].split()])
  hoax_titles[i] = ' '.join([splitHashTag(w) if '#' in w else w for w in hoax_titles[i].split()])

time: 121 ms (started: 2022-12-21 11:31:31 +00:00)


In [None]:
print(hoax_narratives[:5])

['video juru kamera terlihat lebih cepat dari peserta lomba lari', 'pernyataan ceo pfizer saya mengundurkan diri dan vaksin mrna belum terbukti aman', 'video perkumpulan lgbt di sicc sentul', 'dijerat perkara terselubung bawaslu resmi tetapkan anies pasal berat', 'penonton non muslim yang datang untuk menyaksikan pertandingan sepak bola piala dunia fifa di qatar telah memeluk islam']
time: 2.48 ms (started: 2022-12-21 11:31:31 +00:00)


In [None]:
print(hoax_titles[:5])

['video juru kamera terlihat lebih cepat dari peserta lomba lari', 'pernyataan ceo pfizer saya mengundurkan diri dan vaksin mrna belum terbukti aman', 'video perkumpulan lgbt di sicc sentul', 'dijerat perkara terselubung bawaslu resmi tetapkan anies pasal berat', 'video para supporter bola masuk islam di piala dunia qatar']
time: 500 µs (started: 2022-12-21 11:31:31 +00:00)


In [None]:
for i in range(len(hoax_narratives)):
  if len(hoax_narratives[i].split()) <= 8:
    hoax_narratives[i] = hoax_titles[i]

time: 18.1 ms (started: 2022-12-21 11:31:31 +00:00)


In [None]:
print(hoax_narratives[:5])

['video juru kamera terlihat lebih cepat dari peserta lomba lari', 'pernyataan ceo pfizer saya mengundurkan diri dan vaksin mrna belum terbukti aman', 'video perkumpulan lgbt di sicc sentul', 'dijerat perkara terselubung bawaslu resmi tetapkan anies pasal berat', 'penonton non muslim yang datang untuk menyaksikan pertandingan sepak bola piala dunia fifa di qatar telah memeluk islam']
time: 1.52 ms (started: 2022-12-21 11:31:31 +00:00)


In [None]:
# let see how much remaining word with number

word_with_number = []
for i in range(len(hoax_titles)):
  a = re.compile(r'([A-Za-z]+[\d@]+[\w@]*|[\d@]+[A-Za-z]+[\w@]*)').findall(hoax_titles[i])
  b = re.compile(r'([A-Za-z]+[\d@]+[\w@]*|[\d@]+[A-Za-z]+[\w@]*)').findall(hoax_narratives[i])
  word_with_number += a
  word_with_number += b

word_with_number = list(dict.fromkeys(word_with_number))

time: 253 ms (started: 2022-12-21 11:31:31 +00:00)


In [None]:
print(len(word_with_number))

900
time: 1.59 ms (started: 2022-12-21 11:31:31 +00:00)


In [None]:
# this dataset help us to normalize the weird words of Indotizen
drive('1XcHBG8XpbkTDSwOtcjK1tJqqD6qWHCli', 'alay.csv')
alay = pd.read_csv('/content/alay.csv')
alay

Downloading...
From: https://drive.google.com/u/0/uc?id=1XcHBG8XpbkTDSwOtcjK1tJqqD6qWHCli&export=download
To: /content/alay.csv
100%|██████████| 384k/384k [00:00<00:00, 90.0MB/s]


Unnamed: 0,alay,normal
0,ramayana,ramayana
1,000lima,lima
2,000tolong,tolong
3,000untuk,untuk
4,01dan,dan
...,...,...
20104,zul,zul
20105,zupeer,super
20106,zuyle,zumi
20107,zuzu,susu


time: 1.7 s (started: 2022-12-21 11:31:31 +00:00)


In [None]:
drive('1hosi8uadYMeqfarBWlm7NZormxTjr-Vu', 'abnormal.txt')
abnormal = set()
txt_to_sets('abnormal.txt', abnormal, '\n') # these are some words that pass the filtering process. I don't know why.

Downloading...
From: https://drive.google.com/u/0/uc?id=1hosi8uadYMeqfarBWlm7NZormxTjr-Vu&export=download
To: /content/abnormal.txt
100%|██████████| 1.38k/1.38k [00:00<00:00, 1.41MB/s]

time: 1.54 s (started: 2022-12-21 11:31:33 +00:00)





In [None]:
def normalize(text):
  # normalize text
  n = text.split()
  for a in range(len(alay['alay'])):
    if alay['alay'][a] in n:
      n[n.index(alay['alay'][a])] = alay['normal'][a]
  
  text = " ".join(n)
  text = " ".join(w for w in text.split() if not any(c.isdigit() for c in w)) # remove word containing number (just to verify)
  text = " ".join(w for w in text.split() if w not in abnormal)
  text = clean_text(text)
  return text

time: 1.68 ms (started: 2022-12-21 11:31:34 +00:00)


In [None]:
normal_narrative = []
normal_title = []
for i in range(len(hoax_narratives)):
  if i != 0 and i % 100 == 0:
    print(f'{i} datas has been normalized')

  normal_narrative.append(normalize(hoax_narratives[i]))

  temp = normalize(hoax_titles[i])
  if str(detect_langs(temp)[0])[:2] != 'id': temp = 'a' # convert non Indonesian title to string 'a', because I will remove datas with non Indonesian later
  
  normal_title.append(temp)

In [None]:
temp_nar = normal_narrative

for i in range(len(temp_nar)):
  det = detect_langs(temp_nar[i][0:int(1/2*len(temp_nar[i]))])[0]
  if str(det)[:2] != 'id':
    temp_nar[i] = normal_title[i] # this is too. I use title for narrative if the narratve not use Indonesian
  
  temp_nar[i] = " ".join(w for w in temp_nar[i].split() if len(w) > 1) # remove word that contain less than 2 character

time: 40.9 s (started: 2022-12-21 11:52:02 +00:00)


In [None]:
print(hoax_narratives[:5])

['video juru kamera terlihat lebih cepat dari peserta lomba lari', 'pernyataan ceo pfizer saya mengundurkan diri dan vaksin mrna belum terbukti aman', 'video perkumpulan lgbt di sicc sentul', 'dijerat perkara terselubung bawaslu resmi tetapkan anies pasal berat', 'penonton non muslim yang datang untuk menyaksikan pertandingan sepak bola piala dunia fifa di qatar telah memeluk islam']
time: 1.98 ms (started: 2022-12-21 11:52:42 +00:00)


In [None]:
# verify how much narrative and title that just contain short sentence

nonenar = [i for i in temp_nar if len(i.split()) < 4]
nonetit = [i for i in normal_title if len(i.split()) < 1]

print(len(nonenar))
print(len(nonetit))

280
0
time: 26.8 ms (started: 2022-12-21 11:53:56 +00:00)


In [None]:
nars = []
for i in temp_nar:
  try:
    for j in i.split():
      nars.append(j)
  except: pass

tits = []
for i in normal_title:
  try:
    for j in i.split():
      tits.append(j)
  except: pass

time: 49.6 ms (started: 2022-12-21 11:53:56 +00:00)


In [None]:
print(len(set(nars)))
print(len(set(tits)))

19184
9617
time: 21.5 ms (started: 2022-12-21 11:53:56 +00:00)


In [None]:
# I just see the frequence of each word in all sentences

nar_res = dict(Counter(nars))
tit_res = dict(Counter(tits))
nar_res_sorted = {k: v for k, v in sorted(nar_res.items(), key=lambda item: item[1], reverse=True)}
tit_res_sorted = {k: v for k, v in sorted(tit_res.items(), key=lambda item: item[1], reverse=True)}

time: 51 ms (started: 2022-12-21 11:53:56 +00:00)


In [None]:
print(nar_res_sorted)
print(tit_res_sorted)

{'di': 1597, 'dan': 760, 'video': 754, 'yang': 664, 'indonesia': 599, 'foto': 566, 'dari': 476, 'a': 442, 'covid': 440, 'untuk': 394, 'jokowi': 354, 'tidak': 328, 'dengan': 323, 'akun': 292, 'ke': 282, 'presiden': 265, 'karena': 261, 'ini': 259, 'tahun': 227, 'vaksin': 224, 'oleh': 221, 'meninggal': 218, 'whatsapp': 216, 'corona': 213, 'orang': 211, 'sudah': 203, 'virus': 186, 'dunia': 183, 'akan': 181, 'dapat': 179, 'anak': 165, 'pada': 164, 'negara': 162, 'dalam': 155, 'saat': 154, 'rakyat': 153, 'ada': 153, 'anies': 152, 'china': 151, 'kota': 148, 'surat': 146, 'partai': 144, 'bantuan': 138, 'artikel': 135, 'bupati': 133, 'facebook': 131, 'hari': 131, 'pesan': 130, 'baru': 129, 'gambar': 125, 'kerja': 124, 'air': 123, 'setelah': 123, 'daerah': 122, 'warga': 122, 'juta': 118, 'jadi': 118, 'meminta': 116, 'republik': 115, 'jakarta': 113, 'rumah': 113, 'dana': 111, 'pemerintah': 111, 'wakil': 107, 'adalah': 105, 'bisa': 102, 'bagi': 102, 'tak': 101, 'islam': 94, 'polisi': 94, 'uang': 9

In [None]:
print(len(nar_res_sorted))

19184
time: 421 µs (started: 2022-12-21 11:53:56 +00:00)


In [None]:
drive('1cxksiiSI1k8qWKtyAnnCF5-dvpVfbMaU', 'sisa.txt')
sisa = set() # this file contain some non indonesian word that pass the filters, so let's remove it from datas
txt_to_sets('sisa.txt', sisa, '\n')

Downloading...
From: https://drive.google.com/u/0/uc?id=1cxksiiSI1k8qWKtyAnnCF5-dvpVfbMaU&export=download
To: /content/sisa.txt
100%|██████████| 1.65k/1.65k [00:00<00:00, 3.66MB/s]

time: 1.71 s (started: 2022-12-21 11:56:31 +00:00)





In [None]:
for i in range(len(temp_nar)):
  temp_nar[i] = " ".join([w for w in temp_nar[i].split() if w not in sisa])

time: 50.2 ms (started: 2022-12-21 11:56:40 +00:00)


In [None]:
# make sure that the length of each list still same

print(len(temp_nar), len(normal_title), len(hoax_dates), len(hoax_categories))

6320 6320 6320 6320
time: 773 µs (started: 2022-12-21 11:56:40 +00:00)


In [None]:
# convert date to hh/mm/yyyy format

def convert_date(date):
  hh = date[-8:-6].replace(" ", "0")
  yy = date[-4:]
  mm = ''
  if 'Januari' in date: mm = '01'
  if 'Februari' in date: mm = '02'
  if 'Maret' in date: mm = '03'
  if 'April' in date: mm = '04'
  if 'Mei' in date: mm = '05'
  if 'Juni' in date: mm = '06'
  if 'Juli' in date: mm = '07'
  if 'Agustus' in date: mm = '08'
  if 'September' in date: mm = '09'
  if 'Oktober' in date: mm = '10'
  if 'November' in date: mm = '11'
  if 'Desember' in date: mm = '12'
  return hh + '/' + mm + '/' + yy

normal_date = [convert_date(hoax_dates[i]) for i in range(len(hoax_dates))]

# display preview of the action above
print(*normal_date[:5], sep='\n')
print('...')
print(*normal_date[-5:], sep='\n')

21/12/2022
20/12/2022
20/12/2022
20/12/2022
20/12/2022
...
04/07/2018
26/06/2018
11/06/2018
11/06/2018
03/01/2017
time: 15.4 ms (started: 2022-12-21 11:56:40 +00:00)


In [None]:
# convert category to be clean, because it must be so dirty before

normal_category = []
for i in range(len(hoax_categories)):
  hoax_categories[i] = hoax_categories[i].lower()
  hoax_categories[i] = hoax_categories[i].replace('\n', ' ')
  if 'imposter' in hoax_categories[i] or 'tiruan' in hoax_categories[i]:
    normal_category.append('Konten Tiruan')
  elif 'manipulated' in hoax_categories[i] or 'manipulasi' in hoax_categories[i] or 'dimanpulasi' in hoax_categories[i]:
    normal_category.append('Konten Yang Dimanipulasi')
  elif 'fabricated' in hoax_categories[i] or 'palsu' in hoax_categories[i]:
    normal_category.append('Konten Palsu')
  elif 'false context' in hoax_categories[i] or 'konteks yang salah' in hoax_categories[i] or 'konten yang salah' in hoax_categories[i] or 'konten salah' in hoax_categories[i]:
    normal_category.append('Konten Yang Salah')
  elif 'misleading' in hoax_categories[i] or 'menyesatkan' in hoax_categories[i] or 'menyesakan' in hoax_categories[i]:
    normal_category.append('Konten Yang Menyesatkan')
  elif 'satir' in hoax_categories[i] or 'parodi' in hoax_categories[i]:
    normal_category.append('Satire/Parodi')
  else:
    normal_category.append('Koneksi Yang Salah')
  

# display preview of the action above
print(*normal_category[:5], sep='\n')
print('...')
print(*normal_category[-5:], sep='\n')

Konten Yang Salah
Konten Yang Menyesatkan
Konten Yang Salah
Konten Yang Menyesatkan
Konten Yang Salah
...
Konten Yang Menyesatkan
Konten Yang Menyesatkan
Satire/Parodi
Konten Yang Menyesatkan
Satire/Parodi
time: 25.4 ms (started: 2022-12-21 11:56:40 +00:00)


In [None]:
# finally remove data that the narrative length less than 4 and the title length less than 2

fix_title = []
fix_narrative = []
fix_date = []
fix_category = []

for i in range(len(temp_nar)):
  if len(temp_nar[i].split()) >= 4 and len(normal_title[i].split()) > 1:
    fix_title.append(normal_title[i])
    fix_narrative.append(temp_nar[i])
    fix_date.append(normal_date[i])
    fix_category.append(normal_category[i])

time: 26.4 ms (started: 2022-12-21 11:56:40 +00:00)


In [None]:
print(len(fix_title), len(fix_narrative), len(fix_date), len(fix_category))

5850 5850 5850 5850
time: 2.34 ms (started: 2022-12-21 11:56:40 +00:00)


In [None]:
# verify if there is any data with None type

nonar = [i for i in fix_narrative if i == None]
print(len(nonar))

0
time: 3.34 ms (started: 2022-12-21 11:56:40 +00:00)


# **Valid News Scraping Part**

I use each news category in [detik](https://detik.com) to get valid data. That is :


1.   detik edu: [detik edu](https://www.detik.com/edu/indeks) 2.8%
2.   detik finance: [detik finance](https://finance.detik.com/indeks) 2.8%
3.   detik inet: [detik inet](https://inet.detik.com/indeks) 2.8%
4.   detik hot: [detik hot](https://hot.detik.com/indeks) 2.8%
5.   detik sport: [detik sport](https://sport.detik.com/indeks) 2.8%
6.   detik oto: [detik oto](https://oto.detik.com/indeks) 2.8%
7.   detik travel: [detik travel](https://travel.detik.com/indeks) 2.8%
8.   detik food: [detik food](https://food.detik.com/indeks) 2.8%
9.   detik health: [detik health](https://health.detik.com/indeks) 35.8%
10.  detik tag politik: [detik politik](https://www.detik.com/tag/politik) 20%
11.  detik tag bencana: [detik bencana](https://www.detik.com/tag/bencana-alam) 15.1% (recently there have been many disasters in Indonesia, so I increased the quota of 1.1% for this category)
12.  detik tag agama: [detik agama](https://www.detik.com/tag/agama) 6.7%

For each category, I scraped 60 pages. The percentage indicate the number of data that will be scraped. Because, based on the research of Judita and Darmawan (2020), the map of fake news by theme is like this :

![image.png](https://drive.google.com/u/0/uc?id=1gclSk574y6GcfZrK1IgzEfUn7YEbMbPB&export=download)

In [None]:
true_urls = get_urls('https://detik.com/edu/indeks/', ceil((2.8/100)*400))
tag_urls = get_urls('https://www.detik.com/tag/politik/?sortby=time&page=', ceil((20/100)*400*2))

kesehatan = get_urls('https://health.detik.com/indeks/', ceil((35.8/100)*400))
bencana = get_urls('https://www.detik.com/tag/bencana-alam/?sortby=time&page=', ceil((15.1/100)*400*2))
agama = get_urls('https://www.detik.com/tag/agama/?sortby=time&page=', ceil((6.7/100)*400*2))

urls_cat = ['finance', 'inet', 'hot', 'sport', 'oto', 'travel', 'food']
for i in range(len(urls_cat)):
  tmp_urls = get_urls(f'https://{urls_cat[i]}.detik.com/indeks/', ceil((2.8/100)*400))
  for i in tmp_urls: true_urls.append(i)

for i in kesehatan: true_urls.append(i)
for i in bencana: tag_urls.append(i)
for i in agama: tag_urls.append(i)

true_res = get_data(true_urls)
print('-------------------------')
tag_res = get_data(tag_urls)

Processed Part: 1 - 48
Processed Part: 49 - 96
Processed Part: 97 - 144
Processed Part: 145 - 192
Processed Part: 193 - 240
-------------------------
Processed Part: 1 - 48
Processed Part: 49 - 96
Processed Part: 97 - 144
Processed Part: 145 - 192
Processed Part: 193 - 240
Processed Part: 241 - 288
Processed Part: 289 - 335
time: 1min 27s (started: 2022-12-21 12:59:59 +00:00)


The flow same as hoax data scraping, but less in data cleaning and the scraping process is per category

In [None]:
true_links = []
true_titles = []
true_categories = []

for r in true_res:
  sp = bs(r.content, 'html.parser')
  data = sp.find_all('div', {'class': 'media__text'}) # this is the tag and the class that contain link and title of index link
  for i in data:
    if not i.find('h2'):
      link = i.find('a')['href']
      title = i.find('a').text
      true_links.append(link)
      true_titles.append(title)
      true_categories.append('Valid')

for r in tag_res:
  sp = bs(r.content, 'html.parser')
  data = sp.find_all('article') # this is the tag that contain link and title of tag link
  for i in data:
    link = i.find('a')['href']
    title = i.find('h2').text
    true_links.append(link)
    true_titles.append(title)
    true_categories.append('Valid')
      

# calculate the number of article that has fetched successfully
print(f'Number of fetched data: {len(true_links)}')

Number of fetched data: 6645
time: 26.9 s (started: 2022-12-21 13:01:27 +00:00)


In [None]:
true_content = get_data(true_links)

In [None]:
true_narratives = []
true_dates = []

for i in true_content:
  sp = bs(i.content, 'html.parser')
  data = sp.find_all('article', {'class': 'detail'})
  try:
    narrative = ' '.join([k.text for k in [j.find_all('p')[:4] for j in data][0]]) # here, I use first four paragraph of each article as narrative data
    date = [j.find('div', {'class': 'detail__date'}) for j in data][0].text
  except:
    try:
      narrative = ' '.join([k.text for k in [j.find_all('p')[:3] for j in data][0]]) # but if the article is too short, I just use first three paragraph
      date = [j.find('div', {'class': 'detail__date'}) for j in data][0].text
    except: pass # and if the narrative is less than 3 paragraph, then I pass the article
  true_narratives.append(narrative)
  true_dates.append(date)

time: 7min 36s (started: 2022-12-21 13:20:15 +00:00)


In [None]:
print(true_dates[-5:])

['Rabu, 21 Des 2022 16:33 WIB', 'Rabu, 21 Des 2022 16:13 WIB', 'Rabu, 21 Des 2022 16:00 WIB', 'Rabu, 21 Des 2022 15:54 WIB', 'Rabu, 21 Des 2022 15:38 WIB']
time: 735 µs (started: 2022-12-21 13:27:51 +00:00)


In [None]:
for i in range(len(true_narratives)):
  true_narratives[i] = clean_text(true_narratives[i])
  true_narratives[i] = " ".join(w for w in true_narratives[i].split() if not any(c.isdigit() for c in w))
  true_titles[i] = clean_text(true_titles[i])
  true_titles[i] = " ".join(w for w in true_titles[i].split() if not any(c.isdigit() for c in w))

print(true_narratives[:5])

['kondisi ekonomi bukan menjadi penghalang berarti bagi lutfian untuk meraih pendidikan setinggi tinggi lulusan fakultas keperawatan universitas jember unej tersebut membuktikan dengan berhasil lolos beasiswa lembaga pengelola dana pendidikan lpdp ke luar negeri semangat juang dari diri lutfian tumbuh dari cerminan perjuangan kedua orang tuanya yang bekerja sebagai tki di malaysia fian sapaan akrabnya mengungkapkan bahwa seseorang yang sangat berperan penting dalam hidup adalah ibunya ibu sangat mendukung apapun yang saya lakukan dan beliau adalah motivator terhebat saat saya merasa tidak percaya diri ucapnya dikutip dari laman resmi unej rabu', 'setiap makhluk hidup memiliki keunikannya masing masing termasuk tawon ini yang pakai alat genitalnya untuk kabur dari serangan predator tawon adalah salah satu jenis serangga yang memiliki warna mencolok dengan keahlian menyengat namun satu fakta menyebutkan hanya tawon betina yang bisa menyengat dan menyengat predator dengan racunnya racun t

In [None]:
# convert date to hh/mm/yyyy format

for i in range(len(true_dates)):
  date = true_dates[i].split(', ')[1][:11].replace(' ', '/')

  def replace_month(month, idx):
    if month in date: true_dates[i] = date.replace(month, idx)
  
  replace_month('Jan', '01')
  replace_month('Feb', '02')
  replace_month('Mar', '03')
  replace_month('Apr', '04')
  replace_month('Mei', '05')
  replace_month('Jun', '06')
  replace_month('Jul', '07')
  replace_month('Agu', '08')
  replace_month('Sep', '09')
  replace_month('Okt', '10')
  replace_month('Nov', '11')
  replace_month('Des', '12')

time: 17.6 ms (started: 2022-12-21 13:27:57 +00:00)


In [None]:
print(true_dates[:1000])

['21/12/2022', '21/12/2022', '21/12/2022', '21/12/2022', '21/12/2022', '21/12/2022', '21/12/2022', '21/12/2022', '21/12/2022', '21/12/2022', '21/12/2022', '21/12/2022', '21/12/2022', '21/12/2022', '21/12/2022', '21/12/2022', '21/12/2022', '21/12/2022', '21/12/2022', '21/12/2022', '21/12/2022', '21/12/2022', '21/12/2022', '21/12/2022', '21/12/2022', '21/12/2022', '21/12/2022', '21/12/2022', '21/12/2022', '20/12/2022', '20/12/2022', '20/12/2022', '20/12/2022', '20/12/2022', '20/12/2022', '20/12/2022', '20/12/2022', '20/12/2022', '20/12/2022', '20/12/2022', '20/12/2022', '20/12/2022', '20/12/2022', '20/12/2022', '20/12/2022', '20/12/2022', '20/12/2022', '20/12/2022', '20/12/2022', '20/12/2022', '20/12/2022', '20/12/2022', '20/12/2022', '20/12/2022', '20/12/2022', '20/12/2022', '20/12/2022', '20/12/2022', '20/12/2022', '19/12/2022', '19/12/2022', '19/12/2022', '19/12/2022', '19/12/2022', '19/12/2022', '19/12/2022', '19/12/2022', '19/12/2022', '19/12/2022', '19/12/2022', '19/12/2022', '19/1

In [None]:
print(len(true_categories))
print(len(true_titles))
print(len(true_dates))
print(len(true_narratives))

6645
6645
6645
6645
time: 952 µs (started: 2022-12-21 13:27:57 +00:00)


# **Merge All Together**

Merge each class of hoax datas with each class of valid datas

In [None]:
categories = fix_category + true_categories
titles = fix_title + true_titles
dates = fix_date + true_dates
narratives = fix_narrative + true_narratives

time: 2.75 ms (started: 2022-12-21 13:27:57 +00:00)


In [None]:
print(len(categories))

12495
time: 818 µs (started: 2022-12-21 13:27:57 +00:00)


In [None]:
data = {"kategori": categories, "judul": titles, "tanggal": dates, "narasi": narratives}

time: 712 µs (started: 2022-12-21 13:27:57 +00:00)


# **Convert to Dataframe**

Let's convert the dict to dataframe, so we can do deep cleaning easier, such as remove duplication and other.

In [None]:
df = pd.DataFrame.from_dict(data)

time: 5.08 ms (started: 2022-12-21 13:27:57 +00:00)


In [None]:
df.narasi.duplicated(keep="first").value_counts() # count duplicate row in dataset

False    9949
True     2546
Name: narasi, dtype: int64

time: 14 ms (started: 2022-12-21 13:27:57 +00:00)


In [None]:
df[df.judul.duplicated(keep="first")] # show rows that has duplicate

Unnamed: 0,kategori,judul,tanggal,narasi
623,Konten Palsu,link bantuan program keluarga harapan tahap,22/06/2022,telah dibuka pencairan bantuan program keluarg...
720,Konten Palsu,innalillahi detik detik ular piton telan seora...,30/05/2022,innalillahi detik detik ular piton telan seora...
1015,Konten Palsu,gadis ular menggemparkan dunia,23/03/2022,gadis ular menggemparkan dunia
1169,Konten Yang Menyesatkan,jus campuran nanas lobak dan kemiri dapat meny...,26/02/2022,untuk siapa saja yang kena asam urat boleh cob...
1299,Konten Yang Salah,foto artikel pesantren alquran terbakar santri...,04/02/2022,foto artikel pesantren alquran terbakar santri...
...,...,...,...,...
12490,Valid,nadia mulya gugat cerai suami,21/12/2022,nadia mulya mengajukan gugatan cerai kepada su...
12491,Valid,kunjungi kampus muhammadiyah sorong zulhas puj...,21/12/2022,menteri perdagangan zulkifli hasan melanjutkan...
12492,Valid,khiyanah artinya berkhianat termasuk sifat yan...,21/12/2022,khiyanah termasuk salah satu akhlak tercela da...
12493,Valid,jadwal misa natal gereja katedral jakarta cek ...,21/12/2022,jadwal misa natal gereja katedral jakarta suda...


time: 15.8 ms (started: 2022-12-21 13:27:57 +00:00)


In [None]:
# remove duplicate rows but keep the first row

df.drop_duplicates(subset="narasi", keep="first", inplace=True)
df.drop_duplicates(subset="judul", keep="first", inplace=True)
df.reset_index(drop=True, inplace=True)
print(df.narasi.duplicated(keep="first").value_counts())
print('---------------------------')
print(df.judul.duplicated(keep="first").value_counts())

False    9885
Name: narasi, dtype: int64
---------------------------
False    9885
Name: judul, dtype: int64
time: 33.7 ms (started: 2022-12-21 13:27:57 +00:00)


In [None]:
df.kategori.value_counts() # let's count the data of each category

Valid                       4082
Konten Yang Menyesatkan     2094
Konten Yang Salah           1449
Konten Palsu                 897
Konten Yang Dimanipulasi     694
Konten Tiruan                438
Satire/Parodi                140
Koneksi Yang Salah            91
Name: kategori, dtype: int64

time: 6.42 ms (started: 2022-12-21 13:27:57 +00:00)


In [None]:
# make sure that title and narrative is in lowercase

def lower(text): return text.lower()

df['narasi'] = df['narasi'].apply(lower)
df['judul'] = df['judul'].apply(lower)

time: 19.6 ms (started: 2022-12-21 13:28:55 +00:00)


In [None]:
# let's remove the stopword from narrative data
drive('1dNRiXb9fy3fzeypcYukzeexk7M8RNNLB', 'stopwordsID.txt')
stopword = set()
txt_to_sets('stopwordsID.txt', stopword, '\n') # this txt file contain list of Indonesian stopword. that is complete enough

stopword = set(dict.fromkeys(stopword))

def clean_stop(text): return " ".join([w for w in text.split() if w not in stopword])

pd.set_option('max_colwidth', 400)

print(f'Before Stopword Remove:\n\n{df.narasi.head(10)}\n\n{df.narasi.tail(10)}')

df.narasi = df.narasi.apply(clean_stop)

print(f'\n\nAfter Stopword Remove:\n\n{df.narasi.head(10)}\n\n{df.narasi.tail(10)}')

Downloading...
From: https://drive.google.com/u/0/uc?id=1dNRiXb9fy3fzeypcYukzeexk7M8RNNLB&export=download
To: /content/stopwordsID.txt
100%|██████████| 7.32k/7.32k [00:00<00:00, 2.63MB/s]

Before Stopword Remove:

0                                                                                                                                                                                            video juru kamera terlihat lebih cepat dari peserta lomba lari
1                                                                                                                                                                          pernyataan ceo pfizer saya mengundurkan diri dan vaksin mrna belum terbukti aman
2                                                                                                                                                                                                                     video perkumpulan lgbt di sicc sentul
3                                                                                                                                                                dijerat perkara terselubung badan pengawas pemilihan umum 




In [None]:
df.isnull().sum() # verify if there is null data in dataset

kategori    0
judul       0
tanggal     0
narasi      0
dtype: int64

time: 8.61 ms (started: 2022-12-21 13:28:57 +00:00)


In [None]:
df = df.dropna() # remove rows that contain null data
df.isnull().sum()

kategori    0
judul       0
tanggal     0
narasi      0
dtype: int64

time: 16.7 ms (started: 2022-12-21 13:28:57 +00:00)


# **Final Stage**

Let's convert our dataset to csv

In [None]:
# convert dataframe to csv
df.to_csv('valid_hoaks_rep_stop_new.csv', index=False, encoding='utf-8')

time: 193 ms (started: 2022-12-21 13:29:18 +00:00)
