Skip to content

algonacci/INHEAD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

INHEAD

Indonesian News Headline Dataset

Description

Sebuah tools sederhana yang membantu menscrape data judul berita dengan bantuan library PyGoogleNews

Installation

# Python version 3.7 or newser
$ git clone https://github.com/algonacci/INHEAD.git
$ pip install -r requirements.txt

Arguments

--set        : Dataset type, it can be train/test/val
--query      : A keyword to scrape related news
--topic      : A target/label/class given for each news headline
--quantity   : How many data want to be displayed, max 60

Usage

# To scrape data
$ python src/scraping.py --set train --query twitter --topic teknologi

# To merge all scraped data
$ python src/merge.py --set train

# To check the result with Pandas Dataframe
$ python src/check_df.py --set train --quantity 60

TODO

  • Menentukan topik-topik besar yang ingin diklasifikasi

Sejauh ini sudah ada topik:

  • Pendidikan
  • Internasional
  • Politik
  • Kesehatan
  • Pariwisata
  • Ekonomi
  • Bisnis
  • Entertainment
  • Teknologi

Target

  • 45.000 train set
  • 5.000 validation set
  • 5.000 test set

About

Indonesian News Headline Dataset

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published