Skip to content

CLI to extract article contents in bulk using Newspaper3k and multithreading.

Notifications You must be signed in to change notification settings

dsynkov/newspaper-bulk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 

Repository files navigation

newspaper-bulk

A command-line script to apply Newspaper3k's article extraction feature in bulk, made with large lists (>1K) of URLs in mind. The script is made to be as forgiving as possible, accepting any and all URLs insofar as they are in the first column of a .csv, .xlsx, .xls, or .csv. file. It uses Python's standard threading library to divvy up the work (default is for 100 threads; see Shane Lynn's tutorial). In addition to the requirements, make sure you have nltk's punkt package installed (via nlkt.download() in interactive Python) for Newspaper3k's article.nlp() to work properly.

usage

usage: newspaperbulk.py [-h] [-t THREADS] [-r] [-u] [-m MAX_RETRIES]
                        [-b BACKOFF]
                        filepath
                        
positional arguments:
  filepath              Enter the path of the .csv, .txt, .xlsx, or .xls file
                        containing the URLs. If the file is not in your
                        current directory, you must enter the absolute path.

optional arguments:
  -h, --help            show this help message and exit
  -t THREADS, --threads THREADS
                        Number of threads to launch (default 100).
  -r, --redirects       Select to allow redirects.
  -u, --unverified      Select to allow unverified SSL certificates.
  -m MAX_RETRIES, --max_retries MAX_RETRIES
                        Set the max number of retries (default 0 to fail on
                        first retry).
  -b BACKOFF, --backoff BACKOFF
                        Set the backoff factor (default 0).

About

CLI to extract article contents in bulk using Newspaper3k and multithreading.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages