Skip to content

Train models to classify news articles automatically. Articles comes from NewsAPI.

Notifications You must be signed in to change notification settings

angelxd84130/NewsClassification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Contributors Forks Stargazers Issues LinkedIn


News Article Classification

Automatic classification system
View Demo · Report Bug · Request Feature

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contact
  6. Acknowledgements

About The Project

The goal is making a system that can continuously obtain real-world data and apply machine learning to classify after processing. It is hoped that by continuously and automatically obtaining and training new data, the accuracy of classification can be improved.
In this project, the following steps were implemented: Data Mining, Data Preprocessing, Data Modeling, Data Training & Testing, and Evaluation.

Here's why:

  • Real-world data is very complicated and needs to be handled carefully
  • Only by accumulating the amount of data can the accuracy of the training model be improved
  • The training results can be used in the automatic filing and search system of articles

Data Mining

The system collects articles through newsAPI, which are in 6 categories: science, general, health, business, entertainment, sports, and then stored them in a JSON file. Due to the limit of loading in that API, the amount of data that can be downloaded for each subject cannot exceed 100 at a time. Therefore, by running the system for 7 days to download daily data to increase the amount of data.

Data Preprocessing

Considering that this project uses keywords to distinguish article categories, the system uses the nltk library for natural language processing and the scikit-learn library to build a dictionary.
First, separate words from the text, and after querying the part of speech of the word, restore the word to a simple form.
Next, after filter stop words, calculate the flat rate of each word, and record the high-frequency words into the dictionary to facilitate subsequent calculations.

Data Modeling

In this project, 3 supervised learning models are applied: Naive Bayes, SVM, and Logistic Regression. Also, in Naive Bayes model, it considers both unigrams and bigrams at the same time.

Data Training & Testing

Randomly obtain 80% of the training data and 20% of the test data from the JSON file where the data is stored.
The system trains and predicts the data every day to observe the correlation between the amount of data and the accuracy of the prediction.

Evaluation

Through the 7-day experiment, about 1300 articles can be obtained, and the distribution of 6 categories can be seen from the plot below.
According to the results, the SVM model performs best when the amount of data is small, and its accuracy is 0.60.
Followed by logistic regression, whcih is 0.55, and finally is Naive Bayes, which is 0.53.
It can be found from the confusion matrix that among the six categories, sports articles have the highest recognizability, and its accuracy is as high as 96.6%.
Followed by business and entertainment news, their discrimination is as high as 68.6% and 62.7%.
The remaining three categories are due to insufficient data, it is difficult to distinguish the content of the article leading to a decrease in overall accuracy.

Built With

Getting Started

Start with a python file with any machine learning model ex.NaiveBayes.py
Download code and repalce the api_key to your own.

Prerequisites

  1. Get a free API Key at NewsAPI

  2. Replace api_key to your own.

    newsapi = NewsApiClient(api_key=' your api key here ')
  3. Run the python file

  4. Check the plot to see predict results

Usage

Use the plots to check result.

Data

Articles: {'science': 61, 'general': 226, 'health': 133, 'business': 294, 'entertainment': 270, 'sports': 341} DataAccumulation
Data

Accuracy

  • Naive Bayes: 0.539156
  • SVM: 0.608433
  • Logistic Regression: 0.551204
    SupervisedLearning

confused matrix

NaiveBayes

or copy an article content and apply it on the model to make predition.

Roadmap

See the open issues for a list of proposed features (and known issues).

Contact

Yu-Chieh Wang - LinkedIn
email: angelxd84130@gmail.com

Acknowledgements

About

Train models to classify news articles automatically. Articles comes from NewsAPI.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages