GitHub - angelxd84130/NewsClassification: Train models to classify news articles automatically. Articles comes from NewsAPI.

News Article Classification

Automatic classification system
View Demo · Report Bug · Request Feature

Table of Contents

About The Project
Getting Started
- Prerequisites
Usage
Roadmap
Contact
Acknowledgements

About The Project

The goal is making a system that can continuously obtain real-world data and apply machine learning to classify after processing. It is hoped that by continuously and automatically obtaining and training new data, the accuracy of classification can be improved.
In this project, the following steps were implemented: Data Mining, Data Preprocessing, Data Modeling, Data Training & Testing, and Evaluation.

Here's why:

Real-world data is very complicated and needs to be handled carefully
Only by accumulating the amount of data can the accuracy of the training model be improved
The training results can be used in the automatic filing and search system of articles

Data Mining

The system collects articles through newsAPI, which are in 6 categories: science, general, health, business, entertainment, sports, and then stored them in a JSON file. Due to the limit of loading in that API, the amount of data that can be downloaded for each subject cannot exceed 100 at a time. Therefore, by running the system for 7 days to download daily data to increase the amount of data.

Data Preprocessing

Considering that this project uses keywords to distinguish article categories, the system uses the nltk library for natural language processing and the scikit-learn library to build a dictionary.
First, separate words from the text, and after querying the part of speech of the word, restore the word to a simple form.
Next, after filter stop words, calculate the flat rate of each word, and record the high-frequency words into the dictionary to facilitate subsequent calculations.

Data Modeling

In this project, 3 supervised learning models are applied: Naive Bayes, SVM, and Logistic Regression. Also, in Naive Bayes model, it considers both unigrams and bigrams at the same time.

Data Training & Testing

Randomly obtain 80% of the training data and 20% of the test data from the JSON file where the data is stored.
The system trains and predicts the data every day to observe the correlation between the amount of data and the accuracy of the prediction.

Evaluation

Through the 7-day experiment, about 1300 articles can be obtained, and the distribution of 6 categories can be seen from the plot below.
According to the results, the SVM model performs best when the amount of data is small, and its accuracy is 0.60.
Followed by logistic regression, whcih is 0.55, and finally is Naive Bayes, which is 0.53.
It can be found from the confusion matrix that among the six categories, sports articles have the highest recognizability, and its accuracy is as high as 96.6%.
Followed by business and entertainment news, their discrimination is as high as 68.6% and 62.7%.
The remaining three categories are due to insufficient data, it is difficult to distinguish the content of the article leading to a decrease in overall accuracy.

Built With

Getting Started

Start with a python file with any machine learning model ex.NaiveBayes.py
Download code and repalce the api_key to your own.

Prerequisites

Get a free API Key at NewsAPI

Replace api_key to your own.

newsapi = NewsApiClient(api_key=' your api key here ')

Run the python file
Check the plot to see predict results

Usage

Use the plots to check result.

Data

Articles: {'science': 61, 'general': 226, 'health': 133, 'business': 294, 'entertainment': 270, 'sports': 341}

Accuracy

Naive Bayes: 0.539156
SVM: 0.608433
Logistic Regression: 0.551204

confused matrix

or copy an article content and apply it on the model to make predition.

Roadmap

See the open issues for a list of proposed features (and known issues).

Contact

Yu-Chieh Wang - LinkedIn
email: angelxd84130@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.idea		.idea
DataAccumulation.png		DataAccumulation.png
NaiveBayes.png		NaiveBayes.png
NewsClassification.py		NewsClassification.py
NewsClassification_CNN.py		NewsClassification_CNN.py
NewsClassification_Demo.pdf		NewsClassification_Demo.pdf
NewsClassification_LSTM.py		NewsClassification_LSTM.py
NewsClassification_NaiveBayes.py		NewsClassification_NaiveBayes.py
NewsClassification_RNN.py		NewsClassification_RNN.py
NewsClassification_SupervisedLearning.py		NewsClassification_SupervisedLearning.py
PieChart.png		PieChart.png
README.md		README.md
SupervisedLearning.png		SupervisedLearning.png
data.json		data.json
data_day1.json		data_day1.json
data_day2.json		data_day2.json
data_day3.json		data_day3.json
data_day4.json		data_day4.json
data_day5.json		data_day5.json
data_day6.json		data_day6.json
data_day7.json		data_day7.json
loadNews.py		loadNews.py
news.py		news.py
plot_piechart.py		plot_piechart.py
plot_plot.py		plot_plot.py
plot_stack.py		plot_stack.py
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

News Article Classification