Automatic classification system
View Demo
·
Report Bug
·
Request Feature
Table of Contents
The goal is making a system that can continuously obtain real-world data and apply machine learning to classify after processing. It is hoped that by continuously and automatically obtaining and training new data, the accuracy of classification can be improved.
In this project, the following steps were implemented: Data Mining, Data Preprocessing, Data Modeling, Data Training & Testing, and Evaluation.
Here's why:
- Real-world data is very complicated and needs to be handled carefully
- Only by accumulating the amount of data can the accuracy of the training model be improved
- The training results can be used in the automatic filing and search system of articles
The system collects articles through newsAPI, which are in 6 categories: science, general, health, business, entertainment, sports, and then stored them in a JSON file. Due to the limit of loading in that API, the amount of data that can be downloaded for each subject cannot exceed 100 at a time. Therefore, by running the system for 7 days to download daily data to increase the amount of data.
Considering that this project uses keywords to distinguish article categories, the system uses the nltk library for natural language processing and the scikit-learn library to build a dictionary.
First, separate words from the text, and after querying the part of speech of the word, restore the word to a simple form.
Next, after filter stop words, calculate the flat rate of each word, and record the high-frequency words into the dictionary to facilitate subsequent calculations.
In this project, 3 supervised learning models are applied: Naive Bayes, SVM, and Logistic Regression. Also, in Naive Bayes model, it considers both unigrams and bigrams at the same time.
Randomly obtain 80% of the training data and 20% of the test data from the JSON file where the data is stored.
The system trains and predicts the data every day to observe the correlation between the amount of data and the accuracy of the prediction.
Through the 7-day experiment, about 1300 articles can be obtained, and the distribution of 6 categories can be seen from the plot below.
According to the results, the SVM model performs best when the amount of data is small, and its accuracy is 0.60.
Followed by logistic regression, whcih is 0.55, and finally is Naive Bayes, which is 0.53.
It can be found from the confusion matrix that among the six categories, sports articles have the highest recognizability, and its accuracy is as high as 96.6%.
Followed by business and entertainment news, their discrimination is as high as 68.6% and 62.7%.
The remaining three categories are due to insufficient data, it is difficult to distinguish the content of the article leading to a decrease in overall accuracy.
Start with a python file with any machine learning model ex.NaiveBayes.py
Download code and repalce the api_key to your own.
-
Get a free API Key at NewsAPI
-
Replace api_key to your own.
newsapi = NewsApiClient(api_key=' your api key here ')
-
Run the python file
-
Check the plot to see predict results
Use the plots to check result.
Articles: {'science': 61, 'general': 226, 'health': 133, 'business': 294, 'entertainment': 270, 'sports': 341}
or copy an article content and apply it on the model to make predition.
See the open issues for a list of proposed features (and known issues).
Yu-Chieh Wang - LinkedIn
email: angelxd84130@gmail.com