Skip to content

cnstll/detect_synthetic_news

Repository files navigation

Detect Synthetic News

The release date of chatGPT is the 30th of november 2022. Quickly, professionals at variety of positions and from different industries began to use it.
In this context, the project is motivated around assessing the usage of GPT in the news industry and see if we can distinguish "synthetic news" from "real news".
The learning purpose of this project is to design and build a ML system to expose the results of the detection model to end users in an interface. Disclaimer: I am aware that detecting if some text is generated by a LLM or not is very difficult task.
For many reasons the results from my project are not to be trusted.

Extraction of News

I used a news api to source a batch of metadata related to several news articles.
The elements that will get labelized in the metadata are: the title, the description and an sample of the content.
I limited the experiment around 3 groups of news actors : newspapers, cable news and tech news.
I selected top news sources based on their reach:

  • Top newspapers in the US based on circulation numbers : article
  • Top cable news based on households reached : article
  • Top online tech news based on appearance in search engine results.

Models Used

Article and Research Papers