Skip to content

Summer 2019 research internship project that analyze archived web articles' linguistic cues into time-series

License

Notifications You must be signed in to change notification settings

anguyen120/fake-news-in-time

Repository files navigation

Contributors
Forks
Stargazers
Issues
MIT License


Logo

Fake News in Time

Analysis of fake news language evolution in time
Report Bug · Request Feature

Table of Contents

About The Project

As part of Texas A&M — University of Cyprus Student Exchange Program, this is a research internship project for Summer 2019. As the technology evolved to stop the propagation of fake news, the propagandists and the people that deliberately share false content adapted. Analyzing evolution of fake news' language provides new cues in determining fake news.

Due to the fact that several fake news articles might have been removed from the web, this project utilizes Web Archive's snapshots, where webpages are available overtime.

The project has been devised into three steps:

Process Image

  1. Data Crawling and Web Scraping
    • Using the Scrapy framework, the project breaks down this process into two crawlers:
      • cdx.py spider
        • This collects valid snapshots from Wayback CDX Server API. It deploys crawlers to the snapshot urls to extract urls from the page using Scrapy's link extractor. After collecting urls, it inserts the data into MongoDB.
      • url_article.py spider
        • This must start off by making aggregations to the urls collection (default in config.py) in MongoDB. It then filters the aggregated urls and parse the article by using the Newspaper3k library. The spider inserts the articles' metadata into an article collection or filter collection.
      • One could avoid using the separated crawlers by using article.py spider, which is a combination of the two spiders. It is able to both collect and filter urls from Wayback CDX Server API snapshots then crawl the articles' url. It avoids the insertions in urls collection. (At this time, it has not been fully tested for functionality.)
  2. Text Analytics and Natural Language Processing (NLP)
    • This project employs Check-It's1 feature engineering component. It divides linguistic features into:
      • Part-of-Speech
      • Readability and Vocabulary Richness
      • Sentiment Analysis
      • Surface and Syntax Punctuation
    • Currently, the code for the feature engineering is not publicly available, but this repository will update when it has become available. For now, this article's sentiment analysis section provides an alternative to Check-It's sentiment score.
  3. Statistical Analysis on Time-series Data
    • This process takes in the outputted CVS file based from the extracted features from the previous step. By using pandas library to convert the CVS file into a data frame, Matplotlib and seaborn library handles plotting the data into a time-series graphs. To make statistical analysis from the plots, one must apply their knowledge and intuition to approach a conclusion.

Essentially, the main purpose of this project is the data crawling and web scraping component for future work though.

1 Paschalides et al. (2019) Demetris Paschalides, Alexandros Kornilakis, Chrysovalantis Christodoulou, Rafael Andreou, George Pallis, Marios D. Dikaiakos, and Evangelos Markatos. 2019. Check-It: A Plugin for Detecting and Reducing the Spread of Fake News and Misinformation on the Web. arXiv:1412.6980 https://arxiv.org/abs/1905.04260v1

Built With

Getting Started

To get a local copy up and running follow these simple steps.

Prerequisites

Installation

  1. Clone the repo
git clone https://github.com/anguyen120/fake-news-in-time.git  
  1. Go inside the repo folder
cd /folder/to/fake-news-in-time
  1. Install pip packages
pip3 install -r requirements.txt

Usage

If you plan on storing into MongoDB, please be sure to have it running beforehand. If not, please make the necessary adjustments and handling in the code for your desired preference.

Before starting, it is recommended to check each component's respective *config.py.

  • scrapy_config.py is located in .../fake-news-in-time/scrapy_archive/archive/
  • scrapy_config.py is located in .../fake-news-in-time/feature_engineering/
  • timeseries_config.py is located in .../fake-news-in-time/timeseries/

A curated list of fake and factual news sites has been provided. The lists are influence by the blacklist in Check-It. It is slightly modified by using Newspaper3k's popular urls function in the factual news sites. Though you are more than welcome to use your own.

Data Crawling and Web Scraping

To deploy the spider, go to your terminal:

cd folder/to/fake-news-in-time/scrapy_archive/archive/

Depending on your preferences, this component could be launch withcdx.py spider then url_article.py spider or only article.py spider.

For cdx.py spider then url_article.py spider:

  • scrapy crawl cdx
  • Until the urls collection is at an appropriate size, then call the url_article.py spider:
    • scrapy crawl url_article

For article.py spider:

  • scrapy crawl article

It is highly encouraged to run multiple spiders in the same process. If you are interested, Scrapy provides documentation to do so.

Text Analytics and Natural Language Processing (NLP)

As previously mentioned, Check-It's feature engineering is not publicly available at this time, which this component utilizes. For now, feature.py is provided for a skeleton for aggregating the article collection and stored the extracted features in a csv file.

Statistical Analysis on Time-series Data

Before running this, there should be a csv file contained extracted features of the articles in the same dictionary as timeseries.py (.../fake-news-in-time/timeseries/).

To run timeseries.py from your terminal, go to the time-series component path:

cd folder/to/fake-news-in-time/timeseries/

Run the script:

python3 timeseries.py

Roadmap

See the open issues for a list of proposed features (and known issues).

Contributing

Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Alan Nguyen - anguyen120@pm.me

Project Link: https://github.com/anguyen120/fake-news-in-time

Acknowledgements

About

Summer 2019 research internship project that analyze archived web articles' linguistic cues into time-series

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages