GitHub - anguyen120/fake-news-in-time: Summer 2019 research internship project that analyze archived web articles' linguistic cues into time-series

Fake News in Time

Analysis of fake news language evolution in time
Report Bug · Request Feature

About The Project

As part of Texas A&M — University of Cyprus Student Exchange Program, this is a research internship project for Summer 2019. As the technology evolved to stop the propagation of fake news, the propagandists and the people that deliberately share false content adapted. Analyzing evolution of fake news' language provides new cues in determining fake news.

Due to the fact that several fake news articles might have been removed from the web, this project utilizes Web Archive's snapshots, where webpages are available overtime.

The project has been devised into three steps:

Data Crawling and Web Scraping
- Using the Scrapy framework, the project breaks down this process into two crawlers:
  - cdx.py spider
    - This collects valid snapshots from Wayback CDX Server API. It deploys crawlers to the snapshot urls to extract urls from the page using Scrapy's link extractor. After collecting urls, it inserts the data into MongoDB.
  - url_article.py spider
    - This must start off by making aggregations to the urls collection (default in config.py) in MongoDB. It then filters the aggregated urls and parse the article by using the Newspaper3k library. The spider inserts the articles' metadata into an article collection or filter collection.
  - One could avoid using the separated crawlers by using article.py spider, which is a combination of the two spiders. It is able to both collect and filter urls from Wayback CDX Server API snapshots then crawl the articles' url. It avoids the insertions in urls collection. (At this time, it has not been fully tested for functionality.)
Text Analytics and Natural Language Processing (NLP)
- This project employs Check-It's¹ feature engineering component. It divides linguistic features into:
  - Part-of-Speech
  - Readability and Vocabulary Richness
  - Sentiment Analysis
  - Surface and Syntax Punctuation
- Currently, the code for the feature engineering is not publicly available, but this repository will update when it has become available. For now, this article's sentiment analysis section provides an alternative to Check-It's sentiment score.
Statistical Analysis on Time-series Data
- This process takes in the outputted CVS file based from the extracted features from the previous step. By using pandas library to convert the CVS file into a data frame, Matplotlib and seaborn library handles plotting the data into a time-series graphs. To make statistical analysis from the plots, one must apply their knowledge and intuition to approach a conclusion.

Essentially, the main purpose of this project is the data crawling and web scraping component for future work though.

¹ Paschalides et al. (2019) Demetris Paschalides, Alexandros Kornilakis, Chrysovalantis Christodoulou, Rafael Andreou, George Pallis, Marios D. Dikaiakos, and Evangelos Markatos. 2019. Check-It: A Plugin for Detecting and Reducing the Spread of Fake News and Misinformation on the Web. arXiv:1412.6980 https://arxiv.org/abs/1905.04260v1

Built With

Scrapy

Getting Started

To get a local copy up and running follow these simple steps.

Prerequisites

MongoDB
pip

Installation

Clone the repo

git clone https://github.com/anguyen120/fake-news-in-time.git

Go inside the repo folder

cd /folder/to/fake-news-in-time

Install pip packages

pip3 install -r requirements.txt

Usage

If you plan on storing into MongoDB, please be sure to have it running beforehand. If not, please make the necessary adjustments and handling in the code for your desired preference.

Before starting, it is recommended to check each component's respective *config.py.

scrapy_config.py is located in .../fake-news-in-time/scrapy_archive/archive/
scrapy_config.py is located in .../fake-news-in-time/feature_engineering/
timeseries_config.py is located in .../fake-news-in-time/timeseries/

A curated list of fake and factual news sites has been provided. The lists are influence by the blacklist in Check-It. It is slightly modified by using Newspaper3k's popular urls function in the factual news sites. Though you are more than welcome to use your own.

Data Crawling and Web Scraping

To deploy the spider, go to your terminal:

cd folder/to/fake-news-in-time/scrapy_archive/archive/

Depending on your preferences, this component could be launch withcdx.py spider then url_article.py spider or only article.py spider.

For cdx.py spider then url_article.py spider:

scrapy crawl cdx
Until the urls collection is at an appropriate size, then call the url_article.py spider:
- scrapy crawl url_article

For article.py spider:

scrapy crawl article

It is highly encouraged to run multiple spiders in the same process. If you are interested, Scrapy provides documentation to do so.

Text Analytics and Natural Language Processing (NLP)

As previously mentioned, Check-It's feature engineering is not publicly available at this time, which this component utilizes. For now, feature.py is provided for a skeleton for aggregating the article collection and stored the extracted features in a csv file.

Statistical Analysis on Time-series Data

Before running this, there should be a csv file contained extracted features of the articles in the same dictionary as timeseries.py (.../fake-news-in-time/timeseries/).

To run timeseries.py from your terminal, go to the time-series component path:

cd folder/to/fake-news-in-time/timeseries/

Run the script:

python3 timeseries.py

Roadmap

See the open issues for a list of proposed features (and known issues).

Contributing

Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Alan Nguyen - anguyen120@pm.me

Project Link: https://github.com/anguyen120/fake-news-in-time

Acknowledgements

Demetris "Jimmy" Paschalides
LInC
www.flaticon.com
- Web free icon made by Pixelmeetup from www.flaticon.com is licensed by CC 3.0 BY
- Network free icon made by Smashicons from www.flaticon.com is licensed by CC 3.0 BY
- Statistics free icon made by Eucalyp from www.flaticon.com is licensed by CC 3.0 BY

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
feature_engineering		feature_engineering
scrapy_archive/archive		scrapy_archive/archive
timeseries		timeseries
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fake News in Time

Table of Contents

About The Project

Built With

Getting Started

Prerequisites

Installation

Usage

Data Crawling and Web Scraping

Text Analytics and Natural Language Processing (NLP)

Statistical Analysis on Time-series Data

Roadmap

Contributing

License

Contact

Acknowledgements

About

Releases

Packages

Languages

License

anguyen120/fake-news-in-time

Folders and files

Latest commit

History

Repository files navigation

Fake News in Time

Table of Contents

About The Project

Built With

Getting Started

Prerequisites

Installation

Usage

Data Crawling and Web Scraping

Text Analytics and Natural Language Processing (NLP)

Statistical Analysis on Time-series Data

Roadmap

Contributing

License

Contact

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages