Duck News Reporters

This repository contains the source code for the programs written in this project. This README should guide the user through our file structure and a general hint on how to run the code.

Also see ATTRIBUTIONS.md for a list of attributions. A plaintext version is included at ATTRIBUTIONS.txt fo those without a markdown renderer.

Directory structure

All code is contained within the src directory.

Dataset

We included the dataset in src/data/Horne2017_FakeNewsData/Buzzfeed. Along with the raw dataset, it also contains intermediary cached files containing our contributions. These will be automatically loaded by our pipeline.

context.csv - These are all the urls manually scraped for each input article. It also includes the summary we used.
context/ - This is a folder containing all the downloaded article content. This is not really intended for user browsing but it is cached here so we don't have to re-fetch them. They will be automatically joined when the dataset is loaded.
features_*.csv - This contains all our features after running the pipeline. It acts as a cache so we don't have to run the whole pipeline multiple times.
tf_model/ - The saved model for our neural network classification model. This will be loaded by main.py when running our pipeline.

Software

We include both helper code in *.py files and notebooks in *.ipynb files. Notebooks have been pre-run so they can be viewed by just opening them. The structure of python files is as follows:

classification.py - Contains the helpers for our classification models and a pipeline that will load all features.
dataset.py - Contains helpers to load the dataset from the `data/`` folder
eda.py - Contains helper functions to perform data analysis.
non_latent_features.py - Contains helpers used to extract non latent features.
preprocess.py - Contains the helper to preprocess and tokenize input.
sentiment.py - Contains a legacy helper used to extract sentiment. This has been replaced by code within non_latent_features.py.
similarity.py - Contains the summary extractor and similarity model.

main.py - Takes the user through a run of our pipeline including getting any article, and prompting the user to enter 3 context articles based on a summary given to the user. Will build the features and perform an inference. Note: This is currently broken. It will run to the end but give funny results.

EDA.ipynb - Contains all the data analysis done on the dataset and features.
pca_kmeans_analysis.ipynb - A notebook that investigates performing pca and kmeans on our features.
machine_learning.ipynb - Contains a notebook to run tests for our machine learning models.
deep_learning.ipynb - Contains a notebook to run our deep learning model.

Running code

You will need to have python>=3.10 installed and no guarantees are made about GPU support. The ducknewsreporters team uses a combination of Google Colab and local WSL2 systems with CUDA installed to run software. We believe that our code should automatically fal back to CPU.

To run the code:

# Make sure you have python>=3.10 installed
$ cd src
# This will install all the dependencies. (May take a while)
$ ./install
$ python3 main.py

Name		Name	Last commit message	Last commit date
Latest commit History 165 Commits
report		report
src		src
.gitignore		.gitignore
ATTRIBUTIONS.md		ATTRIBUTIONS.md
ATTRIBUTIONS.txt		ATTRIBUTIONS.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Duck News Reporters

Directory structure

Dataset

Software

Running code

About

Releases 2

Packages

Contributors 4

Languages

License

dukeraphaelng/ducknewsreporter

Folders and files

Latest commit

History

Repository files navigation

Duck News Reporters

Directory structure

Dataset

Software

Running code

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 4

Languages

Packages