Skip to content

This is the backend of an Alexa skill that comments on any given news headline based on information that was gathered from Reddit.

License

Notifications You must be signed in to change notification settings

andy-96/nlp-reddit-worldnews

Repository files navigation

Cover

This repo was created as part of the Applied Deep Learning for NLP in held in WS 2020/2021 at the Technical University of Munich

Introduction

There are always situations where you don't know the perfect answer to a political topic and with this project, we have solved this problem!

We've built an Alexa skill that comments on any given news headline based on information that was gathered from Reddit.

see examples

Quickstart

OpenNMT model

  1. Download pretrained OpenNMT model
  2. Serve the model using Tensorflow Serving (Inspired by this example)
    docker run -p 9000:9000 -v $PWD:/models --name tensorflow_serving --entrypoint tensorflow_model_server tensorflow/serving --enable_batching=true --port=9000 --model_base_path=/models/hot-take-model --model_name=hot-take-model
  3. Install pipenv: pip install pipenv
  4. Setup virtual environment using pipenv: pipenv --python 3.8
  5. Install all dependencies: pipenv install
  6. Create environment file based on the .env.example
  7. Start server: pipenv run python3 -m api.main --model openmnt or to directly test it: pipenv run python3 -m api.model.comment_generator_opennmt --headline "This is amazing"
  8. Test: curl -X POST localhost:8000/generate-comment -d '{"headline": "This is amazing"}'

Self-trained Transformer model

  1. Download the 200k_comment_model
  2. Move them into api/model/pretrained
  3. Install pipenv: pip install pipenv
  4. Setup virtual environment using pipenv: pipenv --python 3.8
  5. Install all dependencies: pipenv install
  6. Create environment file based on the .env.example
  7. Start server: pipenv run python3 -m api.main --model 200k_comment_model --preprocessed_data
  8. Test: curl -X POST localhost:8000/generate-comment -d '{"headline": "This is amazing"}'

Documentation

Structure of project

/alexa-skill: Code of the skill. Can be uploaded to the alexa developer console directly or on Amazon Lambda

/api: Main repo of the API and model

/assets: Images, PDFs, etc.

/data: Raw and processed data

/data-acquisition: Jupyter notebooks and Python code for acquiring the data

/pretrained: Pretrained models

Data Acquisition

We used PRAW and pushshift.io to crawl Reddit posts and top-level comments from the subreddits worldnews, news, politics, uplifitingnews, truenews from over the last 3 years.

Data Preprocessing

We filtered the data using two approaches:

  • Filtering the negative scores
  • Filtering based on keywords (deleted, removed, tl;dr)
  • Keeping only one sentence per comment to reduce complexity in training

Model

As this task can be considered as a translation task, we used a Transformer-based architecture. In total, we trained three models from scratch:

  1. 50k_comments_model (implementation based on the lectures code, not provided in this repo)

    • Did not properly learn grammatical structure
    • Could not understand the context of the headline
  2. 200k_comments_model (implementation based on the lectures code)

    • One epoch took approx. 4 hours, thus not feasible to train with only access to Google Colab's GPUs
  3. OpenNMT Transformer model

    • Learned grammatical structure
    • Could give coherent answers to new unseen headlines
    • Is quite opinionated
    • Learned correlation between topics were visible, e.g. connecting Republicans/Democrats with Socialist/Communists

Deployment

For demonstration purposes, we deployed our project into a production environment using the OpenNMT model. On a AWS-server instance, we started two Docker containers serving the Tensorflow model using Tensorflow Serving and the Python code interacting between Alexa and the model and pre/postprocessing the data.

Potential Extensions

Even though, the performance is considerably good, we propose following extensions:

  • Improved preprocessing of Reddit comments, e.g. using spellchecker
  • Use multi-sentence comments as target data
  • Classify comments on sentiment and thus creating a dynamic comment generator
  • Larger amount of data
  • Extension to further domains than news/politics
  • Deploying this service as a bot on Reddit

Examples

"Estonia to beecome the only country in the world with a female president and female prime minister"

it's almost as if the ultra rich don't give a shit.

"politicians want to blow up the planet by two thousand twenty two"

this is why i don't have a problem.

"EU urges biden to help draft joint rule book to rein in tech giants"

this is the kind of stuff that makes me happy.

Learnings

  • Playing around with and understand the data is crucial to anticipate patterns the model might learn early
  • It makes sense to use a library like OpenNMT to have a good baseline and improve upon this baseline
  • Even though, we "only" collected roughly over 250k comments, the model learned the grammatical structure of the English language and could give coherent answers

If there are any questions or problems, feel free to create an issue!

With lots of ❤️ by Maja & Andy

About

This is the backend of an Alexa skill that comments on any given news headline based on information that was gathered from Reddit.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages