This repo was created as part of the Applied Deep Learning for NLP in held in WS 2020/2021 at the Technical University of Munich
There are always situations where you don't know the perfect answer to a political topic and with this project, we have solved this problem!
We've built an Alexa skill that comments on any given news headline based on information that was gathered from Reddit.
- Download pretrained OpenNMT model
- Serve the model using Tensorflow Serving (Inspired by this example)
docker run -p 9000:9000 -v $PWD:/models --name tensorflow_serving --entrypoint tensorflow_model_server tensorflow/serving --enable_batching=true --port=9000 --model_base_path=/models/hot-take-model --model_name=hot-take-model
- Install pipenv:
pip install pipenv
- Setup virtual environment using pipenv:
pipenv --python 3.8
- Install all dependencies:
pipenv install
- Create environment file based on the
.env.example
- Start server:
pipenv run python3 -m api.main --model openmnt
or to directly test it:pipenv run python3 -m api.model.comment_generator_opennmt --headline "This is amazing"
- Test:
curl -X POST localhost:8000/generate-comment -d '{"headline": "This is amazing"}'
- Download the 200k_comment_model
- Move them into
api/model/pretrained
- Install pipenv:
pip install pipenv
- Setup virtual environment using pipenv:
pipenv --python 3.8
- Install all dependencies:
pipenv install
- Create environment file based on the
.env.example
- Start server:
pipenv run python3 -m api.main --model 200k_comment_model --preprocessed_data
- Test:
curl -X POST localhost:8000/generate-comment -d '{"headline": "This is amazing"}'
/alexa-skill
: Code of the skill. Can be uploaded to the alexa developer console directly or on Amazon Lambda
/api
: Main repo of the API and model
/assets
: Images, PDFs, etc.
/data
: Raw and processed data
/data-acquisition
: Jupyter notebooks and Python code for acquiring the data
/pretrained
: Pretrained models
We used PRAW and pushshift.io to crawl Reddit posts and top-level comments from the subreddits worldnews, news, politics, uplifitingnews, truenews from over the last 3 years.
We filtered the data using two approaches:
- Filtering the negative scores
- Filtering based on keywords (deleted, removed, tl;dr)
- Keeping only one sentence per comment to reduce complexity in training
As this task can be considered as a translation task, we used a Transformer-based architecture. In total, we trained three models from scratch:
-
50k_comments_model (implementation based on the lectures code, not provided in this repo)
- Did not properly learn grammatical structure
- Could not understand the context of the headline
-
200k_comments_model (implementation based on the lectures code)
- One epoch took approx. 4 hours, thus not feasible to train with only access to Google Colab's GPUs
-
OpenNMT Transformer model
- Learned grammatical structure
- Could give coherent answers to new unseen headlines
- Is quite opinionated
- Learned correlation between topics were visible, e.g. connecting Republicans/Democrats with Socialist/Communists
For demonstration purposes, we deployed our project into a production environment using the OpenNMT model. On a AWS-server instance, we started two Docker containers serving the Tensorflow model using Tensorflow Serving and the Python code interacting between Alexa and the model and pre/postprocessing the data.
Even though, the performance is considerably good, we propose following extensions:
- Improved preprocessing of Reddit comments, e.g. using spellchecker
- Use multi-sentence comments as target data
- Classify comments on sentiment and thus creating a dynamic comment generator
- Larger amount of data
- Extension to further domains than news/politics
- Deploying this service as a bot on Reddit
"Estonia to beecome the only country in the world with a female president and female prime minister"
it's almost as if the ultra rich don't give a shit.
"politicians want to blow up the planet by two thousand twenty two"
this is why i don't have a problem.
"EU urges biden to help draft joint rule book to rein in tech giants"
this is the kind of stuff that makes me happy.
- Playing around with and understand the data is crucial to anticipate patterns the model might learn early
- It makes sense to use a library like OpenNMT to have a good baseline and improve upon this baseline
- Even though, we "only" collected roughly over 250k comments, the model learned the grammatical structure of the English language and could give coherent answers
If there are any questions or problems, feel free to create an issue!
With lots of ❤️ by Maja & Andy