Where to start in the NLP jungle?

Disclaimer I: This document is intended for students either writing their (nlp-related) bachelor or master thesis or working on their (nlp-related) student consulting project under our supervision.

Disclaimer II: Please note that this document is subject to continuous change. Every time we find a new, nice source we will add it. Some stuff might also be deleted over time if not considered useful anymore.

Authors: Matthias Aßenmacher // Christian Heumann

Note: Most important resources are marked by a ⚠️

Last change: 03-03-2022

I. Set up all necessary communication channels

II. Useful materials for starting with NLP

1. Get familiar with the basics:

Pre-Processing (e.g. in Python with NLTK or spaCy)
One-hot-encoding of words, the bag-of-words (bow) approach, its applications in ML, drawbacks & limitations (just google this stuff, you will find enough material).
Extensions of the bow approach, like n-grams or tf-idf (also just google this).

2. Get familiar with the Python environment:

In general (it's a little different from the just "plug-and-play" style in which you can install R and R-Studio)
Find a comfortable setup:
- One alternative could be using Miniconda for Python, package management and virtualenvs together with e.g. VS Code as IDE
- Another alternative (nice for beginners) is Anaconda, a all-in-one solution that comes with various IDEs (like e.g. Spyder, which very closely resembles R-Studio)
Jupyter Notebooks / Lab
Google Colaboratory

3. Start looking into neural networks and deep learning for NLP

4. Milestone papers:

Important conceptual foundation
- Bengio, Y., et al. (2003) "A neural probabilistic language model." Journal of machine learning research: 1137-1155.
Modification of the idea of Bengio et al.; use the internal representations of the neural net as the primary objective; Their architecture is called word2vec and is able to learn static word embeddings
- Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781(2013).
- Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems. 2013.
Alternative framework to word2vec
- Pennington, Jeffrey, Richard Socher, and Christopher Manning. "Glove: Global vectors for word representation." Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014.
Extending the embedding idea from word2vec to sentence/paragraph/document level
- Le, Quoc, and Tomas Mikolov. "Distributed representations of sentences and documents." International conference on machine learning. 2014.
Sequence-to-sequence models
- Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.
Extending the embedding idea to subword tokens
- Bojanowski, Piotr, et al. "Enriching word vectors with subword information." Transactions of the Association for Computational Linguistics 5 (2017): 135-146.
- Joulin, Armand, et al. "Bag of tricks for efficient text classification." arXiv preprint arXiv:1607.01759 (2016).
Important foundations for the so-called Attention & Self-Attention mechanism
- Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).
- Luong, Minh-Thang, Hieu Pham, and Christopher D. Manning. "Effective approaches to attention-based neural machine translation.") arXiv preprint arXiv:1508.04025 (2015).
- Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
Some of the most famous models that learn (a) contextualized embeddings and (b) can be used for transfer learning
- Radford, Alec, et al. "Improving language understanding by generative pre-training." pdf (2018).
- Peters, Matthew E., et al. "Deep contextualized word representations." arXiv preprint arXiv:1802.05365 (2018).
- Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).
- Radford, Alec, et al. "Language models are unsupervised multitask learners." OpenAI Blog 1.8 (2019).
- Yang, Zhilin, et al. "XLNet: Generalized Autoregressive Pretraining for Language Understanding." Advances in neural information processing systems 32 (2019).
- Liu, Yinhan, et al. "RoBERTa: A Robustly Optimized BERT Pretraining Approach." arXiv preprint arXiv:1907.11692 (2019).
- Lan, Zhenzhong, et al. "Albert: A lite bert for self-supervised learning of language representations." arXiv preprint arXiv:1909.11942 (2019).
- Raffel, Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer." arXiv preprint arXiv:1910.10683 (2019).
- Sanh, Victor, et al. "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter." arXiv preprint arXiv:1910.01108 (2019).
- Clark, Kevin, et al. "Electra: Pre-training text encoders as discriminators rather than generators." arXiv preprint arXiv:2003.10555 (2020).
- Brown, Tom B., et al. "Language models are few-shot learners." arXiv preprint arXiv:2005.14165 (2020).
Heavily used benchmark data sets
- Wang, Alex, et al. "GLUE: A multi-task benchmark and analysis platform for natural language understanding." arXiv preprint arXiv:1804.07461 (2018).
- Wang, Alex, et al. "Superglue: A stickier benchmark for general-purpose language understanding systems." arXiv preprint arXiv:1905.00537 (2019).
- Pranav Rajpurkar, et al. "SQuAD: 100,000+ questions for machine comprehension of text" Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics.
- Pranav Rajpurkar, et al. "Know what you don’t know: Unanswerable questions for SQuAD" In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, Melbourne, Australia.
Zero-/Few-Shot Learning TO DO
- Brown, Tom B., et al. "Language models are few-shot learners." arXiv preprint arXiv:2005.14165 (2020).
- Schick, Timo, and Hinrich Schütze. "It's not just size that matters: Small language models are also few-shot learners." arXiv preprint arXiv:2009.07118 (2020).
Prompting/Prompt-Engineering
- Liu, Pengfei, et al. "Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing." arXiv preprint arXiv:2107.13586 (2021).
- Wei, Jason, et al. "Finetuned language models are zero-shot learners." arXiv preprint arXiv:2109.01652 (2021).
- Lester, Brian, et al. "The Power of Scale for Parameter-Efficient Prompt Tuning" In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Where to start in the NLP jungle?

I. Set up all necessary communication channels

II. Useful materials for starting with NLP

1. Get familiar with the basics:

2. Get familiar with the Python environment:

3. Start looking into neural networks and deep learning for NLP

4. Milestone papers:

5. Make use of the overwhelming offer of blogs, tutorials (or the internet in general): Here are some nice online resources

6. Software

About

Releases

Packages

assenmacher-mat/howto_nlp

Folders and files

Latest commit

History

Repository files navigation

Where to start in the NLP jungle?

I. Set up all necessary communication channels

II. Useful materials for starting with NLP

1. Get familiar with the basics:

2. Get familiar with the Python environment:

3. Start looking into neural networks and deep learning for NLP

4. Milestone papers:

5. Make use of the overwhelming offer of blogs, tutorials (or the internet in general): Here are some nice online resources

6. Software

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages