# Tweets Tokenization

The goal of the assignment is to write a tweet tokenizer. The input of the code will be a set of tweet text and the output will be the tokens in each tweet. The assignment is made up of four tasks.

The [data](https://drive.google.com/file/d/15x_wPAflvYQ2Xh38iNQGrqUIWLj5l5Nw/view?usp=share_link) contains 5 files whereby each contains 44 tweets. Each tweet is separated by a newline. For manual tokenization only one file should be used.

Grading:
- 30 points - Tokenize tweets by hand
- 30 points - Implement 4 tokenizers
- 20 points - Stemming and Lemmatization
- 20 points - Explain sentencepiece (for masters only)


Remarks: 
- Use Python 3 or greater
- Max is 80 points for bachelors, 100 points for masters

## Tokenize tweets by hand

As a first task you need to tokenize 15 tweets by hand. This will allow you to understand the problem from a linguistic point of view. The guidelines for tweet tokenization are as follows:

- Each smiley is a separate token
- Each hashtag is an individual token. Each user reference is an individual token
- If a word has spaces between them then it is converted to a single token
- If a sentence ends with a word that legitimately has a full stop (abbreviations, for example), add a final full stop
- All punctuations are individual tokens. This includes double-quotes and single quotes also
- A URL is a single token

Example of output

    Input tweet
    @xfranman Old age has made N A T O!

    Tokenized tweet (separated by comma)
    @xfranman , Old , age , has , made , NATO , !


    1. Input tweet
    ...
    1. Tokenized tweet
    ...

    2. Input tweet
    ...
    2. Tokenized tweet
    ...

## Implement 4 tokenizers

Your task is to implement the 4 different tokenizers that take a list of tweets on a topic and output tokenization for each:

- White Space Tokenization
- Sentencepiece
- Tokenizing text using regular expressions
- NLTK TweetTokenizer

For tokenizing text using regular expressions use the rules in task 1. Combine task 1 rules into regular expression and create a tokenizer.

In [None]:
def white_space_tokenizer(text: str) -> list[str]:
    
    return []

In [None]:
import sentencepiece as spm

def sentencepiece_wrapper(text: str) -> List[str]:
    
    return []

In [None]:
def re_tokenizer(text: str) -> List[str]:
    
    return []

In [None]:
import nltk

def nltk_tweet_tokenizer(text: str) -> List[str]:
    
    return []

Run your implementations on the data. Compare the results, decide which one is better. List the advantages of the best tokenizer.

...

## Stemming and Lemmatization

Your task is to write two functions: stem and lemmatize. Input is a text, so you need to tokenize it first.

In [None]:
from nltk.stem.snowball import SnowballStemmer

def stem(text: str) -> List[str]:

    return []

In [None]:
import spacy

nlp = spacy.load("...")

def lemmatize(text: str) -> List[str]:

    return []

## Explain sentencepiece (for masters only)

For this task you will have to use sentencepiece text tokenizer. Your task will be to read how it works and write a minimum 10 sentences explanation of the tokenizer works.

...

## Resources

1. [Regular Expressions 1](https://realpython.com/regex-python/)
2. [Regular Expressions 2](https://realpython.com/regex-python-part-2/)
2. [Spacy Lemmatizer](https://spacy.io/api/lemmatizer)
2. [NLTK Stem](https://www.nltk.org/howto/stem.html)
3. [SentencePiece](https://github.com/google/sentencepiece)
4. [sentencepiece tokenizer](https://towardsdatascience.com/sentencepiece-tokenizer-demystified-d0a3aac19b15)