## Natural Language Processing - Summer Term 2024
### Hochschule Karlsruhe
### Lecturer: Prof. Dr. Jannik Strötgen
### Tutor: Paul Löhr

# Exercise 02

### You will learn about:

- tokenization
- data cleaning and stop word removal
- stemming
- zipf's law

---

## Task 1 - Tokenization (5 P):

### Part 1

Describe what tokenization is, how it is performed, and what problems it solves.

\# TEXT SUBMISSION ANSWER HERE (Double click to edit) - expected approx. 100 words
 
 From the lecture:

 Token:
 ● The occurrence of a word in a text
 Tokenization:
 ● Segmentation of an input stream into an ordered sequence of tokens
 Tokenizer: 
● A system that splits texts into word tokens
 Example:
 ● Input text: John likes Mary and Mary likes John.
 ● Tokens: {“John”, “likes”, “Mary”, “and”, “Mary”, “likes”, “John”, “.”}




### Part 2

For the later analysis of each text file, we need to identify single tokens. Therefore, you have to use a library to separate single tokens from the text. We will use the methods offered by `nltk` for this.

In [2]:
#%pip install nltk

In [3]:
import os
import json
import nltk

In [16]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\dawso\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\dawso\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [5]:
with open('data/texts.json', 'r') as infile:
    data = json.load(infile)

content_debates = data['debates']
content_reddit = data['reddit']
content_tv = data['tv']

In [6]:
# 1. Tokenize the text content for the three datasets above
from nltk.tokenize import word_tokenize

tokens_debates = word_tokenize(content_debates)
tokens_reddit = word_tokenize(content_reddit)
tokens_tv = word_tokenize(content_tv)


# 2. Print the first 20 tokens for each dataset

# direct print
# print("Tokens DEBATE: ", tokens_debates[:20])
# print("Tokens REDDIT: ", tokens_reddit[:20])
# print("Tokens TV: ", tokens_tv[:20])
# save as variables
tokens_debates_f20 = tokens_debates[:20]
tokens_reddit_f20 = tokens_reddit[:20]
tokens_tv_f20 = tokens_tv[:20]

print("Tokens DEBATE: ", tokens_debates_f20)
print("Tokens REDDIT: ", tokens_reddit_f20)
print("Tokens TV: ", tokens_tv_f20)


# 3. Now display the first paragraphs of the corresponding original text and study them. 
print(content_debates[:210],"\n")
print(content_reddit[0:200],"\n")
print(content_tv[0:200],"\n")

Tokens DEBATE:  ['Good', 'evening', 'from', 'Hofstra', 'University', 'in', 'Hempstead', ',', 'New', 'York', '.', 'I', 'am', 'Lester', 'Holt', ',', 'anchor', 'of', '``', 'NBC']
Tokens TV:  ['``', 'THE', 'TERMS', 'WERE', 'LAID', 'OUT', '.', 'I', 'WROTE', '--', '>', '>', 'YOU', 'CALLED', 'IT', 'THE', 'GOLD', 'STANDARD', '.', "''"]
 Good evening from Hofstra University in Hempstead, New York. I am Lester Holt, anchor of "NBC Nightly News.” I want to welcome you to the first presidential debate.
The participants tonight are Donald Trump an 


This is a reminder that this subreddit has strict posting/commenting rules that will be enforced by moderation. If you are new to this su 

"THE TERMS WERE LAID OUT. I WROTE -- >> YOU CALLED IT THE GOLD STANDARD."
  YEAH."
  SECRETARY CLINTON. >> I HAVE A FEELING BY THE END OF THIS"
 "THAT ARE INEFFECTIVE. STOP AND FRISK WAS FOUND TO BE U 



In [7]:
# CODE SUBMISSION ANSWER HERE 
# 1. 
# 




### Part 3

Does this what you expected it to do? How well does the tokenization work? What happens to special characters? Can you think of any problems?

\# TEXT SUBMISSION ANSWER HERE (Double click to edit)
- Several stop words are not yet tokenized
- Special characters are tokenized too
- Problems

---

## Task 2 - Data Cleaning and Stop Word Removal (10 P):


### Part 1

In two to three sentences, describe what *data cleaning* in the context of text data refers to.

\# TEXT SUBMISSION ANSWER HERE (Double click to edit)

- Detecting and correcting corrupt, inaccurate or irrelevant data records.
- After cleaning, should be consistent to the other data sets in the system.
- It is the prerequisite for the proper analysis afterwards.


### Part 2

To have more accurate word counts and visualizations, it is often helpful to remove the capitalization of words. This is especially true for languages like German. In the following, for the three texts from above, remove any capitalization.

In [12]:
# CODE SUBMISSION ANSWER HERE
    
# def conv_to_lower(tokens):
#     tokens_lower = []
#     for token in tokens:
#         tokens_lower.append(token.lower())
#     return tokens_lower 

# print(conv_to_lower(tokens_reddit_f20))
# print(conv_to_lower(tokens_debates_f20))
# print(conv_to_lower(tokens_tv_f20))

reddit_lower = content_reddit.lower()
debates_lower = content_debates.lower()
tv_lower = content_tv.lower()

### Part 3

Apply tokenization to the lowercase version of the texts

In [14]:
# CODE SUBMISSION ANSWER HERE
reddit_lower_tok = word_tokenize(reddit_lower)
debates_lower_tok = word_tokenize(debates_lower)
tv_lower_tok = word_tokenize(tv_lower)

### Part 4

In two to three sentences, describe what *stop word removal* in the context of text data refers to.

\# TEXT SUBMISSION ANSWER HERE (Double click to edit)
It is a common preprocessing step in NLP. The idea is to remove very common words that occur across all documents and carry little to no information. These tend to be things like articles or pronouns for example.

### Part 5

Now apply stop word removal to the three datasets.

Hint: Assume the texts are all written in _English_

In [19]:
# CODE SUBMISSION ANSWER HERE
op_words = set(nltk.corpus.stopwords.words("english"))

content_reddit_rm_sw = [token for token in reddit_lower_tok if not token in op_words]
content_debates_rm_sw = [token for token in debates_lower_tok if not token in op_words]
content_tv_rm_sw = [token for token in tv_lower_tok if not token in op_words]

### Part 6

Now compare the first original sentence for each dataset with the parts remaining after performing the above steps. Write them down and explain what happens.

In [22]:
# CODE SUBMISSION ANSWER HERE

print(content_reddit[:200])
print(content_reddit_rm_sw[:100])


This is a reminder that this subreddit has strict posting/commenting rules that will be enforced by moderation. If you are new to this su


\# TEXT SUBMISSION ANSWER HERE (Double click to edit)
- All words are lowercase
- All stop words have been removed
- Some special characters remain like '*'

---

## Task 3 - Stemming (10 P):


### Part 1

In two to three sentences, describe what *stemming* in the context of text data refers to.

\# TEXT SUBMISSION ANSWER HERE (Double click to edit)
- Reduce response time for indexing by truncating words to their root. 
- Reducing inflected words to their stem.
- The last few characters (suffix) of a given words are removed to obtain a shorter form.
- Goal is to reduce the number of unique words by grouping word variations based on their root.

### Part 2

Think about how you would go about implementing your own stemmer?
Come up with at least ten rules and write them down.

Hint: For example:

```*s -> *   # remove trailing s```

\# TEXT SUBMISSION ANSWER HERE (Double click to edit) 
```*e -> *   # remove trailing e```  
```*ed -> *   # remove trailing ed```  
```*er -> *   # remove trailing er```  
```*es -> *   # remove trailing es```  
```*ies -> *   # remove trailing ies```  
```*ing -> *   # remove trailing ing```  
```*al -> *   # remove trailing al```  
```*ity -> *   # remove trailing ity```  
```*ly -> *   # remove trailing ly```  
```*y -> *i   # replace trailing y by i```  

### Part 3

Use the cleaned word tokens (Step 5 above) and apply stemming. Use the Snowball Stemmer

In [None]:
# CODE SUBMISSION ANSWER HERE



### Part 4

Compare the results of the Snowball Stemmer with your stemming rules. How do they differ, how could you improve your stemmer?

\# TEXT SUBMISSION ANSWER HERE (Double click to edit)

### Part 5

Create the word clouds from Exercise 1 again, but now with the preprocessed text.

What changes do you see?

In [None]:
from utils import create_word_cloud

# CODE SUBMISSION ANSWER HERE

\# TEXT SUBMISSION ANSWER HERE (Double click to edit)

## Task 4 - Zipf's Law (5 P):

In the lecture, you have heard about Zipf’s law. 

### Part 1

State Zipf's law

\# TEXT SUBMISSION ANSWER HERE (Double click to edit) - expected approx. 1-2 sentences and the formula

### Part 2

Check if Zipf's law (approximately) holds for our three datasets after all preprocessing steps.

For this, plot Zipf's law and the word distribution for each of the datasets.

In [None]:
# CODE SUBMISSION ANSWER HERE

### Part 3

Describe your plots and discuss your findings.

\# TEXT SUBMISSION ANSWER HERE (Double click to edit)

---

#### Submitting your results:

To submit your results, please:

- save this file, i.e., `ex??_assignment.ipynb`.
- if you reference any external files (e.g., images), please create a zip or rar archieve and put the notebook files and all referenced files in there.
- login to ILIAS and submit the `*.ipynb` or archive for the corresponding assignment.

**Remarks:**
    
- Do not copy any code from the Internet. In case you want to use publicly available code, please, add the reference to the respective code snippet.
- Check your code compiles and executes, even after you have restarted the Kernel.
- Submit your written solutions and the coding exercises within the provided spaces and not otherwise.
- Write the names of your partner and your name in the top section.