## Natural Language Processing - Summer Term 2024
### Hochschule Karlsruhe
### Lecturer: Prof. Dr. Jannik Strötgen
### Tutor: Paul Löhr

# Exercise 01

### You will learn

- how to get the infrastructure up and running
- how you will complete the exercises


- about some different data formats
- how to write a simple data parser
- analyze text occurrences.

## Task 0.1 - Introduction (0 P):

- learn about global and local environments
- create a new local environment and add it to jupyter notebook
- installed the dependencies in requirements.txt for each exercise
- start jupyter notebook or jupyter-lab, switch the kernel to your environment and start solving


This assumes:
- you already have Python 3.9 installed
- you have Jupyter Notebook or Jupyterlab installed
- Python and Jupyter Notebook are already included in Anaconda, which we recommend due to the ease of installation. You can, however, also install them manually.


### Decide between local environment and global installation

When Python is installed, it is installed globally, meaning that all users and applications use the same version and, more importantly, the same version of packages (libraries). As we might have some applications that run only with a specific version of a library, it is hard and sometimes impossible to resolve version conflicts system-wide.

Therefore it is possible to create *environments*, of which you can think of as encapsulated copies of the python interpreter and a set of specific versions of packages (the idea of containerization before and during the earliest years of docker). It is customary to specify the versions in a file called `requirements.txt`, which lists the package name `wordcloud`.

**We strongly recommend creating environments to avoid global version conflicts.**

That said, nothing prevents you in principle to just install the required packages in the global namespace. We will use the latest version of packages (as listed on https://pypi.org/) and are unlikely to encounter version conflicts. We will not deduce points for using a global installation per se, but note that when version conflicts occur, this falls within your responsibility. When your submission is not executable due to this, this is obviously not helpful.

### Create an environment and make it usable in Jupyter Notebook

Using your global installation, start a terminal in the top folder where you store your exercises. Then execute the following commands in a shell:

```
# creates the environment in the folder .env
python -m venv .env            

# activate the environment
source .env/bin/activate   # for linux only
.env\Scripts\activate.bat      # for windows only

# add the jupyter kernel to the environment
pip install ipykernel

# add your environment to jupyter notebook
python -m ipykernel install --name=.env
```

### What you have now

You now have:

 - a new environment called .env in your top exercise folder
 - added the environment to the (global) jupyter notebook settings
 
 
You now can:
 
 - **configure your local environment** as you like, for example, install specific packages (**for each exercise**)
 - **start jupyter notebook globally** and change the kernel to use your local environment via Kernel->Change Kernel>.env in the top menu bar. **Then you can start to solve your exercise.**

While it would be maybe more intuitive to execute jupyter notebook *within* the environment, this is the least cumbersome way..

### Configure your environment
##### Install requirements within the environment
by typing the following command in a terminal in this directory

`source .env/bin/activate`   # windows: `.env\Scripts\activate.bat`

`pip install -r requirements.txt`  # distributed with the latest exercise

In [4]:
!pip install -r requirements.txt



*This is also necessary when the dependencies have changed!*

### Start Jupyter Notebook (Jupyterlab) within this environment

Close this window, close the previous terminal session and type

`jupyter notebook` or `jupyter-lab`

in a new terminal.

- *Select your current exercise ipynb file*
- *Change the kernel* to use your .env via Kernel->Change Kernel>.env 

*This is required every time to start up your exercise when coding.*

## Task 0.2 - How to solve the exercises (0 P):

The upcoming exercises contain both code submission as well as written answers.

### Code-Submission

For these assignments, you complete the template code that we already provide you within this notebook. For example and exercise might look like this:

In [5]:
def square(x):
    # Assignment XY: calculate the square here and return the value
    pass

Then you are expected to complete the missing functionality. In this case, the function might look like this:

In [6]:
def square(x):
    return x*x

And to check, we can even call the function and get a result:

In [7]:
square(4)

16

### Structure your code within external python files

While you can solve all coding directly within the jupyter notebook, it is also possible to include external python files from the containing directory.

For this to work, you have to enable live edit capabilities by executing the following commands in your jupter notebook:
```
%load_ext autoreload
%autoreload 
```
Then you can include files like `util.py` by just importing them with `import util`.

**In principle, we recommend completing all your exercises within the notebook wherever possible and only include external files when strictly necessary!**

**Beware:** *If you include any extern files, please always ensure that you submit your solutions as an archive, which includes both the ipynb file and all the referenced files! Otherwise, we ***cannot*** grade them.* 

### Text-Submissions
Another type of submissions is a text submission. Here you are asked for a written answer.

*Tip*: You can use Markdown formatting if you set the cell type in the top menu bar to `Markdown`!

**To edit a cell, you can double-click it.**

For example, in the following case:

*Question*: What is Python?

\# ANSWER HERE (Double click to edit)

we expect you to write your answer in this cell (or several, if you chose to do so), like this:

Python is an interpreted, high-level, general-purpose programming language and commonly used in data science.

#### Markdown Support
Please use the markdown syntax to make your submission more readable!

Here are some examples:

##### Headings

*italic*

**bold**

\*literal asterisks\*

Latex code: $e^{i\pi} + 1 = 0$


Nice python styles (not for code submission!!)
```python
print "Hello World!"
```

tables

| This | is   |
|------|------|
|   a  | table|

### Include Images
If you need to include external images (not any visualizations computed inside this notebook), you can reference the file like this:


![my image 123](cat.jpg)


**Beware:** *If you include any extern files, please always ensure that you submit your solutions as an archive, which includes both the ipynb file and all the referenced files! Otherwise, we **cannot** grade them.

This concludes the short precursor, and we can start with the real exercises.

---

## Task 1 - Data Description (10P):

We provided some data files together with this assignment. These data files have a specific structure with some meta-information. Study the files and try to understand what they contain. Then try to make some sense out of this information.
Prepare a short description of the data in written form naming interesting facts (e.g., what kind of data are you seeing, how is it encoded).

Hint: It might help to read the text a little to understand what the data is all about.

### Datenset 1: Debates
# xml file, encoding: <?xml version="1.0" encoding="UTF-8"?> 
# About a presidential debate in form of a transcript
Your submission

### Datenset 2: Reddit
# json file 
# About Presidential Debate thread in reddit
Your submission


### Datenset 3: TV
# CSV File
# Different News Channels covering 2016 Presidential Debate data from archives
Your submission


expected approx. 200-500 words

## Task 2 - Data Parser Implementation (10 P):

Use the provided framework that you have now in your project. Write a reader to read in the data files and the appropriate data structures that you need in the beginning. You may only need to store things in your data structure that you will need later (e.g., lots of attributes in the reddit dump are useless for us).

Follow the instructions in the following code snippets.

In [8]:
# import the necessary library functions we prepared
from utils import create_word_cloud, read_lorem_ipsum_text

In [9]:
#
# Part 1
#
#  - Read all the files in the data directory (and subdirectories)
#
# You may use any available library here, but please add it to the requirements.txt and SUBMIT the changed file
# Create python classes to represent your data. For the beginning, the simple text body should be sufficient.
# However, if you want to add other metadata you are free to do so.
# Regard each folder as one dataset. So in Part 3, you should generate three additional word clouds.
# 
# Hint: You do _not_ need to represent the thread structure of the reddit data with your Python class instances.
# 
# Example:
#loremipsum = read_lorem_ipsum_text()
#from bs4 import BeautifulSoup as bs
#print(loremipsum)


    
# TODO - ADD YOUR OTHER READER HERE
# XML
import xml.etree.ElementTree as ET

# Parse the XML file
tree = ET.parse(r'C:\Users\dawso\Desktop\WIM\WIM\NLP\data\debates\first.turns.xml')
root = tree.getroot()

alltext = ""
# Iterate through the XML tree
for child in root:
   # print(child.tag, child.attrib)
    for subchild in child:
      #  print('\t', subchild.tag, subchild.text)
        alltext = alltext + subchild.text

print(alltext)

 Good evening from Hofstra University in Hempstead, New York. I am Lester Holt, anchor of "NBC Nightly News.” I want to welcome you to the first presidential debate.
The participants tonight are Donald Trump and Hillary Clinton. This debate is sponsored by the Commission on Presidential Debates, a nonpartisan, nonprofit organization. The commission drafted tonight's format, and the rules have been agreed to by the campaigns.
The 90-minute debate is divided into six segments, each 15 minutes long. We'll explore three topic areas tonight: Achieving prosperity; America's direction; and securing America. At the start of each segment, I will ask the same lead-off question to both candidates, and they will each have up to two minutes to respond. From that point until the end of the segment, we'll have an open discussion.
The questions are mine and have not been shared with the commission or the campaigns. The audience here in the room has agreed to remain silent so that we can focus on what 

In [10]:
# 
# Part 2
#
#  - Count the words in a map. Do so for each dataset.
#
# Example:
# word_count_lorem = {}
# words = loremipsum.split(" ")
# for word in words:
#     word_count_lorem[word] = word_count_lorem.get(word, 0) + 1
# #print(word_count_lorem)
# # TODO - ADD YOUR OTHER COUNTER FUNCTIONS HERE
# # XML
    
word_count = {}
# using the alltext variable from previous XML task
alltext_words = alltext.split(" ")
for word in alltext_words:
    word_count[word] = word_count.get(word, 0) +1
print(word_count)


# # reddit

# import json

# # Open the JSON file
# with open(r'C:\Users\dawso\Desktop\WIM\WIM\NLP\data\reddit\redditdump.json') as f:
#     # Load the JSON data
#     data = json.load(f)

# # Now `data` contains the content of the JSON file as a Python dictionary
# print(data)



{'': 8, 'Good': 3, 'evening': 2, 'from': 41, 'Hofstra': 3, 'University': 5, 'in': 219, 'Hempstead,': 1, 'New': 15, 'York.': 2, 'I': 416, 'am': 47, 'Lester': 2, 'Holt,': 1, 'anchor': 1, 'of': 338, '"NBC': 1, 'Nightly': 1, 'News.”': 1, 'want': 48, 'to': 571, 'welcome': 2, 'you': 258, 'the': 596, 'first': 17, 'presidential': 7, 'debate.\nThe': 1, 'participants': 1, 'tonight': 2, 'are': 217, 'Donald': 23, 'Trump': 7, 'and': 315, 'Hillary': 9, 'Clinton.': 3, 'This': 6, 'debate': 7, 'is': 285, 'sponsored': 1, 'by': 50, 'Commission': 1, 'on': 90, 'Presidential': 1, 'Debates,': 1, 'a': 316, 'nonpartisan,': 1, 'nonprofit': 1, 'organization.': 1, 'The': 33, 'commission': 2, 'drafted': 1, "tonight's": 1, 'format,': 1, 'rules': 1, 'have': 252, 'been': 56, 'agreed': 2, 'campaigns.\nThe': 1, '90-minute': 1, 'divided': 1, 'into': 31, 'six': 3, 'segments,': 1, 'each': 4, '15': 4, 'minutes': 10, 'long.': 2, "We'll": 3, 'explore': 1, 'three': 2, 'topic': 1, 'areas': 1, 'tonight:': 1, 'Achieving': 1, 'pr

In [11]:
# 
# Part 3
#
#  - Create a word cloud for the dataset.
# 
# Example:

create_word_cloud(word_count_lorem, "Lorem ipsum")

# TODO - CREATE THE OTHER THREE WORDCLOUDS HERE

Lorem ipsum:


ValueError: We need at least 1 word to plot a word cloud, got 0.

### Wordcloud 1: Debates (xml)

### Wordcloud 2: Reddit (json)

### Wordcloud 3: TV (csv)

## Task 3 - Word-Cloud Interpretation and Next Steps (10 P):

Interpret the word clouds you created. Can you tell any differences from the different data sets? What do you think would be the next steps to improve the word clouds?

Reference the corresponding figure and write down your interpretation.

\# TEXT SUBMISSION ANSWER - expected approx. 250-600 words



---

#### Submitting your results:

To submit your results, please:

- save this file, i.e. `ex01_assignment.ipynb`.
- if you reference any external files (e.g., images), please create a zip or rar archive and put the notebook files and all referenced files in there.
- login to ILIAS and submit the `*.ipynb` or `*.zip` archieve for the corresponding assignment.

**Remarks:**
    
- Do not copy any code from the Internet. In case you want to use publicly available code, please, add the reference to the respective code snippet.
- Check your code compiles and executes, even after you have restarted the Kernel (Menu -> Kernel -> Restart & Run all).
- Submit your written solutions and the coding exercises within the provided spaces and not otherwise.
- Write the names of your partner and your name in the top section.