Submission for <br>
Exercise task 1 <br>
of UTU course TKO_8964-3006 <br>
Textual Data Analysis <br>
by Botond Ortutay <br>

---

**Instructions:**

### Part 1: loading a dataset from the HF hub

Using the [load_dataset](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/loading_methods#datasets.load_dataset) function of the [datasets](https://huggingface.co/docs/datasets/en/index) library, load each of the following datasets in turn:

 - [stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb)
 - [eriktks/conll2003](https://huggingface.co/datasets/eriktks/conll2003)
 - [openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k)

For each of the datasets, report the following information:

 - What NLP task is the dataset intended for (e.g. syntactic analysis, toxicity detection, etc.)? (You may need to refer to the documentation of the dataset for this.)
 - What parts is the dataset split into (e.g. train, test) and how many examples does each contain?
 - What features (e.g. text, label) does the dataset have? (Try to understand how these relate to the NLP task the dataset is intended for.)
 - What is the first item in the training set of the dataset?

### Part 2: creating a dataset from your own data

You can find data collected from the Yle news RSS feed here: http://dl.turkunlp.org/TKO_8964_2023/

Download either the Finnish or English data (`news-fi-2021.jsonl` or `news-en-2021.jsonl`) using `wget` and create a datasets from the JSONL data (see https://huggingface.co/docs/datasets/loading#json). Answer the following questions:

 - What NLP tasks could the dataset be used for?
 - What features does the dataset have?
 - How many space-separated words do the texts of the dataset contain in total?

---

**Solutions:**

**Part 1:**

**Importing libraries & environment setup:**

**NOTE (to self):** here we assume that this notebook is run on a kernel with all the relevant libraries installed. This is important because I sweat and bled while setting up a venv to allow my Debian laptop to have the Python libraries (I screwed up something during configuration and it took me way too long to debug.) Therefore: once again: make sure that the libraries are there before you run this (use tdavenv on your Debian laptop and if you change computers make sure to install the relevant libraries and maybe even do environment configurations.

In [1]:
import datasets
import random    # For getting random samples

**Defining & documenting functions used below:**

In [2]:
"""
A function that returns a DatasetDict object and the train split from a huggingface dataset
---
In:
path                 str            the path huggingface uses to find the dataset
mainConfigNeeded     boolean        certain datasets want you to specify a config before datasets.load_dataset, setting this as True allows you to load such a dataset with the main config.
---
Out:
(dsDict, dsTrain),
    where:
dsDict               DatasetDict    object containing all available splits in dataset as well as the splits' features and amounts
dsTrain              Dataset        contains the train split of the dataset
---
Note: we assume that all the datasets handled by this function have a split called "train"
"""
def loadDataset(path, mainConfigNeeded):
    if mainConfigNeeded:
        dsDict = datasets.load_dataset(path, "main")                    # No split was specified. Therefore: the dsDict variable should have a DatasetDict object
        dsTrain = datasets.load_dataset(path, "main", split="train")    # Loading the train split of the dataset to take a closer peek at its contents
    else:
        dsDict = datasets.load_dataset(path)                            # No split was specified. Therefore: the dsDict variable should have a DatasetDict object
        dsTrain = datasets.load_dataset(path, split="train")            # Loading the train split of the dataset to take a closer peek at its contents
    return dsDict, dsTrain

"""
A function that prints basic information from a huggingface dataset, such as:
split names, features & sizes
samples from an inputted split
---
In:
dsDict       DatasetDict    DatasetDict of the dataset we want to examine
dsSplit      Dataset        A split from the dataset for printing samples
splitName    str            The name of the split we print samples from within the original huggingface dataset
prints       int            The amount of samples we want printed
---
Out:

Note: we assume that all the datasets handled by this function have a split called "train"
"""
def examineDataset(dsDict, dsSplit, splitName, prints):
    print(dsDict)    # Printing the DatasetDict to get an overview of all the splits and features

    print("")
    print("---")
    print("")

    # Printing a few random samples from the dataset to look what kind of data is there
    for i in range(prints):
        print(dsSplit[random.randint(0,dsDict[splitName].num_rows)])
        print("")

**The Stanford IMDB dataset:** <br>
code & outputs:

In [3]:
stanfordIMDB, stanfordIMDBTrain = loadDataset("stanfordnlp/imdb", False)

In [4]:
examineDataset(stanfordIMDB, stanfordIMDBTrain, "train", 4)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

---

{'text': 'Why has this not been released? I kind of thought it must be a bit rubbish since it hasn\'t been. How wrong can a girl be! This film is, in a word, enthralling.<br /><br />You will be captivated. It holds your attention from the start and its pace never slows.<br /><br />The final part of the film, the "episode" as it were (not giving anything away, you saw that in the trailer) is also unmissable. You will chose a favourite, you will be shocked, you wont be able to go and make a cup of coffee because you need to find out what happens. The adrenalin rises and you cant not watch. Cudos to the actors, it\'s very believable. And it doesn\'t stop there, they have a final shock for you.<br /

answers:

**What NLP task is the dataset intended for?**<br>
According to the stanfordnlp/imdb dataset's [README.md file](https://huggingface.co/datasets/stanfordnlp/imdb/blob/main/README.md) the dataset is intended to be used for [sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis). The author's claim that the datasets contain "highly polar" movie reviews. This is probably done so that the positive & negative feelings people have about the movies (a.k.a. the "sentiments") could be more easily assiociated with certain kinds of language. <br>
<br>
**What parts is the dataset split into and how many examples does each contain?**<br>
The dataset has a train and a test split each with 25000 members. It also has a split called "unsupervised" with 50000 members. This is probably just the data from both the train and the test splits combined in one collection for unsupervised learning. <br>
<br>
**What features does the dataset have?**<br>
The dataset consists of the text and label features. The text feature contains the movie review and the label feature just contains a 0 or a 1. 0 seems to be associated with negative reviews and 1 with positeive reviews. The model probably uses the label data to learn what kind of language is associated with positive and negative emotion <br>
<br>
**What is the first item in the training set of the dataset?**<br>
Printed below:
<br>

In [5]:
print(stanfordIMDBTrain[0])

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

**Eriktks's conll2003 dataset:**<br>
code & outputs:

In [6]:
conll2003Dict, conll2003Train = loadDataset("eriktks/conll2003", False)

In [7]:
examineDataset(conll2003Dict, conll2003Train, "train", 4)

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

---

{'id': '9179', 'tokens': ['Atheist', 'China', 'officially', 'bans', 'missionary', 'activities', 'but', 'often', 'turns', 'a', 'blind', 'eye', 'to', 'religious', 'activities', 'of', 'people', 'nominally', 'employed', 'as', 'foreign', 'language', 'teachers', ',', 'particularly', 'in', 'remote', 'areas', 'that', 'are', 'unable', 'to', 'attract', 'other', 'candidates', '.'], 'pos_tags': [21, 22, 30, 42, 16, 24, 10, 30, 42, 12, 16, 21, 35, 16, 24, 15, 24, 30, 40, 15, 16, 21, 24, 6, 30, 15, 16, 24, 43, 41, 16, 35, 37, 16, 24, 7], 'chunk_tags': [11, 12, 21, 22, 11, 12, 0, 21, 22, 11, 12, 12, 13, 1

answers:

**What NLP task is the dataset intended for?**<br>
According to the eriktks/conll2003 dataset's [README.md file](https://huggingface.co/datasets/eriktks/conll2003) the dataset is intended to be used for "language-independent named entity recognition". So if I understand this correctly the aim is to teach a model to look at a text and then determine from the contex which words refer to "named entities", such as people, places and organizations etc. (The README said they were focusing on these three areas). <br>
<br>
**What parts is the dataset split into and how many examples does each contain?**<br>
The dataset consits of a train split of 14041 rows, a validation split of 3250 rows and a test split of 3453 rows. <br>
<br>
**What features does the dataset have?**<br>
The dataset has the following features: id, tokens, pos_tags, chunk_tags and ner_tags. The id feature is - surprise surprise - an ID tag and its purpose is to let any computer system reference individual rows of the dataset. Having such a feature is a common practice in any computer system. The tokens feature is the text to be analyzed. So sentences or collections of words including the named entities among other words. (Although my random sample included several rows where token was just: City name, number, number, number, number. What is up with that? Probably postal adresses or something but seems weird to have so many of these that a random sample of 4 out of 1400 caught two of them.) The ner_tags feature is a reference to a token and marks whether the token is a person, location, organization, other named entity or none of these.<sup>(source: the README file)</sup> This is important because teaching these is the goal of the dataset. The pos_tags and chunk_tag features were a bit harder to understand for me (probably easier for other people who have taken nlp courses, I haven't), but my hypotheses was that these are grammatical. I found [an article](https://medium.com/greyatom/learning-pos-tagging-chunking-in-nlp-85f7f811a8cb) that seems to confirm this. According to this article pos_tags refer to POS (Parts-of-speech) which is basically a system to categorize grammatic functions of words into groups such as nouns, verbs, pronouns, etc. This is basically exactly my hypothesis and it's in line with the README's explanation as well. According to the same article chunk_tags refer to groups ow words that refer to the same concept (for example in the sentence "The quick brown fox jumped over the lazy dog", "the quick brown fox" would be a chunk. <br>
<br>
**What is the first item in the training set of the dataset?**<br>
Printed below:
<br>

In [8]:
print(conll2003Train[0])

{'id': '0', 'tokens': ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'], 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7], 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0], 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}


**Openai's gsm8k dataset:**<br>
code & outputs:

In [9]:
gsm8kDict, gsm8kTrain = loadDataset("openai/gsm8k", True)

In [10]:
examineDataset(gsm8kDict, gsm8kTrain, "train", 4)

DatasetDict({
    train: Dataset({
        features: ['question', 'answer'],
        num_rows: 7473
    })
    test: Dataset({
        features: ['question', 'answer'],
        num_rows: 1319
    })
})

---

{'question': "Viggo's age was 10 years more than twice his younger brother's age when his brother was 2. If his younger brother is currently 10 years old, what's the sum of theirs ages?", 'answer': "Twice Viggo's younger brother's age when his brother was 2 is 2*2 = <<2*2=4>>4 years.\nIf Viggo's age was 10 more than twice his younger brother's age when his brother was 2, Viggo was 10+4 = 14 years old.\nViggo is 14 years - 2 years = <<14-2=12>>12 years older than his brother\nSince Viggo's brother is currently 10 and Vigo is 12 years older, he's currently 10 + 12 = <<10+12=22>>22 years old\nTheir combined age is 22 years + 10 years = <<22+10=32>>32 years\n#### 32"}

{'question': 'Jack goes hunting 6 times a month.  The hunting season lasts for 1 quarter of the year.  He catches 2 de

answers:

**What NLP task is the dataset intended for?**<br>
According to the openai/gsm8k dataset's [README.md file](https://github.com/openai/grade-school-math/blob/master/README.md) the dataset is intended to be used to teach transformer-based LLM systems to extract mathematical problems from written descriptions and then perform multi-step mathematical reasoning similarly to how elementary school math-word problems work. <br>
<br>
**What parts is the dataset split into and how many examples does each contain?**<br>
The dataset consits of a train split of 7473 rows and a test split of 1319 rows. <br>
<br>
**What features does the dataset have?**<br>
The dataset contains question and answer features. The question feature contains a written problem, and the answer feature contains a written answer with the logical deduction parts written down in plaintext as well as mathematical syntax. <br>
<br>
**What is the first item in the training set of the dataset?**<br>
Printed below:
<br>

In [11]:
print(gsm8kTrain[0])

{'question': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?', 'answer': 'Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72'}


---

**Part 2:**

**NOTE (to the person checking this):** We assume that Jupyter is configured in such a way that bash commands can be run on here. This is due to the exercise instructions requiring the use of `wget`.

Using `wget` to install the data:

In [12]:
#NOTE: bash kernel needed !!!
!echo Downloading data from TurkuNLP
!echo
!wget http://dl.turkunlp.org/TKO_8964_2023/news-en-2019.jsonl
!echo
!echo Printing all files in current directory to check data has downloaded
!echo
!ls

Downloading data from TurkuNLP

--2025-01-20 14:17:40--  http://dl.turkunlp.org/TKO_8964_2023/news-en-2019.jsonl
Resolving dl.turkunlp.org (dl.turkunlp.org)... 195.148.30.23
Connecting to dl.turkunlp.org (dl.turkunlp.org)|195.148.30.23|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7855444 (7,5M) [application/octet-stream]
Saving to: ‘news-en-2019.jsonl’


2025-01-20 14:17:42 (5,16 MB/s) - ‘news-en-2019.jsonl’ saved [7855444/7855444]


Printing all files in current directory to check data has downloaded

empty.ipynb  exercise_task_1.ipynb  news-en-2019.jsonl	tdavenv


<br>Using `datasets.load_dataset()` to load the downloaded json file to python as a dataset:

In [13]:
newsEn = datasets.load_dataset("json", data_files="news-en-2019.jsonl")

<br>Poking around our new dataset (bash):

In [14]:
#NOTE: bash kernel needed !!!
# Printing a few random samples from the dataset to look what kind of data is there
!shuf -n 4 news-en-2019.jsonl

!echo
!echo ---
!echo

!echo -n Line count: 
!wc -l news-en-2019.jsonl

{"summary": "Funds from the campaign will be channeled into conservation projects aimed at reducing the impact of destructive algae.", "tags": ["Baltia", "Itämeri", "Muumi-kirjat", "Pohjoismainen kirjallisuus", "Tove Jansson", "kuvittajat", "lasten- ja nuortenkirjallisuus", "levät", "meret", "muumit", "sarjakuvataiteilijat", "scifi", "sinilevät", "suomenkielinen kirjallisuus", "vesistöt"], "text": "If Moomin creator  Tove Jansson  could see the Baltic Sea in 2019, she would do her best to improve it. \n That's the view of Moomin Characters' current artistic director, Tove Jansson's niece  Sophia Jansson . \n Moomin Characters Ltd, the limited liability company that controls the image and licensing rights of the Moomin characters, plans to raise one million euros next year to protect the Baltic Sea. Company CEO  Roleff Kråkström  first proposed the idea of using Moomins as patrons, when he witnessed the effect of blue-green algae on the water. \n Since then, more than 50 companies and o


---

Line count:2481 news-en-2019.jsonl


Poking around our new dataset (python):

In [15]:
print(newsEn)

DatasetDict({
    train: Dataset({
        features: ['summary', 'tags', 'text', 'timestamp', 'title', 'url'],
        num_rows: 2481
    })
})


Written answers for the questions in Part 2:<br>
<br>
**What NLP tasks could the dataset be used for?**<br>
This is a huge amount of data and it could be used for all kinds of purposes. The first thing that came to mind was LLM training, although using this material for that has several problems. Firstly all the material is news media, meaning that it's not generic enough to train an LLM; All generated text would be news-related... Furthermore this data isn't in a conversation format, so the text-promts it'd need to generate new text wouldn't feel conversational, which would make it harder to use. I tried looking around the net for similar kinds of datasets and found [this one](https://github.com/chimaobi-okite/NLP-Projects-Competitions/blob/main/NewsCategorization/README.md) where they collected over 5000 news articles from different newspapers in Nigeria. They used it to perform news categorization, which makes sense and could absolutely be a valid application for our news dataset. However it is a bit funny how our tags (which would be the feature to do news categorization by) are in Finnish whereas everything else in the dataset is in English... I suppose I could've chosen the Finnish dataset and then this problem wouldn't exist...<br>
<br>
**What features does the dataset have?**<br>
The dataset has the features: summary, tags, text, timestamp, title and url. I believe these are self-explanatory enough that I won't need to explain them.<br>
<br>
**How many space-separated words do the texts of the dataset contain in total?**<br>
That depends on one's definition of "space-separated word". However I'll just give the simplest answer and let the following bash code count them for me:<br>

In [16]:
#NOTE: bash kernel needed !!!
!echo -n Word count in news-en-2019.jsonl as counted by the wc command: 
!wc -w news-en-2019.jsonl

Word count in news-en-2019.jsonl as counted by the wc command:1193770 news-en-2019.jsonl
