# Textual Data Analysis - Exercise 1
## Name: Ayesha Zafar

# Part 1: loading a dataset from the HF hub

Using the load_dataset function of the datasets library, load each of the following datasets in turn:

stanfordnlp/imdb

eriktks/conll2003

openai/gsm8k

For each of the datasets, report the following information:

1. What NLP task is the dataset intended for (e.g. syntactic analysis, toxicity detection, etc.)? (You may need to refer to the documentation of the dataset for this.)

2. What parts is the dataset split into (e.g. train, test) and how many examples does each contain?

3. What features (e.g. text, label) does the dataset have? (Try to understand how these relate to the NLP task the dataset is intended for.)

4. What is the first item in the training set of the dataset?

### 1. What NLP task is the dataset intended for (e.g. syntactic analysis, toxicity detection, etc.)? 

#### stanfordnlp/imdb

NLP Task: Sentiment Analysis (The dataset is designed to classify movie reviews as positive (1) or negative (0))

#### eriktks/conll2003

NLP Task: Named Entity Recognition (NER), Part-of-Speech (POS) tagging, and Syntactic Chunking (This dataset is used for identifying named entities in text (e.g., persons, locations, organizations) and includes POS tags and syntactic chunks)

#### openai/gsm8k

NLP Task: Math Word Problem Solving (This dataset provides math problems and solutions for training models on mathematical reasoning)


In [18]:
from datasets import load_dataset

# Dataset list with optional configurations
datasets_list = [
    {"name": "stanfordnlp/imdb", "config": None},
    {"name": "eriktks/conll2003", "config": None},
    {"name": "openai/gsm8k", "config": "main"}
]

# Looping over each dataset 
for dataset_info in datasets_list:
    dataset_name = dataset_info["name"]
    config = dataset_info["config"]
    print(f"=> Dataset: {dataset_name}")
    try:
        if config:
            dataset = load_dataset(dataset_name, config, trust_remote_code=True)
        else:
            dataset = load_dataset(dataset_name, trust_remote_code=True)
        # 2. What parts is the dataset split into (e.g. train, test) and how many examples does each contain?
        for split, data in dataset.items():
            print(f"  Split: {split}, Examples: {len(data)}\n")
        # 3. What features (e.g. text, label) does the dataset have? 
        print(f"  Features: {dataset[list(dataset.keys())[0]].features}\n")
        # 4. What is the first item in the training set of the dataset?
        if "train" in dataset:
            print(f"  First item in training set: {dataset['train'][0]}\n")
    except Exception as e:
        print(f"  Error loading dataset: {e}")

=> Dataset: stanfordnlp/imdb
  Split: train, Examples: 25000

  Split: test, Examples: 25000

  Split: unsupervised, Examples: 50000

  Features: {'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['neg', 'pos'], id=None)}

  First item in training set: {'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens o

# Part 2: creating a dataset from your own data

You can find data collected from the Yle news RSS feed here: http://dl.turkunlp.org/TKO_8964_2023/

Download either the Finnish or English data (news-fi-2021.jsonl or news-en-2021.jsonl) using wget and create a dataset from the JSONL data (see 
https://huggingface.co/docs/datasets/loading#json). Answer the following questions:

1. What NLP tasks could the dataset be used for?

2. What features does the dataset have?

3. How many space-separated words do the texts of the dataset contain in total?

In [19]:
import requests

# Downloading dataset using URL
url = "http://dl.turkunlp.org/TKO_8964_2023/news-en-2021.jsonl"
output_file = "news-en-2021.jsonl"

response = requests.get(url)
if response.status_code == 200:
    with open(output_file, "wb") as f:
        f.write(response.content)
    print(f"File downloaded successfully and saved as {output_file}")
else:
    print(f"Failed to download file. Status code: {response.status_code}")

File downloaded successfully and saved as news-en-2021.jsonl


In [22]:
from datasets import load_dataset

# Loading JSONL file as a dataset
dataset = load_dataset("json", data_files="news-en-2021.jsonl")

# Displaying dataset info
print("Dataset loaded successfully:")
print(dataset)

Dataset loaded successfully:
DatasetDict({
    train: Dataset({
        features: ['summary', 'tags', 'text', 'timestamp', 'title', 'url'],
        num_rows: 1059
    })
})


## 1. What NLP tasks could the dataset be used for?

Here are some possible NLP tasks:

Text classification: Categorizing articles based on topics.

Named entity recognition (NER): Identifying entities such as people, locations, and organizations in the text.

Summarization: Generating summaries for each news article.

Sentiment analysis: Determining the sentiment or tone of the news articles.

Language modeling: Training models to predict text sequences.

## 2. What features does the dataset have?

In [23]:
# Displaying features
print("Features of the dataset:")
print(dataset['train'].features)

# First item
print("\nFirst item in the dataset:")
print(dataset['train'][0])


Features of the dataset:
{'summary': Value(dtype='string', id=None), 'tags': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'text': Value(dtype='string', id=None), 'timestamp': Value(dtype='timestamp[s]', id=None), 'title': Value(dtype='string', id=None), 'url': Value(dtype='string', id=None)}

First item in the dataset:
{'summary': 'The decisions follow a meeting of government ministers at the House of the Estates on Thursday afternoon.', 'tags': ['Kotimaan uutiset'], 'text': 'Finland\'s government is pushing ahead with plans to introduce a Covid pass, following a meeting of ministers at the House of the Estates in Helsinki on Thursday afternoon. \n "There are still many open questions that need to be answered. At this point, it is impossible to promise that the pass will come or when it will come," Prime Minister  Sanna Marin  (SDP) told the media following the conclusion of the meeting. \n "The government has given the green light to the Covid pass and prepara

## 3. Total Number of Space-Separated Words

In [26]:
def count_words(batch):
    # Returning list of word counts for each entry in batch
    return {"word_count": [len(text.split()) for text in batch["text"]]}

# Applying function to entire dataset
word_count_result = dataset['train'].map(count_words, batched=True, batch_size=1000)
# Calculating total word count 
total_words = sum(word_count_result["word_count"])
# Displaying total word count
print(f"Total number of words in the dataset: {total_words}")

Total number of words in the dataset: 475975
