# INLP Group Work HS2022

This document describes the group work aspect of the INLP module. We provide descriptions and code samples of the planned tasks.

Please fill in details where explicitly indicated and leave everything else intact.

Group ID: 6

Group members: Alina Meyer

## Task

| Name            | Split (Train/Test/Validation) |       Type |
|-----------------|:-----------------------------:|-----------:|
| emotion         | 3257/1421/374                 | Quaternary |
| hate            | 9000/2970/1000                |     Binary |
| irony           | 2862/784/955                  |     Binary |
| offensive       | 11916/860/1324                |     Binary |
| sentiment       | 45615/12284/2000              |    Ternary |
| stance_abortion | 587/280/66                    |    Ternary |
| stance_atheism  | 461/220/52                    |    Ternary |
| stance_climate  | 355/169/40                    |    Ternary |
| stance_feminism | 597/285/67                    |    Ternary |
| stance_hillary  | 620/295/69                    |    Ternary |

Following the same approaches presented in the module, solve the current tasks:

1. Envision an NLP application using one or more of the data sets described in the table above
2. Implement your solution in a Jupyter Notebook
3. Document it using the provided Canvas document (that will guide you in the required aspects)
4. Present your solution in the final lecture of the module

## Dependencies

In [1]:
!pip install datasets

Collecting datasets
  Downloading datasets-4.1.1-py3-none-any.whl (503 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m503.6/503.6 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m31m10.9 MB/s[0m eta [36m0:00:01[0m
[?25hCollecting filelock
  Downloading filelock-3.20.0-py3-none-any.whl (16 kB)
Collecting pandas
  Using cached pandas-2.3.3-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (12.8 MB)
Collecting tqdm>=4.66.3
  Using cached tqdm-4.67.1-py3-none-any.whl (78 kB)
Collecting multiprocess<0.70.17
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub>=0.24.0
  Downloading huggingface_hub-0.35.3-py3-none-any.whl (564 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m564.3/564.3 kB[0m [31m37.1 MB/s[0m eta [36m0:00:00[0m
Collecting numpy>=1.

## Data loading

This section provides starter code for loading all the data

In [3]:
import datasets

# NOTE: this block loads all the available data into a dictionary
# use the keys of the dictionary to access the required data set
all_data = {}
names = ["emotion", "hate", "irony",
         "offensive", "sentiment", "stance_abortion",
         "stance_atheism", "stance_climate", "stance_feminist",
         "stance_hillary"]
for name in names:
    all_data[name] = datasets.load_dataset("tweet_eval", name)

In [4]:
all_data.keys()

dict_keys(['emotion', 'hate', 'irony', 'offensive', 'sentiment', 'stance_abortion', 'stance_atheism', 'stance_climate', 'stance_feminist', 'stance_hillary'])

In [5]:
# print description of the "offensive" data set
print(all_data["offensive"]["train"].info.description)




In [6]:
# print labels available for "offensive" data set (with order)
print(all_data["offensive"]["train"].info.features["label"].names)

['non-offensive', 'offensive']


In [7]:
print(all_data["emotion"]["train"].info.features["label"].names)

['anger', 'joy', 'optimism', 'sadness']


In [8]:
print(all_data["stance_feminist"]["train"].info.features["label"].names)

['none', 'against', 'favor']


In [9]:
print(all_data["sentiment"]["train"].info.features["label"].names)

['negative', 'neutral', 'positive']


In [10]:
# example of a non-offensive tweet
all_data["offensive"]["train"][0]

{'text': '@user Bono... who cares. Soon people will understand that they gain nothing from following a phony celebrity. Become a Leader of your people instead or help and support your fellow countrymen.',
 'label': 0}

In [11]:
# example of an offensive tweet
all_data["offensive"]["train"][1]

{'text': '@user Eight years the republicans denied obama’s picks. Breitbarters outrage is as phony as their fake president.',
 'label': 1}

## Implementation

This section describes next steps in your implementation.


### Feature extraction/transformation and tokenization

**Fill in** your NLP pipeline in the next blocks.

In [12]:
!pip install nltk


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [13]:
feminist_stance_train = all_data["stance_feminist"]["train"]
feminist_stance_test = all_data["stance_feminist"]["test"]
feminist_stance_val = all_data["stance_feminist"]["validation"]

In [14]:
offensive_train = all_data["offensive"]["train"]
offensive_test = all_data["offensive"]["test"]
offensive_val = all_data["offensive"]["validation"]

In [15]:
feminist_stance_train[2]

{'text': 'RT @user Look for our latest indiegogo campaign coming out soon to help turn young girls into great leaders. #womensrights #SemST',
 'label': 2}

In [18]:
import re
import string
from nltk.tokenize import TweetTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.util import ngrams

tweet_tokenizer = TweetTokenizer(preserve_case=False, reduce_len=True)
lemmatizer = WordNetLemmatizer()

PUNCTUATION_AND_DIGITS = string.punctuation + string.digits



def preprocess_and_clean_text(example):
    text = example['text'] 
    # removes unnecessary "RT @user"
    text = re.sub(r'\brt\s?@?\w*\s*', '', text, flags=re.IGNORECASE)
    # remove urls
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)

    tokens = tweet_tokenizer.tokenize(text)

    # keeps: '#something' '@user'
    # removes: '#' '123' '.' '@'
    filtered_tokens = [
        token for token in tokens 
        if not all(c in PUNCTUATION_AND_DIGITS for c in token)
    ]
    
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]

    bigrams = list(ngrams(lemmatized_tokens, n=2))
    bigram_strings = ['_'.join(gram) for gram in bigrams]
    processed_tokens = lemmatized_tokens + bigram_strings
    
    processed_string = " ".join(processed_tokens)
    return {'processed_text': processed_string}

In [19]:
processed_feminist_stance_train = feminist_stance_train.map(preprocess_and_clean_text)
processed_feminist_stance_validation = feminist_stance_val.map(preprocess_and_clean_text)
processed_feminist_stance_test = feminist_stance_test.map(preprocess_and_clean_text)

Map:   0%|          | 0/597 [00:00<?, ? examples/s]

Map:   0%|          | 0/67 [00:00<?, ? examples/s]

Map:   0%|          | 0/285 [00:00<?, ? examples/s]

In [20]:
processed_offensive_train = offensive_train.map(preprocess_and_clean_text)
processed_offensive_validation = offensive_val.map(preprocess_and_clean_text)
processed_offensive_test = offensive_test.map(preprocess_and_clean_text)

Map:   0%|          | 0/11916 [00:00<?, ? examples/s]

Map:   0%|          | 0/1324 [00:00<?, ? examples/s]

Map:   0%|          | 0/860 [00:00<?, ? examples/s]

### Vocabulary and vector representation

**Fill in** the code for providing the vector representation of your data set(s).

In [9]:
# TODO: code goes here

### Evaluation (traditional ML)

**Fill in** the code for evaluating your NLP pipeline.

In [10]:
# TODO: code goes here

### Evaluation (neural network)

**Fill in** the code for evaluating/comparing with a neural network (transformers)

In [11]:
# TODO: code goes here

## That's all folks :)