### 4. Prepare the data
Notes:
* Work on copies of the data (keep the original dataset intact).
* Write functions for all data transformations you apply, for three reasons:
    * So you can easily prepare the data the next time you run your code
    * So you can apply these transformations in future projects
    * To clean and prepare the test set
    
    
1. Data cleaning:
    * Fix or remove outliers (or keep them)
    * Fill in missing values (e.g. with zero, mean, median, regression ...) or drop their rows (or columns)
2. Feature selection (optional):
    * Drop the features that provide no useful information for the task (e.g. a customer ID is usually useless for modelling).
3. Feature engineering, where appropriate:
    * Discretize continuous features
    * Use one-hot encoding if/when relevant
    * Add promising transformations of features (e.g. $\log(x)$, $\sqrt{x}$, $x^2$, etc)
    * Aggregate features into promising new features
4. Feature scaling: standardise or normalise features

In [20]:
# installing dependencies
!pip3 install pandas

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Applications/Xcode.app/Contents/Developer/usr/bin/python3 -m pip install --upgrade pip' command.[0m


In [21]:
# importing the data as dataframes
import pandas as pd

emotion_raw_df = pd.read_csv('../exports/emotion_raw.csv')
go_emotions_raw_df = pd.read_csv('../exports/go_emotions_raw.csv')

pd.set_option('display.max_columns', None)

In [22]:
emotion_raw_df

Unnamed: 0.1,Unnamed: 0,text,label
0,0,i feel awful about it too because it s my job ...,0
1,1,im alone i feel awful,0
2,2,ive probably mentioned this before but i reall...,1
3,3,i was feeling a little low few days back,0
4,4,i beleive that i am much more sensitive to oth...,2
...,...,...,...
416804,416804,that was what i felt when i was finally accept...,1
416805,416805,i take every day as it comes i m just focussin...,4
416806,416806,i just suddenly feel that everything was fake,0
416807,416807,im feeling more eager than ever to claw back w...,1


In [23]:
go_emotions_raw_df

Unnamed: 0.1,Unnamed: 0,text,id,author,subreddit,link_id,parent_id,created_utc,rater_id,example_very_unclear,admiration,amusement,anger,annoyance,approval,caring,confusion,curiosity,desire,disappointment,disapproval,disgust,embarrassment,excitement,fear,gratitude,grief,joy,love,nervousness,optimism,pride,realization,relief,remorse,sadness,surprise,neutral
0,0,That game hurt.,eew5j0j,Brdd9,nrl,t3_ajis4z,t1_eew18eq,1.548381e+09,1,False,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
1,1,>sexuality shouldn’t be a grouping category I...,eemcysk,TheGreen888,unpopularopinion,t3_ai4q37,t3_ai4q37,1.548084e+09,37,False,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,2,"You do right, if you don't care then fuck 'em!",ed2mah1,Labalool,confessions,t3_abru74,t1_ed2m7g7,1.546428e+09,37,False,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,3,Man I love reddit.,eeibobj,MrsRobertshaw,facepalm,t3_ahulml,t3_ahulml,1.547965e+09,18,False,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
4,4,"[NAME] was nowhere near them, he was by the Fa...",eda6yn6,American_Fascist713,starwarsspeculation,t3_ackt2f,t1_eda65q2,1.546669e+09,2,False,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
211220,211220,Everyone likes [NAME].,ee6pagw,Senshado,heroesofthestorm,t3_agjf24,t3_agjf24,1.547634e+09,16,False,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
211221,211221,Well when you’ve imported about a gazillion of...,ef28nod,5inchloser,nottheonion,t3_ak26t3,t3_ak26t3,1.548553e+09,15,False,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
211222,211222,That looks amazing,ee8hse1,springt1me,shittyfoodporn,t3_agrnqb,t3_agrnqb,1.547684e+09,70,False,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
211223,211223,The FDA has plenty to criticize. But like here...,edrhoxh,enamedata,medicine,t3_aejqzd,t1_edrgdtx,1.547169e+09,4,False,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


1. Data cleaning:
    * Fix or remove outliers (or keep them)
    * Fill in missing values (e.g. with zero, mean, median, regression ...) or drop their rows (or columns)

So now we need to clean the data - first focusing on the missing values. So before we already made a function for the `go_emotions` that can create a new column that can show us, which row doesn't have an emotion label - and we will delete these entries as they dont have labels and we dont have the capacity to label them, and they are also possibly not relevant from seeing some of them.


after running the function to tokenize we realized that it takes too long to do such a simple function, therefore we decided to create a new function that will only serve the purpose of deleting the rows without any emotion

In [24]:
def remove_rows_without_emotion(input_df, emotion_columns):
    """
    Modifies the dataframe by removing rows where there is no emotion labeled.

    :param df: DataFrame to be modified.
    :param emotion_columns: List of columns corresponding to emotions.
    :return: Modified DataFrame.
    """
    df = input_df.copy()
    # Creating a mask to identify rows with no emotion
    no_emotion_mask = df[emotion_columns].sum(axis=1) == 0

    # Removing rows where no emotion is labeled
    df = df[~no_emotion_mask]

    return df

# List of emotion columns
emotion_columns = [
    "admiration",
    "amusement",
    "anger",
    "annoyance",
    "approval",
    "caring",
    "confusion",
    "curiosity",
    "desire",
    "disappointment",
    "disapproval",
    "disgust",
    "embarrassment",
    "excitement",
    "fear",
    "gratitude",
    "grief",
    "joy",
    "love",
    "nervousness",
    "optimism",
    "pride",
    "realization",
    "relief",
    "remorse",
    "sadness",
    "surprise",
    "neutral",
]

In [25]:
go_emotions_cleaned = remove_rows_without_emotion(go_emotions_raw_df, emotion_columns)

go_emotions_cleaned.count(), go_emotions_raw_df.count()

(Unnamed: 0              207814
 text                    207814
 id                      207814
 author                  207814
 subreddit               207814
 link_id                 207814
 parent_id               207814
 created_utc             207814
 rater_id                207814
 example_very_unclear    207814
 admiration              207814
 amusement               207814
 anger                   207814
 annoyance               207814
 approval                207814
 caring                  207814
 confusion               207814
 curiosity               207814
 desire                  207814
 disappointment          207814
 disapproval             207814
 disgust                 207814
 embarrassment           207814
 excitement              207814
 fear                    207814
 gratitude               207814
 grief                   207814
 joy                     207814
 love                    207814
 nervousness             207814
 optimism                207814
 pride  

Now we look at the percentage of letters in the text, so we can see if the text is clean and there arent weird characters

In [26]:
def filter_text_by_letter_percentage(df, text_column, min_percentage):
    """
    Filters the dataframe to include only rows where the percentage of letters (excluding spaces) in the text column
    is at least the specified minimum percentage.

    :param df: DataFrame to be filtered.
    :param text_column: The name of the column containing text.
    :param min_percentage: The minimum percentage of letters (0-100) in the text, excluding spaces.
    :return: Filtered DataFrame.
    """
    def is_letter_percentage_above_min(text):
        if not isinstance(text, str):
            return False
        letter_count = sum(c.isalpha() for c in text)
        non_space_count = sum(not c.isspace() for c in text)
        if non_space_count == 0:
            return False
        return (letter_count / non_space_count) * 100 >= min_percentage

    return df[df[text_column].apply(is_letter_percentage_above_min)]



go_emotions_cleaned_filtered = filter_text_by_letter_percentage(go_emotions_cleaned, 'text', 0.95)

go_emotions_cleaned_filtered.count(), go_emotions_cleaned.count()


(Unnamed: 0              207805
 text                    207805
 id                      207805
 author                  207805
 subreddit               207805
 link_id                 207805
 parent_id               207805
 created_utc             207805
 rater_id                207805
 example_very_unclear    207805
 admiration              207805
 amusement               207805
 anger                   207805
 annoyance               207805
 approval                207805
 caring                  207805
 confusion               207805
 curiosity               207805
 desire                  207805
 disappointment          207805
 disapproval             207805
 disgust                 207805
 embarrassment           207805
 excitement              207805
 fear                    207805
 gratitude               207805
 grief                   207805
 joy                     207805
 love                    207805
 nervousness             207805
 optimism                207805
 pride  

In [27]:
emotion_cleaned_filtered = filter_text_by_letter_percentage(go_emotions_cleaned, 'text', 0.95)

emotion_cleaned_filtered.count(), go_emotions_cleaned.count()

(Unnamed: 0              207805
 text                    207805
 id                      207805
 author                  207805
 subreddit               207805
 link_id                 207805
 parent_id               207805
 created_utc             207805
 rater_id                207805
 example_very_unclear    207805
 admiration              207805
 amusement               207805
 anger                   207805
 annoyance               207805
 approval                207805
 caring                  207805
 confusion               207805
 curiosity               207805
 desire                  207805
 disappointment          207805
 disapproval             207805
 disgust                 207805
 embarrassment           207805
 excitement              207805
 fear                    207805
 gratitude               207805
 grief                   207805
 joy                     207805
 love                    207805
 nervousness             207805
 optimism                207805
 pride  

In [28]:
go_emotions_cleaned['example_very_unclear'].describe()

count     207814
unique         1
top        False
freq      207814
Name: example_very_unclear, dtype: object

Checking if there are any "very unclear examples" also shows us that we dont need to clean any of that also.

Seems that we removed only some small number of recods, and in the `emotion` dataset there were no records that have more than 5% of non-letter text so we can also just keep the data raw, with just removing records that have no emotion label.

#### 2. Feature selection (optional):
    * Drop the features that provide no useful information for the task (e.g. a customer ID is usually useless for modelling).

In [29]:
text_column = "text"
emotion_columns = [
    "admiration", "amusement", "anger", "annoyance", "approval", "caring", "confusion", "curiosity", 
    "desire", "disappointment", "disapproval", "disgust", "embarrassment", "excitement", "fear", 
    "gratitude", "grief", "joy", "love", "nervousness", "optimism", "pride", "realization", "relief", 
    "remorse", "sadness", "surprise", "neutral"
]

selected_columns = [text_column] + emotion_columns
go_emotions_cleaned_selected_df = go_emotions_cleaned[selected_columns]

In [30]:
go_emotions_cleaned_selected_df

Unnamed: 0,text,admiration,amusement,anger,annoyance,approval,caring,confusion,curiosity,desire,disappointment,disapproval,disgust,embarrassment,excitement,fear,gratitude,grief,joy,love,nervousness,optimism,pride,realization,relief,remorse,sadness,surprise,neutral
0,That game hurt.,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
2,"You do right, if you don't care then fuck 'em!",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,Man I love reddit.,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
4,"[NAME] was nowhere near them, he was by the Fa...",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
5,Right? Considering it’s such an important docu...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
211219,"Well, I'm glad you're out of all that now. How...",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
211220,Everyone likes [NAME].,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
211221,Well when you’ve imported about a gazillion of...,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
211222,That looks amazing,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [31]:
## dataset down-sampling
def slim_down_df(input_df, to_percentage = 0.10):
    df = input_df.copy()

    df_slim = df.sample(frac=to_percentage)

    print(f"Original size: {len(df)}")
    print(f"Slimmed down size: {len(df_slim)}")
    
    return df_slim

go_emotions_cleaned_selected_slim_df = slim_down_df(go_emotions_cleaned_selected_df)

Original size: 207814
Slimmed down size: 20781


In [32]:
go_emotions_cleaned_selected_df.to_csv('../exports/go_emotions_cleaned_selected.csv')
go_emotions_cleaned_selected_slim_df.to_csv('../exports/go_emotions_cleaned_selected_slim.csv')

---------------

#### 3. Feature engineering, where appropriate:
    * Discretize continuous features
    * Use one-hot encoding if/when relevant
    * Add promising transformations of features (e.g. $\log(x)$, $\sqrt{x}$, $x^2$, etc)
    * Aggregate features into promising new features

Now we need to consolidate the emotions into fewer emotions, and after that tokenize them into a single label.

These are the categories we need to consolidate into, optionally it would be possible to create new ones, but we have to be careful with that, because the go_dataset will then be the only dataset representing these new labels.

- 0: 'sadness'
- 1: 'joy'
- 2: 'love'
- 3: 'anger'
- 4: 'fear'
- 5: 'surprise'
sadness, joy, love, anger, fear, surprise

### possible mapping the go_emotions to the second shorter categories:

### 0: 'sadness'
- disappointment
- grief
- remorse
- sadness

### 1: 'joy'
- amusement
- excitement
- joy
- optimism

### 2: 'love'
- caring
- love

### 3: 'anger'
- anger
- annoyance
- disapproval 
- disgust

### 4: 'fear'
- fear
- nervousness

### 5: 'surprise'
- surprise


not as easily categorized but with their respective category:
- admiration 1
- approval 1
- gratitude 1
- pride 1
- relief 1
- 
- desire 2

##### new category:
- neutral (label 6)

*Not categorized*:
- confusion
- curiosity
- embarrassment
- realization

so after categorizing the many into the few, and leaving some emotions not categorized, we will have a look at how we can join them together. 

It is also possible that the 4 not categorized emotions will just appear with another emotion, therefore we can simply add them there.

So what we want to achieve is to consolidate the categorized emotions into their respective category, after which we will have a look how many of them still share the main categories:
- 0: 'sadness'
- 1: 'joy'
- 2: 'love'
- 3: 'anger'
- 4: 'fear'
- 5: 'surprise'
- 6: 'neutral'

In [33]:
def consolidate_emotions(input_df):
    """
    Consolidates the emotion labels into fewer categories. If a record has multiple emotions from multiple categories,
    it will have 1 in multiple category columns. The function also removes all columns that are not the main categories.

    :param df: DataFrame with emotion columns to be consolidated.
    :return: Modified DataFrame with consolidated emotion categories.
    """
    df = input_df.copy()
    
    # Mapping of original emotions to new categories
    emotion_mapping = {
        'sadness': ['disappointment', 'grief', 'remorse'],
        'joy': ['amusement', 'excitement', 'optimism', 'admiration', 'approval', 'gratitude', 'pride', 'relief'],
        'love': ['caring', 'desire'],
        'anger': ['annoyance', 'disapproval', 'disgust'],
        'fear': ['nervousness'],
    }

    # Resetting the existing category columns to 0
    for category in emotion_mapping:
        if category in df.columns:
            df[category] = 0

    # Mapping the old emotions to the new categories
    for new_emotion, old_emotions in emotion_mapping.items():
        df[new_emotion] = df[old_emotions].max(axis=1)

    # Non-categorized emotions that should remain untouched
    non_categorized_emotions = ['confusion', 'curiosity', 'embarrassment', 'realization']

    # Columns to keep: new categories + non-categorized emotions + 'text'
    keep_columns = list(emotion_mapping.keys()) + non_categorized_emotions + ['text']

    # Removing columns that are not in the keep_columns list
    df = df[keep_columns]

    return df

# Applying the function to the loaded go_emotions_cleaned_selected_df dataset
consolidated_emotions_df = consolidate_emotions(go_emotions_cleaned_selected_df)

# Displaying the first few rows of the consolidated dataframe
consolidated_emotions_df.head()


Unnamed: 0,sadness,joy,love,anger,fear,confusion,curiosity,embarrassment,realization,text
0,0,0,0,0,0,0,0,0,0,That game hurt.
2,0,0,0,0,0,0,0,0,0,"You do right, if you don't care then fuck 'em!"
3,0,0,0,0,0,0,0,0,0,Man I love reddit.
4,0,0,0,0,0,0,0,0,0,"[NAME] was nowhere near them, he was by the Fa..."
5,0,1,0,0,0,0,0,0,0,Right? Considering it’s such an important docu...


In [34]:
emotion_categories = ['sadness', 'joy', 'love', 'anger', 'fear', 'surprise', 'neutral']
# 0-6
non_categorized_emotions = ['curiosity', 'embarrassment'] # taken out after 'realization', 'confusion'
# 7-10

columns_to_keep = emotion_categories + ['text']

In [35]:
def map_and_reduce_emotions(df, emotion_mapping, keep_columns):
    """
    Maps specific emotions to broader categories and reduces the DataFrame to only include specified columns.
    For each row, the function checks each column, and if there is 1 in any of the specified columns, 
    it sets 1 in the corresponding broader category column.

    :param df: DataFrame to be processed.
    :param emotion_mapping: Dictionary mapping broader categories to specific emotions.
    :param keep_columns: List of columns to keep in the final DataFrame.
    :return: DataFrame with emotions mapped and reduced to specified columns.
    """
    # Working on a copy of the DataFrame
    df_copy = df.copy()

    # Iterating over the rows
    for index, row in df_copy.iterrows():
        # Iterating over the emotion mapping
        for broader_emotion, specific_emotions in emotion_mapping.items():
            # Check if any of the specific emotions are 1
            if any(row[specific_emotion] == 1 for specific_emotion in specific_emotions):
                df_copy.at[index, broader_emotion] = 1

    # Keeping only the specified columns
    df_reduced = df_copy[keep_columns]

    return df_reduced

In [36]:
emotion_mapping = {
    'sadness': ['sadness', 'disappointment', 'grief', 'remorse'],
    'joy': ['amusement', 'excitement', 'optimism', 'admiration', 'approval', 'gratitude', 'pride', 'relief', 'joy'],
    'love': ['caring', 'desire', 'love'],
    'anger': ['annoyance', 'disapproval', 'disgust', 'anger'],
    'fear': ['nervousness', 'fear'],
    'surprise': ['realization', 'surprise', 'confusion']
}

go_emotions_mapped_reduced = map_and_reduce_emotions(go_emotions_cleaned_selected_df, emotion_mapping, columns_to_keep + non_categorized_emotions)

go_emotions_mapped_reduced

Unnamed: 0,sadness,joy,love,anger,fear,surprise,neutral,text,curiosity,embarrassment
0,1,0,0,0,0,0,0,That game hurt.,0,0
2,0,0,0,0,0,0,1,"You do right, if you don't care then fuck 'em!",0,0
3,0,0,1,0,0,0,0,Man I love reddit.,0,0
4,0,0,0,0,0,0,1,"[NAME] was nowhere near them, he was by the Fa...",0,0
5,0,1,0,0,0,0,0,Right? Considering it’s such an important docu...,0,0
...,...,...,...,...,...,...,...,...,...,...
211219,0,1,0,0,0,0,0,"Well, I'm glad you're out of all that now. How...",0,0
211220,0,0,1,0,0,0,0,Everyone likes [NAME].,0,0
211221,0,0,1,0,0,0,0,Well when you’ve imported about a gazillion of...,0,0
211222,0,1,0,0,0,0,0,That looks amazing,0,0


In [37]:
def tokenize_emotions(input_df, emotion_columns):
    
    df = input_df.copy()
    # Creating a new column for the tokenized label
    df["emotion_label"] = -2  # default value indicating no emotion

    for index, row in df.iterrows():
        emotions = row[emotion_columns]
        labeled_emotions = emotions[emotions == 1]

        if len(labeled_emotions) > 1:
            df.at[index, "emotion_label"] = -1  # multiple emotions
        elif len(labeled_emotions) == 1:
            emotion_index = labeled_emotions.idxmax()
            df.at[index, "emotion_label"] = emotion_columns.index(emotion_index)

    return df

In [38]:
go_emotions_mapped_tokenized = tokenize_emotions(go_emotions_mapped_reduced, emotion_categories + non_categorized_emotions)

go_emotions_mapped_tokenized

Unnamed: 0,sadness,joy,love,anger,fear,surprise,neutral,text,curiosity,embarrassment,emotion_label
0,1,0,0,0,0,0,0,That game hurt.,0,0,0
2,0,0,0,0,0,0,1,"You do right, if you don't care then fuck 'em!",0,0,6
3,0,0,1,0,0,0,0,Man I love reddit.,0,0,2
4,0,0,0,0,0,0,1,"[NAME] was nowhere near them, he was by the Fa...",0,0,6
5,0,1,0,0,0,0,0,Right? Considering it’s such an important docu...,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...
211219,0,1,0,0,0,0,0,"Well, I'm glad you're out of all that now. How...",0,0,1
211220,0,0,1,0,0,0,0,Everyone likes [NAME].,0,0,2
211221,0,0,1,0,0,0,0,Well when you’ve imported about a gazillion of...,0,0,2
211222,0,1,0,0,0,0,0,That looks amazing,0,0,1


In [39]:
go_emotions_mapped_tokenized['emotion_label'].value_counts()

emotion_label
 1    56620
 6    55298
 3    27308
-1    22425
 5    13584
 0    11310
 2    11005
 7     5885
 4     2946
 8     1433
Name: count, dtype: int64

after tokenizing the mapped emotions we can see the distributions of the mapped emotions, and we can see that there are 22000 overlays

In [40]:
go_emotions_mapped_chopped = go_emotions_mapped_reduced[columns_to_keep]
go_emotions_mapped_chopped

Unnamed: 0,sadness,joy,love,anger,fear,surprise,neutral,text
0,1,0,0,0,0,0,0,That game hurt.
2,0,0,0,0,0,0,1,"You do right, if you don't care then fuck 'em!"
3,0,0,1,0,0,0,0,Man I love reddit.
4,0,0,0,0,0,0,1,"[NAME] was nowhere near them, he was by the Fa..."
5,0,1,0,0,0,0,0,Right? Considering it’s such an important docu...
...,...,...,...,...,...,...,...,...
211219,0,1,0,0,0,0,0,"Well, I'm glad you're out of all that now. How..."
211220,0,0,1,0,0,0,0,Everyone likes [NAME].
211221,0,0,1,0,0,0,0,Well when you’ve imported about a gazillion of...
211222,0,1,0,0,0,0,0,That looks amazing


After picking just the columns we want finally and chopping of the rest...

In [41]:
go_emotions_mapped_chopped_tokenized = tokenize_emotions(go_emotions_mapped_chopped, emotion_categories)
go_emotions_mapped_chopped_tokenized

Unnamed: 0,sadness,joy,love,anger,fear,surprise,neutral,text,emotion_label
0,1,0,0,0,0,0,0,That game hurt.,0
2,0,0,0,0,0,0,1,"You do right, if you don't care then fuck 'em!",6
3,0,0,1,0,0,0,0,Man I love reddit.,2
4,0,0,0,0,0,0,1,"[NAME] was nowhere near them, he was by the Fa...",6
5,0,1,0,0,0,0,0,Right? Considering it’s such an important docu...,1
...,...,...,...,...,...,...,...,...,...
211219,0,1,0,0,0,0,0,"Well, I'm glad you're out of all that now. How...",1
211220,0,0,1,0,0,0,0,Everyone likes [NAME].,2
211221,0,0,1,0,0,0,0,Well when you’ve imported about a gazillion of...,2
211222,0,1,0,0,0,0,0,That looks amazing,1


we can see that we can't categorize around 18k records, which is around 3% from the overall size, so we can live with that, after that we see that the overlays went down to 14k in the main categories that we need, so if we remove all these we lose 32k which is 5% of the original size which is still fine

In [42]:
go_emotions_mapped_chopped_tokenized['emotion_label'].value_counts()

emotion_label
 1    57947
 6    55298
 3    27997
-1    18306
 5    14830
 0    11759
 2    11290
-2     7346
 4     3041
Name: count, dtype: int64

In [43]:
multi_emotion_entries = go_emotions_mapped_tokenized[go_emotions_mapped_tokenized['emotion_label'] == -1]

# Count the occurrences of each emotion in these entries
multi_emotion_distribution = multi_emotion_entries[emotion_categories + non_categorized_emotions].sum()
print(multi_emotion_distribution)

sadness           5740
joy              13335
love              6439
anger             7123
fear              1569
surprise          7439
neutral              0
curiosity         3807
embarrassment     1043
dtype: int64


In [44]:
multi_emotion_entries = go_emotions_mapped_chopped_tokenized[go_emotions_mapped_chopped_tokenized['emotion_label'] == -1]

# Count the occurrences of each emotion in these entries
multi_emotion_distribution = multi_emotion_entries[emotion_categories].sum()
print(multi_emotion_distribution)

sadness      5291
joy         12008
love         6154
anger        6434
fear         1474
surprise     6193
neutral         0
dtype: int64


In [45]:
go_emotions_mapped_chopped_tokenized_slim = slim_down_df(go_emotions_mapped_tokenized)

go_emotions_mapped_chopped_tokenized.to_csv('../exports/go_emotions_mapped_chopped_tokenized.csv')
go_emotions_mapped_chopped_tokenized_slim.to_csv('../exports/go_emotions_mapped_chopped_tokenized_slim.csv')

Original size: 207814
Slimmed down size: 20781


So now after tokenizing the mapped and copped records, we can remove all columns other than text and emotion_label
and after that remove all records that have negative emotion label (-1 are that have more than 1 emotion, -2 are those who dont have a categorized emotion)

In [46]:
def remove_negative_emotion_labels(df, label_name):
    """
    Removes rows with negative emotion labels from the DataFrame.

    :param df: DataFrame to be processed.
    :return: DataFrame with rows having negative emotion labels removed.
    """
    df_copy = df.copy()
    df_copy = df_copy[df_copy[label_name] >= 0]
    return df_copy

In [47]:
def retain_specific_columns(df, columns_to_keep):
    """
    Retains only the specified columns in the DataFrame.

    :param df: DataFrame to be processed.
    :param columns_to_keep: List of column names to retain.
    :return: DataFrame with only the specified columns.
    """
    df_copy = df.copy()
    df_reduced = df_copy[columns_to_keep]
    return df_reduced

In [48]:
go_emotions_clear_labeled = retain_specific_columns(go_emotions_mapped_chopped_tokenized, ['text', 'emotion_label'])

go_emotions_clear_labeled = go_emotions_clear_labeled.rename(columns={'emotion_label': 'label'})

go_emotions_clear_labeled = remove_negative_emotion_labels(go_emotions_clear_labeled, 'label')

In [49]:
go_emotions_clear_labeled['label'].value_counts()

label
1    57947
6    55298
3    27997
5    14830
0    11759
2    11290
4     3041
Name: count, dtype: int64

In [None]:
go_emotions_clear_labeled['label'].value_counts()

label
1    60404
6    55298
3    28879
0    12300
2    11649
5     4072
4     3164
Name: count, dtype: int64

In [51]:
go_emotions_clear_labeled

Unnamed: 0,text,label
0,That game hurt.,0
2,"You do right, if you don't care then fuck 'em!",6
3,Man I love reddit.,2
4,"[NAME] was nowhere near them, he was by the Fa...",6
5,Right? Considering it’s such an important docu...,1
...,...,...
211219,"Well, I'm glad you're out of all that now. How...",1
211220,Everyone likes [NAME].,2
211221,Well when you’ve imported about a gazillion of...,2
211222,That looks amazing,1


this one doesnt even need to be cleaned

In [58]:
emotion_raw_df

Unnamed: 0.1,Unnamed: 0,text,label
0,0,i feel awful about it too because it s my job ...,0
1,1,im alone i feel awful,0
2,2,ive probably mentioned this before but i reall...,1
3,3,i was feeling a little low few days back,0
4,4,i beleive that i am much more sensitive to oth...,2
...,...,...,...
416804,416804,that was what i felt when i was finally accept...,1
416805,416805,i take every day as it comes i m just focussin...,4
416806,416806,i just suddenly feel that everything was fake,0
416807,416807,im feeling more eager than ever to claw back w...,1


In [56]:
def slim_and_export(input_df: pd.DataFrame, export_file_name: str):

    slimmed = slim_down_df(input_df)

    slimmed.to_csv(f'../exports/{export_file_name}.csv')
    input_df.to_csv(f'../exports/{export_file_name}_slim.csv')    

In [62]:
def randomize_dataframe_order(df):
    df_copy = df.copy()
    randomized_df = df_copy.sample(frac=1).reset_index(drop=True)
    return randomized_df

In [57]:
slim_and_export(go_emotions_clear_labeled, 'go_emotions_clear_labeled')

Original size: 182162
Slimmed down size: 18216


In [60]:
emotion_raw_df = emotion_raw_df.drop(columns=['Unnamed: 0'])

concatenated_df = pd.concat([go_emotions_clear_labeled, emotion_raw_df], ignore_index=True)

# Now concatenated_df is the combined DataFrame of both datasets

In [63]:
slim_and_export(concatenated_df, 'merged_emotions_train_ready')
slim_and_export(randomize_dataframe_order(concatenated_df), 'merged_emotions_scrambled_train_ready')

Original size: 598971
Slimmed down size: 59897
Original size: 598971
Slimmed down size: 59897
