In this notebook, I will be preparing the SemEVAL 2016 dataset for my experiments. I will do that in three stages:
1. Filtering for only Climate Change related data.
2. Splitting the training data into 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80% and 90% splits.
3. Oversampling the minority classes.

### Loading and Filtering Data:

In [2]:
# Importing libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.utils import resample

In [13]:
# Loading the data

train_df = pd.read_csv("semeval_data/trainingdata-all-annotations copy.csv")
test_df = pd.read_csv("semeval_data/testdata-taskA-all-annotations copy.csv")

In [14]:
train_df.info()
train_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2814 entries, 0 to 2813
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   ID               2814 non-null   int64 
 1   Target           2814 non-null   object
 2   Tweet            2814 non-null   object
 3   Stance           2814 non-null   object
 4   Opinion towards  2814 non-null   object
 5   Sentiment        2814 non-null   object
dtypes: int64(1), object(5)
memory usage: 132.0+ KB


Unnamed: 0,ID,Target,Tweet,Stance,Opinion towards,Sentiment
0,101,Atheism,dear lord thank u for all of ur blessings forg...,AGAINST,OTHER,POSITIVE
1,102,Atheism,"Blessed are the peacemakers, for they shall be...",AGAINST,OTHER,POSITIVE
2,103,Atheism,I am not conformed to this world. I am transfo...,AGAINST,OTHER,POSITIVE
3,104,Atheism,Salah should be prayed with #focus and #unders...,AGAINST,OTHER,POSITIVE
4,105,Atheism,And stay in your houses and do not display you...,AGAINST,OTHER,NEGATIVE


In [15]:
# Printing the unique targets
unique_targets = train_df['Target'].unique()
print(unique_targets)


['Atheism' 'Climate Change is a Real Concern' 'Feminist Movement'
 'Hillary Clinton' 'Legalization of Abortion']


In [16]:
# Filtering for Climate Change tweets
train_df = train_df[train_df['Target']== 'Climate Change is a Real Concern']
train_df = train_df.reset_index(drop=True)

test_df = test_df[test_df['Target']== 'Climate Change is a Real Concern']
test_df = test_df.reset_index(drop=True)


In [18]:
unique_targets = train_df['Target'].unique()
print(unique_targets)
train_df.info()
train_df.head()

['Climate Change is a Real Concern']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   ID               395 non-null    int64 
 1   Target           395 non-null    object
 2   Tweet            395 non-null    object
 3   Stance           395 non-null    object
 4   Opinion towards  395 non-null    object
 5   Sentiment        395 non-null    object
dtypes: int64(1), object(5)
memory usage: 18.6+ KB


Unnamed: 0,ID,Target,Tweet,Stance,Opinion towards,Sentiment
0,614,Climate Change is a Real Concern,"We cant deny it, its really happening. #SemST",FAVOR,TARGET,NEITHER
1,615,Climate Change is a Real Concern,RT @cderworiz: Timelines are short. Strategy m...,FAVOR,TARGET,POSITIVE
2,616,Climate Change is a Real Concern,SO EXCITING! Meaningful climate change action ...,FAVOR,TARGET,POSITIVE
3,617,Climate Change is a Real Concern,"Delivering good jobs for Albertans, maintainin...",FAVOR,TARGET,POSITIVE
4,618,Climate Change is a Real Concern,@davidswann says he wants carbon fund to be sp...,FAVOR,NO ONE,NEITHER


In [19]:
# Distribution of semeval data

print("Train:")
print(train_df["Stance"].value_counts())

print("Test:")
print(test_df["Stance"].value_counts())

Train:
Stance
FAVOR      212
NONE       168
AGAINST     15
Name: count, dtype: int64
Test:
Stance
FAVOR      123
NONE        35
AGAINST     11
Name: count, dtype: int64


###  Splitting the training data into subsets:

In [21]:
# Define the split percentages
split_percentages = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
RANDOM_SEED = 42

# Create subsets and save

for train_percent in split_percentages:
    subset_train_df, _ = train_test_split(
        train_df,
        train_size=train_percent,
        shuffle=True,
        random_state=RANDOM_SEED,
        stratify=train_df["Stance"]  # Stratify to maintain label distribution
    )

    print(f"Created subset with {len(subset_train_df)} samples out of {len(train_df)} total.")
    
    # Save to disk
    subset_path = f"semeval_data/initial_splits/subset_train_{int(train_percent*100)}.csv"
    subset_train_df.to_csv(subset_path, index=False)

Created subset with 39 samples out of 395 total.
Created subset with 79 samples out of 395 total.
Created subset with 118 samples out of 395 total.
Created subset with 158 samples out of 395 total.
Created subset with 197 samples out of 395 total.
Created subset with 237 samples out of 395 total.
Created subset with 276 samples out of 395 total.
Created subset with 316 samples out of 395 total.
Created subset with 355 samples out of 395 total.


### Oversampling the subsets to make the class distribution balanced:

In [23]:
for train_percent in split_percentages:
    # Read the subsets created in step above
    subset_path = f"semeval_data/initial_splits/subset_train_{int(train_percent*100)}.csv"
    subset_train_df = pd.read_csv(subset_path)

    print(f"\nOversampling subset with {len(subset_train_df)} samples")

    # Seperate each class
    df_favor   = subset_train_df[subset_train_df["Stance"] == "FAVOR"]
    df_none    = subset_train_df[subset_train_df["Stance"] == "NONE"]
    df_against = subset_train_df[subset_train_df["Stance"] == "AGAINST"]


    # Oversample the smaller classes to match the largest
    n_samples = max(len(df_favor), len(df_none), len(df_against))

    df_none_oversampled = resample(df_none,
                                   replace=True,
                                   n_samples=n_samples,
                                   random_state=RANDOM_SEED)
    
    df_against_oversampled = resample(df_against,
                                      replace=True,
                                      n_samples=n_samples,
                                      random_state=RANDOM_SEED)
    
    # Combine oversampled classes and intial class together
    oversampled_train_df = pd.concat([df_favor,
                                      df_none_oversampled,
                                      df_against_oversampled])

    # Shuffle the final oversampled set
    oversampled_train_df = oversampled_train_df.sample(
        frac=1,
        random_state=RANDOM_SEED
    ).reset_index(drop=True)


    # Save the oversampled subsets
    oversampled_path = f"semeval_data/oversampled_splits/subset_train_{int(train_percent*100)}.csv"
    oversampled_train_df.to_csv(oversampled_path, index=False)
    print("Oversampling complete. Class distribution:")
    print(oversampled_train_df["Stance"].value_counts())


Oversampling subset with 39 samples
Oversampling complete. Class distribution:
Stance
AGAINST    21
FAVOR      21
NONE       21
Name: count, dtype: int64

Oversampling subset with 79 samples
Oversampling complete. Class distribution:
Stance
NONE       42
FAVOR      42
AGAINST    42
Name: count, dtype: int64

Oversampling subset with 118 samples
Oversampling complete. Class distribution:
Stance
AGAINST    63
FAVOR      63
NONE       63
Name: count, dtype: int64

Oversampling subset with 158 samples
Oversampling complete. Class distribution:
Stance
AGAINST    85
FAVOR      85
NONE       85
Name: count, dtype: int64

Oversampling subset with 197 samples
Oversampling complete. Class distribution:
Stance
FAVOR      106
AGAINST    106
NONE       106
Name: count, dtype: int64

Oversampling subset with 237 samples
Oversampling complete. Class distribution:
Stance
AGAINST    127
NONE       127
FAVOR      127
Name: count, dtype: int64

Oversampling subset with 276 samples
Oversampling complete.