# Dataset Classes Remapping

I need to remap the classes of the Fakeddit dataset (subreddit sources) to our binary class problem (pristine or fake).
In particular, we will consider as "fake" each image coming from the "PS battle comments" subreddit 

In [1]:
import pandas as pd
import os
import json
from IPython.display import display
from dotenv import load_dotenv

load_dotenv()

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


True

### get the unique list of all the possible subreddits

In [2]:
multimodal_train_tsv_path = os.getenv('MULTIMODAL_TRAIN_CLEANED_NO_EXACT_DUPLICATES_NO_IMAGEONLY_DUPLICATES_NO_CORRUPTED_TSV')

df = pd.read_csv(multimodal_train_tsv_path, sep='\t')

#get unique values from the "subreddit" column
unique_subreddits = list(df['subreddit'].unique())

with open('subreddit_values.json', 'w') as json_file:
    json.dump(unique_subreddits, json_file, indent=2)

print("Unique subreddit values saved to subreddit_values.json")
print(unique_subreddits)

Unique subreddit values saved to subreddit_values.json
['mildlyinteresting', 'pareidolia', 'neutralnews', 'photoshopbattles', 'nottheonion', 'psbattle_artwork', 'fakehistoryporn', 'propagandaposters', 'upliftingnews', 'fakealbumcovers', 'subredditsimulator', 'satire', 'savedyouaclick', 'misleadingthumbnails', 'pic', 'theonion', 'confusing_perspective', 'usanews', 'usnews', 'waterfordwhispersnews', 'subsimulatorgpt2', 'fakefacts']


add a column named "class" which will contain either "fake" or "pristine" depending on the "subreddit" column value

In [3]:
multimodal_train_tsv_path = os.getenv('MULTIMODAL_TRAIN_CLEANED_NO_EXACT_DUPLICATES_NO_IMAGEONLY_DUPLICATES_NO_CORRUPTED_TSV')
multimodal_test_tsv_path = os.getenv('MULTIMODAL_TEST_CLEANED_NO_EXACT_DUPLICATES_NO_CORRUPTED_TSV')
multimodal_validation_tsv_path = os.getenv('MULTIMODAL_VAL_CLEANED_NO_EXACT_DUPLICATES_NO_CORRUPTED_TSV')

df_train = pd.read_csv(multimodal_train_tsv_path, sep='\t')
df_test = pd.read_csv(multimodal_test_tsv_path, sep='\t')
df_val = pd.read_csv(multimodal_validation_tsv_path, sep='\t')

#create a new column "class" and set default value to "pristine"
df_train['class'] = 'pristine'
df_test['class'] = 'pristine'
df_val['class'] = 'pristine'

#then set the value to "fake" for rows where "subreddit" is equal to "psbattle_artwork"
df_train.loc[df_train['subreddit'] == 'psbattle_artwork', 'class'] = 'fake'
df_test.loc[df_test['subreddit'] == 'psbattle_artwork', 'class'] = 'fake'
df_val.loc[df_val['subreddit'] == 'psbattle_artwork', 'class'] = 'fake'

df_train.to_csv("train_tsv_with_class.tsv", sep='\t', index=False)
df_test.to_csv("test_tsv_with_class.tsv", sep='\t', index=False)
df_val.to_csv("val_tsv_with_class.tsv", sep='\t', index=False)

train_counts = df_train['class'].value_counts()
test_counts = df_test['class'].value_counts()
val_counts = df_val['class'].value_counts()

print("Train Counts:")
print(train_counts)
print("\nTest Counts:")
print(test_counts)
print("\nValidation Counts:")
print(val_counts)

print("\n=> Total \"pristine\": "+ str(train_counts['pristine']+test_counts['pristine']+val_counts['pristine'])+" | Total \"fake\": "+str(train_counts['fake']+test_counts['fake']+val_counts['fake']))

Train Counts:
class
pristine    385871
fake        157126
Name: count, dtype: int64

Test Counts:
class
pristine    41567
fake        16480
Name: count, dtype: int64

Validation Counts:
class
pristine    40909
fake        16580
Name: count, dtype: int64

=> Total "pristine": 468347 | Total "fake": 190186
