# Dataset Classes Remapping

I need to remap the classes of the Fakeddit dataset (subreddit sources) to our binary class problem (pristine or fake).
In particular, we will consider as "fake" each image coming from the "PS battle comments" subreddit 

In [1]:
import pandas as pd
import os
import json
from IPython.display import display
from dotenv import load_dotenv

load_dotenv()

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


True

### get the unique list of all the possible subreddits

In [2]:
multimodal_train_tsv_path = os.getenv('MULTIMODAL_TRAIN_CLEANED_NO_EXACT_DUPLICATES_NO_IMAGEONLY_DUPLICATES_NO_CORRUPTED_TSV')

df = pd.read_csv(multimodal_train_tsv_path, sep='\t')

#get unique values from the "subreddit" column
unique_subreddits = list(df['subreddit'].unique())

with open('subreddit_values.json', 'w') as json_file:
    json.dump(unique_subreddits, json_file, indent=2)

print("Unique subreddit values saved to subreddit_values.json")
print(unique_subreddits)

Unique subreddit values saved to subreddit_values.json
['mildlyinteresting', 'pareidolia', 'neutralnews', 'photoshopbattles', 'nottheonion', 'psbattle_artwork', 'fakehistoryporn', 'propagandaposters', 'upliftingnews', 'fakealbumcovers', 'subredditsimulator', 'satire', 'savedyouaclick', 'misleadingthumbnails', 'pic', 'theonion', 'confusing_perspective', 'usanews', 'usnews', 'waterfordwhispersnews', 'subsimulatorgpt2', 'fakefacts']


add a column named "class" which will contain either "fake" or "pristine" depending on the "subreddit" column value

In [3]:
multimodal_train_tsv_path = os.getenv('MULTIMODAL_TRAIN_CLEANED_NO_EXACT_DUPLICATES_NO_IMAGEONLY_DUPLICATES_NO_CORRUPTED_TSV')
multimodal_test_tsv_path = os.getenv('MULTIMODAL_TEST_CLEANED_NO_EXACT_DUPLICATES_NO_CORRUPTED_TSV')
multimodal_validation_tsv_path = os.getenv('MULTIMODAL_VAL_CLEANED_NO_EXACT_DUPLICATES_NO_CORRUPTED_TSV')

df_train = pd.read_csv(multimodal_train_tsv_path, sep='\t')
df_test = pd.read_csv(multimodal_test_tsv_path, sep='\t')
df_val = pd.read_csv(multimodal_validation_tsv_path, sep='\t')

#create a new column "class" and set default value to "pristine"
df_train['class'] = 'pristine'
df_test['class'] = 'pristine'
df_val['class'] = 'pristine'

#then set the value to "fake" for rows where "subreddit" is equal to "psbattle_artwork"
df_train.loc[df_train['subreddit'] == 'psbattle_artwork', 'class'] = 'fake'
df_test.loc[df_test['subreddit'] == 'psbattle_artwork', 'class'] = 'fake'
df_val.loc[df_val['subreddit'] == 'psbattle_artwork', 'class'] = 'fake'

df_train.to_csv("train_tsv_with_class.tsv", sep='\t', index=False)
df_test.to_csv("test_tsv_with_class.tsv", sep='\t', index=False)
df_val.to_csv("val_tsv_with_class.tsv", sep='\t', index=False)

train_counts = df_train['class'].value_counts()
test_counts = df_test['class'].value_counts()
val_counts = df_val['class'].value_counts()

print("Train Counts:")
print(train_counts)
print("\nTest Counts:")
print(test_counts)
print("\nValidation Counts:")
print(val_counts)

print("\n=> Total \"pristine\": "+ str(train_counts['pristine']+test_counts['pristine']+val_counts['pristine'])+" | Total \"fake\": "+str(train_counts['fake']+test_counts['fake']+val_counts['fake']))

Train Counts:
class
pristine    385871
fake        157126
Name: count, dtype: int64

Test Counts:
class
pristine    41567
fake        16480
Name: count, dtype: int64

Validation Counts:
class
pristine    40909
fake        16580
Name: count, dtype: int64

=> Total "pristine": 468347 | Total "fake": 190186


given the old classes:
 Class 0: True
 
 Class 1: Satire
 
 Class 2:  False Connection
 
 Class 3: Imposter Content
 
 Class 4: Manipulated Content
 
 Class 5: Misleading Content
 
i also add 4 redundant columns "real_image", "fake_image", "real_text", "fakenews_text" to result_df based on the value of "6_way_label":

 "real_image" = 1 if 6_way_label is equal to 0,1,2,3 or 4, else = 0.

 "fake_image" = 0 if 6_way_label is equal to 0,1,2,3 or 4, else = 1.

 "real_text" = 1 if 6_way_label is equal to 0 or 5, else = 0.
 
 "fakenews_text" = 0 if 6_way_label is equal to 0 or 5, else = 1.

In [3]:
df_train = pd.read_csv('C:/Users/nello/OneDrive - University of Pisa/TESI/TSV_JSON/1_dataset_cleaning/tsv/train_tsv_with_class.tsv', sep='\t')
df_test = pd.read_csv('C:/Users/nello/OneDrive - University of Pisa/TESI/TSV_JSON/1_dataset_cleaning/tsv/test_tsv_with_class.tsv', sep='\t')
df_val = pd.read_csv('C:/Users/nello/OneDrive - University of Pisa/TESI/TSV_JSON/1_dataset_cleaning/tsv/val_tsv_with_class.tsv', sep='\t')

In [4]:
df_test.head()

Unnamed: 0,author,clean_title,created_utc,domain,hasImage,id,image_url,linked_submission_id,num_comments,score,subreddit,title,upvote_ratio,2_way_label,3_way_label,6_way_label,class
0,trustbytrust,stargazer,1425139000.0,,True,cozywbv,http://i.imgur.com/BruWKDi.jpg,2xct9d,,3,psbattle_artwork,stargazer,,0,2,4,fake
1,,yeah,1438173000.0,,True,ctk61yw,http://i.imgur.com/JRZT727.jpg,3f0h7o,,2,psbattle_artwork,yeah,,0,2,4,fake
2,chaseoes,pd phoenix car thief gets instructions from yo...,1560492000.0,abc15.com,True,c0gl7r,https://external-preview.redd.it/1A2_4VwgS8Qd2...,,2.0,16,nottheonion,PD: Phoenix car thief gets instructions from Y...,0.89,1,0,0,pristine
3,SFepicure,as trump accuses iran he has one problem his o...,1560606000.0,nytimes.com,True,c0xdqy,https://external-preview.redd.it/9BKRcgvaobpTo...,,4.0,45,neutralnews,"As Trump Accuses Iran, He Has One Problem: His...",0.78,1,0,0,pristine
4,fragments_from_Work,believers hezbollah,1515139000.0,i.imgur.com,True,7o9rmx,https://external-preview.redd.it/rbwXHncnjVh51...,,40.0,285,propagandaposters,"""Believers"" - Hezbollah 2011",0.95,0,1,5,pristine


In [9]:
# Reorder columns
df_train = df_train[['id', 'author', 'num_comments', '6_way_label', 'class']]

# Function to assign values based on conditions
def assign_values(label):
    real_image = 0 if label == 4 else 1
    fake_image = 1 if label == 4 else 0
    real_text = 1 if label in [0, 4] else 0
    fakenews_text = 0 if label in [0, 4] else 1
    return pd.Series([real_image, fake_image, real_text, fakenews_text])

# Apply the function to create new columns
df_train[['real_image', 'fake_image', 'real_text', 'fakenews_text']] = df_train['6_way_label'].apply(assign_values)

# Export to CSV
df_train.to_csv('train_tsv_with_class2.csv', index=False)
df_train.head()

Unnamed: 0,id,author,num_comments,6_way_label,class,real_image,fake_image,real_text,fakenews_text
0,awxhir,Alexithymia,2.0,0,pristine,1,0,1,0
1,98pbid,VIDCAs17,2.0,2,pristine,1,0,0,1
2,6f2cy5,prometheus1123,1.0,0,pristine,1,0,1,0
3,4xypkv,,26.0,0,pristine,1,0,1,0
4,8gnet9,3rikR3ith,2.0,2,pristine,1,0,0,1
