# Reformating of the data from MMHS150K dataset

This notebook is a part of the project to reformat the data from the MMHS150K dataset. The original dataset is in the form of a JSON file. The data is reformatted into a CSV file for easier access and manipulation. The dataset contains the following columns:
- `index`: The unique identifier for each tweet.
- `img_url`: The URL of the image associated with the tweet.
- `labels`: The labels assigned to the tweet.
- `tweet_url`: The URL of the tweet.
- `tweet_text`: The text content of the tweet.
- `labels_str`: The labels assigned to the tweet as a string.
- `tweet_text_clean`: The cleaned version of the tweet text. (no URLS and mentions)
- `img_text`: The text extracted from the image, if available. (otherwise NaN)
- `text_in_image`: Indicates whether the image contains text or not.
- `hate_speech`: The level of hate speech in the tweet. (from 0 to 1)
- `binary_hate`: A binary label indicating whether the tweet contains hate speech or not. (threshold at 0.5)
- `split`: The split of the dataset (train, test, or val).

The notebook expects to find such directory structure:
```
.
├── MMHS150K
│   ├── MMHS150K_GT.json
│   └── img_resized
│   │   ├── 1114679353714016256.jpg
│   │   ├── ...
│   │   └── 1110368198786846720.jpg
│   └── img_txt
│   │   ├── 1114679353714016256.json
│   │   ├── ...
│   │   └── 1110368198786846720.json
│   └── splits
│   │   ├── train_ids.txt
│   │   └── test_ids.txt
│   └────── val_ids.txt
└── reformat_data.ipynb
```

In [12]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import os
import json
import tqdm
import re

import torch
import torch.nn as nn


In [15]:
## Will load the text-side of the dataset, put it in a pandas dataframe and save it in a csv file
# for ease of use in the future

# Data folder
DATA_FOLDER = 'MMHS150K/MMHS150K_GT.json'
# Folder with the img_txt
IMG_TEXT_FOLDER = 'MMHS150K/img_txt/'
# Splits folder
SPLITS_FOLDER = 'MMHS150K/splits/'

## Load data
data = pd.read_json(DATA_FOLDER, orient='index', convert_dates=False, convert_axes=False)
data = data.reset_index(drop=False)
data['index'] = data['index'].astype(int)


## Clean the tweet text
# Keep only the text before https://t.co/
data['tweet_text_clean'] = data['tweet_text'].str.split('https://t.co/').str[0]
# Replace any occurence of @user with <tag>
regex_tag = r'(^|[^@\w])@(\w{1,15})\b'
data['tweet_text_clean'] = data['tweet_text_clean'].apply(lambda x: re.sub(regex_tag, '<tag>', x))


## Add the text of the image if it exists
# Number of files in the folder
n_files = len(os.listdir(IMG_TEXT_FOLDER))
# Names of the files
files = os.listdir(IMG_TEXT_FOLDER)
# Add new column in the dataset for the image text, filled with None
data['img_text'] = [None]*len(data)
# Load each file and add the text to the dataset to the corresponding index
for file in files:
    index = int(file.split('.')[0])
    
    # Open the file (json)
    with open(IMG_TEXT_FOLDER + file) as f:
        file_data = json.load(f)
                
        # Add the text to the dataset
        data.loc[data['index'] == index, 'img_text'] = file_data["img_text"]
data['text_in_image'] = data['img_text'].isna().apply(lambda x: not x)


## Add the hate_speech label
# replace the labels with a single label hateful or not
data['hate_speech'] = data.apply(lambda x: np.mean([0 if i == 0 else 1 for i in x['labels']]), axis=1)
data['binary_hate'] = data['hate_speech'].apply(lambda x: 1 if x >= 0.5 else 0)


## Add the split
# Load the splitsb
train = pd.read_csv(SPLITS_FOLDER + 'train_ids.txt', header=None)
test = pd.read_csv(SPLITS_FOLDER + 'test_ids.txt', header=None)
val = pd.read_csv(SPLITS_FOLDER + 'val_ids.txt', header=None)

# Add the split to the dataset if the index is in the split
data['split'] = 'train'
data.loc[data['index'].isin(test[0]), 'split'] = 'test'
data.loc[data['index'].isin(val[0]), 'split'] = 'val'

display(data)



## Save the dataset
data.to_csv('MMHS150K/MMHS150K.csv', index=False)

Unnamed: 0,index,img_url,labels,tweet_url,tweet_text,labels_str,tweet_text_clean,img_text,text_in_image,hate_speech,binary_hate,split
0,1114679353714016256,http://pbs.twimg.com/tweet_video_thumb/D3gi9MH...,"[4, 1, 3]",https://twitter.com/user/status/11146793537140...,@FriskDontMiss Nigga https://t.co/cAsaLWEpue,"[Religion, Racist, Homophobe]",<tag> Nigga,#YOUNGERU SAVE IT,True,1.000000,1,train
1,1063020048816660480,http://pbs.twimg.com/ext_tw_video_thumb/106301...,"[5, 5, 5]",https://twitter.com/user/status/10630200488166...,My horses are retarded https://t.co/HYhqc6d5WN,"[OtherHate, OtherHate, OtherHate]",My horses are retarded,,False,1.000000,1,train
2,1108927368075374593,http://pbs.twimg.com/media/D2OzhzHUwAADQjd.jpg,"[0, 0, 0]",https://twitter.com/user/status/11089273680753...,“NIGGA ON MA MOMMA YOUNGBOY BE SPITTING REAL S...,"[NotHate, NotHate, NotHate]",“NIGGA ON MA MOMMA YOUNGBOY BE SPITTING REAL S...,,False,0.000000,0,train
3,1114558534635618305,http://pbs.twimg.com/ext_tw_video_thumb/111401...,"[1, 0, 0]",https://twitter.com/user/status/11145585346356...,RT xxSuGVNGxx: I ran into this HOLY NIGGA TODA...,"[Racist, NotHate, NotHate]",RT xxSuGVNGxx: I ran into this HOLY NIGGA TODA...,,False,0.333333,0,train
4,1035252480215592966,http://pbs.twimg.com/media/Dl30pGIU8AAVGxO.jpg,"[1, 0, 1]",https://twitter.com/user/status/10352524802155...,“EVERYbody calling you Nigger now!” https://t....,"[Racist, NotHate, Racist]",“EVERYbody calling you Nigger now!”,,False,0.666667,1,val
...,...,...,...,...,...,...,...,...,...,...,...,...
149818,1114170734472048640,http://pbs.twimg.com/tweet_video_thumb/D3ZUXNw...,"[2, 5, 0]",https://twitter.com/user/status/11141707344720...,@svdate @gtconway3d I would just say hes Donny...,"[Sexist, OtherHate, NotHate]",<tag><tag> I would just say hes Donny the retard,LATE MOGIF LATE MOTIV,True,0.666667,1,train
149819,1110368198786846720,http://pbs.twimg.com/ext_tw_video_thumb/111036...,"[0, 0, 0]",https://twitter.com/user/status/11103681987868...,@Cheftime_Dev congrats my nigga keep on grindi...,"[NotHate, NotHate, NotHate]",<tag> congrats my nigga keep on grinding,ON AIR Elapsed Time: 05.47:18 Select Your Leve...,True,0.000000,0,train
149820,1106941858540851200,http://pbs.twimg.com/media/D1yluGmXgAEKNG5.jpg,"[0, 1, 0]",https://twitter.com/user/status/11069418585408...,My nigga big shitty https://t.co/e0snJGBgH9,"[NotHate, Racist, NotHate]",My nigga big shitty,,False,0.333333,0,train
149821,1105268309233188865,http://pbs.twimg.com/tweet_video_thumb/D1azqiz...,"[1, 0, 0]",https://twitter.com/user/status/11052683092331...,did she just say “my nigga” to Rich? &amp; she...,"[Racist, NotHate, NotHate]",did she just say “my nigga” to Rich? &amp; she...,,False,0.333333,0,train


A second CSV is created, containing only the memes for which there is a text in the image. This dataset contains the same columns as the original dataset.

In [17]:
# Second version of the dataset with only tweets which have a text in the image
data2 = data[data['text_in_image']]

# Remove text_in_image column
data2 = data2.drop(columns=['text_in_image'])

display(data2)

# Save the dataset
data2.to_csv('MMHS150K/MMHS150K_text_in_image.csv', index=False)

Unnamed: 0,index,img_url,labels,tweet_url,tweet_text,labels_str,tweet_text_clean,img_text,hate_speech,binary_hate,split
0,1114679353714016256,http://pbs.twimg.com/tweet_video_thumb/D3gi9MH...,"[4, 1, 3]",https://twitter.com/user/status/11146793537140...,@FriskDontMiss Nigga https://t.co/cAsaLWEpue,"[Religion, Racist, Homophobe]",<tag> Nigga,#YOUNGERU SAVE IT,1.000000,1,train
6,1113920043568463874,http://pbs.twimg.com/media/D3VwYEKW4AYz4vk.jpg,"[5, 1, 1]",https://twitter.com/user/status/11139200435684...,@WhiteHouse @realDonaldTrump Fuck ice. White s...,"[OtherHate, Racist, Racist]",<tag><tag> Fuck ice. White supremacist trash. ...,"Hello, White Nationalist. Good-bye. Others wil...",1.000000,1,train
7,1114588617693966336,http://pbs.twimg.com/media/D3fQcCCWAAIG8tO.jpg,"[0, 0, 0]",https://twitter.com/user/status/11145886176939...,Day’s a cunt https://t.co/Ie6QZReHsw,"[NotHate, NotHate, NotHate]",Day’s a cunt,Dad's a Cunt Mum's a Cunt Nan's a Cunt Kids ar...,0.000000,0,train
8,1045809514740666370,http://pbs.twimg.com/media/DoN2KFmXcAAIT-Y.jpg,"[3, 3, 0]",https://twitter.com/user/status/10458095147406...,#sissy faggot https://t.co/bm1nk8HcYO,"[Homophobe, Homophobe, NotHate]",#sissy faggot,EVERY SISSY GIRL SHOULD KNOW THAT MEN ARE WIZA...,0.666667,1,val
10,1116702448016556035,http://pbs.twimg.com/tweet_video_thumb/D39S8tb...,"[0, 0, 0]",https://twitter.com/user/status/11167024480165...,@DefNotJerm So.... you turn to twitter for it ...,"[NotHate, NotHate, NotHate]",<tag> So.... you turn to twitter for it instea...,SBURG,0.000000,0,train
...,...,...,...,...,...,...,...,...,...,...,...
149808,1056595445215059969,http://pbs.twimg.com/media/DqnH9QYVYAAKgDj.jpg,"[1, 0, 1]",https://twitter.com/user/status/10565954452150...,New Video: Spice – Black Hypocrisy https://t.c...,"[Racist, NotHate, Racist]",New Video: Spice – Black Hypocrisy,vevo,0.666667,1,train
149812,1107387893541232642,http://pbs.twimg.com/media/D147aqyX0AICbzE.jpg,"[0, 0, 0]",https://twitter.com/user/status/11073878935412...,@EcholsEli Nigga Ricardo literally givin out n...,"[NotHate, NotHate, NotHate]",<tag> Nigga Ricardo literally givin out n-word...,Ricardo wants to grant you a free n-word pass!...,0.000000,0,train
149816,1105465552544374786,http://pbs.twimg.com/tweet_video_thumb/D1dnDez...,"[0, 1, 1]",https://twitter.com/user/status/11054655525443...,@quisLaFlare Good luck my nigga 🤘🏾 https://t.c...,"[NotHate, Racist, Racist]",<tag> Good luck my nigga 🤘🏾,YOU CAN DO IT,0.666667,1,train
149818,1114170734472048640,http://pbs.twimg.com/tweet_video_thumb/D3ZUXNw...,"[2, 5, 0]",https://twitter.com/user/status/11141707344720...,@svdate @gtconway3d I would just say hes Donny...,"[Sexist, OtherHate, NotHate]",<tag><tag> I would just say hes Donny the retard,LATE MOGIF LATE MOTIV,0.666667,1,train
