# Reformating of the data from MMHS150K dataset

This notebook is a part of the project to reformat the data from the MMHS150K dataset. The original dataset is in the form of a JSON file. The data is reformatted into a CSV file for easier access and manipulation. The dataset contains the following columns:
- `index`: The unique identifier for each tweet.
- `img_url`: The URL of the image associated with the tweet.
- `labels`: The labels assigned to the tweet.
- `tweet_url`: The URL of the tweet.
- `tweet_text`: The text content of the tweet.
- `labels_str`: The labels assigned to the tweet as a string.
- `tweet_text_clean`: The cleaned version of the tweet text. (no URLS and mentions)
- `img_text`: The text extracted from the image, if available. (otherwise NaN)
- `text_in_image`: Indicates whether the image contains text or not.
- `hate_speech`: The level of hate speech in the tweet. (from 0 to 1)
- `binary_hate`: A binary label indicating whether the tweet contains hate speech or not. (threshold at 0.5)
- `split`: The split of the dataset (train, test, or val).

The notebook expects to find such directory structure:
```
.
├── MMHS150K
│   ├── MMHS150K_GT.json
│   └── img_resized
│   │   ├── 1114679353714016256.jpg
│   │   ├── ...
│   │   └── 1110368198786846720.jpg
│   └── img_txt
│   │   ├── 1114679353714016256.json
│   │   ├── ...
│   │   └── 1110368198786846720.json
│   └── splits
│   │   ├── train_ids.txt
│   │   └── test_ids.txt
│   └────── val_ids.txt
└── reformat_data.ipynb
```

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import os
import json
import tqdm
import re
import torch
import torch.nn as nn

ROOT_DIR = os.path.dirname(os.getcwd()) + '/'

In [None]:

## Will load the text-side of the dataset, put it in a pandas dataframe and save it in a csv file
# for ease of use in the future

# Data folder
DATA_FOLDER = ROOT_DIR+'/data/MMHS150K/MMHS150K_GT.json'
# Folder with the img_txt
IMG_TEXT_FOLDER = ROOT_DIR+'data/MMHS150K/img_txt/'
# Splits folder
SPLITS_FOLDER = ROOT_DIR+'data/MMHS150K/splits/'

## Load data
data = pd.read_json(DATA_FOLDER, orient='index', convert_dates=False, convert_axes=False)
data = data.reset_index(drop=False)
data['index'] = data['index'].astype('int64')


## Clean the tweet text
# Keep only the text before https://t.co/
data['tweet_text_clean'] = data['tweet_text'].str.split('https://t.co/').str[0]
# Replace any occurence of @user with <tag>
regex_tag = r'(^|[^@\w])@(\w{1,15})\b'
data['tweet_text_clean'] = data['tweet_text_clean'].apply(lambda x: re.sub(regex_tag, '<tag>', x))
# Replace nan and '' with the string <empty>
data[data['tweet_text_clean'].isna()]['tweet_text_clean'] = '<empty>'
data['tweet_text_clean'] = data['tweet_text_clean'].apply(lambda x: '<empty>' if x == '' else x)

## Add the text of the image if it exists
# Number of files in the folder
n_files = len(os.listdir(IMG_TEXT_FOLDER))
# Names of the files
files = os.listdir(IMG_TEXT_FOLDER)
# Add new column in the dataset for the image text, filled with None
data['img_text'] = [None]*len(data)
# Load each file and add the text to the dataset to the corresponding index
for file in files:
    index = int(file.split('.')[0])
    
    # Open the file (json)
    with open(IMG_TEXT_FOLDER + file) as f:
        file_data = json.load(f)
                
        # Add the text to the dataset
        data.loc[data['index'] == index, 'img_text'] = file_data["img_text"]
data['text_in_image'] = data['img_text'].isna().apply(lambda x: not x)

## Add the hate_speech label
# replace the labels with a single label hateful or not
data['hate_speech'] = data.apply(lambda x: np.mean([0 if i == 0 else 1 for i in x['labels']]), axis=1)
data['binary_hate'] = data['hate_speech'].apply(lambda x: 1 if x >= 0.5 else 0)


## Add the split
# Load the splitsb
train = pd.read_csv(SPLITS_FOLDER + 'train_ids.txt', header=None)
test = pd.read_csv(SPLITS_FOLDER + 'test_ids.txt', header=None)
val = pd.read_csv(SPLITS_FOLDER + 'val_ids.txt', header=None)

# Add the split to the dataset if the index is in the split
data['split'] = 'train'
data.loc[data['index'].isin(test[0]), 'split'] = 'test'
data.loc[data['index'].isin(val[0]), 'split'] = 'val'

display(data)



## Save the dataset
data.to_csv(ROOT_DIR+'/data/MMHS150K/MMHS150K.csv', index=False)

A second CSV is created, containing only the memes for which there is a text in the image. This dataset contains the same columns as the original dataset.

In [None]:
# Second version of the dataset with only tweets which have a text in the image
data2 = data[data['text_in_image']]

# Remove text_in_image column
data2 = data2.drop(columns=['text_in_image'])

display(data2)

# Save the dataset
data2.to_csv(os.path.join(ROOT_DIR, "data", "MMHS150K", "MMHS150K_with_img_text.csv"), index=False)

In [None]:
# Display the 372th line of data2
display(data2.iloc[372])