In [None]:
import pandas as pd
pd.set_option('display.max_columns', None)

# Preliminary analysis
## First dataset: df_duplicated_with_path

This notebook requires first running the script "download_dataset.sh"

In [None]:
with open('dataset/raw/df_duplicated_with_path.pkl', 'rb') as file:
    df1 = pd.compat.pickle_compat.load(file) 

If a tweet has multiple associated images, there is an entry in the dataset for each image

In [None]:
df1.shape

In [None]:
df1[:3]

In [None]:
df1.columns

## Second dataset: df_no_duplicated_with_path2

In [None]:
with open('dataset/raw/df_no_duplicated_with_path2.pkl', 'rb') as file:
    df2 = pd.compat.pickle_compat.load(file) 

Same as dataset 1 but there is a single entry for each tweet, with all image paths 

In [None]:
df2.shape

In [None]:
df2[:3]

In [None]:
df2.columns

## Third dataset: merged_df_with_gold_freq1

In [None]:
with open('dataset/raw/merged_df_with_gold_freq1.pkl', 'rb') as file:
    df3 = pd.compat.pickle_compat.load(file) 

In [None]:
df3.shape

Adds gold labels to dataset 2. Columns with name in the form of T_x are the gold labels for text only and contain the score for emotion x. T_gold_multi_label contains a list of emotions for which that entry has a non-zero score. Columns with name M_x are the same, but for the multimodal gold labels. The gold labels are those assigned manually. They are 900 entries of the dataset.

In [None]:
df3[:3]

In [None]:
df3[df3["M_gold_multi_label"].notnull()][:4].filter(regex=("M_*"))

In [None]:
df3[df3["M_gold_multi_label"].notnull()].shape

In [None]:
df3.columns

## Fourth dataset: merged_df_with_gold

In [None]:
with open('dataset/raw/merged_df_with_gold.pkl', 'rb') as file:
    df4 = pd.compat.pickle_compat.load(file) 

The format of this dataset is the same as the previous one and only differs in the contents of the gold multi label fields.

In [None]:
df4.shape

In [None]:
df4[:3]

In [None]:
df4.columns

This dataset differs to the previous one only in the gold label field. Whereas dataset 3 includes in that field labels with a score different from 0, this one only does if it is higher than 2. NaN values mean that the fields are equal. Around 100 entries (not necessarily the same ones) have the same labels for both multimodal and text only labels.

In [None]:
print(df3.compare(df4).isnull().sum())
df3.compare(df4)

Given all this both the third and fourth datasets are suitable for further analysis but it could also be helpful to merge the two by keeping both columns. The label column names are also renamed for future convenience.

In [None]:
df3["label_M_gold_main"] = df4["M_gold_multi_label"]
df3["label_M_gold_multi"] = df3["M_gold_multi_label"]

df3["label_T_gold_main"] = df4["T_gold_multi_label"]
df3["label_T_gold_multi"] = df3["T_gold_multi_label"]

df3 = df3.drop(columns = ["M_gold_multi_label", "T_gold_multi_label"])

# Cleaning up the dataset
Some of the columns are not useful for the task so they can be dropped. First of all, any field relating to the user who posted the tweet can be dropped. There are also many fields which have a single value across all entries, some of them being simply null.<br>

In [None]:
df3 = df3.drop(columns=["name", "user_id", "user_id_str", "user_rt", "user_rt_id", "username",
                    "video", "translate", "trans_dest", "trans_src", "timezone", "geo", "hour", "day", "near",
                    "created_at", "retweet", "retweet_date", "retweet_id", "reply_to", "source", "place"])

There are some other fields that can be dropped as they are not useful.<br>
<ul>
    <li>"quote_url" contains, if present, the link to the tweet being replied to.</li>
    <li>"urls" contains any links present in the text of the tweet.</li>
    <li>"thumbnail" is the link to the picture used as the thumbnail of the tweet and is a replica of one of the images linked in "photos".</li>
    <li>"photos" can also be dropped as it only contains the links to the images in the tweet, which are already stored locally.</li>
    <li>"link" simply contains the link to the tweet so it can be dropped as well.</li>
    <li>The "cashtags" and "hashtags" fields are redundant as that text is already present in the text of the tweet.</li>
    <li>"date", as the name implies, contains the timestamp of the tweet.</li>
</ul>
   

In [None]:
df3 = df3.drop(columns=["quote_url", "urls", "thumbnail", "photos", "link", "cashtags", "hashtags", "date"])

"conversation_id" contains some kind of ID which is in some cases different from the ID of the tweet. Nonetheless, it seems to be useless in our case so it can be dropped

In [None]:
print(df3["conversation_id"].compare(df3["id"]))
df3 = df3.drop(columns=["conversation_id"])

"path_photos" contains the local paths to the images, but the file names are simply the ID of the tweet with a number appended to the end, so it is sufficient to store the number of pictures.

In [None]:
print(df3.loc[0, "path_photos"])
print(df3.loc[0, "id"])

In [None]:
df3["img_count"] = df3["path_photos"].apply(len)
df3 = df3.drop(columns=["path_photos"])

The "language" column is mostly useless, as only 8 rows have a different value from "en". Additionally, only 2 of those with language "fr" are actually in the correct language, the others are stil in English. None of them have gold labels either.

In [None]:
print(df3.loc[df3["language"]!="en", ["tweet", "label_M_gold_multi"]])
df3 = df3.drop(columns = ["language"])

By checking for null values we can see that "old_label" is missing in the vast majority of the dataset so it would not be particularly useful. It is also unclear what it represents.

In [None]:
print(df3.isnull().sum())
df3 = df3.drop(columns=["old_label"])

Some of the remaining columns are not directly useful for our task, but might be interesting for some kind of analysis, for example relating emotions with said fields. They are the following:
<ul>
    <li>nlikes</li>
    <li>nreplies</li>
    <li>nretweets</li>
    <li>search</li>
    <li>seeds</li>
</ul>

The following columns are now left:
<ul>
    <li>id: id of the tweet.</li>
    <li>tweet: text of the tweet.</li>
    <li>nlikes: number of likes.</li>
    <li>nreplies: number of replies.</li>
    <li>nretweets: number of retweets.</li>
    <li>search: the search used to retrieve the tweet.</li>
    <li>seeds: the words that retrieved the tweet separated by emotion with a score of how "strongly" it embodies that emotion.</li>
    <li>uni_label: the emotion with the highest score in seeds.</li>
    <li>multi_label: all the emotions in seeds.</li>
    <li>M_x: gold multimodal labels, one for each emotion. Contains the score of for that emotion.</li>
    <li>T_x: gold text-only label, one for each emotion. Contains the score of for that emotion.</li>
    <li>label_M_gold_main: list of emotions with a multimodal score of at least 2.</li>
    <li>label_M_gold_multi: list of emotions with a multimodal score of at least 1.</li>
    <li>label_T_gold_main: list of emotions with a text only score of at least 2.</li>
    <li>label_T_gold_multi: list of emotions with a text only score of at least 1.</li>
    <li>img_count: number of images of the tweet.</li>
</ul>

## Gold and silver label split

Now that the dataset is cleaned up it is useful to separate data with gold labels and data with only silver labels.

In [None]:
gold_df = df3[df3["M_Anger"].notnull()].copy().reset_index(drop=True)
print(gold_df.shape)

For the gold dataset it might also be useful to modify the scores for the gold labels so that they sum to 1 over a single row, both for multimodal and text only labels. (Implemented but not actually used currently)

In [None]:
# currently not enabled


# cols = gold_df.columns[gold_df.columns.str.startswith("M_")]
# gold_df[cols] = gold_df[cols].div(gold_df[cols].sum(axis=1), axis=0)

# cols = gold_df.columns[gold_df.columns.str.startswith("T_")]
# gold_df[cols] = gold_df[cols].div(gold_df[cols].sum(axis=1), axis=0)

Converting gold labels to lowercase for consistency with silver labels.

In [None]:
cols = gold_df.columns[gold_df.columns.str.startswith("label_")]
for column in cols:
    gold_df[column] = gold_df[column].apply(lambda x : [y.lower() for y in x])

In [None]:
gold_df[:3]

Extract images in the gold label dataset from the zip file containing all images.

In [None]:
from zipfile import ZipFile
import os
from pathlib import Path

gold_dir = "dataset/gold_images"

with ZipFile("dataset/raw/images.zip") as zfile:
    for i, row in gold_df.iterrows():
        for n in range(0, row["img_count"]):
            file_name = f"{row['id']}_{n}.jpg"
            if not os.path.isfile(f"{gold_dir}/twint_images3/{file_name}"):
                zfile.extract(f"twint_images3/{file_name}", gold_dir)

In [None]:
silver_df = df3[df3["M_Anger"].isnull()].reset_index(drop=True)
print(silver_df.shape)
silver_df[:3]

Extract images in the silver label dataset from the zip file containing all images.

In [None]:
# from zipfile import ZipFile
# import os
# from pathlib import Path

# silver_dir = "dataset/silver_images"

# with ZipFile("dataset/raw/images.zip") as zfile:
#     for i, row in silver_df.iterrows():
#         for n in range(0, row["img_count"]):
#             file_name = f"{row['id']}_{n}.jpg"
#             if not os.path.isfile(f"{silver_dir}/twint_images3/{file_name}"):
#                 zfile.extract(f"twint_images3/{file_name}", silver_dir)

We can now save the datasets to file.

In [None]:
gold_df.to_pickle("dataset/gold_label_dataset.pkl")
silver_df.to_csv("dataset/silver_label_dataset.csv")