# "Battle plan":

**Personal notes on what needs/should be done.**

- Correct dtypes in both dataframes
- Merge dataframes together

- filter out blatant errors
    how to deal with "label pending" / "uncategorised"?

- first explore the ownership dataframe thoroughly - do we care that some values are missing? Since that is also a category

- We definitely don't include the articles without the text

- Try to find which images have a human face - use a pretrained model for that
- Get the semantic analysis for the text

- Try to predict the "classification" which they have based on the image input data and the rest of the data

Relevant articles:

https://ieeexplore.ieee.org/document/8715409

https://arxiv.org/abs/1911.08670        https://github.com/haamoon/mmtm    MMTM model

https://ieeexplore.ieee.org/abstract/document/8928538   A Survey on Canonical Correlation Analysis

https://ieeexplore.ieee.org/document/9099281   Multi-Modal Stacked Denoising Autoencoder for Handling Missing Data in Healthcare Big Data¨

https://arxiv.org/abs/1504.06063    Multimodal Convolutional Neural Networks for Matching Image and Sentence

https://arxiv.org/abs/1904.02874    An Attentive Survey of Attention Models

https://arxiv.org/abs/2003.08897    Normalized and geometry-aware self-attention network for image captioning

https://arxiv.org/abs/1409.0473     Neural Machine Translation by Jointly Learning to Align and Translate

https://link.springer.com/article/10.1007/s10489-020-01801-5   Transfer learning based hybrid 2D-3D CNN for traffic sign recognition and semantic road detection applied in advanced driver assistance systems

https://arxiv.org/abs/1904.01475    Good News, Everyone! Context driven entity-aware captioning for news images   ***

https://arxiv.org/abs/1903.06275    Show, Translate and Tell   ***

https://arxiv.org/abs/1911.12377    Multimodal Attention Networks for Low-Level Vision-and-Language Navigation

# Name

The topic of the thesis could be "multimodal learning for political classification of newsletter texts".

# Imports and introduction

This data corresponds to the date: 08.11.2022

Day before American midterm elections.

## Loading

In [None]:
# Import libraries

import re
import json
import os
import pickle
import pandas as pd

from ydata_profiling import ProfileReport

In [None]:
data_original = None

with open('exported_articles.json') as json_file:
    data_original = json.load(json_file)

In [None]:
json_keys = data_original[0].keys()
json_keys

In [None]:
data_original[0]['ownership'].keys()

In [None]:
data_original[3]

In [None]:
max(data_original, key=lambda x: len(x['images']))

In [None]:
data_original[144696]

In [None]:
# Could be done just using json.flatten but I wanted to keep the ownerships separated

ownerships = [None] * len(data_original)
missing_ownerships: list = list()

for i in range(len(data_original)):
    try:
        ownerships[i] = data_original[i]['ownership']
    except:
        print('Error at index', i)
        missing_ownerships.append(i)
    #ownerships[i]['index'] = i
    data_original[i] = {k: data_original[i][k] for k in json_keys - {'ownership'}}
    #data_original[i]['index'] = i

In [None]:
print('Missing ownerships:', len(missing_ownerships))

In [None]:
for i in reversed(missing_ownerships):
    print('Removing index', i)
    ownerships.pop(i)
    data_original.pop(i)

df_owner = pd.DataFrame(ownerships)
df_owner.head()

In [None]:
len(data_original)

In [None]:
list(df_owner.values)[2:4]

In [None]:
list(df_owner.values)[-5:]

In [None]:
df_owner.keys()

In [None]:
df_owner

# Basic processing

### df_original

This is the dataframe that contains information about the text and title of the articles. It also contains the "classification" which will serve as the target variable.

In [None]:
df_original = pd.DataFrame(data_original)
df_original.head()

In [None]:
df_original.dtypes

TODO: Do we care about the time variable?
Is it really necessary for the prediction?

Should we filter out the empty time? Or is it fine as is?

In [None]:
df_original.isnull().sum()

In [None]:
df_original[df_original['text'].str.len() == 0]

In [None]:
df_original[df_original['title'].str.len() == 0]

In [None]:
df_original['images'].apply(lambda x: len(x)).value_counts()

In [None]:
df_original[df_original['images'].apply(lambda x: len(x)) == 13]

Remove all the rows that have no text (we have to do it for both dataframes).

In [None]:
df_owner.drop(df_original[(df_original['text'].str.len() == 0) | (df_original['title'].str.len() == 0)].index, inplace=True)
df_original.drop(df_original[(df_original['text'].str.len() == 0) | (df_original['title'].str.len() == 0)].index, inplace=True)

In [None]:
df_original

### Correct the dtypes of the columns.

Now let's look at the classification labels.

These labels come from the overall definition of the publication source, not the article itself.

In [None]:
print(len(df_original["classification"].unique()))
df_original["classification"].unique()

In [None]:
df_original["classification"].value_counts()

In [None]:
#df_original["time"] = pd.to_datetime(df_original["time"])
df_original["text"] = df_original["text"].astype('string')
df_original["title"] = df_original["title"].astype('string')
df_original["classification"] = df_original["classification"].astype('category')
df_original["classification_by_editorial"] = df_original["classification_by_editorial"].astype('category')

In [None]:
df_original.dtypes

In [None]:
df_original["classification_by_editorial"].value_counts()

One thing we could also do is to delete the "image_url" column as that won't really help us in our predictions as this basically identifies the author. If we would like to use it, we could just take the domain name.

In [None]:
#df_original.drop(columns=["image_url"], inplace=True) # ID has to stay in since it's the image identifier

Now this should be the basic processing done for the df_original dataframe except for the "text" and "title" column. We will do that later.

## df_owner

Now we will deal with the preprocessing of the dataframe that corresponds to the information known about the owner/creator of the article.

In [None]:
df_owner

In [None]:
df_owner.dtypes

In [None]:
df_owner.isnull().sum()

# Skip to later part

In [None]:
df_owner[df_owner["id"].isna()]

# Question - are any of the names in the id.isna rows the same as the names in the id.notna rows?

# df_owner[df_owner["id"].notna()]

In [None]:
names = df_owner[df_owner["id"].isna()]["name"].unique()
len(names)

In [None]:
df_owner[df_owner["name"].isin(names) & df_owner["id"].notna()]

This part shows us that all the rows that have missing values can be reconstructed back from grouping by any column, thus that the values aren't missing at random, but they

In [None]:
len(df_owner.groupby(["name"], dropna=False).value_counts())

In [None]:
len(df_owner.groupby(["detail"], dropna=False).value_counts())

In [None]:
len(df_owner.groupby(["address"], dropna=False).value_counts())

In [None]:
len(df_owner.groupby(["info_url"], dropna=False).value_counts())

In [None]:
#grouped_owners = df_owner.groupby(["name", "url", "label", "detail"], dropna=False).value_counts(dropna=False)
grouped_owners = df_owner.groupby("name", dropna=False).value_counts(dropna=False)
grouped_owners

In [None]:
len(grouped_owners)

In [None]:
len(df_owner["name"].unique())

In [None]:
grouped_owners = grouped_owners.reset_index()
grouped_owners.rename(columns={0: "count"}, inplace=True)
grouped_owners.sort_values(by="count", ascending=False, inplace=True)
grouped_owners

In [None]:
df_original[df_owner["name"] == "National Review"]

In [None]:
grouped_owners.isna().sum()

In [None]:
grouped_owners[grouped_owners["url"] == "www.stanforddaily.com"]

Now we can reconstruct the missing values in the df_owner dataframe by using the grouped_owner dataframe since all the missing values can be reconstructed from there.

In [None]:
#(df_owner.set_index('name').combine_first(grouped_owners.set_index('name')).reset_index()).isna().sum()

In [None]:
df_owner

# Skip to here, experiments

Let's do some cleaning in the data:

- The column **ID** is irrelevant for machine learning, thus we can easily omit it.
- I don't think we can get any more insight/feature from the **url** column, thus we can also drop it.
- **Label/detail** are should be pre-processed further as more insight is needed if there is information there that could be useful.
- **Address** could be processed to also the region of the country, but that would require a lot of effort and **country** itself should be sufficient in most of the instances.
- **info_url** can be dropped as it doesn't contain any other relevant information (maybe we could parse the domain out of it, but I don't think that will help in any way).
- **year** shouldn't be a variable that has a strong correlation with the predicted variable, but we will keep it for the time being to test that hypothesis.
- **country** can be really helpful to help understand the context of the text.

In [None]:
df_owner.drop(columns=["id", "url", "address", "info_url"] ,inplace=True)

In [None]:
df_owner.head()

Now let's combine both df_owner and df_original together

In [None]:
df_combined = pd.concat([df_original, df_owner], axis=1)
df_combined.head()

In [None]:
df_combined.dtypes

In [None]:
df_combined.isna().sum()

In [None]:
df_combined["label"] = df_combined["label"].astype('category')
df_combined["detail"] = df_combined["detail"].astype('string')
df_combined["name"] = df_combined["name"].astype('category')
df_combined["country"] = df_combined["country"].astype('category')

In [None]:
df_combined.dtypes

# Removing of articles without an image and non english articles.

Non english articles:

In [None]:
df_combined['text'].str.contains(r'[^\x00-\x7F]').value_counts()

In [None]:
df_combined['title'].str.contains(r'[^\x00-\x7F]').value_counts()

In [None]:
# non_latin_rows = df_combined[(df_combined['text'].str.contains(r'[^\x00-\x7F]')) | (df_combined['title'].str.contains(r'[^\x00-\x7F]'))]

# df_combined = non_latin_rows
df_combined

Articles with not available images:

In [None]:


filename = "rows_to_remove.json"
# Create a list to store the IDs of rows to be removed
rows_to_remove = []

if not os.path.exists(filename):
    # List all files in the "./images" folder
    image_files = os.listdir("./images")

    # Iterate through the IDs in the "id" column
    for id_value in df_combined["id"]:
        # Check if there is no image file that begins with the ID
        if not any(file.startswith(str(id_value)) for file in image_files):
            rows_to_remove.append(id_value)

    with open(filename, "w") as file:
        json.dump(rows_to_remove, file)
else:
    with open(filename, "r") as file:
        rows_to_remove = json.load(file)

len(rows_to_remove)

In [None]:
df_combined = df_combined.drop(df_combined[df_combined['id'].isin(rows_to_remove)].index)
df_combined

In [None]:
#df_owner = df_owner[(df_original['text'].str.contains(r'[^\x00-\x7F]')) | (df_original['title'].str.contains(r'[^\x00-\x7F]'))]
#df_original = df_original[(df_original['text'].str.contains(r'[^\x00-\x7F]')) | (df_original['title'].str.contains(r'[^\x00-\x7F]'))]

df_owner = df_owner.drop(df_original[df_original['id'].isin(rows_to_remove)].index)
df_original = df_original.drop(df_original[df_original['id'].isin(rows_to_remove)].index)

In [None]:
df_owner

In [None]:
df_original

In [None]:
df_combined

# Analysis

In [None]:
"""
from autoviz import AutoViz_Class

AV = AutoViz_Class()

dft = AV.AutoViz(
    "",
    sep=",",
    depVar=["classification_by_editorial", "classification"],
    dfte=df_combined[df_combined.columns.difference(['images', 'id', 'time'])].dropna(),
    header=0,
    verbose=1,
    lowess=False,
    chart_format="html",
    max_rows_analyzed=15000000,
    max_cols_analyzed=30,
    save_plot_dir=None
)
"""
print("AutoViz is not working")

In [None]:
df_original.columns

In [None]:
#profile = ProfileReport(df_original[df_original.columns.difference(['images', 'id', 'text'])], title="Profiling Report")
#profile

In [None]:
df_owner.columns

In [None]:
#profile = ProfileReport(df_owner, title="Profiling Report")
#profile

In [None]:
df_combined.columns

In [None]:
#ProfileReport(df_combined[df_combined.columns.difference(['images', 'id', 'text', 'title'])], title="Profiling Report")

TODO: Add more analysis, the extra things from the beginning

### Ideas for extra features:

# Exploratory analysis notes/observations

## "Classification" feature from the dataset

The "classification" feature could serve as our primary "target".

However, there are still some questions regarding this variable - mainly what does "neutral" mean. Because in terms of news outlets none of them can be considered neutral (unless they are strictly reporting, but even so).

Are there guidelines for evaluating this variable? Can we trust that the labeling is consistent.  

# Involved preprocessing

## Text preprocessing

## Faces detection in images

## Saving the final version data after preprocesing

In [None]:
df_combined.to_csv("df_combined.csv", index=False)