## Structural analysis of synopsis (GPT2)

This notebook aims to visualize structural similarities among the anime/comic titles by looking at the words that are used in the synopsis text. It uses GPT2 to vectorize each word from the synopsis text, and for each title, we will take the average of all the word vectors in one title's synopsis and use that as the representative vector of that title.

Why GPT2? -> Recommendation in the RoBERTa's "intended uses": https://huggingface.co/roberta-base

Quote: `tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation you should look at model like GPT2.`

#### ToDo:

- [x] Create anime/manga flag to meta labels
- [x] Try other BERT models
- [ ] Structure of table data with standard dim reduction (PCA/UMAP) and clustering

In [1]:
!pip install transformers
!pip install tensorboardX
!pip install tensorboard



In [2]:
import pandas as pd
import numpy as np
from tensorboardX import SummaryWriter
import tensorboard

import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


In [3]:
# from transformers import GPT2Tokenizer, GPT2Model
from transformers import GPT2Tokenizer, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2Model.from_pretrained("gpt2")
model.to(device)

GPT2Model(
  (wte): Embedding(50257, 768)
  (wpe): Embedding(1024, 768)
  (drop): Dropout(p=0.1, inplace=False)
  (h): ModuleList(
    (0): GPT2Block(
      (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn): GPT2Attention(
        (c_attn): Conv1D()
        (c_proj): Conv1D()
        (attn_dropout): Dropout(p=0.1, inplace=False)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
      (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (mlp): GPT2MLP(
        (c_fc): Conv1D()
        (c_proj): Conv1D()
        (act): NewGELUActivation()
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (1): GPT2Block(
      (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn): GPT2Attention(
        (c_attn): Conv1D()
        (c_proj): Conv1D()
        (attn_dropout): Dropout(p=0.1, inplace=False)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
      (ln_2): LayerNorm((768,), eps=1e-05, elementwis

In [4]:
def prep_label_array(df, excluded_idx, columns):
    flag_list = []
    for col in columns:
        flag_list.append(np.expand_dims(np.array(df[col]), axis=1))
    meta = np.concatenate(flag_list, axis=1)
    meta_header = np.insert(meta, 0, columns, axis=0)
    meta_header = np.delete(meta_header, excluded_idx, axis=0)
    return meta_header

In [5]:
# Load data
df_titles = pd.read_csv("../assets/titles_200p_cleaned.csv")
display(df_titles.head())

Unnamed: 0,title_id,title_english,title_romaji,type,duration,start_year,chapters,volume,publishing_status,country,...,Sci-Fi,Slice of Life,Sports,Supernatural,Thriller,title_romaji_type,synopsis_cleaned,synopsis_source,synopsis_wc,synopsis_cleaned_token
0,30002,Berserk,Berserk,MANGA,,1989.0,,,RELEASING,JP,...,0,0,0,0,0,Berserk_MANGA,His name Guts Black Swordsman feared warrior s...,Dark Horse,425,"['name', 'feared', 'warrior', 'spoken', 'whisp..."
1,31706,,JoJo no Kimyou na Bouken: Steel Ball Run,MANGA,,2004.0,95.0,24.0,FINISHED,JP,...,0,0,1,1,0,JoJo no Kimyou na Bouken: Steel Ball Run_MANGA,Originally presented unrelated story series la...,Wikipedia,346,"['presented', 'unrelated', 'story', 'series', ..."
2,114129,Gintama: THE VERY FINAL,Gintama: THE FINAL,ANIME,104.0,2021.0,,,FINISHED,JP,...,1,0,0,0,0,Gintama: THE FINAL_ANIME,Gintama THE FINAL rd final film adaptation rem...,no match,82,"['rd', 'final', 'film', 'adaptation', 'remaind..."
3,30013,One Piece,ONE PIECE,MANGA,,1997.0,,,RELEASING,JP,...,0,0,0,0,0,ONE PIECE_MANGA,As child Monkey D Luffy inspired become pirate...,Viz Media,348,"['child', 'inspired', 'become', 'pirate', 'lis..."
4,124194,Fruits Basket The Final Season,Fruits Basket: The Final,ANIME,24.0,2021.0,,,FINISHED,JP,...,0,1,0,1,0,Fruits Basket: The Final_ANIME,After last season revelations Soma family move...,Funimation,277,"['last', 'season', 'revelations', 'family', 'm..."


### ↓Below is the code to vectorize synopsis text by GPT2. Commented out as it takes a bit of time to run

In [17]:
max_length = 256
vec = [] # destination of sentence vectors
error_idx = [] # indices to exclude from visualization

for item in df_titles["synopsis_cleaned"].iteritems():
    try:
        tokenizer.pad_token = tokenizer.eos_token # https://github.com/huggingface/transformers/issues/12594
        enc = tokenizer(
            item[1],
            max_length=max_length,
            padding=True, # https://github.com/huggingface/transformers/issues/2630 -> https://huggingface.co/docs/transformers/glossary#attention-mask
            truncation=True,
            return_tensors="pt"
        )
        enc = { k: v.to(device) for k, v in enc.items() }
        attention_mask = enc["attention_mask"]
        with torch.no_grad():
            output = model(**enc)
            last_hs = output.last_hidden_state
            averaged_hs = (last_hs*attention_mask.unsqueeze(-1)).sum(1) / attention_mask.sum(1, keepdim=True)

        vec.append(averaged_hs[0].cpu().numpy())
    except:
        error_idx.append(item[0]+1)
        print("Error at: ", item)

vec = np.array(vec)
print(vec.shape)
print(type(vec))

print("length of df_titles: ", len(df_titles))
print("length of error indices: ", len(error_idx))
meta_header = prep_label_array(df_titles, error_idx, ["type", "Action", "Adventure", "Drama", "Mystery", "Psychological", "Romance", "Slice of Life", "title_romaji", "synopsis_source"])
print(meta_header)
np.savetxt("../assets/synvec_gpt2.tsv", vec, delimiter='\t', fmt="%f")
np.savetxt("../assets/synvec_gpt2_meta.tsv", meta_header, delimiter='\t', fmt="%s")

Error at:  (998, nan)
Error at:  (3183, nan)
Error at:  (5716, nan)
Error at:  (5906, nan)
Error at:  (6072, nan)
Error at:  (6329, nan)
Error at:  (6635, nan)
Error at:  (6954, nan)
Error at:  (7224, nan)
Error at:  (7422, nan)
Error at:  (7632, nan)
(8775, 768)
<class 'numpy.ndarray'>
length of df_titles:  8786
length of error indices:  11
[['type' 'Action' 'Adventure' ... 'Slice of Life' 'title_romaji'
  'synopsis_source']
 ['MANGA' 1 1 ... 0 'Berserk' 'Dark Horse']
 ['MANGA' 1 1 ... 0 'JoJo no Kimyou na Bouken: Steel Ball Run'
  'Wikipedia']
 ...
 ['MANGA' 0 0 ... 0 'Katappashi Kara Zenbu Koi' 'no match']
 ['ANIME' 0 0 ... 0 'WiSH VOYAGE' 'no match']
 ['MANGA' 0 1 ... 0
  'Maydare Tensei Monogatari: Kono Sekai de Ichiban Warui Majo'
  'no match']]


In [18]:
# Load gpt2 vectors
vec = np.loadtxt("../assets/synvec_gpt2.tsv", delimiter='\t')

# tensorboard
writer = SummaryWriter()
writer.add_embedding(torch.FloatTensor(vec))
writer.close()

In [19]:
# Visualizations by tensorboard
# Download the "../assets/synvec_gpt2_meta.tsv" file and "Load" data to label data points
%reload_ext tensorboard
%tensorboard --logdir ./runs