## Subtitle Processing

Methodology: We first check that whether we can the folder name in our dataset as an IMDb ID of a movie.
For each subtiitle that is matched with a movie, we then read each subtitle XML file using BeautifulSoup package and write clean sentences into a txt file.

Here we show the resulting dataframes only. You can check the processed dataset at [here](https://drive.google.com/drive/folders/1FycaszmTdI2UjO06tgsg5nqvtpLG_z4s?usp=sharing).

### Libraries used and imports

In [133]:
from typing import Union
from pathlib import Path
import shutil
import requests
import json
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from tqdm.notebook import tqdm

### Loading data

In [116]:
cmu_path = Path('../data/raw/MovieSummaries/')

In [5]:
col_names = [
    'Wikipedia movie ID',
    'Freebase movie ID',
    'Movie name',
    'Movie release date',
    'Movie box office revenue',
    'Movie runtime',
    'Movie languages (Freebase ID:name tuples)',
    'Movie countries (Freebase ID:name tuples)',
    'Movie genres (Freebase ID:name tuples)'
]

df_movie = pd.read_csv(cmu_path.joinpath('movie.metadata.tsv'), delimiter='\t', names=col_names)

### Variables

In [None]:
INPUT_DIR = '../data/raw/en/OpenSubtitles/xml/en/'
OUTPUT_DIR = '..data/raw/subtitles/'
WIKIDATA_PATH = '../data/preprocessed/wikipedia_ids.csv'

### Helper Functions

In [None]:
def xml2txt(xml_path: Union[Path, str], txt_path: Union[Path, str]) -> None:
    '''
    Opens an XML files using BeautifulSoup package, 
    reads every sentence by removing the tags 
    and writes every sentence to a new line into a txt file
    '''
    with open(xml_path, 'r', encoding='utf8') as f:
        data = f.read()

    Bs_data = BeautifulSoup(data, 'xml')
    
    text = []

    sentences = Bs_data.find_all('s')
    for sen in sentences:
        words = sen.find_all('w')
        sen = ' '.join([w.next_element for w in words])
        text.append(sen+'\n')
        
    with open(txt_path, 'w', encoding='utf8') as f:
        f.writelines(text)
        
        
def imdb_id_corrector(imdb_id:str) -> Union[str, None]:
    if len(imdb_id) > 7:
        # Folder names longer than 7 character do not seem to correspond to a correct IMDb ID
        return None

    additional_zeros = '0' * max(0, 7-len(imdb_id))
    imdb_id = 'tt' + additional_zeros + imdb_id
    
    return imdb_id


def check_imdb_id(test_id, imdb_id_list):
    '''
    Check if the folder name is in our dataset.
    '''
    for imdb_id in imdb_id_list:
        if test_id in imdb_id:
            return imdb_id
    return None

Based on manual inspection, we observed that folder names longer than 7 
character do not seem to correspond to a correct IMDb ID. Hence we discard all the folders with name length more than 7.
Furthermore, IMDb ID follows the convention `tt*` where `*` is a squence of number with length either 7 or 8 [ref](https://en.wikipedia.org/wiki/Template:IMDb_title#:~:text=https%3A%2F%2Fwww.imdb.com,work%20if%20it%20is%20included).
Hence we add leading zeros if the length of a folder name is less than 7 to obtain an appropriate ID.

We do not convert letters to lowercase or remove punctuations. These should be dealt with at the beginning of an NLP pipeline.

Do not process all them, only process if a candidate IMDb ID is contained in our movie dataset.

### Process subtitles

In [100]:
input_dir = Path(INPUT_DIR)
output_dir = Path(OUTPUT_DIR)
wikidata_path = Path(WIKIDATA_PATH)

if not output_dir.exists():
    output_dir.mkdir()
    

wikidata = pd.read_csv(wikidata_path)
imdb_id_list = wikidata['IMDb ID']
imdb_id_list = imdb_ids[imdb_id_list.notna()]


io_map  = {}

# Find all the subtitles that matches a movie in our dataset
for year_folder in tqdm(list(input_dir.iterdir())):
    for movie_folder in tqdm(list(year_folder.iterdir()), leave=False):
        sub_path = next(movie_folder.glob('*.xml'))

        
        movie_id = movie_folder.stem
        imdb_id = check_imdb_id(movie_id, imdb_id_list)
        if imdb_id is None:
            continue

        
        txt_path = output_dir.joinpath(imdb_id+'.txt')
        
        io_map[imdb_id] = [sub_path, txt_path]

In [None]:
# Convert subtitles
for imdb_id, (subtitle_path, txt_path) in tqdm(io_map.items()):
    xml2txt(subtitle_path, txt_path)