# A Movie behind a Script


In [1]:
import os
import re
import findspark
import pandas as pd
findspark.init()
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.sql.types import *
from datetime import datetime
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import urllib.request
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.10:0.4.1 pyspark-shell'

In [2]:
spark = SparkSession.builder.getOrCreate()
spark.conf.set('spark.sql.session.timeZone', 'UTC')
sc = spark.sparkContext
sqlContext = SQLContext(sc)

## Overview of datasets

The OpenSubtitles dataset is a compressed cluster of folders containing XML files. Each XML file is split into a script portion with the subtitles of the movie and a metadata portion with additional information about the movie or show. The name of one of the parent folders of the XML file is the corresponding IMDb identifier of the movie or show, thus allowing us to extract additional information from the IMDb dataset.

### OpenSubtitles

The dataset consists of 31 (**TODO: how many?**) GB of XML files distributed in the following file structure: 

```
├── opensubtitle
│   ├── OpenSubtitles2018
│   │   ├── Year
│   │   │   ├── Id
│   │   │   │   ├── #######.xml.gz
│   │   │   │   ├── #######.xml.gz
│   ├── en.tar.gz
│   ├── fr.tar.gz
│   ├── zh_cn.tar.gz
```
where
- `######` is a 6-digit unique identifier of the file on the OpenSubtitles dataset.
- `Year` is the year the movie or episode was made.
- `Id` is a 5 to 7 digit identifier (if it's 7-digit it's also an IMDb identifier).

The subtitles are provided in different languages. For the moment we only analyze the `OpenSubtitles2018` folder and it's the only folder we detail.

Some `Year` folders are not indicative, for instance 0, 666 and 1191. We also notice that for each `Id` we can find multiple subtitle XML files, as illustrated above. The decompressed XML files vary in size, ranging from 5KB to 9000KB sized files.

### XML Files

#### Subtitles

For the actual subtitles in each xml file we can see that they are stored in sentences, each one having an unique id(integers in increasing order starting at 1). Each sentence has a set timestamps and a set of words. Every timestamp and word have also an id and a set of attributes. The timestamp id has two different formats. 

First for the timestamp id we have "T#S" or "T#E" where # is an increasing integer, "S" indicates start and "E" indicates end. The words inbetween a start and end of timestamp are shown on the screen during the time indicated by the time stamp. **This is a great indicator of fast dialog!**. Apart from the id, the timestamp also has a value attribute which has the format `` HH:mm:ss,fff``.

For the words the id is simply an increasing number of decimal numbers "X.Y" where X is the string id and Y is the word id within the corresponding string. Each word element in the XML file has a non-empty value (the actual word, can be a mark) and it might have an alternative and initial value. The initial value corresponds to slang words generally, mispronounced words because of an accent such as lyin' instead of lying. The alternative is another way of displaying the subtitle for example HOW instead of how.

There is another attribute we found for the strings and words which is not present in all the files and it is the emphasis attribute, which takes either true or false value.

#### Metadata

Each XML file has a unique identifier in the name of the file and contains at the end of the file metadata in the following structure:

```
├── Conversion
│   ├── corrected_words: Integer
│   ├── sentences: Integer
│   ├── tokens: Integer
│   ├── encoding: String always utf-8
│   ├── unknown_words: Integer
│   ├── ignored_blocks: Integer
│   ├── truecased_words: Integer
├── Subtitle
│   ├── language: String
│   ├── date: String
│   ├── duration: String
│   ├── cds: String presented as #/# where # is an int
│   ├── blocks: Integer
│   ├── confidence: Double
├── Source
│   ├── genre: String[] (up to 3 genres)
│   ├── year: Integer
│   ├── duration: Integer (in minutes)
│   ├── original: String
│   ├── country: String
```

This is the structure of the metadata we consider, although some XML files may not have all the entries. 
We use the metadata to obtain additional information about the movie or show's subtitles and compute certain statistics. 

#### Exploration

After going through the dataset we found many things worth noting. First of all is that the dataset is not uniform, it has "strange folders" and contains xml files that are not related to movies or tv shows. We have for example the folder 666/ which contains Justin Bieber song subtitles, folder 1858/ which is empty and so on. To solve this we decided to ignore all the folders which weren't inside the range of 1920-2018. We also found that trailer of films are present in the dataset. In the folder 2018 we found for example Black Panther teaser trailer subtitles.

Another thing worth mentioning is that a lot of different subtitles contain text that is not related to the movie, like credentials of the person who made the subtitles.

We found that the code for the movies is not always reliable to get the actual movie name, hence we can't have 100% certainty that the id for the subtitles are associated with the correct film. We also see that each movie might have more than 1 subtitle file, we have to decide which one we should take. We can base this decision by taking one subtitle file at random or we could consider the confidence attribute in the metadata. To choose movies that can actually have a correct IMDb identifier we looked that the ID is composed of 7 integers, hence all the files in folders with more or less that 7 integers (after the year identifier) are very hard to associate with a video.

### IMDb Dataset

We also have at our disposal the IMDb ratings and basics dataset.

In [3]:
baseURL = "https://datasets.imdbws.com/"
ratings_fn = "title.ratings.tsv.gz"
basics_fn = "title.basics.tsv.gz"

In [4]:
df_ratings = pd.read_csv(baseURL + ratings_fn, sep='\t', compression='gzip')
df_ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.8,1439
1,tt0000002,6.3,172
2,tt0000003,6.6,1040
3,tt0000004,6.4,102
4,tt0000005,6.2,1735


In [5]:
df_basics = pd.read_csv(baseURL + basics_fn, sep='\t', compression='gzip')
df_basics.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,\N,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


## Sample film loading

Here we take one sample film and load it into a spark dataframe with the help of spark-xml library. We know that we are dealing with a very big data set hence using spark is the right way to go. Using this library we see that we can load two distinct dataframes per movie which reveal different information. One that contains the actual text and another one that contains the metadata of the film.

We have first the schema and look of the dataframe containing the subtitles. We can see that it is not very clear and it contains a lot of null values and information we want to get rid of. Each word array contains an Id we don't really need and per row entry we have an array of arrays for words and for the times. We need to decide how we want to store the information and what information we want to keep.

In [4]:
df_sample_film = sqlContext.read.format('com.databricks.spark.xml')\
                                .options(rowTag='s') \
                                .load('data_subtitles/2017/5052448/6963336.xml.gz')
df_sample_film.printSchema()
df_sample_film.show()

root
 |-- _id: long (nullable = true)
 |-- time: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _VALUE: string (nullable = true)
 |    |    |-- _id: string (nullable = true)
 |    |    |-- _value: string (nullable = true)
 |-- w: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _VALUE: string (nullable = true)
 |    |    |-- _id: double (nullable = true)

+---+--------------------+--------------------+
|_id|                time|                   w|
+---+--------------------+--------------------+
|  1|[[, T1S, 00:00:10...|[[", 1.1], [ahmad...|
|  2|[[, T2S, 00:00:56...|[[Well, 2.1], [,,...|
|  3|[[, T3S, 00:00:58...|[[What, 3.1], [ki...|
|  4|[[, T5S, 00:01:09...|[[Crazy, 4.1], [....|
|  5|[[, T6S, 00:01:10...|[[Got, 5.1], [me,...|
|  6|[[, T7S, 00:01:18...|[[Serious, 6.1], ...|
|  7|[[, T7E, 00:01:23...|[[I, 7.1], [feel,...|
|  8|[[, T8S, 00:01:25...|[[Alright, 8.1], ...|
|  9|[[, T8E, 00:01:29...|[[See, 9.1

In [5]:
df_sample_metadata = sqlContext.read.format('com.databricks.spark.xml')\
                                    .options(rowTag='meta') \
                                    .load('data_subtitles/2017/5052448/6963336.xml.gz')

For the metadata we have a very clean dataframe which can be used for a lot of statistics and filtering. We have useful stats such as the duration of the film, the genre. Here we can see the schema. We need to decide what is actually relevant for us to filter out the useless information and choose which format our dataframe should have (for example having all the different genres in a separate column.

In [6]:
df_sample_metadata.printSchema()
df_sample_metadata.show()

root
 |-- conversion: struct (nullable = true)
 |    |-- corrected_words: long (nullable = true)
 |    |-- encoding: string (nullable = true)
 |    |-- ignored_blocks: long (nullable = true)
 |    |-- sentences: long (nullable = true)
 |    |-- tokens: long (nullable = true)
 |    |-- truecased_words: long (nullable = true)
 |    |-- unknown_words: long (nullable = true)
 |-- source: struct (nullable = true)
 |    |-- duration: long (nullable = true)
 |    |-- genre: string (nullable = true)
 |    |-- year: long (nullable = true)
 |-- subtitle: struct (nullable = true)
 |    |-- blocks: long (nullable = true)
 |    |-- cds: string (nullable = true)
 |    |-- confidence: double (nullable = true)
 |    |-- date: string (nullable = true)
 |    |-- duration: string (nullable = true)
 |    |-- language: string (nullable = true)

+--------------------+--------------------+--------------------+
|          conversion|              source|            subtitle|
+--------------------+------------

We can see that there is no actual link between our both dataframes. The id of the film is only present in the folder which contains the different subtitle files. We need to be able to link the subtitle and metadata dataframe. To do so we add an id column which contains the id of the film.

We need to treat the dataframes now to store the information we actually we want in an efficient manner. Here we use our sample film to create functions that will shape our dataframes to then be able to extract the information we desire.

In [7]:
def to_sentence(words):
    w_list = []
    for w in words:
        w_list.append(w[0])
    return w_list
udf_word = udf(to_sentence, ArrayType(StringType()))
udf_sentence = udf(lambda x: ' '.join([w[0] for w in x]), StringType())

In [8]:
df_sample_film_sentence_list = df_sample_film.withColumn("sentence", udf_word("w"))
df_sample_film_sentence_list.show()

+---+--------------------+--------------------+--------------------+
|_id|                time|                   w|            sentence|
+---+--------------------+--------------------+--------------------+
|  1|[[, T1S, 00:00:10...|[[", 1.1], [ahmad...|[", ahmad, torifi...|
|  2|[[, T2S, 00:00:56...|[[Well, 2.1], [,,...|[Well, ,, the, th...|
|  3|[[, T3S, 00:00:58...|[[What, 3.1], [ki...|[What, kinda, sic...|
|  4|[[, T5S, 00:01:09...|[[Crazy, 4.1], [....|          [Crazy, .]|
|  5|[[, T6S, 00:01:10...|[[Got, 5.1], [me,...|[Got, me, out, in...|
|  6|[[, T7S, 00:01:18...|[[Serious, 6.1], ...|[Serious, though, .]|
|  7|[[, T7E, 00:01:23...|[[I, 7.1], [feel,...|[I, feel, here, l...|
|  8|[[, T8S, 00:01:25...|[[Alright, 8.1], ...|[Alright, man, ,,...|
|  9|[[, T8E, 00:01:29...|[[See, 9.1], [ya,...|        [See, ya, .]|
| 10|[[, T9S, 00:01:32...|[[Okay, 10.1], [,...|[Okay, ,, so, thi...|
| 11|[[, T10S, 00:01:3...|[[It, 11.1], ['s,...|[It, 's, like, a,...|
| 12|[[, T11S, 00:01:4...|[[Okay, 

After analyzing the subtitle dataframe, we encounterd the problem of not being able to associate words with timestamps. As our xml files separate data by sentences, each sentence might have 0 or many timestamps associated and it would be necessary to change the whole dataset to b

In [9]:
df_sample_film_sentence_string = df_sample_film.withColumn("sentence", udf_sentence("w"))
df_sample_film_sentence_string.show()

+---+--------------------+--------------------+--------------------+
|_id|                time|                   w|            sentence|
+---+--------------------+--------------------+--------------------+
|  1|[[, T1S, 00:00:10...|[[", 1.1], [ahmad...|" ahmad torifi " ...|
|  2|[[, T2S, 00:00:56...|[[Well, 2.1], [,,...|Well , the thing ...|
|  3|[[, T3S, 00:00:58...|[[What, 3.1], [ki...|What kinda sick i...|
|  4|[[, T5S, 00:01:09...|[[Crazy, 4.1], [....|             Crazy .|
|  5|[[, T6S, 00:01:10...|[[Got, 5.1], [me,...|Got me out in thi...|
|  6|[[, T7S, 00:01:18...|[[Serious, 6.1], ...|    Serious though .|
|  7|[[, T7E, 00:01:23...|[[I, 7.1], [feel,...|I feel here like ...|
|  8|[[, T8S, 00:01:25...|[[Alright, 8.1], ...|Alright man , alr...|
|  9|[[, T8E, 00:01:29...|[[See, 9.1], [ya,...|            See ya .|
| 10|[[, T9S, 00:01:32...|[[Okay, 10.1], [,...|Okay , so this is...|
| 11|[[, T10S, 00:01:3...|[[It, 11.1], ['s,...|It 's like a fuck...|
| 12|[[, T11S, 00:01:4...|[[Okay, 

In [10]:
df_sample_film_words =df_sample_film_sentence_list.select('*', explode(col("sentence")).alias('word'))
#filter strings that are not words like marks or spaces, we use a regular expression.
df_sample_film_words =df_sample_film_words.filter(df_sample_film_words.word.rlike("^[a-zA-Z]+$"))
word_count_distinct = df_sample_film_words.select("word").distinct().count()
word_count_total = df_sample_film_words.select("word").count()

In [11]:
print("Number of distinct words in film is: {:}".format(word_count_distinct))
print("Total number of  words in film is: {:}".format(word_count_total))

Number of distinct words in film is: 1452
Total number of  words in film is: 6798


In [12]:
def film_words(df_film):
    df_words = df_film.withColumn("sentence", udf_word("w")) \
                        .select('*', explode(col("sentence")).alias('word'))
    #TODO change udf_sentence to filter out empty strings and marks.
    df_words_filter = df_words.filter(df_sample_film_words.word.rlike("^[a-zA-Z]+$"))
    word_count_distinct = df_words_filter.select("word").distinct().count()
    word_count_total = df_words_filter.select("word").count()
    return (word_count_distinct, word_count_total)
    