# A Movie behind a Script


In [1]:
import os
import re
import findspark
import pandas as pd
findspark.init()
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.sql.types import *
from datetime import datetime
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import urllib.request
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.10:0.4.1 pyspark-shell'

In [2]:
spark = SparkSession.builder.getOrCreate()
spark.conf.set('spark.sql.session.timeZone', 'UTC')
sc = spark.sparkContext
sqlContext = SQLContext(sc)

# Overview of datasets

The OpenSubtitles dataset is a compressed cluster of folders containing XML files. Each XML file is split into a script portion with the subtitles of the movie and a metadata portion with additional information about the movie or show. The name of one of the parent folders of the XML file is the corresponding IMDb identifier of the movie or show, thus allowing us to extract additional information from the IMDb dataset.

## IMDb Dataset

We have at our disposal the IMDb ratings and basics dataset. For the moment we have downloaded the files locally, but we would like to scrape the data.

In [3]:
# TODO scrape data https://datasets.imdbws.com/
ratings_fn = "title.ratings.tsv.gz"
basics_fn = "title.basics.tsv.gz"

In [4]:
df_ratings = spark.read.option("header", "true")\
                       .option("sep", "\t")\
                       .csv("imdb_data/" + ratings_fn)
df_ratings = df_ratings.selectExpr("tconst", "cast(averageRating as float) averageRating", "cast(numVotes as int) numVotes")
df_ratings.show()

+---------+-------------+--------+
|   tconst|averageRating|numVotes|
+---------+-------------+--------+
|tt0000001|          5.8|    1440|
|tt0000002|          6.3|     172|
|tt0000003|          6.6|    1041|
|tt0000004|          6.4|     102|
|tt0000005|          6.2|    1735|
|tt0000006|          5.5|      91|
|tt0000007|          5.5|     579|
|tt0000008|          5.6|    1539|
|tt0000009|          5.6|      74|
|tt0000010|          6.9|    5127|
|tt0000011|          5.4|     214|
|tt0000012|          7.4|    8599|
|tt0000013|          5.7|    1318|
|tt0000014|          7.2|    3739|
|tt0000015|          6.2|     660|
|tt0000016|          5.9|     982|
|tt0000017|          4.8|     197|
|tt0000018|          5.5|     414|
|tt0000019|          6.6|      13|
|tt0000020|          5.1|     232|
+---------+-------------+--------+
only showing top 20 rows



In [5]:
df_basics = spark.read.option("header", "true")\
                      .option("sep", "\t")\
                      .csv("imdb_data/" + basics_fn)
df_basics = df_basics.withColumn("rttmp", df_basics.runtimeMinutes.cast(DoubleType())) \
                     .drop("runtimeMinutes").withColumnRenamed("rttmp", "runtimeMinutes")
df_basics.show()

+---------+---------+--------------------+--------------------+-------+---------+-------+--------------------+--------------+
|   tconst|titleType|        primaryTitle|       originalTitle|isAdult|startYear|endYear|              genres|runtimeMinutes|
+---------+---------+--------------------+--------------------+-------+---------+-------+--------------------+--------------+
|tt0000001|    short|          Carmencita|          Carmencita|      0|     1894|     \N|   Documentary,Short|           1.0|
|tt0000002|    short|Le clown et ses c...|Le clown et ses c...|      0|     1892|     \N|     Animation,Short|           5.0|
|tt0000003|    short|      Pauvre Pierrot|      Pauvre Pierrot|      0|     1892|     \N|Animation,Comedy,...|           4.0|
|tt0000004|    short|         Un bon bock|         Un bon bock|      0|     1892|     \N|     Animation,Short|          null|
|tt0000005|    short|    Blacksmith Scene|    Blacksmith Scene|      0|     1893|     \N|        Comedy,Short|        

## OpenSubtitles dataset

The dataset consists of 31 GB of XML files distributed in the following file structure: 

```
├── opensubtitle
│   ├── OpenSubtitles2018
│   │   ├── Year
│   │   │   ├── Id
│   │   │   │   ├── #######.xml.gz
│   │   │   │   ├── #######.xml.gz
│   ├── en.tar.gz
│   ├── fr.tar.gz
│   ├── zh_cn.tar.gz
```
where
- `######` is a 6-digit unique identifier of the file on the OpenSubtitles dataset.
- `Year` is the year the movie or episode was made.
- `Id` is a 5 to 7 digit identifier (if it's 7-digit it's also an IMDb identifier).

The subtitles are provided in different languages. We only analyze the `OpenSubtitles2018` folder and it's the only folder we detail.

The decompressed XML files vary in size, ranging from 5KB to 9000KB sized files.

## XML Files

Each XML file is split into a `document` and `metadata` section.

### Subtitles

The `document` section contains all the subtitles and its general structure is the following:

```
├── s
│   ├── time: Integer
│   ├── w: String
```

An example snippet of an XML file:

```xml
  <s id="1">
    <time id="T1S" value="00:00:51,819" />
    <w id="1.1">Travis</w>
    <w id="1.2">.</w>
    <time id="T1E" value="00:00:53,352" />
  </s>
```

The subtitles in each XML file are stored by **blocks** denoted by `s` with a unique `id` attribute (integers in increasing order starting at 1).  

Each block (`<s id="1">` for instance) has a:  

1. Set of timestamps (denoted by `time`) with
 - A timestamp `id` attribute that can take two different formats: `T#S` or `T#E`, where _S_ indicates _start_, _E_ indicates _end_ and _#_ is an increasing integer. 
 - A `value` attribute which has the format `HH:mm:ss,fff`.

2. Set of words (denoted by `w`) with
 - an `id` attribute that is simply an increasing number of decimal numbers of the format `X.Y` where X is the string id and Y is the word id within the corresponding string
 - a non-empty `value` attribute that contains a token: a word or a punctuation character. 

It sometimes also has an `alternative`, `initial` and `emphasis` attribute.  

 - The `initial` attribute generally corresponds to slang words or mispronounced words because of an accent such as _lyin'_ instead of _lying_.  
 - The `alternative` attribute is another way of displaying the subtitle for example _HOW_ instead of _how_.
 - The `emphasis` attribute is a boolean.

### Metadata

The `metadata` section has the following structure:

```
├── Conversion
│   ├── corrected_words: Integer
│   ├── sentences: Integer
│   ├── tokens: Integer
│   ├── encoding: String (always utf-8)
│   ├── unknown_words: Integer
│   ├── ignored_blocks: Integer
│   ├── truecased_words: Integer
├── Subtitle
│   ├── language: String
│   ├── date: String
│   ├── duration: String
│   ├── cds: String (presented as #/# where # is an int)
│   ├── blocks: Integer
│   ├── confidence: Double
├── Source
│   ├── genre: String[] (up to 3 genres)
│   ├── year: Integer
│   ├── duration: Integer (in minutes)
│   ├── original: String
│   ├── country: String
```

We note that some XML files may not have all the entries. 
We can use the metadata to obtain additional information about the movie or show's subtitles and compute certain statistics. 

## Document dataframe

## Exploration

Going through the dataset we notice a few things:

1. The dataset has meaningless folders. For example, the folder 1858/ is empty.
2. Dataset contains XML files that are not related to movies or TV shows. For example, the folder 666/ contains Justin Bieber song subtitles.  
3. Trailer of films can be present in the dataset. For example, the folder 2018/ we found for example Black Panther teaser trailer subtitles.
4. Each movie might have more than 1 subtitle file.
5. Some subtitle files contain text that is not related to the movie, like credits to the person who made the subtitles.
6. The IDMDb folder name is not always a 7-digit number, meaning it is not always a valid IMDb identifer and we can't retrieve the IMDb info.
7. Each block may have an arbitrary number (including 0) of timestamps associated to it.

To solve points 1 and 2, we ignore all the folders which aren't inside the range of 1920-2018.

To solve point 3, we drop trailers by looking at the `duration` field in the metadata section.

To solve point 4, we simply take the first one.

To solve point 6, we keep movies that have a correct IMDb identifier. Hence, all the files in folders that don't have a 7-digit folder name are dropped.

To solve point 7, we decide not to associate a timestamp to each word for the moment.
 
For the moment, we take a sample of the dataset from the cluster (see python script `extract_sample_2.py`) by collecting 1 or 2 movies for each year in the range 1920-2018.

## Putting it all together

After doing an analysis of the files and considering the statistics we want to obtain taking the size of our data into account, we decide to load the metadata and sentences directly into 1 dataframe where we manipulate it as before. We decide not to extract all words at first as it would induce into very heavy computations. We store the text in an array of sentences where each sentence is an array of words.

In [6]:
imdb_id = '6464116'
df_document_example = sqlContext.read.format('com.databricks.spark.xml')\
                                     .options(rowTag='document') \
                                     .load('sample_dataset/2017/6464116/6887453.xml.gz')
df_document_example.printSchema()
df_document_example.show()

root
 |-- _id: long (nullable = true)
 |-- meta: struct (nullable = true)
 |    |-- conversion: struct (nullable = true)
 |    |    |-- corrected_words: long (nullable = true)
 |    |    |-- encoding: string (nullable = true)
 |    |    |-- ignored_blocks: long (nullable = true)
 |    |    |-- sentences: long (nullable = true)
 |    |    |-- tokens: long (nullable = true)
 |    |    |-- truecased_words: long (nullable = true)
 |    |    |-- unknown_words: long (nullable = true)
 |    |-- source: struct (nullable = true)
 |    |    |-- duration: long (nullable = true)
 |    |    |-- genre: string (nullable = true)
 |    |    |-- year: long (nullable = true)
 |    |-- subtitle: struct (nullable = true)
 |    |    |-- blocks: long (nullable = true)
 |    |    |-- cds: string (nullable = true)
 |    |    |-- confidence: double (nullable = true)
 |    |    |-- date: string (nullable = true)
 |    |    |-- duration: string (nullable = true)
 |    |    |-- language: string (nullable = true)
 

To avoid confusion, we will set some naming conventions. We will refer to certain attributes as follows:

- The `s` array as **blocks**
- An element of blocks, as a **block**.
- The `w` array as **elements**
- An element of elements, as **element**.
- `_VALUE` as a **token**

### Dataframe manipulation

We define a function that converts the `w` column of the document to an array of sentences, where each sentence is an array of words.

In [7]:
def to_sentences_array(sentences):
    """Function to map the struct containing the words 
    to a list of words """
    s_list = []
    if sentences is None:
        return s_list
    for words in sentences:
        w_list = []
        if words and "w" in words and words["w"]:
            for w in words["w"]:
                if '_VALUE' in w and w['_VALUE']:
                    w_list.append(w['_VALUE'])
                
            s_list.append(w_list)

    return s_list

Here we define a couple of udf functions we will later use for the manipulation of our dataset

In [8]:
# Transform to spark function
udf_sentences_array = udf(to_sentences_array, ArrayType(ArrayType(StringType())))
# Convert array of words into a single string
udf_sentence = udf(lambda x: ' '.join(x), StringType())
# Function to split genres
udf_split = udf(str.split, ArrayType(StringType()))

The function below structures our data to the format we want to then process all the queries we need: We link the movie with the proper imdbID, we get all the sentences, change the subtitle duration to be in seconds (We assume for this that they all have the same format and after exploring the dataset we know the vast majority does).

In [9]:
def dataframe_cleaning(df_document, imdb_id):
    # Create IMDb ID and sentences column
    df_film_sentences = df_document.withColumn("tconst", lit("tt" + imdb_id))\
                                   .withColumn("sentences", udf_sentences_array("s"))
    
    # Select metadata and previously created columns
    df_result = df_film_sentences.selectExpr("tconst",
                                             "meta.conversion.sentences as num_sentences",
                                             "meta.source.genre", 
                                             "meta.source.year", 
                                             "meta.subtitle.blocks",
                                             "meta.subtitle.duration as subtitle_duration",
                                             "meta.subtitle.language",
                                             "sentences")
    # Split genre column and convert subtitle duration to seconds
    df_result = df_result.withColumn("genres", udf_split("genre")) \
                         .withColumn("subtitle_mins", 
                                     unix_timestamp(df_result.subtitle_duration, "HH:mm:ss,SSS") / 60)
    # Discard redundant columns
    return df_result.select("tconst", "num_sentences", "year", "blocks", "subtitle_mins", "genres", "sentences")

Here we have an example of the resulting dataframe

In [10]:
df_document_example = dataframe_cleaning(df_document_example, imdb_id)
df_document_example.printSchema()
df_document_example.show()

root
 |-- tconst: string (nullable = false)
 |-- num_sentences: long (nullable = true)
 |-- year: long (nullable = true)
 |-- blocks: long (nullable = true)
 |-- subtitle_mins: double (nullable = true)
 |-- genres: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- sentences: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)

+---------+-------------+----+------+-------------+--------------------+--------------------+
|   tconst|num_sentences|year|blocks|subtitle_mins|              genres|           sentences|
+---------+-------------+----+------+-------------+--------------------+--------------------+
|tt6464116|          967|2017|   888|         40.7|[Action,Crime,Drama]|[[[, shots, firin...|
+---------+-------------+----+------+-------------+--------------------+--------------------+



In [11]:
def dataframe_maker(path):
    """Create dataframe based on the document"""
    df_film = sqlContext.read.format('com.databricks.spark.xml')\
                             .options(rowTag='document')\
                             .load(path)
    return df_film

The code below creates a dataframe based on the sample dataset. We will later expand it to cover a bigger quantity of films.

In [12]:
path = "sample_dataset/"
# Create empty dataframe with same schema
df_films = spark.createDataFrame([], df_document_example.schema)

for year in os.listdir(path):
    if not year.startswith('.'):
        for imdb_id in os.listdir(path + year):
            if not imdb_id.startswith('.'):
                current_path = path + year + "/" + imdb_id
                for file in os.listdir(current_path):
                        df_document = dataframe_maker(current_path + '/' + file)
                        df_films = df_films.union(dataframe_cleaning(df_document, imdb_id))

#             df_m.show()
#             print(current_path + "/" + file)

We now have a proper dataframe which will help us generate useful statistics. This is the resulting format

In [13]:
df_films.show()

+---------+-------------+----+------+------------------+--------------------+--------------------+
|   tconst|num_sentences|year|blocks|     subtitle_mins|              genres|           sentences|
+---------+-------------+----+------+------------------+--------------------+--------------------+
|tt1165285|          337|1924|   319|28.066666666666666|             [Short]|[[BACKWARD, CURRE...|
|tt1452522|          208|1924|   181|53.333333333333336| [Drama,History,War]|[[The, Battle, of...|
|tt1002599|           10|1927|     9|               3.5|   [Animation,Short]|[[FIINBECK, HAS, ...|
|tt1320310|           49|1928|    47|              7.55|      [Comedy,Short]|[[YAJI, AND, KITA...|
|tt1886619|           98|1928|    76|             54.15|             [Drama]|[[The, night, coa...|
|tt1002784|          141|1934|   138|55.983333333333334|             [Drama]|[[Song, of, the, ...|
|tt1002784|          138|1934|   138| 55.96666666666667|             [Drama]|[[Song, of, the, ...|
|tt1703934

We join our dataframe with the IMDb dataframe

In [14]:
df_films.join(df_ratings, ["tconst"]).show()
df_films.printSchema()

+---------+-------------+----+------+------------------+--------------------+--------------------+-------------+--------+
|   tconst|num_sentences|year|blocks|     subtitle_mins|              genres|           sentences|averageRating|numVotes|
+---------+-------------+----+------+------------------+--------------------+--------------------+-------------+--------+
|tt1165285|          337|1924|   319|28.066666666666666|             [Short]|[[BACKWARD, CURRE...|          6.4|      47|
|tt1452522|          208|1924|   181|53.333333333333336| [Drama,History,War]|[[The, Battle, of...|          5.0|      11|
|tt1002599|           10|1927|     9|               3.5|   [Animation,Short]|[[FIINBECK, HAS, ...|          4.4|       5|
|tt1320310|           49|1928|    47|              7.55|      [Comedy,Short]|[[YAJI, AND, KITA...|          5.4|      22|
|tt1886619|           98|1928|    76|             54.15|             [Drama]|[[The, night, coa...|          7.7|      21|
|tt1002784|          141

Query to get number of sentences of best 50 films considering at least 20000 reviews

In [15]:
df_filtered = df_ratings.filter(df_ratings.numVotes > 20000)
df_50_best = df_films.join(df_filtered, ["tconst"])\
                     .orderBy(df_ratings.averageRating.desc()).select("num_sentences", 
                                                                      "averageRating", 
                                                                      "numVotes").take(50)

In [16]:
#cest de la merde pour linstant vue la taille du dataset
df_pd_example = pd.DataFrame(df_50_best)

In [17]:
df_pd_example


Unnamed: 0,0,1,2
0,21,6.2,174384
