# A Movie behind a Script


In [1]:
import os
import re
import findspark
import pandas as pd
findspark.init()
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.sql.types import *
from datetime import datetime
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import urllib.request
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.10:0.4.1 pyspark-shell'

In [2]:
spark = SparkSession.builder.getOrCreate()
spark.conf.set('spark.sql.session.timeZone', 'UTC')
sc = spark.sparkContext
sqlContext = SQLContext(sc)

## Overview of datasets

We were provided with a dataset of movie and tv show subtitles which contain for each video one or more xml files. The files supposedly have a corresponding IMDb identifier which can be linked to the IMDb datasets where we can extract useful information about a certain movie (rating, actors, director, etc) hence we decided to also use the IMDb dataset so we had more analytical tools in our disposal.

### OpenSubtitles

(Considering only the folder OpenSubtitles2018) The dataset presented consists of 31GB of xml files where we can find the subtitles in different languages of movies, tv shows and trailers. The data is separeted in different folders, first separating the subtitles by language, then by year and finally by identifier. Each Id is supposed to be its IMDb identifier but this needs to be checked. There are some year folders however that are not indicative, we can find folder 0. We can also notice that per film we can have multiple subtitle xml files. The decompressed xml files vary a great deal in size aswell, we can have 9000KB file and 5KB.

Each xml file has a document id and contains the following metadata splitted in 3 different categories: Conversion, Source, and Subtitle.

Coversion contains:
- Number of Sentences
- Number of corrected words
- Number of unknown words
- Number of tokens
- Encoding type

Source contains:
- Genre (Action, drama, horror, etc. Can have multiple)
- Year

Subtitle contains:
- Language
- Date (creation of file or release date of associated video?) -> XML file
- Duration (of video associated)
- Cds (can be 1/5, 2/3)
- Blocks
- Confidence

We can use these metadata to find different statistics that might reveal interesting information.

For the actual subtitles in each xml file we can see that they are stored in sentences, each one having an unique id(integers in increasing order starting at 1). Each sentence has a set timestamps and a set of words. Every timestamp and word have also an id and a set of attributes. The timestamp id has two different formats. 

First for the timestamp id we have "T#S" or "T#E" where # is an increasing integer, "S" indicates start and "E" indicates end. The words inbetween a start and end of timestamp are shown on the screen during the time indicated by the time stamp. **This is a great indicator of fast dialog!**. Apart from the id, the timestamp also has a value attribute which has the format `` HH:mm:ss,fff``.

For the words the id is simply an increasing number of decimal numbers "X.Y" where X is the string id and Y is the word id within the corresponding string. Each word element in the XML file has a non-empty value (the actual word, can be a mark) and it might have an alternative and initial value. The initial value corresponds to slang words generally, mispronounced words because of an accent such as lyin' instead of lying. The alternative is another way of displaying the subtitle for example HOW instead of how.

There is another attribute we found for the strings and words which is not present in all the files and it is the emphasis attribute, which takes either true or false value.

#### Exploration

After going through the dataset we found many things worth noting. First of all is that the dataset is not uniform, it has "strange folders" and contains xml files that are not related to movies or tv shows. We have for example the folder 666/ which contains Justin Bieber song subtitles, folder 1858/ which is empty and so on. To solve this we decided to ignore all the folders which weren't inside the range of 1920-2018. We also found that trailer of films are present in the dataset. In the folder 2018 we found for example Black Panther teaser trailer subtitles.

Another thing worth mentioning is that a lot of different subtitles contain text that is not related to the movie, like credentials of the person who made the subtitles.

We found that the code for the movies is not always reliable to get the actual movie name, hence we can't have 100% certainty that the id for the subtitles are associated with the correct film. We also see that each movie might have more than 1 subtitle file, we have to decide which one we should take. We can base this decision by taking one subtitle file at random or we could consider the confidence attribute in the metadata. To choose movies that can actually have a correct IMDb identifier we looked that the ID is composed of 7 integers, hence all the files in folders with more or less that 7 integers (after the year identifier) are very hard to associate with a video.

### IMDb Dataset

We also have at our disposal the IMDb ratings and basics dataset.

In [3]:
baseURL = "https://datasets.imdbws.com/"
ratings_fn = "title.ratings.tsv.gz"
basics_fn = "title.basics.tsv.gz"

In [4]:
df_ratings = pd.read_csv(baseURL + ratings_fn, sep='\t', compression='gzip')
df_ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.8,1439
1,tt0000002,6.3,172
2,tt0000003,6.6,1040
3,tt0000004,6.4,102
4,tt0000005,6.2,1735


In [5]:
df_basics = pd.read_csv(baseURL + basics_fn, sep='\t', compression='gzip')
df_basics.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,\N,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


In [6]:
df_sample_film = sqlContext.read.format('com.databricks.spark.xml')\
                                .options(rowTag='s') \
                                .load('data_subtitles/2017/5052448/6963336.xml.gz')
df_sample_film.printSchema()
df_sample_film.show()

root
 |-- _id: long (nullable = true)
 |-- time: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _VALUE: string (nullable = true)
 |    |    |-- _id: string (nullable = true)
 |    |    |-- _value: string (nullable = true)
 |-- w: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _VALUE: string (nullable = true)
 |    |    |-- _id: double (nullable = true)

+---+--------------------+--------------------+
|_id|                time|                   w|
+---+--------------------+--------------------+
|  1|[[, T1S, 00:00:10...|[[", 1.1], [ahmad...|
|  2|[[, T2S, 00:00:56...|[[Well, 2.1], [,,...|
|  3|[[, T3S, 00:00:58...|[[What, 3.1], [ki...|
|  4|[[, T5S, 00:01:09...|[[Crazy, 4.1], [....|
|  5|[[, T6S, 00:01:10...|[[Got, 5.1], [me,...|
|  6|[[, T7S, 00:01:18...|[[Serious, 6.1], ...|
|  7|[[, T7E, 00:01:23...|[[I, 7.1], [feel,...|
|  8|[[, T8S, 00:01:25...|[[Alright, 8.1], ...|
|  9|[[, T8E, 00:01:29...|[[See, 9.1

In [7]:
df_sample_metadata = sqlContext.read.format('com.databricks.spark.xml')\
                                    .options(rowTag='meta') \
                                    .load('data_subtitles/2017/5052448/6963336.xml.gz')

In [9]:
df_sample_metadata.printSchema()
df_sample_metadata.show()

root
 |-- conversion: struct (nullable = true)
 |    |-- corrected_words: long (nullable = true)
 |    |-- encoding: string (nullable = true)
 |    |-- ignored_blocks: long (nullable = true)
 |    |-- sentences: long (nullable = true)
 |    |-- tokens: long (nullable = true)
 |    |-- truecased_words: long (nullable = true)
 |    |-- unknown_words: long (nullable = true)
 |-- source: struct (nullable = true)
 |    |-- duration: long (nullable = true)
 |    |-- genre: string (nullable = true)
 |    |-- year: long (nullable = true)
 |-- subtitle: struct (nullable = true)
 |    |-- blocks: long (nullable = true)
 |    |-- cds: string (nullable = true)
 |    |-- confidence: double (nullable = true)
 |    |-- date: string (nullable = true)
 |    |-- duration: string (nullable = true)
 |    |-- language: string (nullable = true)

+--------------------+--------------------+--------------------+
|          conversion|              source|            subtitle|
+--------------------+------------

In [8]:
df_sample_film = df_sample_film.select('w', explode(col("w")).alias('ws'))
df_sample_film.show()

+--------------------+-----------------+
|                   w|               ws|
+--------------------+-----------------+
|[[", 1.1], [ahmad...|         [", 1.1]|
|[[", 1.1], [ahmad...|     [ahmad, 1.2]|
|[[", 1.1], [ahmad...|    [torifi, 1.3]|
|[[", 1.1], [ahmad...|         [", 1.4]|
|[[", 1.1], [ahmad...|  [subtitle, 1.5]|
|[[Well, 2.1], [,,...|      [Well, 2.1]|
|[[Well, 2.1], [,,...|         [,, 2.2]|
|[[Well, 2.1], [,,...|       [the, 2.3]|
|[[Well, 2.1], [,,...|     [thing, 2.4]|
|[[Well, 2.1], [,,...|         [I, 2.5]|
|[[Well, 2.1], [,,...|       ['ve, 2.6]|
|[[Well, 2.1], [,,...|      [been, 2.7]|
|[[Well, 2.1], [,,...|    [asking, 2.8]|
|[[Well, 2.1], [,,...|    [myself, 2.9]|
|[[Well, 2.1], [,,...|        [is, 2.1]|
|[[Well, 2.1], [,,...|      [..., 2.11]|
|[[What, 3.1], [ki...|      [What, 3.1]|
|[[What, 3.1], [ki...|     [kinda, 3.2]|
|[[What, 3.1], [ki...|      [sick, 3.3]|
|[[What, 3.1], [ki...|[individual, 3.4]|
+--------------------+-----------------+
only showing top

In [10]:
def to_sentence(words):
    w_list = []
    for w in words:
        w_list.append(w[0])
    return w_list
udf_word = udf(to_sentence, ArrayType(StringType()))
udf_sentence = udf(lambda x: ' '.join([w[0] for w in x]), StringType())

In [11]:
df_sample_film_sentence_list = df_sample_film.withColumn("sentence", udf_word("w"))
df_sample_film_sentence_list.show()

+--------------------+-----------------+--------------------+
|                   w|               ws|            sentence|
+--------------------+-----------------+--------------------+
|[[", 1.1], [ahmad...|         [", 1.1]|[", ahmad, torifi...|
|[[", 1.1], [ahmad...|     [ahmad, 1.2]|[", ahmad, torifi...|
|[[", 1.1], [ahmad...|    [torifi, 1.3]|[", ahmad, torifi...|
|[[", 1.1], [ahmad...|         [", 1.4]|[", ahmad, torifi...|
|[[", 1.1], [ahmad...|  [subtitle, 1.5]|[", ahmad, torifi...|
|[[Well, 2.1], [,,...|      [Well, 2.1]|[Well, ,, the, th...|
|[[Well, 2.1], [,,...|         [,, 2.2]|[Well, ,, the, th...|
|[[Well, 2.1], [,,...|       [the, 2.3]|[Well, ,, the, th...|
|[[Well, 2.1], [,,...|     [thing, 2.4]|[Well, ,, the, th...|
|[[Well, 2.1], [,,...|         [I, 2.5]|[Well, ,, the, th...|
|[[Well, 2.1], [,,...|       ['ve, 2.6]|[Well, ,, the, th...|
|[[Well, 2.1], [,,...|      [been, 2.7]|[Well, ,, the, th...|
|[[Well, 2.1], [,,...|    [asking, 2.8]|[Well, ,, the, th...|
|[[Well,

In [12]:
df_sample_film_sentence_string = df_sample_film.withColumn("sentence", udf_sentence("w"))
df_sample_film_sentence_string.show()

+--------------------+-----------------+--------------------+
|                   w|               ws|            sentence|
+--------------------+-----------------+--------------------+
|[[", 1.1], [ahmad...|         [", 1.1]|" ahmad torifi " ...|
|[[", 1.1], [ahmad...|     [ahmad, 1.2]|" ahmad torifi " ...|
|[[", 1.1], [ahmad...|    [torifi, 1.3]|" ahmad torifi " ...|
|[[", 1.1], [ahmad...|         [", 1.4]|" ahmad torifi " ...|
|[[", 1.1], [ahmad...|  [subtitle, 1.5]|" ahmad torifi " ...|
|[[Well, 2.1], [,,...|      [Well, 2.1]|Well , the thing ...|
|[[Well, 2.1], [,,...|         [,, 2.2]|Well , the thing ...|
|[[Well, 2.1], [,,...|       [the, 2.3]|Well , the thing ...|
|[[Well, 2.1], [,,...|     [thing, 2.4]|Well , the thing ...|
|[[Well, 2.1], [,,...|         [I, 2.5]|Well , the thing ...|
|[[Well, 2.1], [,,...|       ['ve, 2.6]|Well , the thing ...|
|[[Well, 2.1], [,,...|      [been, 2.7]|Well , the thing ...|
|[[Well, 2.1], [,,...|    [asking, 2.8]|Well , the thing ...|
|[[Well,