# A Movie behind a Script


In [1]:
import os
import re
import findspark
import pandas as pd
findspark.init()
from pyspark.sql import *
from pyspark.sql.functions import unix_timestamp, udf, to_date
from pyspark.sql.types import *
from datetime import datetime
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import urllib.request

In [2]:
spark = SparkSession.builder.getOrCreate()
spark.conf.set('spark.sql.session.timeZone', 'UTC')
sc = spark.sparkContext
sqlContext = SQLContext(sc)

## Overview of dataset

(Considering only the folder OpenSubtitles2018) The dataset presented consists of 31GB of xml files where we can find the subtitles in different languages of movies, tv shows and trailers. The data is separeted in different folders, first separating the subtitles by language, then by year and finally by identifier. Each Id is supposed to be its IMDb identifier but this needs to be checked. There are some year folders however that are not indicative, we can find folder 0. We can also notice that per film we can have multiple subtitle xml files. The decompressed xml files vary a great deal in size aswell, we can have 9000KB file and 5KB.

Each xml file has a document id and contains the following metadata splitted in 3 different categories: Conversion, Source, and Subtitle.

Coversion contains:
- Number of Sentences
- Number of corrected words
- Number of unknown words
- Number of tokens
- Encoding type

Source contains:
- Genre (Action, drama, horror, etc. Can have multiple)
- Year

Subtitle contains:
- Language
- Date (creation of file or release date of associated video?) -> XML file
- Duration (of video associated)
- Cds (can be 1/5, 2/3)
- Blocks
- Confidence

We can use these metadata to find different statistics that might reveal interesting information.

For the actual subtitles in each xml file we can see that they are stored in sentences, each one having an unique id(integers in increasing order starting at 1). Each sentence has a set timestamp and a set of words. Every timestamp and word has also an id, the timestamp id has two different formats. 

First for the timestamp we have"T#S", "T#E" where # is an increasing integer, "S" indicates start and "E" indicates end. The words inbetween a start and end of timestamp are shown on the screen during the time indicated by the time stamp. **This is a great indicator of fast dialog!** 

For the words the id is simply an increasing number of decimal numbers "X.Y" where X is the string id and Y is the word id within the corresponding string.

In [3]:
baseURL = "https://datasets.imdbws.com/"
ratings_fn = "title.ratings.tsv.gz"
basics_fn = "title.basics.tsv.gz"

In [4]:
df_ratings = pd.read_csv(baseURL + ratings_fn, sep='\t', compression='gzip')
df_ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.8,1439
1,tt0000002,6.3,172
2,tt0000003,6.6,1040
3,tt0000004,6.4,102
4,tt0000005,6.2,1735


In [5]:
df_basics = pd.read_csv(baseURL + basics_fn, sep='\t', compression='gzip')
df_basics.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,\N,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


In [6]:
df_sample_film = sqlContext.read.format('com.databricks.spark.xml')\
                                .options(rowTag='s') \
                                .load('data_subtitles/2017/331314/6908253.xml/6908253.xml')

In [7]:
df_sample_film.printSchema()
df_sample_film.show()

root
 |-- _id: long (nullable = true)
 |-- time: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _VALUE: string (nullable = true)
 |    |    |-- _id: string (nullable = true)
 |    |    |-- _value: string (nullable = true)
 |-- w: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _VALUE: string (nullable = true)
 |    |    |-- _alternative: string (nullable = true)
 |    |    |-- _id: double (nullable = true)
 |    |    |-- _initial: string (nullable = true)

+---+--------------------+--------------------+
|_id|                time|                   w|
+---+--------------------+--------------------+
|  1|[[, T1S, 00:00:51...|[[Travis,, 1.1,],...|
|  2|[[, T2S, 00:00:53...|[[Travis,, 2.1,],...|
|  3|[[, T3S, 00:00:54...|[[What,, 3.1,], [...|
|  4|[[, T4S, 00:01:04...|[[You,, 4.1,], [j...|
|  5|[[, T5S, 00:01:08...|[[Travis,, 5.1,],...|
|  6|[[, T6S, 00:01:10...|[[I,, 6.1,], ['ll...|
|  7|[[, T7S, 00:01:13...|[[