The first step is to get movie ids from tmdb.tsv file 

In [None]:
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext()

# tmdb.tsv.1 is my copy of the original tmdb.tsv. please change it to "tmdb.tsv", if you are running locally
tmbdRDD = sc.textFile("data/tmdb.tsv.1")
import re
imdb_ids = (tmbdRDD.map(lambda lines: lines.split()[1])
         .filter(lambda word: (word != 'imdb_id') & (word.startswith("tt")))
         .map(lambda word: re.sub("\D", "", word) )
         .collect())

I've tried multiple methods to get movie information. These are: 

Method 1. Use these ids to query imdb and get summary ( that contains genre) information.

In [None]:
from imdb import IMDb
ia = IMDb()
imdbMoviefile = open('data/imdb_movie_summary.txt', 'w')
for i in imdb_ids:
    movie = ia.get_movie(str(i))
    print imdbMoviefile.write("%s\n" % (['Id: ' + str(i)] + str(movie.summary()).splitlines()[2:]))

Summary looks like this:

['Id: 2771200', 'Title: Beauty and the Beast (2017)', 'Genres: Family, Fantasy, Musical, Romance.', 'Director: Bill Condon.', 'Writer: Stephen Chbosky, Evan Spiliotopoulos, Jeanne-Marie Leprince de Beaumont.', 'Cast: Emma Watson (Belle), Dan Stevens (Beast), Luke Evans (Gaston), Josh Gad (LeFou), Kevin Kline (Maurice).', 'Runtime: 129.', 'Country: USA.', 'Language: English.', 'Rating: 7.8 (62624 votes).', "Plot: Disney's animated classic takes on a new form, with a widened mythology and an all-star cast. A young prince, imprisoned in the form of a beast, can be freed only by true love. What may be his only opportunity arrives when he meets Belle, the only human girl to ever visit the castle since it was enchanted."]

['Id: 3315342', 'Title: Logan (2017)', 'Genres: Action, Drama, Sci-Fi, Thriller.', 'Director: James Mangold.', 'Writer: James Mangold, Scott Frank, James Mangold, Michael Green, Craig Kyle, John Romita Sr, Roy Thomas, Herb Trimpe, Len Wein, Christopher Yost.', 'Cast: Hugh Jackman (Logan), Patrick Stewart (Charles), Dafne Keen (Laura), Boyd Holbrook (Pierce), Stephen Merchant (Caliban).', 'Runtime: 141, China:122::(Mainland China Censored Version).', 'Country: USA.', 'Language: English, Spanish.', 'Rating: 8.5 (196768 votes).', 'Plot: In 2029 the mutant population has shrunk significantly and the X-Men have disbanded. Logan, whose power to self-heal is dwindling, has surrendered himself to alcohol and now earns a living as a chauffeur. He takes care of the ailing old Professor X whom he keeps hidden away. One day, a female stranger asks Logan to drive a girl named Laura to the Canadian border. At first he refuses, but the Professor has been waiting for a long time for her to appear. Laura possesses an extraordinary fighting prowess and is in many ways like Wolverine. She is pursued by sinister figures working for a powerful corporation; this is because her DNA contains the secret that connects her to Logan. A relentless pursuit begins - In this third cinematic outing featuring the Marvel comic book character Wolverine we see the superheroes beset by everyday problems. They are ageing, ailing and struggling to survive financially. A decrepit Logan is forced to ask himself if he can or even wants to put his remaining powers to good use. It would appear that in the near-future, the times in which they were able put the world to rights with razor sharp claws and telepathic powers are now over.']


Method 2. Unfortunately, the above method takes long time. Therefore, I tried another method mentioned in the IMDBPy documentation. A python script provided in the there development repository puts the entire IMDb's database in a SQL database. I used to script to populate local (mac) mysql database and then used IMDBPy to get relevant movie information.

Script: https://bitbucket.org/alberanid/imdbpy/src/f2762bef1563c3d4a169868ede61e89855edaff4/bin/imdbpy2sql.py?at=default&fileviewer=file-view-default

Unfortunately, the SQL API for IMDBPy does not provide a direct way to query movie by imdb_id. But one can search by movie name and then iterate over the results and get imdb_id for each result.

In [None]:
imdb_name = (tmbdRDD.map(lambda lines: lines.split("\t")[2])
                   .collect())

The below method iterates over movie name for each movie in TMDB database and populates dictionary of all movie information with SQL and imdb_id 

In [None]:
from imdb import IMDb
i = IMDb('sql', uri='mysql://imdb:imdb@localhost/imdb')
imdbMoviefile = open('data/imdb_movie_summary_sql.txt', 'w')
for imdb_n in imdb_name:
    search = i.search_movie(imdb_n)
    for s in search:
         if i.get_imdbID(s) in imdb_ids:
                movie = i.get_movie(s.movieID)
                movie.update({"SQL_ID" : str(s.movieID)})
                movie.update({"IMDB_ID" : str(i.get_imdbID(s))})
                imdbMoviefile.write("%s\n" % str(movie.items()))

dictionary looks like this:

[('rating', 7.7), ('SQL_ID', '4081894'), ('writer', [<Person id:818505[sql] name:_Ghosh, Rituparno_>]), ('producer', [<Person id:5448295[sql] name:_Biswas, Tapan (I)_>, <Person id:3017608[sql] name:_Ghosh, Sutapa_>]), ('votes', 252), ('IMDB_ID', '0356129'), ('director', [<Person id:818505[sql] name:_Ghosh, Rituparno_>]), ('votes distribution', u'0..0001212'), ('cinematographer', [<Person id:1610641[sql] name:_Mukhopadhyay, Avik_>]), ('composer', [<Person id:1558850[sql] name:_Mishra, Debajyoti_>]), ('year', 2002), ('miscellaneous crew', [<Person id:4939827[sql] name:_Ghosh, Nirmal (II)_>, <Person id:3199130[sql] name:_Karlekar, Madhuchhanda_>, <Person id:5159311[sql] name:_Mukhopadhyay, Alok (I)_>]), ('akas', [u'The First Monsoon Day (2002)::(International: English title)']), ('color info', [u'Color']), ('connections', {'references': [<Movie id:2982037[sql] title:_Aradhana (1969)_>]}), ('genres', [u'Drama']), ('costume designer', [<Person id:4343542[sql] name:_Das, Sabarni_>]), ('title', u'Titli'), ('kind', 'movie'), ('languages', [u'Bengali']), ('cast', [<Person id:3714752[sql] name:_Sen, Aparna_>, <Person id:385319[sql] name:_Chakraborty, Mithun (I)_>, <Person id:576359[sql] name:_Dey, Dipankar (I)_>, <Person id:3722728[sql] name:_Sharma, Konkona Sen_>, <Person id:3017572[sql] name:_Ghosh, Rukkmini_>]), ('editor', [<Person id:4514805[sql] name:_Mitra, Arghakamal_>]), ('production companies', [<Company id:150502[sql] name:_Cinemawalla [in]_>]), ('countries', [u'India']), ('canonical title', u'Titli'), ('long imdb title', u'Titli (2002)'), ('long imdb canonical title', u'Titli (2002)'), ('smart canonical title', u'Titli'), ('smart long imdb canonical title', u'Titli (2002)')]

Unfortnately the above method also turned out to be quite slow. Searching over the movie name would give a lot of similar movies in result and going over each othem them takes long time. 

Method 3. I also tried another method suggested by Andrew. Assuming imdb_ids are already present in tables, just 
A) get all imdb_ids corresponding to sql_ids
B) filter the imdb_ids we care about
C) use the sql_ids for those, to get movie information.
Here's a sample code for the first step a)

In [None]:
imdb_id_list = list()
for k in range(1,5000000):
        movie_id = i.get_imdbMovieID(k)
        if movie_id:
            imdb_id_list.append((k, movie_id))

Unfortunately, this method is also very slow. After investigation, I found that the Python script that generates the SQL database does not get IMDb id with it "on purpose". Within the imdb SQL database, the table "title" has a column for "imdb_id" but it is always null. It only gets populated in datatbase with the first (http) query to get movie id. And the second time the API will use this populated imdb_id. 

Thus the above code also needs to make an http call to get movie_id. So in conclusion, we are using the "dictionary" approach to get movie information.

Method 4. The code below appends movie information to the dictionary file populated above.

In [None]:
# previously dictionary populated text file
lines = []
with open("data/imdb_movie_summary_sql.txt") as file:
    for line in file:
        line = line.strip() #or someother preprocessing
        lines.append(line)

# populate ids that we have already populated before.
already_pop_ids = list()
for line in lines:
    for l in line.split("), ("):
        if "IMDB_ID" in l:
            already_pop_ids.append(re.sub("\D", "", l))

# filter previously populated imdb_ids from original imdb ids (populated from tmdb)
imdb_ids_to_pop = sc.broadcast(set(imdb_ids).difference(set(already_pop_ids)))

# get movie names for filter imdb_ids
imdb_name_1 = (tmbdRDD.filter(lambda lines: (lines.split("\t")[1] != 'imdb_id') 
                  & (lines.split("\t")[1].startswith("tt"))
                  & (re.sub("\D", "", lines.split("\t")[1]) in imdb_ids_to_pop.value))
             .map(lambda lines: lines.split("\t")[2])
                   .collect())        

In [None]:
from imdb import IMDb
from sets import Set
i = IMDb('sql', uri='mysql://imdb:imdb@localhost/imdb')
movies_populated = Set()

# append the movie information to previosu file. 
imdbMoviefile = open('data/imdb_movie_summary_sql.txt', 'a')
for imdb_n in imdb_name_1:
    search = i.search_movie(imdb_n)
    for s in search:
        imdb_id = i.get_imdbID(s)
        if imdb_id in imdb_ids_to_pop.value and imdb_id not in movies_populated:
            movie = i.get_movie(s.movieID)
            movie.update({"SQL_ID" : str(s.movieID)})
            movie.update({"IMDB_ID" : str(imdb_id)})
            imdbMoviefile.write("%s\n" % str(movie.items()))
            print movie
            movies_populated.add(imdb_id)
               
                
print "population complete"