# Processing Logic
- Down load IMDb Datasets - name.basics.tsv, title.principals.tsv, title.akas.tsv, title.basics.tsv and title.ratings.tsv.
- The dataset files are accessed and downloaded from https://datasets.imdbws.com/. 
- Apply filters and prepare output with actor and movie IDs

## Filters

- Actors born after 1940 and actor atleast associated with 4 movies
- Region should be US
- Atleast 2000 user votes


## Download and Process Actors Information

***name.basics.tsv.gz – Contains the following information for names:***

- nconst (string) - alphanumeric unique identifier of the name/person
- primaryName (string)– name by which the person is most often credited
- birthYear – in YYYY format
- deathYear – in YYYY format if applicable, else '\N'
- primaryProfession (array of strings)– the top-3 professions of the person
- knownForTitles (array of tconsts) – titles the person is known for

**Following filters applied**

- Actors born after 1940
- Atlease 4 movies


In [1]:
import pandas as pd
actors = pd.read_csv('../sourceFiles/name.basics.tsv', sep='\t', header=0)
print("Actor Count before Filter:",len(actors))
actors=actors[(actors.birthYear != '\\N')]
actors = actors[(actors.birthYear > '1940') & (actors.primaryProfession.str.contains('act'))]

actors['titleCount'] = actors['knownForTitles'].str.count(',')+1
print("Actor Count after Filter:",len(actors))
actors = actors[(actors.titleCount > 3)]
actors = actors[['nconst','primaryName']]
actors.head()

Actor Count before Filter: 11359935
Actor Count after Filter: 220915


Unnamed: 0,nconst,primaryName
3,nm0000004,John Belushi
28,nm0000029,Margaux Hemingway
83,nm0000084,Gong Li
86,nm0000087,Elena Koreneva
90,nm0000091,Gérard Pirès


## Download and Process principal actors for a given movie

***title.principals.tsv.gz – Contains the principal cast/crew for titles***
- tconst (string) - alphanumeric unique identifier of the title
- ordering (integer) – a number to uniquely identify rows for a given titleId
- nconst (string) - alphanumeric unique identifier of the name/person
- category (string) - the category of job that person was in
- job (string) - the specific job title if applicable, else '\N'
- characters (string) - the name of the character played if applicable, else '\N'


In [2]:
import pandas as pd
princ = pd.read_csv('../sourceFiles/title.principals.tsv', sep='\t', header=0)
princ.head()

Unnamed: 0,tconst,ordering,nconst,category,job,characters
0,tt0000001,1,nm1588970,self,\N,"[""Self""]"
1,tt0000001,2,nm0005690,director,\N,\N
2,tt0000001,3,nm0374658,cinematographer,director of photography,\N
3,tt0000002,1,nm0721526,director,\N,\N
4,tt0000002,2,nm1335271,composer,\N,\N


In [3]:
princ = princ[(princ.category.str.contains('act'))][['nconst','tconst']]
actors = actors.merge(princ,how='inner')
print(len(actors))
actors.head()

7515381


Unnamed: 0,nconst,primaryName,tconst
0,nm0000004,John Belushi,tt0076816
1,nm0000004,John Belushi,tt0077621
2,nm0000004,John Belushi,tt0077975
3,nm0000004,John Belushi,tt0078723
4,nm0000004,John Belushi,tt0079660


## Download and Process titles

***title.akas.tsv.gz - Contains the following information for titles:***

- titleId (string) - a tconst, an alphanumeric unique identifier of the title
- ordering (integer) – a number to uniquely identify rows for a given titleId
- title (string) – the localized title
- region (string) - the region for this version of the title
- language (string) - the language of the title
- types (array) - Enumerated set of attributes for this alternative title. One or more of the following: "alternative", "dvd", "festival", "tv", "video", "working", "original", "imdbDisplay". New values may be added in the future without warning
- attributes (array) - Additional terms to describe this alternative title, not enumerated
- isOriginalTitle (boolean) – 0: not original title; 1: original title

***Following Filters Applied***

- Region is US

In [4]:
import pandas as pd
titles = pd.read_csv('../sourceFiles/title.akas.tsv', sep='\t', header=0)
titles = titles[titles.region == 'US']
titles.head()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
5,tt0000001,6,Carmencita,US,\N,imdbDisplay,\N,0
14,tt0000002,7,The Clown and His Dogs,US,\N,\N,literal English title,0
35,tt0000005,1,Blacksmithing Scene,US,\N,alternative,\N,0
39,tt0000005,5,Blacksmith Scene #1,US,\N,alternative,\N,0
40,tt0000005,6,Blacksmithing,US,\N,\N,informal alternative title,0


## Download and Process title Basic Info

***title.basics.tsv.gz - Contains the following information for titles:***

- tconst (string) - alphanumeric unique identifier of the title
- titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
- primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
- originalTitle (string) - original title, in the original language
- isAdult (boolean) - 0: non-adult title; 1: adult title
- startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
- endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
- runtimeMinutes – primary runtime of the title, in minutes
- genres (string array) – includes up to three genres associated with the title

***Following Filters Applied***

- Title type is movie

In [5]:
import pandas as pd
basics = pd.read_csv('../sourceFiles/title.basics.tsv', sep='\t', header=0)
basics = basics[basics.titleType == 'movie'][['tconst']]
basics.columns = ['titleId']
basics.head()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,titleId
498,tt0000502
570,tt0000574
587,tt0000591
610,tt0000615
625,tt0000630


In [6]:
titles = titles.merge(basics,how='inner')
del basics
print(len(titles))
titles.head()

301354


Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000574,6,The Story of the Kelly Gang,US,\N,imdbDisplay,\N,0
1,tt0000591,3,The Prodigal Son,US,\N,\N,\N,0
2,tt0000630,4,Hamlet,US,\N,\N,\N,0
3,tt0000679,3,The Fairylogue and Radio-Plays,US,\N,imdbDisplay,\N,0
4,tt0000886,2,"Hamlet, Prince of Denmark",US,\N,\N,\N,0


In [7]:
df_titles = pd.DataFrame((titles[titles.region == 'US']['titleId'].unique()))
del titles
df_titles.columns = ['tconst']
df_titles.head()

Unnamed: 0,tconst
0,tt0000574
1,tt0000591
2,tt0000630
3,tt0000679
4,tt0000886


## Download and Process Ratings Info

***title.ratings.tsv.gz – Contains the IMDb rating and votes information for titles***

- tconst (string) - alphanumeric unique identifier of the title
- averageRating – weighted average of all the individual user ratings
- numVotes - number of votes the title has received

***Following Filters Applied***

- numVotes atleast 2000

In [8]:
import pandas as pd
ratings = pd.read_csv('../sourceFiles/title.ratings.tsv', sep='\t', header=0)
print(len(ratings))
ratings.head()

1201921


Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1834
1,tt0000002,6.0,236
2,tt0000003,6.5,1595
3,tt0000004,6.0,153
4,tt0000005,6.2,2411


In [9]:
ratings = ratings[ratings.numVotes > 2000]
print(len(ratings))
ratings = ratings[['tconst']]
df_titles = df_titles.merge(ratings,how='inner')
print(len(df_titles))
df_titles.head()

41363
22978


Unnamed: 0,tconst
0,tt0002130
1,tt0002844
2,tt0003419
3,tt0003740
4,tt0004707


In [10]:
actors = actors.merge(df_titles,how='inner')
actors = actors.reset_index(drop=True)
print(len(actors))
actors.head()             

57041


Unnamed: 0,nconst,primaryName,tconst
0,nm0000004,John Belushi,tt0077621
1,nm0005460,Mary Steenburgen,tt0077621
2,nm0000004,John Belushi,tt0077975
3,nm0000261,Karen Allen,tt0077975
4,nm0001371,Tom Hulce,tt0077975


In [11]:
print("Total Movies : ",len(actors['tconst'].unique()))
print("Total Actors : ",len(actors['nconst'].unique()))

Total Movies :  18657
Total Actors :  17435


In [12]:
##actors = actors[:100]

# Write File

In [13]:
actors.to_csv('../files/actorsOrig.csv',header=True,index=False)

In [14]:
print(len(actors))

57041
