# Statistical Programming Python
## IMDB assignment
### Frederico Ferreira Andrade - MBD2

Importing our libraries:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
import seaborn as sns
%matplotlib inline

### Q1. Open the dataset as a pandas DataFrame named imdb: please note that the file is .tsv type. Investigate what this is, and how to pass a different separator value sep="" when using pd.read_csv

In [5]:
IMDB = pd.read_table("imdb_2019.tsv", sep="\t")
IMDB.head(5)

Unnamed: 0,index,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,87501.0,tt0089435,short,Kokoa,Kokoa,0.0,2019.0,,13.0,
1,89512.0,tt0091490,short,Martina's Playhouse,Martina's Playhouse,0.0,2019.0,,20.0,
2,114407.0,tt0116991,movie,Mariette in Ecstasy,Mariette in Ecstasy,0.0,2019.0,,,
3,126556.0,tt0129960,tvMovie,Eine geschlossene Gesellschaft,Eine geschlossene Gesellschaft,0.0,2019.0,,,
4,166388.0,tt0172112,short,Ambulans,Ambulans,0.0,2019.0,,11.0,


<b>tconst (string)</b> - alphanumeric unique identifier of the title

<b>titleType (string)</b> – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)

<b>primaryTitle (string)</b> – the more popular title / the title used by the filmmakers on promotional materials at the point of release

<b>originalTitle (string)</b> - original title, in the original language

<b>isAdult (boolean)</b> - 0: non-adult title; 1: adult title

<b>startYear (YYYY)</b> – represents the release year of a title. In the case of TV Series, it is the series start year

<b>endYear (YYYY)</b> – TV Series end year. ‘\N’ for all other title types

<b>runtimeMinutes</b> – primary runtime of the title, in minutes

<b>genres (string array)</b> – includes up to three genres associated with the title

### Q2. How many types of titles are there in the column titleType? No for loops allowed! Check pandas.unique, pandas.Series.value_counts, set

In [6]:
from IPython.display import display
display(pd.unique(IMDB['titleType']))

print("Number of title types: ")
display(IMDB['titleType'].nunique())

array(['short', 'movie', 'tvMovie', 'video', 'tvSeries', 'tvEpisode',
       'tvMiniSeries', 'tvSpecial', 'videoGame', 'tvShort'], dtype=object)

Number of title types: 


10

In [7]:
pd.Series.value_counts(IMDB['titleType'])

tvEpisode       225972
short            35351
movie            15640
video            10450
tvSeries          7154
tvMovie           2431
tvMiniSeries      2000
tvSpecial         1306
videoGame          884
tvShort            151
Name: titleType, dtype: int64

### Q3. Create a slice of imdb that only contains the following columns: 
### • titleType, primaryTitle, startYear, runtimeMinutes

In [9]:
sliced_imdb = IMDB[['titleType', 'primaryTitle', 'startYear', 'runtimeMinutes']]
sliced_imdb.head(5)

Unnamed: 0,titleType,primaryTitle,startYear,runtimeMinutes
0,short,Kokoa,2019.0,13.0
1,short,Martina's Playhouse,2019.0,20.0
2,movie,Mariette in Ecstasy,2019.0,
3,tvMovie,Eine geschlossene Gesellschaft,2019.0,
4,short,Ambulans,2019.0,11.0


### Q4. Create a subset of imdb named tvEpisodes_2019 that only includes the type tvEpisodes

In [10]:
ep_subset = IMDB[IMDB.titleType == 'tvEpisode']
ep_subset.head(5)

Unnamed: 0,index,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
21,758955.0,tt0782666,tvEpisode,Save the Duckling!,Save the Duckling!,0.0,2019.0,,,
30,969426.0,tt10001058,tvEpisode,Le grand saut,Le grand saut,0.0,2019.0,,51.0,
32,969458.0,tt10001110,tvEpisode,The Interviews,The Interviews,0.0,2019.0,,,
34,969464.0,tt10001120,tvEpisode,The New Guy,The New Guy,0.0,2019.0,,,
35,969465.0,tt10001122,tvEpisode,The New Girl,The New Girl,0.0,2019.0,,,


### Q5. Percentage of adult films over total releases in 2019. Check pd.Series.mean

In [12]:
releases2019 = IMDB[IMDB.startYear == 2019]
print((pd.Series.mean(releases2019.isAdult))*100,'%')


2.9684176293144926 %


### Q6. Create a column named words_in_title that contains the total number of words in the title. Use map alongside a function or a lambda function.

In [14]:
IMDB['wordsInTitle'] = IMDB['primaryTitle'].astype(str).apply(lambda x: len(x.split()))
IMDB.head(5)

Unnamed: 0,index,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,wordsInTitle
0,87501.0,tt0089435,short,Kokoa,Kokoa,0.0,2019.0,,13.0,,1
1,89512.0,tt0091490,short,Martina's Playhouse,Martina's Playhouse,0.0,2019.0,,20.0,,2
2,114407.0,tt0116991,movie,Mariette in Ecstasy,Mariette in Ecstasy,0.0,2019.0,,,,3
3,126556.0,tt0129960,tvMovie,Eine geschlossene Gesellschaft,Eine geschlossene Gesellschaft,0.0,2019.0,,,,3
4,166388.0,tt0172112,short,Ambulans,Ambulans,0.0,2019.0,,11.0,,1


### Q7. What’s the average value of runtimeMinutes for the type short?

In [16]:
types_short = IMDB[IMDB.titleType == 'short']
types_short_correct = types_short[pd.notnull(types_short['runtimeMinutes'])]
types_short_correct.runtimeMinutes.mean()

12.536104279390065

### Q8. Filter imdb to return tvMovie type with 3 or more words in the title, and less than 75 minutes of runTimeMinutes

In [17]:
Movie_filter = IMDB[(IMDB.titleType == 'tvMovie') & (IMDB.wordsInTitle >=3) & (IMDB.runtimeMinutes<75)]
Movie_filter.head(5)

Unnamed: 0,index,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,wordsInTitle
10,290068.0,tt0302617,tvMovie,Great Bear Rainforest,Great Bear Rainforest,0.0,2019.0,,41.0,,3
77,970092.0,tt10002188,tvMovie,"La Sagi, una pionera del Barça","La Sagi, una pionera del Barça",0.0,2019.0,,54.0,,6
655,972490.0,tt10006422,tvMovie,Warrior Women of Dahomey,Warrior Women of Dahomey,0.0,2019.0,,60.0,,4
870,974152.0,tt10009314,tvMovie,Peter Kraus: Immer in Bewegung,Peter Kraus: Immer in Bewegung,0.0,2019.0,,59.0,,5
1549,977406.0,tt10015036,tvMovie,Arts Across the Heartland,Arts Across the Heartland,0.0,2019.0,,42.0,,4
