IMDB Dataset
For the illustration purpose, we will pick the IMDB dataset for the top 1000 movies to understand the features/traits of top IMDB movies by applying the 10 steps process. Let’s start with the process :

# 1. Summary
The first step is to get the summary of columns present in the dataset, but before that, we will import the packages and read our IMDB dataset.

The summary helps us to understand the data by showing the statistics (count, number of unique values, top, mean, etc) for the columns as shown below :

In [None]:
# import our packages 
import pandas as pd 
import matplotlib.pyplot as plt

# reading the data
df_movies =  pd.read_csv('https://raw.githubusercontent.com/peetck/IMDB-Top1000-Movies/master/IMDB-Movie-Data.csv')

# summary of the columns   
df_movies.describe(include = 'all')

Insights :

Drishyam occurs twice in the top 1000 IMDB movies (One of the movies is in Malayalam and another is in Hindi).
Most of the movies(32) are from 2014.
The drama genre has most(85) movies.
Alfred Hitchcock has most(14) movies in the top 1000.
We can also see the distribution(min,25%,50%…etc) of IMDB ratings across these movies.

# 2. Data Types
Step 2 is to do a sanity check of the data types of the columns of the dataframe. If we find some incorrect data types then we will correct them in this step.

In [None]:
# check the datatypes 
print(df_movies.dtypes)

Fix any "wrong" data types

In [None]:
# convert release year into int
# df_movies.loc[df_movies.Released_Year == 'PG','Released_Year'] = '9999'
df_movies['Year'] = df_movies['Year'].astype(int)

# Gross into int
# df_movies['Gross'] = df_movies['Gross'].str.replace(',','')
# df_movies['Gross'] = df_movies['Gross'].fillna(0).astype(int)

# runtime into int 
# df_movies['Runtime'] = df_movies['Runtime'].str.replace(' min','').astype(int)
# df_movies.dtypes

# 3. Missing values
The 3rd step is to find the number of missing values across the columns of the dataframe. It’s important to understand the count of nulls so that we can gauge whether we need to treat them.

In [None]:
# find nulls 
df_movies.isnull().sum()

# 4. Missing values treatment
Once we know the count of missing values, the next step is to treat the columns with missing values.

For illustration purposes, I am filling the nulls with the mean value of the columns, although there are more sophisticated methods of missing value treatment.

In [None]:
# let's replace the nulls with mean  (there are none - but if there were ... this is how
df_movies['Metascore'].fillna(df_movies['Metascore'].mean())
df_movies['Revenue (Millions)'].fillna(df_movies['Revenue (Millions)'].mean())
df_movies.head()

# 5. Outliers
The step5 is to check for outliers. There are multiple ways of checking the outliers, we will be using the graphical method. We will pick one continuous variable and check for the outlier by looking at the histogram.

In [None]:
# distribution of meta scores 
plt.hist(df_movies['Metascore'],bins = 15)
plt.show()

# 6. Outlier Treatment
Step 6 is to treat the outlier detected in step 5. There are different ways of treating the outliers such as 1) Capping the min and max value limits 2) Removing the rows with outlier values.

Although there is nothing off with the distribution of meta scores, for illustration purposes, let’s cap the minimum meta score value to 40.

In [None]:
# capping the minimum meta score to 40
df_movies.loc[df_movies['Metascore'] < 40,'Metascore'] = 40

#check the minimum score 
df_movies['Metascore'].min()
# output : 40.0

# 7. Who
Step 7 is to answer the questions related to a person, member, etc. For example in our use case, we have actors and directors and we can formulate and answer the following question related to them.

Who has directed the most number of top IMDB movies? (univariate)
Who has acted in most top IMDB movies? (univariate)
Which Actor-Director combination gave most top IMDB movies? (bivariate)
Who gave music in most top IMDB movies ? (Data not available)
And More …..
Now, let’s answer these questions.

In [None]:
## Who has directed the most number of top IMDB movies ?
df_movies.groupby(['Director']).agg({'Title':'count'}).reset_index().rename(columns = {'Title':'count'}).\
sort_values('count',ascending = False).head(5)

## Who has acted in the most number of top movies 
df_movies.groupby(['Actors']).agg({'Title':'count'}).reset_index().rename(columns = {'Title':'count'}).\
sort_values('count',ascending = False).head(5)

## Director - Actor works best 
df_movies.groupby(['Director','Actors'])['Title'].count().reset_index().\
rename(columns = {'Title':'Count'}).sort_values('Count',ascending = False).head(5)

# 8. When
Step 8 is to answer questions related to the time aspect- year, month, week, etc. In the contest of our data, we can find the following :

Find the years with most movies in IMDB top 1000 ? (univariate)

In [None]:
# finding years with most movies in top 1000
year_dis = df_movies.groupby('Year')['Title'].count().reset_index().\
rename(columns = {'Title':'Count'}).sort_values('Count',ascending = False).head(10)

plt.bar(year_dis['Year'].astype(str), year_dis['Count'], width = 0.5)
plt.xlabel('Year')
plt.ylabel('Number of Movies')
plt.title('Years with most movies in IMDB top 1000')
plt.show()

# 9. Where
Step 9 is to look at the things from the “place” perspective, for example, country, state, regions etc. In context to our data set, we can find the following :

Find countries with most movies in IMDB top 1000.
Currently, we don’t have the data to answer this question.

While formulating these questions, I would recommend you to be as exhaustive as possible and don’t limit the thought process based on data availability because the data that is not available now could be obtained later.

# 10. What/Which
Step 10 is about formulating questions about things that are not covered above. These are not related to people, place, time but everything apart from these. This is a bit subjective and takes some time to get adept at.

Which genres are featured most in the top 1000?
What is the duration of the top movies?
What is the correlation between the rating and gross earning?
and more…
For illustration purposes, we are answering the first question using the following code :

In [None]:
### Which genres are featured most in top 1000 ? 
genre_dis = df_movies.groupby('Genre')['Title'].count().reset_index().\
rename(columns = {'Title':'Count'}).sort_values('Count',ascending = False).head(5)
fig, ax = plt.subplots()
plt.bar(genre_dis['Genre'], genre_dis['Count'], width = 0.5)
plt.setp(ax.get_xticklabels(), rotation=30, horizontalalignment='right')
plt.show()