# Analysis of top-1000 movies from 2006 to 2016 according to IMDB.



### Content:
   + Introduction 
   + Data description
   + Researh questions
   + Data preparation: cleaning and shaping
   + Conclusion
    

### Introduction
IMDB (also known as the Internet Movie Database) is the world's most popular and authoritative source of information related to films, television programs, home videos, video games, and streaming content online – including cast, production crew and personal biographies, plot summaries, ratings, and fan and critical reviews. The movie and talent pages of IMDb are accessible to all internet users, but a registration process is necessary to contribute information to the site.
As one adjunct to data, the IMDb offers a rating scale that allows users to rate films on a scale of one to ten.
IMDb launched online in 1990 and has been a subsidiary of Amazon.com since 1998.
As of January 2020, IMDb has approximately 6.5 million titles (including episodes) and 10.4 million personalities in its database,as well as 83 million registered users.

The following analysis will provide an overview of the top-1000 movies from 2006 to 2016. It will also provide further insight on the relationships between ranking, revenue, actors, directors, genres and years. 


Sourse: https://en.wikipedia.org/wiki/IMDb and https://help.imdb.com/article/imdb/general-information/what-is-imdb/G836CY29Z4SGNMK5?ref_=help


### Data description
There is data set of 1,000 most popular movies on IMDB for 10 years. 
The data points included are: Rank, Title, Genre, Description, Director, Actors, Year, Runtime, Rating, Votes, Revenue and Metascore.
It is interesting to know more about relationships of these data points.
These analysis will be based on data of decade(2006-2016).

Sourse: https://www.kaggle.com/PromptCloudHQ/imdb-data


### Research questions
1. Analyse the relationships between revenue and rating of film. 
2. Find which year had the most highly-rated films. 
3. Identify which director makes films with the biggest revenue. 
4. Track relationship between genres and rating of film. Which genre are most often ranked low and which are the highest? 
5. Which actors are often found in high-rated films? 
6. Is there any relationship between films with the most popular actors and the income of the movie?

### Data preparation: cleaning and shaping

First of all, let's take a look at the dataset and identify the main tasks for further cleaning.

In [87]:
import pandas as pd
import warnings
warnings.simplefilter('ignore')
data = pd.read_csv('IMDB-Movie-Data.csv')
data.head(5)

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0


As we can see there are some columns, that are not needed for this analysis.
Namely rank, description, runtime, votes and metascore, so I am going to delete it.

In [None]:
import pandas as pd
del data['Rank']
del data['Description']
del data['Runtime (Minutes)']
del data['Votes']
del data['Metascore'] 

In [86]:
data.head(5)

Unnamed: 0,Title,Genre,Director,Actors,Year,Rating,Revenue (Millions)
0,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,8.1,333.13
1,Prometheus,"Adventure,Mystery,Sci-Fi",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,7.0,126.46
2,Split,"Horror,Thriller",M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,7.3,138.12
3,Sing,"Animation,Comedy,Family",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,7.2,270.32
4,Suicide Squad,"Action,Adventure,Fantasy",David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,6.2,325.02


Then I want to ensure that data is complete.

In [67]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Title               1000 non-null   object 
 1   Genre               1000 non-null   object 
 2   Director            1000 non-null   object 
 3   Actors              1000 non-null   object 
 4   Year                1000 non-null   int64  
 5   Rating              1000 non-null   float64
 6   Revenue (Millions)  872 non-null    float64
dtypes: float64(2), int64(1), object(4)
memory usage: 54.8+ KB


With this statistics we can understood, that there are some null values in Revenue (Millions) column. Therefore the next step is find it and delete.

In [68]:
data = data[pd.notnull(data['Revenue (Millions)'])]

In [69]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 872 entries, 0 to 999
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Title               872 non-null    object 
 1   Genre               872 non-null    object 
 2   Director            872 non-null    object 
 3   Actors              872 non-null    object 
 4   Year                872 non-null    int64  
 5   Rating              872 non-null    float64
 6   Revenue (Millions)  872 non-null    float64
dtypes: float64(2), int64(1), object(4)
memory usage: 54.5+ KB


After deleting null rows, I planning to check dataframe for duplication rows.

In [88]:
data.duplicated()

0      False
1      False
2      False
3      False
4      False
       ...  
995    False
996    False
997    False
998    False
999    False
Length: 1000, dtype: bool

There are no duplicated rows, so the last step is prepare subdataframes for tasks according to my questions.

In [74]:
import pandas as pd

data1 = data[['Title','Rating','Revenue (Millions)']]
data2 = data[['Title','Year','Rating']]
data3 = data[['Title','Director','Revenue (Millions)']]
data4 = data[['Title','Genre','Rating']]
data5 = data[['Title','Actors','Rating']]
data6 = data[['Title','Actors','Revenue (Millions)']]


In [85]:
data1.head(5)

Unnamed: 0,Title,Rating,Revenue (Millions)
0,Guardians of the Galaxy,8.1,333.13
1,Prometheus,7.0,126.46
2,Split,7.3,138.12
3,Sing,7.2,270.32
4,Suicide Squad,6.2,325.02


In [84]:
data2.head(5)

Unnamed: 0,Title,Year,Rating
0,Guardians of the Galaxy,2014,8.1
1,Prometheus,2012,7.0
2,Split,2016,7.3
3,Sing,2016,7.2
4,Suicide Squad,2016,6.2


In [83]:
data3.head(5)

Unnamed: 0,Title,Director,Revenue (Millions)
0,Guardians of the Galaxy,James Gunn,333.13
1,Prometheus,Ridley Scott,126.46
2,Split,M. Night Shyamalan,138.12
3,Sing,Christophe Lourdelet,270.32
4,Suicide Squad,David Ayer,325.02


In [82]:
data4.head(5)

Unnamed: 0,Title,Genre,Rating
0,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",8.1
1,Prometheus,"Adventure,Mystery,Sci-Fi",7.0
2,Split,"Horror,Thriller",7.3
3,Sing,"Animation,Comedy,Family",7.2
4,Suicide Squad,"Action,Adventure,Fantasy",6.2


In [81]:
data5.head(5)

Unnamed: 0,Title,Actors,Rating
0,Guardians of the Galaxy,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",8.1
1,Prometheus,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",7.0
2,Split,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",7.3
3,Sing,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",7.2
4,Suicide Squad,"Will Smith, Jared Leto, Margot Robbie, Viola D...",6.2


In [80]:
data6.head(5)

Unnamed: 0,Title,Actors,Revenue (Millions)
0,Guardians of the Galaxy,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",333.13
1,Prometheus,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",126.46
2,Split,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",138.12
3,Sing,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",270.32
4,Suicide Squad,"Will Smith, Jared Leto, Margot Robbie, Viola D...",325.02


Conclusion:
There are cleaned dataset without useless columns, duplicated or null rows. Also there are 6 prepared dataframes for further 6 objectives.