# Features Describtion 
* **Ranking of movie:** Movie's position in the dataset.
* **Movie name:** Name of the movie.
* **Year:** Year of movie release.
* **Certificate:** Age rating or classification of the movie.
* **Runtime:** Duration of the movie in minutes.
* **Genre:** Category or type of the movie.
* **Rating:** Score given to the movie.
* **Detail about movie:** Brief information about the movie.
* **Director:** Name of the movie's director(s).
* **Actor 1, Actor 2, Actor 3, Actor 4:** Names of the main actors or actresses.
* **Votes:** Number of votes or ratings received.
* **Metascore:** Aggregated score based on critic reviews.
* **Gross Collection:** Total box office earnings of the movie.

## Import Libraries

In [1]:
import pandas as pd 
import numpy as np 

## Reading Data

In [2]:
df=pd.read_csv('imdb_movies.csv')

### Data Overview

### EDA

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 16 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ranking of movie    250 non-null    int64  
 1   movie name          250 non-null    object 
 2   Year                250 non-null    object 
 3   certificate         250 non-null    object 
 4   runtime             250 non-null    object 
 5   genre               250 non-null    object 
 6   RATING              250 non-null    float64
 7   DETAIL ABOUT MOVIE  250 non-null    object 
 8   DIRECTOR            250 non-null    object 
 9   ACTOR 1             250 non-null    object 
 10  ACTOR 2             250 non-null    object 
 11  ACTOR 3             250 non-null    object 
 12  ACTOR 4             250 non-null    object 
 13  votes               250 non-null    int64  
 14  metascore           218 non-null    float64
 15  GROSS COLLECTION    214 non-null    object 
dtypes: float

In [17]:
df.describe()

Unnamed: 0,ranking of movie,rating,votes,metascore
count,250.0,250.0,250.0,218.0
mean,125.5,8.3084,578529.9,82.449541
std,72.312977,0.234669,495130.4,10.822392
min,1.0,8.1,26538.0,55.0
25%,63.25,8.1,168862.8,75.0
50%,125.5,8.2,431335.5,84.0
75%,187.75,8.4,885425.5,90.0
max,250.0,9.4,2515762.0,100.0


In [10]:
df.describe(include='O')

Unnamed: 0,movie name,Year,certificate,runtime,genre,DETAIL ABOUT MOVIE,DIRECTOR,ACTOR 1,ACTOR 2,ACTOR 3,ACTOR 4,GROSS COLLECTION
count,250,250,250,250,250,250,250,250,250,250,250,214
unique,250,90,11,102,109,250,154,185,233,241,242,204
top,Jai Bhim,-1995,R,130 min,Drama,When a tribal man is arrested for a case of al...,Christopher Nolan,Robert De Niro,Matt Damon,Joe Pesci,Diane Keaton,$4.36M
freq,1,8,106,9,22,1,7,6,3,3,2,3


In [12]:
df.columns=df.columns.str.lower()

In [14]:
df.isnull().mean()*100

ranking of movie       0.0
movie name             0.0
year                   0.0
certificate            0.0
runtime                0.0
genre                  0.0
rating                 0.0
detail about movie     0.0
director               0.0
actor 1                0.0
actor 2                0.0
actor 3                0.0
actor 4                0.0
votes                  0.0
metascore             12.8
gross collection      14.4
dtype: float64

In [16]:
df.duplicated().sum()

0

### Drop Unnecessary Columns

In [23]:
df.drop(['ranking of movie','detail about movie'],axis=1,inplace=True)

### Rename Columns in a suitable form

In [30]:
df.rename(mapper={'runtime':'runtime(m)','rating':'rating(10)','gross collection':'gross collection(M-$)'},axis=1,inplace=True)

In [32]:
df.columns

Index(['movie name ', 'year', 'certificate', 'runtime(m)', 'genre',
       'rating(10)', 'director ', 'actor 1', 'actor 2', 'actor 3', 'actor 4',
       'votes', 'metascore', 'gross collection(M-$)'],
      dtype='object')

In [42]:
for col in df.select_dtypes('object').columns:
    print(col)
    print(df[col].unique())
    print('-'*100)

movie name 
['Jai Bhim' 'The Shawshank Redemption' 'The Godfather' 'The Dark Knight'
 'The Godfather: Part II' '12 Angry Men'
 'The Lord of the Rings: The Return of the King' 'Pulp Fiction'
 "Schindler's List" 'Inception' 'Spider-Man: No Way Home' 'Fight Club'
 'The Lord of the Rings: The Fellowship of the Ring' 'Forrest Gump'
 'The Good, the Bad and the Ugly' 'The Lord of the Rings: The Two Towers'
 'The Matrix' 'Goodfellas'
 'Star Wars: Episode V - The Empire Strikes Back'
 "One Flew Over the Cuckoo's Nest" 'Parasite' 'Interstellar' 'City of God'
 'Spirited Away' 'Saving Private Ryan' 'The Green Mile'
 'Life Is Beautiful' 'Se7en' 'The Silence of the Lambs'
 'Star Wars: Episode IV - A New Hope' 'Hara-Kiri' 'Seven Samurai'
 "It's a Wonderful Life" 'Whiplash' 'The Intouchables' 'The Prestige'
 'The Departed' 'The Pianist' 'Gladiator' 'American History X'
 'The Usual Suspects' 'Léon: The Professional' 'The Lion King'
 'Terminator 2: Judgment Day' 'Cinema Paradiso' 'Grave of the Fireflies

In [63]:
df['year'] = df['year'].apply(lambda x: int(x.strip('-')) if len(x) <= 5 else int(x.split()[1].strip('()')))


In [82]:
df['runtime(m)']=df['runtime(m)'].apply(lambda x: int(x.split(' ')[0]))

In [92]:
df['gross collection(M-$)']=df['gross collection(M-$)'].apply(lambda x: x if type(x)==float else float(x[1:-1]))

In [108]:
df.columns=df.columns.str.strip()

### What is the minimum and maximum ratings ?


In [109]:
df['rating(10)'].min()

8.1

In [110]:
df['rating(10)'].max()

9.4

### What are the movies with rating > 9 ?

In [111]:
msk =df['rating(10)']>9
df[msk]

Unnamed: 0,movie name,year,certificate,runtime(m),genre,rating(10),director,actor 1,actor 2,actor 3,actor 4,votes,metascore,gross collection(M-$)
0,Jai Bhim,2021,TV-MA,164,"Crime, Drama",9.4,T.J. Gnanavel,Suriya,Lijo Mol Jose,Manikandan,Rajisha Vijayan,163431,,
1,The Shawshank Redemption,1994,R,142,Drama,9.3,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2515762,80.0,28.34
2,The Godfather,1972,R,175,"Crime, Drama",9.2,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1732749,100.0,134.97


### Top 10 movies per metascore and user rating

In [112]:
df.sort_values(by='metascore',ascending=False).head(10)

Unnamed: 0,movie name,year,certificate,runtime(m),genre,rating(10),director,actor 1,actor 2,actor 3,actor 4,votes,metascore,gross collection(M-$)
112,Vertigo,1958,PG,128,"Mystery, Romance, Thriller",8.3,Alfred Hitchcock,James Stewart,Kim Novak,Barbara Bel Geddes,Tom Helmore,387477,100.0,3.2
109,Lawrence of Arabia,1962,Approved,228,"Adventure, Biography, Drama",8.3,David Lean,Peter O'Toole,Alec Guinness,Anthony Quinn,Jack Hawkins,283057,100.0,44.82
117,Citizen Kane,1941,PG,119,"Drama, Mystery",8.3,Orson Welles,Orson Welles,Joseph Cotten,Dorothy Comingore,Agnes Moorehead,427403,100.0,1.59
50,Casablanca,1942,PG,102,"Drama, Romance, War",8.5,Michael Curtiz,Humphrey Bogart,Ingrid Bergman,Paul Henreid,Claude Rains,549646,100.0,1.02
49,Rear Window,1954,PG,112,"Mystery, Thriller",8.5,Alfred Hitchcock,James Stewart,Grace Kelly,Wendell Corey,Thelma Ritter,471860,100.0,36.76
2,The Godfather,1972,R,175,"Crime, Drama",9.2,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1732749,100.0,134.97
224,Fanny and Alexander,1982,R,188,Drama,8.1,Ingmar Bergman,Bertil Guve,Pernilla Allwin,Kristina Adolphson,Börje Ahlstedt,62011,100.0,4.97
113,Singin' in the Rain,1952,G,103,"Comedy, Musical, Romance",8.3,Stanley Donen,Gene Kelly,Gene Kelly,Donald O'Connor,Debbie Reynolds,232106,99.0,8.82
52,City Lights,1931,G,87,"Comedy, Drama, Romance",8.5,Charles Chaplin,Charles Chaplin,Virginia Cherrill,Florence Lee,Harry Myers,178296,99.0,0.02
31,Seven Samurai,1954,Not Rated,207,"Action, Drama",8.6,Akira Kurosawa,Toshirô Mifune,Takashi Shimura,Keiko Tsushima,Yukiko Shimazaki,333085,98.0,0.27


### Top 10 genres

In [113]:
df['genre'].value_counts().head(10).to_frame()

Unnamed: 0_level_0,count
genre,Unnamed: 1_level_1
Drama,22
"Crime, Drama",14
"Biography, Drama, History",9
"Drama, War",8
"Crime, Drama, Mystery",8
"Animation, Adventure, Comedy",8
"Action, Crime, Drama",7
"Drama, Romance",7
"Action, Adventure, Drama",6
"Action, Adventure, Sci-Fi",5


### Top 10 Directories

In [114]:
df['director'].value_counts().head(10).to_frame()

Unnamed: 0_level_0,count
director,Unnamed: 1_level_1
Christopher Nolan,7
Stanley Kubrick,7
Akira Kurosawa,7
Martin Scorsese,7
Alfred Hitchcock,6
Steven Spielberg,6
Charles Chaplin,5
Ingmar Bergman,5
Quentin Tarantino,5
Billy Wilder,5


### Top 10 First Actors

In [115]:
df['actor 1'].value_counts().head(10).to_frame()

Unnamed: 0_level_0,count
actor 1,Unnamed: 1_level_1
Robert De Niro,6
Tom Hanks,5
Charles Chaplin,5
Leonardo DiCaprio,5
James Stewart,4
Clint Eastwood,4
Toshirô Mifune,4
Christian Bale,4
Al Pacino,3
Harrison Ford,3


### Top 10 Second Actors

In [116]:
df['actor 2'].value_counts().head(10).to_frame()

Unnamed: 0_level_0,count
actor 2,Unnamed: 1_level_1
Matt Damon,3
Robert De Niro,3
Harrison Ford,3
Paulette Goddard,2
Liv Ullmann,2
Alec Guinness,2
Brad Pitt,2
Julie Delpy,2
Joseph Cotten,2
Robert Downey Jr.,2


In [118]:
df['genre'].unique()

array(['Crime, Drama', 'Drama', 'Action, Crime, Drama',
       'Action, Adventure, Drama', 'Biography, Drama, History',
       'Action, Adventure, Sci-Fi', 'Action, Adventure, Fantasy',
       'Drama, Romance', 'Adventure, Western', 'Action, Sci-Fi',
       'Biography, Crime, Drama', 'Comedy, Drama, Thriller',
       'Adventure, Drama, Sci-Fi', 'Animation, Adventure, Family',
       'Drama, War', 'Crime, Drama, Fantasy', 'Comedy, Drama, Romance',
       'Crime, Drama, Mystery', 'Crime, Drama, Horror',
       'Action, Drama, Mystery', 'Action, Drama',
       'Drama, Family, Fantasy', 'Drama, Music',
       'Biography, Comedy, Drama', 'Drama, Mystery, Thriller',
       'Biography, Drama, Music', 'Animation, Adventure, Drama',
       'Animation, Drama, War', 'Adventure, Comedy, Sci-Fi', 'Western',
       'Horror, Mystery, Thriller', 'Mystery, Thriller',
       'Drama, Romance, War', 'Comedy, Drama, Family',
       'Crime, Drama, Thriller', 'Animation, Drama, Fantasy',
       'Action, Biog

# Put some questions based on data business 

### 1- for (drama) genere which year get high total gross collection ? 

In [127]:
df[df['genre']=='Drama'].groupby(['year'])['gross collection(M-$)'].sum().sort_values(ascending=False).head(10).to_frame()

Unnamed: 0_level_0,gross collection(M-$)
year,Unnamed: 1_level_1
1999,167.13
2008,148.1
1975,112.0
2007,40.22
1994,28.34
1996,16.5
2011,7.1
1998,6.72
1982,4.97
2000,3.64


### 2- get average rating for each director ?

In [132]:
df.groupby('director')['rating(10)'].mean().to_frame()

Unnamed: 0_level_0,rating(10)
director,Unnamed: 1_level_1
Aamir Khan,8.400000
Adam Elliot,8.100000
Akira Kurosawa,8.314286
Alejandro G. Iñárritu,8.100000
Alfred Hitchcock,8.316667
...,...
William Wyler,8.100000
Wim Wenders,8.100000
Wolfgang Petersen,8.300000
Yasujirô Ozu,8.200000


### 3- is there is a correlation between rating and voting ?

In [136]:
df[['rating(10)', 'votes']].corr()

Unnamed: 0,rating(10),votes
rating(10),1.0,0.556386
votes,0.556386,1.0


### 4- what is average gross collection after 2000 and before 2000 ?

In [142]:
msk1=df['year']>2000
msk2=df['year']<2000
df[msk1]['gross collection(M-$)'].mean(),df[msk2]['gross collection(M-$)'].mean()

(136.17291139240507, 60.27147286821706)

### 5- is there is correlation between gross collection and rating ?

In [144]:
df[['gross collection(M-$)','rating(10)']].corr()

Unnamed: 0,gross collection(M-$),rating(10)
gross collection(M-$),1.0,0.206913
rating(10),0.206913,1.0


### 6- for (Martin Scorsese) director what is most work gener ?

In [150]:
df[df['director']=='Martin Scorsese']['genre'].value_counts()

genre
Crime, Drama                2
Biography, Crime, Drama     1
Action, Crime, Drama        1
Mystery, Thriller           1
Biography, Comedy, Crime    1
Biography, Drama, Sport     1
Name: count, dtype: int64

### 7- for (Tim Robbins	) actor1 what is highest work rate ?

In [152]:
df[df['actor 1']=='Tim Robbins']['rating(10)'].max()

9.3