# IMDB Top 1000 Movies — Data Preparation & Feature Engineering

This notebook documents the data cleaning and feature engineering steps applied to the original IMDB Top 1000 Movies dataset prior to analysis in Power BI.

## 1. Load Raw Dataset

The raw dataset is loaded from the original Kaggle source and inspected to identify data quality issues.

## 2. Data Cleaning

Several numerical fields are stored as text and require type conversion before analysis.

## 3. Runtime Transformation

Movie runtime is converted from text format (e.g. "136 min") to integer minutes to support quantitative analysis.

## 4. Feature Engineering — Main Genre

Multiple IMDB genres are grouped into high-level categories to simplify analysis and align with business use cases.

## 5. Export Clean Dataset

The final cleaned and feature-engineered dataset is exported for use in Power BI.

In [1]:
import pandas as pd

In [2]:
df=pd.read_csv("imdb_top_1000.csv")

In [3]:
df.head()

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000


In [4]:
df.dtypes

Poster_Link       object
Series_Title      object
Released_Year     object
Certificate       object
Runtime           object
Genre             object
IMDB_Rating      float64
Overview          object
Meta_score       float64
Director          object
Star1             object
Star2             object
Star3             object
Star4             object
No_of_Votes        int64
Gross             object
dtype: object

In [5]:
df.sort_values(by="IMDB_Rating",ascending=False)

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,https://m.media-amazon.com/images/M/MV5BNGEwMT...,Breakfast at Tiffany's,1961,A,115 min,"Comedy, Drama, Romance",7.6,A young New York socialite becomes interested ...,76.0,Blake Edwards,Audrey Hepburn,George Peppard,Patricia Neal,Buddy Ebsen,166544,
996,https://m.media-amazon.com/images/M/MV5BODk3Yj...,Giant,1956,G,201 min,"Drama, Western",7.6,Sprawling epic covering the life of a Texas ca...,84.0,George Stevens,Elizabeth Taylor,Rock Hudson,James Dean,Carroll Baker,34075,
997,https://m.media-amazon.com/images/M/MV5BM2U3Yz...,From Here to Eternity,1953,Passed,118 min,"Drama, Romance, War",7.6,"In Hawaii in 1941, a private is cruelly punish...",85.0,Fred Zinnemann,Burt Lancaster,Montgomery Clift,Deborah Kerr,Donna Reed,43374,30500000
998,https://m.media-amazon.com/images/M/MV5BZTBmMj...,Lifeboat,1944,,97 min,"Drama, War",7.6,Several survivors of a torpedoed merchant ship...,78.0,Alfred Hitchcock,Tallulah Bankhead,John Hodiak,Walter Slezak,William Bendix,26471,


In [6]:
df= df.drop(columns=["Poster_Link"])

In [7]:
df.isna().sum()

Series_Title       0
Released_Year      0
Certificate      101
Runtime            0
Genre              0
IMDB_Rating        0
Overview           0
Meta_score       157
Director           0
Star1              0
Star2              0
Star3              0
Star4              0
No_of_Votes        0
Gross            169
dtype: int64

In [8]:
df.iloc[966]

Series_Title                                             Apollo 13
Released_Year                                                   PG
Certificate                                                      U
Runtime                                                    140 min
Genre                                    Adventure, Drama, History
IMDB_Rating                                                    7.6
Overview         NASA must devise a strategy to return Apollo 1...
Meta_score                                                    77.0
Director                                                Ron Howard
Star1                                                    Tom Hanks
Star2                                                  Bill Paxton
Star3                                                  Kevin Bacon
Star4                                                  Gary Sinise
No_of_Votes                                                 269197
Gross                                                  173,837

In [9]:
df[df["Series_Title"]=="Apollo 13"]

Unnamed: 0,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
966,Apollo 13,PG,U,140 min,"Adventure, Drama, History",7.6,NASA must devise a strategy to return Apollo 1...,77.0,Ron Howard,Tom Hanks,Bill Paxton,Kevin Bacon,Gary Sinise,269197,173837933


In [10]:
df["Released_Year"].value_counts()

Released_Year
2014    32
2004    31
2009    29
2013    28
2016    28
        ..
1920     1
1930     1
1922     1
1943     1
PG       1
Name: count, Length: 100, dtype: int64

In [11]:
df.loc[966,"Released_Year"]="1995"

In [12]:
df["Released_Year"] = df["Released_Year"].astype(int)

In [13]:
df.dtypes

Series_Title      object
Released_Year      int64
Certificate       object
Runtime           object
Genre             object
IMDB_Rating      float64
Overview          object
Meta_score       float64
Director          object
Star1             object
Star2             object
Star3             object
Star4             object
No_of_Votes        int64
Gross             object
dtype: object

In [14]:
df["Genre"].nunique()

202

In [15]:
def map_main_genre(genres):
    if pd.isna(genres):
        return "Other"

    genres = genres.lower()

    if "action" in genres or "adventure" in genres:
        return "Action"
    elif "crime" in genres or "thriller" in genres or "mystery" in genres:
        return "Crime & Thriller"
    elif "sci-fi" in genres or "fantasy" in genres:
        return "Sci-Fi & Fantasy"
    elif "drama" in genres:
        return "Drama"
    elif "comedy" in genres:
        return "Comedy"
    elif "romance" in genres:
        return "Romance"
    elif "animation" in genres:
        return "Animation"
    else:
        return "Other"


In [16]:
df["Main_Genre"] = df["Genre"].apply(map_main_genre)

df[["Genre", "Main_Genre"]].head(15)


Unnamed: 0,Genre,Main_Genre
0,Drama,Drama
1,"Crime, Drama",Crime & Thriller
2,"Action, Crime, Drama",Action
3,"Crime, Drama",Crime & Thriller
4,"Crime, Drama",Crime & Thriller
5,"Action, Adventure, Drama",Action
6,"Crime, Drama",Crime & Thriller
7,"Biography, Drama, History",Drama
8,"Action, Adventure, Sci-Fi",Action
9,Drama,Drama


In [17]:
df["Main_Genre"].value_counts()

Main_Genre
Drama               363
Action              302
Crime & Thriller    247
Sci-Fi & Fantasy     48
Comedy               34
Other                 6
Name: count, dtype: int64

In [26]:
df.head(15)

Unnamed: 0,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross,Main_Genre
0,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469,Drama
1,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411,Crime & Thriller
2,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444,Action
3,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000,Crime & Thriller
4,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000,Crime & Thriller
5,The Lord of the Rings: The Return of the King,2003,U,201 min,"Action, Adventure, Drama",8.9,Gandalf and Aragorn lead the World of Men agai...,94.0,Peter Jackson,Elijah Wood,Viggo Mortensen,Ian McKellen,Orlando Bloom,1642758,377845905,Action
6,Pulp Fiction,1994,A,154 min,"Crime, Drama",8.9,"The lives of two mob hitmen, a boxer, a gangst...",94.0,Quentin Tarantino,John Travolta,Uma Thurman,Samuel L. Jackson,Bruce Willis,1826188,107928762,Crime & Thriller
7,Schindler's List,1993,A,195 min,"Biography, Drama, History",8.9,"In German-occupied Poland during World War II,...",94.0,Steven Spielberg,Liam Neeson,Ralph Fiennes,Ben Kingsley,Caroline Goodall,1213505,96898818,Drama
8,Inception,2010,UA,148 min,"Action, Adventure, Sci-Fi",8.8,A thief who steals corporate secrets through t...,74.0,Christopher Nolan,Leonardo DiCaprio,Joseph Gordon-Levitt,Elliot Page,Ken Watanabe,2067042,292576195,Action
9,Fight Club,1999,A,139 min,Drama,8.8,An insomniac office worker and a devil-may-car...,66.0,David Fincher,Brad Pitt,Edward Norton,Meat Loaf,Zach Grenier,1854740,37030102,Drama


In [18]:
df["Runtime"] = (
    df["Runtime"]
    .str.replace("min", "", regex=False)  # enlève "min"
    .astype("int64")                      # convertit en entier
)

In [19]:
df.dtypes

Series_Title      object
Released_Year      int64
Certificate       object
Runtime            int64
Genre             object
IMDB_Rating      float64
Overview          object
Meta_score       float64
Director          object
Star1             object
Star2             object
Star3             object
Star4             object
No_of_Votes        int64
Gross             object
Main_Genre        object
dtype: object

In [20]:
df.head()

Unnamed: 0,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross,Main_Genre
0,The Shawshank Redemption,1994,A,142,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469,Drama
1,The Godfather,1972,A,175,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411,Crime & Thriller
2,The Dark Knight,2008,UA,152,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444,Action
3,The Godfather: Part II,1974,A,202,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000,Crime & Thriller
4,12 Angry Men,1957,U,96,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000,Crime & Thriller


In [22]:
df["Runtime"].unique()

array([142, 175, 152, 202,  96, 201, 154, 195, 148, 139, 178, 161, 179,
       136, 146, 124, 133, 160, 132, 153, 169, 130, 125, 189, 116, 127,
       118, 121, 207, 122, 106, 112, 151, 150, 155, 119, 110,  88, 137,
        89, 165, 109, 102,  87, 126, 147, 117, 181, 149, 105, 164, 170,
        98, 101, 113, 134, 229, 115, 143,  95, 104, 123, 131, 108,  81,
        99, 114, 129, 228, 128, 103, 107,  68, 138, 156, 167, 163, 186,
       321, 135, 140, 180, 158, 210,  86, 162, 177, 204,  91, 172,  45,
       145, 100, 196,  93, 120,  92, 144,  80, 183, 111, 141, 224, 171,
       188,  94, 185,  85, 205, 212, 238,  72,  67,  76, 159,  83,  90,
        84, 191, 197, 174,  97,  75, 157, 209,  82, 220,  64, 184, 168,
       166, 192, 194, 193,  69,  70, 242,  79,  71,  78])

In [23]:
bins = [0, 90, 120, 150, 1000]
labels = [
    "< -*1h30",
    "1h30–2h",
    "2h–2h30",
    "2h30+"
]


df["Runtime_Avg"] = pd.cut(df["Runtime"], bins=bins, labels=labels)

In [26]:
df["Runtime_Avg"].value_counts()

Runtime_Avg
1h30–2h    445
2h–2h30    330
2h30+      147
< 1h30      78
Name: count, dtype: int64

In [25]:
df[["Runtime", "Runtime_Avg"]].sample(10)

Unnamed: 0,Runtime,Runtime_Avg
639,109,1h30–2h
101,81,< 1h30
324,130,2h–2h30
347,146,2h–2h30
594,134,2h–2h30
421,118,1h30–2h
99,126,2h–2h30
216,123,2h–2h30
194,45,< 1h30
141,134,2h–2h30


In [27]:
runtime_insight = (
    df
    .groupby("Runtime_Avg")
    .agg(
        Avg_Rating=("IMDB_Rating", "mean"),
        Movie_Count=("IMDB_Rating", "count")
    )
    .reset_index()
)

runtime_insight

  .groupby("Runtime_Avg")


Unnamed: 0,Runtime_Avg,Avg_Rating,Movie_Count
0,< 1h30,7.917949,78
1,1h30–2h,7.89236,445
2,2h–2h30,7.967576,330
3,2h30+,8.097279,147


In [28]:
df.head()

Unnamed: 0,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross,Main_Genre,Runtime_Avg
0,The Shawshank Redemption,1994,A,142,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469,Drama,2h–2h30
1,The Godfather,1972,A,175,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411,Crime & Thriller,2h30+
2,The Dark Knight,2008,UA,152,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444,Action,2h30+
3,The Godfather: Part II,1974,A,202,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000,Crime & Thriller,2h30+
4,12 Angry Men,1957,U,96,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000,Crime & Thriller,1h30–2h


In [29]:
df.to_csv("IMDB.csv",index=False)