# CinemAI: A Machine Learning Approach for Accurate Movie Quality Prediction

## Preface

The U.S. movie industry is a thriving economic powerhouse, with its value reaching an impressive [$95.45 billion](https://www.zippia.com/advice/us-film-industry-statistics/#:~:text=The%20U.S.%20movie%20industry%20is%20worth%20%2495.45%20billion%20as%20of,4.1%25%20from%202018%20to%202025.) in 2022. Moreover, the industry is projected to maintain a Compound Annual Growth Rate (CAGR) of 4.1% from 2018 to 2025, indicating promising prospects for those working within this domain. However, this growth presents a double-edged sword. While it ensures ample job opportunities for industry employees, it also intensifies the competition among filmmakers to create captivating films that capture audience attention and satisfaction.

To comprehend the significance of predicting movie quality, let us examine a case from the past. In 2005, director Breck Eisner helmed an action-adventure film called ["Sahara"](https://en.wikipedia.org/wiki/Sahara_(2005_film)#:~:text=Sahara%20grossed%20%24119%20million%20worldwide,office%20failures%20of%20all%2Dtime.), based on Clive Cussler's bestselling novel of the same name. Despite a production cost of 160 million, the movie only managed to gross 119 million, failing to recoup its expenses. The film encountered numerous challenges, including legal disputes among the crew and violations of international laws, which substantially escalated production costs. However, the primary factor behind its lackluster performance was the lack of creativity and clear goals in its storyline. As a consequence, the movie failed to capture sufficient attention and resulted in disappointing box office returns.

Hence, predicting movie quality is important for the following reasons:

1. Audience Satisfaction: Predicting movie quality allows filmmakers, production studios, and distributors to gauge the potential reception of a movie by the audience. By identifying whether a movie is likely to be good or bad, they can make informed decisions about marketing strategies, release dates, and investment returns.

2. Financial Success: As the movie industry is a highly competitive and costly business, accurately predicting movie quality helps minimize financial risks by allowing stakeholders to invest their resources wisely. It aids in identifying potential box office successes, maximizing revenue, and minimizing losses.

3. Resource Allocation: Predicting movie quality enables better allocation of resources during the production process. Filmmakers can make adjustments, such as script revisions, casting choices, and production enhancements, to improve the overall quality of the movie based on the predictions. This helps optimize resource allocation and increases the chances of creating a well-received film.

4. Critical Reception: Movie quality predictions can also influence critical reception and industry recognition. Positive reviews and critical acclaim can enhance a movie's reputation, leading to increased exposure, award nominations, and overall industry impact. Accurate predictions allow filmmakers to target the critical reception aspect and potentially elevate the movie's stature.

5. Audience Engagement: Predicting movie quality helps in tailoring marketing and promotional campaigns to attract the target audience. By understanding the expected quality of a movie, marketers can design effective strategies to engage viewers, generate buzz, and drive ticket sales or streaming numbers.

The ability to predict movie quality holds immense significance in the movie industry. It serves as a valuable decision-making tool for stakeholders, enabling them to optimize their efforts, allocate resources wisely, and make informed investments. Ultimately, accurate predictions empower filmmakers, production studios, and distributors to deliver highly satisfying experiences to audiences while maximizing commercial success in an increasingly competitive landscape.

## Aim

The aim of this project is to leverage textural inputs to develop a predictive model that accurately assesses the quality of a movie. By utilizing machine learning techniques, the goal is to classify the forecasted movie as either a 'Good' or 'Bad' movie, providing valuable insights into its potential reception among audiences. This project seeks to empower movie enthusiasts, prospective viewers, and even filmmakers with a reliable tool to evaluate the anticipated quality of a movie based on textual information, enhancing decision-making processes and fostering a deeper understanding of cinematic excellence.

## Data

1. The data is webscrapped from [IMDB](https://www.imdb.com/).

## Project Stakeholder

1. Enthusiastic movie enthusiasts seeking deeper insights into various types of movies.
2. Curious individuals eager to determine the worthiness of an unreleased movie before watching it.
3. Movie directors aspiring to evaluate the potential success or failure of their upcoming projects.

In [3]:
#import libries
import numpy as np
import pandas as pd
import re

In [4]:
df = pd.read_csv('datasets/all_movies_02.csv')

In [5]:
df.shape

(700698, 7)

In [4]:
df.head()

Unnamed: 0,Genre,Title,Year_produced,Certificate,Ratings,Description,Actors
0,Comedy,Beef,(2023– ),TV-MA,8.2,Two people let a road rage incident burrow int...,"['Steven Yeun', 'Ali Wong', 'Joseph Lee', 'You..."
1,Comedy,Succession,(2018–2023),TV-MA,8.8,The Roy family is known for controlling the bi...,"['Nicholas Braun', 'Brian Cox', 'Kieran Culkin..."
2,Comedy,The Super Mario Bros. Movie,(2023),PG,7.3,The story of The Super Mario Bros. on their jo...,"['Aaron Horvath', 'Michael Jelenic', 'Pierre L..."
3,Comedy,Ted Lasso,(2020– ),TV-MA,8.8,American college football coach Ted Lasso head...,"['Jason Sudeikis', 'Brett Goldstein', 'Hannah ..."
4,Comedy,Ghosted,(2023),PG-13,5.8,Cole falls head over heels for enigmatic Sadie...,"['Dexter Fletcher', '| ', ' Stars:', 'Chris..."


Remarks:
1. Data from webscrapping has been read.

## Data Cleaning

### Actors column

It is observed that there are some rows in actors column that contains some undesired strings and it is all a single string.

In [5]:
df['Actors'].head(10)

0    ['Steven Yeun', 'Ali Wong', 'Joseph Lee', 'You...
1    ['Nicholas Braun', 'Brian Cox', 'Kieran Culkin...
2    ['Aaron Horvath', 'Michael Jelenic', 'Pierre L...
3    ['Jason Sudeikis', 'Brett Goldstein', 'Hannah ...
4    ['Dexter Fletcher', '| ', '    Stars:', 'Chris...
5    ['James Marsden', 'Alan Barinholtz', 'Susan Be...
6    ['Ari Aster', '| ', '    Stars:', 'Joaquin Pho...
7    ['Bill Hader', 'Stephen Root', 'Sarah Goldberg...
8    ['Rachel Brosnahan', 'Alex Borstein', 'Michael...
9    ['Chris McKay', '| ', '    Stars:', 'Nicholas ...
Name: Actors, dtype: object

Strategy: 
1. From webscrapping, it is known that anything before the string 'Stars:' are movie directors.
2. The next few steps is to filter out these rows and remove undesired values in them 

In [6]:
redunant_val = df[df['Actors'].str.contains('Stars:')]

In [7]:
redunant_val['Actors'].head()

2     ['Aaron Horvath', 'Michael Jelenic', 'Pierre L...
4     ['Dexter Fletcher', '| ', '    Stars:', 'Chris...
6     ['Ari Aster', '| ', '    Stars:', 'Joaquin Pho...
9     ['Chris McKay', '| ', '    Stars:', 'Nicholas ...
10    ['John Francis Daley', 'Jonathan Goldstein', '...
Name: Actors, dtype: object

In [8]:
df['Actors'][0].split("''")

["['Steven Yeun', 'Ali Wong', 'Joseph Lee', 'Young Mazino']"]

In [9]:
df['Actors'] = df['Actors'].apply(lambda x:eval(x))

In [10]:
df['Actors'][0]

['Steven Yeun', 'Ali Wong', 'Joseph Lee', 'Young Mazino']

In [11]:
df['Actors'][4]

['Dexter Fletcher',
 '| ',
 '    Stars:',
 'Chris Evans',
 'Ana de Armas',
 'Adrien Brody',
 'Mike Moh']

In [12]:
x = df['Actors'][4].index('    Stars:')
df['Actors'][4][x+1:]

['Chris Evans', 'Ana de Armas', 'Adrien Brody', 'Mike Moh']

In [13]:
df['Actors'][4][df['Actors'][4].index('    Stars:')+1:]

['Chris Evans', 'Ana de Armas', 'Adrien Brody', 'Mike Moh']

In [14]:
def remove_before_stars(lst):
    try:
        return lst[lst.index('    Stars:')+1:]
    except ValueError:
        return lst

df['Actors'] = df['Actors'].apply(lambda x: remove_before_stars(x))

In [15]:
df['Actors'][4]

['Chris Evans', 'Ana de Armas', 'Adrien Brody', 'Mike Moh']

In [16]:
df.head()

Unnamed: 0,Genre,Title,Year_produced,Certificate,Ratings,Description,Actors
0,Comedy,Beef,(2023– ),TV-MA,8.2,Two people let a road rage incident burrow int...,"[Steven Yeun, Ali Wong, Joseph Lee, Young Mazino]"
1,Comedy,Succession,(2018–2023),TV-MA,8.8,The Roy family is known for controlling the bi...,"[Nicholas Braun, Brian Cox, Kieran Culkin, Pet..."
2,Comedy,The Super Mario Bros. Movie,(2023),PG,7.3,The story of The Super Mario Bros. on their jo...,"[Chris Pratt, Anya Taylor-Joy, Charlie Day, Ja..."
3,Comedy,Ted Lasso,(2020– ),TV-MA,8.8,American college football coach Ted Lasso head...,"[Jason Sudeikis, Brett Goldstein, Hannah Waddi..."
4,Comedy,Ghosted,(2023),PG-13,5.8,Cole falls head over heels for enigmatic Sadie...,"[Chris Evans, Ana de Armas, Adrien Brody, Mike..."


Remarks: 
1. The actors column has been cleaned up
2. It is observed that there are 4 actors in each list. It would be good to split these 4 actors into 4 columns, for further data exploration.

In [17]:
df['Actors'][0]

['Steven Yeun', 'Ali Wong', 'Joseph Lee', 'Young Mazino']

In [18]:
df['Actor_1'] = df['Actors'].apply(lambda x: x[0] if len(x) > 0 else None)

In [19]:
df['Actor_1'].head()

0       Steven Yeun
1    Nicholas Braun
2       Chris Pratt
3    Jason Sudeikis
4       Chris Evans
Name: Actor_1, dtype: object

In [20]:
df['Actor_2'] = df['Actors'].apply(lambda x: x[1] if len(x) > 1 else None)

In [21]:
df['Actor_2'].head()

0           Ali Wong
1          Brian Cox
2    Anya Taylor-Joy
3    Brett Goldstein
4       Ana de Armas
Name: Actor_2, dtype: object

In [22]:
df['Actor_3'] = df['Actors'].apply(lambda x: x[2] if len(x) > 2 else None)

In [23]:
df['Actor_3'].head()

0           Joseph Lee
1        Kieran Culkin
2          Charlie Day
3    Hannah Waddingham
4         Adrien Brody
Name: Actor_3, dtype: object

In [24]:
df['Actor_4'] = df['Actors'].apply(lambda x: x[3] if len(x) > 3 else None)

In [25]:
df['Actor_4'].head()

0      Young Mazino
1    Peter Friedman
2        Jack Black
3      Brendan Hunt
4          Mike Moh
Name: Actor_4, dtype: object

In [26]:
df.head()

Unnamed: 0,Genre,Title,Year_produced,Certificate,Ratings,Description,Actors,Actor_1,Actor_2,Actor_3,Actor_4
0,Comedy,Beef,(2023– ),TV-MA,8.2,Two people let a road rage incident burrow int...,"[Steven Yeun, Ali Wong, Joseph Lee, Young Mazino]",Steven Yeun,Ali Wong,Joseph Lee,Young Mazino
1,Comedy,Succession,(2018–2023),TV-MA,8.8,The Roy family is known for controlling the bi...,"[Nicholas Braun, Brian Cox, Kieran Culkin, Pet...",Nicholas Braun,Brian Cox,Kieran Culkin,Peter Friedman
2,Comedy,The Super Mario Bros. Movie,(2023),PG,7.3,The story of The Super Mario Bros. on their jo...,"[Chris Pratt, Anya Taylor-Joy, Charlie Day, Ja...",Chris Pratt,Anya Taylor-Joy,Charlie Day,Jack Black
3,Comedy,Ted Lasso,(2020– ),TV-MA,8.8,American college football coach Ted Lasso head...,"[Jason Sudeikis, Brett Goldstein, Hannah Waddi...",Jason Sudeikis,Brett Goldstein,Hannah Waddingham,Brendan Hunt
4,Comedy,Ghosted,(2023),PG-13,5.8,Cole falls head over heels for enigmatic Sadie...,"[Chris Evans, Ana de Armas, Adrien Brody, Mike...",Chris Evans,Ana de Armas,Adrien Brody,Mike Moh


Remarks: 

1. As observed, there is an order of precedence for stars. 
2. The most popular star for the show is classified from left to right.
3. This means, order for precedence: Actor_1 > Actor_2 > Actor_3 > Actor_4
4. As such, this means that there an actor can be top star from movie A which translates to being in Actor_1 column. However,  the same actor can appear in Actor_4 for movie B as he is not classified as the 'top' star in movie B.

### Certificate ratings columns

To check the certificate rating columns

In [33]:
df['Certificate'].unique()

array(['TV-MA', 'PG', 'PG-13', '16+', 'R', nan, 'TV-PG', 'TV-14',
       'Not Rated', '18+', 'TV-Y7-FV', 'TV-G', 'TV-Y7', 'G', '12', '13+',
       'M', 'Approved', '16', 'NC-17', 'Passed', '6', 'TV-Y', 'Unrated',
       'GP', 'MA-17', 'AO', '14', 'AL', 'X', '9', 'TV-13', 'T', 'E',
       'M/PG', 'E10+', 'MG6', 'K-A', '18', '7+', 'F', 'EM', 'Open', 'GA',
       'MA-13', 'EC', '15', '(Banned)', 'C', 'Banned'], dtype=object)

Remarks:
1. It seems that there are some columns that means the same, but are rated differently. Example: (Banned) and Banned.
2. The next lines of codes are to rectify this issue.

In [34]:
df['Certificate'] = df['Certificate'].replace({'(Banned)':'Banned',
                      'Unrated': 'Not Rated'})

In [35]:
df['Certificate'].unique()

array(['TV-MA', 'PG', 'PG-13', '16+', 'R', nan, 'TV-PG', 'TV-14',
       'Not Rated', '18+', 'TV-Y7-FV', 'TV-G', 'TV-Y7', 'G', '12', '13+',
       'M', 'Approved', '16', 'NC-17', 'Passed', '6', 'TV-Y', 'GP',
       'MA-17', 'AO', '14', 'AL', 'X', '9', 'TV-13', 'T', 'E', 'M/PG',
       'E10+', 'MG6', 'K-A', '18', '7+', 'F', 'EM', 'Open', 'GA', 'MA-13',
       'EC', '15', 'Banned', 'C'], dtype=object)

In [36]:
df['Certificate'] = df['Certificate'].replace({'PG-13':'13+','12': '7+','M':'16+','Approved':'16+','16':'16+','Passed':'16+','6':'7+',
                                              'GP': 'PG','MA-17': 'NC-17','AO':'PG','14':'14A','AL':'G','X':'A','9':'PG','TV-13':'TV-14',
                                              'T':'16+','E':'G','M/PG':'PG','E10+':'PG','MG6':'PG','K-A':'PG','18':'18A','F':'PG','Open':'PG',
                                              '15':'16+','EC':'PG','Banned':'A','C':'PG'})

In [37]:
df['Certificate'].unique()

array(['TV-MA', 'PG', '13+', '16+', 'R', nan, 'TV-PG', 'TV-14',
       'Not Rated', '18+', 'TV-Y7-FV', 'TV-G', 'TV-Y7', 'G', '7+',
       'NC-17', 'TV-Y', '14A', 'A', '18A', 'EM', 'GA', 'MA-13'],
      dtype=object)

Remarks: The certification ratings have been renamed to Canada's ratings as it is the most comprehensive

### Check for NaN/None/Null values

In [38]:
null_values = df[df['Ratings'].isnull()]
null_values.head()

Unnamed: 0,Genre,Title,Year_produced,Certificate,Ratings,Description,Actors,Actor_1,Actor_2,Actor_3,Actor_4
14,Comedy,Barbie,(2023),,,To live in Barbie Land is to be a perfect bein...,"[Margot Robbie, Ariana Greenblatt, Ryan Goslin...",Margot Robbie,Ariana Greenblatt,Ryan Gosling,Helen Mirren
27,Comedy,Guardians of the Galaxy Vol. 3,(2023),13+,,"Still reeling from the loss of Gamora, Peter Q...","[Chris Pratt, Zoe Saldana, Dave Bautista, Vin ...",Chris Pratt,Zoe Saldana,Dave Bautista,Vin Diesel
35,Comedy,Lilo & Stitch,,,,Live-action remake of Disney's animated classi...,"[Sydney Agudong, Billy Magnussen, Tia Carrere,...",Sydney Agudong,Billy Magnussen,Tia Carrere,Courtney B. Vance
58,Comedy,Wicked,(2024),,,The story of how a green-skinned woman framed ...,"[Michelle Yeoh, Cynthia Erivo, Jeff Goldblum, ...",Michelle Yeoh,Cynthia Erivo,Jeff Goldblum,Ariana Grande
76,Comedy,White Men Can't Jump,(2023),R,,A remake of the 1992 film about a pair of bask...,"[Sinqua Walls, Jack Harlow, Lance Reddick, Tey...",Sinqua Walls,Jack Harlow,Lance Reddick,Teyana Taylor


Remarks:
Ratings are NaN are due to the following reasons:
1. TV show still on going
2. Movie is not yet released (2023)
3. Movie still in production (2024 released onwards)

In [39]:
null_values.shape

(210385, 11)

To determine if null values should be dropped, percentage of null value is to be calculated. 

In [40]:
null_percent = (null_values.shape[0]/df.shape[0])*100

In [41]:
print(f'The percentage of null value is: {null_percent:.1f}%. For now it can be dropped.')
print(f'However, these values will be used in hold out data to test the algorithm since the movie/shows has yet to be released.')

The percentage of null value is: 30.0%. For now it can be dropped.
However, these values will be used in hold out data to test the algorithm since the movie/shows has yet to be released.


In [42]:
null_values.to_csv('./datasets/hold_out.csv', index=False)

Remarks: Hold out datasets created

In [43]:
df.drop(df[df['Ratings'].isnull()].index, inplace=True)

In [44]:
df.columns

Index(['Genre', 'Title', 'Year_produced', 'Certificate', 'Ratings',
       'Description', 'Actors', 'Actor_1', 'Actor_2', 'Actor_3', 'Actor_4'],
      dtype='object')

In [45]:
df[df['Genre'].isnull()].head()

Unnamed: 0,Genre,Title,Year_produced,Certificate,Ratings,Description,Actors,Actor_1,Actor_2,Actor_3,Actor_4


In [46]:
df[df['Title'].isnull()].head()

Unnamed: 0,Genre,Title,Year_produced,Certificate,Ratings,Description,Actors,Actor_1,Actor_2,Actor_3,Actor_4


In [47]:
df.shape

(490313, 11)

Remarks:
1. The null values have been removed.
2. Hold out dataset created.

### Year_produced columns

In [48]:
df[df['Year_produced'].isnull()].head()

Unnamed: 0,Genre,Title,Year_produced,Certificate,Ratings,Description,Actors,Actor_1,Actor_2,Actor_3,Actor_4
26169,Comedy,Eldritch USA,,,9.6,Siblings Geoff and Rich Brewer have competed a...,"[Graham Weldin, Andy Phinney, Cameron Perry, A...",Graham Weldin,Andy Phinney,Cameron Perry,Aline O'Neill
36388,Comedy,Cool as Hell 2,,,5.3,After goofball Rich accidentally decapitates h...,"[James Balsamo, Billy Walsh, Dave Stein, Phil ...",James Balsamo,Billy Walsh,Dave Stein,Phil Anselmo
45317,Comedy,Mannphodganj Ki Binny,,,5.2,"Binny Bajpai is a 21-year old, dreamy-eyed gir...","[Anurag Sinha, Pranati Rai Prakash, Atul Sriva...",Anurag Sinha,Pranati Rai Prakash,Atul Srivastava,Alka Badola Kaushal
48968,Comedy,Shkembimi,,,7.2,"The paradoxical lives of two brothers, one in ...","[Ina Aderi, Bujar Asqeriu, Tomi Filipi, Vani G...",Ina Aderi,Bujar Asqeriu,Tomi Filipi,Vani Gjuzi
55036,Sci-Fi,Cyborg Nemesis: The Dark Rift,,R,8.1,A U.S. Marine special ops team awakens from hy...,"[Sasha Mitchell, Vincent Klyn, Terrie Batson, ...",Sasha Mitchell,Vincent Klyn,Terrie Batson,Olivier Gruner


1. It is observed that there are some columns in year_produced are NAN.
2. As the values are little ~40 blanks, the information can be sourced from the web.

Strategy: 
1. Located the year produced on the web and fill in the details

In [49]:
df.loc[df['Title'] == 'Eldritch USA', 'Year_produced'] = df.loc[df['Title'] == 'Eldritch USA', 'Year_produced'].apply(lambda x: 2022)

In [50]:
df.loc[df['Title'] == 'Cool as Hell 2', 'Year_produced'] = df.loc[df['Title'] == 'Cool as Hell 2', 'Year_produced'].apply(lambda x: 2019)

In [51]:
df.loc[df['Title'] == 'Mannphodganj Ki Binny', 'Year_produced'] = df.loc[df['Title'] == 'Mannphodganj Ki Binny', 'Year_produced'].apply(lambda x: 2020)

In [52]:
df.loc[df['Title'] == 'Shkembimi', 'Year_produced'] = df.loc[df['Title'] == 'Shkembimi', 'Year_produced'].apply(lambda x: 2021)

In [53]:
df.loc[df['Title'] == 'Cyborg Nemesis: The Dark Rift', 'Year_produced'] = df.loc[df['Title'] == 'Cyborg Nemesis: The Dark Rift', 'Year_produced'].apply(lambda x: 2016)

In [54]:
df.loc[df['Title'] == 'Hargrave', 'Year_produced'] = df.loc[df['Title'] == 'Hargrave', 'Year_produced'].apply(lambda x: 2020)

In [55]:
df.loc[(df['Title'] == 'Shifters') & (df['Year_produced'].isnull()), 'Year_produced'] = df.loc[(df['Title'] == 'Shifters') & (df['Year_produced'].isnull()), 'Year_produced'].apply(lambda x: 2018)

In [56]:
df.loc[df['Title'] == '*69', 'Year_produced'] = df.loc[df['Title'] == '*69', 'Year_produced'].apply(lambda x: 2022)

In [57]:
df.loc[df['Title'] == 'Haunted Connecticut', 'Year_produced'] = df.loc[df['Title'] == 'Haunted Connecticut', 'Year_produced'].apply(lambda x: 2022)

In [58]:
df.loc[df['Title'] == 'Sesha Raati', 'Year_produced'] = df.loc[df['Title'] == 'Sesha Raati', 'Year_produced'].apply(lambda x: 2022)

In [59]:
df.loc[df['Title'] == 'This Guest of Summer', 'Year_produced'] = df.loc[df['Title'] == 'This Guest of Summer', 'Year_produced'].apply(lambda x: 2021)

In [60]:
df.loc[df['Title'] == 'Halloween Jack 3D', 'Year_produced'] = df.loc[df['Title'] == 'Halloween Jack 3D', 'Year_produced'].apply(lambda x: 2022)

In [61]:
df.loc[df['Title'] == 'Hell Phone', 'Year_produced'] = df.loc[df['Title'] == 'Hell Phone', 'Year_produced'].apply(lambda x: 2018)

In [62]:
df.loc[(df['Title'] == 'Night of the Clown') & (df['Year_produced'].isnull()), 'Year_produced'] = df.loc[(df['Title'] == 'Night of the Clown') & (df['Year_produced'].isnull()), 'Year_produced'].apply(lambda x: 2022)

In [63]:
df.loc[df['Title'] == 'My Dear Guardian', 'Year_produced'] = df.loc[df['Title'] == 'My Dear Guardian', 'Year_produced'].apply(lambda x: 2020)

In [64]:
df.loc[df['Title'] == 'Prem Prakaran', 'Year_produced'] = df.loc[df['Title'] == 'Prem Prakaran', 'Year_produced'].apply(lambda x: 2022)

In [65]:
df.loc[df['Title'] == 'Raebareli', 'Year_produced'] = df.loc[df['Title'] == 'Raebareli', 'Year_produced'].apply(lambda x: 2022)

In [66]:
df.loc[df['Title'] == 'Tiger Cops', 'Year_produced'] = df.loc[df['Title'] == 'Tiger Cops', 'Year_produced'].apply(lambda x: 2017)

In [67]:
df.loc[df['Title'] == 'Student War Punjabi Movie', 'Year_produced'] = df.loc[df['Title'] == 'Student War Punjabi Movie', 'Year_produced'].apply(lambda x: 2021)

In [68]:
df.loc[df['Title'] == 'Adakaar', 'Year_produced'] = df.loc[df['Title'] == 'Adakaar', 'Year_produced'].apply(lambda x: 2022)

In [69]:
df.loc[df['Title'] == 'The Dark Web', 'Year_produced'] = df.loc[df['Title'] == 'The Dark Web', 'Year_produced'].apply(lambda x: 2020)

In [70]:
df.loc[df['Title'] == 'Amsterdam Vice', 'Year_produced'] = df.loc[df['Title'] == 'Amsterdam Vice', 'Year_produced'].apply(lambda x: 2019)

In [71]:
df.loc[df['Title'] == 'It Was Him: The Many Murders of Ed Edwards', 'Year_produced'] = df.loc[df['Title'] == 'It Was Him: The Many Murders of Ed Edwards', 'Year_produced'].apply(lambda x: 2017)

In [72]:
df.loc[(df['Title'] == 'Faultline') & (df['Year_produced'].isnull()), 'Year_produced'] = df.loc[(df['Title'] == 'Faultline') & (df['Year_produced'].isnull()), 'Year_produced'].apply(lambda x: 2017)

In [73]:
df.loc[df['Title'] == 'Chop Chop Ninja', 'Year_produced'] = df.loc[df['Title'] == 'Chop Chop Ninja', 'Year_produced'].apply(lambda x: 2018)

In [74]:
df.loc[df['Title'] == 'Jade of Death', 'Year_produced'] = df.loc[df['Title'] == 'Jade of Death', 'Year_produced'].apply(lambda x: 2018)

In [75]:
df[df['Year_produced'].isnull()]['Title'].unique()

array([], dtype=object)

Remarks:
1. NaN values have been filled in.

In [76]:
df['Certificate'].fillna('Not Rated', inplace=True)

In [77]:
df[df['Certificate'] == 'Not Rated'].head()

Unnamed: 0,Genre,Title,Year_produced,Certificate,Ratings,Description,Actors,Actor_1,Actor_2,Actor_3,Actor_4
51,Comedy,Kisi Ka Bhai Kisi Ki Jaan,(2023),Not Rated,7.3,The eldest brother refuses to marry since he b...,"[Salman Khan, Pooja Hegde, Venkatesh Daggubati...",Salman Khan,Pooja Hegde,Venkatesh Daggubati,Jagapathi Babu
89,Comedy,Shehzada,(2023),Not Rated,4.6,Bantu is hated by his father Valmiki since he ...,"[Kartik Aaryan, Kriti Sanon, Paresh Rawal, Man...",Kartik Aaryan,Kriti Sanon,Paresh Rawal,Manisha Koirala
97,Comedy,Mrs Undercover,(2023),Not Rated,6.5,A simple Indian housewife who is in fact a spe...,"[Radhika Apte, Sumeet Vyas, Rajesh Sharma, Roy...",Radhika Apte,Sumeet Vyas,Rajesh Sharma,Roy Angana
100,Comedy,Colin from Accounts,(2022– ),Not Rated,8.2,"Ashley and Gordon, two single-ish, complex hum...","[Patrick Brammall, Harriet Dyer, Zak, Emma Har...",Patrick Brammall,Harriet Dyer,Zak,Emma Harvie
127,Comedy,Totally Completely Fine,(2023– ),Not Rated,7.3,"It follows Vivian Cunningham, who winds up hel...","[Thomasin McKenzie, Contessa Treffone, Rowan W...",Thomasin McKenzie,Contessa Treffone,Rowan Witt,Brandon McClelland


Remarks:
1. Observed that there are a bunch of movies that are not rated. This could be due to the website not being able to categorize it properly.

### Ratings column

In [79]:
df[df['Ratings'].isnull()].head()

Unnamed: 0,Genre,Title,Year_produced,Certificate,Ratings,Description,Actors,Actor_1,Actor_2,Actor_3,Actor_4


Remarks:
1. This shows that all of the ratings column has been filled in.

### Certificate columns

In [80]:
df.head()

Unnamed: 0,Genre,Title,Year_produced,Certificate,Ratings,Description,Actors,Actor_1,Actor_2,Actor_3,Actor_4
0,Comedy,Beef,(2023– ),TV-MA,8.2,Two people let a road rage incident burrow int...,"[Steven Yeun, Ali Wong, Joseph Lee, Young Mazino]",Steven Yeun,Ali Wong,Joseph Lee,Young Mazino
1,Comedy,Succession,(2018–2023),TV-MA,8.8,The Roy family is known for controlling the bi...,"[Nicholas Braun, Brian Cox, Kieran Culkin, Pet...",Nicholas Braun,Brian Cox,Kieran Culkin,Peter Friedman
2,Comedy,The Super Mario Bros. Movie,(2023),PG,7.3,The story of The Super Mario Bros. on their jo...,"[Chris Pratt, Anya Taylor-Joy, Charlie Day, Ja...",Chris Pratt,Anya Taylor-Joy,Charlie Day,Jack Black
3,Comedy,Ted Lasso,(2020– ),TV-MA,8.8,American college football coach Ted Lasso head...,"[Jason Sudeikis, Brett Goldstein, Hannah Waddi...",Jason Sudeikis,Brett Goldstein,Hannah Waddingham,Brendan Hunt
4,Comedy,Ghosted,(2023),13+,5.8,Cole falls head over heels for enigmatic Sadie...,"[Chris Evans, Ana de Armas, Adrien Brody, Mike...",Chris Evans,Ana de Armas,Adrien Brody,Mike Moh


1. It is observed that there are quite a number of movies that are classified as 'not rated'.
2. It is noteworthy that future NLP projects can be done to predict certificate columns. This can be used to properly label such movies so as to ensure that users accessing the website will know what kind of movie certificate is the movie to be well informed of the movie types.

Strategy: 
1. drop rows that are 'not rated'.

In [81]:
df[df['Certificate'] == 'Not Rated'].shape

(300065, 11)

In [82]:
df[df['Certificate'] != 'Not Rated'].shape

(190248, 11)

In [83]:
new_df = df[df['Certificate'] != 'Not Rated']

In [89]:
new_df.shape

(190248, 11)

In [84]:
df.head()

Unnamed: 0,Genre,Title,Year_produced,Certificate,Ratings,Description,Actors,Actor_1,Actor_2,Actor_3,Actor_4
0,Comedy,Beef,(2023– ),TV-MA,8.2,Two people let a road rage incident burrow int...,"[Steven Yeun, Ali Wong, Joseph Lee, Young Mazino]",Steven Yeun,Ali Wong,Joseph Lee,Young Mazino
1,Comedy,Succession,(2018–2023),TV-MA,8.8,The Roy family is known for controlling the bi...,"[Nicholas Braun, Brian Cox, Kieran Culkin, Pet...",Nicholas Braun,Brian Cox,Kieran Culkin,Peter Friedman
2,Comedy,The Super Mario Bros. Movie,(2023),PG,7.3,The story of The Super Mario Bros. on their jo...,"[Chris Pratt, Anya Taylor-Joy, Charlie Day, Ja...",Chris Pratt,Anya Taylor-Joy,Charlie Day,Jack Black
3,Comedy,Ted Lasso,(2020– ),TV-MA,8.8,American college football coach Ted Lasso head...,"[Jason Sudeikis, Brett Goldstein, Hannah Waddi...",Jason Sudeikis,Brett Goldstein,Hannah Waddingham,Brendan Hunt
4,Comedy,Ghosted,(2023),13+,5.8,Cole falls head over heels for enigmatic Sadie...,"[Chris Evans, Ana de Armas, Adrien Brody, Mike...",Chris Evans,Ana de Armas,Adrien Brody,Mike Moh


In [88]:
df.shape

(490313, 11)

In [85]:
new_df.head()

Unnamed: 0,Genre,Title,Year_produced,Certificate,Ratings,Description,Actors,Actor_1,Actor_2,Actor_3,Actor_4
0,Comedy,Beef,(2023– ),TV-MA,8.2,Two people let a road rage incident burrow int...,"[Steven Yeun, Ali Wong, Joseph Lee, Young Mazino]",Steven Yeun,Ali Wong,Joseph Lee,Young Mazino
1,Comedy,Succession,(2018–2023),TV-MA,8.8,The Roy family is known for controlling the bi...,"[Nicholas Braun, Brian Cox, Kieran Culkin, Pet...",Nicholas Braun,Brian Cox,Kieran Culkin,Peter Friedman
2,Comedy,The Super Mario Bros. Movie,(2023),PG,7.3,The story of The Super Mario Bros. on their jo...,"[Chris Pratt, Anya Taylor-Joy, Charlie Day, Ja...",Chris Pratt,Anya Taylor-Joy,Charlie Day,Jack Black
3,Comedy,Ted Lasso,(2020– ),TV-MA,8.8,American college football coach Ted Lasso head...,"[Jason Sudeikis, Brett Goldstein, Hannah Waddi...",Jason Sudeikis,Brett Goldstein,Hannah Waddingham,Brendan Hunt
4,Comedy,Ghosted,(2023),13+,5.8,Cole falls head over heels for enigmatic Sadie...,"[Chris Evans, Ana de Armas, Adrien Brody, Mike...",Chris Evans,Ana de Armas,Adrien Brody,Mike Moh


Remarks:
1. Two seperate dataframes has been created. One with 'no ratings', the other with it. This can be used in the future for further machine learning projects.
2. For data visualisation, and modelling, I will be using data without movie certificates that does not contain 'no ratings'.

## Export to CSV

This is for including not rated movies

In [86]:
df.to_csv('./datasets/all_movies_cleaned_02.csv', index=False)

This excludes all not rated movies

In [87]:
new_df.to_csv('./datasets/all_movies_rated_only_cleaned_02.csv', index=False)

Remarks: Data including rated movies