# Data Cleaning Project
This notebook presents the data cleaning process for a dataset of IMDb. I got the dataset from [Kaggle](https://www.kaggle.com/datasets/davidfuenteherraiz/messy-imdb-dataset/data). It contains informartion about Title, Release Year, Genre, Content Rating, etc. The goal is to clean and organize the data so it's ready for further analysis. In this notebook, I'll go through the cleaning steps I took - such as handling missing values,removing duplicates, and fixing data formats.

In [1]:
#Importing Libraries
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

In [2]:
#Loading the Dataset
import chardet

with open('messy_IMDB_dataset.csv', 'rb') as f:
    rawdata = f.read()
    result = chardet.detect(rawdata)
    encoding = result['encoding']

IMDb = pd.read_csv('messy_IMDB_dataset.csv', encoding = encoding, sep=';')

In [3]:
#Display the first 5 rows of the dataset
IMDb.head()

Unnamed: 0,IMBD title ID,Original titlÊ,Release year,Genrë¨,Duration,Country,Content Rating,Director,Unnamed: 8,Income,Votes,Score
0,tt0111161,The Shawshank Redemption,1995-02-10,Drama,142.0,USA,R,Frank Darabont,,$ 28815245,2.278.845,9.3
1,tt0068646,The Godfather,09 21 1972,"Crime, Drama",175.0,USA,R,Francis Ford Coppola,,$ 246120974,1.572.674,9.2
2,tt0468569,The Dark Knight,23 -07-2008,"Action, Crime, Drama",152.0,US,PG-13,Christopher Nolan,,$ 1005455211,2.241.615,9.
3,tt0071562,The Godfather: Part II,1975-09-25,"Crime, Drama",220.0,USA,R,Francis Ford Coppola,,"$ 4o8,035,783",1.098.714,9.0
4,tt0110912,Pulp Fiction,1994-10-28,"Crime, Drama",,USA,R,Quentin Tarantino,,$ 222831817,1.780.147,"8,9f"


In [4]:
#Checking column names
IMDb.columns

Index(['IMBD title ID', 'Original titlÊ', 'Release year', 'Genrë¨', 'Duration',
       'Country', 'Content Rating', 'Director', 'Unnamed: 8', 'Income',
       ' Votes ', 'Score'],
      dtype='object')

In [5]:
#Rename columns to standardize format and fix typos
IMDb.rename(columns={
    'IMBD title ID': 'IMDb Title ID',
    'Original titlÊ': 'Original Title',
    'Release year': 'Release Year',
    'Genrë¨': 'Genre',
    ' Votes ': 'Votes'
}, inplace=True)

In [6]:
#Get a concise summary of the DataFrame
IMDb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101 entries, 0 to 100
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   IMDb Title ID   100 non-null    object 
 1   Original Title  100 non-null    object 
 2   Release Year    100 non-null    object 
 3   Genre           100 non-null    object 
 4   Duration        99 non-null     object 
 5   Country         100 non-null    object 
 6   Content Rating  77 non-null     object 
 7   Director        100 non-null    object 
 8   Unnamed: 8      0 non-null      float64
 9   Income          100 non-null    object 
 10  Votes           100 non-null    object 
 11  Score           100 non-null    object 
dtypes: float64(1), object(11)
memory usage: 9.6+ KB


In [7]:
#Summary statistics for numerical columns
IMDb.describe()

Unnamed: 0,Unnamed: 8
count,0.0
mean,
std,
min,
25%,
50%,
75%,
max,


## Handling Missing Values

In [8]:
#Check for missing values in the dataset
IMDb.isna().sum()

IMDb Title ID       1
Original Title      1
Release Year        1
Genre               1
Duration            2
Country             1
Content Rating     24
Director            1
Unnamed: 8        101
Income              1
Votes               1
Score               1
dtype: int64

### "Unnamed: 8" Column

In [9]:
#Drop "Unnamed: 8" column
IMDb = IMDb.drop(columns=['Unnamed: 8'])

In [10]:
IMDb.isna().sum()

IMDb Title ID      1
Original Title     1
Release Year       1
Genre              1
Duration           2
Country            1
Content Rating    24
Director           1
Income             1
Votes              1
Score              1
dtype: int64

### "Content Rating" Column

In [11]:
#Check unique values in the "Content Rating" column
IMDb['Content Rating'].unique()

array(['R', 'PG-13', 'Not Rated', 'Approved', nan, 'PG', 'Unrated', 'G'],
      dtype=object)

In [12]:
#Replace 'Unrated' with 'Not Rated' for consistency
IMDb['Content Rating'] = IMDb['Content Rating'].replace({'Unrated': 'Not Rated'})

In [13]:
IMDb['Content Rating'].unique()

array(['R', 'PG-13', 'Not Rated', 'Approved', nan, 'PG', 'G'],
      dtype=object)

In [14]:
#Fill missing content ratings with 'Not Rated'
IMDb['Content Rating'] = IMDb['Content Rating'].fillna('Not Rated')
IMDb['Content Rating'].unique()

array(['R', 'PG-13', 'Not Rated', 'Approved', 'PG', 'G'], dtype=object)

In [15]:
IMDb.isna().sum()

IMDb Title ID     1
Original Title    1
Release Year      1
Genre             1
Duration          2
Country           1
Content Rating    0
Director          1
Income            1
Votes             1
Score             1
dtype: int64

### "Duration" Column

In [16]:
#Filter and display all rows where "Duration" is missing
IMDb[IMDb['Duration'].isna()]

Unnamed: 0,IMDb Title ID,Original Title,Release Year,Genre,Duration,Country,Content Rating,Director,Income,Votes,Score
13,,,,,,,Not Rated,,,,
14,tt0133093,The Matrix,1999-05-07,"Action, Sci-Fi",,USA,R,"Lana Wachowski, Lilly Wachowski",$ 465718588,1.632.315,++8.7


In [17]:
#Drop row with index 13 due to missing or invalid values
IMDb.drop(index=13, inplace=True)

In [18]:
IMDb.isna().sum()

IMDb Title ID     0
Original Title    0
Release Year      0
Genre             0
Duration          1
Country           0
Content Rating    0
Director          0
Income            0
Votes             0
Score             0
dtype: int64

In [19]:
#Check unique values in the "Duration" column
IMDb['Duration'].unique()

array(['142', '175', '152', '220', ' ', '201', 'Nan', '96', '148', 'Inf',
       '178c', '161', nan, '179', 'Not Applicable', '146', '-', '169',
       '127', '118', '121', '189', '130', '125', '116', '132', '207',
       '155', '151', '119', '110', '137', '106', '88', '122', '112',
       '150', '109', '102', '165', '89', '87', '164', '113', '98', '115',
       '149', '117', '181', '147', '120', '95', '105', '170', '134',
       '229', '153', '178', '131', '99', '108', '81', '126', '104', '136',
       '103', '114', '160', '128', '228', '129', '123'], dtype=object)

In [20]:
#Filter rows where "Duration" has invalid values or missing
invalid_values = [' ', 'Nan', 'Inf', 'Not Applicable', '-']

IMDb[IMDb['Duration'].isin(invalid_values) |
     IMDb['Duration'].isna()]

Unnamed: 0,IMDb Title ID,Original Title,Release Year,Genre,Duration,Country,Content Rating,Director,Income,Votes,Score
4,tt0110912,Pulp Fiction,1994-10-28,"Crime, Drama",,USA,R,Quentin Tarantino,$ 222831817,1.780.147,"8,9f"
6,tt0108052,Schindler's List,1994-03-11,"Biography, Drama, History",Nan,USA,R,Steven Spielberg,$ 322287794,1.183.248,8.9
9,tt0137523,Fight Club,10-29-99,Drama,Inf,UK,R,David Fincher,$ 101218804,1.807.440,8.8
14,tt0133093,The Matrix,1999-05-07,"Action, Sci-Fi",,USA,R,"Lana Wachowski, Lilly Wachowski",$ 465718588,1.632.315,++8.7
16,tt0080684,Star Wars: Episode V - The Empire Strikes Back,1980-09-19,"Action, Adventure, Fantasy",Not Applicable,USA,PG,Irvin Kershner,$ 549265501,1.132.073,87e-0
18,tt0073486,One Flew Over the Cuckoo's Nest,18/11/1976,Drama,-,USA,R,Milos Forman,$ 108997629,891.071,8.7


In [21]:
#Fill missing values in the "Duration" column
durations = {
    'Pulp Fiction': 154,
    "Schindler's List": 195,
    'Fight Club': 139,
    'The Matrix': 126,
    'Star Wars: Episode V - The Empire Strikes Back': 124,
    "One Flew Over the Cuckoo's Nest": 133
}

IMDb['Duration'] = IMDb.apply(
    lambda row: durations.get(row['Original Title'], row['Duration']), axis=1
)

In [22]:
#Replace '178c' with '178' for consistency
IMDb['Duration'] = IMDb['Duration'].replace({'178c': '178'})

In [23]:
IMDb['Duration'].unique()

array(['142', '175', '152', '220', 154, '201', 195, '96', '148', 139,
       '178', '161', 126, '179', 124, '146', 133, '169', '127', '118',
       '121', '189', '130', '125', '116', '132', '207', '155', '151',
       '119', '110', '137', '106', '88', '122', '112', '150', '109',
       '102', '165', '89', '87', '164', '113', '98', '115', '149', '117',
       '181', '147', '120', '95', '105', '170', '134', '229', '153',
       '131', '99', '108', '81', '126', '104', '136', '103', '114', '160',
       '128', '228', '129', '123'], dtype=object)

In [24]:
#Check for missing values in the dataset
IMDb.isna().sum()

IMDb Title ID     0
Original Title    0
Release Year      0
Genre             0
Duration          0
Country           0
Content Rating    0
Director          0
Income            0
Votes             0
Score             0
dtype: int64

## Handling Duplicate Values

In [25]:
#Check for duplicate data
IMDb.duplicated().sum()

0

## Checking Columns One by One

### "IMDb Title ID" Column

In [26]:
#Check unique values in the "IMDb Title ID" column
IMDb['IMDb Title ID'].unique()

array(['tt0111161', 'tt0068646', 'tt0468569', 'tt0071562', 'tt0110912',
       'tt0167260', 'tt0108052', 'tt0050083', 'tt1375666', 'tt0137523',
       'tt0109830', 'tt0120737', 'tt0060196', 'tt0133093', 'tt0167261',
       'tt0080684', 'tt0099685', 'tt0073486', 'tt0816692', 'tt0114369',
       'tt0102926', 'tt0076759', 'tt0120815', 'tt0120689', 'tt0317248',
       'tt0245429', 'tt0118799', 'tt6751668', 'tt0038650', 'tt0047478',
       'tt0172495', 'tt0407887', 'tt0482571', 'tt0088763', 'tt0120586',
       'tt0110413', 'tt0103064', 'tt0114814', 'tt0110357', 'tt7286456',
       'tt1675434', 'tt0253474', 'tt2582802', 'tt0054215', 'tt0034583',
       'tt0064116', 'tt0095327', 'tt0095765', 'tt0027977', 'tt1345836',
       'tt1853728', 'tt0209144', 'tt0910970', 'tt0081505', 'tt0082971',
       'tt4154756', 'tt0078748', 'tt4154796', 'tt0078788', 'tt0364569',
       'tt0057012', 'tt0047396', 'tt2380307', 'tt0405094', 'tt4633694',
       'tt1187043', 'tt0119698', 'tt0087843', 'tt0032553', 'tt00

### "Original Title" Column

In [27]:
#Check unique values in the "Original Title" column
IMDb['Original Title'].unique()

array(['The Shawshank Redemption', 'The Godfather', 'The Dark Knight',
       'The Godfather: Part II', 'Pulp Fiction',
       'The Lord of the Rings: The Return of the King',
       "Schindler's List", '12 Angry Men', 'Inception', 'Fight Club',
       'Forrest Gump',
       'The Lord of the Rings: The Fellowship of the Ring',
       'Il buono, il brutto, il cattivo', 'The Matrix',
       'The Lord of the Rings: The Two Towers',
       'Star Wars: Episode V - The Empire Strikes Back', 'Goodfellas',
       "One Flew Over the Cuckoo's Nest", 'Interstellar', 'Se7en',
       'The Silence of the Lambs', 'Star Wars', 'Saving Private Ryan',
       'The Green Mile', 'Cidade de Deus',
       'Sen to Chihiro no kamikakushi', 'La vita B9 bella',
       'Gisaengchung', "It's a Wonderful Life", 'Shichinin no samurai',
       'Gladiator', 'The Departed', 'The Prestige', 'Back to the Future',
       'American History X', 'LÃ©on', 'Terminator 2: Judgment Day',
       'The Usual Suspects', 'The Lion Ki

### "Release Year" Column

In [28]:
#Check unique values in the "Release Year" column
IMDb['Release Year'].unique()

array(['1995-02-10', '09 21 1972', ' 23 -07-2008', '1975-09-25',
       '1994-10-28', '22 Feb 04', '1994-03-11', '1957-09-04',
       '2010-09-24', '10-29-99', '1994-10-06', '2002-01-18',
       '23rd December of 1966 ', '1999-05-07', '01/16-03', '1980-09-19',
       '1990-09-20', '18/11/1976', '2014-11-06', '1995-12-15',
       '1991-03-05', '1977-10-20', '1998-10-30', '2000-10-03',
       '2003-05-09', '2003-04-18', '1997-12-20', '2019-11-07',
       '1948-03-11', '1955-08-19', '2000-05-19', '2006-10-27',
       '2006-12-22', '1985-10-18', '1999-08-27', '1995-04-07',
       '1991-12-19', '1995-11-30', '1994-11-25', '2019-10-03',
       '2012-02-24', '2002-10-25', '2015-02-12', '1960-10-28', '21-11-46',
       '1968-12-21', '2015-10-11', '1988-11-17', '1937-03-12',
       '2012-08-29', '2013-01-17', '2001-01-19', '2008-10-17',
       '1980-12-22', '1981-06-12', '2018-04-25', '1979-10-25',
       '2019-04-24', '1979-12-18', '2005-05-06', '1964-04-03',
       '1955-04-14', '2017-12-28',

In [29]:
#Replace incorrect dates
date_map = {
    '09 21 1972': '1972-03-24',
    ' 23 -07-2008': '2008-07-18',
    '22 Feb 04': '2003-12-17',
    '23rd December of 1966 ': '1966-12-23',
    '01/16-03': '2002-12-18',
    'The 6th of marzo, year 1951': '1950-08-04',
    '1984-02-34': '1983-12-09',
    '1976-13-24': '1976-02-09'
}

IMDb['Release Year'] = IMDb['Release Year'].replace(date_map, regex=False)

In [30]:
#Convert "Release Year" to datetime
IMDb['Release Year'] = pd.to_datetime(IMDb['Release Year'], format='mixed').dt.date

In [31]:
IMDb['Release Year'].unique()

array([datetime.date(1995, 2, 10), datetime.date(1972, 3, 24),
       datetime.date(2008, 7, 18), datetime.date(1975, 9, 25),
       datetime.date(1994, 10, 28), datetime.date(2003, 12, 17),
       datetime.date(1994, 3, 11), datetime.date(1957, 9, 4),
       datetime.date(2010, 9, 24), datetime.date(1999, 10, 29),
       datetime.date(1994, 10, 6), datetime.date(2002, 1, 18),
       datetime.date(1966, 12, 23), datetime.date(1999, 5, 7),
       datetime.date(2002, 12, 18), datetime.date(1980, 9, 19),
       datetime.date(1990, 9, 20), datetime.date(1976, 11, 18),
       datetime.date(2014, 11, 6), datetime.date(1995, 12, 15),
       datetime.date(1991, 3, 5), datetime.date(1977, 10, 20),
       datetime.date(1998, 10, 30), datetime.date(2000, 10, 3),
       datetime.date(2003, 5, 9), datetime.date(2003, 4, 18),
       datetime.date(1997, 12, 20), datetime.date(2019, 11, 7),
       datetime.date(1948, 3, 11), datetime.date(1955, 8, 19),
       datetime.date(2000, 5, 19), datetime.date(

### "Genre" Column

In [32]:
#Check unique values in the "Genre" column
IMDb['Genre'].unique()

array(['Drama', 'Crime, Drama', 'Action, Crime, Drama',
       'Action, Adventure, Drama', 'Biography, Drama, History',
       'Action, Adventure, Sci-Fi', 'Drama, Romance', 'Western',
       'Action, Sci-Fi', 'Action, Adventure, Fantasy',
       'Biography, Crime, Drama', 'Adventure, Drama, Sci-Fi',
       'Crime, Drama, Mystery', 'Crime, Drama, Thriller', 'Drama, War',
       'Crime, Drama, Fantasy', 'Animation, Adventure, Family',
       'Comedy, Drama, Romance', 'Comedy, Drama, Thriller',
       'Drama, Family, Fantasy', 'Drama, Mystery, Sci-Fi',
       'Adventure, Comedy, Sci-Fi', 'Crime, Mystery, Thriller',
       'Animation, Adventure, Drama', 'Biography, Comedy, Drama',
       'Biography, Drama, Music', 'Drama, Music',
       'Horror, Mystery, Thriller', 'Drama, Romance, War',
       'Animation, Drama, War', 'Comedy, Drama, Family',
       'Action, Adventure', 'Drama, Western', 'Mystery, Thriller',
       'Drama, Horror', 'Horror, Sci-Fi', 'Drama, Mystery, War',
       'Action,

### "Duration" Column

In [33]:
#Check unique values in the "Duration" column
IMDb['Duration'].unique()

array(['142', '175', '152', '220', 154, '201', 195, '96', '148', 139,
       '178', '161', 126, '179', 124, '146', 133, '169', '127', '118',
       '121', '189', '130', '125', '116', '132', '207', '155', '151',
       '119', '110', '137', '106', '88', '122', '112', '150', '109',
       '102', '165', '89', '87', '164', '113', '98', '115', '149', '117',
       '181', '147', '120', '95', '105', '170', '134', '229', '153',
       '131', '99', '108', '81', '126', '104', '136', '103', '114', '160',
       '128', '228', '129', '123'], dtype=object)

In [34]:
#Convert "Duration" to numeric
IMDb['Duration'] = pd.to_numeric(IMDb['Duration'], errors='coerce')

### "Country" Column

In [35]:
#Check unique values in the "Country" column
IMDb['Country'].unique()

array(['USA', 'US', 'New Zealand', 'UK', 'New Zesland', 'Italy',
       'New Zeland', 'US.', 'Brazil', 'Japan', 'Italy1', 'South Korea',
       'France', 'Germany', 'India', 'Denmark', 'West Germany', 'Iran'],
      dtype=object)

In [36]:
#Replace inconsistent country names
country_map = {
    'New Zesland': 'New Zealand',
    'New Zeland': 'New Zealand',
    'US.': 'US',
    'Italy1': 'Italy'
}

IMDb['Country'] = IMDb['Country'].replace(country_map, regex=False)

In [37]:
IMDb['Country'].unique()

array(['USA', 'US', 'New Zealand', 'UK', 'Italy', 'Brazil', 'Japan',
       'South Korea', 'France', 'Germany', 'India', 'Denmark',
       'West Germany', 'Iran'], dtype=object)

### "Content Rating" Column

In [38]:
#Check unique values in the "Content Rating" column
IMDb['Content Rating'].unique()

array(['R', 'PG-13', 'Not Rated', 'Approved', 'PG', 'G'], dtype=object)

### "Director" Column

In [39]:
#Check unique values in the "Director" column
IMDb['Director'].unique()

array(['Frank Darabont', 'Francis Ford Coppola', 'Christopher Nolan',
       'Quentin Tarantino', 'Peter Jackson', 'Steven Spielberg',
       'Sidney Lumet', 'David Fincher', 'Robert Zemeckis', 'Sergio Leone',
       'Lana Wachowski, Lilly Wachowski', 'Irvin Kershner',
       'Martin Scorsese', 'Milos Forman', 'Jonathan Demme',
       'George Lucas', 'Fernando Meirelles, KÃ¡tia Lund',
       'Hayao Miyazaki', 'Roberto Benigni', 'Bong Joon Ho', 'Frank Capra',
       'Akira Kurosawa', 'Ridley Scott', 'Tony Kaye', 'Luc Besson',
       'James Cameron', 'Bryan Singer', 'Roger Allers, Rob Minkoff',
       'Todd Phillips', 'Olivier Nakache, Ã‰ric Toledano',
       'Roman Polanski', 'Damien Chazelle', 'Alfred Hitchcock',
       'Michael Curtiz', 'Isao Takahata', 'Giuseppe Tornatore',
       'Charles Chaplin', 'Andrew Stanton', 'Stanley Kubrick',
       'Anthony Russo, Joe Russo', 'Chan-wook Park',
       'Lee Unkrich, Adrian Molina', 'Florian Henckel von Donnersmarck',
       'Bob Persichetti,

### "Income" Column

In [40]:
#Check unique values in the "Income" column
IMDb['Income'].unique()

array(['$ 28815245', '$ 246120974', '$ 1005455211', '$ 4o8,035,783',
       '$ 222831817', '$ 1142271098', '$ 322287794', '$ 576',
       '$ 869784991', '$ 101218804', '$ 678229452', '$ 887934303',
       '$ 25252481', '$ 465718588', '$ 951227416', '$ 549265501',
       '$ 46879633', '$ 108997629', '$ 696742056', '$ 327333559',
       '$ 272753884', '$ 775768912', '$ 482349603', '$ 286801374',
       '$ 30680793', '$ 355467056', '$ 230098753', '$ 257604912',
       '$ 6130720', '$ 322773', '$ 465361176', '$ 291465034',
       '$ 109676311', '$ 388774684', '$ 23875127', '$ 19552639',
       '$ 520884847', '$ 23341568', '$ 968511805', '$ 1074251311',
       '$ 426588510', '$ 120072577', '$ 48983260', '$ 32008644',
       '$ 4374761', '$ 112911', '$ 516962', '$ 13826605', '$ 457688',
       '$ 1081133191', '$ 425368238', '$ 39970386', '$ 521311860',
       '$ 46520613', '$ 390133212', '$ 2048359754', '$ 108110316',
       '$ 2797800564', '$ 91968688', '$ 15002116', '$ 9443876',
       '$ 

In [41]:
#Replace inconsistent income value and remove dollar signs
income_map = {
    'o': '0',
    ',': '',
    '\$': '',
    ' ': ''
}

IMDb['Income'] = IMDb['Income'].replace(income_map, regex=True)

In [42]:
IMDb['Income'].unique()

array(['28815245', '246120974', '1005455211', '408035783', '222831817',
       '1142271098', '322287794', '576', '869784991', '101218804',
       '678229452', '887934303', '25252481', '465718588', '951227416',
       '549265501', '46879633', '108997629', '696742056', '327333559',
       '272753884', '775768912', '482349603', '286801374', '30680793',
       '355467056', '230098753', '257604912', '6130720', '322773',
       '465361176', '291465034', '109676311', '388774684', '23875127',
       '19552639', '520884847', '23341568', '968511805', '1074251311',
       '426588510', '120072577', '48983260', '32008644', '4374761',
       '112911', '516962', '13826605', '457688', '1081133191',
       '425368238', '39970386', '521311860', '46520613', '390133212',
       '2048359754', '108110316', '2797800564', '91968688', '15002116',
       '9443876', '37032034', '807083670', '77356942', '375540831',
       '60262836', '169785629', '5472914', '969879', '299645',
       '321455689', '356296601', '2

In [43]:
#Convert "Income" to numeric
IMDb['Income'] = pd.to_numeric(IMDb['Income'], errors='coerce')

In [44]:
#Rename the column
IMDb = IMDb.rename(columns={'Income': 'Income ($)'})

### "Votes" Column

In [45]:
#Check unique values in the "Votes" column
IMDb['Votes'].unique()

array(['2.278.845', '1.572.674', '2.241.615', '1.098.714', '1.780.147',
       '1.604.280', '1.183.248', '668.473', '2.002.816', '1.807.440',
       '1.755.490', '1.619.920', '672.499', '1.632.315', '1.449.778',
       '1.132.073', '991.505', '891.071', '1.449.256', '1.402.015',
       '1.234.134', '1.204.107', '1.203.825', '1.112.336', '685.856',
       '626.693', '605.648', '470.931', '388.310', '307.958', '1.308.193',
       '1.159.703', '1.155.723', '1.027.330', '1.014.218', '1.007.598',
       '974.970', '968.947', '917.248', '855.097', '736.691', '707.942',
       '690.732', '586.765', '509.953', '295.220', '225.438', '223.050',
       '211.250', '1.480.582', '1.317.856', '1.098.879', '974.734',
       '869.480', '865.510', '796.486', '768.874', '754.786', '591.251',
       '501.082', '441.115', '432.390', '352.455', '349.642', '335.892',
       '332.217', '331.045', '302.317', '197.381', '195.789', '1.229.958',
       '1.049.009', '941.683', '928.036', '896.551', '889.875', '864

In [46]:
#Remove dot characters
IMDb['Votes'] = IMDb['Votes'].str.replace('.', '', regex=False)

In [47]:
IMDb['Votes'].unique()

array(['2278845', '1572674', '2241615', '1098714', '1780147', '1604280',
       '1183248', '668473', '2002816', '1807440', '1755490', '1619920',
       '672499', '1632315', '1449778', '1132073', '991505', '891071',
       '1449256', '1402015', '1234134', '1204107', '1203825', '1112336',
       '685856', '626693', '605648', '470931', '388310', '307958',
       '1308193', '1159703', '1155723', '1027330', '1014218', '1007598',
       '974970', '968947', '917248', '855097', '736691', '707942',
       '690732', '586765', '509953', '295220', '225438', '223050',
       '211250', '1480582', '1317856', '1098879', '974734', '869480',
       '865510', '796486', '768874', '754786', '591251', '501082',
       '441115', '432390', '352455', '349642', '335892', '332217',
       '331045', '302317', '197381', '195789', '1229958', '1049009',
       '941683', '928036', '896551', '889875', '864461', '837379',
       '766589', '748291', '740301', '739717', '721343', '703264',
       '690480', '658175', '639

In [48]:
#Convert "Votes" to numeric
IMDb['Votes'] = pd.to_numeric(IMDb['Votes'], errors='coerce')

### "Score" Column

In [49]:
#Check unique values in the "Score" column
IMDb['Score'].unique()

array(['9.3', '9.2', '9.', '9,.0', '8,9f', '08.9', '8.9', '8..8', '8.8',
       '8:8', '++8.7', '8.7.', '8,7e-0', '8.7', '8.6', '8,6', '8.5',
       '8.4', '8.3', '8.2', '8.1', '8.0', '7.9', '7.8', '7.7', '7.6',
       '7.5', '7.4'], dtype=object)

In [50]:
#Checking how many times each unique value appears in the "Score" column
IMDb['Score'].value_counts()

Score
8.6       11
8.2        8
8.3        8
8.1        7
8.4        7
8.5        6
7.5        6
7.8        6
8.0        6
7.9        5
7.6        4
7.7        4
7.4        3
8.8        3
8.7        2
8.9        2
9.2        1
8,6        1
8,7e-0     1
8.7.       1
++8.7      1
8:8        1
8..8       1
08.9       1
8,9f       1
9,.0       1
9.         1
9.3        1
Name: count, dtype: int64

In [51]:
#Replace inconsistent score values
score_map = {
    '9.': '9.0',
    '9,.0': '9.0',
    '8,9f': '8.9',
    '08.9': '8.9',
    '8..8': '8.8',
    '8:8': '8.8',
    '++8.7': '8.7',
    '8.7.': '8.7',
    '8,7e-0': '8.7',
    '8,6': '8.6'
}

IMDb['Score'] = IMDb['Score'].replace(score_map, regex=False)

In [52]:
IMDb['Score'].unique()

array(['9.3', '9.2', '9.0', '8.9', '8.8', '8.7', '8.6', '8.5', '8.4',
       '8.3', '8.2', '8.1', '8.0', '7.9', '7.8', '7.7', '7.6', '7.5',
       '7.4'], dtype=object)

In [53]:
#Mengubah tipe data
IMDb['Score'] = pd.to_numeric(IMDb['Score'], errors='coerce')

## Cleaned Data

In [54]:
#Display the first 5 rows of the dataset after cleaning
IMDb.head()

Unnamed: 0,IMDb Title ID,Original Title,Release Year,Genre,Duration,Country,Content Rating,Director,Income ($),Votes,Score
0,tt0111161,The Shawshank Redemption,1995-02-10,Drama,142,USA,R,Frank Darabont,28815245,2278845,9.3
1,tt0068646,The Godfather,1972-03-24,"Crime, Drama",175,USA,R,Francis Ford Coppola,246120974,1572674,9.2
2,tt0468569,The Dark Knight,2008-07-18,"Action, Crime, Drama",152,US,PG-13,Christopher Nolan,1005455211,2241615,9.0
3,tt0071562,The Godfather: Part II,1975-09-25,"Crime, Drama",220,USA,R,Francis Ford Coppola,408035783,1098714,9.0
4,tt0110912,Pulp Fiction,1994-10-28,"Crime, Drama",154,USA,R,Quentin Tarantino,222831817,1780147,8.9


In [55]:
#Get a conscise summary of the DataFrame after cleaning
IMDb.info()

<class 'pandas.core.frame.DataFrame'>
Index: 100 entries, 0 to 100
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   IMDb Title ID   100 non-null    object 
 1   Original Title  100 non-null    object 
 2   Release Year    100 non-null    object 
 3   Genre           100 non-null    object 
 4   Duration        100 non-null    int64  
 5   Country         100 non-null    object 
 6   Content Rating  100 non-null    object 
 7   Director        100 non-null    object 
 8   Income ($)      100 non-null    int64  
 9   Votes           100 non-null    int64  
 10  Score           100 non-null    float64
dtypes: float64(1), int64(3), object(7)
memory usage: 9.4+ KB


In [56]:
#Summary statistics for numerical columns after cleaning
IMDb.describe()

Unnamed: 0,Duration,Income ($),Votes,Score
count,100.0,100.0,100.0,100.0
mean,135.94,299125500.0,829771.3,8.24
std,31.04276,436033500.0,482934.0,0.451485
min,81.0,576.0,195789.0,7.4
25%,115.0,23741740.0,389069.0,7.9
50%,130.0,109337000.0,751538.5,8.3
75%,152.25,405208000.0,1102243.0,8.6
max,229.0,2797801000.0,2278845.0,9.3


## Saving the Cleaned Data

In [57]:
IMDb.to_csv("IMDb_cleaned.csv", index=False)