# Project: Building MySQL Database for VHS Rental Store | Cristiane Carneiro

## Data Cleaning : film.csv

In this file, one can review the step by step cleaning process for table film.csv 

### Import 

We start by importing the libraries we are going to use and loading the database

In [1]:
%pip install ipython
%pip install seaborn

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
pd.set_option('display.max_columns', None)

import numpy as np

import warnings
warnings.filterwarnings('ignore')

import pylab as plt  

import seaborn as sns 

%matplotlib inline

In [3]:
films = pd.read_csv('/Users/criscarneiro/desktop/ironhack/6_Projects/sql-data-base-building/data/raw/film.csv')

In [4]:
films.head()

Unnamed: 0,film_id,title,description,release_year,language_id,original_language_id,rental_duration,rental_rate,length,replacement_cost,rating,special_features,last_update
0,1,ACADEMY DINOSAUR,A Epic Drama of a Feminist And a Mad Scientist...,2006,1,,6,0.99,86,20.99,PG,"Deleted Scenes,Behind the Scenes",2006-02-15 05:03:42
1,2,ACE GOLDFINGER,A Astounding Epistle of a Database Administrat...,2006,1,,3,4.99,48,12.99,G,"Trailers,Deleted Scenes",2006-02-15 05:03:42
2,3,ADAPTATION HOLES,A Astounding Reflection of a Lumberjack And a ...,2006,1,,7,2.99,50,18.99,NC-17,"Trailers,Deleted Scenes",2006-02-15 05:03:42
3,4,AFFAIR PREJUDICE,A Fanciful Documentary of a Frisbee And a Lumb...,2006,1,,5,2.99,117,26.99,G,"Commentaries,Behind the Scenes",2006-02-15 05:03:42
4,5,AFRICAN EGG,A Fast-Paced Documentary of a Pastry Chef And ...,2006,1,,6,2.99,130,22.99,G,Deleted Scenes,2006-02-15 05:03:42


### Good practices

Some good practices before we continue with the exercise

In [5]:
#creating a back-up with the original table 

filmsoriginal = films.copy()

In [6]:
#ensuring column names are clean 

films.columns

Index(['film_id', 'title', 'description', 'release_year', 'language_id',
       'original_language_id', 'rental_duration', 'rental_rate', 'length',
       'replacement_cost', 'rating', 'special_features', 'last_update'],
      dtype='object')

In [7]:
films.columns = [c.lower().replace(' ', '_') for c in films.columns]

films.columns

Index(['film_id', 'title', 'description', 'release_year', 'language_id',
       'original_language_id', 'rental_duration', 'rental_rate', 'length',
       'replacement_cost', 'rating', 'special_features', 'last_update'],
      dtype='object')

In [8]:
#checking for duplicates 

films.duplicated().any() #there are no duplicates 

False

### Explore 

Exploratory analysis to understand the data base (e.g,. description, column types, searching for null values) 

In [13]:
#this is a repository of films, with their respective data (e.g., their title, language, lenght, etc) 

films.head(1)

Unnamed: 0,film_id,title,description,release_year,language_id,original_language_id,rental_duration,rental_rate,length,replacement_cost,rating,special_features,last_update
0,1,ACADEMY DINOSAUR,A Epic Drama of a Feminist And a Mad Scientist...,2006,1,,6,0.99,86,20.99,PG,"Deleted Scenes,Behind the Scenes",2006-02-15 05:03:42


In [15]:
#we have 13 columns, and 1000 entries (rows) in our original database

filmsoriginal.shape

(1000, 13)

In [16]:
#here we can see the type of each of the columns  
#we also see column original_language_id is 100% null 

films.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   film_id               1000 non-null   int64  
 1   title                 1000 non-null   object 
 2   description           1000 non-null   object 
 3   release_year          1000 non-null   int64  
 4   language_id           1000 non-null   int64  
 5   original_language_id  0 non-null      float64
 6   rental_duration       1000 non-null   int64  
 7   rental_rate           1000 non-null   float64
 8   length                1000 non-null   int64  
 9   replacement_cost      1000 non-null   float64
 10  rating                1000 non-null   object 
 11  special_features      1000 non-null   object 
 12  last_update           1000 non-null   object 
dtypes: float64(3), int64(5), object(5)
memory usage: 101.7+ KB


In [17]:
#description table 
#here we can see the #of unique values, and the mode of each field. 

films.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
film_id,1000.0,,,,500.5,288.819436,1.0,250.75,500.5,750.25,1000.0
title,1000.0,1000.0,ACADEMY DINOSAUR,1.0,,,,,,,
description,1000.0,1000.0,A Epic Drama of a Feminist And a Mad Scientist...,1.0,,,,,,,
release_year,1000.0,,,,2006.0,0.0,2006.0,2006.0,2006.0,2006.0,2006.0
language_id,1000.0,,,,1.0,0.0,1.0,1.0,1.0,1.0,1.0
original_language_id,0.0,,,,,,,,,,
rental_duration,1000.0,,,,4.985,1.411654,3.0,4.0,5.0,6.0,7.0
rental_rate,1000.0,,,,2.98,1.646393,0.99,0.99,2.99,4.99,4.99
length,1000.0,,,,115.272,40.426332,46.0,80.0,114.0,149.25,185.0
replacement_cost,1000.0,,,,19.984,6.050833,9.99,14.99,19.99,24.99,29.99


### Null values

As stated above, it seems one of the columns is fully null. See per below:

In [18]:
nan_cols = films.isna().sum()

nan_cols

film_id                    0
title                      0
description                0
release_year               0
language_id                0
original_language_id    1000
rental_duration            0
rental_rate                0
length                     0
replacement_cost           0
rating                     0
special_features           0
last_update                0
dtype: int64

In [19]:
#I will exclude that column from the database

films.drop('original_language_id', axis=1, inplace=True)

In [20]:
nan_cols = films.isna().sum()

nan_cols

film_id             0
title               0
description         0
release_year        0
language_id         0
rental_duration     0
rental_rate         0
length              0
replacement_cost    0
rating              0
special_features    0
last_update         0
dtype: int64

### Other cleaning 

#### film_id

In [21]:
#we got a list of int values, which seem to be IDs for the each film
#this is the most appropriate datatype (although we will optimize it later)

films.film_id.dtype

dtype('int64')

In [22]:
#it seems all the IDs are unique values 

len(films.film_id.unique())

1000

In [24]:
#films.film_id.unique()

#### title

In [25]:
#object type, as strings 

films.title.dtype

dtype('O')

In [26]:
#all film titles unique

len(films.title.unique())

1000

In [27]:
#not a fan of uppercase, will trim names

films.title = films.title.apply(lambda X: X.title().strip())

In [29]:
films.head(3)

Unnamed: 0,film_id,title,description,release_year,language_id,rental_duration,rental_rate,length,replacement_cost,rating,special_features,last_update
0,1,Academy Dinosaur,A Epic Drama of a Feminist And a Mad Scientist...,2006,1,6,0.99,86,20.99,PG,"Deleted Scenes,Behind the Scenes",2006-02-15 05:03:42
1,2,Ace Goldfinger,A Astounding Epistle of a Database Administrat...,2006,1,3,4.99,48,12.99,G,"Trailers,Deleted Scenes",2006-02-15 05:03:42
2,3,Adaptation Holes,A Astounding Reflection of a Lumberjack And a ...,2006,1,7,2.99,50,18.99,NC-17,"Trailers,Deleted Scenes",2006-02-15 05:03:42


#### description

In [30]:
#object type, as strings 

films.description.dtype

dtype('O')

In [31]:
#all descriptions unique

len(films.description.unique())

1000

In [34]:
#prefer only first letter capitalized, will trim 

films.description = films.description.apply(lambda x: '. '.join(map(str.capitalize, x.split('. '))))

In [35]:
films.head(3)

Unnamed: 0,film_id,title,description,release_year,language_id,rental_duration,rental_rate,length,replacement_cost,rating,special_features,last_update
0,1,Academy Dinosaur,A epic drama of a feminist and a mad scientist...,2006,1,6,0.99,86,20.99,PG,"Deleted Scenes,Behind the Scenes",2006-02-15 05:03:42
1,2,Ace Goldfinger,A astounding epistle of a database administrat...,2006,1,3,4.99,48,12.99,G,"Trailers,Deleted Scenes",2006-02-15 05:03:42
2,3,Adaptation Holes,A astounding reflection of a lumberjack and a ...,2006,1,7,2.99,50,18.99,NC-17,"Trailers,Deleted Scenes",2006-02-15 05:03:42


# clean 

#### last_update

In [40]:
#this column is type 'object'. It seems tough it would be most appropriate as a 'time type'

actors.last_update.dtype

dtype('O')

In [39]:
#all the values are the same, indicating all the names were last updated on Feb 15th 2006

actors.last_update.value_counts()

last_update
2006-02-15 04:34:33    200
Name: count, dtype: int64

In [41]:
#I will convert the data to datetime64

actors.last_update = pd.to_datetime(actors.last_update)

In [42]:
#converted 

actors.last_update.dtype

dtype('<M8[ns]')

#### first_name, last_name, and new column full_name

I will clean the columns first_name, last_name together, and make sure there are no repeated actors (by their full name)

In [44]:
#this column is type 'object'. They cointain a list of strings 

print(actors.first_name.dtype)
print(actors.last_name.dtype)

object
object


In [49]:
#these are the top first_names 
#some repeated values, but let us wait until we see full names

actors.first_name.value_counts().head(3)

first_name
PENELOPE    4
JULIA       4
KENNETH     4
Name: count, dtype: int64

In [79]:
#actors.first_name.unique()

In [50]:
#these are the top last_names 
#some repeated values, but let us wait until we see full names

actors.last_name.value_counts().head(3)

last_name
KILMER    5
TEMPLE    4
NOLTE     4
Name: count, dtype: int64

In [81]:
#actors.last_name.unique()

In [None]:
#I personally don't like uppercase 

In [52]:
actors.first_name = actors.first_name.apply(lambda X: X.title().replace(' ',''))

In [53]:
actors.last_name = actors.last_name.apply(lambda X: X.title().replace(' ',''))

In [55]:
actors.head()

Unnamed: 0,actor_id,first_name,last_name,last_update
0,1,Penelope,Guiness,2006-02-15 04:34:33
1,2,Nick,Wahlberg,2006-02-15 04:34:33
2,3,Ed,Chase,2006-02-15 04:34:33
3,4,Jennifer,Davis,2006-02-15 04:34:33
4,5,Johnny,Lollobrigida,2006-02-15 04:34:33


In [57]:
#let us create a full name column, and place it after last_name 

actors.insert(3, 'full_name', actors['first_name'] + ' ' + actors['last_name'])

In [63]:
#now let us see if we have repeated actors 

actors.full_name.value_counts()

full_name
Susan Davis             2
Ewan Gooding            1
Daryl Crawford          1
Greta Keitel            1
Jane Jackman            1
                       ..
Michelle Mcconaughey    1
Adam Grant              1
Sean Williams           1
Gary Penn               1
Thora Temple            1
Name: count, Length: 199, dtype: int64

In [84]:
#actors.full_name.unique()

In [None]:
#it seems we do have a repeated value. However, there are indeed more than one actress named Susan Davis. 
#for now I will keep both values, but keep that info in mind as we establish links between the tables

'''From chat GPT 
Susan Davis (born 1943): Known for her roles in films such as "Three Women" (1977) and "Love and Death" (1975).

Susan Davis (born 1944): Known for her role as Betty Munson in the TV series "The Mary Tyler Moore Show" (1970-1977) and its spin-off "Lou Grant" (1977-1982).'''

### Column types and optimization 

I will optimize the database for memory 

In [86]:
actors.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   actor_id     200 non-null    int64         
 1   first_name   200 non-null    object        
 2   last_name    200 non-null    object        
 3   full_name    200 non-null    object        
 4   last_update  200 non-null    datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 41.3 KB


In [87]:
#downcast actor_id

actors.actor_id = pd.to_numeric(actors.actor_id, downcast='integer')

In [88]:
#name columns to 'category'

for c in actors.select_dtypes(include='object'):
    
    actors[c] = actors[c].astype('category')   

In [89]:
#no need for 'nanoseconds' precision

actors.last_update = actors.last_update.astype('datetime64[s]')

### Comparison output vs. original

In [90]:
#one additional column as we have created a 'full_name' column 

print(actorsoriginal.shape)
print(actors.shape)

(200, 4)
(200, 5)


In [91]:
actors.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype        
---  ------       --------------  -----        
 0   actor_id     200 non-null    int16        
 1   first_name   200 non-null    category     
 2   last_name    200 non-null    category     
 3   full_name    200 non-null    category     
 4   last_update  200 non-null    datetime64[s]
dtypes: category(3), datetime64[s](1), int16(1)
memory usage: 48.1 KB


In [92]:
actorsoriginal.info(memory_usage='deep') #take into account we have included a column

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   actor_id     200 non-null    int64 
 1   first_name   200 non-null    object
 2   last_name    200 non-null    object
 3   last_update  200 non-null    object
dtypes: int64(1), object(3)
memory usage: 41.0 KB


### Export clean table

In [93]:
actors.to_csv('/Users/criscarneiro/desktop/ironhack/6_Projects/sql-data-base-building/data/clean/actor_clean.csv', index=False)