# Project: Building MySQL Database for VHS Rental Store | Cristiane Carneiro

## Data Cleaning : old_HDD.csv

In this file, one can review the step by step cleaning process for table old_HDD.csv 

We were told this is a database that was 'lost' among the other files - let us see if it can be useful!

### Import 

We start by importing the libraries we are going to use and loading the database

In [2]:
%pip install ipython
%pip install seaborn

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [3]:
import pandas as pd
pd.set_option('display.max_columns', None)

import numpy as np

import warnings
warnings.filterwarnings('ignore')

import pylab as plt  

import seaborn as sns 

%matplotlib inline

In [4]:
olddb = pd.read_csv('/Users/criscarneiro/desktop/ironhack/6_Projects/sql-data-base-building/data/raw/old_HDD.csv')

In [5]:
olddb.head()

Unnamed: 0,first_name,last_name,title,release_year,category_id
0,PENELOPE,GUINESS,ACADEMY DINOSAUR,2006,6
1,PENELOPE,GUINESS,ANACONDA CONFESSIONS,2006,2
2,PENELOPE,GUINESS,ANGELS LIFE,2006,13
3,PENELOPE,GUINESS,BULWORTH COMMANDMENTS,2006,10
4,PENELOPE,GUINESS,CHEAPER CLYDE,2006,14


It seems this will be quite useful for: 

1) relating actors and titles 
2) relating titles and categories

I will work on both!  

### Good practices

Some good practices before we continue with the exercise

In [6]:
#creating a back-up with the original table 

olddboriginal = olddb.copy()

In [7]:
#ensuring column names are clean 

olddb.columns

Index(['first_name', 'last_name', 'title', 'release_year', 'category_id'], dtype='object')

In [8]:
olddb.columns = [c.lower().replace(' ', '_') for c in olddb.columns]

olddb.columns

Index(['first_name', 'last_name', 'title', 'release_year', 'category_id'], dtype='object')

In [9]:
#checking for duplicates 

olddb.duplicated().any() #there are no duplicates 

False

### Explore 

Exploratory analysis to understand the data base (e.g,. description, column types, searching for null values) 

In [10]:
#it seems we have a repository of actors with films where they participated
#we also have the category associated with each movie 

olddb.head()

Unnamed: 0,first_name,last_name,title,release_year,category_id
0,PENELOPE,GUINESS,ACADEMY DINOSAUR,2006,6
1,PENELOPE,GUINESS,ANACONDA CONFESSIONS,2006,2
2,PENELOPE,GUINESS,ANGELS LIFE,2006,13
3,PENELOPE,GUINESS,BULWORTH COMMANDMENTS,2006,10
4,PENELOPE,GUINESS,CHEAPER CLYDE,2006,14


In [11]:
#we have 5 columns, and 1000 entries (rows) in our original database

olddboriginal.shape

(1000, 5)

In [12]:
#here we can see the type of each of the columns 
#it seems all values are non-null

olddb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   first_name    1000 non-null   object
 1   last_name     1000 non-null   object
 2   title         1000 non-null   object
 3   release_year  1000 non-null   int64 
 4   category_id   1000 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 39.2+ KB


In [13]:
#description table 
#here we can see the #of unique values, and the mode of each field. Ultimately we will be interested on the unique 'full names', so worth checking if there are non-unique values there

olddb.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
first_name,1000.0,38.0,SANDRA,56.0,,,,,,,
last_name,1000.0,37.0,OLIVIER,53.0,,,,,,,
title,1000.0,614.0,BOONDOCK BALLROOM,6.0,,,,,,,
release_year,1000.0,,,,2006.0,0.0,2006.0,2006.0,2006.0,2006.0,2006.0
category_id,1000.0,,,,8.355,4.726872,1.0,4.0,8.0,13.0,16.0


### Null values

As stated above, there are no null values in the database. See per below:

In [14]:
#there are no null values in the database 

nan_cols = olddb.isna().sum()

nan_cols

first_name      0
last_name       0
title           0
release_year    0
category_id     0
dtype: int64

### Other cleaning 

#### first_name, last_name, and new column full_name

I will clean the columns first_name, last_name together, and make sure there are no repeated actors (by their full name)

In [15]:
#this column is type 'object'. They cointain a list of strings 

print(olddb.first_name.dtype)
print(olddb.last_name.dtype)

object
object


In [16]:
#these are the top first_names 
#some repeated values, but let us wait until we see full names

olddb.first_name.value_counts().head(5)

first_name
SANDRA    56
VAL       35
UMA       35
JULIA     33
RIP       33
Name: count, dtype: int64

In [17]:
#olddb.first_name.unique()

In [18]:
#these are the top last_names 
#some repeated values, but let us wait until we see full names

olddb.last_name.value_counts().head(3)

last_name
OLIVIER    53
PECK       43
KILMER     37
Name: count, dtype: int64

In [19]:
#actors.last_name.unique()

In [20]:
#I personally don't like uppercase 

In [21]:
olddb.first_name = olddb.first_name.apply(lambda X: X.title().replace(' ',''))

In [22]:
olddb.last_name = olddb.last_name.apply(lambda X: X.title().replace(' ',''))

In [23]:
olddb.head()

Unnamed: 0,first_name,last_name,title,release_year,category_id
0,Penelope,Guiness,ACADEMY DINOSAUR,2006,6
1,Penelope,Guiness,ANACONDA CONFESSIONS,2006,2
2,Penelope,Guiness,ANGELS LIFE,2006,13
3,Penelope,Guiness,BULWORTH COMMANDMENTS,2006,10
4,Penelope,Guiness,CHEAPER CLYDE,2006,14


In [24]:
#let us create a full name column, and place it after last_name 

olddb.insert(2, 'full_name', olddb['first_name'] + ' ' + olddb['last_name'])

In [25]:
#there will be repeated values, as this table is relating the actors with films they have participated

olddb.head()

Unnamed: 0,first_name,last_name,full_name,title,release_year,category_id
0,Penelope,Guiness,Penelope Guiness,ACADEMY DINOSAUR,2006,6
1,Penelope,Guiness,Penelope Guiness,ANACONDA CONFESSIONS,2006,2
2,Penelope,Guiness,Penelope Guiness,ANGELS LIFE,2006,13
3,Penelope,Guiness,Penelope Guiness,BULWORTH COMMANDMENTS,2006,10
4,Penelope,Guiness,Penelope Guiness,CHEAPER CLYDE,2006,14


#### title 

In [26]:
#object type, as strings 

olddb.title.dtype

dtype('O')

In [27]:
#not a fan of uppercase, will trim names

olddb.title = olddb.title.apply(lambda X: X.title().strip())

In [28]:
olddb.head(3)

Unnamed: 0,first_name,last_name,full_name,title,release_year,category_id
0,Penelope,Guiness,Penelope Guiness,Academy Dinosaur,2006,6
1,Penelope,Guiness,Penelope Guiness,Anaconda Confessions,2006,2
2,Penelope,Guiness,Penelope Guiness,Angels Life,2006,13


In [29]:
#here we have the category for 614 titles. Our titles database has 1000 unique titles.
#will asses later if we an leverage that info to connect films & category tables

len(olddb.title.unique())

614

#### release_year

In [30]:
#I do not need that column for the purposes stated above (e.g., relating films to actors and relating films to categories)

olddb.drop('release_year',axis=1, inplace = True )

In [31]:
olddb.head()

Unnamed: 0,first_name,last_name,full_name,title,category_id
0,Penelope,Guiness,Penelope Guiness,Academy Dinosaur,6
1,Penelope,Guiness,Penelope Guiness,Anaconda Confessions,2
2,Penelope,Guiness,Penelope Guiness,Angels Life,13
3,Penelope,Guiness,Penelope Guiness,Bulworth Commandments,10
4,Penelope,Guiness,Penelope Guiness,Cheaper Clyde,14


### Importing the name ID and title ID

Ideally, I want to relate actors & titles AND titles & categories by their IDs. I need to import the IDs from actors and titles. I have that information in the actor.csv and title.csv files, which we've already cleaned

In [32]:
films_clean = pd.read_csv('/Users/criscarneiro/desktop/ironhack/6_Projects/sql-data-base-building/data/clean/films_clean.csv')

In [33]:
actor_clean = pd.read_csv('/Users/criscarneiro/desktop/ironhack/6_Projects/sql-data-base-building/data/clean/actor_clean.csv')

In [34]:
films_clean.head(1)

Unnamed: 0,film_id,title,description,release_year,language_id,rental_duration,rental_rate,length,replacement_cost,rating,special_features,film_last_update
0,1,Academy Dinosaur,A epic drama of a feminist and a mad scientist...,1970,1,6,0.99,86,20.99,PG,"Deleted Scenes,Behind the Scenes",2006-02-15 05:03:42


In [35]:
actor_clean.head(1)

Unnamed: 0,actor_id,first_name,last_name,full_name,actor_last_update
0,1,Penelope,Guiness,Penelope Guiness,2006-02-15 04:34:33


In [36]:
#I will merge the data frames on 'full_name' and on 'title'

In [37]:
olddb_merge1 = pd.merge(olddb, actor_clean[['full_name','actor_id']], on='full_name', how='left')

In [38]:
olddb_merge1.head()

Unnamed: 0,first_name,last_name,full_name,title,category_id,actor_id
0,Penelope,Guiness,Penelope Guiness,Academy Dinosaur,6,1
1,Penelope,Guiness,Penelope Guiness,Anaconda Confessions,2,1
2,Penelope,Guiness,Penelope Guiness,Angels Life,13,1
3,Penelope,Guiness,Penelope Guiness,Bulworth Commandments,10,1
4,Penelope,Guiness,Penelope Guiness,Cheaper Clyde,14,1


In [39]:
olddb_merge2 = pd.merge(olddb_merge1, films_clean[['title','film_id']], on='title', how='left')

In [40]:
olddb_merge2.head()

Unnamed: 0,first_name,last_name,full_name,title,category_id,actor_id,film_id
0,Penelope,Guiness,Penelope Guiness,Academy Dinosaur,6,1,1
1,Penelope,Guiness,Penelope Guiness,Anaconda Confessions,2,1,23
2,Penelope,Guiness,Penelope Guiness,Angels Life,13,1,25
3,Penelope,Guiness,Penelope Guiness,Bulworth Commandments,10,1,106
4,Penelope,Guiness,Penelope Guiness,Cheaper Clyde,14,1,140


### Column names and duplicates 

In [41]:
#checking for duplicates 

olddb_merge2.duplicated().any() #there are no duplicates 

False

In [42]:
olddb_merge2.head(2)

Unnamed: 0,first_name,last_name,full_name,title,category_id,actor_id,film_id
0,Penelope,Guiness,Penelope Guiness,Academy Dinosaur,6,1,1
1,Penelope,Guiness,Penelope Guiness,Anaconda Confessions,2,1,23


### Column types and optimization 

I will optimize the database for memory 

In [43]:
olddb_merge2.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   first_name   1000 non-null   object
 1   last_name    1000 non-null   object
 2   full_name    1000 non-null   object
 3   title        1000 non-null   object
 4   category_id  1000 non-null   int64 
 5   actor_id     1000 non-null   int64 
 6   film_id      1000 non-null   int64 
dtypes: int64(3), object(4)
memory usage: 282.9 KB


In [44]:
#downcast int

for c in olddb_merge2.select_dtypes('integer'):
    
    olddb_merge2[c] = pd.to_numeric(olddb_merge2[c], downcast='integer')

In [45]:
#name columns to 'category'

for c in olddb_merge2.select_dtypes(include='object'):
    
    olddb_merge2[c] = olddb_merge2[c].astype('category')   

### Comparison output vs. original

In [46]:
#we have added the ID columns

print(olddboriginal.shape)
print(olddb_merge2.shape)

(1000, 5)
(1000, 7)


In [47]:
olddb_merge2.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   first_name   1000 non-null   category
 1   last_name    1000 non-null   category
 2   full_name    1000 non-null   category
 3   title        1000 non-null   category
 4   category_id  1000 non-null   int8    
 5   actor_id     1000 non-null   int8    
 6   film_id      1000 non-null   int16   
dtypes: category(4), int16(1), int8(2)
memory usage: 78.2 KB


In [48]:
olddboriginal.info(memory_usage='deep') #take into account we have included columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   first_name    1000 non-null   object
 1   last_name     1000 non-null   object
 2   title         1000 non-null   object
 3   release_year  1000 non-null   int64 
 4   category_id   1000 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 207.6 KB


### Export clean table

#### table actor_id and film_id 

In [49]:
actor_film = olddb_merge2.loc[:,['actor_id','film_id']]

In [50]:
actor_film.head(4)

Unnamed: 0,actor_id,film_id
0,1,1
1,1,23
2,1,25
3,1,106


In [51]:
actor_film.to_csv('/Users/criscarneiro/desktop/ironhack/6_Projects/sql-data-base-building/data/clean/actor_film_clean.csv', index=False)

#### table  film_id and category_id

In [52]:
film_category = olddb_merge2.loc[:,['title','film_id','category_id']]

In [53]:
film_category.head(3)

Unnamed: 0,title,film_id,category_id
0,Academy Dinosaur,1,6
1,Anaconda Confessions,23,2
2,Angels Life,25,13


In [54]:
film_category.to_csv('/Users/criscarneiro/desktop/ironhack/6_Projects/sql-data-base-building/data/clean/film_category_clean.csv', index=False)