# Project: Building MySQL Database for VHS Rental Store | Cristiane Carneiro

## Data Cleaning : old_HDD.csv

In this file, one can review the step by step cleaning process for table old_HDD.csv 

We were told this is a database that was 'lost' among the other files - let us see if it can be useful!

### Import 

We start by importing the libraries we are going to use and loading the database

In [58]:
%pip install ipython
%pip install seaborn

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [59]:
import pandas as pd
pd.set_option('display.max_columns', None)

import numpy as np

import warnings
warnings.filterwarnings('ignore')

import pylab as plt  

import seaborn as sns 

%matplotlib inline

In [60]:
olddb = pd.read_csv('/Users/criscarneiro/desktop/ironhack/6_Projects/sql-data-base-building/data/raw/old_HDD.csv')

In [61]:
olddb.head()

Unnamed: 0,first_name,last_name,title,release_year,category_id
0,PENELOPE,GUINESS,ACADEMY DINOSAUR,2006,6
1,PENELOPE,GUINESS,ANACONDA CONFESSIONS,2006,2
2,PENELOPE,GUINESS,ANGELS LIFE,2006,13
3,PENELOPE,GUINESS,BULWORTH COMMANDMENTS,2006,10
4,PENELOPE,GUINESS,CHEAPER CLYDE,2006,14


It seems this will be quite useful for: 

1) relating actors and titles 
2) relating titles and categories

I will work on both!  

### Good practices

Some good practices before we continue with the exercise

In [62]:
#creating a back-up with the original table 

olddboriginal = olddb.copy()

In [63]:
#ensuring column names are clean 

olddb.columns

Index(['first_name', 'last_name', 'title', 'release_year', 'category_id'], dtype='object')

In [64]:
olddb.columns = [c.lower().replace(' ', '_') for c in olddb.columns]

olddb.columns

Index(['first_name', 'last_name', 'title', 'release_year', 'category_id'], dtype='object')

In [65]:
#checking for duplicates 

olddb.duplicated().any() #there are no duplicates 

False

### Explore 

Exploratory analysis to understand the data base (e.g,. description, column types, searching for null values) 

In [66]:
#it seems we have a repository of actors with films where they participated
#we also have the category associated with each movie 

olddb.head()

Unnamed: 0,first_name,last_name,title,release_year,category_id
0,PENELOPE,GUINESS,ACADEMY DINOSAUR,2006,6
1,PENELOPE,GUINESS,ANACONDA CONFESSIONS,2006,2
2,PENELOPE,GUINESS,ANGELS LIFE,2006,13
3,PENELOPE,GUINESS,BULWORTH COMMANDMENTS,2006,10
4,PENELOPE,GUINESS,CHEAPER CLYDE,2006,14


In [67]:
#we have 5 columns, and 1000 entries (rows) in our original database

olddboriginal.shape

(1000, 5)

In [68]:
#here we can see the type of each of the columns 
#it seems all values are non-null

olddb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   first_name    1000 non-null   object
 1   last_name     1000 non-null   object
 2   title         1000 non-null   object
 3   release_year  1000 non-null   int64 
 4   category_id   1000 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 39.2+ KB


In [69]:
#description table 
#here we can see the #of unique values, and the mode of each field. Ultimately we will be interested on the unique 'full names', so worth checking if there are non-unique values there

olddb.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
first_name,1000.0,38.0,SANDRA,56.0,,,,,,,
last_name,1000.0,37.0,OLIVIER,53.0,,,,,,,
title,1000.0,614.0,BOONDOCK BALLROOM,6.0,,,,,,,
release_year,1000.0,,,,2006.0,0.0,2006.0,2006.0,2006.0,2006.0,2006.0
category_id,1000.0,,,,8.355,4.726872,1.0,4.0,8.0,13.0,16.0


### Null values

As stated above, there are no null values in the database. See per below:

In [70]:
#there are no null values in the database 

nan_cols = olddb.isna().sum()

nan_cols

first_name      0
last_name       0
title           0
release_year    0
category_id     0
dtype: int64

### Other cleaning 

#### first_name, last_name, and new column full_name

I will clean the columns first_name, last_name together, and make sure there are no repeated actors (by their full name)

In [71]:
#this column is type 'object'. They cointain a list of strings 

print(olddb.first_name.dtype)
print(olddb.last_name.dtype)

object
object


In [72]:
#these are the top first_names 
#some repeated values, but let us wait until we see full names

olddb.first_name.value_counts().head(5)

first_name
SANDRA    56
VAL       35
UMA       35
JULIA     33
RIP       33
Name: count, dtype: int64

In [73]:
#olddb.first_name.unique()

In [74]:
#these are the top last_names 
#some repeated values, but let us wait until we see full names

olddb.last_name.value_counts().head(3)

last_name
OLIVIER    53
PECK       43
KILMER     37
Name: count, dtype: int64

In [75]:
#actors.last_name.unique()

In [76]:
#I personally don't like uppercase 

In [77]:
olddb.first_name = olddb.first_name.apply(lambda X: X.title().replace(' ',''))

In [78]:
olddb.last_name = olddb.last_name.apply(lambda X: X.title().replace(' ',''))

In [79]:
olddb.head()

Unnamed: 0,first_name,last_name,title,release_year,category_id
0,Penelope,Guiness,ACADEMY DINOSAUR,2006,6
1,Penelope,Guiness,ANACONDA CONFESSIONS,2006,2
2,Penelope,Guiness,ANGELS LIFE,2006,13
3,Penelope,Guiness,BULWORTH COMMANDMENTS,2006,10
4,Penelope,Guiness,CHEAPER CLYDE,2006,14


In [80]:
#let us create a full name column, and place it after last_name 

olddb.insert(2, 'full_name', olddb['first_name'] + ' ' + olddb['last_name'])

In [82]:
#there will be repeated values, as this table is relating the actors with films they have participated

olddb.head()

Unnamed: 0,first_name,last_name,full_name,title,release_year,category_id
0,Penelope,Guiness,Penelope Guiness,ACADEMY DINOSAUR,2006,6
1,Penelope,Guiness,Penelope Guiness,ANACONDA CONFESSIONS,2006,2
2,Penelope,Guiness,Penelope Guiness,ANGELS LIFE,2006,13
3,Penelope,Guiness,Penelope Guiness,BULWORTH COMMANDMENTS,2006,10
4,Penelope,Guiness,Penelope Guiness,CHEAPER CLYDE,2006,14


#### title 

In [83]:
#object type, as strings 

olddb.title.dtype

dtype('O')

In [85]:
#not a fan of uppercase, will trim names

olddb.title = olddb.title.apply(lambda X: X.title().strip())

In [86]:
olddb.head(3)

Unnamed: 0,first_name,last_name,full_name,title,release_year,category_id
0,Penelope,Guiness,Penelope Guiness,Academy Dinosaur,2006,6
1,Penelope,Guiness,Penelope Guiness,Anaconda Confessions,2006,2
2,Penelope,Guiness,Penelope Guiness,Angels Life,2006,13


#### release_year

In [87]:
#I do not need that column for the purposes stated above (e.g., relating films to actors and relating films to categories)

olddb.drop('release_year',axis=1, inplace = True )

In [88]:
olddb.head()

Unnamed: 0,first_name,last_name,full_name,title,category_id
0,Penelope,Guiness,Penelope Guiness,Academy Dinosaur,6
1,Penelope,Guiness,Penelope Guiness,Anaconda Confessions,2
2,Penelope,Guiness,Penelope Guiness,Angels Life,13
3,Penelope,Guiness,Penelope Guiness,Bulworth Commandments,10
4,Penelope,Guiness,Penelope Guiness,Cheaper Clyde,14


### Importing the name ID and title ID

Ideally, I want to relate actors & titles AND titles & categories by their IDs. I need to import the IDs from actors and titles. I have that information in the actor.csv and title.csv files, which we've already cleaned

In [None]:
films_clean = pd.read_csv('/Users/criscarneiro/desktop/ironhack/6_Projects/sql-data-base-building/data/clean/film.csv')

### Column names and duplicates 

In [53]:
actors.columns

Index(['actor_id', 'first_name', 'last_name', 'last_update'], dtype='object')

In [54]:
#renaming last_update to distinguish from other tables

newcolumns = ['actor_id', 'first_name', 'last_name', 'full_name', 'actor_last_update']

In [55]:
actors.columns = newcolumns

ValueError: Length mismatch: Expected axis has 4 elements, new values have 5 elements

In [None]:
#checking for duplicates 

actors.duplicated().any() #there are no duplicates 

In [None]:
actors.head(2)

### Column types and optimization 

I will optimize the database for memory 

In [None]:
actors.info(memory_usage='deep')

In [None]:
#downcast actor_id

actors.actor_id = pd.to_numeric(actors.actor_id, downcast='integer')

In [None]:
#name columns to 'category'

for c in actors.select_dtypes(include='object'):
    
    actors[c] = actors[c].astype('category')   

In [None]:
#no need for 'nanoseconds' precision

actors.last_update = actors.actor_last_update.astype('datetime64[s]')

### Comparison output vs. original

In [None]:
#one additional column as we have created a 'full_name' column 

print(actorsoriginal.shape)
print(actors.shape)

In [None]:
actors.info(memory_usage='deep')

In [None]:
actorsoriginal.info(memory_usage='deep') #take into account we have included a column

### Export clean table

In [None]:
actors.to_csv('/Users/criscarneiro/desktop/ironhack/6_Projects/sql-data-base-building/data/clean/actor_clean.csv', index=False)