# Project: Building MySQL Database for VHS Rental Store | Cristiane Carneiro

## Data Cleaning : actors.csv

In this file, one can review the step by step cleaning process for table actors.csv 

### Import 

We start by importing the libraries we are going to use and loading the database

In [1]:
%pip install ipython
%pip install seaborn
%pip install mysql-connector-python
%pip install sqlalchemy
%pip install pymysql

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
pd.set_option('display.max_columns', None)

import numpy as np

import warnings
warnings.filterwarnings('ignore')

import pylab as plt  

import seaborn as sns 

import mysql.connector as conn

from sqlalchemy import create_engine

%matplotlib inline

In [3]:
actors = pd.read_csv('/Users/criscarneiro/desktop/ironhack/6_Projects/sql-data-base-building/data/raw/actor.csv')

In [4]:
actors.head()

Unnamed: 0,actor_id,first_name,last_name,last_update
0,1,PENELOPE,GUINESS,2006-02-15 04:34:33
1,2,NICK,WAHLBERG,2006-02-15 04:34:33
2,3,ED,CHASE,2006-02-15 04:34:33
3,4,JENNIFER,DAVIS,2006-02-15 04:34:33
4,5,JOHNNY,LOLLOBRIGIDA,2006-02-15 04:34:33


### Good practices

Some good practices before we continue with the exercise

In [5]:
#creating a back-up with the original table 

actorsoriginal = actors.copy()

In [6]:
#ensuring column names are clean 

actors.columns

Index(['actor_id', 'first_name', 'last_name', 'last_update'], dtype='object')

In [7]:
actors.columns = [c.lower().replace(' ', '_') for c in actors.columns]

actors.columns

Index(['actor_id', 'first_name', 'last_name', 'last_update'], dtype='object')

In [8]:
#checking for duplicates 

actors.duplicated().any() #there are no duplicates 

False

### Explore 

Exploratory analysis to understand the data base (e.g,. description, column types, searching for null values) 

In [9]:
#it seems we have a repository of actors with their respective IDs and the data in each it was updated. 

actors.head()

Unnamed: 0,actor_id,first_name,last_name,last_update
0,1,PENELOPE,GUINESS,2006-02-15 04:34:33
1,2,NICK,WAHLBERG,2006-02-15 04:34:33
2,3,ED,CHASE,2006-02-15 04:34:33
3,4,JENNIFER,DAVIS,2006-02-15 04:34:33
4,5,JOHNNY,LOLLOBRIGIDA,2006-02-15 04:34:33


In [10]:
#we have 4 columns, and 200 entries (rows) in our original database

actorsoriginal.shape

(200, 4)

In [11]:
#here we can see the type of each of the columns - int type for column ID, and object type for the name columns (strings)
#it seems all values are non-null

actors.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   actor_id     200 non-null    int64 
 1   first_name   200 non-null    object
 2   last_name    200 non-null    object
 3   last_update  200 non-null    object
dtypes: int64(1), object(3)
memory usage: 6.4+ KB


In [12]:
#description table 
#here we can see the #of unique values, and the mode of each field. Ultimately we will be interested on the unique 'full names', so worth checking if there are non-unique values there

actors.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
actor_id,200.0,,,,100.5,57.879185,1.0,50.75,100.5,150.25,200.0
first_name,200.0,128.0,PENELOPE,4.0,,,,,,,
last_name,200.0,121.0,KILMER,5.0,,,,,,,
last_update,200.0,1.0,2006-02-15 04:34:33,200.0,,,,,,,


### Null values

As stated above, there are no null values in the database. See per below:

In [13]:
#there are no null values in the database 

nan_cols = actors.isna().sum()

nan_cols

actor_id       0
first_name     0
last_name      0
last_update    0
dtype: int64

### Other cleaning 

#### actor_id

In [14]:
#we got a list of int values, which seem to be IDs for the each actor
#this is the most appropriate datatype 

actors.actor_id.dtype

dtype('int64')

In [15]:
#it seems all the IDs are unique values 

len(actors.actor_id.unique())

200

In [53]:
#actors.actor_id.unique()

#### last_update

In [17]:
#this column is type 'object'. It seems tough it would be most appropriate as a 'time type'

actors.last_update.dtype

dtype('O')

In [18]:
#all the values are the same, indicating all the names were last updated on Feb 15th 2006

actors.last_update.value_counts()

last_update
2006-02-15 04:34:33    200
Name: count, dtype: int64

In [19]:
#I will convert the data to datetime64

actors.last_update = pd.to_datetime(actors.last_update)

In [20]:
#converted 

actors.last_update.dtype

dtype('<M8[ns]')

#### first_name, last_name, and new column full_name

I will clean the columns first_name, last_name together, and make sure there are no repeated actors (by their full name)

In [21]:
#this column is type 'object'. They cointain a list of strings 

print(actors.first_name.dtype)
print(actors.last_name.dtype)

object
object


In [22]:
#these are the top first_names 
#some repeated values, but let us wait until we see full names

actors.first_name.value_counts().head(3)

first_name
PENELOPE    4
JULIA       4
KENNETH     4
Name: count, dtype: int64

In [51]:
#actors.first_name.unique()

In [24]:
#these are the top last_names 
#some repeated values, but let us wait until we see full names

actors.last_name.value_counts().head(3)

last_name
KILMER    5
TEMPLE    4
NOLTE     4
Name: count, dtype: int64

In [25]:
#actors.last_name.unique()

In [26]:
#I personally don't like uppercase 

In [27]:
actors.first_name = actors.first_name.apply(lambda X: X.title().replace(' ',''))

In [28]:
actors.last_name = actors.last_name.apply(lambda X: X.title().replace(' ',''))

In [29]:
actors.head()

Unnamed: 0,actor_id,first_name,last_name,last_update
0,1,Penelope,Guiness,2006-02-15 04:34:33
1,2,Nick,Wahlberg,2006-02-15 04:34:33
2,3,Ed,Chase,2006-02-15 04:34:33
3,4,Jennifer,Davis,2006-02-15 04:34:33
4,5,Johnny,Lollobrigida,2006-02-15 04:34:33


In [30]:
#let us create a full name column, and place it after last_name 

actors.insert(3, 'full_name', actors['first_name'] + ' ' + actors['last_name'])

In [31]:
#now let us see if we have repeated actors 

actors.full_name.value_counts()

full_name
Susan Davis             2
Ewan Gooding            1
Daryl Crawford          1
Greta Keitel            1
Jane Jackman            1
                       ..
Michelle Mcconaughey    1
Adam Grant              1
Sean Williams           1
Gary Penn               1
Thora Temple            1
Name: count, Length: 199, dtype: int64

In [32]:
#actors.full_name.unique()

In [33]:
#it seems we do have a repeated value. However, there are indeed more than one actress named Susan Davis. 
#for now I will keep both values, but keep that info in mind as we establish links between the tables

'''From chat GPT 
Susan Davis (born 1943): Known for her roles in films such as "Three Women" (1977) and "Love and Death" (1975).

Susan Davis (born 1944): Known for her role as Betty Munson in the TV series "The Mary Tyler Moore Show" (1970-1977) and its spin-off "Lou Grant" (1977-1982).'''

'From chat GPT \nSusan Davis (born 1943): Known for her roles in films such as "Three Women" (1977) and "Love and Death" (1975).\n\nSusan Davis (born 1944): Known for her role as Betty Munson in the TV series "The Mary Tyler Moore Show" (1970-1977) and its spin-off "Lou Grant" (1977-1982).'

### Column names and duplicates 

In [34]:
actors.columns

Index(['actor_id', 'first_name', 'last_name', 'full_name', 'last_update'], dtype='object')

In [35]:
#renaming last_update to distinguish from other tables

newcolumns = ['actor_id', 'first_name', 'last_name', 'full_name', 'actor_last_update']

In [36]:
actors.columns = newcolumns

In [37]:
#checking for duplicates 

actors.duplicated().any() #there are no duplicates 

False

In [38]:
actors.head(2)

Unnamed: 0,actor_id,first_name,last_name,full_name,actor_last_update
0,1,Penelope,Guiness,Penelope Guiness,2006-02-15 04:34:33
1,2,Nick,Wahlberg,Nick Wahlberg,2006-02-15 04:34:33


### Column types and optimization 

I will optimize the database for memory 

In [39]:
actors.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   actor_id           200 non-null    int64         
 1   first_name         200 non-null    object        
 2   last_name          200 non-null    object        
 3   full_name          200 non-null    object        
 4   actor_last_update  200 non-null    datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 41.3 KB


In [40]:
#downcast actor_id

actors.actor_id = pd.to_numeric(actors.actor_id, downcast='integer')

In [41]:
#name columns to 'category'

for c in actors.select_dtypes(include='object'):
    
    actors[c] = actors[c].astype('category')   

In [42]:
#no need for 'nanoseconds' precision

actors.last_update = actors.actor_last_update.astype('datetime64[s]')

### Comparison output vs. original

In [43]:
#one additional column as we have created a 'full_name' column 

print(actorsoriginal.shape)
print(actors.shape)

(200, 4)
(200, 5)


In [44]:
actors.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   actor_id           200 non-null    int16         
 1   first_name         200 non-null    category      
 2   last_name          200 non-null    category      
 3   full_name          200 non-null    category      
 4   actor_last_update  200 non-null    datetime64[ns]
dtypes: category(3), datetime64[ns](1), int16(1)
memory usage: 48.1 KB


In [45]:
actorsoriginal.info(memory_usage='deep') #take into account we have included a column

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   actor_id     200 non-null    int64 
 1   first_name   200 non-null    object
 2   last_name    200 non-null    object
 3   last_update  200 non-null    object
dtypes: int64(1), object(3)
memory usage: 41.0 KB


### Export clean table

In [46]:
actors.to_csv('/Users/criscarneiro/desktop/ironhack/6_Projects/sql-data-base-building/data/clean/actor_clean.csv', index=False)

### Export to MYSQL

In [47]:
with open('/Users/criscarneiro/desktop/ironhack/6_Projects/sql-data-base-building/pw.txt') as file: 
    
    password = file.read()

In [48]:
str_conn=f'mysql+pymysql://root:{password}@localhost:3306/rentalstore'

cursor = create_engine(str_conn)

In [49]:
actors.to_sql(name='actors',
              con=cursor,
              if_exists = 'replace',
              index=True)

OperationalError: (pymysql.err.OperationalError) (3730, "Cannot drop table 'actors' referenced by a foreign key constraint 'fk_actors' on table 'actorsfilms'.")
[SQL: 
DROP TABLE actors]
(Background on this error at: https://sqlalche.me/e/20/e3q8)