# Project: Building MySQL Database for VHS Rental Store | Cristiane Carneiro

## Data Cleaning : category.csv

In this file, one can review the step by step cleaning process for table category.csv 

### Import 

We start by importing the libraries we are going to use and loading the database

In [150]:
%pip install ipython
%pip install seaborn

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [151]:
import pandas as pd
pd.set_option('display.max_columns', None)

import numpy as np

import warnings
warnings.filterwarnings('ignore')

import pylab as plt  

import seaborn as sns 

%matplotlib inline

In [152]:
categories = pd.read_csv('/Users/criscarneiro/desktop/ironhack/6_Projects/sql-data-base-building/data/raw/category.csv')

In [153]:
categories.head()

Unnamed: 0,category_id,name,last_update
0,1,Action,2006-02-15 04:46:27
1,2,Animation,2006-02-15 04:46:27
2,3,Children,2006-02-15 04:46:27
3,4,Classics,2006-02-15 04:46:27
4,5,Comedy,2006-02-15 04:46:27


### Good practices

Some good practices before we continue with the exercise

In [154]:
#creating a back-up with the original table 

categoriesoriginal = categories.copy()

In [155]:
#ensuring column names are clean 

categories.columns

Index(['category_id', 'name', 'last_update'], dtype='object')

In [156]:
categories.columns = [c.lower().replace(' ', '_') for c in categories.columns]

categories.columns

Index(['category_id', 'name', 'last_update'], dtype='object')

In [157]:
#checking for duplicates 

categories.duplicated().any() #there are no duplicates 

False

### Explore 

Exploratory analysis to understand the data base (e.g,. description, column types, searching for null values) 

In [158]:
#it seems we have a repository of movie categories with their respective IDs and the data in each it was updated. 

categories.head()

Unnamed: 0,category_id,name,last_update
0,1,Action,2006-02-15 04:46:27
1,2,Animation,2006-02-15 04:46:27
2,3,Children,2006-02-15 04:46:27
3,4,Classics,2006-02-15 04:46:27
4,5,Comedy,2006-02-15 04:46:27


In [159]:
#we have 3 columns, and 16 entries (rows) in our original database

categoriesoriginal.shape

(16, 3)

In [160]:
#here we can see the type of each of the columns
#it seems all values are non-null

categories.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   category_id  16 non-null     int64 
 1   name         16 non-null     object
 2   last_update  16 non-null     object
dtypes: int64(1), object(2)
memory usage: 512.0+ bytes


In [161]:
#description table
#here we can see the #of unique values, and the mode of each field. 
#it seems all categories names are unique 

categories.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
category_id,16.0,,,,8.5,4.760952,1.0,4.75,8.5,12.25,16.0
name,16.0,16.0,Action,1.0,,,,,,,
last_update,16.0,1.0,2006-02-15 04:46:27,16.0,,,,,,,


### Null values

As stated above, there are no null values in the database. See per below:

In [162]:
#there are no null values in the database 

nan_cols = categories.isna().sum()

nan_cols

category_id    0
name           0
last_update    0
dtype: int64

### Other cleaning 

#### category_id

In [163]:
#we got a list of int values, which seem to be IDs for the each category
#this is the most appropriate datatype (although we will optimize it later)

categories.category_id.dtype

dtype('int64')

In [164]:
#it seems all the IDs are unique values 

len(categories.category_id.unique())

16

In [165]:
categories.category_id.unique()

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16])

#### last_update

In [166]:
#this column is type 'object'. It seems tough it would be most appropriate as a 'time type'

categories.last_update.dtype

dtype('O')

In [167]:
#all the values are the same, indicating all the names were last updated on Feb 15th 2006

categories.last_update.value_counts()

last_update
2006-02-15 04:46:27    16
Name: count, dtype: int64

In [168]:
#I will convert the data to datetime64

categories.last_update = pd.to_datetime(categories.last_update)

In [169]:
#converted 

categories.last_update.dtype

dtype('<M8[ns]')

#### name

In [170]:
#this column is type 'object'. It seems tough it would be most appropriate as a string 

categories.name.dtype

dtype('O')

In [171]:
#these are the categories
#all values are unique 

categories.name.value_counts()

name
Action         1
Animation      1
Children       1
Classics       1
Comedy         1
Documentary    1
Drama          1
Family         1
Foreign        1
Games          1
Horror         1
Music          1
New            1
Sci-Fi         1
Sports         1
Travel         1
Name: count, dtype: int64

In [172]:
categories.name.unique()

array(['Action', 'Animation', 'Children', 'Classics', 'Comedy',
       'Documentary', 'Drama', 'Family', 'Foreign', 'Games', 'Horror',
       'Music', 'New', 'Sci-Fi', 'Sports', 'Travel'], dtype=object)

In [173]:
#in case there are spaces 

categories.name = categories.name.apply(lambda X: X.title().replace(' ',''))

In [174]:
categories.head()

Unnamed: 0,category_id,name,last_update
0,1,Action,2006-02-15 04:46:27
1,2,Animation,2006-02-15 04:46:27
2,3,Children,2006-02-15 04:46:27
3,4,Classics,2006-02-15 04:46:27
4,5,Comedy,2006-02-15 04:46:27


### Column names and duplicates

In [175]:
categories.columns

Index(['category_id', 'name', 'last_update'], dtype='object')

In [176]:
#renaming last_update to distinguish from other tables

newcolumns = ['category_id', 'category_name', 'category_last_update']

In [177]:
categories.columns = newcolumns

In [178]:
#checking for duplicates 

categories.duplicated().any() #there are no duplicates 

False

In [179]:
categories.head(2)

Unnamed: 0,category_id,category_name,category_last_update
0,1,Action,2006-02-15 04:46:27
1,2,Animation,2006-02-15 04:46:27


### Column types and optimization 

I will optimize the database for memory 

In [180]:
categories.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 3 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   category_id           16 non-null     int64         
 1   category_name         16 non-null     object        
 2   category_last_update  16 non-null     datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 1.4 KB


In [181]:
#downcast category_id

categories.category_id = pd.to_numeric(categories.category_id, downcast='integer')

In [182]:
#name column to 'category'

for c in categories.select_dtypes(include='object'):
    
    categories[c] = categories[c].astype('category')   

In [183]:
#no need for 'nanoseconds' precision

categories.last_update = categories.category_last_update.astype('datetime64[s]')

### Comparison output

In [184]:
#no values eliminated 

print(categoriesoriginal.shape)
print(categories.shape)

(16, 3)
(16, 3)


In [185]:
#1.8KB vs 2.4KB

categories.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 3 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   category_id           16 non-null     int8          
 1   category_name         16 non-null     category      
 2   category_last_update  16 non-null     datetime64[ns]
dtypes: category(1), datetime64[ns](1), int8(1)
memory usage: 1.8 KB


In [186]:
categoriesoriginal.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   category_id  16 non-null     int64 
 1   name         16 non-null     object
 2   last_update  16 non-null     object
dtypes: int64(1), object(2)
memory usage: 2.4 KB


### Export clean table

In [187]:
categories.to_csv('/Users/criscarneiro/desktop/ironhack/6_Projects/sql-data-base-building/data/clean/category_clean.csv', index=False)