# Project: Building MySQL Database for VHS Rental Store | Cristiane Carneiro

## Data Cleaning : inventory.csv

In this file, one can review the step by step cleaning process for table inventory.csv 

### Import 

We start by importing the libraries we are going to use and loading the database

In [40]:
%pip install ipython
%pip install seaborn
%pip install mysql-connector-python
%pip install sqlalchemy
%pip install pymysql

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [41]:
import pandas as pd
pd.set_option('display.max_columns', None)

import numpy as np

import warnings
warnings.filterwarnings('ignore')

import pylab as plt  

import seaborn as sns 

import mysql.connector as conn

from sqlalchemy import create_engine

%matplotlib inline

In [42]:
inventory = pd.read_csv('/Users/criscarneiro/desktop/ironhack/6_Projects/sql-data-base-building/data/raw/inventory.csv')

In [43]:
inventory.head()

Unnamed: 0,inventory_id,film_id,store_id,last_update
0,1,1,1,2006-02-15 05:09:17
1,2,1,1,2006-02-15 05:09:17
2,3,1,1,2006-02-15 05:09:17
3,4,1,1,2006-02-15 05:09:17
4,5,1,2,2006-02-15 05:09:17


### Good practices

Some good practices before we continue with the exercise

In [44]:
#creating a back-up with the original table 

inventoryoriginal = inventory.copy()

In [45]:
#ensuring column names are clean 

inventory.columns

Index(['inventory_id', 'film_id', 'store_id', 'last_update'], dtype='object')

In [46]:
inventory.columns = [c.lower().replace(' ', '_') for c in inventory.columns]

inventory.columns

Index(['inventory_id', 'film_id', 'store_id', 'last_update'], dtype='object')

In [47]:
#checking for duplicates 

inventory.duplicated().any() #there are no duplicates 

False

### Explore 

Exploratory analysis to understand the data base (e.g,. description, column types, searching for null values) 

In [48]:
#it seems we have a repository of films per store 

inventory.head()

Unnamed: 0,inventory_id,film_id,store_id,last_update
0,1,1,1,2006-02-15 05:09:17
1,2,1,1,2006-02-15 05:09:17
2,3,1,1,2006-02-15 05:09:17
3,4,1,1,2006-02-15 05:09:17
4,5,1,2,2006-02-15 05:09:17


In [49]:
#we have 4 columns, and 1000 entries (rows) in our original database

inventoryoriginal.shape

(1000, 4)

In [50]:
#here we can see the type of each of the columns
#it seems all values are non-null

inventory.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   inventory_id  1000 non-null   int64 
 1   film_id       1000 non-null   int64 
 2   store_id      1000 non-null   int64 
 3   last_update   1000 non-null   object
dtypes: int64(3), object(1)
memory usage: 31.4+ KB


In [51]:
#description table 

inventory.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
inventory_id,1000.0,,,,500.5,288.819436,1.0,250.75,500.5,750.25,1000.0
film_id,1000.0,,,,109.866,63.862042,1.0,56.0,111.5,164.0,223.0
store_id,1000.0,,,,1.497,0.500241,1.0,1.0,1.0,2.0,2.0
last_update,1000.0,1.0,2006-02-15 05:09:17,1000.0,,,,,,,


### Null values

As stated above, there are no null values in the database. See per below:

In [52]:
#there are no null values in the database 

nan_cols = inventory.isna().sum()

nan_cols

inventory_id    0
film_id         0
store_id        0
last_update     0
dtype: int64

### Other cleaning 

#### inventory_id

In [53]:
#we got a list of int values, which seem to be IDs for the each inventory log 
#this is the most appropriate datatype 

inventory.inventory_id.dtype

dtype('int64')

In [54]:
#it seems all the IDs are unique values 

print(len(inventory.inventory_id.unique()))
print(inventory.inventory_id.min())
print(inventory.inventory_id.max())

1000
1
1000


In [55]:
#values from 1 to 1000 

#inventory.inventory_id.unique()

#### film_id

In [56]:
#data type is int - correct, although will optimize it later 
#this should be linked to table film

inventory.film_id.dtype

dtype('int64')

In [57]:
#the ids are int numbers from 1 to 223
#this is a subset of the film_ids in the film table, which varies from 1 to 1000

#inventory.film_id.unique()

#### store_id

In [58]:
#data type is int - correct, although will optimize it later 

inventory.store_id.dtype

dtype('int64')

In [59]:
#the ids are int numbers - either 1 or 2
#this is a subset of the film_ids in the film table, which varies from 1 to 1000

inventory.store_id.unique()

array([1, 2])

#### last_update

In [60]:
#this column is type 'object'. It seems tough it would be most appropriate as a 'time type'

inventory.last_update.dtype

dtype('O')

In [61]:
#all the values are the same, indicating all the names were last updated on Feb 15th 2006 05:09

inventory.last_update.value_counts()

last_update
2006-02-15 05:09:17    1000
Name: count, dtype: int64

In [62]:
#I will convert the data to datetime64

inventory.last_update = pd.to_datetime(inventory.last_update)

In [63]:
#converted 

inventory.last_update.dtype

dtype('<M8[ns]')

### Column names and duplicates 

In [64]:
inventory.columns

Index(['inventory_id', 'film_id', 'store_id', 'last_update'], dtype='object')

In [65]:
#renaming last_update to distinguish from other tables

newcolumns = ['inventory_id', 'film_id', 'store_id', 'inventory_last_update']

In [66]:
inventory.columns = newcolumns

In [67]:
#checking for duplicates 

inventory.duplicated().any() #there are no duplicates 

False

In [68]:
inventory.head(2)

Unnamed: 0,inventory_id,film_id,store_id,inventory_last_update
0,1,1,1,2006-02-15 05:09:17
1,2,1,1,2006-02-15 05:09:17


### Column types and optimization 

I will optimize the database for memory 

In [69]:
inventory.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   inventory_id           1000 non-null   int64         
 1   film_id                1000 non-null   int64         
 2   store_id               1000 non-null   int64         
 3   inventory_last_update  1000 non-null   datetime64[ns]
dtypes: datetime64[ns](1), int64(3)
memory usage: 31.4 KB


In [70]:
#downcast int

for c in inventory.select_dtypes('integer'):
    
    inventory[c] = pd.to_numeric(inventory[c], downcast='integer')

In [71]:
#no need for 'nanoseconds' precision

inventory.last_update = inventory.inventory_last_update.astype('datetime64[s]')

### Comparison output vs. original

In [72]:
#one additional column as we have created a 'full_name' column 

print(inventoryoriginal.shape)
print(inventory.shape)

(1000, 4)
(1000, 4)


In [73]:
inventory.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   inventory_id           1000 non-null   int16         
 1   film_id                1000 non-null   int16         
 2   store_id               1000 non-null   int8          
 3   inventory_last_update  1000 non-null   datetime64[ns]
dtypes: datetime64[ns](1), int16(2), int8(1)
memory usage: 12.8 KB


In [74]:
inventoryoriginal.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   inventory_id  1000 non-null   int64 
 1   film_id       1000 non-null   int64 
 2   store_id      1000 non-null   int64 
 3   last_update   1000 non-null   object
dtypes: int64(3), object(1)
memory usage: 97.8 KB


### Export clean table

In [75]:
inventory.to_csv('/Users/criscarneiro/desktop/ironhack/6_Projects/sql-data-base-building/data/clean/inventory_clean.csv', index=False)

### Export to MYSQL

In [76]:
with open('/Users/criscarneiro/desktop/ironhack/6_Projects/sql-data-base-building/pw.txt') as file: 
    
    password = file.read()

In [77]:
str_conn=f'mysql+pymysql://root:{password}@localhost:3306/rentalstore'

cursor = create_engine(str_conn)

In [78]:
inventory.to_sql(name='inventory',
              con=cursor,
              if_exists = 'replace',
              index=True)

1000