# Project: Building MySQL Database for VHS Rental Store | Cristiane Carneiro

## Data Cleaning : language.csv

In this file, one can review the step by step cleaning process for table language.csv 

### Import 

We start by importing the libraries we are going to use and loading the database

In [1]:
%pip install ipython
%pip install seaborn
%pip install mysql-connector-python
%pip install sqlalchemy
%pip install pymysql

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
pd.set_option('display.max_columns', None)

import numpy as np

import warnings
warnings.filterwarnings('ignore')

import pylab as plt  

import seaborn as sns 

import mysql.connector as conn

from sqlalchemy import create_engine

%matplotlib inline

In [3]:
languages = pd.read_csv('/Users/criscarneiro/desktop/ironhack/6_Projects/sql-data-base-building/data/raw/language.csv')

In [4]:
languages.head()

Unnamed: 0,language_id,name,last_update
0,1,English,2006-02-15 05:02:19
1,2,Italian,2006-02-15 05:02:19
2,3,Japanese,2006-02-15 05:02:19
3,4,Mandarin,2006-02-15 05:02:19
4,5,French,2006-02-15 05:02:19


### Good practices

Some good practices before we continue with the exercise

In [5]:
#creating a back-up with the original table 

languagesoriginal = languages.copy()

In [6]:
#ensuring column names are clean 

languages.columns

Index(['language_id', 'name', 'last_update'], dtype='object')

In [7]:
languagues.columns = [c.lower().replace(' ', '_') for c in languages.columns]

languages.columns

NameError: name 'languagues' is not defined

In [None]:
#checking for duplicates 

languages.duplicated().any() #there are no duplicates 

### Explore 

Exploratory analysis to understand the data base (e.g,. description, column types, searching for null values) 

In [None]:
#it seems we have a repository of actors with their respective IDs and the data in each it was updated. 

languages.head()

In [None]:
#we have 3 columns, and 6 entries (rows) in our original database

languagesoriginal.shape

In [None]:
#here we can see the type of each of the columns - int type for language ID, and object type for name and last_update columns
#it seems all values are non-null

languagues.info()

In [None]:
#description table 
#here we can see the #of unique values, and the mode of each field. 

languages.describe(include='all').T

### Null values

As stated above, there are no null values in the database. See per below:

In [None]:
#there are no null values in the database 

nan_cols = languages.isna().sum()

nan_cols

### Other cleaning 

#### language_id

In [None]:
#we got a list of int values, which seem to be IDs for the each language
#this is the most appropriate datatype (although we will optimize later)

languages.language_id.dtype

In [None]:
#it seems all the IDs are unique values 

len(languages.language_id.unique())

In [None]:
languagues.language_id.unique()

#### last_update

In [None]:
#this column is type 'object'. It seems tough it would be most appropriate as a 'time type'

languages.last_update.dtype

In [None]:
#all the values are the same, indicating all the names were last updated on Feb 15th 2006 at 5:02

languagues.last_update.value_counts()

In [None]:
#I will convert the data to datetime64

languages.last_update = pd.to_datetime(languages.last_update)

In [None]:
#converted 

languagues.last_update.dtype

#### name

In [None]:
#this column is type 'object'. They cointain a list of strings 

languages.name.dtype

In [None]:
#all unique values 

languages.name.unique()

In [None]:
#just making sure there are no spaces 

languages.name = languages.name.apply(lambda X: X.title().replace(' ',''))

In [None]:
languages.head()

### Column names and duplicates 

In [None]:
languages.columns

In [None]:
#renaming last_update to distinguish from other tables

newcolumns = ['language_id', 'language_name', 'language_last_update']

In [None]:
languages.columns = newcolumns

In [None]:
#checking for duplicates 

languages.duplicated().any() #there are no duplicates 

In [None]:
languages.head(2)

### Column types and optimization 

I will optimize the database for memory 

In [None]:
languages.info(memory_usage='deep')

In [None]:
#downcast language_id

languages.language_id = pd.to_numeric(languagues.language_id, downcast='integer')

In [None]:
#name columns to 'category'

languages.language_name = languages.language_name.astype('category')   

In [None]:
#no need for 'nanoseconds' precision

languages.last_update = languages.language_last_update.astype('datetime64[s]')

### Comparison output vs. original

In [None]:
#no values excluded

print(languagesoriginal.shape)
print(languages.shape)

In [None]:
#744 bytes vs. 1016 bytes 

languages.info(memory_usage='deep')

In [None]:
languagesoriginal.info(memory_usage='deep') 

### Export clean table

In [None]:
languages.to_csv('/Users/criscarneiro/desktop/ironhack/6_Projects/sql-data-base-building/data/clean/language_clean.csv', index=False)

### Export to MYSQL

In [8]:
with open('/Users/criscarneiro/desktop/ironhack/6_Projects/sql-data-base-building/pw.txt') as file: 
    
    password = file.read()

In [9]:
str_conn=f'mysql+pymysql://root:{password}@localhost:3306/rentalstore'

cursor = create_engine(str_conn)

In [10]:
languages.to_sql(name='language',
              con=cursor,
              if_exists = 'replace',
              index=True)

6