# Project: Building MySQL Database for VHS Rental Store | Cristiane Carneiro

## Data Cleaning : language.csv

In this file, one can review the step by step cleaning process for table language.csv 

### Import 

We start by importing the libraries we are going to use and loading the database

In [80]:
%pip install ipython
%pip install seaborn

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [81]:
import pandas as pd
pd.set_option('display.max_columns', None)

import numpy as np

import warnings
warnings.filterwarnings('ignore')

import pylab as plt  

import seaborn as sns 

%matplotlib inline

In [82]:
languages = pd.read_csv('/Users/criscarneiro/desktop/ironhack/6_Projects/sql-data-base-building/data/raw/language.csv')

In [83]:
languages.head()

Unnamed: 0,language_id,name,last_update
0,1,English,2006-02-15 05:02:19
1,2,Italian,2006-02-15 05:02:19
2,3,Japanese,2006-02-15 05:02:19
3,4,Mandarin,2006-02-15 05:02:19
4,5,French,2006-02-15 05:02:19


### Good practices

Some good practices before we continue with the exercise

In [84]:
#creating a back-up with the original table 

languagesoriginal = languages.copy()

In [85]:
#ensuring column names are clean 

languages.columns

Index(['language_id', 'name', 'last_update'], dtype='object')

In [86]:
languagues.columns = [c.lower().replace(' ', '_') for c in languages.columns]

languages.columns

Index(['language_id', 'name', 'last_update'], dtype='object')

In [87]:
#checking for duplicates 

languages.duplicated().any() #there are no duplicates 

False

### Explore 

Exploratory analysis to understand the data base (e.g,. description, column types, searching for null values) 

In [88]:
#it seems we have a repository of actors with their respective IDs and the data in each it was updated. 

languages.head()

Unnamed: 0,language_id,name,last_update
0,1,English,2006-02-15 05:02:19
1,2,Italian,2006-02-15 05:02:19
2,3,Japanese,2006-02-15 05:02:19
3,4,Mandarin,2006-02-15 05:02:19
4,5,French,2006-02-15 05:02:19


In [89]:
#we have 3 columns, and 6 entries (rows) in our original database

languagesoriginal.shape

(6, 3)

In [90]:
#here we can see the type of each of the columns - int type for language ID, and object type for name and last_update columns
#it seems all values are non-null

languagues.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   language_id  6 non-null      int64 
 1   name         6 non-null      object
 2   last_update  6 non-null      object
dtypes: int64(1), object(2)
memory usage: 272.0+ bytes


In [91]:
#description table 
#here we can see the #of unique values, and the mode of each field. 

languages.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
language_id,6.0,,,,3.5,1.870829,1.0,2.25,3.5,4.75,6.0
name,6.0,6.0,English,1.0,,,,,,,
last_update,6.0,1.0,2006-02-15 05:02:19,6.0,,,,,,,


### Null values

As stated above, there are no null values in the database. See per below:

In [92]:
#there are no null values in the database 

nan_cols = languages.isna().sum()

nan_cols

language_id    0
name           0
last_update    0
dtype: int64

### Other cleaning 

#### language_id

In [93]:
#we got a list of int values, which seem to be IDs for the each language
#this is the most appropriate datatype (although we will optimize later)

languages.language_id.dtype

dtype('int64')

In [94]:
#it seems all the IDs are unique values 

len(languages.language_id.unique())

6

In [95]:
languagues.language_id.unique()

array([1, 2, 3, 4, 5, 6])

#### last_update

In [96]:
#this column is type 'object'. It seems tough it would be most appropriate as a 'time type'

languages.last_update.dtype

dtype('O')

In [97]:
#all the values are the same, indicating all the names were last updated on Feb 15th 2006 at 5:02

languagues.last_update.value_counts()

last_update
2006-02-15 05:02:19    6
Name: count, dtype: int64

In [98]:
#I will convert the data to datetime64

languages.last_update = pd.to_datetime(languages.last_update)

In [99]:
#converted 

languagues.last_update.dtype

dtype('O')

#### name

In [100]:
#this column is type 'object'. They cointain a list of strings 

languages.name.dtype

dtype('O')

In [101]:
#all unique values 

languages.name.unique()

array(['English', 'Italian', 'Japanese', 'Mandarin', 'French', 'German'],
      dtype=object)

In [102]:
#just making sure there are no spaces 

languages.name = languages.name.apply(lambda X: X.title().replace(' ',''))

In [103]:
languages.head()

Unnamed: 0,language_id,name,last_update
0,1,English,2006-02-15 05:02:19
1,2,Italian,2006-02-15 05:02:19
2,3,Japanese,2006-02-15 05:02:19
3,4,Mandarin,2006-02-15 05:02:19
4,5,French,2006-02-15 05:02:19


### Column types and optimization 

I will optimize the database for memory 

In [104]:
languages.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   language_id  6 non-null      int64         
 1   name         6 non-null      object        
 2   last_update  6 non-null      datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 608.0 bytes


In [105]:
#downcast language_id

languages.language_id = pd.to_numeric(languagues.language_id, downcast='integer')

In [106]:
#name columns to 'category'

languages.name = languagues.name.astype('category')   

In [107]:
#no need for 'nanoseconds' precision

languages.last_update = languages.last_update.astype('datetime64[s]')

### Comparison output vs. original

In [108]:
#no values excluded

print(languagesoriginal.shape)
print(languages.shape)

(6, 3)
(6, 3)


In [109]:
#744 bytes vs. 1016 bytes 

languages.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype        
---  ------       --------------  -----        
 0   language_id  6 non-null      int8         
 1   name         6 non-null      category     
 2   last_update  6 non-null      datetime64[s]
dtypes: category(1), datetime64[s](1), int8(1)
memory usage: 744.0 bytes


In [110]:
languagesoriginal.info(memory_usage='deep') 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   language_id  6 non-null      int64 
 1   name         6 non-null      object
 2   last_update  6 non-null      object
dtypes: int64(1), object(2)
memory usage: 1016.0 bytes


### Export clean table

In [111]:
languages.to_csv('/Users/criscarneiro/desktop/ironhack/6_Projects/sql-data-base-building/data/clean/language_clean.csv', index=False)