# Data Cleaning Process

Before we start the data analysis on the Loan Data database, we will need to perform a few data cleaning steps such as altering column names, setting columns with the appropriate data type, and neatly formatting the column values.

- Refer to the data dictionary for more details.

## 1\. Checking out our data.

To start, let's take a glimpse at our data by selecting the first five rows of all the columns in the dataset.

In [1]:
SELECT TOP(5) *
FROM [Projects].[dbo].[loan_data]

credit#policy,purpose,int#rate,installment,log#annual#inc,dti,fico,days#with#cr#line,revol#bal,revol#util,inq#last#6mths,delinq#2yrs,pub#rec,not#fully#paid
1,debt_consolidation,0.1189,829.1,11.35040654,19.48,737,5639.958333,28854,52.1,0,0,0,0
1,credit_card,0.1071,228.22,11.08214255,14.29,707,2760.0,33623,76.7,0,0,0,0
1,debt_consolidation,0.1357,366.86,10.37349118,11.63,682,4710.0,3511,25.6,1,0,0,0
1,debt_consolidation,0.1008,162.34,11.35040654,8.1,712,2699.958333,33667,73.2,1,0,0,0
1,credit_card,0.1426,102.92,11.29973224,14.97,667,4066.0,4740,39.5,0,1,0,0


## 2\. Changing Columns Names And Data Type.  

By looking at the results, we see that we need to fix the names of the columns to match with the Column Naming Standards.

In [None]:
--Renaming columns

EXEC sp_RENAME '[dbo].[loan_data].int#rate', 'int_rate', 'COLUMN'
EXEC sp_RENAME '[dbo].[loan_data].log#annual#inc', 'log_annual_inc', 'COLUMN'
EXEC sp_RENAME '[dbo].[loan_data].days#with#cr#line', 'days_with_cr_line', 'COLUMN'
EXEC sp_RENAME '[dbo].[loan_data].revol#bal', 'revol_bal', 'COLUMN'
EXEC sp_RENAME '[dbo].[loan_data].revol#util', 'revol_util', 'COLUMN'
EXEC sp_RENAME '[dbo].[loan_data].inq#last#6mths', 'inq_last_6mths', 'COLUMN'
EXEC sp_RENAME '[dbo].[loan_data].delinq#2yrs', 'delinq_2yrs', 'COLUMN'
EXEC sp_RENAME '[dbo].[loan_data].pub#rec', 'pub_rec', 'COLUMN'
EXEC sp_RENAME '[dbo].[loan_data].not#fully#paid', 'not_fully_paid', 'COLUMN'

Now that we renamed the columns, we will adjust the data types of the dataset.  All numeric values are _floats,_ let's change to _int_ the columns that do not have decimals numbers.

In [1]:
-- Altering inq_last_6mths column from float to int
ALTER TABLE [Projects].[dbo].[loan_data] 
ALTER COLUMN inq_last_6mths int 
-- Altering fico column from float to int
ALTER TABLE [Projects].[dbo].[loan_data] 
ALTER COLUMN fico int 
-- Altering delinq_2yrs column from float to int
ALTER TABLE [Projects].[dbo].[loan_data] 
ALTER COLUMN delinq_2yrs int 
-- Altering pub_rec column from float to int
ALTER TABLE [Projects].[dbo].[loan_data] 
ALTER COLUMN pub_rec int 
-- Altering not_fully_paid column from float to int
ALTER TABLE [Projects].[dbo].[loan_data] 
ALTER COLUMN not_fully_paid int 

Let's check our database so far. We see that the names of the columns are more readable and the columns have the appropriate data types now.  The "purpose" column can be also cleaned by replacing the underscore and capitalizing the first letter.

In [8]:
SELECT TOP(5) *
FROM [Projects].[dbo].[loan_data]

credit_policy,purpose,int_rate,installment,log_annual_inc,dti,fico,days_with_cr_line,revol_bal,revol_util,inq_last_6mths,delinq_2yrs,pub_rec,not_fully_paid
1,debt_consolidation,0.1189,829.1,11.35040654,19.48,737,5639.958333,28854,52.1,0,0,0,0
1,credit_card,0.1071,228.22,11.08214255,14.29,707,2760.0,33623,76.7,0,0,0,0
1,debt_consolidation,0.1357,366.86,10.37349118,11.63,682,4710.0,3511,25.6,1,0,0,0
1,debt_consolidation,0.1008,162.34,11.35040654,8.1,712,2699.958333,33667,73.2,1,0,0,0
1,credit_card,0.1426,102.92,11.29973224,14.97,667,4066.0,4740,39.5,0,1,0,0


In [9]:
UPDATE [Projects].[dbo].[loan_data] 
SET purpose = REPLACE(purpose, '_', ' ')

In [10]:
UPDATE [Projects].[dbo].[loan_data] 
SET purpose =  UPPER(LEFT(purpose, 1)) + lower(RIGHT(purpose, len(purpose)-1) ) 

## 3\. The Final Look:

Now that everything is cleaned, let's check our database and start our analysis.

In [11]:
SELECT TOP(5) *
FROM [Projects].[dbo].[loan_data]

credit_policy,purpose,int_rate,installment,log_annual_inc,dti,fico,days_with_cr_line,revol_bal,revol_util,inq_last_6mths,delinq_2yrs,pub_rec,not_fully_paid
1,Debt consolidation,0.1189,829.1,11.35040654,19.48,737,5639.958333,28854,52.1,0,0,0,0
1,Credit card,0.1071,228.22,11.08214255,14.29,707,2760.0,33623,76.7,0,0,0,0
1,Debt consolidation,0.1357,366.86,10.37349118,11.63,682,4710.0,3511,25.6,1,0,0,0
1,Debt consolidation,0.1008,162.34,11.35040654,8.1,712,2699.958333,33667,73.2,1,0,0,0
1,Credit card,0.1426,102.92,11.29973224,14.97,667,4066.0,4740,39.5,0,1,0,0
