# Big Data Real-Time Analytics with Python and Spark

## Chapter 2  - Case Study 1 - Cleaning and processing data with Numpy

- Documentation: https://numpy.org/
- Data from: https://www.openintro.org/data/index.php

![Case Study DSA](images/CaseStudy1.png "Case Study DSA")

In [1]:
# Python version
from platform import python_version
print('The version used in this notebook is: ', python_version())

The version used in this notebook is:  3.8.8


In [2]:
# Import the only library for the Python that we will use here 
import numpy as np

In [3]:
# Warning filter
import warnings
warnings.filterwarnings('ignore')

In [4]:
# package version used in this notebook
%reload_ext watermark
%watermark -a "Bianca Amorim" --iversion

Author: Bianca Amorim

numpy: 1.23.3



- https://numpy.org/doc/stable/reference/generated/numpy.set_printoptions.html

In [5]:
# print setting in Numpy
np.set_printoptions(suppress = True, linewidth = 200, precision = 2)

## Loading the dataset

- https://numpy.org/doc/stable/reference/generated/numpy.genfromtxt.html

In [6]:
# loading the data set 
dataset = np.genfromtxt("datasets/dataset1.csv",
                        delimiter = ";",
                        skip_header = 1,
                        autostrip = True,
                        encoding = 'cp1252')

In [7]:
# Cheking type (ndarray is an array with many dimensions)
type(dataset)

numpy.ndarray

In [8]:
# 1000 lines for 14 columns
dataset.shape

(10000, 14)

In [9]:
dataset.view()

array([[48010226.  ,         nan,    35000.  , ...,         nan,         nan,     9452.96],
       [57693261.  ,         nan,    30000.  , ...,         nan,         nan,     4679.7 ],
       [59432726.  ,         nan,    15000.  , ...,         nan,         nan,     1969.83],
       ...,
       [50415990.  ,         nan,    10000.  , ...,         nan,         nan,     2185.64],
       [46154151.  ,         nan,         nan, ...,         nan,         nan,     3199.4 ],
       [66055249.  ,         nan,    10000.  , ...,         nan,         nan,      301.9 ]])

**"nan"** means "not a number". But we don't have empty columns. what happened was numpy did not recognize some data. This is because the special characters in the data set and the way NumPy loads numeric and string data.

## Verificando Valores Ausentes

In [10]:
# Check the number of total missing values
# A good part of these missing values were generated at the time we loaded the data
np.isnan(dataset).sum()

88005

- https://numpy.org/doc/stable/reference/generated/numpy.nanmax.html

In [11]:
# It will return the highest value + 1, ignoring nan values
# We will use this to fill in the nan values at the moment of the loading data numeric variables
# then we will treat this value as a missing value
joker_value = np.nanmax(dataset) + 1
print(joker_value)

68616520.0


In [12]:
#If I do not use this function above to ignore nan, the max values will be nan
np.max(dataset)

nan

- https://numpy.org/doc/stable/reference/generated/numpy.nanmean.html

In [13]:
# We calculate the average of the numeric variables ignoring the nan values in the column
# We will use this to separate numeric variables from string variables
average_ignoring_nan = np.nanmean(dataset, axis = 0)
print(average_ignoring_nan)

[54015809.19         nan    15273.46         nan    15311.04         nan       16.62      440.92         nan         nan         nan         nan         nan     3143.85]


- Return the position of the elements, that are non-zero
https://numpy.org/doc/stable/reference/generated/numpy.argwhere.html


- Squeeze the array, because we don't need [[0,1], [0, 3] ...
https://numpy.org/doc/stable/reference/generated/numpy.squeeze.html


In [14]:
# Columns with data type string with nan values
# squeeze() We tranforming multiples arrays in one
string_columns = np.argwhere(np.isnan(average_ignoring_nan)).squeeze()
string_columns

array([ 1,  3,  5,  8,  9, 10, 11, 12])

In [15]:
# The function argwhere above return the position only if is true(1)
# Because the false is a 0, and the function argwhere do not return position when the value is a 0
np.isnan(average_ignoring_nan)

array([False,  True, False,  True, False,  True, False, False,  True,  True,  True,  True,  True, False])

In [16]:
# To see filter numericac columns
numeric_columns = np.argwhere(np.isnan(average_ignoring_nan) == False).squeeze()
numeric_columns

array([ 0,  2,  4,  6,  7, 13])

In [17]:
# Now the columns which is true is the not nan
np.isnan(average_ignoring_nan) == False

array([ True, False,  True, False,  True, False,  True,  True, False, False, False, False, False,  True])

> Import the dataset again, separating string columns from numeric columns

In [18]:
# We will loading only the columns with string data type 
# We specify the string columns with the index and their data type
arr_strings = np.genfromtxt("datasets/dataset1.csv",
                           delimiter = ";",
                           skip_header = 1,
                           autostrip = True,
                           usecols = string_columns,
                           dtype = str,
                           encoding = 'cp1252')

In [19]:
arr_strings

array([['May-15', 'Current', '36 months', ..., 'Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=48010226', 'CA'],
       ['', 'Current', '36 months', ..., 'Source Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=57693261', 'NY'],
       ['Sep-15', 'Current', '36 months', ..., 'Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=59432726', 'PA'],
       ...,
       ['Jun-15', 'Current', '36 months', ..., 'Source Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=50415990', 'CA'],
       ['Apr-15', 'Current', '36 months', ..., 'Source Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=46154151', 'OH'],
       ['Dec-15', 'Current', '36 months', ..., '', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=66055249', 'IL']], dtype='<U69')

In [20]:
# We will loading only the columns with numeric data type 
# We specify the numeric columns with the index and their data type
# filling_values - set values to be used as default when the data are missing.
arr_numeric = np.genfromtxt("datasets/dataset1.csv",
                           delimiter = ";",
                           skip_header = 1,
                           autostrip = True,
                           usecols = numeric_columns,
                           filling_values = joker_value,
                           encoding = 'cp1252')

In [21]:
arr_numeric

array([[48010226.  ,    35000.  ,    35000.  ,       13.33,     1184.86,     9452.96],
       [57693261.  ,    30000.  ,    30000.  , 68616520.  ,      938.57,     4679.7 ],
       [59432726.  ,    15000.  ,    15000.  , 68616520.  ,      494.86,     1969.83],
       ...,
       [50415990.  ,    10000.  ,    10000.  , 68616520.  , 68616520.  ,     2185.64],
       [46154151.  , 68616520.  ,    10000.  ,       16.55,      354.3 ,     3199.4 ],
       [66055249.  ,    10000.  ,    10000.  , 68616520.  ,      309.97,      301.9 ]])

> Now we are going to extract the columns names, we didn't extract them before because they are all of type string

In [22]:
# Loading the columns name
arr_columns_name = np.genfromtxt("datasets/dataset1.csv",
                                delimiter = ";",
                                autostrip = True,
                                skip_footer = dataset.shape[0],
                                dtype = str,
                                encoding = 'cp1252')

In [23]:
arr_columns_name

array(['id', 'issue_d', 'loan_amnt', 'loan_status', 'funded_amnt', 'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'verification_status', 'url', 'addr_state', 'total_pymnt'], dtype='<U19')

> "skip_footer" is the index of the lines to skip at the end of the file.

In [24]:
# Separate numeric and string column headers
header_strings, header_numeric = arr_columns_name[string_columns], arr_columns_name[numeric_columns]

In [25]:
header_strings

array(['issue_d', 'loan_status', 'term', 'grade', 'sub_grade', 'verification_status', 'url', 'addr_state'], dtype='<U19')

In [26]:
header_numeric

array(['id', 'loan_amnt', 'funded_amnt', 'int_rate', 'installment', 'total_pymnt'], dtype='<U19')

## Checkpoint function
### Checkpoint 1
We will create a checkpoint function to salve the intermediate results

In [27]:
# A funtion that will save in disk, everything that we have until here.
# I choose what I will save here.
def checkpoint(file_name, checkpoint_header, checkpoint_data):
    np.savez(file_name, header = checkpoint_header, data = checkpoint_data)
    checkpoint_variable = np.load(file_name + ".npz")
    return(checkpoint_variable)

In [28]:
# Here we will save our strings arrays, because they are more critical
checkpoint_inicial = checkpoint("datasets/Checkpoint-Inicial", header_strings, arr_strings)

In [29]:
checkpoint_inicial['data']

array([['May-15', 'Current', '36 months', ..., 'Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=48010226', 'CA'],
       ['', 'Current', '36 months', ..., 'Source Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=57693261', 'NY'],
       ['Sep-15', 'Current', '36 months', ..., 'Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=59432726', 'PA'],
       ...,
       ['Jun-15', 'Current', '36 months', ..., 'Source Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=50415990', 'CA'],
       ['Apr-15', 'Current', '36 months', ..., 'Source Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=46154151', 'OH'],
       ['Dec-15', 'Current', '36 months', ..., '', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=66055249', 'IL']], dtype='<U69')

In [30]:
# Here I ask if the array created is equal my strings array. Must to be equal.
np.array_equal(checkpoint_inicial['data'], arr_strings)

True

## Manipulating strings columns

In [31]:
header_strings

array(['issue_d', 'loan_status', 'term', 'grade', 'sub_grade', 'verification_status', 'url', 'addr_state'], dtype='<U19')

In [32]:
# We will with the first column
# We will change the name to facilitate columns identification
header_strings[0] = "issue_date"

In [33]:
header_strings

array(['issue_date', 'loan_status', 'term', 'grade', 'sub_grade', 'verification_status', 'url', 'addr_state'], dtype='<U19')

In [34]:
arr_strings

array([['May-15', 'Current', '36 months', ..., 'Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=48010226', 'CA'],
       ['', 'Current', '36 months', ..., 'Source Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=57693261', 'NY'],
       ['Sep-15', 'Current', '36 months', ..., 'Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=59432726', 'PA'],
       ...,
       ['Jun-15', 'Current', '36 months', ..., 'Source Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=50415990', 'CA'],
       ['Apr-15', 'Current', '36 months', ..., 'Source Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=46154151', 'OH'],
       ['Dec-15', 'Current', '36 months', ..., '', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=66055249', 'IL']], dtype='<U69')

## Preprocessing the variable issue_date with Label Encoding
This is a preprocessing strategy to vategory variables

In [35]:
# Extract the unique values of this variable
np.unique(arr_strings[:,0])

array(['', 'Apr-15', 'Aug-15', 'Dec-15', 'Feb-15', 'Jan-15', 'Jul-15', 'Jun-15', 'Mar-15', 'May-15', 'Nov-15', 'Oct-15', 'Sep-15'], dtype='<U69')

- Notice that we always have the month and -15, because they extract the data on the 15th. We don't need this part, just the month and then we can apply the label encoding strategy. We can not deliver text for the ML model.

In [36]:
# We will use the strip to cut -15 in the string, and save in the same variable
# Numpy is excelent! Do this with other tools is not so easy.
arr_strings[:,0] = np.chararray.strip(arr_strings[:,0], "-15")

In [37]:
np.unique(arr_strings[:,0])

array(['', 'Apr', 'Aug', 'Dec', 'Feb', 'Jan', 'Jul', 'Jun', 'Mar', 'May', 'Nov', 'Oct', 'Sep'], dtype='<U69')

In [38]:
# Notice that we have nan values, we have to consider this too
# We will first create an array with the months
months = np.array(['', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])

In [39]:
# Loop to convert the name og the months to numeric numbers
# We call this "LABEL ENCONDING"
for i in range(13):
    arr_strings[:,0] = np.where(arr_strings[:,0] == months[i], i, arr_strings[:,0])

The funtion above check if each is equal to some month in my months array. If is it equal, I will replace my month array i (index number) and if is not I will keep the value that I have in my strings array, always in the columns 0. 
This way I replace every month by number, and the nan by 0. This is a good statregy, because there is no 0 month. So I know this indicate a nan value.

In [40]:
np.unique(arr_strings[:,0])

array(['0', '1', '10', '11', '12', '2', '3', '4', '5', '6', '7', '8', '9'], dtype='<U69')

## Preprocessing the variable loan_status with binarization

In [41]:
header_strings

array(['issue_date', 'loan_status', 'term', 'grade', 'sub_grade', 'verification_status', 'url', 'addr_state'], dtype='<U19')

In [42]:
# Estract the unique values from the column loan_status
np.unique(arr_strings[:,1])

array(['', 'Charged Off', 'Current', 'Default', 'Fully Paid', 'In Grace Period', 'Issued', 'Late (16-30 days)', 'Late (31-120 days)'], dtype='<U69')

In [43]:
# See the number of the elements that we have
np.unique(arr_strings[:,1]).size

9

In this part of the process we have to know or ask what is important, because maybe is not necessary every information and you can categorize the data. Here, we only need to know if the loan status is good or not. So we will create a list to use as a reference when the status is bad.

If the category is in the list status_bad we will put one value(0), if not, we put another one(1). 

In [44]:
# Creating the array with the bad status
# we put the nan values too because we do not know what they are
status_bad = np.array(['', 'Charged Off', 'Default', 'Late (31-120 days)'])

In [45]:
# We check if the values is in status_bad and convert the columns values to binary values
arr_strings[:,1] = np.where(np.isin(arr_strings[:,1], status_bad),0,1)

In [46]:
# Estract the unique values from the column loan_status to confirm
np.unique(arr_strings[:,1])

array(['0', '1'], dtype='<U69')

## Preprocessing the variable term with clean string

In [47]:
header_strings

array(['issue_date', 'loan_status', 'term', 'grade', 'sub_grade', 'verification_status', 'url', 'addr_state'], dtype='<U19')

In [48]:
# See at the source of the data they do not worry about ML analyse
# Here we can see that the data has numbers and strings
np.unique(arr_strings[:,2])

array(['', '36 months', '60 months'], dtype='<U69')

In [49]:
# Remove the word months 
# Attention with the space, I have to remove now, because we need only number
arr_strings[:,2] = np.chararray.strip(arr_strings[:,2], " months")

In [50]:
# Change the number of the variable to know that the numbers in the columns is number
header_strings[2] = "term_months"

In [51]:
# The nan values is a problem, we do not have to keep it. We always have to process in some way.
# we have to decide what to do with them
arr_strings[:,2] = np.where(arr_strings[:,2] == '', '60', arr_strings[:,2])

- **Note:** If above I do not know how much time the person will pay, it is good to put the greater number available. It is no sense to put 0, because they will take some time always, and maybe its not the less time.

In [52]:
arr_strings[:,2]

array(['36', '36', '36', ..., '36', '36', '36'], dtype='<U69')

In [53]:
np.unique(arr_strings[:,2])

array(['36', '60'], dtype='<U69')

## Preprocessing variables grade and subgrade with dictionary (A Label Encoding Type)

In [54]:
header_strings

array(['issue_date', 'loan_status', 'term_months', 'grade', 'sub_grade', 'verification_status', 'url', 'addr_state'], dtype='<U19')

**Note:** You as a analyst must to be attention in the name of the variables and what they are. In this example variables 'grade' and 'sub_grade' seems to be related each other. They have similar names, so you have to check if they represent similar informations.

In [55]:
# Estract the unique values from the column grade
np.unique(arr_strings[:,3])

array(['', 'A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='<U69')

In [56]:
# Estract the unique values from the column sub_grade
np.unique(arr_strings[:,4])

array(['', 'A1', 'A2', 'A3', 'A4', 'A5', 'B1', 'B2', 'B3', 'B4', 'B5', 'C1', 'C2', 'C3', 'C4', 'C5', 'D1', 'D2', 'D3', 'D4', 'D5', 'E1', 'E2', 'E3', 'E4', 'E5', 'F1', 'F2', 'F3', 'F4', 'F5', 'G1',
       'G2', 'G3', 'G4', 'G5'], dtype='<U69')

**Note:** If variables represent the same level of information, there is no sense to keep both. That is not good in ML models because I will be reinforcing certain information.
**_Keeping both variables is not a good decision._** it makes more sense to keep the variable sub_grade, which has more details.

In [57]:
np.unique(arr_strings[:,3])

array(['', 'A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='<U69')

In [58]:
# Unique values of the valid categories (without nan)
# Example of slice without nan number that you have to use below
# With this we has all the values without errors, nan is an error
np.unique(arr_strings[:,3])[1:]

array(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='<U69')

In [59]:
# Loop to set the variable sub_grade
# I will go through each of the values without nan
for i in np.unique(arr_strings[:,3])[1:]:
    arr_strings[:,4] = np.where((arr_strings[:,4] == '') & (arr_strings[:,3] == i), i + '5', arr_strings[:,4])

**I do that to make sure that the two variables are related in some way**.
We can se above that for each category, in the columns grade, I go until the sub_grade and I do the firt check: if has nan values, I do the second check: if in the value of columns grade is equal i. If both conditions is true, I will replace the value in subgrade with the concatenation i + '5', if not, I keep what I already have. 

In [60]:
# print the value with the count of elements in each one
np.unique(arr_strings[:,4], return_counts = True)

(array(['', 'A1', 'A2', 'A3', 'A4', 'A5', 'B1', 'B2', 'B3', 'B4', 'B5', 'C1', 'C2', 'C3', 'C4', 'C5', 'D1', 'D2', 'D3', 'D4', 'D5', 'E1', 'E2', 'E3', 'E4', 'E5', 'F1', 'F2', 'F3', 'F4', 'F5', 'G1',
        'G2', 'G3', 'G4', 'G5'], dtype='<U69'),
 array([  9, 285, 278, 239, 323, 592, 509, 517, 530, 553, 633, 629, 567, 586, 564, 577, 391, 267, 250, 255, 288, 235, 162, 171, 139, 160,  94,  52,  34,  43,  24,  19,  10,   3,   7,   5]))

In [61]:
# The nan value continue because before I put both condition.
# I need to treat the missing value
# Repĺace the nan value for H1 (I do not have H1 yet)
arr_strings[:,4] = np.where(arr_strings[:,4] == '', 'H1', arr_strings[:,4])

Delete the grade columns because we do not need it anymore 

In [62]:
# Delete the grade column
arr_strings = np.delete(arr_strings, 3, axis = 1)

**Note:** When we delete a variable the columns ajust, now the 3 columns is subgrade.

In [63]:
arr_strings[:,3]

array(['C3', 'A5', 'B5', ..., 'A5', 'D2', 'A4'], dtype='<U69')

In [64]:
header_strings = np.delete(header_strings, 3)

In [65]:
# Delete the column name
header_strings[3]

'sub_grade'

Now we have the subgrade column ready to convert into a numeric representation.

In [66]:
# Extract the variable unique values
np.unique(arr_strings[:,3])

array(['A1', 'A2', 'A3', 'A4', 'A5', 'B1', 'B2', 'B3', 'B4', 'B5', 'C1', 'C2', 'C3', 'C4', 'C5', 'D1', 'D2', 'D3', 'D4', 'D5', 'E1', 'E2', 'E3', 'E4', 'E5', 'F1', 'F2', 'F3', 'F4', 'F5', 'G1', 'G2',
       'G3', 'G4', 'G5', 'H1'], dtype='<U69')

**Create a dictionary**

In [67]:
# Create a list of keys
# Put the category values as key
keys = list(np.unique(arr_strings[:,3]))
keys[0]

'A1'

In [68]:
# Create a list of values
values = list(range(1, np.unique(arr_strings[:,3]).shape[0] + 1))
values[0]

1

In [69]:
# Create a dictionary with the keys and values
# We use zip to join keys and values and create the dictionary
dict_sub_grade = dict(zip(keys, values))

In [70]:
dict_sub_grade

{'A1': 1,
 'A2': 2,
 'A3': 3,
 'A4': 4,
 'A5': 5,
 'B1': 6,
 'B2': 7,
 'B3': 8,
 'B4': 9,
 'B5': 10,
 'C1': 11,
 'C2': 12,
 'C3': 13,
 'C4': 14,
 'C5': 15,
 'D1': 16,
 'D2': 17,
 'D3': 18,
 'D4': 19,
 'D5': 20,
 'E1': 21,
 'E2': 22,
 'E3': 23,
 'E4': 24,
 'E5': 25,
 'F1': 26,
 'F2': 27,
 'F3': 28,
 'F4': 29,
 'F5': 30,
 'G1': 31,
 'G2': 32,
 'G3': 33,
 'G4': 34,
 'G5': 35,
 'H1': 36}

In [71]:
# Now we use this dictionary to replace the values in our array in the dataset
# We replace each category with the corresponding number
for i in np.unique(arr_strings[:,3]):
    arr_strings[:,3] = np.where(arr_strings[:,3] == i, dict_sub_grade[i], arr_strings[:,3])

In [72]:
# Extract the unique values of the variable
np.unique(arr_strings[:,3])

array(['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '3', '30', '31', '32', '33', '34', '35', '36', '4', '5', '6',
       '7', '8', '9'], dtype='<U69')

## Preprocessing variables status with binarization

In [73]:
# List of the variables names
header_strings

array(['issue_date', 'loan_status', 'term_months', 'sub_grade', 'verification_status', 'url', 'addr_state'], dtype='<U19')

In [74]:
# Extract variables unique values
np.unique(arr_strings[:,4])

array(['', 'Not Verified', 'Source Verified', 'Verified'], dtype='<U69')

It seems that the last two variables represent the same or not. Here we will consider that is the same and apply binarization. And we will conseder here that the nan values was not verified.

In [75]:
# Enconding with binarization
arr_strings[:,4] = np.where((arr_strings[:,4] == '') | (arr_strings[:,4] == 'Not Verified'), 0, 1)

In [76]:
# Extract unique values of the variable to confirm the change
np.unique(arr_strings[:,4])

array(['0', '1'], dtype='<U69')

## Preprocessing variables url with ID extration

In [77]:
# List of the variables names
header_strings

array(['issue_date', 'loan_status', 'term_months', 'sub_grade', 'verification_status', 'url', 'addr_state'], dtype='<U19')

In [78]:
## View a sample of the data
arr_strings[:,5]

array(['https://www.lendingclub.com/browse/loanDetail.action?loan_id=48010226', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=57693261',
       'https://www.lendingclub.com/browse/loanDetail.action?loan_id=59432726', ..., 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=50415990',
       'https://www.lendingclub.com/browse/loanDetail.action?loan_id=46154151', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=66055249'], dtype='<U69')

View a sample of the data and try to detect a pattern. Above we can see that the id number is the only information that differs from one record to another.

In [79]:
# Extract the ID in the end of each url
np.chararray.strip(arr_strings[:,5], 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=')

chararray(['48010226', '57693261', '59432726', ..., '50415990', '46154151', '66055249'], dtype='<U69')

In [80]:
# Replace the url value with the ID value
arr_strings[:,5] = np.chararray.strip(arr_strings[:,5], 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=')

In [81]:
# Convert the type to int32, because now we have only numbers
arr_strings[:,5].astype(dtype = np.int32)

array([48010226, 57693261, 59432726, ..., 50415990, 46154151, 66055249], dtype=int32)

The analyst should always pay attention to the others variables. We can see the first column in the numeric dataset, seems to have the same value that we extracted here.

In [82]:
# Let's convert the first columns in the numeric dataset to int32, to be able to compare
arr_numeric[:,0].astype(dtype = np.int32)

array([48010226, 57693261, 59432726, ..., 50415990, 46154151, 66055249], dtype=int32)

In [83]:
# Compare if two arrays are equal
# We can see above that is the same, but its good to confirm
np.array_equal(arr_numeric[:,0].astype(dtype = np.int32), arr_strings[:,5].astype(dtype = np.int32))

True

In [84]:
# It removes the variable that we created above
# Since they are the same, we can remove one. We choose remove that one in the arr_strings
arr_strings = np.delete(arr_strings, 5, axis = 1)

In [85]:
# It removes the head of the variable
header_strings = np.delete(header_strings, 5)

In [86]:
# Show the new variable in index 5 column
arr_strings[:,5]

array(['CA', 'NY', 'PA', ..., 'CA', 'OH', 'IL'], dtype='<U69')

In [87]:
# Show the new list of columns names
header_strings

array(['issue_date', 'loan_status', 'term_months', 'sub_grade', 'verification_status', 'addr_state'], dtype='<U19')

In [88]:
# ID column
arr_numeric[:,0]

array([48010226., 57693261., 59432726., ..., 50415990., 46154151., 66055249.])

In [89]:
# ID column header
header_numeric

array(['id', 'loan_amnt', 'funded_amnt', 'int_rate', 'installment', 'total_pymnt'], dtype='<U19')

## Preprocessing variables url with ID extration

In [90]:
header_strings

array(['issue_date', 'loan_status', 'term_months', 'sub_grade', 'verification_status', 'addr_state'], dtype='<U19')

In [91]:
# Change the name of the variable
header_strings[5] = 'state_address'

- numpy.unique(return_counts=True) - return_counts: If is True, also return the number of times each unique item appears in ar.
https://numpy.org/doc/stable/reference/generated/numpy.unique.html?highlight=return_counts
- numpy.argsort - Returns the indices that would sort an array.
https://numpy.org/doc/stable/reference/generated/numpy.argsort.html?highlight=argsort#numpy.argsort

In [92]:
# Extract names and counting
# Two variables because will return two values, the unique value and the count of each one
states_names, states_count = np.unique(arr_strings[:,5], return_counts = True)

In [93]:
# Returns the indices that would sort the array states_count
# Using this variable as a index I will sort by states_count
# The "-" is to sort by descending order
states_count_sorted = np.argsort(-states_count)

In [94]:
# Print the states names and the counts sorted by states counts
# Print using as index the return sorted from the function np.argsort
states_names[states_count_sorted], states_count[states_count_sorted]

(array(['CA', 'NY', 'TX', 'FL', '', 'IL', 'NJ', 'GA', 'PA', 'OH', 'MI', 'NC', 'VA', 'MD', 'AZ', 'WA', 'MA', 'CO', 'MO', 'MN', 'IN', 'WI', 'CT', 'TN', 'NV', 'AL', 'LA', 'OR', 'SC', 'KY', 'KS', 'OK',
        'UT', 'AR', 'MS', 'NH', 'NM', 'WV', 'HI', 'RI', 'MT', 'DE', 'DC', 'WY', 'AK', 'NE', 'SD', 'VT', 'ND', 'ME'], dtype='<U69'),
 array([1336,  777,  758,  690,  500,  389,  341,  321,  320,  312,  267,  261,  242,  222,  220,  216,  210,  201,  160,  156,  152,  148,  143,  143,  130,  119,  116,  108,  107,   84,   84,   83,
          74,   74,   61,   58,   57,   49,   44,   40,   28,   27,   27,   27,   26,   25,   24,   17,   16,   10]))

In [95]:
# Replace missing values(nan) value with 0
arr_strings[:,5] = np.where(arr_strings[:,5] == '', 0, arr_strings[:,5])

Is it relevant to know the state of each person? Is there some kind of analysis being done per state? If the answer is not, it doen't make sense to keep each state, I can convert the variable to regions.

In [96]:
# Split the status by region
states_west = np.array(['WA', 'OR', 'CA', 'NV', 'ID', 'MT', 'WY', 'UT', 'CO', 'AZ', 'NM', 'HI', 'AK'])
states_south = np.array(['TX', 'OK', 'AR', 'LA', 'MS', 'AL', 'TN', 'KY', 'FL', 'GA', 'SC', 'NC', 'VA', 'WV', 'MD', 'DE', 'DC'])
states_midwest = np.array(['ND', 'SD', 'NE', 'KS', 'MN', 'IA', 'MO', 'WI', 'IL', 'IN', 'MI', 'OH'])
states_east = np.array(['PA', 'NY', 'NJ', 'CT', 'MA', 'VT', 'NH', 'ME', 'RI'])

In [97]:
# Replace each state with the region ID
arr_strings[:,5] = np.where(np.isin(arr_strings[:,5], states_west), 1, arr_strings[:,5])
arr_strings[:,5] = np.where(np.isin(arr_strings[:,5], states_south), 2, arr_strings[:,5])
arr_strings[:,5] = np.where(np.isin(arr_strings[:,5], states_midwest), 3, arr_strings[:,5])
arr_strings[:,5] = np.where(np.isin(arr_strings[:,5], states_east), 4, arr_strings[:,5])

In [98]:
# Extract unique values
np.unique(arr_strings[:,5])

array(['0', '1', '2', '3', '4'], dtype='<U69')

**You can change the data, but can not change the information!**

## Converting an array

Our string array is now an numeric array. Let's ajust the data type.

In [99]:
arr_strings

array([['5', '1', '36', '13', '1', '1'],
       ['0', '1', '36', '5', '1', '4'],
       ['9', '1', '36', '10', '1', '4'],
       ...,
       ['6', '1', '36', '5', '1', '1'],
       ['4', '1', '36', '17', '1', '3'],
       ['12', '1', '36', '4', '0', '3']], dtype='<U69')

In [100]:
arr_strings = arr_strings.astype(int)

In [101]:
arr_strings

array([[ 5,  1, 36, 13,  1,  1],
       [ 0,  1, 36,  5,  1,  4],
       [ 9,  1, 36, 10,  1,  4],
       ...,
       [ 6,  1, 36,  5,  1,  1],
       [ 4,  1, 36, 17,  1,  3],
       [12,  1, 36,  4,  0,  3]])

In [102]:
arr_strings.dtype

dtype('int64')

## Chekpoint with the cleaned and preprocessed string data
### Checkpoint 2
Completed the first part

In [103]:
checkpoint_strings = checkpoint("datasets/Checkpoint-Strings", header_strings, arr_strings)

In [104]:
checkpoint_strings["header"]

array(['issue_date', 'loan_status', 'term_months', 'sub_grade', 'verification_status', 'state_address'], dtype='<U19')

In [105]:
checkpoint_strings["data"]

array([[ 5,  1, 36, 13,  1,  1],
       [ 0,  1, 36,  5,  1,  4],
       [ 9,  1, 36, 10,  1,  4],
       ...,
       [ 6,  1, 36,  5,  1,  1],
       [ 4,  1, 36, 17,  1,  3],
       [12,  1, 36,  4,  0,  3]])

In [106]:
# Compare if my checkpoint save on disk is the same as what we have saved in the computer's memory
np.array_equal(checkpoint_strings['data'], arr_strings)

True

## Manipulating numeric columns

In [107]:
# View the data
arr_numeric

array([[48010226.  ,    35000.  ,    35000.  ,       13.33,     1184.86,     9452.96],
       [57693261.  ,    30000.  ,    30000.  , 68616520.  ,      938.57,     4679.7 ],
       [59432726.  ,    15000.  ,    15000.  , 68616520.  ,      494.86,     1969.83],
       ...,
       [50415990.  ,    10000.  ,    10000.  , 68616520.  , 68616520.  ,     2185.64],
       [46154151.  , 68616520.  ,    10000.  ,       16.55,      354.3 ,     3199.4 ],
       [66055249.  ,    10000.  ,    10000.  , 68616520.  ,      309.97,      301.9 ]])

In [108]:
# Name of the columns
header_numeric

array(['id', 'loan_amnt', 'funded_amnt', 'int_rate', 'installment', 'total_pymnt'], dtype='<U19')

In [109]:
# Check how many nan values we have
# We do not have a missing values because when we load the data, we replace it with an arbtitay value
np.isnan(arr_numeric).sum()

0

The 0 above indicates that we do not have a lack of that, but we have a lack of information. The joker value is not valid information for the whole dataset. We put it in to be able to replace, treat and make adjustments.

In [110]:
joker_value

68616520.0

In [111]:
# 1. We can check if each column has been filled with a joker value
np.isin(arr_numeric[:,0], joker_value)

array([False, False, False, ..., False, False, False])

In [112]:
# 2. # We can check if each column has been filled with a joker value
# This second option is better to be in no doubt, because we can have True in that "..."
np.isin(arr_numeric[:,0], joker_value).sum()

0

Lets creat an array os statistics, specifically the minimum, maximum and average value for each variable. We will use this to handle with missing values. (which is filled with Joker value)

In [113]:
# Create an array with the minimum, maximum and average for each variable
# We had already created the average variable in the beginning
arr_stats = np.array([np.nanmin(dataset, axis = 0), average_ignoring_nan, np.nanmax(dataset, axis = 0)])

In [114]:
print(arr_stats)

[[  373332.           nan     1000.           nan     1000.           nan        6.         31.42         nan         nan         nan         nan         nan        0.  ]
 [54015809.19         nan    15273.46         nan    15311.04         nan       16.62      440.92         nan         nan         nan         nan         nan     3143.85]
 [68616519.           nan    35000.           nan    35000.           nan       28.99     1372.97         nan         nan         nan         nan         nan    41913.62]]


In [115]:
arr_stats[:, numeric_columns]

array([[  373332.  ,     1000.  ,     1000.  ,        6.  ,       31.42,        0.  ],
       [54015809.19,    15273.46,    15311.04,       16.62,      440.92,     3143.85],
       [68616519.  ,    35000.  ,    35000.  ,       28.99,     1372.97,    41913.62]])

## Preprocessing the variable funded_amnt

In [116]:
# View the data
arr_numeric[:,2]

array([35000., 30000., 15000., ..., 10000., 10000., 10000.])

In [117]:
arr_stats[0, numeric_columns[2]]

1000.0

In [118]:
# We ajust the column content
arr_numeric[:,2] = np.where(arr_numeric[:,2] == joker_value, arr_stats[0, numeric_columns[2]], arr_numeric[:,2])

In [119]:
arr_numeric[:,2]

array([35000., 30000., 15000., ..., 10000., 10000., 10000.])

## Preprocessing variables loan_amnt, int_rate, installment and total_pymnt

In [120]:
# Name of the columns
header_numeric

array(['id', 'loan_amnt', 'funded_amnt', 'int_rate', 'installment', 'total_pymnt'], dtype='<U19')

In [121]:
# loop to replace nan values (joker value) with the values from statistic array
for i in [1, 3, 4, 5]:
    arr_numeric[:,i] = np.where(arr_numeric[:,i] == joker_value,
                               arr_stats[2, numeric_columns[i]],
                               arr_numeric[:,i])

In [122]:
arr_numeric

array([[48010226.  ,    35000.  ,    35000.  ,       13.33,     1184.86,     9452.96],
       [57693261.  ,    30000.  ,    30000.  ,       28.99,      938.57,     4679.7 ],
       [59432726.  ,    15000.  ,    15000.  ,       28.99,      494.86,     1969.83],
       ...,
       [50415990.  ,    10000.  ,    10000.  ,       28.99,     1372.97,     2185.64],
       [46154151.  ,    35000.  ,    10000.  ,       16.55,      354.3 ,     3199.4 ],
       [66055249.  ,    10000.  ,    10000.  ,       28.99,      309.97,      301.9 ]])

## Working with the Second Dataset
We will load the exchange rate USD - EURO. Each row of the dataset corresponds to the exchange rate for one month and one year. We have to do this because in this case we assume that the company will also need to see the currency in Euro.

In [123]:
# Load the second dataset
# We need only the third column
exchange_rate_data = np.genfromtxt("datasets/dataset2.csv",
                             delimiter = ',',
                             autostrip = True,
                             skip_header = 1,
                             usecols = 3)

In [124]:
# View the data
exchange_rate_data

array([1.13, 1.12, 1.08, 1.11, 1.1 , 1.12, 1.09, 1.13, 1.13, 1.1 , 1.06, 1.09])

In [125]:
# Header names of strings data
header_strings

array(['issue_date', 'loan_status', 'term_months', 'sub_grade', 'verification_status', 'state_address'], dtype='<U19')

In [126]:
# string data
arr_strings

array([[ 5,  1, 36, 13,  1,  1],
       [ 0,  1, 36,  5,  1,  4],
       [ 9,  1, 36, 10,  1,  4],
       ...,
       [ 6,  1, 36,  5,  1,  1],
       [ 4,  1, 36, 17,  1,  3],
       [12,  1, 36,  4,  0,  3]])

In [127]:
# The 0 column in the string array is the month
arr_strings[:,0]

array([ 5,  0,  9, ...,  6,  4, 12])

In [128]:
# Let's assign the month column to the variable called exchange_rate
exchange_rate = arr_strings[:,0]

In [129]:
exchange_rate

array([ 5,  0,  9, ...,  6,  4, 12])

In [130]:
# Loop to fill the exchange variable with the rate corresponding to the month
# We use exchange_rate_data[i -1] because the we took off the header when we load exchange_rate
for i in range(1, 13):
    exchange_rate = np.where(exchange_rate == i, exchange_rate_data[i - 1], exchange_rate)

In [131]:
exchange_rate

array([1.1 , 0.  , 1.13, ..., 1.12, 1.11, 1.09])

In [132]:
# Replace the month 0 (nan values) with the average of the exchange rate 
exchange_rate = np.where(exchange_rate == 0, np.mean(exchange_rate_data), exchange_rate)

In [133]:
exchange_rate

array([1.1 , 1.11, 1.13, ..., 1.12, 1.11, 1.09])

In [134]:
# The shape has to be the same of the numerical array
# Compare both
exchange_rate.shape

(10000,)

In [135]:
arr_numeric.shape

(10000, 6)

In [136]:
# We reshape to convert to an array format so we can then do the concatenation
exchange_rate = np.reshape(exchange_rate, (10000,1))

In [137]:
exchange_rate.shape

(10000, 1)

- numpy.**h**stack - stack arrays in sequence horizontally (column wise).
https://numpy.org/doc/stable/reference/generated/numpy.hstack.html?highlight=hstack#numpy.hstack

In [138]:
# Horizontal concatenation of arrays 
# To do this both arrays has to be the same shape
arr_numeric = np.hstack((arr_numeric, exchange_rate))

In [139]:
# Include the column name in the column name array
header_numeric = np.concatenate((header_numeric, np.array(['exchange_rate'])))

In [140]:
header_numeric

array(['id', 'loan_amnt', 'funded_amnt', 'int_rate', 'installment', 'total_pymnt', 'exchange_rate'], dtype='<U19')

Let's create USD and EURO exchange rate column

In [141]:
# Take the columns that are in dollar
columns_dollar = np.array([1,2,4,5])

In [142]:
# Exchange rate
arr_numeric[:,6]

array([1.1 , 1.11, 1.13, ..., 1.12, 1.11, 1.09])

In [143]:
# Shape
arr_numeric.shape

(10000, 7)

In [144]:
# Loop to create and add the 4 euro columns from the 4 dollar columns
# See that we use reshape to create another column
for i in columns_dollar:
    arr_numeric = np.hstack((arr_numeric, np.reshape(arr_numeric[:,i] / arr_numeric[:,6], (10000,1))))

In [145]:
# Shape
arr_numeric.shape

(10000, 11)

In [146]:
# View
arr_numeric

array([[48010226.  ,    35000.  ,    35000.  , ...,    31933.3 ,     1081.04,     8624.69],
       [57693261.  ,    30000.  ,    30000.  , ...,    27132.46,      848.86,     4232.39],
       [59432726.  ,    15000.  ,    15000.  , ...,    13326.3 ,      439.64,     1750.04],
       ...,
       [50415990.  ,    10000.  ,    10000.  , ...,     8910.3 ,     1223.36,     1947.47],
       [46154151.  ,    35000.  ,    10000.  , ...,     8997.4 ,      318.78,     2878.63],
       [66055249.  ,    10000.  ,    10000.  , ...,     9145.8 ,      283.49,      276.11]])

In [147]:
# Ajust the column names
header_additional = np.array([column_name + '_EUR' for column_name in header_numeric[columns_dollar]])

In [148]:
header_additional

array(['loan_amnt_EUR', 'funded_amnt_EUR', 'installment_EUR', 'total_pymnt_EUR'], dtype='<U15')

In [149]:
header_numeric = np.concatenate((header_numeric, header_additional))

In [150]:
header_numeric

array(['id', 'loan_amnt', 'funded_amnt', 'int_rate', 'installment', 'total_pymnt', 'exchange_rate', 'loan_amnt_EUR', 'funded_amnt_EUR', 'installment_EUR', 'total_pymnt_EUR'], dtype='<U19')

In [151]:
header_numeric[columns_dollar] = np.array([column_name + '_USD' for  column_name in header_numeric[columns_dollar]])

In [152]:
header_numeric

array(['id', 'loan_amnt_USD', 'funded_amnt_USD', 'int_rate', 'installment_USD', 'total_pymnt_USD', 'exchange_rate', 'loan_amnt_EUR', 'funded_amnt_EUR', 'installment_EUR', 'total_pymnt_EUR'],
      dtype='<U19')

In [153]:
columns_index_order = [0, 1, 7, 2, 8, 3, 4, 9, 5, 10, 6]

In [154]:
header_numeric = header_numeric[columns_index_order]

In [155]:
arr_numeric

array([[48010226.  ,    35000.  ,    35000.  , ...,    31933.3 ,     1081.04,     8624.69],
       [57693261.  ,    30000.  ,    30000.  , ...,    27132.46,      848.86,     4232.39],
       [59432726.  ,    15000.  ,    15000.  , ...,    13326.3 ,      439.64,     1750.04],
       ...,
       [50415990.  ,    10000.  ,    10000.  , ...,     8910.3 ,     1223.36,     1947.47],
       [46154151.  ,    35000.  ,    10000.  , ...,     8997.4 ,      318.78,     2878.63],
       [66055249.  ,    10000.  ,    10000.  , ...,     9145.8 ,      283.49,      276.11]])

In [156]:
arr_numeric = arr_numeric[:, columns_index_order]

## Preprocessing variable int_rate

In [157]:
header_numeric

array(['id', 'loan_amnt_USD', 'loan_amnt_EUR', 'funded_amnt_USD', 'funded_amnt_EUR', 'int_rate', 'installment_USD', 'installment_EUR', 'total_pymnt_USD', 'total_pymnt_EUR', 'exchange_rate'],
      dtype='<U19')

In [158]:
arr_numeric[:,5]

array([13.33, 28.99, 28.99, ..., 28.99, 16.55, 28.99])

In [159]:
# Divide by 100 to connvert the value to a fraction
arr_numeric[:,5] =  arr_numeric[:,5] / 100 

In [160]:
arr_numeric[:,5]

array([0.13, 0.29, 0.29, ..., 0.29, 0.17, 0.29])

## Chekpoint with the cleaned and preprocessed numeric data
### Checkpoint 3

In [161]:
checkpoint_numeric = checkpoint("datasets/Checkpoint-Numeric", header_numeric, arr_numeric)

In [162]:
checkpoint_numeric['header'], checkpoint_numeric['data']

(array(['id', 'loan_amnt_USD', 'loan_amnt_EUR', 'funded_amnt_USD', 'funded_amnt_EUR', 'int_rate', 'installment_USD', 'installment_EUR', 'total_pymnt_USD', 'total_pymnt_EUR', 'exchange_rate'],
       dtype='<U19'),
 array([[48010226.  ,    35000.  ,    31933.3 , ...,     9452.96,     8624.69,        1.1 ],
        [57693261.  ,    30000.  ,    27132.46, ...,     4679.7 ,     4232.39,        1.11],
        [59432726.  ,    15000.  ,    13326.3 , ...,     1969.83,     1750.04,        1.13],
        ...,
        [50415990.  ,    10000.  ,     8910.3 , ...,     2185.64,     1947.47,        1.12],
        [46154151.  ,    35000.  ,    31490.9 , ...,     3199.4 ,     2878.63,        1.11],
        [66055249.  ,    10000.  ,     9145.8 , ...,      301.9 ,      276.11,        1.09]]))

## Building the final dataset

In [163]:
# The number of rows has to be the same in the both datasets, because we do not change it
checkpoint_strings['data'].shape

(10000, 6)

In [164]:
checkpoint_numeric['data'].shape

(10000, 11)

In [165]:
# Concatenate datasets
df_final = np.hstack((checkpoint_numeric['data'], checkpoint_strings['data']))

In [166]:
df_final

array([[48010226.  ,    35000.  ,    31933.3 , ...,       13.  ,        1.  ,        1.  ],
       [57693261.  ,    30000.  ,    27132.46, ...,        5.  ,        1.  ,        4.  ],
       [59432726.  ,    15000.  ,    13326.3 , ...,       10.  ,        1.  ,        4.  ],
       ...,
       [50415990.  ,    10000.  ,     8910.3 , ...,        5.  ,        1.  ,        1.  ],
       [46154151.  ,    35000.  ,    31490.9 , ...,       17.  ,        1.  ,        3.  ],
       [66055249.  ,    10000.  ,     9145.8 , ...,        4.  ,        0.  ,        3.  ]])

In [167]:
# Check if there any missing value (nan), after all this work.
np.isnan(df_final).sum()

0

In [168]:
# Concatenate number of the columns
header_full = np.concatenate((checkpoint_numeric['header'], checkpoint_strings['header']))

In [169]:
header_full

array(['id', 'loan_amnt_USD', 'loan_amnt_EUR', 'funded_amnt_USD', 'funded_amnt_EUR', 'int_rate', 'installment_USD', 'installment_EUR', 'total_pymnt_USD', 'total_pymnt_EUR', 'exchange_rate',
       'issue_date', 'loan_status', 'term_months', 'sub_grade', 'verification_status', 'state_address'], dtype='<U19')

In [170]:
# Sort the final dataset by the Id Column (index :,0)
df_final = df_final[np.argsort(df_final[:,0])]

In [171]:
df_final

array([[  373332.  ,     9950.  ,     9038.08, ...,       21.  ,        0.  ,        1.  ],
       [  575239.  ,    12000.  ,    10900.2 , ...,       25.  ,        1.  ,        2.  ],
       [  707689.  ,    10000.  ,     8924.3 , ...,       13.  ,        1.  ,        0.  ],
       ...,
       [68614880.  ,     5600.  ,     5121.65, ...,        8.  ,        1.  ,        1.  ],
       [68615915.  ,     4000.  ,     3658.32, ...,       10.  ,        1.  ,        2.  ],
       [68616519.  ,    21600.  ,    19754.93, ...,        3.  ,        0.  ,        2.  ]])

In [172]:
# Check the sort in the column ID
np.argsort(df_final[:,0])

array([   0,    1,    2, ..., 9997, 9998, 9999])

## Save the Final Dataset Cleaned and Preprocessed

- numpy.**v**stack - Stack arrays in sequence vertically (row wise).
https://numpy.org/doc/stable/reference/generated/numpy.vstack.html?highlight=vstack#numpy.vstack

In [173]:
df_final = np.vstack((header_full, df_final))

In [174]:
df_final

array([['id', 'loan_amnt_USD', 'loan_amnt_EUR', ..., 'sub_grade', 'verification_status', 'state_address'],
       ['373332.0', '9950.0', '9038.082814338286', ..., '21.0', '0.0', '1.0'],
       ['575239.0', '12000.0', '10900.20037910145', ..., '25.0', '1.0', '2.0'],
       ...,
       ['68614880.0', '5600.0', '5121.647851612413', ..., '8.0', '1.0', '1.0'],
       ['68615915.0', '4000.0', '3658.319894008867', ..., '10.0', '1.0', '2.0'],
       ['68616519.0', '21600.0', '19754.927427647883', ..., '3.0', '0.0', '2.0']], dtype='<U32')

In [175]:
# Save in Disk
np.savetxt("datasets/cleaned_preprocessed_dataset.csv",
          df_final,
          fmt = '%s',
          delimiter = ',')

# The End