# Big Data Real-Time Analytics with Python and Spark

## Chapter 2  - Case Study 1 - Cleaning and processing data with Numpy

- Documentation: https://numpy.org/
- Data from: https://www.openintro.org/data/index.php

![Case Study DSA](images/CaseStudy1.png "Case Study DSA")

In [1]:
# Python version
from platform import python_version
print('The version used in this notebook is: ', python_version())

The version used in this notebook is:  3.8.8


In [2]:
# Import the only library for the Python that we will use here 
import numpy as np

In [3]:
# Warning filter
import warnings
warnings.filterwarnings('ignore')

In [4]:
# package version used in this notebook
%reload_ext watermark
%watermark -a "Bianca Amorim" --iversion

Author: Bianca Amorim

numpy: 1.21.5



- https://numpy.org/doc/stable/reference/generated/numpy.set_printoptions.html

In [5]:
# print setting in Numpy
np.set_printoptions(suppress = True, linewidth = 200, precision = 2)

## Loading the dataset

- https://numpy.org/doc/stable/reference/generated/numpy.genfromtxt.html

In [6]:
# loading the data set 
dataset = np.genfromtxt("datasets/dataset1.csv",
                        delimiter = ";",
                        skip_header = 1,
                        autostrip = True,
                        encoding = 'cp1252')

In [7]:
# Cheking type (ndarray is an array with many dimensions)
type(dataset)

numpy.ndarray

In [8]:
# 1000 lines for 14 columns
dataset.shape

(10000, 14)

In [9]:
dataset.view()

array([[48010226.  ,         nan,    35000.  , ...,         nan,         nan,     9452.96],
       [57693261.  ,         nan,    30000.  , ...,         nan,         nan,     4679.7 ],
       [59432726.  ,         nan,    15000.  , ...,         nan,         nan,     1969.83],
       ...,
       [50415990.  ,         nan,    10000.  , ...,         nan,         nan,     2185.64],
       [46154151.  ,         nan,         nan, ...,         nan,         nan,     3199.4 ],
       [66055249.  ,         nan,    10000.  , ...,         nan,         nan,      301.9 ]])

**"nan"** means "not a number". But we don't have empty columns. what happened was numpy did not recognize some data. This is because the special characters in the data set and the way NumPy loads numerical and string data.

## Verificando Valores Ausentes

In [10]:
# Check the number of total missing values
# A good part of these missing values were generated at the time we loaded the data
np.isnan(dataset).sum()

88005

- https://numpy.org/doc/stable/reference/generated/numpy.nanmax.html

In [11]:
# It will return the highest value + 1, ignoring nan values
# We will use this to fill in the nan values at the moment of the loading data numerical variables
# then we will treat this value as a missing value
joker_value = np.nanmax(dataset) + 1
print(joker_value)

68616520.0


In [12]:
#If I do not use this function above to ignore nan, the max values will be nan
np.max(dataset)

nan

- https://numpy.org/doc/stable/reference/generated/numpy.nanmean.html

In [13]:
# We calculate the average of the numerical variables ignoring the nan values in the column
# We will use this to separate numerical variables from string variables
average_ignoring_nan = np.nanmean(dataset, axis = 0)
print(average_ignoring_nan)

[54015809.19         nan    15273.46         nan    15311.04         nan       16.62      440.92         nan         nan         nan         nan         nan     3143.85]


- Return the position of the elements, that are non-zero
https://numpy.org/doc/stable/reference/generated/numpy.argwhere.html


- Squeeze the array, because we don't need [[0,1], [0, 3] ...
https://numpy.org/doc/stable/reference/generated/numpy.squeeze.html


In [14]:
# Columns with data type string with nan values
# squeeze() We tranforming multiples arrays in one
string_columns = np.argwhere(np.isnan(average_ignoring_nan)).squeeze()
string_columns

array([ 1,  3,  5,  8,  9, 10, 11, 12])

In [15]:
# The function argwhere above return the position only if is true(1)
# Because the false is a 0, and the function argwhere do not return position when the value is a 0
np.isnan(average_ignoring_nan)

array([False,  True, False,  True, False,  True, False, False,  True,  True,  True,  True,  True, False])

In [16]:
# To see filter numerical columns
numerical_columns = np.argwhere(np.isnan(average_ignoring_nan) == False).squeeze()
numerical_columns

array([ 0,  2,  4,  6,  7, 13])

In [17]:
# Now the columns which is true is the not nan
np.isnan(average_ignoring_nan) == False

array([ True, False,  True, False,  True, False,  True,  True, False, False, False, False, False,  True])

> Import the dataset again, separating string columns from numeric columns

In [18]:
# We will loading only the columns with string data type 
# We specify the string columns with the index and their data type
arr_strings = np.genfromtxt("datasets/dataset1.csv",
                           delimiter = ";",
                           skip_header = 1,
                           autostrip = True,
                           usecols = string_columns,
                           dtype = str,
                           encoding = 'cp1252')

In [19]:
arr_strings

array([['May-15', 'Current', '36 months', ..., 'Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=48010226', 'CA'],
       ['', 'Current', '36 months', ..., 'Source Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=57693261', 'NY'],
       ['Sep-15', 'Current', '36 months', ..., 'Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=59432726', 'PA'],
       ...,
       ['Jun-15', 'Current', '36 months', ..., 'Source Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=50415990', 'CA'],
       ['Apr-15', 'Current', '36 months', ..., 'Source Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=46154151', 'OH'],
       ['Dec-15', 'Current', '36 months', ..., '', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=66055249', 'IL']], dtype='<U69')

In [20]:
# We will loading only the columns with numerical data type 
# We specify the numerical columns with the index and their data type
# filling_values - set values to be used as default when the data are missing.
arr_numeric = np.genfromtxt("datasets/dataset1.csv",
                           delimiter = ";",
                           skip_header = 1,
                           autostrip = True,
                           usecols = numerical_columns,
                           filling_values = joker_value,
                           encoding = 'cp1252')

In [21]:
arr_numeric

array([[48010226.  ,    35000.  ,    35000.  ,       13.33,     1184.86,     9452.96],
       [57693261.  ,    30000.  ,    30000.  , 68616520.  ,      938.57,     4679.7 ],
       [59432726.  ,    15000.  ,    15000.  , 68616520.  ,      494.86,     1969.83],
       ...,
       [50415990.  ,    10000.  ,    10000.  , 68616520.  , 68616520.  ,     2185.64],
       [46154151.  , 68616520.  ,    10000.  ,       16.55,      354.3 ,     3199.4 ],
       [66055249.  ,    10000.  ,    10000.  , 68616520.  ,      309.97,      301.9 ]])

> Now we are going to extract the columns names, we didn't extract them before because they are all of type string

In [22]:
# Loading the columns name
arr_columns_name = np.genfromtxt("datasets/dataset1.csv",
                                delimiter = ";",
                                autostrip = True,
                                skip_footer = dataset.shape[0],
                                dtype = str,
                                encoding = 'cp1252')

In [23]:
arr_columns_name

array(['id', 'issue_d', 'loan_amnt', 'loan_status', 'funded_amnt', 'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'verification_status', 'url', 'addr_state', 'total_pymnt'], dtype='<U19')

> "skip_footer" is the index of the lines to skip at the end of the file.

In [24]:
# Separate numerical and string column headers
header_strings, header_numerical = arr_columns_name[string_columns], arr_columns_name[numerical_columns]

In [25]:
header_strings

array(['issue_d', 'loan_status', 'term', 'grade', 'sub_grade', 'verification_status', 'url', 'addr_state'], dtype='<U19')

In [26]:
header_numerical

array(['id', 'loan_amnt', 'funded_amnt', 'int_rate', 'installment', 'total_pymnt'], dtype='<U19')

## Checkpoint function
#### Checkpoint 1
We will create a checkpoint function to salve the intermediate results

In [27]:
# A funtion that will save in disk, everything that we have until here.
# I choose what I will save here.
def checkpoint(file_name, checkpoint_header, checkpoint_data):
    np.savez(file_name, header = checkpoint_header, data = checkpoint_data)
    checkpoint_variable = np.load(file_name + ".npz")
    return(checkpoint_variable)

In [28]:
# Here we will save our strings arrays, because they are more critical
checkpoint_inicial = checkpoint("datasets/Checkpoint-Inicial", header_strings, arr_strings)

In [29]:
checkpoint_inicial['data']

array([['May-15', 'Current', '36 months', ..., 'Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=48010226', 'CA'],
       ['', 'Current', '36 months', ..., 'Source Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=57693261', 'NY'],
       ['Sep-15', 'Current', '36 months', ..., 'Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=59432726', 'PA'],
       ...,
       ['Jun-15', 'Current', '36 months', ..., 'Source Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=50415990', 'CA'],
       ['Apr-15', 'Current', '36 months', ..., 'Source Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=46154151', 'OH'],
       ['Dec-15', 'Current', '36 months', ..., '', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=66055249', 'IL']], dtype='<U69')

In [30]:
# Here I ask if the array created is equal my strings array. Must to be equal.
np.array_equal(checkpoint_inicial['data'], arr_strings)

True

## Manipulating strings columns

In [31]:
header_strings

array(['issue_d', 'loan_status', 'term', 'grade', 'sub_grade', 'verification_status', 'url', 'addr_state'], dtype='<U19')

In [32]:
# We will with the first column
# We will change the name to facilitate columns identification
header_strings[0] = "issue_date"

In [33]:
header_strings

array(['issue_date', 'loan_status', 'term', 'grade', 'sub_grade', 'verification_status', 'url', 'addr_state'], dtype='<U19')

In [34]:
arr_strings

array([['May-15', 'Current', '36 months', ..., 'Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=48010226', 'CA'],
       ['', 'Current', '36 months', ..., 'Source Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=57693261', 'NY'],
       ['Sep-15', 'Current', '36 months', ..., 'Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=59432726', 'PA'],
       ...,
       ['Jun-15', 'Current', '36 months', ..., 'Source Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=50415990', 'CA'],
       ['Apr-15', 'Current', '36 months', ..., 'Source Verified', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=46154151', 'OH'],
       ['Dec-15', 'Current', '36 months', ..., '', 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=66055249', 'IL']], dtype='<U69')

## Preprocessing the variable issue_date with Label Encoding
This is a preprocessing strategy to vategory variables

In [35]:
# Extract the unique values of this variable
np.unique(arr_strings[:,0])

array(['', 'Apr-15', 'Aug-15', 'Dec-15', 'Feb-15', 'Jan-15', 'Jul-15', 'Jun-15', 'Mar-15', 'May-15', 'Nov-15', 'Oct-15', 'Sep-15'], dtype='<U69')

- Notice that we always have the month and -15, because they extract the data on the 15th. We don't need this part, just the month and then we can apply the label encoding strategy. We can not deliver text for the ML model.

In [36]:
# We will use the strip to cut -15 in the string, and save in the same variable
# Numpy is excelent! Do this with other tools is not so easy.
arr_strings[:,0] = np.chararray.strip(arr_strings[:,0], "-15")

In [37]:
np.unique(arr_strings[:,0])

array(['', 'Apr', 'Aug', 'Dec', 'Feb', 'Jan', 'Jul', 'Jun', 'Mar', 'May', 'Nov', 'Oct', 'Sep'], dtype='<U69')

In [38]:
# Notice that we have nan values, we have to consider this too
# We will first create an array with the months
months = np.array(['', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])

In [39]:
# Loop to convert the name og the months to numerical numbers
# We call this "LABEL ENCONDING"
for i in range(13):
    arr_strings[:,0] = np.where(arr_strings[:,0] == months[i], i, arr_strings[:,0])

The funtion above check if each is equal to some month in my months array. If is it equal, I will replace my month array i (index number) and if is not I will keep the value that I have in my strings array, always in the columns 0. 
This way I replace every month by number, and the nan by 0. This is a good statregy, because there is no 0 month. So I know this indicate a nan value.

In [40]:
np.unique(arr_strings[:,0])

array(['0', '1', '10', '11', '12', '2', '3', '4', '5', '6', '7', '8', '9'], dtype='<U69')

## Preprocessing the variable loan_status with binarization

In [41]:
header_strings

array(['issue_date', 'loan_status', 'term', 'grade', 'sub_grade', 'verification_status', 'url', 'addr_state'], dtype='<U19')

In [42]:
# Estract the unique values from the column loan_status
np.unique(arr_strings[:,1])

array(['', 'Charged Off', 'Current', 'Default', 'Fully Paid', 'In Grace Period', 'Issued', 'Late (16-30 days)', 'Late (31-120 days)'], dtype='<U69')

In [43]:
# See the number of the elements that we have
np.unique(arr_strings[:,1]).size

9

In this part of the process we have to know or ask what is important, because maybe is not necessary every information and you can categorize the data. Here, we only need to know if the loan status is good or not. So we will create a list to use as a reference when the status is bad.

If the category is in the list status_bad we will put one value(0), if not, we put another one(1). 

In [44]:
# Creating the array with the bad status
# we put the nan values too because we do not know what they are
status_bad = np.array(['', 'Charged Off', 'Default', 'Late (31-120 days)'])

In [45]:
# We check if the values is in status_bad and convert the columns values to binary values
arr_strings[:,1] = np.where(np.isin(arr_strings[:,1], status_bad),0,1)

In [46]:
# Estract the unique values from the column loan_status to confirm
np.unique(arr_strings[:,1])

array(['0', '1'], dtype='<U69')

## Preprocessing the variable term with clean string

In [47]:
header_strings

array(['issue_date', 'loan_status', 'term', 'grade', 'sub_grade', 'verification_status', 'url', 'addr_state'], dtype='<U19')

In [48]:
# See at the source of the data they do not worry about ML analyse
# Here we can see that the data has numbers and strings
np.unique(arr_strings[:,2])

array(['', '36 months', '60 months'], dtype='<U69')

In [49]:
# Remove the word months 
# Attention with the space, I have to remove now, because we need only number
arr_strings[:,2] = np.chararray.strip(arr_strings[:,2], " months")

In [50]:
# Change the number of the variable to know that the numbers in the columns is number
header_strings[2] = "term_months"

In [51]:
# The nan values is a problem, we do not have to keep it. We always have to process in some way.
# we have to decide what to do with them
arr_strings[:,2] = np.where(arr_strings[:,2] == '', '60', arr_strings[:,2])

- **Note:** If above I do not know how much time the person will pay, it is good to put the greater number available. It is no sense to put 0, because they will take some time always, and maybe its not the less time.

In [52]:
arr_strings[:,2]

array(['36', '36', '36', ..., '36', '36', '36'], dtype='<U69')

In [53]:
np.unique(arr_strings[:,2])

array(['36', '60'], dtype='<U69')

## Preprocessing variables grade and subgrade with dictionary (A Label Encoding Type)

In [54]:
header_strings

array(['issue_date', 'loan_status', 'term_months', 'grade', 'sub_grade', 'verification_status', 'url', 'addr_state'], dtype='<U19')

**Note:** You as a analyst must to be attention in the name of the variables and what they are. In this example variables 'grade' and 'sub_grade' seems to be related each other. They have similar names, so you have to check if they represent similar informations.

In [55]:
# Estract the unique values from the column grade
np.unique(arr_strings[:,3])

array(['', 'A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='<U69')

In [56]:
# Estract the unique values from the column sub_grade
np.unique(arr_strings[:,4])

array(['', 'A1', 'A2', 'A3', 'A4', 'A5', 'B1', 'B2', 'B3', 'B4', 'B5', 'C1', 'C2', 'C3', 'C4', 'C5', 'D1', 'D2', 'D3', 'D4', 'D5', 'E1', 'E2', 'E3', 'E4', 'E5', 'F1', 'F2', 'F3', 'F4', 'F5', 'G1',
       'G2', 'G3', 'G4', 'G5'], dtype='<U69')

**Note:** If variables represent the same level of information, there is no sense to keep both. That is not good in ML models because I will be reinforcing certain information.
**_Keeping both variables is not a good decision._** it makes more sense to keep the variable sub_grade, which has more details.

In [57]:
np.unique(arr_strings[:,3])

array(['', 'A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='<U69')

In [58]:
# Unique values of the valid categories (without nan)
# Example of slice without nan number that you have to use below
# With this we has all the values without errors, nan is an error
np.unique(arr_strings[:,3])[1:]

array(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='<U69')

In [59]:
# Loop to set the variable sub_grade
# I will go through each of the values without nan
for i in np.unique(arr_strings[:,3])[1:]:
    arr_strings[:,4] = np.where((arr_strings[:,4] == '') & (arr_strings[:,3] == i), i + '5', arr_strings[:,4])

**I do that to make sure that the two variables are related in some way**.
We can se above that for each category, in the columns grade, I go until the sub_grade and I do the firt check: if has nan values, I do the second check: if in the value of columns grade is equal i. If both conditions is true, I will replace the value in subgrade with the concatenation i + '5', if not, I keep what I already have. 

In [60]:
# print the value with the count of elements in each one
np.unique(arr_strings[:,4], return_counts = True)

(array(['', 'A1', 'A2', 'A3', 'A4', 'A5', 'B1', 'B2', 'B3', 'B4', 'B5', 'C1', 'C2', 'C3', 'C4', 'C5', 'D1', 'D2', 'D3', 'D4', 'D5', 'E1', 'E2', 'E3', 'E4', 'E5', 'F1', 'F2', 'F3', 'F4', 'F5', 'G1',
        'G2', 'G3', 'G4', 'G5'], dtype='<U69'),
 array([  9, 285, 278, 239, 323, 592, 509, 517, 530, 553, 633, 629, 567, 586, 564, 577, 391, 267, 250, 255, 288, 235, 162, 171, 139, 160,  94,  52,  34,  43,  24,  19,  10,   3,   7,   5]))

In [61]:
# The nan value continue because before I put both condition.
# I need to treat the missing value
# Repĺace the nan value for H1 (I do not have H1 yet)
arr_strings[:,4] = np.where(arr_strings[:,4] == '', 'H1', arr_strings[:,4])

Delete the grade columns because we do not need it anymore 

In [62]:
# Delete the grade column
arr_strings = np.delete(arr_strings, 3, axis = 1)

**Note:** When we delete a variable the columns ajust, now the 3 columns is subgrade.

In [64]:
arr_strings[:,3]

array(['C3', 'A5', 'B5', ..., 'A5', 'D2', 'A4'], dtype='<U69')