# Loan Data Analysis Project
### Elaina S, 12-29-2021

## Import Numpy Package

In [1]:
import numpy as np

Supress = True stops NumPy from using scientific notation for numbers.
linewidth = 100 extends the number of characters in a line of output to 100.
precision = 2 will only display 2 digits after the decimal point.
These settings help prevent rows of the array being displayed over multiple lines, for easier viewing of results.

In [2]:
np.set_printoptions(suppress = True, linewidth = 100, precision = 2)

## Import the Data from the .csv File

In [3]:
raw_data_np = np.genfromtxt("loan-data.csv", delimiter = ';', encoding = 'ISO8859', skip_header = 1, autostrip = True)
raw_data_np

array([[48010226.  ,         nan,    35000.  , ...,         nan,         nan,     9452.96],
       [57693261.  ,         nan,    30000.  , ...,         nan,         nan,     4679.7 ],
       [59432726.  ,         nan,    15000.  , ...,         nan,         nan,     1969.83],
       ...,
       [50415990.  ,         nan,    10000.  , ...,         nan,         nan,     2185.64],
       [46154151.  ,         nan,         nan, ...,         nan,         nan,     3199.4 ],
       [66055249.  ,         nan,    10000.  , ...,         nan,         nan,      301.9 ]])

raw_data_np has nan values, which could be text or missing elements.
skip_header omits the first line, which is the names of the columns. 
auostrip True removes excess white spaces since they can distort the columns.

## Check for Incomplete Data

There are 88005 nan (not a number) values in the data. 

In [4]:
np.isnan(raw_data_np).sum()

88005

In [5]:
temporary_fill = np.nanmax(raw_data_np) + 1
temporary_mean = np.nanmean(raw_data_np, axis = 0)

  


nanmax() and nanmean return the max and mean of an array, ignoring nan values.
temporary_fill is a temporary filler for all missing entries in the dataset.
temporary_mean holds the means for every column.

The call to nanmean creates 'Runtime Warning: Mean of empty slice.' Meaning there are columns in the data that are all nan or string values. The default data type for genfromtxt() is float, so a column of only strings is automatically filled with nan values and considered empty. So the data may contain important text data. Looking at the .csv data, we know that the data has string values. While errors prevent the program from compiling, warnings do not, so we can proceed.  

In [6]:
temporary_mean

array([54015809.19,         nan,    15273.46,         nan,    15311.04,         nan,       16.62,
            440.92,         nan,         nan,         nan,         nan,         nan,     3143.85])

In [7]:
np.isnan(temporary_mean).sum()

8

temporary_mean shows that 8 columns have nan as their mean, meaning those columns have no floats.  
Storing numbers and text in the same array limits what we can do with the data set. Splitting the data into a number array and a string array will solve this. First, get the minimum and maximum values of each numeric column using nanmin and nanmax. We know the warnings are due to some columns not containing floats.

In [8]:
temporary_stats = np.array(
    [np.nanmin(raw_data_np, axis = 0), 
     temporary_mean, 
     np.nanmax(raw_data_np, axis = 0)])

  
  after removing the cwd from sys.path.


In [9]:
temporary_stats

array([[  373332.  ,         nan,     1000.  ,         nan,     1000.  ,         nan,        6.  ,
              31.42,         nan,         nan,         nan,         nan,         nan,        0.  ],
       [54015809.19,         nan,    15273.46,         nan,    15311.04,         nan,       16.62,
             440.92,         nan,         nan,         nan,         nan,         nan,     3143.85],
       [68616519.  ,         nan,    35000.  ,         nan,    35000.  ,         nan,       28.99,
            1372.97,         nan,         nan,         nan,         nan,         nan,    41913.62]])

temporary_stats is a 2-D array that holds 3 1-dimensional arrays. The first array is the min of each column, the second array is the mean of each column, and the third array is the max of each column. 

## Split the Dataset

In [10]:
columns_strings = np.argwhere(np.isnan(temporary_mean)).squeeze()
columns_strings

array([ 1,  3,  5,  8,  9, 10, 11, 12])

In [11]:
columns_numeric = np.argwhere(np.isnan(temporary_mean) == False).squeeze()
columns_numeric

array([ 0,  2,  4,  6,  7, 13])

If a column contains only text, its mean is nan. Then .isnan() will return true for the mean of that column. The argwhere() method will return the indices of an array where a given expression does not evaluate to 0. Without .squeeze(), each value is in a separate vector and the array is a 2D array. .sqeeze() stores the results in a 1-dimensional array. 

columns_strings is an array containing the indices of the string columns and columns_numeric is an array containing the indices of the numeric columns.

## Re-Import the Dataset as 2 Separate Arrays

In [12]:
loan_data_strings = np.genfromtxt("loan-data.csv", dtype = np.str, delimiter = ';', skip_header = 1, usecols = columns_strings, autostrip = True, encoding = 'ISO8859')

loan_data_strings

array([['May-15', 'Current', '36 months', ..., 'Verified',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=48010226', 'CA'],
       ['', 'Current', '36 months', ..., 'Source Verified',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=57693261', 'NY'],
       ['Sep-15', 'Current', '36 months', ..., 'Verified',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=59432726', 'PA'],
       ...,
       ['Jun-15', 'Current', '36 months', ..., 'Source Verified',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=50415990', 'CA'],
       ['Apr-15', 'Current', '36 months', ..., 'Source Verified',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=46154151', 'OH'],
       ['Dec-15', 'Current', '36 months', ..., '',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=66055249', 'IL']],
      dtype='<U69')

In [13]:
loan_data_numeric = np.genfromtxt("loan-data.csv", dtype = np.str, delimiter = ';', skip_header = 1, filling_values = temporary_fill, usecols = columns_numeric, autostrip = True, encoding = 'ISO8859')

loan_data_numeric

array([['48010226', '35000.0', '35000.0', '13.33', '1184.86', '9452.96'],
       ['57693261', '30000.0', '30000.0', 'þëè.89', '938.57', '4679.7'],
       ['59432726', '15000.0', '15000.0', 'íîå.53', '494.86', '1969.83'],
       ...,
       ['50415990', '10000.0', '10000.0', 'þëè.89', '', '2185.64'],
       ['46154151', '', '10000.0', '16.55', '354.3', '3199.4'],
       ['66055249', '10000.0', '10000.0', 'þëè.26', '309.97', '301.9']], dtype='<U11')

We now use the optional data type parameter, dtype, to specify the data type of the resulting array from calling genfromtxt(). Therefore both loan_data_strings and loan_data_numeric contain the string data type. loan_data_strings is the string dataset and loan_data_numeric is the numeric dataset. loan_data_strings has only the string values of each row and loan_data_numeric only has the numeric values of each row because we used the usecols parameter. We will be manipulating these datasets separately due to handling missing values differently for the strings and numbers. 

In [14]:
header_full = np.genfromtxt("loan-data.csv", dtype = np.str, delimiter = ';', skip_footer = raw_data_np.shape[0], autostrip = True, encoding = 'ISO8859')
header_full


array(['id', 'issue_d', 'loan_amnt', 'loan_status', 'funded_amnt', 'term', 'int_rate',
       'installment', 'grade', 'sub_grade', 'verification_status', 'url', 'addr_state',
       'total_pymnt'], dtype='<U19')

We set skip_footer to the number of rows in raw_data_np because we are telling .genfromtxt to exclude all rows after the header, which is the first row at index 0. header_full contains all the column names.

In [15]:
header_strings, header_numeric = header_full[columns_strings], header_full[columns_numeric]

header_strings is an array containing the string column names, and header_numeric is an array containing the numeric column names.

In [16]:
header_strings

array(['issue_d', 'loan_status', 'term', 'grade', 'sub_grade', 'verification_status', 'url',
       'addr_state'], dtype='<U19')

In [17]:
header_numeric

array(['id', 'loan_amnt', 'funded_amnt', 'int_rate', 'installment', 'total_pymnt'], dtype='<U19')

## Create a checkpoint to store a copy of the dataset and avoid losing progress

In [18]:
def checkpoint(file_name, checkpoint_header, checkpoint_data):
    np.savez(file_name, header = checkpoint_header, data = checkpoint_data) #Create .npz file.
    checkpoint_variable = np.load(file_name + ".npz") #Load the file we just saved into a checkpoint variable.
    return(checkpoint_variable) #Return the checkpoint.

In [19]:
checkpoint1 = checkpoint('20211229_backup_string_loan_data', header_strings, loan_data_strings)

In [20]:
np.array_equal(checkpoint1['data'], loan_data_strings)

True

In [21]:
checkpoint2 = checkpoint('20211229_backup_numeric_loan_data', header_numeric, loan_data_numeric)

In [22]:
np.array_equal(checkpoint2['data'], loan_data_numeric)

True

## Manipulate String Columns

### Issue Date

In [23]:
header_strings

array(['issue_d', 'loan_status', 'term', 'grade', 'sub_grade', 'verification_status', 'url',
       'addr_state'], dtype='<U19')

Give the column a descriptive name.

In [24]:
header_strings[0] = 'issue_date'

In [25]:
header_strings

array(['issue_date', 'loan_status', 'term', 'grade', 'sub_grade', 'verification_status', 'url',
       'addr_state'], dtype='<U19')

Take a slice of the multidimensional array, keeping all rows but only selecting the first column. 

In [26]:
loan_data_strings[:,0]

array(['May-15', '', 'Sep-15', ..., 'Jun-15', 'Apr-15', 'Dec-15'], dtype='<U69')

Use the unique() method to see all the unique values in the array.

In [27]:
np.unique(loan_data_strings[:,0])

array(['', 'Apr-15', 'Aug-15', 'Dec-15', 'Feb-15', 'Jan-15', 'Jul-15', 'Jun-15', 'Mar-15',
       'May-15', 'Nov-15', 'Oct-15', 'Sep-15'], dtype='<U69')

Each of the column names is the fifteenth of the month, and so '-15' can be removed because that is not a meaningful differentiator in the data. 

In [28]:
loan_data_strings[:,0] = np.chararray.strip(loan_data_strings[:,0], "-15")

In [29]:
np.unique(loan_data_strings[:,0])

array(['', 'Apr', 'Aug', 'Dec', 'Feb', 'Jan', 'Jul', 'Jun', 'Mar', 'May', 'Nov', 'Oct', 'Sep'],
      dtype='<U69')

The month values can be represented as integers, e.g. 1 for January, because they use less memory compared to using the names of the months. 

Missing values will be represented with 0. 0 will be used for candidates with no loan issue date.

In [30]:
months = np.array(['0', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])

In [31]:
#range() uses a closed-open interval and so the loop variable, i, will go from 0 up to but not including 13. 
for i in range(13):
    loan_data_strings[:,0] = np.where(loan_data_strings[:,0] == months[i], i, loan_data_strings[:,0])

In [32]:
np.unique(loan_data_strings[:,0])

array(['', '1', '10', '11', '12', '2', '3', '4', '5', '6', '7', '8', '9'], dtype='<U69')

### Loan Status

In [33]:
header_strings

array(['issue_date', 'loan_status', 'term', 'grade', 'sub_grade', 'verification_status', 'url',
       'addr_state'], dtype='<U19')

The Loan Status column is at index 1 in the header_strings array, the second column.

In [34]:
np.unique(loan_data_strings[:,1])

array(['', 'Charged Off', 'Current', 'Default', 'Fully Paid', 'In Grace Period', 'Issued',
       'Late (16-30 days)', 'Late (31-120 days)'], dtype='<U69')

In [35]:
np.unique(loan_data_strings[:,1]).size

9

The 9 loan statuses can be simplified into 2 categories: 1, which represents good, non-defaulted accounts; and 0, which represents bad, defaulted accounts.

In [36]:
bad_status = np.array(['', 'Charged Off', 'Default', 'Late (31-120 days)'])

If the loan status is in the bad_status array, assign it 0, and if not, assign it 1.

In [37]:
loan_data_strings[:,1] = np.where(np.isin(loan_data_strings[:,1], bad_status), 0, 1)

In [38]:
np.unique(loan_data_strings[:,1])

array(['0', '1'], dtype='<U69')

### Term

In [39]:
header_strings

array(['issue_date', 'loan_status', 'term', 'grade', 'sub_grade', 'verification_status', 'url',
       'addr_state'], dtype='<U19')

In [40]:
np.unique(loan_data_strings[:,2])

array(['', '36 months', '60 months'], dtype='<U69')

Give the column a descriptive name.

In [41]:
header_strings[2] = 'term_months'
header_strings

array(['issue_date', 'loan_status', 'term_months', 'grade', 'sub_grade', 'verification_status',
       'url', 'addr_state'], dtype='<U19')

Since the column now has months as its unit, we can remove ' months' from the values without taking away any meaning from the data. 

In [42]:
loan_data_strings[:,2] = np.chararray.strip(loan_data_strings[:,2], " months")
loan_data_strings[:,2]

array(['36', '36', '36', ..., '36', '36', '36'], dtype='<U69')

In [43]:
np.unique(loan_data_strings[:,2])

array(['', '36', '60'], dtype='<U69')

### Grade and Subgrade

In [44]:
header_strings

array(['issue_date', 'loan_status', 'term_months', 'grade', 'sub_grade', 'verification_status',
       'url', 'addr_state'], dtype='<U19')

In [45]:
np.unique(loan_data_strings[:,3])

array(['', 'A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='<U69')

In [46]:
np.unique(loan_data_strings[:,4])

array(['', 'A1', 'A2', 'A3', 'A4', 'A5', 'B1', 'B2', 'B3', 'B4', 'B5', 'C1', 'C2', 'C3', 'C4',
       'C5', 'D1', 'D2', 'D3', 'D4', 'D5', 'E1', 'E2', 'E3', 'E4', 'E5', 'F1', 'F2', 'F3', 'F4',
       'F5', 'G1', 'G2', 'G3', 'G4', 'G5'], dtype='<U69')

### Filling Sub Grade

Loop through all the unique grades in the grade column except for the first one, which is an empty space. For every grade, if the sub grade is equal to an empty string and the grade is equal to the iterator variable, assign the worst sub grade for that grade. Otherwise, the sub grade should remain unchanged. 

In [47]:
for i in np.unique(loan_data_strings[:,3])[1:]:
    loan_data_strings[:,4] = np.where((loan_data_strings[:,4] == '') & (loan_data_strings[:,3] == i), i + '5', loan_data_strings[:,4])

In [48]:
np.unique(loan_data_strings[:,4], return_counts = True)

(array(['', 'A1', 'A2', 'A3', 'A4', 'A5', 'B1', 'B2', 'B3', 'B4', 'B5', 'C1', 'C2', 'C3', 'C4',
        'C5', 'D1', 'D2', 'D3', 'D4', 'D5', 'E1', 'E2', 'E3', 'E4', 'E5', 'F1', 'F2', 'F3', 'F4',
        'F5', 'G1', 'G2', 'G3', 'G4', 'G5'], dtype='<U69'),
 array([  9, 285, 278, 239, 323, 592, 509, 517, 530, 553, 633, 629, 567, 586, 564, 577, 391, 267,
        250, 255, 288, 235, 162, 171, 139, 160,  94,  52,  34,  43,  24,  19,  10,   3,   7,   5]))

The empty space at the start means we still have missing data. Specifically, there are still rows that have neighther a grade nor a sub grade. Setting the return_counts parameter to True lets us view how many rows have each of the unique values. 9 rows have no grade or sub grade. These are accounts which are witholding information, and so will have the lowest of all the grades. 

In [49]:
loan_data_strings[:,4] = np.where(loan_data_strings[:,4] == '', 'H1', loan_data_strings[:,4])

In [50]:
np.unique(loan_data_strings[:,4], return_counts = True)

(array(['A1', 'A2', 'A3', 'A4', 'A5', 'B1', 'B2', 'B3', 'B4', 'B5', 'C1', 'C2', 'C3', 'C4', 'C5',
        'D1', 'D2', 'D3', 'D4', 'D5', 'E1', 'E2', 'E3', 'E4', 'E5', 'F1', 'F2', 'F3', 'F4', 'F5',
        'G1', 'G2', 'G3', 'G4', 'G5', 'H1'], dtype='<U69'),
 array([285, 278, 239, 323, 592, 509, 517, 530, 553, 633, 629, 567, 586, 564, 577, 391, 267, 250,
        255, 288, 235, 162, 171, 139, 160,  94,  52,  34,  43,  24,  19,  10,   3,   7,   5,   9]))

Now, there is no longer missing data, and there is a new sub grade called H1. 

All the grade information is in the sub grade and so we don't need the grade column anymore and can remove it.

In [51]:
loan_data_strings = np.delete(loan_data_strings, 3, axis = 1) #Removes the 4th column in the data set. 

In [52]:
loan_data_strings[:,3] #The column at index 3 is no longer grade but sub grade.

array(['C3', 'A5', 'B5', ..., 'A5', 'D2', 'A4'], dtype='<U69')

In [53]:
header_strings = np.delete(header_strings, 3) #Remove grade from the headers.

In [54]:
header_strings[3]

'sub_grade'

### Converting Sub Grade

In [55]:
np.unique(loan_data_strings[:,3])

array(['A1', 'A2', 'A3', 'A4', 'A5', 'B1', 'B2', 'B3', 'B4', 'B5', 'C1', 'C2', 'C3', 'C4', 'C5',
       'D1', 'D2', 'D3', 'D4', 'D5', 'E1', 'E2', 'E3', 'E4', 'E5', 'F1', 'F2', 'F3', 'F4', 'F5',
       'G1', 'G2', 'G3', 'G4', 'G5', 'H1'], dtype='<U69')

In [56]:
keys = list(np.unique(loan_data_strings[:,3]))
values = list(range(1, np.unique(loan_data_strings[:,3]).shape[0] + 1))
dict_sub_grade = dict(zip(keys,values))

In [57]:
dict_sub_grade

{'A1': 1,
 'A2': 2,
 'A3': 3,
 'A4': 4,
 'A5': 5,
 'B1': 6,
 'B2': 7,
 'B3': 8,
 'B4': 9,
 'B5': 10,
 'C1': 11,
 'C2': 12,
 'C3': 13,
 'C4': 14,
 'C5': 15,
 'D1': 16,
 'D2': 17,
 'D3': 18,
 'D4': 19,
 'D5': 20,
 'E1': 21,
 'E2': 22,
 'E3': 23,
 'E4': 24,
 'E5': 25,
 'F1': 26,
 'F2': 27,
 'F3': 28,
 'F4': 29,
 'F5': 30,
 'G1': 31,
 'G2': 32,
 'G3': 33,
 'G4': 34,
 'G5': 35,
 'H1': 36}

In [58]:
for i in np.unique(loan_data_strings[:,3]):
    loan_data_strings[:,3] = np.where(loan_data_strings[:,3] == i,
                                     dict_sub_grade[i],
                                     loan_data_strings[:,3])

In [59]:
np.unique(loan_data_strings[:,3])

array(['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '22',
       '23', '24', '25', '26', '27', '28', '29', '3', '30', '31', '32', '33', '34', '35', '36',
       '4', '5', '6', '7', '8', '9'], dtype='<U69')

### Verification Status

In [60]:
header_strings

array(['issue_date', 'loan_status', 'term_months', 'sub_grade', 'verification_status', 'url',
       'addr_state'], dtype='<U19')

In [61]:
np.unique(loan_data_strings[:,4])

array(['', 'Not Verified', 'Source Verified', 'Verified'], dtype='<U69')

In [62]:
loan_data_strings[:,4] = np.where((loan_data_strings[:,4] == '') | (loan_data_strings[:,4] == 'Not Verified'), 0, 1)

In [63]:
np.unique(loan_data_strings[:,4])

array(['0', '1'], dtype='<U69')

### URL

In [64]:
loan_data_strings[:,5]

array(['https://www.lendingclub.com/browse/loanDetail.action?loan_id=48010226',
       'https://www.lendingclub.com/browse/loanDetail.action?loan_id=57693261',
       'https://www.lendingclub.com/browse/loanDetail.action?loan_id=59432726', ...,
       'https://www.lendingclub.com/browse/loanDetail.action?loan_id=50415990',
       'https://www.lendingclub.com/browse/loanDetail.action?loan_id=46154151',
       'https://www.lendingclub.com/browse/loanDetail.action?loan_id=66055249'], dtype='<U69')

In [65]:
np.chararray.strip(loan_data_strings[:,5], 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=')

chararray(['48010226', '57693261', '59432726', ..., '50415990', '46154151', '66055249'],
          dtype='<U69')

In [66]:
loan_data_strings[:,5] = np.chararray.strip(loan_data_strings[:,5], 'https://www.lendingclub.com/browse/loanDetail.action?loan_id=')

In [67]:
loan_data_strings[:,5]

array(['48010226', '57693261', '59432726', ..., '50415990', '46154151', '66055249'], dtype='<U69')

In [68]:
loan_data_numeric[:,0]

array(['48010226', '57693261', '59432726', ..., '50415990', '46154151', '66055249'], dtype='<U11')

In [69]:
np.array_equal(loan_data_strings[:,5], loan_data_numeric[:,0])

True

The URL column doesn't hold any data we can't already get from the ID column. Therefore, we can get rid of the URL column.

In [70]:
loan_data_strings = np.delete(loan_data_strings, 5, axis = 1)
header_strings = np.delete(header_strings, 5)

In [71]:
loan_data_strings[:,5]

array(['CA', 'NY', 'PA', ..., 'CA', 'OH', 'IL'], dtype='<U69')

In [72]:
header_strings[5]

'addr_state'

In [73]:
loan_data_numeric[:,0]

array(['48010226', '57693261', '59432726', ..., '50415990', '46154151', '66055249'], dtype='<U11')

In [74]:
header_numeric

array(['id', 'loan_amnt', 'funded_amnt', 'int_rate', 'installment', 'total_pymnt'], dtype='<U19')

### State Address

In [75]:
header_strings

array(['issue_date', 'loan_status', 'term_months', 'sub_grade', 'verification_status',
       'addr_state'], dtype='<U19')

In [76]:
header_strings[5] = "state address"

In [77]:
np.unique(loan_data_strings[:,5])

array(['', 'AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA', 'HI', 'IL', 'IN',
       'KS', 'KY', 'LA', 'MA', 'MD', 'ME', 'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH',
       'NJ', 'NM', 'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VA',
       'VT', 'WA', 'WI', 'WV', 'WY'], dtype='<U69')

In [78]:
np.unique(loan_data_strings[:,5]).size

50

1 of the values in the loan_data_strings array is an empty string, which means 1 of the 50 states is missing. The missing state is Iowa, which was excluded because it is used as a baseline/benchmark/base case. 

In [79]:
states_names, states_count = np.unique(loan_data_strings[:,5], return_counts = True)
states_count_sorted = np.argsort(-states_count)
states_names[states_count_sorted], states_count[states_count_sorted]

(array(['CA', 'NY', 'TX', 'FL', '', 'IL', 'NJ', 'GA', 'PA', 'OH', 'MI', 'NC', 'VA', 'MD', 'AZ',
        'WA', 'MA', 'CO', 'MO', 'MN', 'IN', 'WI', 'CT', 'TN', 'NV', 'AL', 'LA', 'OR', 'SC', 'KY',
        'KS', 'OK', 'UT', 'AR', 'MS', 'NH', 'NM', 'WV', 'HI', 'RI', 'MT', 'DE', 'DC', 'WY', 'AK',
        'NE', 'SD', 'VT', 'ND', 'ME'], dtype='<U69'),
 array([1336,  777,  758,  690,  500,  389,  341,  321,  320,  312,  267,  261,  242,  222,  220,
         216,  210,  201,  160,  156,  152,  148,  143,  143,  130,  119,  116,  108,  107,   84,
          84,   83,   74,   74,   61,   58,   57,   49,   44,   40,   28,   27,   27,   27,   26,
          25,   24,   17,   16,   10]))

In [80]:
loan_data_strings[:,5] = np.where(loan_data_strings[:,5] == '',
                                 0,
                                 loan_data_strings[:,5])
#So that accounts with no state won't belong to any region.

In [81]:
states_west = np.array(['WA', 'OR', 'CA', 'NV', 'ID', 'MT', 'WY', 'UT', 'CO', 'AZ', 'NM', 'HI', 'AK'])
states_south = np.array(['TX', 'OK', 'AR', 'LA', 'MS', 'AL', 'TN', 'KY', 'FL', 'GA', 'SC', 'NC', 'VA', 'WV', 'MD', 'DE', 'DC'])
states_midwest = np.array(['ND', 'SD', 'NE', 'KS', 'MN', 'IA', 'MO', 'WI', 'IL', 'IN', 'MI', 'OH'])
states_east = np.array(['PA', 'NY', 'NJ', 'CT', 'MA', 'VT', 'NH', 'ME', 'RI'])

In [83]:
loan_data_strings[:,5] = np.where(np.isin(loan_data_strings[:,5], states_west), 1, loan_data_strings[:,5])
loan_data_strings[:,5] = np.where(np.isin(loan_data_strings[:,5], states_south), 2, loan_data_strings[:,5])
loan_data_strings[:,5] = np.where(np.isin(loan_data_strings[:,5], states_midwest), 3, loan_data_strings[:,5])
loan_data_strings[:,5] = np.where(np.isin(loan_data_strings[:,5], states_east), 4, loan_data_strings[:,5])


In [84]:
np.unique(loan_data_strings[:,5])

array(['0', '1', '2', '3', '4'], dtype='<U69')

At this point, we have converted our string data into numeric values stored as strings.