<a href="https://colab.research.google.com/github/Koketso-dax/prosper-data/blob/main/Part_III_machine_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Part III - Machine Learning Application
### by Koketso Diale

> Now that we are more familiar with the dataset, let us use the insights we have to build a machine learning model to predict a potential Borrower's Prosper Score.


#### Data Gathering

In [None]:
# import packages for data manipulation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# Download Data
!wget 'https://s3.amazonaws.com/udacity-hosted-downloads/ud651/prosperLoanData.csv'

--2024-03-07 12:53:39--  https://s3.amazonaws.com/udacity-hosted-downloads/ud651/prosperLoanData.csv
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.12.62, 16.182.40.224, 52.217.223.32, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.12.62|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 86471101 (82M) [application/octet-stream]
Saving to: ‘prosperLoanData.csv.1’


2024-03-07 12:53:40 (101 MB/s) - ‘prosperLoanData.csv.1’ saved [86471101/86471101]



In [None]:
# Load in the dataset
loan_df = pd.read_csv('prosperLoanData.csv')
loan_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113937 entries, 0 to 113936
Data columns (total 81 columns):
 #   Column                               Non-Null Count   Dtype  
---  ------                               --------------   -----  
 0   ListingKey                           113937 non-null  object 
 1   ListingNumber                        113937 non-null  int64  
 2   ListingCreationDate                  113937 non-null  object 
 3   CreditGrade                          28953 non-null   object 
 4   Term                                 113937 non-null  int64  
 5   LoanStatus                           113937 non-null  object 
 6   ClosedDate                           55089 non-null   object 
 7   BorrowerAPR                          113912 non-null  float64
 8   BorrowerRate                         113937 non-null  float64
 9   LenderYield                          113937 non-null  float64
 10  EstimatedEffectiveYield              84853 non-null   float64
 11  EstimatedLoss

> From our analysis in PART I, we have already observed that our data is very imbalanced. Most of the data is from the State of Canada and Texas respectively. Therefore we will need to account for this by randomly selecting limited numbers of Borrowers from each state.

> Machine Learning Frameworks prefer to work with numbers over strings so we will need to use numeric ratings over alpha ratings for simplicity and performance. Additionally some features will need to be Normalized first to work out scaling imbalances amongst the different features.

#### Feature Selection:

In [None]:
# Select suitable features we can use for our models.
features_df = loan_df[[ 'ProsperScore', 'ProsperRating (numeric)',
                       'EmploymentStatus', 'Term','IsBorrowerHomeowner',
                       'IncomeVerifiable', 'IncomeRange',
                       'StatedMonthlyIncome', 'Recommendations']]

> Note:
  * Depending on which rows are useable (complete) and contain fewer outliers, we will then decide on which between `ProsperRating (numeric)` and `ProsperScore` we can use as a target to predict.
  * A similar evaluation will be made for the other features if they are usable as input features.

In [None]:
# define function to determine field with fewest entries.

def find_features_with_fewest_entries(dataframe):
  """Finds and prints the dataframe field(s) with the fewest number of entries.
     If a dataframe with no features is given, will print no features found.
     If any object other than a dataframe is given as input, will print an error.

     Arguments:
     ----------
      dataframe : obj
        input pandas dataframe.

      Returns:
      --------
        response : str
          the field(s) with the least number of entries.
    """
  # Check if the input is a DataFrame
  if not isinstance(dataframe, pd.DataFrame):
    print("Input is not a valid DataFrame.")
    return

  # Get the number of rows in the DataFrame
  num_rows = len(dataframe)

  # Find min entries
  min_entries = dataframe.count().min()

  # Find feature names
  min_feature_names = dataframe.columns[dataframe.count() == min_entries].to_list()

  # Print results
  if min_feature_names:
    feature_str = ", ".join([f"'{feature}'" for feature in min_feature_names])
    print(f"min features: {feature_str} = {min_entries}/{num_rows}")
  else:
    print("No features found in DF.")

In [None]:
# We can then use this func to outline the features with smallest sample size
find_features_with_fewest_entries(features_df)


min features: 'ProsperScore', 'ProsperRating (numeric)' = 84853/113937


 > From this function we can see that our candidate targets (`ProsperScore` and `ProsperRating (numeric)`) have an equivalent number of entries therefore we may use whichever one of the 2 that translate better via Normalization.