# Part I - Loan Data from Prosper Exploration
## by Agustin Barto

In [1]:
from pathlib import Path
from urllib import request
from zipfile import ZipFile

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from IPython.display import display, Markdown, HTML

%matplotlib inline

## Table of Contents

* [Introduction](#introduction)
* [Preliminary Wrangling](#preliminary-wrangling)
    * [Data Dictionary](#data-dictionary)

## Introduction<a class="anchor" id="introduction"></a>

I've chosen the "Loan Data from Prosper" dataset as it strikes a good balance between size, complexity and richness. It has a lot of variables to chose from, both numerical and categorical and enough samples to provided meaningful results.

As we'll show in the following section, the data set contains quite a lot of columns to chose from. During the wrangling process we'll briefly explore the dataset to decide which columns are going to be the focus of the analysis. Whenever needed, the column definition will be expanded with external sources.

## Preliminary Wrangling<a class="anchor" id="preliminary-wrangling"></a>

### Data dictionary<a class="anchor" id="data-dictionary"></a>

 The following table (converted to CSV from the original [Google sheet](https://docs.google.com/spreadsheets/d/1gDyi_L4UvIrLTEC6Wri5nbaMmkGmLQBk-Yx3z0XDEtI/edit#gid=0)) contains a brief description of each column:

In [2]:
variable_definitions_df = pd.read_csv(
    "https://docs.google.com/spreadsheets/d/1gDyi_L4UvIrLTEC6Wri5nbaMmkGmLQBk-Yx3z0XDEtI/export?format=csv",
)

In [3]:
styler = variable_definitions_df\
    .style\
    .set_properties(
        **{'text-align': 'left'}
    )\
    .hide(axis="index")
display(HTML(styler.to_html()))

Variable,Description
ListingKey,"Unique key for each listing, same value as the 'key' used in the listing object in the API."
ListingNumber,The number that uniquely identifies the listing to the public as displayed on the website.
ListingCreationDate,The date the listing was created.
CreditGrade,The Credit rating that was assigned at the time the listing went live. Applicable for listings pre-2009 period and will only be populated for those listings.
Term,The length of the loan expressed in months.
LoanStatus,"The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue. The PastDue status will be accompanied by a delinquency bucket."
ClosedDate,"Closed date is applicable for Cancelled, Completed, Chargedoff and Defaulted loan statuses."
BorrowerAPR,The Borrower's Annual Percentage Rate (APR) for the loan.
BorrowerRate,The Borrower's interest rate for this loan.
LenderYield,The Lender yield on the loan. Lender yield is equal to the interest rate on the loan less the servicing fee.


### Downloading the dataset

We assume the dataset has been included with the submission, but in case it had to be removed due to size constraints, the following cell checks if the data is available and downloads it if it is not.

In [4]:
prosper_load_data_csv_zip = Path("./prosperLoanData.csv.zip")

In [5]:
def download_compress_dataset():
    try:
        with request.urlopen("https://s3.amazonaws.com/udacity-hosted-downloads/ud651/prosperLoanData.csv") as f:
            with ZipFile(prosper_load_data_csv_zip, "w") as zf:
                with zf.open("prosperLoanData.csv", "w") as g:
                    g.write(f.read())
    except Exception as e:
        display(Markdown(f"> <span style='color: red;'>**Exception raised retrieving data set:**</span> {e}"))
        prosper_load_data_csv_zip.unlink()

In [6]:
if not prosper_load_data_csv_zip.exists():
    download_compress_dataset()

### Loading the dataset

Once the dataset has been downloaded (and compressed), pandas can just read the CSV file straight out of the zip file:

In [7]:
prosper_load_data_df = pd.read_csv(prosper_load_data_csv_zip)

### What is the structure of your dataset?

In [8]:
prosper_load_data_df.sample(5)

Unnamed: 0,ListingKey,ListingNumber,ListingCreationDate,CreditGrade,Term,LoanStatus,ClosedDate,BorrowerAPR,BorrowerRate,LenderYield,...,LP_ServiceFees,LP_CollectionFees,LP_GrossPrincipalLoss,LP_NetPrincipalLoss,LP_NonPrincipalRecoverypayments,PercentFunded,Recommendations,InvestmentFromFriendsCount,InvestmentFromFriendsAmount,Investors
57884,097035480642340602D8E57,590413,2012-05-18 07:20:51.227000000,,60,Current,,0.31375,0.287,0.277,...,-240.1,0.0,0.0,0.0,0.0,1.0,0,0,0.0,49
20384,FA3535617532836172A257C,665475,2012-11-06 08:49:22.087000000,,60,Past Due (1-15 days),,0.23656,0.2118,0.2018,...,-166.57,-69.24,0.0,0.0,0.0,1.0,0,0,0.0,185
101251,28EB3397698812114AE7B72,186444,2007-08-13 12:34:40.233000000,B,36,Completed,2008-06-17 00:00:00,0.20735,0.2,0.19,...,-100.67,0.0,0.0,0.0,0.0,1.0,0,0,0.0,127
3378,956B35582635247623C7A06,643427,2012-09-22 07:48:08.197000000,,36,Current,,0.33051,0.2909,0.2809,...,-47.63,0.0,0.0,0.0,0.0,1.0,0,0,0.0,6
102193,26BE3430343675240728EDA,389521,2008-08-27 20:42:51.570000000,AA,36,Completed,2008-12-10 00:00:00,0.127,0.12,0.11,...,-5.7,0.0,0.0,0.0,0.0,1.0,1,1,100.0,191


In [9]:
prosper_load_data_df.shape

(113937, 81)

In [10]:
prosper_load_data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113937 entries, 0 to 113936
Data columns (total 81 columns):
 #   Column                               Non-Null Count   Dtype  
---  ------                               --------------   -----  
 0   ListingKey                           113937 non-null  object 
 1   ListingNumber                        113937 non-null  int64  
 2   ListingCreationDate                  113937 non-null  object 
 3   CreditGrade                          28953 non-null   object 
 4   Term                                 113937 non-null  int64  
 5   LoanStatus                           113937 non-null  object 
 6   ClosedDate                           55089 non-null   object 
 7   BorrowerAPR                          113912 non-null  float64
 8   BorrowerRate                         113937 non-null  float64
 9   LenderYield                          113937 non-null  float64
 10  EstimatedEffectiveYield              84853 non-null   float64
 11  EstimatedLoss

The dataset is comprised of 113937 rows (loans) and 81 columns.

### What is/are the main feature(s) of interest in your dataset?

> Your answer here!

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

> Your answer here!

## Univariate Exploration

> In this section, investigate distributions of individual variables. If
you see unusual points or outliers, take a deeper look to clean things up
and prepare yourself to look at relationships between variables.


> **Rubric Tip**: The project (Parts I alone) should have at least 15 visualizations distributed over univariate, bivariate, and multivariate plots to explore many relationships in the data set.  Use reasoning to justify the flow of the exploration.



>**Rubric Tip**: Use the "Question-Visualization-Observations" framework  throughout the exploration. This framework involves **asking a question from the data, creating a visualization to find answers, and then recording observations after each visualisation.** 




>**Rubric Tip**: Visualizations should depict the data appropriately so that the plots are easily interpretable. You should choose an appropriate plot type, data encodings, and formatting as needed. The formatting may include setting/adding the title, labels, legend, and comments. Also, do not overplot or incorrectly plot ordinal data.

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> Your answer here!

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> Your answer here!

## Bivariate Exploration

> In this section, investigate relationships between pairs of variables in your
data. Make sure the variables that you cover here have been introduced in some
fashion in the previous section (univariate exploration).

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> Your answer here!

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> Your answer here!

## Multivariate Exploration

> Create plots of three or more variables to investigate your data even
further. Make sure that your investigations are justified, and follow from
your work in the previous sections.

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> Your answer here!

### Were there any interesting or surprising interactions between features?

> Your answer here!

## Conclusions
>You can write a summary of the main findings and reflect on the steps taken during the data exploration.



> Remove all Tips mentioned above, before you convert this notebook to PDF/HTML


> At the end of your report, make sure that you export the notebook as an
html file from the `File > Download as... > HTML or PDF` menu. Make sure you keep
track of where the exported file goes, so you can put it in the same folder
as this notebook for project submission. Also, make sure you remove all of
the quote-formatted guide notes like this one before you finish your report!

