# (Dataset Exploration Title)
## by Yahia Ali Abusaif

## Preliminary Wrangling

> This data set contains 81 variables about loan (Loan Status, Borrower Rate, borrower income ...etc) and has 113,937 row as loans data.

In [None]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

> Load in your dataset and describe its properties through the questions below.
Try and motivate your exploration goals through this section.

In [None]:
df = pd.read_csv('prosperLoanData.csv')
#pd.reset_option('display.max_rows')
#pd.reset_option('display.max_columns')
#pd.reset_option('display.width')

In [None]:
#set option so we can see all columns 
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
#explore data 
print(df.shape)
print(df.dtypes)
df.head(10)

In [None]:
#df.dropna(inplace=True)
#print(df.shape)
#print(df.dtypes)
#df.head()
###there are variables with  many nans

In [None]:
#drop duplicates
df.drop_duplicates()
print(df.shape)

In [None]:
#check for NA for all variables
print(df.isna().sum())

In [None]:
#drop some columns as we don't need them or they have many NA 
df.drop(['ListingKey','ListingNumber','ClosedDate','CreditGrade','GroupKey','TotalProsperLoans','TotalProsperPaymentsBilled',
            'OnTimeProsperPayments','ProsperPaymentsLessThanOneMonthLate','ProsperPaymentsOneMonthPlusLate',
            'ProsperPrincipalBorrowed','ProsperPrincipalOutstanding','ScorexChangeAtTimeOfListing','LoanFirstDefaultedCycleNumber'
            ],inplace=True, axis=1)
df.dropna(inplace=True)
print(df.isna().sum().sum())

In [None]:
df.shape

In [None]:
#add order to Categorical Dtype
ordinal = {'ProsperRating (Alpha)': ['HR','E','D','C','B', 'A', 'AA'],
                    'EmploymentStatus' : [ 'Not employed', 'Other', 'Retired', 'Self-employed', 'Part-time','Full-time', 'Employed']
                    }
for i in ordinal:
    orderedvar = pd.api.types.CategoricalDtype(ordered = True,categories = ordinal[i])
    df[i] = df[i].astype(orderedvar)

In [None]:
print(df.shape)
df.describe()

In [None]:
df.head(10)

### What is the structure of your dataset?

> The dataset contains 113,937 loans with 81 features but after remove Nan and duplicates it has  76216 loans with 67 features

### What is/are the main feature(s) of interest in your dataset?

> borrower's Annual Percentage Rate for the loan (BorrowerAPR) 

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

> Term, ProsperRating, LoanAmount, APR, borrower's EmploymentStatus and the MonthlyIncome .

## Univariate Exploration

> In this section, investigate distributions of individual variables. If
you see unusual points or outliers, take a deeper look to clean things up
and prepare yourself to look at relationships between variables.

In [None]:
plt.figure(figsize=(10,8))
plt.subplot(1,1,1)
sb.countplot(data=df, x='ProsperRating (Alpha)', color='blue');
plt.title('Prosper Ratings', fontsize=14)
plt.ylabel('Count', fontsize=12)
plt.xlabel('ProsperRating', fontsize=12);

C and B are the most popular ratings.

In [None]:
plt.figure(figsize=(10,10))
plt.subplot(2,1,2)
sb.countplot(data=df, x='EmploymentStatus', color='blue');
plt.title('Employment Status', fontsize=14)
plt.ylabel('Log(Count)', fontsize=12)
plt.xlabel('EmploymentStatus', fontsize=12);
plt.yscale('log')
plt.xticks(rotation=45)


plt.subplot(2,1,1)
sb.countplot(data=df, x='Term', color='blue')
plt.title('Loan Terms', fontsize=14)
plt.ylabel('Count', fontsize=12)
plt.xlabel('Terms', fontsize=12);
sb.despine()

most borrowers are employed or full-time.
The length of most of the loans are 36 months.

In [None]:
plt.figure(figsize=(20,30))
plt.subplot(2,1,1)
sb.countplot(data=df, x='ListingCategory (numeric)', color='blue');
plt.yscale('log')
plt.yticks([100,1000,10000], ['100', '1000', '10000'])
plt.title('Listing Distribution', fontsize=14)
plt.xlabel('Listing', fontsize=12)
plt.ylabel('Log(Count)', fontsize=12)
plt.subplots_adjust(hspace = 0.3)

plt.subplot(2,1,2)
sb.countplot(data=df, y='LoanStatus', color='blue');
#plt.xticks(rotation=45)
plt.xscale('log')
plt.xticks([100,1000,10000], ['100', '1000', '10000'])
plt.title('LoanStatus Distribution', fontsize=14)
plt.xlabel('Log(Count)', fontsize=12)
plt.ylabel('Status', fontsize=12);
sb.despine()

The category of the listing that the borrower selected when posting their listing: 
0 - Not Available, 
1 - Debt Consolidation, 
2 - Home Improvement,
3 - Business,
4 - Personal Loan, 
5 - Student Use, 
6 - Auto, 
7- Other, 
8 - Baby&Adoption, 
9 - Boat, 
10 - Cosmetic Procedure, 
11 - Engagement Ring, 
12 - Green Loans, 
13 - Household Expenses, 
14 - Large Purchases, 
15 - Medical/Dental, 
16 -Motorcycle, 
17 - RV, 
18 - Taxes, 
19 - Vacation, 
20 - Wedding Loans

the Debt Consildation (1), Home Improvement (2) and Other (7) are the most popular listings.
Most of the Loans have 'Current' status then 'Completed' status.

In [None]:
plt.figure(figsize=(20,10))
plt.subplot(1,2,1)
bins = np.arange(0, df.StatedMonthlyIncome.max(), 1000)
plt.hist(data=df, x='StatedMonthlyIncome', bins=bins, color='darkblue');
plt.xlim(0, 35000);
plt.title('Income Distribution', fontsize=14)
plt.xlabel('Monthly Income', fontsize=12)

plt.subplot(1,2,2)
sb.boxplot(data=df, y='StatedMonthlyIncome');
plt.title('Income boxplot', fontsize=14)
plt.ylim(0, 20000)
plt.ylabel('Monthly Income', fontsize=12);
sb.despine()

It appears that the monthly income distribution is a right-skewed distribution with about mean 5000.
There are a lot of outliers according to the boxplot which needs more investigating.

In [None]:
# Reduce outliers of Monthly Income by cut them 
# tried 20000 , 17500 and 15000
print(100*df[df['StatedMonthlyIncome']>15000].shape[0] / df.shape[0])
df = df[df['StatedMonthlyIncome']<=15000]

In [None]:
plt.figure(figsize=(20,10))
plt.subplot(1,2,1)
bins = np.arange(0, df.StatedMonthlyIncome.max(), 1000)
plt.hist(data=df, x='StatedMonthlyIncome', bins=bins, color='darkblue');
#plt.xlim(0, 35000);
plt.title('Income Distribution', fontsize=14)
plt.xlabel('Monthly Income', fontsize=12)

plt.subplot(1,2,2)
sb.boxplot(data=df, y='StatedMonthlyIncome');
plt.title('Income boxplot', fontsize=14)
plt.ylim(0, 20000)
plt.ylabel('Monthly Income', fontsize=12);
sb.despine()

In [None]:
plt.figure(figsize=(10,10))
bins = np.arange(0, df.LoanOriginalAmount.max(), 1200)
plt.hist(data=df, x='LoanOriginalAmount', bins=bins, color='darkblue')
plt.title('Loan Amount Distribution', fontsize=14)
plt.xlim(0, 35000);
plt.xlabel('Amount', fontsize=12);
sb.despine()

loans mean of amount is about 5000 borrowers rarely ask for high-amount loans

In [None]:
plt.figure(figsize=(10, 8))
bins = np.arange(0, df.BorrowerAPR.max()+0.01, 0.009)
plt.hist(data=df, x='BorrowerAPR', bins=bins, color='darkblue');
plt.title('Borrower APR Distribution', fontsize=14)
plt.xlabel('Borrower APR', fontsize=12);
sb.despine()

APR is more close to a unimodal distribution with mean around 0.2.

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

>APR is a unimodal distribution with mean around 0.2. No transformations needed
### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

>I used log(count) to close the gap between numbers as there are very large numbers can dominate the plot.
   
> For MonthlyIncome distribution, It was right-skewed with many outliers, some had large income amounts compared to the rest of the borrowers which can be made up, and thus I removed most of them since they represented around 2.5% of the data.




## Bivariate Exploration

> In this section, investigate relationships between pairs of variables in your
data. Make sure the variables that you cover here have been introduced in some
fashion in the previous section (univariate exploration).

In [None]:
plt.figure(figsize=(10,15))
plt.subplot(3,1,1)
sb.regplot(data=df, x='LoanOriginalAmount', y='BorrowerAPR', color='darkblue', x_jitter=100,
          scatter_kws={'alpha':0.01}, fit_reg=True)
plt.title('APR vs LoanAmount', fontsize=14)
plt.xlabel('LoanAmount', fontsize=12)
plt.ylabel('APR', fontsize=12);

plt.subplot(3,1,2)
sb.regplot(data=df, x='StatedMonthlyIncome', y='LoanOriginalAmount', color='darkblue', x_jitter=100,
          scatter_kws={'alpha':0.01}, fit_reg=True)
plt.title('LoanAmount vs Income', fontsize=14)
plt.xlabel('MonthlyIncome', fontsize=12)
plt.ylabel('LoanAmount', fontsize=12);
sb.despine()

plt.subplot(3,1,3)
sb.regplot(data=df, x='StatedMonthlyIncome', y='BorrowerAPR', color='darkblue', x_jitter=100,
          scatter_kws={'alpha':0.01}, fit_reg=True)
plt.title('APR vs Income', fontsize=14)
plt.xlabel('MonthlyIncome', fontsize=12)
plt.ylabel('APR', fontsize=12);
sb.despine()

there exists a negative relationship between APR and LoanAmount.
There is a positive relation between loanAmount and Income.
there is a weak negative relation between monthly income and the APR

In [None]:
plt.figure(figsize = (10, 10))
g = sb.PairGrid(data=df, y_vars=['BorrowerAPR', 'LoanOriginalAmount', 'StatedMonthlyIncome'], 
                x_vars=['Term', 'ProsperRating (Alpha)', 'EmploymentStatus'], 
                height=3, aspect = 1.5)
g.map(sb.barplot, color='darkblue')
plt.xticks(rotation=45);

Rating has negative relation with APR, the more the rating is, the less APR is.
Large loan amounts tend to have longer terms and better ratings.
Large loan amounts are mostly created by Employed borrowers.
More monthly incomes have higher ratings.

In [None]:
plt.figure(figsize=(10,10))
sb.countplot(data=df, x='ProsperRating (Alpha)', hue='Term')
plt.title('ProsperRating Count', fontsize=14)
plt.xlabel('ProsperRating', fontsize=12)
plt.ylabel('Count', fontsize=12)
sb.despine()

12-months is constant for all rating 
36-months ratings is the most popular/standard terms apparently.
60-months terms tend to have C and B ratings.

In [None]:
plt.figure(figsize=(15,10))
sb.countplot(data=df, x='ProsperRating (Alpha)', hue='EmploymentStatus')
plt.title('ProsperRating Count', fontsize=14)
plt.xlabel('ProsperRating', fontsize=12)
plt.ylabel('log(Count)', fontsize=12)
plt.yscale('log')
sb.despine()

not-employed only appear in only D rating.
self-employed have less mean rating.

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> 
        APR has negative relation with ProsperRatings.
        The APR has negative relation with MonthlyIncome.
        The APR has negative relation with LoanAmount.
         



### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> 
        LoanAmount has positive correlation with MonthlyIncome.
        LoanAmount has positive correlation with ProsperRatings. High loanAmounts tend to have higher ratings.
        More MonthlyIncomes have higher ratings.
        Large loanAmounts are made by employed borrowers.
        36-months terms are the most popular across all ratings.
        HR ratings have only 36-months terms.
        not-employed and self-employed have less mean rating.



## Multivariate Exploration

> Create plots of three or more variables to investigate your data even
further. Make sure that your investigations are justified, and follow from
your work in the previous sections.

In [None]:
plt.figure(figsize=(10,10))
g=sb.FacetGrid(data=df, col='Term', col_wrap=4, height=5, aspect=1.2)
g.map(sb.regplot, 'LoanOriginalAmount', 'BorrowerAPR', color='darkblue', x_jitter=100, scatter_kws={'alpha':0.1});

It seems that Term has no effect on the relation between APR and LoanAmount.

In [None]:
plt.figure(figsize=(10,10))
g=sb.FacetGrid(data=df, col='ProsperRating (Alpha)', col_wrap=4, height=5, aspect=1)
g.map(sb.regplot, 'LoanOriginalAmount', 'BorrowerAPR', color='darkblue', x_jitter=100, scatter_kws={'alpha':0.1});

The Negative relation between the APR and loanAmount turns to be slightly positive in high ratings!

In [None]:
plt.figure(figsize = (10,16))
plt.subplot(3,1,1)
sb.barplot(data=df, x='ProsperRating (Alpha)', y='BorrowerAPR', hue = 'Term')
plt.title('APR vs rating across term')
plt.ylabel('Mean APR')

plt.subplot(3,1,2)
sb.barplot(data=df, x='ProsperRating (Alpha)', y='LoanOriginalAmount', hue = 'Term')
plt.title('LoanAmount vs rating across term')
plt.ylabel('Mean LoanAmount')

plt.subplot(3,1,3)
sb.barplot(data=df, x='ProsperRating (Alpha)', y='StatedMonthlyIncome', hue = 'Term')
plt.title('MonthIncome vs rating across term')
plt.ylabel('Mean MonthIncome');

There is negative relation between APR and rating across all terms.
There is positive relation between LoanAmount and rating for 36-months and 60-months terms but there is no change in 12-months term.
For monthly income vs rating, the behaviour across all terms is quite similar with weak positive relation.

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> we confirm the negative correlation between APR and LoanAmount across all terms. However, observing this relation across ratings, we found that it is a negative relation that turned to be slightly positive with higher ratings surprisngly. there is also negative relation between mean APR and ratings across all terms, while there is positive relation between LoanAmount and ratings across all terms. Income and ratings have similar behaviour with weak positive relation.

### Were there any interesting or surprising interactions between features?

> yes, APR vs LoanAmount's negative relation turning to be slightly positive in higher ratings.