# Kiva Crowdfunding

*Kiva.org is an online crowdfunding platform to extend financial services to poor and financially excluded people around the world. Kiva lenders have provided over $1 billion dollars in loans to over 2 million people. In order to set investment priorities, help inform lenders, and understand their target communities, knowing the level of poverty of each borrower is critical. However, this requires inference based on a limited set of information for each borrower.*

## Prediction and Models

* Not all loan requests are funded (about 3383 records in loans dataset that are not funded). Is there a pattern among the loans not funded? [Clustering] 

^ Krysten: I think this is EDA; should we remove?


* Predict if a loan request is likely to get funded or not [Logistic Regression] 


* Which loans get funded fastest by country [Linear Regression]


* Time taken for Repayment -  Is this predictable based on Gender, Country/Region, Activity/Sector, Loan Amount etc. ? [Linear Regression]

^ Krysten: not sure which columns we would use for "Time Taken for Repayment". Are we predicting which loans are re-paid most quickly? Which fields would tell us that? 

* Predict Loan Requests(Count/Amount) for a country/region based on past trends (year/month) [Linear Regression]


* Estimate the poverty level of a borrower for a given country/region  [???]

^ Krysten: Not sure about this one...sounds more like EDA to me


In [1]:
%matplotlib inline

# General libraries.
import re
import math
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os
import seaborn as sns
from mpl_toolkits.basemap import Basemap
color = sns.color_palette()
import plotly
plotly.offline.init_notebook_mode(connected=True)
import plotly.offline as py


ModuleNotFoundError: No module named 'mpl_toolkits.basemap'

In [None]:
os.getcwd()

## EDA

### Kiva Loans

In [None]:
df_kiva_loans = pd.read_csv("kiva_loans.csv")
df_kiva_loans.head()
df_kiva_loans.describe(include = 'all')

In [None]:
df_kiva_loans.describe()

In [None]:
# Distribution of Loan Amount
plt.figure(figsize=(12,8))
sns.distplot(np.log(df_kiva_loans.loan_amount.values), bins=50, kde=False)
plt.xlabel('loan_amount_trunc', fontsize=12)
plt.title("Log Loan Amount Histogram")
plt.show()

In [None]:
plt.scatter(range(df_kiva_loans.shape[0]), np.sort(df_kiva_loans.funded_amount.values))

In [None]:
# Correlation Heatmap for Kiva Loans dataset
corr_loan = df_kiva_loans.corr()
plt.figure(figsize=(10,10))
sns.heatmap(corr_loan, 
            xticklabels=corr_loan.columns.values,
            yticklabels=corr_loan.columns.values, cmap="YlGnBu",annot=True,square=True)
plt.title('Correlation - Loan Data')
corr_loan

** OBSERVATION: Funded Amount, Loan Amount and Lender Count are highly correlated**


In [None]:
# Identify rows with no funding (Funding amount = 0)
df_nofund = df_kiva_loans[(df_kiva_loans['funded_amount']==0)]
df_nofund.shape


 **OBSERVATION: There are about 3383 rows with funded amount = 0**  <font size="4">$\color{red}{\text{Is there any pattern among loans that are not funded?}}$ </font>


In [None]:
# Plot side by side - Top Countries based on Loan Amount & Top Countries not getting funds
plt.subplot(1, 2, 1)
df_kiva_loans['country'].value_counts().head(10).plot(kind='bar', title ="Top 10 Countries getting Loan", figsize=(15, 10), legend=True, fontsize=12)
plt.subplot(1, 2, 2)
df_nofund['country'].value_counts().head(10).plot(kind='bar', title ="Top Countries with Loan Amount NOT FUNDED", figsize=(15, 10), legend=True, fontsize=12, color = 'DarkGreen')

**OBSERVATION: Philippines, Kenya and El Salvador are the top 3 countries featuring in Loans dataset.  United States, Kenya and Pakistan are the countries top countries with loan amount not funded**

In [None]:
# Plot side by side - Top Sectors based on Loan Amount & Top Sectors not getting funds
plt.subplot(1, 2, 1)
df_kiva_loans['sector'].value_counts().head(10).plot(kind='bar', title ="Top 10 Sectors based on Loan Amount", figsize=(15, 10), legend=True, fontsize=12)
plt.subplot(1, 2, 2)
df_nofund['sector'].value_counts().head(10).plot(kind='bar', title ="Top 10 Sectors not getting Funding", figsize=(15, 10), legend=True, fontsize=12, color = 'DarkGreen')

**OBSERVATION: Agriculture, Food and Retail are the top 3 sectors based on Loan Amount.  And these are the top 3 that are not getting funds **

In [None]:
# Plot side by side - Top Activities based on Loan Amount & Top Activities not getting funds
plt.subplot(1, 2, 1)
df_kiva_loans['activity'].value_counts().head(10).plot(kind='bar', title ="Top 10 Activities based on Loan Amount", figsize=(15, 10), legend=True, fontsize=12)
plt.subplot(1, 2, 2)
df_nofund['activity'].value_counts().head(10).plot(kind='bar', title ="Top 10 Activities not getting Funding", figsize=(15, 10), legend=True, fontsize=12, color = 'DarkGreen')

** OBSERVATION: Farming is the top activity based on Loan Amount, followed by General Store and Personal Housing Expenses**

In [None]:
# Plot side by side - Top Uses based on Loan Amount & Top Uses not getting funds
plt.subplot(1, 2, 1)
df_kiva_loans['use'].value_counts().head(10).plot(kind='bar', title ="Top 10 Uses based on Loan Amount", figsize=(15, 10), legend=True, fontsize=12)
plt.subplot(1, 2, 2)
df_nofund['use'].value_counts().head(10).plot(kind='bar', title ="Top 10 Used not getting Funding", figsize=(15, 10), legend=True, fontsize=12, color = 'DarkGreen')

In [None]:
# Plot Loan Amount by Sector  
plt.figure(figsize=(16,16))
sns.boxplot(df_kiva_loans['sector'], np.log(df_kiva_loans.loan_amount.values))


In [None]:
#Loan Count - Split by Gender
df_kiva_loans['borrower_genders']=[elem if elem in ['female','male'] else 'Other' for elem in df_kiva_loans['borrower_genders'] ]

df_kiva_loans['borrower_genders'].value_counts().head(3).plot(kind='pie', title ="Loan by Gender", figsize=(15, 10), legend=True, fontsize=12)

** OBSERVATION: There are more Female borrowers than male **

In [None]:
# Determine Time Taken for Funding. It is Loan Posted Time - Funded Time

df_funded = df_kiva_loans[(df_kiva_loans['funded_amount']!=0)]
df_funded.dropna()
df_funded.shape

df_funded['funded_time'] = pd.to_datetime(df_funded['funded_time'])
df_funded['posted_time'] = pd.to_datetime(df_funded['posted_time'])    
time_to_fund = (df_funded.funded_time - df_funded.posted_time)
time_to_fund_in_days = (time_to_fund.astype('timedelta64[s]')/(3600 * 24))
df_funded = df_funded.assign(time_to_fund=time_to_fund)
df_funded = df_funded.assign(time_to_fund_in_days=time_to_fund_in_days)
df_funded.time_to_fund_in_days.plot.hist();

In [None]:
# Funded Amount vs. Term
plt.scatter(df_kiva_loans['funded_amount'], df_kiva_loans['term_in_months'])
plt.xlabel('Funded Amount')
plt.ylabel('Term in Months')
plt.title('Scatter Plot of Funded Amount and Term')

In [None]:
countries_funded_amount = df_kiva_loans.groupby('country').mean()['funded_amount'].sort_values(ascending = False)
print("Top Countries with funded amount (Mean values)\n",countries_funded_amount.head(10))

In [None]:
data = [dict(
        type='choropleth',
        locations= countries_funded_amount.index,
        locationmode='country names',
        z=countries_funded_amount.values,
        text=countries_funded_amount.index,
        colorscale='Red',
        marker=dict(line=dict(width=0.7)),
        colorbar=dict(autotick=False, tickprefix='', title='Top Countries with funded_amount(Mean value)'),
)]
layout = dict(title = 'Top Countries based on Funded amount',
             geo = dict(
            showframe = False,
            #showcoastlines = False,
            projection = dict(
                type = 'Mercatorodes'
            )
        ),)
fig = dict(data=data, layout=layout)
py.iplot(fig, validate=False)

**OBSERVATION: Cote D'Ivoire, Mauritania, Bhutan got top funding**   

In [None]:
df_funded
funded_time = df_funded.groupby('country').mean()['time_to_fund_in_days'].sort_values(ascending = False)
print("Top Countries with max funded_time(Mean values)\n",funded_time.head(10))

In [None]:
data = [dict(
        type='choropleth',
        locations= funded_time.index,
        locationmode='country names',
        z=funded_time.values,
        text=funded_time.index,
        colorscale='Red',
        marker=dict(line=dict(width=0.7)),
        colorbar=dict(autotick=False, tickprefix='', title='Top Countries - Funded Time(Mean value)'),
)]
layout = dict(title = 'Top Countries - Funded Time',
             geo = dict(
            showframe = False,
            #showcoastlines = False,
            projection = dict(
                type = 'Mercatorodes'
            )
        ),)
fig = dict(data=data, layout=layout)
py.iplot(fig, validate=False)


**OBSERVATION: US, Puerto Rico, Vanuata take longer time for funding**

## MPI Region/Location

In [None]:
df_region = pd.read_csv("kiva_mpi_region_locations.csv")
df_region.head()

In [None]:
df_region['world_region'].unique()

In [None]:
m=Basemap(llcrnrlon=-160, llcrnrlat=-75,urcrnrlon=160,urcrnrlat=80)
m.drawmapboundary(fill_color='#A6CAE0', linewidth=0)
m.fillcontinents(color='grey', alpha=0.7, lake_color='grey')
m.drawcoastlines(linewidth=0.1, color="white")
 
# Add a marker per city of the data frame!
m.plot(df_region['lat'], df_region['lon'], linestyle='none', marker="o", markersize=16, alpha=0.6, c="orange", markeredgecolor="black", markeredgewidth=1)

In [None]:
mpi_region = df_region.groupby('world_region').mean()['MPI'].sort_values(ascending = False)
print("Top Countries with High MPI(Mean values)\n",mpi_region.head(10))
mpi_region.plot(kind = 'bar')

** OBSERVATION: As expected, Africa has the HIGHEST Poverty followed by South Asia **

## Loan Theme

In [None]:
df_theme = pd.read_csv("loan_theme_ids.csv")
df_theme.head()

In [None]:
df_theme['Loan Theme Type'].unique()

In [None]:
df_theme_reg = pd.read_csv("loan_themes_by_region.csv")
df_theme_reg