#### Kiva provided files
- loans.csv
- kiva_mpi_region_locations.csv
- loan_theme_ids.csv
- loan_themes_by_region.csv


- all_kiva_loans: larger version of loans.csv with more rows and some different columns

- mpi_on_regions: amount invested in a region and the biggest problems the said region has to deal with.
    - all_loan_theme_merged_with_geo_mpi_regions: A left join from mpi_on_regions on loan_themes_by_region
- Contribution_of_Deprivations: This table shows which dimensions and indicators contribute most to a region's MPI, which is useful for understanding the major source(s) of deprivation in a sub-national region
- SubNational_Decomposition_MPI_2017_18
    - missing 5.1, 5.2, and 5.4 from the datasource?
- MPI_estimations_country_levels: all MPI data, 5 years, not joined with the Kiva tables
- unique_regions_from_kiva_loan_themes: list of unique regions from Kiva dataset


#### Variables OG Kiva Loans
- id Unique ID for loan

- sector High level category
- activity More granular category
- useExact usage of loan amount. This is manually entered text. Could use to capture key words that give detail. Do a word cloud.
- tags
- CONNECT TO LOAN THEME IDS TO GET LOAN THEME TYPE

- borrower_genders Comma separated M,F letters, where each instance represents a single male/female in the group


- country_codeISO country code of country in which loan was disbursed
- countryFull country name of country in which loan was disbursed
- regionFull region name within the country
- currencyThe currency in which the loan was disbursed

- partner_id ID of partner organization (field partner with agents)

- funded_amount The amount disbursed by Kiva to the field agent(USD)
- posted_time The time at which the loan is posted on Kiva by the field agent
- funded_time The time at which the loan posted to Kiva gets funded by lenders completely
- loan_amount The amount disbursed by the field agent to the borrower(USD)
- repayment_interval
- lender_count The total number of lenders that contributed to this loan
- disbursed_time The time at which the loan is disbursed by the field agent to the borrower
- term_in_months The duration for which the loan was disbursed in months


- date UNKNOWN


---
id - This is identity value for each row in the dataset.
loan_amount - It is the loan amount asked by the borrower from the organization.
activity - It is the work in which the borrower is engaged in.
sector - The sector to which the borrowing organization or person belongs to.
country_code - This is country code to which borrower belongs to.
country - Name of the country to which borrower belongs.
region - It is the region inside the country where the organization or person resides.
currency - currency in which loan is lended by Kiva.
partner_id - These are the unique id provided for field partners
posted_time - It is the date and time when the loan was posted on Kiva.
disbursed_time - It is the date and time when the loan was disbursed to the borrower.
funded_time - It is the date and time when the loan was funded completely.
term_in_months - Duration in months after which the loan has to be returned by the borrower.
lender_count - Total number of lenders who have colaboratively funded the amount.
borrower_genders - This is list having the gender of all the borrowers involved in a loan.
repayment_interval - How frequently the amount of loan will be paid by the borrower.
date - Date on which loan was posted.

## Import

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

In [107]:
# OG files
og_kiva_loans = pd.read_csv('OG_kiva/kiva_loans.csv') # Kiva product data
og_kiva_mpi_region_locations = pd.read_csv('OG_kiva/kiva_mpi_region_locations.csv') # metadata about location
og_loan_theme_ids = pd.read_csv('OG_kiva/loan_theme_ids.csv') # aggregated theme info
og_loan_themes_by_region = pd.read_csv('OG_kiva/loan_themes_by_region.csv') # detailed partner and theme info
                                                           
# Derived files
kiva_loans = pd.read_csv('mpi-on-regions/all_kiva_loans.csv') # version of kiva_loans
loan_themes_by_region_JOIN_mpi_regions = pd.read_excel('mpi-on-regions/all_loan_theme_merged_with_geo_mpi_regions.xlsx')
country_stats = pd.read_csv('mpi-on-regions/country_stats.csv')
mpi_on_regions = pd.read_excel('mpi-on-regions/mpi_on_regions.xlsx')
unique_region_country = pd.read_excel('mpi-on-regions/unique_regions_from_kiva_loan_themes.xlsx')

# Dirty data. Missing values, unnamed columns
Contribution_of_Deprivations = pd.read_csv('mpi-on-regions/Tables_5.3_Contribution_of_Deprivations.csv', encoding = "ISO-8859-1")
SubNational_Decomposition_MPI_2017_18 = pd.read_excel('mpi-on-regions/Tables_5_SubNational_Decomposition_MPI_2017-18.xlsx')
MPI_estimations_country_levels = pd.read_excel('mpi-on-regions/Tables_7_MPI_estimations_country_levels.xlsx')

## Exploration
- borrower_genders column needs changing
- remove nan
- merge OG datasets to make denormalized dataset at loan level
- group the merged dataset by region, country, industry, gender etc.
- look into date values

#### Drop Null

In [108]:
# don't drop na's: use, tags, funded time and disbursed time (marks incomplete?)
og_kiva_loans.isnull().sum()/og_kiva_loans.shape[0]*100

id                     0.000000
funded_amount          0.000000
loan_amount            0.000000
activity               0.000000
sector                 0.000000
use                    0.630508
country_code           0.001192
country                0.000000
region                 8.462392
currency               0.000000
partner_id             2.012351
posted_time            0.000000
disbursed_time         0.356970
funded_time            7.200632
term_in_months         0.000000
lender_count           0.000000
tags                  25.538546
borrower_genders       0.628869
repayment_interval     0.000000
date                   0.000000
dtype: float64

In [109]:
# let's drop these. It leaves us with 92% of the original dataset.
og_kiva_loans.dropna(inplace=False, subset=['borrower_genders','partner_id','region','country_code','funded_time']).shape[0]/og_kiva_loans.shape[0]*100
og_kiva_loans_drop_na = og_kiva_loans.dropna(inplace=False, subset=['borrower_genders','partner_id','region','country_code','funded_time'])

#### Borrower gender
- num_borrowers
- num_female
- num_male
- has_female
- has_male

In [110]:
og_kiva_loans_drop_na['num_borrowers'] = og_kiva_loans_drop_na['borrower_genders'].apply(lambda x: len(str(x).split(',')))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [111]:
og_kiva_loans_drop_na['only_male_borrowers'] = og_kiva_loans_drop_na['borrower_genders'].apply(lambda x: 1 if ((' male' in str(x) or 'male'==str(x)) and 'female' not in str(x)) else 0)


og_kiva_loans_drop_na['only_female_borrowers'] = og_kiva_loans_drop_na['borrower_genders'].apply(lambda x: 1 if ('female' in str(x) and ' male' not in str(x)) else 0)


og_kiva_loans_drop_na['both_male_female_borrowers'] = og_kiva_loans_drop_na['borrower_genders'].apply(lambda x: 1 if (' male' in str(x) and 'female' in str(x)) else 0)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


#### merge 
- all og datasets with loan id as primary key.

In [112]:
og_kiva_loans_drop_na['time_from_posted_to_funded'] = pd.to_datetime(og_kiva_loans_drop_na['funded_time']) - pd.to_datetime(og_kiva_loans_drop_na['posted_time'])


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [143]:
for inx, row in og_kiva_loans_drop_na.iterrows():
   og_kiva_loans_drop_na.iloc[inx,'time_from_posted_to_funded'] = row['time_from_posted_to_funded'].seconds/60

ValueError: Can only index by location with a [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array]

In [119]:
og_kiva_loans_drop_na.loc[:,['id', 'loan_amount', 'activity', 'sector',
                           'country_code', 'country', 'region', 'currency', 
                           'posted_time', 'funded_time', 
                            'term_in_months',
                           'lender_count', 'repayment_interval',
                           'borrower_genders', 'num_borrowers', 'only_male_borrowers', 'only_female_borrowers',
                           'both_male_female_borrowers',
                           'time_from_posted_to_funded']].sample(200).to_csv('og_kiva_loans_sample.csv', index=False)

In [120]:
og_kiva_loans_drop_na.loc[:,['id', 'loan_amount', 'activity', 'sector',
                           'country_code', 'country', 'region', 'currency', 
                           'posted_time', 'funded_time', 
                            'term_in_months',
                           'lender_count', 'repayment_interval',
                           'borrower_genders', 'num_borrowers', 'only_male_borrowers', 'only_female_borrowers',
                           'both_male_female_borrowers',
                           'time_from_posted_to_funded']].to_csv('og_kiva_loans.csv', index=False)

# IMPORT SUPPLEMANTARY DATA

In [124]:
df = pd.read_csv('og_kiva_loans.csv')