# Calculate Similarity
Given the diverse set of features here (about location, borrower, and the loan itself) let's just start with a really simple similarity metric.  We'll use Jaccard similarity, where each element of a user's set of loans will just get included in a set of loan elements.  Then we'll find a loan that's currently eligible with a large number of overlapping elements.

We could eventually measure similarity across different metrics and weight more important ones more heavily, but to start let's just throw everything in the same set.

The downside of this is that if someone consistently prefers the same country (for example) that's only indirectly accounted for in the small size of the set of elements (because that country would be counted only once, though it occurred many times.)  However, this will work as a proof of concept.

In [1]:
import pandas as pd
import requests

from country import country_to_continent
from utils import eval_string

path = '/Users/brianna/Dropbox/data_project/loan_project/data/'


## Get loan elements for each eligible loan
Include the following features:
- country
- continent
- sector
- tags
- themes

In [2]:
# Make a dictionary with {loan: set of loan elements} for each currently eligible loan.
# Eventually make a cron job to grab new loans every day.
loans = pd.DataFrame.from_csv('%sloans_10k.csv' % path)
loans['tags'] = loans.tags.apply(lambda x: eval_string(x))
loans['themes'] = loans.themes.apply(lambda x: eval_string(x))

all_loan_details = {}
for i in loans.index:
    # Include location and sector info, tags, and themes
    loan_elements = set([loans['country'][i], loans['continent'][i], loans['sector'][i]])
    loan_elements.update(loans['tags'][i] + loans['themes'][i])
    all_loan_details[loans['loan_id'][i]] = loan_elements


## For a specific user, create a set of all of the elements of his/her previous loans.
Let's look at my loan history for an example.  Because I lived in Uganda and I'm a big fan of woman-owned businesses, I tend to focus on East Africa and women.

In [3]:
def get_user_loan_elements(user):
    url = 'http://api.kivaws.org/v1/lenders/{user}/loans.json'.format(user=user)
    response = requests.get(url)
    lender = eval(response.content.replace('false', 'False').replace('true', 'True'))

    user_loan_elements = set()

    for loan in range(len(lender['loans'])):
        if 'country' in lender['loans'][loan]['location']:
            user_loan_elements.update([lender['loans'][loan]['location']['country']])
            user_loan_elements.update([country_to_continent.get(lender['loans'][loan]['location']['country'])])

        if 'sector'in lender['loans'][loan]:
            user_loan_elements.update([lender['loans'][loan]['sector']])

        if 'tags' in lender['loans'][loan]:
            tags = [k['name'].strip('#') for k in lender['loans'][loan]['tags']]
            user_loan_elements.update(tags)

        if 'themes' in lender['loans'][loan]:
            themes = lender['loans'][loan]['themes']
            user_loan_elements.update(themes)
        
    return user_loan_elements

In [4]:
user = 'brianna9306'
user_loan_elements = get_user_loan_elements(user)

In [5]:
user_loan_elements

{'Africa',
 'Agriculture',
 'Animals',
 'Clothing',
 'Eco-friendly',
 'Fabrics',
 'Food',
 'Green',
 'Growing Businesses',
 'Health and Sanitation',
 'Housing',
 'Interesting Photo',
 'Job Creation',
 'Job Creator',
 'Kenya',
 'Parent',
 'Repeat Borrower',
 'Retail',
 'Rural Exclusion',
 'Schooling',
 'Single Parent',
 'Social Enterprise',
 'Technology',
 'Uganda',
 'Unique',
 'Vegan',
 'Vulnerable Groups',
 'Woman Owned Biz',
 'Youth',
 'user_favorite'}

## Find which loans have the highest overlap with the user's loans

In [6]:
def jaccard_distance(x, user_loan_elements):
    intersection = len(set.intersection(x, user_loan_elements))
    union = len(set.union(x, user_loan_elements))
    if union > 0:
        return intersection/float(union)
    else:
        return 0

# Use dict comprehension to find which of those loans has the highest overlap with the user_loan_elements
loan_similarity = {jaccard_distance(v, user_loan_elements): k for k, v in all_loan_details.iteritems()}

print('***TOP FIVE SIMILAR LOANS for %s***' % user)
for best_similarity in sorted(loan_similarity, reverse=True)[:5]:
    loan_id = loan_similarity[best_similarity]
    print('\nLoan: %s, similarity: %s' % (loan_id, best_similarity))
    print(all_loan_details[loan_id])
    
print('\n\n***TOP FIVE WORST LOANS for %s***' % user)
for worst_similarity in sorted(loan_similarity)[:5]:
    loan_id = loan_similarity[worst_similarity]
    print('\nLoan: %s, similarity: %s' % (loan_id, worst_similarity))
    print(all_loan_details[loan_id])

***TOP FIVE SIMILAR LOANS for brianna9306***

Loan: 1126189, similarity: 0.354838709677
set(['Widowed', 'user_favorite', 'Parent', 'Food', 'Africa', 'Schooling', 'Vegan', 'Uganda', 'Single Parent', 'Woman Owned Biz', 'Repeat Borrower', 'Rural Exclusion'])

Loan: 1129601, similarity: 0.333333333333
set(['user_favorite', 'Parent', 'Africa', 'Schooling', 'Uganda', 'Single Parent', 'Woman Owned Biz', 'Repeat Borrower', 'Clothing', 'Rural Exclusion'])

Loan: 1135181, similarity: 0.3
set(['user_favorite', 'Parent', 'Africa', 'Schooling', 'Eco-friendly', 'Uganda', 'Woman Owned Biz', 'Clothing', 'Rural Exclusion'])

Loan: 1134815, similarity: 0.290322580645
set(['Widowed', 'user_favorite', 'Parent', 'Food', 'Africa', 'Vegan', 'Single Parent', 'Woman Owned Biz', 'Kenya', 'Repeat Borrower'])

Loan: 1130773, similarity: 0.28125
set(['user_favorite', 'Parent', 'Food', 'Female Education', 'Africa', 'Schooling', 'Vegan', 'Senegal', 'Woman Owned Biz', 'Repeat Borrower', 'Rural Exclusion'])


***TOP F

Four of my top five "similar" loans are women-owned businesses in East Africa, nice!  And conversely, my five least good fits are South America, where I've never given a loan.

In [7]:
# Let's try someone with different preferences to see how it fits.
user = 'rafael7312'
user_loan_elements = get_user_loan_elements(user)
user_loan_elements

{'Agriculture',
 'Animals',
 'Arts',
 'Biz Durable Asset',
 'Clothing',
 'Colombia',
 'Elderly',
 'Fabrics',
 'First Loan',
 'Flexible Credit Study',
 'Food',
 'IPA Study',
 'Innovative Loans',
 'Manufacturing',
 'Parent',
 'Refugees\\/Displaced',
 'Repeat Borrower',
 'Retail',
 'Rural Exclusion',
 'Services',
 'Single Parent',
 'South_America',
 'Vulnerable Groups',
 'Woman Owned Biz',
 'user_favorite',
 'volunteer_like',
 'volunteer_pick'}

In [8]:
# Use dict comprehension to find which of those loans has the highest overlap with the user_loan_elements
loan_similarity = {jaccard_distance(v, user_loan_elements): k for k, v in all_loan_details.iteritems()}

print('***TOP FIVE SIMILAR LOANS for %s***' % user)
for best_similarity in sorted(loan_similarity, reverse=True)[:5]:
    loan_id = loan_similarity[best_similarity]
    print('\nLoan: %s, similarity: %s' % (loan_id, best_similarity))
    print(all_loan_details[loan_id])
    
print('\n\n***TOP FIVE WORST LOANS for %s***' % user)
for worst_similarity in sorted(loan_similarity)[:5]:
    loan_id = loan_similarity[worst_similarity]
    print('\nLoan: %s, similarity: %s' % (loan_id, worst_similarity))
    print(all_loan_details[loan_id])

***TOP FIVE SIMILAR LOANS for rafael7312***

Loan: 1135258, similarity: 0.428571428571
set(['Biz Durable Asset', 'IPA Study', 'user_favorite', 'First Loan', 'Innovative Loans', 'Flexible Credit Study', 'South_America', 'Colombia', 'Services', 'Woman Owned Biz', 'Parent', 'Repair Renew Replace', 'Fabrics'])

Loan: 1136283, similarity: 0.407407407407
set(['Biz Durable Asset', 'IPA Study', 'Parent', 'First Loan', 'Food', 'Flexible Credit Study', 'South_America', 'Elderly', 'Innovative Loans', 'Colombia', 'Woman Owned Biz'])

Loan: 1132112, similarity: 0.392857142857
set(['Biz Durable Asset', 'IPA Study', 'user_favorite', 'First Loan', 'Innovative Loans', 'Flexible Credit Study', 'South_America', 'Elderly', 'Colombia', 'Services', 'Parent', 'Technology'])

Loan: 1138986, similarity: 0.37037037037
set(['IPA Study', 'First Loan', 'Innovative Loans', 'Flexible Credit Study', 'South_America', 'Elderly', 'Colombia', 'Single Parent', 'Woman Owned Biz', 'Retail'])

Loan: 1138213, similarity: 0.35

Nice, a lot of "Innovative Loans" in South America.  It would take A/B tests to validate whether these loans would actually be more persuasive to people and result in more loans given, but at face value it's definitely promising.