# Calculate Similarity
Given the diverse set of features here (about location, borrower, and the loan itself) let's just start with a really simple similarity metric.  We'll use Jaccard similarity, where each element of a user's set of loans will just get included in a set of loan elements.  Then we'll find a loan that's currently eligible with a large number of overlapping elements.

We could eventually measure similarity across different metrics and weight more important ones more heavily, but to start let's just throw everything in the same set.

The downside of this is that if someone consistently prefers the same country (for example) that's only indirectly accounted for in the small size of the set of elements (because that country would be counted only once, though it occurred many times.)  However, this will work as a proof of concept.

In [1]:
import pandas as pd
import pickle
import requests

from country import country_to_continent
from utils import eval_string

path = '/Users/brianna/Dropbox/data_project/loan_project/data/'


## Get loan elements for each eligible loan
Include the following features:
- country
- continent
- sector
- tags
- themes

In [2]:
# Load the dictionary with loan ids and elements for each loan
loan_elements = pickle.load( open( "%sloan_elements.pickle" % path, "rb" ) )

## For a specific user, create a set of all of the elements of his/her previous loans.
Let's look at my loan history for an example.  Because I lived in Uganda and I'm a big fan of woman-owned businesses, I tend to focus on East Africa and women.

In [3]:
def get_user_loan_elements(user):
    url = 'http://api.kivaws.org/v1/lenders/{user}/loans.json'.format(user=user)
    response = requests.get(url)
    lender = eval(response.content.replace('false', 'False').replace('true', 'True'))

    user_loan_elements = set()

    for loan in range(len(lender['loans'])):
        if 'country' in lender['loans'][loan]['location']:
            user_loan_elements.update([lender['loans'][loan]['location']['country']])
            user_loan_elements.update([country_to_continent.get(lender['loans'][loan]['location']['country'])])

        if 'sector'in lender['loans'][loan]:
            user_loan_elements.update([lender['loans'][loan]['sector']])

        if 'tags' in lender['loans'][loan]:
            tags = [k['name'].strip('#') for k in lender['loans'][loan]['tags']]
            user_loan_elements.update(tags)

        if 'themes' in lender['loans'][loan]:
            themes = lender['loans'][loan]['themes']
            user_loan_elements.update(themes)
        
    return user_loan_elements

In [4]:
user = 'brianna9306'
user_loan_elements = get_user_loan_elements(user)

In [5]:
user_loan_elements

{'Africa',
 'Agriculture',
 'Animals',
 'Clothing',
 'Eco-friendly',
 'Fabrics',
 'Food',
 'Green',
 'Growing Businesses',
 'Health and Sanitation',
 'Housing',
 'Interesting Photo',
 'Job Creation',
 'Job Creator',
 'Kenya',
 'Parent',
 'Repeat Borrower',
 'Retail',
 'Rural Exclusion',
 'Schooling',
 'Single Parent',
 'Social Enterprise',
 'Technology',
 'Uganda',
 'Unique',
 'Vegan',
 'Vulnerable Groups',
 'Woman Owned Biz',
 'Youth',
 'user_favorite'}

## Find which loans have the highest overlap with the user's loans

In [6]:
def jaccard_distance(x, user_loan_elements):
    intersection = len(set.intersection(x, user_loan_elements))
    union = len(set.union(x, user_loan_elements))
    if union > 0:
        return intersection/float(union)
    else:
        return 0

# Find which of the currently active loans has the highest overlap with the user_loan_elements
loan_similarity = {jaccard_distance(v['elements'], user_loan_elements): k for k, v in loan_elements.iteritems()}

print('***TOP FIVE SIMILAR LOANS for %s***' % user)
for best_similarity in sorted(loan_similarity, reverse=True)[:5]:
    loan_id = loan_similarity[best_similarity]
    print('\nLoan: %s, similarity: %s' % (loan_id, best_similarity))
    print(loan_elements[loan_id]['elements'])
    
print('\n\n***TOP FIVE WORST LOANS for %s***' % user)
for worst_similarity in sorted(loan_similarity)[:5]:
    loan_id = loan_similarity[worst_similarity]
    print('\nLoan: %s, similarity: %s' % (loan_id, worst_similarity))
    print(loan_elements[loan_id]['elements'])

***TOP FIVE SIMILAR LOANS for brianna9306***

Loan: 1220456, similarity: 0.166666666667
set(['Kenya', 'Africa', 'Eco-friendly', 'Agriculture', 'Rural Exclusion'])

Loan: 1220102, similarity: 0.133333333333
set(['Food', 'Africa', 'Eco-friendly', 'Uganda'])

Loan: 1209855, similarity: 0.129032258065
set(['Food', 'Cameroon', 'Africa', 'Eco-friendly', 'Vulnerable Groups'])

Loan: 1220317, similarity: 0.125
set(['Mexico', 'Green', 'Eco-friendly', 'Agriculture', 'North_America', 'Rural Exclusion'])

Loan: 1220020, similarity: 0.1
set(['Kenya', 'Retail', 'Africa'])


***TOP FIVE WORST LOANS for brianna9306***

Loan: 1219913, similarity: 0.0
set(['Philippines', 'Manufacturing', 'Asia'])

Loan: 1220302, similarity: 0.0277777777778
set(['IPA Study', 'Innovative Loans', 'Flexible Credit Study', 'South_America', 'Colombia', 'Services', 'Repeat Borrower'])

Loan: 1219558, similarity: 0.0285714285714
set(['IPA Study', 'Food', 'Flexible Credit Study', 'South_America', 'Innovative Loans', 'Colombia'])

Four of my top five "similar" loans are women-owned businesses in East Africa, nice!  And conversely, my five least good fits are South America, where I've never given a loan.

NOTE: This isn't matching quite as well now that I only have 500 active loans (as opposed to the historical loans that I first tested on), but it will get better when I get the cron job up and running and accumulate more active loans to choose from.

In [None]:
# Let's try someone with different preferences to see how it fits.
user = 'rafael7312'
user_loan_elements = get_user_loan_elements(user)
user_loan_elements

In [None]:
# Use dict comprehension to find which of those loans has the highest overlap with the user_loan_elements
loan_similarity = {jaccard_distance(v['elements'], user_loan_elements): k for k, v in loan_elements.iteritems()}

print('***TOP FIVE SIMILAR LOANS for %s***' % user)
for best_similarity in sorted(loan_similarity, reverse=True)[:5]:
    loan_id = loan_similarity[best_similarity]
    print('\nLoan: %s, similarity: %s' % (loan_id, best_similarity))
    print(loan_elements[loan_id])
    
print('\n\n***TOP FIVE WORST LOANS for %s***' % user)
for worst_similarity in sorted(loan_similarity)[:5]:
    loan_id = loan_similarity[worst_similarity]
    print('\nLoan: %s, similarity: %s' % (loan_id, worst_similarity))
    print(loan_elements[loan_id])

Nice, a lot of "Innovative Loans" in South America.  It would take A/B tests to validate whether these loans would actually be more persuasive to people and result in more loans given, but at face value it's definitely promising.

In [None]:
# Get the list of loans to display so we can call them with the API and get their details
def get_loans_to_display(loan_similarity, number_displayed):
    number_displayed = 10
    if number_displayed > len(loan_similarity):
        number_displayed = len(loan_similarity)

    loans_to_display = []
    for i in range(number_displayed):
        loans_to_display.append(loan_similarity.pop(max(loan_similarity)))
    
    return loans_to_display

def get_loan_details_from_api(loans_to_display):
    url = 'http://api.kivaws.org/v1/loans/'
    for loan_id in loans_to_display:
        url += '%s,' % loan_id
    url = url[:-1] + '.json'
    
    # Get the details for the loans from the API
    response = requests.get(url)
    loan = eval(response.content.replace('false', 'False').replace('true', 'True'))
    loan_details = {}

    # Put the details into a dictionary with each loan id as a key
    for n in range(len(loan['loans'])):
        loan_id = loan['loans'][n]['id']
        loan_details[loan_id] = loan['loans'][n]
    
    loan_details_to_display = []
    for loan_id in loans_to_display:
        loan_details_to_display.append({'id': loan_details[loan_id]['id'],
                                        'country': loan_details[loan_id]['location']['country'],
                                        'sector': loan_details[loan_id]['sector']})
    return loan_details_to_display
    

In [None]:
# Return a list of loans to display, in order from best to worst
# loans_to_display = get_loans_to_display(loan_similarity, number_displayed)
loan_details_to_display = get_loan_details_from_api(loans_to_display)

In [None]:
# loan_details

In [None]:
loan_details_to_display

In [None]:
loan_details[loan_id]


In [None]:
loan_details[1219558]

In [7]:
response = requests.get('http://api.kivaws.org/v1/loans/1220456.json')
loan = eval(response.content.replace('false', 'False').replace('true', 'True'))

In [13]:
# import pprint
loan['loans'][0]['borrowers']['pictured']

TypeError: list indices must be integers, not str

In [9]:
loan['loans'][0]['description']['texts']['en']

'Esther and her family live in the Nyahururu area of Kenya. She keeps livestock and grows food crops, fodder and cash crops on her 2.5-acre piece of land. She sells her farm produce at a local market near her home. She has been farming for many years now. Through the years, her biggest growth constraint has been lack of access to capital.<br \\/><br \\/>Esther needs to buy agricultural inputs necessary to increase her farm\\u2019s productivity but lacks adequate funds. She lives in a part of Kenya where there are no banks. She is seeking a loan to buy improved seeds, fertilizers and other important inputs that control weeds and pests. <br \\/><br \\/>With your loan, Esther will be able to access capital to buy important farm inputs that will increase her farm\\u2019s productivity and family income.'


<a href="https://www.kiva.org/lend/1220456">link text</a>