# Calculate Similarity
Given the diverse set of features here (about location, borrower, and the loan itself) let's just start with a really simple similarity metric.  We'll use Jaccard similarity, where each element of a user's set of loans will just get included in a set of loan elements.  Then we'll find a loan that's currently eligible with a large number of overlapping elements.

We could eventually measure similarity across different metrics and weight more important ones more heavily, but to start let's just throw everything in the same set.

The downside of this is that if someone consistently prefers the same country (for example) that's only indirectly accounted for in the small size of the set of elements (because that country would be counted only once, though it occurred many times.)  However, this will work as a proof of concept.

In [1]:
import pandas as pd
import pickle
import pprint
import requests

from country import country_to_continent
from utils import eval_string

path = '/Users/brianna/Dropbox/data_projects/loan_project/data/'


## Get loan elements for each eligible loan
Include the following features:
- country
- continent
- sector
- tags
- themes

In [2]:
# Load the dictionary with loan ids and elements for each loan
loan_elements = pickle.load( open( "%sloan_elements.pickle" % path, "rb" ) )

## For a specific user, create a set of all of the elements of his/her previous loans.
Let's look at my loan history for an example.  Because I lived in Uganda and I'm a big fan of woman-owned businesses, I tend to focus on East Africa and women.

In [3]:
def add_element(category, element):
    if element not in category:
        category[element] = 1
    else:
        category[element] += 1
    
    return category
        
def get_user_loan_elements(user):                                                                              
    url = 'http://api.kivaws.org/v1/lenders/{user}/loans.json'.format(user=user)                               
    response = requests.get(url)                                                                               
    lender = eval(response.content.replace('false', 'False').replace('true', 'True'))                          
                                                                                                               
    # To speed up computing time, if a user has a ton of loans only use the 10 most recent.                    
    if len(lender['loans']) > 20:                                                                              
        lender['loans'] = lender['loans'][-20:]                                                                
                                                                                                               
    # Make dictionaries of each of these important categories of the user's loans,
    # where the key is the category (ex. "Woman Owned Biz") and the value is the number
    # of times it's appeared in this user's loans
    countries, continents, sectors, tags, themes = {}, {}, {}, {}, {}                       
                                                                                                               
    for loan in range(len(lender['loans'])):                                                                   
        if 'country' in lender['loans'][loan]['location']:
            country = lender['loans'][loan]['location']['country']
            countries = add_element(countries, country)
            continent = country_to_continent.get(country, 'Unknown')
            continents = add_element(continents, continent)       
                                                                                                               
        if 'sector'in lender['loans'][loan]: 
            sector = lender['loans'][loan]['sector']
            sectors = add_element(sectors, sector)                                                
                                                                                                               
        if 'tags' in lender['loans'][loan]:                                                                    
            loan_tags = [k['name'].strip('#') for k in lender['loans'][loan]['tags']]  
            for t in loan_tags:
                tags = add_element(tags, t)                                                                             
                                                                                                               
        if 'themes' in lender['loans'][loan]:                                                                  
            loan_themes = lender['loans'][loan]['themes'] 
            for th in loan_themes:
                themes = add_element(themes, th)                                                                       
                              
        # Find the gender of the borrowers for each loan.  (For this we actually have to return
        # to the loans API)
#         url = 'http://api.kivaws.org/v1/loans/'
#         for i in range(len(lender['loans'])):
#             # API call an only include 50 loans at a time
#             if i > 49:
#                 break    
#             url += '%s,' % lender['loans'][i]['id']

#         url = '%s.json' % url[:-1]

#         response = requests.get(url)                                                                     
#         user_loans = eval(response.content.replace('false', 'False').replace('true', 'True'))['loans'] 

#         borrower_genders_all_loans = []
#         for i in range(len(user_loans)):
#             borrower_genders_this_loan = [k['gender'] for k in user_loans[i]['borrowers']]
#             borrower_genders_all_loans = borrower_genders_all_loans + borrower_genders_this_loan

#         user_borrowers = {'F': borrower_genders_all_loans.count('F'),
#                           'M': borrower_genders_all_loans.count('M')}

        user_loan_elements = {'user_countries': countries,                                                     
                              'user_continents': continents,                                                   
                              'user_sectors': sectors,                                                         
                              'user_tags': tags,                                                               
                              'user_themes': themes}                                                           
                                                                                                               
    return user_loan_elements   

def get_user_loan_elements_and_counts(user_loan_elements):                                         
    # Combine elements from each category into a single set to calculate similarity         
    user_loan_elements_set = {}
    for category in user_loan_elements: 
        for element in user_loan_elements[category]:
            user_loan_elements_set.update(user_loan_elements[category])
    return user_loan_elements_set

def get_user_loan_elements_categories_only(user_loan_elements):                                         
    # Combine elements from each category into a single set to calculate similarity         
    user_loan_elements_set = set()                                                          
    for category in user_loan_elements:                                                           
        user_loan_elements_set.update(user_loan_elements[category])                               
    return user_loan_elements_set                                                           

In [4]:
user = 'brianna9306'
user_loan_elements = get_user_loan_elements(user)

In [5]:
user_loan_elements

{'user_continents': {'Africa': 18},
 'user_countries': {'Kenya': 1, 'Uganda': 17},
 'user_sectors': {'Agriculture': 2,
  'Clothing': 2,
  'Food': 7,
  'Housing': 1,
  'Retail': 6},
 'user_tags': {'Animals': 1,
  'Eco-friendly': 1,
  'Fabrics': 3,
  'Health and Sanitation': 1,
  'Interesting Photo': 1,
  'Job Creator': 1,
  'Parent': 9,
  'Repeat Borrower': 1,
  'Schooling': 1,
  'Single Parent': 1,
  'Technology': 1,
  'Unique': 1,
  'Vegan': 2,
  'Woman Owned Biz': 9,
  'user_favorite': 10},
 'user_themes': {'Green': 1,
  'Growing Businesses': 1,
  'Job Creation': 1,
  'Rural Exclusion': 9,
  'Social Enterprise': 1,
  'Vulnerable Groups': 4,
  'Youth': 4}}

In [6]:
print(get_user_loan_elements_categories_only(user_loan_elements))
print('\n')
print(get_user_loan_elements_and_counts(user_loan_elements))

set(['Job Creator', 'Job Creation', 'Vulnerable Groups', 'Health and Sanitation', 'Housing', 'Growing Businesses', 'Interesting Photo', 'Youth', 'Green', 'Eco-friendly', 'Single Parent', 'Animals', 'Repeat Borrower', 'Retail', 'Rural Exclusion', 'Agriculture', 'Unique', 'user_favorite', 'Parent', 'Food', 'Clothing', 'Africa', 'Schooling', 'Vegan', 'Uganda', 'Woman Owned Biz', 'Kenya', 'Social Enterprise', 'Technology', 'Fabrics'])


{'Job Creator': 1, 'Job Creation': 1, 'Vulnerable Groups': 4, 'Health and Sanitation': 1, 'Housing': 1, 'Growing Businesses': 1, 'Interesting Photo': 1, 'Youth': 4, 'Green': 1, 'Eco-friendly': 1, 'Single Parent': 1, 'Animals': 1, 'Repeat Borrower': 1, 'Retail': 6, 'Rural Exclusion': 9, 'Agriculture': 2, 'Unique': 1, 'user_favorite': 10, 'Parent': 9, 'Food': 7, 'Clothing': 2, 'Africa': 18, 'Schooling': 1, 'Vegan': 2, 'Uganda': 17, 'Woman Owned Biz': 9, 'Kenya': 1, 'Social Enterprise': 1, 'Technology': 1, 'Fabrics': 3}


# Similarity Measures

### Jaccard Distance
The simplest way to compute similarity, the number of shared elements divided by the total number of elements.
   - <b>Strengths</b>: Simple to understand, the fit is imperfect but it injects an element of randomness that might be beneficial
   - <b>Weaknesses</b>: Doesn't take into account the nature of the category (ex. treats country the same as sector) or the number of times each category is is associated with the loans (ex. If I take loans 30 loans from Senegal and one from Cambodia, it counts Senegal and Cambodia equally)

In [7]:
def jaccard_distance(x, user_loan_elements):                                          
    user_loan_elements_set = get_user_loan_elements_categories_only(user_loan_elements)           
    intersection = len(set.intersection(x, user_loan_elements_set))                   
    union = len(set.union(x, user_loan_elements_set))                                 
    if union > 0:                                                                     
        return intersection/float(union)                                              
    else:                                                                             
        return 0      

### Dot Product Similarity
Compute similarity taking into account the number of instances of each element (ie. if someone gives 10 "Agriculture" sector loans and 2 "Clothing" sector loans, their similarity scores will be more heavily weighted toward "Agriculture".)

   - <b>Strengths</b>: If a user has a strong preference in one category, that category will carry more weight, resulting in more loans that are similar in that category
   - <b>Weaknesses</b>: Less intuitive measure, tailored specifically to the eccentricities of the Kiva loan dataset

In [8]:
def get_max_instance(user_loan_elements, category):
    # Get the maximm number of instances of a particular category (country, continent, sector)
    instances = {v: k for k,v in user_loan_elements[category].iteritems()}
    return max(instances)

def get_sum_of_instances(user_loan_elements, category):
    # Get the maximm number of instances of a particular category (country, continent, sector)
    total_instances = 0
    for k in user_loan_elements[category]:
        total_instances += user_loan_elements[category][k]
    return total_instances

def max_dp_similarity(user_loan_elements):
    # The maximum "similarity score" would be:
    #    - a match with the most common country, continent, and sector
    #    - all tags and themes shared
    dp_max = 0
    for category in ['user_countries', 'user_continents', 'user_sectors']:
        dp_max += get_max_instance(user_loan_elements, category)
    for category in ['user_tags', 'user_themes']:
        dp_max += get_sum_of_instances(user_loan_elements, category)
    return dp_max
    
def dp_similarity(candidate_loan_elements, user_loan_elements):
    user_loan_elements_set = get_user_loan_elements_and_counts(user_loan_elements)
    candidate_loan_elements = list(candidate_loan_elements)
    
    dot_product = sum([user_loan_elements_set.get(k, 0) for k in candidate_loan_elements])
    
    # Normalize by the maximum dot product of the most ideal loan (which is not the same as sharing
    # every element since a loan can in theory share all tags and themes but can only have one 
    # country, continent, and sector
    return dot_product * 1.0 / max_dp_similarity(user_loan_elements)

# Find most similar loans

In [9]:
# Find which of the currently active loans has the highest overlap with the user_loan_elements
#similarity_scores = {k: jaccard_distance(v['elements'], user_loan_elements) for k, v in loan_elements.iteritems()}
similarity_scores = {k: dp_similarity(v['elements'], user_loan_elements) for k, v in loan_elements.iteritems()}

# Turn the dict into a list of tuples sorted by similarity score
sorted_similarity_scores = sorted(similarity_scores.items(), key=lambda tup: tup[1], reverse=True)

In [10]:
print('***TOP FIVE SIMILAR LOANS for %s***' % user)
for sim in sorted_similarity_scores[:5]:
    loan_id, similarity_score = sim[0], sim[1]
    print('\nLoan: %s, similarity: %s' % (loan_id, similarity_score))
    print(loan_elements[loan_id]['elements'])


print('\n\n***TOP FIVE WORST LOANS for %s***' % user)
sorted_similarity_scores.reverse()
for sim in sorted_similarity_scores[:5]:
    loan_id, similarity_score = sim[0], sim[1]
    print('\nLoan: %s, similarity: %s' % (loan_id, similarity_score))
    print(loan_elements[loan_id]['elements'])

***TOP FIVE SIMILAR LOANS for brianna9306***

Loan: 1284122, similarity: 0.660377358491
set([u'user_favorite', u'Food', u'Africa', u'Uganda', u'Woman Owned Biz', u'Rural Exclusion'])

Loan: 1284842, similarity: 0.61320754717
set([u'user_favorite', u'Parent', u'Africa', u'Senegal', u'Woman Owned Biz', u'Repeat Borrower', u'Retail', u'Fabrics', u'Rural Exclusion'])

Loan: 1284849, similarity: 0.61320754717
set([u'user_favorite', u'Parent', u'Food', u'Africa', u'Vegan', u'Senegal', u'Woman Owned Biz', u'Repeat Borrower', u'Rural Exclusion'])

Loan: 1284844, similarity: 0.61320754717
set([u'Supporting Family', u'user_favorite', u'Parent', u'Africa', u'Senegal', u'Woman Owned Biz', u'Repeat Borrower', u'Retail', u'Fabrics', u'Rural Exclusion'])

Loan: 1284925, similarity: 0.61320754717
set([u'user_favorite', u'Parent', u'Food', u'Africa', u'Vegan', u'Senegal', u'Woman Owned Biz', u'Repeat Borrower', u'Rural Exclusion'])


***TOP FIVE WORST LOANS for brianna9306***

Loan: 1277750, similarity

All of top five "similar" loans are eco-friendly and/or women-owned businesses in Sub-Saharan Africa, nice!  And conversely, my five least good fits are in place where I've never given a loan.

In [11]:
# Let's try someone with different preferences to see how it fits.
user ='marylynn4377'
user_loan_elements = get_user_loan_elements(user)
user_loan_elements

{'user_continents': {'Asia': 20},
 'user_countries': {'Philippines': 20},
 'user_sectors': {'Agriculture': 10, 'Clothing': 1, 'Food': 7, 'Housing': 2},
 'user_tags': {'Animals': 8,
  'Elderly': 2,
  'Job Creator': 1,
  'Parent': 2,
  'Repair Renew Replace': 1,
  'Repeat Borrower': 1,
  'Woman Owned Biz': 6,
  'user_favorite': 1},
 'user_themes': {}}

In [12]:
# Find which of the currently active loans has the highest overlap with the user_loan_elements
#similarity_scores = {k: jaccard_distance(v['elements'], user_loan_elements) for k, v in loan_elements.iteritems()}
similarity_scores = {k: dp_similarity(v['elements'], user_loan_elements) for k, v in loan_elements.iteritems()}

# Turn the dict into a list of tuples sorted by similarity score
sorted_similarity_scores = sorted(similarity_scores.items(), key=lambda tup: tup[1], reverse=True)

print('***TOP FIVE SIMILAR LOANS for %s***' % user)
for sim in sorted_similarity_scores[:5]:
    loan_id, similarity_score = sim[0], sim[1]
    print('\nLoan: %s, similarity: %s' % (loan_id, similarity_score))
    print(loan_elements[loan_id]['elements'])


print('\n\n***TOP FIVE WORST LOANS for %s***' % user)
sorted_similarity_scores.reverse()
for sim in sorted_similarity_scores[:5]:
    loan_id, similarity_score = sim[0], sim[1]
    print('\nLoan: %s, similarity: %s' % (loan_id, similarity_score))
    print(loan_elements[loan_id]['elements'])

***TOP FIVE SIMILAR LOANS for marylynn4377***

Loan: 1289619, similarity: 0.958333333333
set([u'Animals', u'Parent', u'Philippines', u'Asia', u'Elderly', u'Woman Owned Biz', u'Repeat Borrower', u'Agriculture'])

Loan: 1289748, similarity: 0.944444444444
set([u'Animals', u'user_favorite', u'Philippines', u'Asia', u'Elderly', u'Woman Owned Biz', u'Repeat Borrower', u'Agriculture'])

Loan: 1287152, similarity: 0.930555555556
set([u'Animals', u'Parent', u'Philippines', u'Asia', u'Woman Owned Biz', u'Repeat Borrower', u'Agriculture'])

Loan: 1285204, similarity: 0.930555555556
set([u'Animals', u'Parent', u'Philippines', u'Asia', u'Woman Owned Biz', u'Repeat Borrower', u'Agriculture'])

Loan: 1289604, similarity: 0.930555555556
set([u'Animals', u'Parent', u'Philippines', u'Asia', u'Woman Owned Biz', u'Repeat Borrower', u'Agriculture'])


***TOP FIVE WORST LOANS for marylynn4377***

Loan: 1277897, similarity: 0.0
set([u'El Salvador', u'Retail', u'North_America'])

Loan: 1286050, similarity: 0

Nice, a lot of loans involving agriculture and animals in the Phillipines.  It would take A/B tests to validate whether these loans would actually be more persuasive to people and result in more loans given, but at face value it's definitely promising.