# Calculate Similarity
Given the diverse set of features here (about location, borrower, and the loan itself) let's just start with a really simple similarity metric.  We'll use Jaccard similarity, where each element of a user's set of loans will just get included in a set of loan elements.  Then we'll find a loan that's currently eligible with a large number of overlapping elements.

We could eventually measure similarity across different metrics and weight more important ones more heavily, but to start let's just throw everything in the same set.

The downside of this is that if someone consistently prefers the same country (for example) that's only indirectly accounted for in the small size of the set of elements (because that country would be counted only once, though it occurred many times.)  However, this will work as a proof of concept.

In [1]:
import math
import pandas as pd
import pickle
import pprint
import requests

from country import country_to_continent
from utils import eval_string

path = '/Users/brianna/Dropbox/data_projects/loan_project/data/'

## Get loan elements for each eligible loan
Include the following features:
- country
- continent
- sector
- tags
- themes

In [2]:
# Load the dictionary with loan ids and elements for each loan
loan_elements = pickle.load( open( "%sloan_elements.pickle" % path, "rb" ) )

## For a specific user, create a set of all of the elements of his/her previous loans.
Let's look at my loan history for an example.  Because I lived in Uganda and I'm a big fan of woman-owned businesses, I tend to focus on East Africa and women.

In [3]:
max_loans = 50
NORMALIZE = None # None, 'sqrt' or 'random' to introduce random noise into results for more randomness  

In [4]:
def add_element(category, element):
    if element not in category:
        category[element] = 1
    else:
        category[element] += 1
    
    return category
        
def get_user_loan_elements(user):                                                                              
    url = 'http://api.kivaws.org/v1/lenders/{user}/loans.json'.format(user=user)                               
    response = requests.get(url)                                                                               
    lender = eval(response.content.decode('utf8').replace(u'false', u'False').replace(u'true', u'True'))                          
                                                                                                               
    # To speed up computing time, if a user has a ton of loans only use the most recent.                    
    if len(lender['loans']) > max_loans:                                                                              
        lender['loans'] = lender['loans'][-max_loans:]                                                                
                                                                                                               
    # Make dictionaries of each of these important categories of the user's loans,
    # where the key is the category (ex. "Woman Owned Biz") and the value is the number
    # of times it's appeared in this user's loans
    countries, continents, sectors, tags, themes = {}, {}, {}, {}, {}                       
                                                                                                               
    for loan in range(len(lender['loans'])):                                                                   
        if 'country' in lender['loans'][loan]['location']:
            country = lender['loans'][loan]['location']['country']
            countries = add_element(countries, country)
            continent = country_to_continent.get(country, 'Unknown')
            continents = add_element(continents, continent)       
                                                                                                               
        if 'sector'in lender['loans'][loan]: 
            sector = lender['loans'][loan]['sector']
            sectors = add_element(sectors, sector)                                                
                                                                                                               
        if 'tags' in lender['loans'][loan]:                                                                    
            loan_tags = [k['name'].strip('#') for k in lender['loans'][loan]['tags']]  
            for t in loan_tags:
                tags = add_element(tags, t)                                                                             
                                                                                                               
        if 'themes' in lender['loans'][loan]:                                                                  
            loan_themes = lender['loans'][loan]['themes'] 
            for th in loan_themes:
                themes = add_element(themes, th)                                                                       

        user_loan_elements = {'user_countries': countries,                                                     
                              'user_continents': continents,                                                   
                              'user_sectors': sectors,                                                         
                              'user_tags': tags,                                                               
                              'user_themes': themes}                                                           
                                                                                                               
    return user_loan_elements   

def get_user_loan_elements_and_counts(user_loan_elements):                                         
    # Combine elements from each category into a single set to calculate similarity         
    user_loan_elements_set = {}
    for category in user_loan_elements: 
        for element in user_loan_elements[category]:
            user_loan_elements_set.update(user_loan_elements[category])
    return user_loan_elements_set

def get_user_loan_elements_categories_only(user_loan_elements):                                         
    # Combine elements from each category into a single set to calculate similarity         
    user_loan_elements_set = set()                                                          
    for category in user_loan_elements:                                                           
        user_loan_elements_set.update(user_loan_elements[category])                               
    return user_loan_elements_set                                                           

In [5]:
user = 'brianna9306'
user_loan_elements = get_user_loan_elements(user)

In [6]:
user_loan_elements

{'user_continents': {'Africa': 18},
 'user_countries': {'Kenya': 1, 'Uganda': 17},
 'user_sectors': {'Agriculture': 2,
  'Clothing': 2,
  'Food': 7,
  'Housing': 1,
  'Retail': 6},
 'user_tags': {'Animals': 1,
  'Eco-friendly': 1,
  'Fabrics': 3,
  'Health and Sanitation': 1,
  'Interesting Photo': 1,
  'Job Creator': 1,
  'Parent': 9,
  'Repeat Borrower': 1,
  'Schooling': 1,
  'Single Parent': 1,
  'Technology': 1,
  'Unique': 1,
  'Vegan': 2,
  'Woman Owned Biz': 9,
  'user_favorite': 10},
 'user_themes': {'Green': 1,
  'Growing Businesses': 1,
  'Job Creation': 1,
  'Rural Exclusion': 9,
  'Social Enterprise': 1,
  'Vulnerable Groups': 4,
  'Youth': 4}}

In [7]:
print(get_user_loan_elements_categories_only(user_loan_elements))
print('\n')
print(get_user_loan_elements_and_counts(user_loan_elements))

{'Fabrics', 'Africa', 'Job Creator', 'Repeat Borrower', 'Housing', 'Uganda', 'Vegan', 'Single Parent', 'Agriculture', 'Parent', 'Vulnerable Groups', 'Eco-friendly', 'Animals', 'Health and Sanitation', 'Growing Businesses', 'Rural Exclusion', 'Woman Owned Biz', 'Unique', 'Kenya', 'Schooling', 'Retail', 'Social Enterprise', 'Food', 'Green', 'Job Creation', 'user_favorite', 'Technology', 'Youth', 'Interesting Photo', 'Clothing'}


{'Fabrics': 3, 'Africa': 18, 'Job Creator': 1, 'Housing': 1, 'Uganda': 17, 'Vulnerable Groups': 4, 'Single Parent': 1, 'Agriculture': 2, 'Vegan': 2, 'Eco-friendly': 1, 'Repeat Borrower': 1, 'Animals': 1, 'Growing Businesses': 1, 'Social Enterprise': 1, 'Rural Exclusion': 9, 'Retail': 6, 'Woman Owned Biz': 9, 'Unique': 1, 'Health and Sanitation': 1, 'Kenya': 1, 'Schooling': 1, 'Food': 7, 'Green': 1, 'Job Creation': 1, 'user_favorite': 10, 'Technology': 1, 'Youth': 4, 'Interesting Photo': 1, 'Clothing': 2, 'Parent': 9}


# Similarity Measures

### Jaccard Distance
The simplest way to compute similarity, the number of shared elements divided by the total number of elements.
   - <b>Strengths</b>: Simple to understand, the fit is imperfect but it injects an element of randomness that might be beneficial
   - <b>Weaknesses</b>: Doesn't take into account the nature of the category (ex. treats country the same as sector) or the number of times each category is is associated with the loans (ex. If I take loans 30 loans from Senegal and one from Cambodia, it counts Senegal and Cambodia equally)

In [8]:
def jaccard_distance(x, user_loan_elements):                                          
    user_loan_elements_set = get_user_loan_elements_categories_only(user_loan_elements)           
    intersection = len(set.intersection(x, user_loan_elements_set))                   
    union = len(set.union(x, user_loan_elements_set))                                 
    if union > 0:                                                                     
        return intersection/float(union)                                              
    else:                                                                             
        return 0      

### Dot Product Similarity
Compute similarity taking into account the number of instances of each element (ie. if someone gives 10 "Agriculture" sector loans and 2 "Clothing" sector loans, their similarity scores will be more heavily weighted toward "Agriculture".)

   - <b>Strengths</b>: If a user has a strong preference in one category, that category will carry more weight, resulting in more loans that are similar in that category
   - <b>Weaknesses</b>: Less intuitive measure, tailored specifically to the eccentricities of the Kiva loan dataset

In [9]:
def get_max_instance(user_loan_elements, category):                                                      
    # Get the maximum number of instances of a particular category (country, continent, sector)          
    instances = {v: k for k,v in user_loan_elements[category].items()}                                   
    if NORMALIZE == 'sqrt':                                                                              
        return math.sqrt(max(instances))                                                                 
    elif NORMALIZE == None or NORMALIZE == 'random':                                                     
        return max(instances)                                                                            
    else:                                                                                                
        raise ValueError('NORMALIZE must be None, sqrt, or random')                                      
                                                                                                         
def get_sum_of_instances(user_loan_elements, category):                                                  
    # Get the sum of number of instances of a particular category (tags, themes)                         
    # Since the maximum number of tags and themes for any given loan is about 3,                         
    # only take the top 3 instances                                                                      
    total_instances = []                                                                                 
    for k in user_loan_elements[category]:                                                               
        if NORMALIZE == 'sqrt':                                                                          
            total_instances.append(math.sqrt(user_loan_elements[category][k]))                           
        elif NORMALIZE == None or NORMALIZE == 'random':                                                 
            total_instances.append(user_loan_elements[category][k])                                      
        else:                                                                                            
            raise ValueError('NORMALIZE must be None, sqrt, or random')                                  
    total_instances.sort()                                                                               
    return sum(total_instances[-3:])                                                                     
                                                                                                         

def dot_product_plus_random_noise(user_loan_elements_set, k):                                         
    # Introduce an element of randomness so the user doesn't see 15 loans that are basically          
    # identical and the closest fit.  Pick a random number from zero to the true number and           
    # subtract it from the real number.                                                               
    num_shared = user_loan_elements_set.get(k, 0)                                                     
    noise = random.uniform(0, num_shared)                                                             
    return num_shared - noise                                                                         
                                                                                                      
def dot_product_sqrt(user_loan_elements_set, k):                                                      
    # Use the sqrt of the dp instead of the dp itself to give smaller numbers a fighting chance       
    num_shared = user_loan_elements_set.get(k, 0)                                                     
    return math.sqrt(num_shared)                                                                      
                                                                                                      
def max_dp_similarity(user_loan_elements):                                                            
    # The maximum "similarity score" would be:                                                        
    #    - a match with the most common country, continent, and sector                                
    #    - all tags and themes shared                                                                 
    dp_max = 0                                                                                        
    for category in ['user_countries', 'user_continents', 'user_sectors']:                            
        dp_max += get_max_instance(user_loan_elements, category)                                      
    for category in ['user_tags', 'user_themes']:                                                     
        dp_max += get_sum_of_instances(user_loan_elements, category)                                  
    return dp_max                                                                                     
    
def dp_similarity(candidate_loan_elements, user_loan_elements):
    user_loan_elements_set = get_user_loan_elements_and_counts(user_loan_elements)
    candidate_loan_elements = list(candidate_loan_elements)

    # To get the best possible fits, just take the dot product of each loan's elements with
    # the user's loan elements.
    if NORMALIZE == 'random':
        # To inject some randomness in choices, inject a bit of random noise
        dot_product = sum([dot_product_plus_random_noise(user_loan_elements_set, k) for k in candidate_loan_elements])
    elif NORMALIZE == 'sqrt':
        dot_product = sum([dot_product_sqrt(user_loan_elements_set, k) for k in candidate_loan_elements])
    elif NORMALIZE == None:
        dot_product = sum([user_loan_elements_set.get(k, 0) for k in candidate_loan_elements])
    else:
        raise ValueError('NORMALIZE must be None, sqrt, or random')

    # Normalize by the maximum dot product of the most ideal loan (which is not the same as sharing
    # every element since a loan can in theory share all tags and themes but can only have one
    # country, continent, and sector
    if dot_product > max_dp_similarity(user_loan_elements):
        return 1.0
    else:
        return dot_product * 1.0 / max_dp_similarity(user_loan_elements)


# Find most and least similar loans (Ja

In [24]:
def print_best_and_worst_loans(user, similarity_metric):
    
    print('Loan profile for {}\n'.format(user))
    user_loan_elements = get_user_loan_elements(user)
    for element in user_loan_elements:
        print(element)
        for sub_element in user_loan_elements[element]:
            print('\t{}: {}'.format(sub_element, user_loan_elements[element][sub_element]))
        
    # Find which of the currently active loans has the highest overlap with the user_loan_elements
    # User either Jaccard or Dot Product metric to calculate similarity
    if similarity_metric == 'jaccard':
        similarity_scores = {k: jaccard_distance(v['elements'], user_loan_elements) for k, v in loan_elements.items()}
    elif similarity_metric == 'dot_product':
        similarity_scores = {k: dp_similarity(v['elements'], user_loan_elements) for k, v in loan_elements.items()}
    else:
        print("Similarity metric should be either 'jaccard' or 'dot_product'")
        return

    # Turn the dict into a list of tuples sorted by similarity score
    sorted_similarity_scores = sorted(similarity_scores.items(), key=lambda tup: tup[1], reverse=True)
    
    print('\n\n***TOP FIVE SIMILAR LOANS for {}, using {} similarity metric***'.format(user, similarity_metric))
    for sim in sorted_similarity_scores[:5]:
        loan_id, similarity_score = sim[0], sim[1]
        print('\nLoan: {}, similarity: {}'.format(loan_id, similarity_score))
        print(loan_elements[loan_id]['elements'])


    print('\n\n***TOP FIVE WORST LOANS for {}, using {} similarity metric***'.format(user, similarity_metric))
    sorted_similarity_scores.reverse()
    for sim in sorted_similarity_scores[:5]:
        loan_id, similarity_score = sim[0], sim[1]
        print('\nLoan: {}, similarity: {}'.format(loan_id, similarity_score))
        print(loan_elements[loan_id]['elements'])

In [23]:
# Check the best loans for me using the Jaccard similarity metric
print_best_and_worst_loans('brianna9306', 'jaccard')

Loan profile for brianna9306

user_countries
	Kenya: 1
	Uganda: 17
user_tags
	Fabrics: 3
	Schooling: 1
	Job Creator: 1
	Unique: 1
	Vegan: 2
	Repeat Borrower: 1
	Woman Owned Biz: 9
	Animals: 1
	Single Parent: 1
	Parent: 9
	Eco-friendly: 1
	user_favorite: 10
	Technology: 1
	Interesting Photo: 1
	Health and Sanitation: 1
user_sectors
	Retail: 6
	Housing: 1
	Agriculture: 2
	Food: 7
	Clothing: 2
user_continents
	Africa: 18
user_themes
	Social Enterprise: 1
	Growing Businesses: 1
	Rural Exclusion: 9
	Green: 1
	Job Creation: 1
	Youth: 4
	Vulnerable Groups: 4


***TOP FIVE SIMILAR LOANS for brianna9306, using jaccard similarity metric***

Loan: 1315583, similarity: 0.3
{'Africa', 'Repeat Borrower', 'Woman Owned Biz', 'Kenya', 'Retail', 'Schooling', 'Single Parent', 'Parent', 'user_favorite'}

Loan: 1313614, similarity: 0.2903225806451613
{'Rural Exclusion', 'Africa', 'Repeat Borrower', 'Woman Owned Biz', 'Kenya', 'Agriculture', 'Parent', 'user_favorite', 'Biz Durable Asset', 'Animals'}

Loan: 

Four of my top five "best" loan matches are in Kenya, where I have given one loan.  However, the majority of the loans I've given (17 out of 18) were in Uganda.  The Jaccard metric takes into account whether an instance has ever happened, but not the number of times it has happened.

The five "worst" loans are definitely subjectively different than loans I would normally be interested in.  I've never given (or been interested in) loans for transportation in Asia.

In [27]:
# Let's try someone with different preferences to see how it fits.
print_best_and_worst_loans('marylynn4377', 'jaccard')

Loan profile for marylynn4377

user_countries
	Philippines: 20
user_tags
	Repeat Borrower: 1
	Job Creator: 1
	Parent: 2
	Woman Owned Biz: 6
	user_favorite: 1
	Elderly: 2
	Repair Renew Replace: 1
	Animals: 8
user_sectors
	Housing: 2
	Agriculture: 10
	Food: 7
	Clothing: 1
user_continents
	Asia: 20
user_themes


***TOP FIVE SIMILAR LOANS for marylynn4377, using jaccard similarity metric***

Loan: 1310657, similarity: 0.5714285714285714
{'Asia', 'Woman Owned Biz', 'Elderly', 'Philippines', 'Agriculture', 'Parent', 'user_favorite', 'Animals'}

Loan: 1310753, similarity: 0.5
{'Asia', 'Repeat Borrower', 'Woman Owned Biz', 'Philippines', 'Agriculture', 'Parent', 'Animals'}

Loan: 1311963, similarity: 0.5
{'Asia', 'Repeat Borrower', 'Woman Owned Biz', 'Elderly', 'Philippines', 'Agriculture', 'Animals'}

Loan: 1309535, similarity: 0.5
{'Asia', 'Repeat Borrower', 'Woman Owned Biz', 'Elderly', 'Philippines', 'Agriculture', 'Animals'}

Loan: 1309484, similarity: 0.5
{'Asia', 'Repeat Borrower', 'Wom

Nice, the top five "best" loans all involve women-owned businesses and animals in the Phillipines.  If anything, this might be recommending loans that are all too similar.  It will be good to introduce some noise here to add some diversity to the offerings.

The five "worst" loans are unrelated to the Phillipines, animals, or women-owned businesses.

# Try Dot Product metric instead
A dot product similarity metric will take into account the number of instances of a loan element. For example, if I've given 17 loans in Uganda and 1 in Kenya, Uganda will be weighted 17 times higher than Kenya.

In [28]:
print_best_and_worst_loans('brianna9306', 'dot_product')

Loan profile for brianna9306

user_countries
	Kenya: 1
	Uganda: 17
user_tags
	Fabrics: 3
	Schooling: 1
	Job Creator: 1
	Unique: 1
	Vegan: 2
	Repeat Borrower: 1
	Woman Owned Biz: 9
	Animals: 1
	Single Parent: 1
	Parent: 9
	Eco-friendly: 1
	user_favorite: 10
	Technology: 1
	Interesting Photo: 1
	Health and Sanitation: 1
user_sectors
	Retail: 6
	Housing: 1
	Agriculture: 2
	Food: 7
	Clothing: 2
user_continents
	Africa: 18
user_themes
	Social Enterprise: 1
	Growing Businesses: 1
	Rural Exclusion: 9
	Green: 1
	Job Creation: 1
	Youth: 4
	Vulnerable Groups: 4


***TOP FIVE SIMILAR LOANS for brianna9306, using dot_product similarity metric***

Loan: 1300114, similarity: 0.8045977011494253
{'Rural Exclusion', 'Africa', 'Woman Owned Biz', 'Uganda', 'Schooling', 'Parent', 'Food'}

Loan: 1310375, similarity: 0.7471264367816092
{'Rural Exclusion', 'Africa', 'Repeat Borrower', 'Woman Owned Biz', 'Vegan', 'Mali', 'Parent', 'Food', 'Underfunded Areas', 'user_favorite'}

Loan: 1301754, similarity: 0.712

Four of the top five "similar" loans are rural and/or women-owned businesses in Uganda, nice!  

And conversely, my five least good fits are in places where I've never given a loan.

In [30]:
print_best_and_worst_loans('marylynn4377', 'dot_product')

Loan profile for marylynn4377

user_countries
	Philippines: 20
user_tags
	Repeat Borrower: 1
	Job Creator: 1
	Parent: 2
	Woman Owned Biz: 6
	user_favorite: 1
	Elderly: 2
	Repair Renew Replace: 1
	Animals: 8
user_sectors
	Housing: 2
	Agriculture: 10
	Food: 7
	Clothing: 1
user_continents
	Asia: 20
user_themes


***TOP FIVE SIMILAR LOANS for marylynn4377, using dot_product similarity metric***

Loan: 1310753, similarity: 1.0
{'Asia', 'Repeat Borrower', 'Woman Owned Biz', 'Philippines', 'Agriculture', 'Parent', 'Animals'}

Loan: 1311731, similarity: 1.0
{'Asia', 'Repeat Borrower', 'Woman Owned Biz', 'Philippines', 'Schooling', 'Agriculture', 'Parent', 'Animals'}

Loan: 1311963, similarity: 1.0
{'Asia', 'Repeat Borrower', 'Woman Owned Biz', 'Elderly', 'Philippines', 'Agriculture', 'Animals'}

Loan: 1312459, similarity: 1.0
{'Asia', 'Woman Owned Biz', 'Elderly', 'Philippines', 'Agriculture', 'Animals'}

Loan: 1309535, similarity: 1.0
{'Asia', 'Repeat Borrower', 'Woman Owned Biz', 'Elderly', 

# Add normalization to introduce diversity
Let's introduce some diversity here by taking the square root of the dot product rather than the absolute value itself.  

This way, if I take 17 loans in Uganda and 1 in Kenya, the similarity metric will weight Uganda sqrt(17) = 4.1 times as high as Kenya.  Hopefully this will help us from getting loans that are all mostly identical to a person's most common loan profile.

In [34]:
NORMALIZE = 'sqrt'
print_best_and_worst_loans('brianna9306', 'dot_product')

Loan profile for brianna9306

user_countries
	Kenya: 1
	Uganda: 17
user_tags
	Fabrics: 3
	Schooling: 1
	Job Creator: 1
	Unique: 1
	Vegan: 2
	Repeat Borrower: 1
	Woman Owned Biz: 9
	Animals: 1
	Single Parent: 1
	Parent: 9
	Eco-friendly: 1
	user_favorite: 10
	Technology: 1
	Interesting Photo: 1
	Health and Sanitation: 1
user_sectors
	Retail: 6
	Housing: 1
	Agriculture: 2
	Food: 7
	Clothing: 2
user_continents
	Africa: 18
user_themes
	Social Enterprise: 1
	Growing Businesses: 1
	Rural Exclusion: 9
	Green: 1
	Job Creation: 1
	Youth: 4
	Vulnerable Groups: 4


***TOP FIVE SIMILAR LOANS for brianna9306, using dot_product similarity metric***

Loan: 1310375, similarity: 0.7899117070195137
{'Rural Exclusion', 'Africa', 'Repeat Borrower', 'Woman Owned Biz', 'Vegan', 'Mali', 'Parent', 'Food', 'Underfunded Areas', 'user_favorite'}

Loan: 1300114, similarity: 0.7732270324689271
{'Rural Exclusion', 'Africa', 'Woman Owned Biz', 'Uganda', 'Schooling', 'Parent', 'Food'}

Loan: 1313614, similarity: 0.766

This definitely appears to introduce more randomness into the recommendations.

It would take A/B tests to validate whether these loans would actually be more persuasive to people and result in more loans given, but at face value it's definitely promising.