# Ensemble Collaborative filtering for personalized project recommendations

## Kernel - 2 for DonorsChoose.org

## Why this problem is different then a standard recommendation engine problem ?
In normal recommendation system problem, we have a recommend a movie/item to a user and we want this movie/item to be as personalized as possible. If we take an example of Netflix recommendations, for each account/user Netflix will recommend a bunch of movies/shows (say 20-25 in number, scope of view is limited as if we suggest 1000 shows donor won't scroll and look at them) and hope that these recommendation are such that user buy one or more of these shows. But this problem is different, First of all in this problem, one needs to recommend donor of each new projects (its the other way around) and second, there is no limitation how many donors should be recommended for a new project. i.e. for a new project, 10000 emails can be send targeting 10000 different individuals. For one individual there is only one email. Then Why use the standard way of k recommended donors for each project and then judge on recall@k and precision@k, rather design a new method and evaluate the success rate of recommendation. In this kernel we will design a method which is memory efficient, runs very fast and give us a set of donors for each project. We will test it on the dataset and see what is the success rate if we would have emailed the recommended donors. 

**Why not email all the donors for each project ?**
- It costs to email each donor so if one is Emailing 5000 donors for one project, it will cost around 5000 x 0.003 dollar i.e. 15 Dollar, so if we don't use this targeting and Email all 2m donors, ~6000 dollars and which can be way greater than the actual donation received for that project. ( **and offcourse we will get multiple donations and might get 20x the total project cost, we may still be profitable**)
- But that way donor will get irritated and won't pay much attention if we continue to do so.


## Why ensemble CF taking Donor vs projects attributes one by one 
### Because the order of SVD on donor vs project matrix is O(min{mn2,m2n}) and it will take huge space and time in calculation if done on all dataset in one go
Whenever we use collaborative filtering on data which is very large as the data provided by DonorsChoose, it is very difficult to give space optimized solution. The Donors vs Project matrix is very large in this case, One will need good capacity systems to compute the SVD for Donors vs Projects matrix. It made me wonder if we can do it in limited space and time and still get comparable results. This will be my attempt to develop a solution which will decompose multiple Donors vs Projects_attribute matrices and apply a filter on them to finally provide recommendations.

- **Approach** - Each project has different attributes, we can create categories based on each attribute and apply collaborative filtering on Donors vs project_attribute_1, then Donors vs project_attribute_2 and so on. We will get multiple sets of Donors which should be emailed about new projects. The purpose is not to do less targeted campaigning but to systematically solve the problem of a large matrix. That's why we will take the intersection of all the sets and take the common in all sets will be our list of Donors and we will send email to them about a new project.

- **How it is helping** - Each project has different different attribute (say 3 attribute) A1, A2 and A3, and they have 100, 100 and 50 categories in them, so if one wants to use all three categories then there will be 100*100*50 i.e. 500k columns will be there in Donors vs Project(categories) matrix and SVD on this will be very tough in comparision to SVD if we are using one attribute at a time i.e. Donors vs Project_attribute (A1, 100 columns), so we will be doing in a sequencial manner. In addition to that the order of SVD is  O(min{mn2,m2n}) so making small matrix will make it super fast.

## Workflow 

![](https://lh3.googleusercontent.com/-JcJKzgv_9Ek/WyoYdjrA5vI/AAAAAAAAvcQ/dZFjJ0EcHSUzvw0joJcjdvUGfqNn45zKwCL0BGAYYCw/h1101/2018-06-20.png)

In [1]:
# importing packages 
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
from surprise import Reader, Dataset, SVD, evaluate
import warnings; warnings.simplefilter('ignore')
import time
import datetime
from sys import getsizeof
from scipy.sparse.linalg import svds
from bokeh.palettes import Spectral4
from bokeh.plotting import figure, output_notebook, show
output_notebook()
import plotly.plotly as py
import plotly.graph_objs as go
from plotly import tools
from plotly.offline import iplot, init_notebook_mode
import plotly.tools as tls
from matplotlib_venn import venn3
from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler
import itertools
from sklearn.metrics import confusion_matrix
from functools import reduce
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode()

In [3]:
donations = pd.read_csv("../input/Donations.csv", encoding='latin-1')
donors = pd.read_csv("../input/Donors.csv", encoding='latin-1')
projects = pd.read_csv("../input/Projects.csv", encoding='latin-1')
resources = pd.read_csv("../input/Resources.csv", encoding='latin-1')
schools = pd.read_csv("../input/Schools.csv", encoding='latin-1')
teachers = pd.read_csv("../input/Teachers.csv", encoding='latin-1')
# We won't be doing EDA here as EDA as been done in Kernel 1

In [5]:
# merging projects and school file
projects = projects.merge(schools, left_on='School ID',
                          right_on='School ID', how='left')
# merging projects and donations file
projects = projects.merge(donations, left_on='Project ID',
                           right_on='Project ID', how='left')
# merging projects and donations file
projects = projects.merge(donors, left_on='Donor ID',
                           right_on='Donor ID', how='left')
master_df = projects.merge(teachers, left_on='Teacher ID',
                           right_on='Teacher ID', how='left')



master_df.head()

In [6]:
sequence_list = ['Project Type', 'School State',
                 'Project Subject Category Tree',
                 'Project Grade Level Category',
                 'School Metro Type']


def Reco_CF(args):
    """Function to calculate the ensembled filtered CF based recommendations
        Input -
            Args dict
            args[0] => df on which one wants to do the filtering
            args[1] => variable to be used for category/item in donor category matrix
            args[2] => sample size to be used in passed dataframe 
                       -1 for complete df
            args[3] => factors for SV decomposition 
        Output -
            CF_pred_df => Donor vs category matrix (numpy array)
        
        
        Example - 
        cf_matrix = Reco_CF(args)
            
    """
    print("="*80)
    df = args[0]
    var = args[1]
    sample_size = args[2]
    factors = args[3]
    print("CF for donors vs {}".format(var))
    df2 = pd.DataFrame(df.groupby(['Donor ID', var])['Donation Amount'
                       ].sum())
    df2.reset_index(inplace=True)
    df2['log_amount'] = list(map(np.log, df2['Donation Amount'] + 1))

    # Specifying sample size and making pivot

    if sample_size == -1:
        size = df2.shape[0]
    else:
        size = sample_size
    donors_projects_pivot_matrix_df = \
        df2.head(size).pivot(index='Donor ID', columns=var,
                             values='log_amount').fillna(0)
    print('Pivot formed')

    donors_ids = list(donors_projects_pivot_matrix_df.index)

    # Collaborative filtering making vactors using matrix of donor vs projects categories

    donors_projects_pivot_matrix = \
        donors_projects_pivot_matrix_df.as_matrix()

    # Performs matrix factorization of the original donor item matrix

    (U, sigma, Vt) = svds(donors_projects_pivot_matrix, k=factors)
    sigma = np.diag(sigma)
    all_donor_predicted_ratings = np.dot(np.dot(U, sigma), Vt)
    cf_preds_df = pd.DataFrame(all_donor_predicted_ratings,
                               columns=donors_projects_pivot_matrix_df.columns,
                               index=donors_ids).transpose()

    print('CF done')
    return cf_preds_df


## projects features we will be using for ensemble collaborative filtering 
                    A) 'Project Type', 
                    B) 'School State',
                    C) 'Project Subject Category Tree',
                    D) 'Project Grade Level Category',
                    E) 'School Metro Type'
                    
### Other features can also be used like after digitizing like  
                    F) '% lunch
                    G) 'Project cost, log of project cost
### Other variables which can be used -
                    H) 'School district'
                    I) 'Teacher's Prefix'
                    J) 'Donor's State'
                    K) 'Donor is Teacher'


## Define prediction ensemble CF functions

In [9]:
def predictions(
    engine,
    test,
    var_,
    n,
    ):
    """Function to generate test predictions for CF_engine passed
        Input - 
            engine - recommendation engine
            test - Test dataset which has same variables that train dataset has
            var_ - variables which was used in passed engine for making Donors
                   vs Project attribute matrix
            n - Number of recpmmedation per project attribute
        Output -
            
    """

    # test set prep - making pivot

    def pred_n(row, var_=var_):
        """function to predict the n most probable donors"""

        var = row[var_]

        # engine_row = engine.loc[var]
        # print(engine_row.dtype)

        pred_ = engine.sort_values(var, axis=1,
                                   ascending=False).loc[var].index[0:n]

        # print(pred_)

        pred_list = list(pred_)
        return pred_list

    test['pred_' + var_ + '_list'] = test.apply(lambda row: \
            pred_n(row), axis=1)
    return test


def Ensemble_CF(
    df,
    list_var,
    test,
    n,
    sample_size,
    ):
    """Function to do the ensamble CF
        Input - 
               df - dataframe usually master df
               list_var - list of variables to be used
               test- test set having projects information
               n - number of donors to be predicted from each CF run
               Sample_size - size of the sample: -1 for full sample
        Output - 
                test - test set having prediction using all attributes passed
                engine_list - list of engines as np array
    """
    text_file = open("logfile.txt", "w")
    engine_list = []
    for var_ in list_var:
        start = time.time()
        
        text_file.write("="*80)
        text_file.write("Models specs are ")
        text_file.write('\n')
        text_file.write("Sample size is "+ str(sample_size))
        text_file.write('\n' )
        text_file.write("CF for donors vs "+str(var_))
        text_file.write('\n' )
        factors = 2
        args = {}
        args[0] = df
        args[1] = var_
        args[2] = sample_size
        args[3] = factors
        engine = Reco_CF(args)
        end_1 = time.time()
        text_file.write('Engine made in ' + str(end_1 - start)+ " sec")
        print('Engine made in ' + str(end_1 - start)+ " sec")
        test = predictions(engine, test, var_, n)
        end = time.time()
        text_file.write('\n' )
        text_file.write('predictions Done in ' + str(end - end_1)+ " sec")
        text_file.write('\n' )
        print('predictions Done in ' + str(end - end_1)+ " sec")
        engine_list.append(engine)
    text_file.write('Ensemble Collaborative filtering done')
    text_file.close()
    return (test, engine_list)

## Testing Ensemble collaborative filtering on an example 

In [None]:
sequence_list = [
    'Project Type',
    'School State',
    'Project Subject Category Tree',
    'Project Grade Level Category',
    'School Metro Type',
    'Teacher Prefix',
    ]
k = 5000
(test_ensemble, engine_list) = Ensemble_CF(master_df, sequence_list,
        master_df[6000:6100], k, -1)

### Parallelized functions if we need to run it on large number of projects (maybe in future or for testing on full data)

In [None]:
# multiprocessing pool functions 
from multiprocessing import Pool

def predictions(args):
    """Function to generate test predictions for CF_engine passed
        Input - 
            engine - recommendation engine
            test - Test dataset which has same variables that train dataset has
            var_ - variables which was used in passed engine for making Donors 
                   vs Project attribute matrix
            n - Number of recpmmedation per project attribute
        Output -
            
    """

    engine = args[0]
    test = args[1]
    var_ = args[2]
    n = args[3]

    # test set prep - making pivot

    def pred_n(row, var_=var_):
        """function to predict the n most probable donors"""
        var = row[var_]
        # engine_row = engine.loc[var]
        # print(engine_row.dtype)
        pred_ = engine.sort_values(var, axis=1,
                                   ascending=False).loc[var].index[0:n]

        # print(pred_)
        pred_list = list(pred_)
        return pred_list
    test['pred_' + var_ + '_list'] = test.apply(lambda row: \
            pred_n(row), axis=1)
    return test


def Ensemble_CF_mp(df, list_var, test, n, sample_size):
    """Function to do the ensamble CF"""
    engine_list =[]
    pool = Pool(processes=3)
    for var_ in list_var:
        start = time.time()
        factors = 2
        args = {}
        args[0] = df
        args[1] = var_
        args[2] = sample_size
        args[3] = factors
        engine = Reco_CF(args)
        #engine = Reco_CF(df,  var_,sample_size, factors = 2)
        print("Engine is {}".format(getsizeof(engine/1024)))
        end_1 = time.time()
        print("Engine made "+ str(end_1 -start))
        # pool_processes = 2
        chunk_size = 50
        test.reset_index(inplace = True)
        h = 0
        test_ = []
        start = 0
        for h in list(range((test.shape[0]//chunk_size))):
            if h != test.shape[0]//chunk_size:
                test_.append(test.loc[start:(start+chunk_size)])
                start = start+chunk_size
                #print(start)
            else:
                test_.append(test.loc[start:(test.shape[0]-1)]) 
        print(test_.__len__())
        h =test_.__len__()
        print(h)
        #print(zip([engine]*h, test_, [var_]*h, [n]*h))
        test_2 = pool.map(predictions, zip([engine]*h, test_, [var_]*h, [n]*h))
        print(test_2.__len__())
        
        #test = predictions(engine, test, var_, n)
        end = time.time()
        print("Engine made "+ str(end-end_1))
        engine_list.append(engine)
    pool.close()    
    return(test_2, engine_list)

MULTIPROCESSING = False
if MULTIPROCESSING:
    sequence_list = ['Project Type', 'School State',
                     'Project Subject Category Tree']
    k = 5000
    (test_ensemble, engine_list) = Ensemble_CF_mp(master_df,
            sequence_list, master_df[6000:6100], k, -1)

In [26]:
# Function to find the intersection off all predictions 


def intersection_of_ensemble(row):
    cols = list(row.index)
    pred_cols = [x for x in cols if x[0:5] == 'pred_']
    for i in list(range(pred_cols.__len__() - 1)):
        if i == 0:
            temp_set = \
                set(row[pred_cols[i]]).intersection(set(row[pred_cols[i
                    + 1]]))
        else:
            temp_set = temp_set.intersection(set(row[pred_cols[i + 1]]))
    return list(temp_set)


test_ensemble['ensemble_pred_list'] = test_ensemble.apply(lambda row: \
        intersection_of_ensemble(row), axis=1)

In [27]:
test_ensemble.head()

## Showing it on venn diagram

#### Venn diagram showing intersection of 3 out of 6 variables
Here we are using 6 variables for 

In [52]:
# venn diagram for visualisation

p = 6005
sequence_list1 = ['Project Type', 'School State',
                  'Project Subject Category Tree']
set_list = []
for var in sequence_list1:
    var_new = 'pred_' + var + '_list'
    temp = set(test_ensemble.loc[p][var_new])
    set_list.append(temp)

plt.figure(figsize=(12, 12))
venn3(set_list, ('pred_Project Type_list', 'pred_School State_list',
      'pred_Project Subject Category Tree_list'))
plt.show()

#### Taking intersection of predictions on rest of the 3 variables 

In [54]:
sequence_list2 = ['Project Grade Level Category', 'School Metro Type',
                  'Teacher Prefix']
p = 6005
set_list = []
for var in sequence_list2:
    var_new = 'pred_' + var + '_list'
    temp = set(test_ensemble.loc[p][var_new])
    set_list.append(temp)

plt.figure(figsize=(12, 12))
venn3(set_list, ('Project Grade Level Category', 'School Metro Type',
      'Teacher Prefix'))
plt.show()

#### Taking intersection of prediction from last 2 sets from by taking intersections 

In [55]:
p = 6005
set_1 = set(test_ensemble.loc[p]['pred_Project Type_list'
            ]).intersection(set(test_ensemble.loc[p]['pred_School State_list'
                            ]))
set_2 = \
    set_1.intersection(set(test_ensemble.loc[p]['pred_Project Subject Category Tree_list'
                       ]))
set_2

p = 6005
set_3 = \
    set(test_ensemble.loc[p]['pred_Project Grade Level Category_list'
        ]).intersection(set(test_ensemble.loc[p]['pred_School Metro Type_list'
                        ]))
set_4 = \
    set_1.intersection(set(test_ensemble.loc[p]['pred_Teacher Prefix_list'
                       ]))
set_4

import matplotlib.pyplot as plt
from matplotlib_venn import venn2

plt.figure(figsize=(9, 9))
venn2([set_2, set_4], ('Intersection of seq_list1',
      'Intersection of seq_list2'))
plt.show()


In [40]:
# Accuracy Check
# Explain why we are not using recall and precision


def reco_success_rate(row):
    """Function to calculate how many reco must have succeeded 
       if we used this model
        input - df - test_ensemble dataframe
                k - number of reco to conside
        Ouutput - % successful reco email fire
    """

    success = 0
    if row['Donor ID'] in row['ensemble_pred_list']:
        success = 1
    return success


test_ensemble['success'] = test_ensemble.apply(lambda row: \
        reco_success_rate(row), axis=1)


def Mean_success_rate(df):
    """Function to calculate the success rate"""
    df['success'] = df.apply(lambda row: reco_success_rate(row), axis=1)
    success = 1.0 * df['success'].sum()
    total = df.shape[0]
    return np.float(success / total)


## Mean Success Rate 

In [41]:
# MSR plot based on number of recommendation
MSR = Mean_success_rate(test_ensemble)
MSR

In [43]:
unsuccessful_attempt = (100- MSR*100)
successful_attempt = (MSR*100)
import plotly.plotly as py
init_notebook_mode()
trace0 = go.Bar(
    x=['successful Email attempt', 'unsuccessful Email attempt'],
    y=[successful_attempt, unsuccessful_attempt],
    marker=dict(
        color=['rgba(20,204,204,1)', 'rgba(222,45,38,0.8)'],
        line=dict(
            color='rgb(8,48,107)',
            width=0.5,
        )
    )
)

data = [trace0]
layout = go.Layout(
    title='Successful vs unsuccessful Email attempts',
)

fig = go.Figure(data=data, layout=layout)
iplot(fig, filename='color-bar')

## Accuracy variation based on k 
**k is number we set to predict from each CF but after intersection in general final recommendations are 1/4 of set k**

In [44]:
# Accuracy variation based on emails

list_of_k = [
    1000,
    2000,
    4000,
    8000,
    16000,
    ]
MSR_for_k = []
Deciding_k = False
if Deciding_k:
    sequence_list = [
        'Project Type',
        'School State',
        'Project Subject Category Tree',
        'Project Grade Level Category',
        'School Metro Type',
        'Teacher Prefix',
        ]
    for k in list_of_k:
        print('+=' * 50)
        (test_ensemble, engine_list) = Ensemble_CF(master_df,
                sequence_list, master_df[6000:6100], k, -1)
        test_ensemble['ensemble_pred_list'] = \
            test_ensemble.apply(lambda row: \
                                intersection_of_ensemble(row), axis=1)
        MSR = Mean_success_rate(test_ensemble)
        MSR_for_k.append(MSR)
        del test_ensemble
        del engine_list

### Plot of success rate vs k (k for each CF and is not the final number of recommended donors)

In [46]:
# Reseults have been copied here for plotting
list_of_k = [
    1000,
    2000,
    4000,
    8000,
    16000,
    ]
MSR_for_k = [0.11, 0.15, 0.19, 0.23, 0.27]

In [47]:
_x = np.array(list_of_k)
_y0 = np.array(MSR_for_k)


# Create traces
trace0 = go.Scatter(
    x = _x,
    y = _y0,
    mode = 'lines+markers',
    name = 'Success_Rate'
)


data = [trace0]

fig = go.Figure(data=data)
iplot(fig, filename='line-mode')

### Link to next kernel 
link - https://www.kaggle.com/maheshdadhich/behaviour-based-content-clustering

#### Thanks...