## <u>Recommendation Engine</u>
A recommendation engine filters the data using different algorithms and recommends the most relevant items to users. It first captures the past behavior of a customer and based on that, recommends products which the users might be likely to buy.

In [1]:
# Utilities
import warnings
from time import time
from IPython.core.interactiveshell import InteractiveShell

# Data handling
import pandas as pd

# Mathematical calculation
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Configure for any default setting of any library
InteractiveShell.ast_node_interactivity = "all"
warnings.filterwarnings('ignore')
pd.set_option('display.float_format', lambda x: '%.3f' % x)

### <u>Report generation</u>
This piece of calculations are being done with Spark and inserted into a partitioned Reports table in Hive

Spark job generates a CSV file from Hive partitioned table

In [2]:
cc = pd.read_csv('CCDMS_dataset.csv')
points_table = {k: v for k, v in zip(cc.purchase_category.unique(), [0.8,1.7,0.85,3,1.1,1.5,2.1,1.3,2,1.5,1.01,2.2,1.5,1])}
cc['reward_points'] = cc.apply(lambda row: points_table[row['purchase_category']] * row['trans_amt'], axis=1)
cc['cust_name'] = cc.apply(lambda row: f"{row['first']} {row['last']}", axis=1)
cc.trans_date_trans_time = pd.to_datetime(cc.trans_date_trans_time)
cc['trans_month'] = cc.trans_date_trans_time.apply(lambda dt: f"{dt.year}-{str(dt.month).zfill(2)}")
report = cc.groupby(['card_num', 'trans_month', 'purchase_category']) \
                                .agg(dict(cust_name='first', trans_id='count', trans_amt='sum', reward_points='sum')) \
                                .rename(columns={'trans_id': 'trans_count'}) \
                                .reset_index()
report.to_csv('report.csv', index=False)
report.head()

Unnamed: 0,card_num,trans_month,purchase_category,cust_name,trans_count,trans_amt,reward_points
0,60416207185,2020-06,entertainment,Mary Diaz,1,102.15,204.3
1,60416207185,2020-06,food,Mary Diaz,1,41.99,88.179
2,60416207185,2020-06,gas_transport,Mary Diaz,7,492.71,739.065
3,60416207185,2020-06,grocery_net,Mary Diaz,1,63.12,63.12
4,60416207185,2020-06,grocery_pos,Mary Diaz,3,277.13,609.686


### <u>Data Load</u>
Load the Hive table into Pandas dataframe

In [3]:
# Set Env variables
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.7-bin-hadoop2.7"

# Create Spark Context
conf = (SparkConf().setMaster("local").setAppName("CCDMS"))
sc = SparkContext(conf=conf)
spark = SparkSession \
    .builder \
    .appName("CCDMS Visualizations") \
    .config("spark.sql.warehouse.dir", "file://tmp") \
    .getOrCreate()

# Load the data
report.head()

Unnamed: 0,card_num,trans_month,purchase_category,cust_name,trans_count,trans_amt,reward_points
0,60416207185,2020-06,entertainment,Mary Diaz,1,102.15,204.3
1,60416207185,2020-06,food,Mary Diaz,1,41.99,88.179
2,60416207185,2020-06,gas_transport,Mary Diaz,7,492.71,739.065
3,60416207185,2020-06,grocery_net,Mary Diaz,1,63.12,63.12
4,60416207185,2020-06,grocery_pos,Mary Diaz,3,277.13,609.686


Load the generated csv file by Spark job

In [4]:
# Python starts reading the generated csv file by Spark job and creates visualizations
report = pd.read_csv('report.csv')

---
### <u> Recommendation Types</u>

Broadly, there are two major recommendation systems are available:
- **Popularity Based**- We can recommend items to a user which are most popular among all the users
- **Collaborative Filtering**- We can segregate the users/products into multiple segments based on their preferences (user features) and recommend them based on the segment they belong to

Both of the above methods have their drawbacks. In the first case, the most popular items would be the same for each user so everybody will see the same recommendations. While in the second case, as the number of users increases, the number of features will also increase. So classifying the users into various segments will be a very difficult task.

**Please Note**: We will be recommending product categories in place of any product in this scenarios. It's the discrition of the bank to advertise related products for the recommended category i.e. another *credit card* or *loans with attractive rate of interest* or *bearer bonds* etc.


### 1. <u>Popularity Based Recommendation System (non-personalised)</u>
Easiest way to build a recommendation system is popularity based, simply over all the products that are popular or , in this case, over all the categories that recieves most expenditure.

Let's group the purchase categories to find the count and amount of expenditures customers making. this could be considered as their individual scores. Based on the popular purchase category, products related to that could be recommended.

In [5]:
# Create a method to recommend purchase category based on popularity
def recommend_popular(df, top_n, cardNum=None):
    # Generate a recommendation rank based upon score 
    df['Rank'] = df['score'].rank(ascending=0, method='first') 
    recommendations = df.sort_values(['score', 'purchase_category'], ascending=[0,1])
    
    # Add card_num column for which the recommendations are being generated 
    if cardNum:
        recommendations.insert(0,'card_num',cardNum)
    
    # Get the top N recommendations 
    return recommendations.head(top_n)

**Method 1**: Credit Card count for each unique purchase category as recommended score
The score for each purchase category in this method is calculated as the sum of the number of credit cards associated with them. Expenses can be high or low. This method doesn't bother about the expenditures made for that purchase category, rather just counts how many credit cards are used for this purchase category.

In [6]:
#Count of credit card for each unique purchase category as recommendation score 
pc_grp = report.groupby('purchase_category').agg({'card_num': 'count'}).reset_index()
pc_grp.rename(columns={'card_num': 'score'}, inplace=True)
pc_grp.head()

Unnamed: 0,purchase_category,score
0,entertainment,6067
1,food,6055
2,gas_transport,6062
3,grocery_net,4707
4,grocery_pos,6190


In [7]:
# Find recommendation for top 5 purchase category
recommend_popular(pc_grp, 5)

Unnamed: 0,purchase_category,score,Rank
4,grocery_pos,6190,1.0
6,home,6178,2.0
7,kids_pets,6173,3.0
12,shopping_pos,6160,4.0
0,entertainment,6067,5.0


**Method 2**: Average of transaction amount for each unique purchase_category as recommended score
The score for each purchase_category in this method is calculated as the average of the trans_amt recieved. This method is better than method 1.

In [8]:
#Count of user_id for each unique product as recommendation score 
pc_grp = report.groupby(['purchase_category']).agg({'trans_amt': 'mean'}).reset_index()
pc_grp.rename(columns={'trans_amt': 'score'}, inplace=True)
pc_grp.head()

Unnamed: 0,purchase_category,score
0,entertainment,422.952
1,food,329.306
2,gas_transport,591.197
3,grocery_net,221.753
4,grocery_pos,983.865


In [9]:
# Find recommendation for top 5 purchase_category
recommend_popular(pc_grp, 5)

Unnamed: 0,purchase_category,score,Rank
4,grocery_pos,983.865,1.0
12,shopping_pos,621.276,2.0
2,gas_transport,591.197,3.0
11,shopping_net,579.558,4.0
6,home,491.384,5.0


**Method 3**: Sum of transaction amount for each unique purchase_category as recommended score
The score for each purchase_category in this method is calculated as the sum of all trans_amt recieved. This is practically the best approach to determine the popularity of an purchase_category considering only the users expenses are given. 

In [10]:
# Sum of trans_amt for each unique purchase_category as recommendation score 
pc_grp = report.groupby(['purchase_category']).agg({'trans_amt': 'sum'}).reset_index()
pc_grp.rename(columns={'trans_amt': 'score'}, inplace=True)
pc_grp.head()

Unnamed: 0,purchase_category,score
0,entertainment,2566048.02
1,food,1993948.08
2,gas_transport,3583835.56
3,grocery_net,1043791.37
4,grocery_pos,6090121.59


In [11]:
# Find recommendation for top 5 purchase_category
recommend_popular(pc_grp, 5)

Unnamed: 0,purchase_category,score,Rank
4,grocery_pos,6090121.59,1.0
12,shopping_pos,3827058.61,2.0
2,gas_transport,3583835.56,3.0
11,shopping_net,3487779.99,4.0
6,home,3035769.89,5.0


In [12]:
# Find recommended purchase categories for couple of users (Users are considered as credit card numbers)
find_recom = {639046421587: 4, 
              3597337756918960: 3, 
              4497451418073890000: 5}   # This list is card_num, top_n recommendation dict.
for card in find_recom:
    print("Top %d recommendations for the card holder of %d" %(find_recom[card],card))
    recommend_popular(pc_grp,find_recom[card],card)
    print("\n") 

Top 4 recommendations for the card holder of 639046421587


Unnamed: 0,card_num,purchase_category,score,Rank
4,639046421587,grocery_pos,6090121.59,1.0
12,639046421587,shopping_pos,3827058.61,2.0
2,639046421587,gas_transport,3583835.56,3.0
11,639046421587,shopping_net,3487779.99,4.0




Top 3 recommendations for the card holder of 3597337756918960


Unnamed: 0,card_num,purchase_category,score,Rank
4,3597337756918960,grocery_pos,6090121.59,1.0
12,3597337756918960,shopping_pos,3827058.61,2.0
2,3597337756918960,gas_transport,3583835.56,3.0




Top 5 recommendations for the card holder of 4497451418073890000


Unnamed: 0,card_num,purchase_category,score,Rank
4,4497451418073890000,grocery_pos,6090121.59,1.0
12,4497451418073890000,shopping_pos,3827058.61,2.0
2,4497451418073890000,gas_transport,3583835.56,3.0
11,4497451418073890000,shopping_net,3487779.99,4.0
6,4497451418073890000,home,3035769.89,5.0






**Observations:**

- Popularity recommender models works based on the popularity of the purchase_category.
- The purchase_category with highest number of transaction amount recorded gets recommended irrespective of user's interest. This is the model used as a basic recommendation even when the user is not even onboarded to CCDMS.
- So is observed above that all the 3 users recieved the same recommendations i.e the top n rated purchase_category.

---
### 2. <u>Collaborative Filtering</u>
Collaborative filtering is the process of filtering out information or patterns using techniques involving collaboration among multiple agents, viewpoints, data sources, etc.This is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on a matter, A is more likely to have B's opinion on a different matter than that of a randomly chosen person.

**Nearest Neighborhood** - The standard method of Collaborative Filtering is known as Nearest Neighborhood algorithm. 

**Types of Collaborative Filtering (CF)**
- Item Based Collaborative Filtering (IBCF)
  - Compute similarity between Purchase Categories
- User Based Collaborative Filtering (UBCF)
  - Compute similarity between Credit Card Users
  
**IBCF vs. UBCF**
- IBCF is more efficient than UBCF
- Typical applications involve far more Users than categories they spend on. Hence Similarity matrix for IBCF is more compact than UBCF
- Similarity estimates between purchase categories is also more likely to converge over time than similarity between users. Hence the similarities can be pre-computed and cached unlike similarity between users that need to be dynamically computed at every certain interval.
- However, the IBCF recommendations tend to be more conservative than UBCF

**Similarity Metrics**: The following are the two popular similarity metrics used in recommender systems. **Jaccard Similarity** is useful when the User/Item Matrix contain binary values<br/>
- **Cosine Similarity**: Similarity is the cosine of the angle between the 2 item vectors represented by:
$$Cosine\;Similarity:CosSim(x,y)=\frac{\sum_i{x_iy_i}}{\sqrt{\sum{(x_i)^2}}\sqrt{\sum{(y_i)^2}}}=\frac{\langle x, y \rangle}{\|x\|\|y\|}$$

- **Pearson Correlation**: Similarity is the Pearson correlation between two vectors represented by:
$$Pearson\;Correlation:Corr(x, y)=\frac{\sum_i{(x_i-\bar{x})(y_i-\bar{y})}}{\sqrt{\sum{(x_i-\bar{x})^2}}\sqrt{\sum{(y_i-\bar{y})^2}}}=\frac{\langle x-\bar{x}, y-\bar{y} \rangle}{\|x-\bar{x}\|\|y-\bar{y}\|}=CosSim(x-\bar{x},\;y-\bar{y})$$

### User Based Collaborative Filtering (UBCF)
This algorithm first finds the similarity score between credit cards/credit card users. Based on this similarity score, it then picks out the most similar users and recommends purchase categories which these similar users have had spent money on in the past.
$$\widehat{r}_{ui}=\frac{\sum_{v\in{N_i^k(u)}}{similarity(u,v).r_{vi}}}{\sum_{v\in{N_i^k(u)}}{similarity(u,v)}}$$
### Item Based Collaborative Filtering (IBCF)
This algorithm first finds the similarity score between purchase categories. Based on this similarity score, it then picks out the most similar categories and recommends that to the user which these similar categories have been spent or transacted for before.
$$\widehat{r}_{ui}=\frac{\sum_{j\in{N_u^k(i)}}{similarity(i,j).r_{uj}}}{\sum_{j\in{N_u^k(j)}}{similarity(i,j)}}$$

First step is to create the sparse matrix. Let's attempt both UBCF and IBCF consecutively.

In [13]:
# Reconcile total expenses of an user
agg_report = report.groupby(['card_num', 'purchase_category']).agg(
    dict(cust_name='first', trans_count='sum', trans_amt='sum', reward_points='sum')).reset_index()

In [14]:
# Create the credit card - purchase category pivot table
card_catg = agg_report.pivot(index='card_num', columns='purchase_category', values='trans_amt').fillna(0)
print('Shape of credit card - purchase category sparse matrix:', card_catg.shape)
card_catg.head()

Shape of credit card - purchase category sparse matrix: (925, 14)


purchase_category,entertainment,food,gas_transport,grocery_net,grocery_pos,health,home,kids_pets,misc_net,misc_pos,personal_care,shopping_net,shopping_pos,travel
card_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
60416207185,1463.67,961.45,5608.88,1188.51,5766.55,1656.54,3803.92,4502.89,1903.7,1710.08,2740.12,4369.3,8587.06,823.98
60422928733,3051.63,4396.11,5229.08,1155.5,5831.43,1343.35,4040.07,4142.15,206.09,1676.2,2900.16,2009.39,1511.84,929.75
60423098130,748.56,328.3,1602.12,218.41,3102.71,649.24,1808.06,882.38,913.56,344.56,945.19,557.78,304.15,46.9
60427851591,968.69,654.86,653.32,492.56,3133.98,1503.52,1852.79,973.7,214.27,870.52,363.0,1657.58,2413.16,4981.69
60487002085,1534.86,253.02,1700.21,537.08,1419.37,735.2,2564.36,737.25,1258.34,3263.53,362.37,4396.37,393.72,2795.45


In [15]:
# Create the purchase category - credit card pivot table
catg_card = agg_report.pivot(index='purchase_category', columns='card_num', values='trans_amt').fillna(0)
print('Shape of credit card - purchase category sparse matrix:', catg_card.shape)
catg_card.head()

Shape of credit card - purchase category sparse matrix: (14, 925)


card_num,60416207185,60422928733,60423098130,60427851591,60487002085,60490596305,60495593109,501802953619,501828204849,501831082224,...,4861310130652560000,4890424426862850000,4897067971111200000,4906628655840910000,4956828990005110000,4958589671582720000,4973530368125480000,4980323467523540000,4989847570577630000,4992346398065150000
purchase_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
entertainment,1463.67,3051.63,748.56,968.69,1534.86,1275.06,612.59,1724.98,1163.14,1776.82,...,6448.73,2658.73,1083.54,4500.72,6576.29,2106.26,1315.58,1099.53,4610.65,4944.36
food,961.45,4396.11,328.3,654.86,253.02,1288.79,961.33,2344.79,449.94,1298.4,...,5141.77,1989.1,1572.1,3234.54,3250.8,3431.27,1717.33,826.69,2333.54,2423.72
gas_transport,5608.88,5229.08,1602.12,653.32,1700.21,3185.29,1587.3,4907.27,2058.56,3796.99,...,2356.08,5037.12,2508.09,7015.64,8842.85,6349.62,3604.87,1498.42,1495.65,7702.7
grocery_net,1188.51,1155.5,218.41,492.56,537.08,238.98,506.38,235.73,94.93,234.94,...,3982.15,2007.3,1270.99,2907.82,3281.09,965.58,731.1,186.32,2108.91,209.64
grocery_pos,5766.55,5831.43,3102.71,3133.98,1419.37,7500.72,2701.37,8423.0,1718.6,4004.01,...,19928.9,5280.43,2922.86,6547.25,5656.38,5795.12,3992.14,2469.66,10527.23,11119.53


Now, we will calculate the similarity. We can use the cosine_similarity function from sklearn to calculate the cosine similarity.

In [16]:
# Calculate the credit card - credit card similarity
card_similarity = cosine_similarity(card_catg)
np.fill_diagonal(card_similarity, 0)
card_similarity_df = pd.DataFrame(card_similarity,index=card_catg.index, columns=card_catg.index)
card_similarity_df.head()

card_num,60416207185,60422928733,60423098130,60427851591,60487002085,60490596305,60495593109,501802953619,501828204849,501831082224,...,4861310130652560000,4890424426862850000,4897067971111200000,4906628655840910000,4956828990005110000,4958589671582720000,4973530368125480000,4980323467523540000,4989847570577630000,4992346398065150000
card_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
60416207185,0.0,0.811,0.778,0.689,0.636,0.852,0.709,0.849,0.792,0.806,...,0.79,0.939,0.922,0.848,0.771,0.929,0.906,0.786,0.866,0.845
60422928733,0.811,0.0,0.889,0.637,0.636,0.854,0.682,0.923,0.912,0.957,...,0.833,0.914,0.863,0.942,0.905,0.942,0.926,0.889,0.743,0.913
60423098130,0.778,0.889,0.0,0.613,0.596,0.93,0.701,0.961,0.89,0.944,...,0.88,0.857,0.79,0.853,0.836,0.856,0.877,0.937,0.798,0.958
60427851591,0.689,0.637,0.613,0.0,0.732,0.732,0.526,0.655,0.657,0.574,...,0.665,0.674,0.672,0.624,0.568,0.621,0.748,0.604,0.672,0.622
60487002085,0.636,0.636,0.596,0.732,0.0,0.625,0.787,0.634,0.804,0.591,...,0.707,0.704,0.733,0.652,0.692,0.642,0.804,0.629,0.573,0.621


In [17]:
# Calculate the purchase category - purchase category similarity
catg_similarity = cosine_similarity(catg_card)
np.fill_diagonal(catg_similarity, 0)
catg_similarity_df = pd.DataFrame(catg_similarity, index=catg_card.index, columns=catg_card.index)
catg_similarity_df.head()

purchase_category,entertainment,food,gas_transport,grocery_net,grocery_pos,health,home,kids_pets,misc_net,misc_pos,personal_care,shopping_net,shopping_pos,travel
purchase_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
entertainment,0.0,0.892,0.774,0.891,0.873,0.923,0.886,0.88,0.895,0.914,0.893,0.886,0.922,0.496
food,0.892,0.0,0.78,0.848,0.889,0.888,0.93,0.953,0.784,0.815,0.905,0.85,0.853,0.523
gas_transport,0.774,0.78,0.0,0.664,0.73,0.857,0.818,0.878,0.773,0.764,0.849,0.691,0.696,0.339
grocery_net,0.891,0.848,0.664,0.0,0.867,0.848,0.854,0.841,0.789,0.847,0.834,0.861,0.877,0.396
grocery_pos,0.873,0.889,0.73,0.867,0.0,0.885,0.896,0.893,0.81,0.774,0.942,0.879,0.864,0.431


Now that we have both the smiliarity matrices in our hand, let's see the n-Neighborhood.

In [18]:
# Method to find top N neighbors
def find_n_neighbors(df,n):
    order = np.argsort(df.values, axis=1)[:, :n]
    df = df.apply(axis=1, func=lambda x: pd.Series(x.sort_values(ascending=False).iloc[:n].index,
                                                   index=['top{}'.format(i) for i in range(1, n+1)]))
    return df

In [19]:
# Find 10 neighbors of each credit card user
card_10_neighbors = find_n_neighbors(card_similarity_df, 10)
card_10_neighbors.head(10)

Unnamed: 0_level_0,top1,top2,top3,top4,top5,top6,top7,top8,top9,top10
card_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
60416207185,4810789809665940000,630423337322,371226440126102,4904681492230010,3560336828629930,4104312520615370,639023984367,2235613922823690,5152054598359920,4469777115158230000
60422928733,5596347693144590,343668971234893,4670613943676270,340103199302564,2719496466799410,376262134119629,376944481517097,4384910379661770,2348245054386320,180065479077096
60423098130,4689147265057940000,3576144910346950,4922710831011200,4110266553600170000,375848982312810,6011679934075340,6011388901471800,2233882705243590,180094608895855,4917187576956390
60427851591,3568736585751720,30030380240193,571465035400,4610050989831290,3540210836308420,180049032966888,38199021865320,4777065439639720,4149238353975790,376028110684021
60487002085,3586955669388450,346208242862904,4481131401752,501882822387,3536818734263520,2285066385084290,4173950183554600,3596357274378600,4900628639996,3501509250702460
60490596305,4826655832045230,213120463918358,3564182536169290,30175986190993,30371006069917,4769426683924050000,4254074738931270,4449530933957320,4671727014157740,213114122496591
60495593109,180040027502291,573283817795,2284059275940010,38014427445058,30510856607165,2358122155477950,2288748891690220,4155021259183870,3581130339108560,3563837241599440
501802953619,2248735346244810,375848982312810,4769426683924050000,3546897637165770,6011388901471800,4509142395811240,3547574373318970,2305336922781610,2383461948823900,4294930380592
501828204849,180036456789979,371985236239474,4742883543039280000,6597888193422450,30118423745458,4859525594182530,3517814635263520,4430881574719610,4427805710168,3523843138706400
501831082224,3541687240161490,376262134119629,4874006077381170,4736845434667900000,3542162746848550,4862293128558,4128027264554080,3543591270174050,5559857416065240,639077309909


In [20]:
# Find 10 neighbors of each purchase category
catg_10_neighbors = find_n_neighbors(catg_similarity_df, 10)
catg_10_neighbors.head(10)

Unnamed: 0_level_0,top1,top2,top3,top4,top5,top6,top7,top8,top9,top10
purchase_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
entertainment,health,shopping_pos,misc_pos,misc_net,personal_care,food,grocery_net,home,shopping_net,kids_pets
food,kids_pets,home,personal_care,entertainment,grocery_pos,health,shopping_pos,shopping_net,grocery_net,misc_pos
gas_transport,kids_pets,health,personal_care,home,food,entertainment,misc_net,misc_pos,grocery_pos,shopping_pos
grocery_net,entertainment,shopping_pos,grocery_pos,shopping_net,home,health,food,misc_pos,kids_pets,personal_care
grocery_pos,personal_care,home,kids_pets,food,health,shopping_net,entertainment,grocery_net,shopping_pos,misc_net
health,home,entertainment,personal_care,kids_pets,food,grocery_pos,shopping_pos,gas_transport,shopping_net,grocery_net
home,health,food,kids_pets,personal_care,grocery_pos,entertainment,shopping_pos,grocery_net,shopping_net,gas_transport
kids_pets,food,personal_care,home,health,grocery_pos,entertainment,gas_transport,grocery_net,shopping_pos,shopping_net
misc_net,entertainment,misc_pos,health,personal_care,shopping_pos,shopping_net,kids_pets,grocery_pos,grocery_net,food
misc_pos,entertainment,misc_net,shopping_pos,grocery_net,health,personal_care,kids_pets,shopping_net,food,home


Let's verify the similarity in both item and user base to find out if our calculatios are correct

In [21]:
def get_card_similar_catgs(card1, card2):
    common_catgs = agg_report[agg_report.card_num == card1].merge(
        agg_report[agg_report.card_num == card2],
        on = "purchase_category",
        how = "inner")
    return common_catgs[['trans_amt_x', 'trans_amt_y', 'purchase_category']].head()

In [22]:
# Check the similarity of two users
get_card_similar_catgs(60416207185, 4810789809665940000)

Unnamed: 0,trans_amt_x,trans_amt_y,purchase_category
0,1463.67,1551.4,entertainment
1,961.45,838.79,food
2,5608.88,5462.01,gas_transport
3,1188.51,1497.84,grocery_net
4,5766.55,5668.65,grocery_pos


**Observations**:
- From the above step we can see that the similarity we generated is true since both the given credit cards (**60416207185**, **4810789809665940000**) have incurred almost same amount of transactions in different purchase categories.

catg_similarity and card_similarity are catg-catg and card-card similarity matrix in an array form respectively. The next step is to make predictions based on these similarities. Let’s define a function to do just that.

In [23]:
# Method to predict the transaction amount
def predict(amount, similarity, type='card'):
    if type == 'card':
        mean_card_txn = amount.mean(axis=1)
        #We use np.newaxis so that mean_card_txn has same format as amount
        amt_diff = (amount - mean_card_txn[:, np.newaxis])
        pred = mean_card_txn[:, np.newaxis] + similarity.dot(amt_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    elif type == 'catg':
        pred = amount.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
    return pred

Finally, we will make predictions based on credit card similarity and purchase category similarity and recommend product based on similarity.

In [24]:
# Predict the transaction amount for both UBCF and IBCF
st=time()
card_prediction = predict(card_catg, card_similarity, type='card')
card_prediction = pd.DataFrame(card_prediction, index=card_catg.index, columns=card_catg.columns)
card_prediction.head()

catg_prediction = predict(card_catg, catg_similarity, type='catg')
# Commenting out following 2 lines as it throws MemoryError due to the high sparsity
# catg_prediction = pd.DataFrame(catg_prediction, index=catg_card.index, columns=catg_card.columns)
# catg_prediction.head()
print('Time taken %.2fs to find out the credit card and purchase category prediction' % (time()-st))

purchase_category,entertainment,food,gas_transport,grocery_net,grocery_pos,health,home,kids_pets,misc_net,misc_pos,personal_care,shopping_net,shopping_pos,travel
card_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
60416207185,3010.349,2389.047,4192.671,1356.453,6935.67,2371.286,3534.088,3297.007,2554.336,2566.623,2303.185,4012.202,4455.081,2108.652
60422928733,2522.585,1965.771,3799.507,897.123,6448.348,1921.575,3097.188,2886.239,2053.638,2078.175,1866.262,3443.043,3789.271,1654.024
60423098130,657.605,78.202,1929.812,-971.21,4753.436,65.748,1231.69,1004.105,229.366,194.554,21.399,1609.191,1932.568,-284.548
60427851591,1248.288,625.506,2203.616,-445.258,5209.125,611.267,1853.385,1467.477,692.597,711.392,473.526,2264.396,2715.067,1103.254
60487002085,1366.047,723.009,2437.951,-307.883,5121.997,715.929,1897.177,1589.853,891.664,926.673,591.538,2475.548,2696.592,825.035


Time taken 0.03s to find out the credit card and purchase category prediction


In [25]:
# Method to Recommend the purchase categories with the highest predicted transaction amounts
def recommend_catgs(card, orig_df, preds_df, top_n):
    # Get and sort the credit card's transaction amount
    sorted_card_txns = orig_df.loc[card].sort_values(ascending=False)
    sorted_card_predictions = preds_df.loc[card].sort_values(ascending=False)

    # Prepare recommendations
    recommedations = pd.concat([sorted_card_txns, sorted_card_predictions], axis=1)
    recommedations.index.name = 'Recommended Purchase Category'
    recommedations.columns = ['card_trans_amt', 'card_predictions']
    
    # Take the purchase categories which the card has NOT spent money on OR spent lowest amount on
    recommedations = recommedations.sort_values('card_trans_amt').iloc[:top_n,] \
                            if recommedations.loc[recommedations.card_trans_amt == 0].empty \
                            else recommedations.loc[recommedations.card_trans_amt == 0]
    recommedations = recommedations.sort_values('card_predictions', ascending=False)
    return recommedations.head(top_n)

In [26]:
# Find recommendation for couple of credit cards using UBCF
find_recom = {639046421587: 4, 
              3597337756918960: 3, 
              4497451418073890000: 5}   # This list is card_num, top_n recommendation dict.
for card in find_recom:
    print("Top %d recommendations for the card holder of %d" %(find_recom[card],card))
    recommend_catgs(card, card_catg, card_prediction, find_recom[card])
    print("\n") 

Top 4 recommendations for the card holder of 639046421587


Unnamed: 0_level_0,card_trans_amt,card_predictions
Recommended Purchase Category,Unnamed: 1_level_1,Unnamed: 2_level_1
gas_transport,1312.17,4411.274
misc_pos,1359.72,2959.24
food,1635.9,2829.565
grocery_net,734.22,1782.7




Top 3 recommendations for the card holder of 3597337756918960


Unnamed: 0_level_0,card_trans_amt,card_predictions
Recommended Purchase Category,Unnamed: 1_level_1,Unnamed: 2_level_1
shopping_net,604.27,2439.276
misc_net,471.34,1046.66
grocery_net,675.44,-100.415




Top 5 recommendations for the card holder of 4497451418073890000


Unnamed: 0_level_0,card_trans_amt,card_predictions
Recommended Purchase Category,Unnamed: 1_level_1,Unnamed: 2_level_1
gas_transport,921.81,3367.246
misc_net,1474.26,1853.49
misc_pos,1742.55,1840.54
health,1609.96,1649.233
travel,71.12,1362.991






**Observations**:
- Unlike popularity model, here the recommendations are personalized as indicated by the different set of recommendations for different credit card holders based on their expenses.

---
UBCF and IBCF are **Neighborhood approaches** of collaborative recommendation system. Based on certain similarity metrics, the nearest neighbors are calculated which are considered to be similar in terms of past history. 