# Overall Project Description

#### This model will generate Colorado 14er hike recommendations for a given user, first based simply on a standard user-based collaborative filter engine with user similarities calculated via the Pearson coefficient determined from the (user,mountain) matrix of hike checklists. 

#### Next, to address the "long-tail" issue and de-emphasize the importance of a small number of highly popular peaks hiked by many users, the similarity calculation will be updated to more heavily weight common experiences on less popular peaks.

#### Finally, the Pearson similarity metric will be appended in a weighted average with a similarity metric computed based on commonalities between user-provided profile information and hiking preferences.

#### RMSE metric on a train-test split is used to evaluate approaches.  However, some tradeoff between minimized RMSE and increased diversity of top recommendations is allowable in this context.

#### The same approach is repeated in a notebook for 13er hike recommendations.  There is a far wider scope of 13ers in Colorado, and more interesting utility to be gained from this latter recommendation engine.  The 14er recommender is constructed first to test model functionality with a pool of well-known peaks.

As summarized above, the calculation will proceed first using the Pearson coefficient for calculating similarity between users; that is, the similarity between users 1 and 2 is given by  
$s(u_1,u_2)=\frac{\sum_k (x_{1,k}-\bar{x_1})(x_{2,k}-\bar{x_2})}{\sqrt{\sum_k (x_{1,k}-\bar{x})^2 \sum_k (x_{2,k}-\bar{x})^2}}$

where $x_{n,k}$ is the rating that a user $n$ gives to a summit of peak $k$, and $\bar{x}_n$ is the average rating over all peaks given by user $n$.  

Currently, the ratings $x_k$ are simply binary representations of whether a user successfully summited a peak: -1 (user did not hike the peak) or 1 (user did hike the peak).  Using the actual number of climbs exacerbates the long tail problem since many users return repeatedly to the popular peaks simply out of convenience and proximity to cities, which is not intended as a factor in this recommendation system.

The recommendation for peak "k" for a given user $n$ is then calculated from the user similarities according to the intuition that probable likelihood of hiking a given peak is added to the average likelihood of hiking any peak for user "n" ($\bar{x}_k$) as a normalized, similarity-weighted sum over the mean-subtracted likelihood for all users "m":

$\hat{x}_{k,n} = \bar{x}_k + \frac{\sum_{u_m}s(u_n, u_m)(x_{k,m} - \bar{x}_m)}{\left| \sum_{u_m}{s(u_n,u_m)} \right|}$



In [2]:
import pandas as pd
import numpy as np
import random
import pickle
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing
import sklearn
import json
#sns.reset_orig()
sns.set_style('white')
%matplotlib inline
pd.set_option('mode.chained_assignment',None)

## Import Checklist Data and User Data

Import checklist data that was previously scraped from a popular hiking website as part of this project.  Also import user data to map user ID to user Name, and merge dataframes (left join)

In [12]:
fname_14erList = r'14erChecklistByUser_df'
fname_13erList = r'13erChecklistByUser_df'

df_14ers = pickle.load(open(fname_14erList,"rb"))
df_14ers = df_14ers[df_14ers['PeakName']!=""]
df_14ers = df_14ers.sort_values('PeakName')
df_14ers['NumClimbs'] = df_14ers['NumClimbs'].apply(lambda x: -1 if x==0 else 1)
#names = pd.Series(np.arange(len(names)), names)

fname_user = r'user_profile_data_FULL'
user_df = pickle.load(open(fname_user,"rb"))
user_df = user_df[['UserId','Username']]

df_14ers = pd.merge(df_14ers, user_df, on='UserId', how='left')
df_14ers.head()

Unnamed: 0,UserId,PeakName,NumClimbs,Username
0,7398,BlancaPeak,1,JimS
1,20364,BlancaPeak,1,skicrazy2121
2,179,BlancaPeak,1,Steve
3,5784,BlancaPeak,1,keflavich
4,46480,BlancaPeak,1,JeffSheets


In [13]:
climb_counts = df_14ers.groupby('PeakName').agg('count').sort_values('NumClimbs')
dropnames = climb_counts[climb_counts['NumClimbs']<200].index.tolist()
df_14ers = df_14ers[df_14ers['PeakName'].apply(lambda x: x not in dropnames)]
#df_14ers[df_14ers['NewUserId']==4].head(40)

In [14]:
n_14er_users = df_14ers.UserId.nunique()
n_14ers = df_14ers.PeakName.nunique()
print(n_14er_users)
print(n_14ers)
## Create a dataframe that assigns an index to each peak
index_ind=pd.DataFrame(data={'PeakId':np.arange(len(df_14ers.PeakName.unique())),'PeakName':df_14ers.PeakName.unique()})
index_ind.head(12)

15207
65


Unnamed: 0,PeakId,PeakName
0,0,BlancaPeak
1,1,CapitolPeak
2,2,CastlePeak
3,3,ChallengerPoint
4,4,ConundrumPeak
5,5,CrestoneNeedle
6,6,CrestonePeak
7,7,CulebraPeak
8,8,EastLaPlata
9,9,ElDientePeak


In [15]:
df_14ers = df_14ers.merge(index_ind)
df_14ers.head()

Unnamed: 0,UserId,PeakName,NumClimbs,Username,PeakId
0,7398,BlancaPeak,1,JimS,0
1,20364,BlancaPeak,1,skicrazy2121,0
2,179,BlancaPeak,1,Steve,0
3,5784,BlancaPeak,1,keflavich,0
4,46480,BlancaPeak,1,JeffSheets,0


In [16]:
## We should also redo the user ids since they span the length of the full user database, and we
## have only taken the subset of the users that have logged a 14er hike
index_id=pd.DataFrame(data={'NewUserId':np.arange(len(df_14ers.UserId.unique())),'UserId':df_14ers.UserId.unique()})
df_14ers = df_14ers.merge(index_id)
df_14ers.head()

Unnamed: 0,UserId,PeakName,NumClimbs,Username,PeakId,NewUserId
0,7398,BlancaPeak,1,JimS,0,0
1,7398,CapitolPeak,1,JimS,1,0
2,7398,CastlePeak,1,JimS,2,0
3,7398,ChallengerPoint,1,JimS,3,0
4,7398,ConundrumPeak,1,JimS,4,0


In [17]:
### Run if desired
pickle.dump(df_14ers,open(r'14ers_df_final',"wb"))

In [18]:
### Split df into train/test
from sklearn.model_selection import train_test_split
random.seed(1000)
train_data14ers, test_data14ers = train_test_split(df_14ers, test_size=0.25)

## Build the similarity metric matrix between users based only on Pearson metric, and then create a peak prediction for each user 

In [22]:
train_data_userids = train_data14ers['NewUserId'].tolist()
test_data_userids = test_data14ers['NewUserId'].tolist()

In [29]:
full_data_matrix = -1*np.ones((n_14er_users, n_14ers))
for line in df_14ers.itertuples():
    full_data_matrix[line[6], line[5]] = line[3]
train_data_matrix = np.copy(full_data_matrix)
train_data_matrix[train_data_userids,:]=0
test_data_matrix = np.copy(full_data_matrix)
test_data_matrix[test_data_userids,:]=0


##### Pearson Metric 

$s(u_i,u_j)=\frac{\sum_k (x_{i,k}-\bar{x_i})(x_{j,k}-\bar{x_j})}{\sqrt{\sum_k (x_{i,k}-\bar{x_i})^2 \sum_k (x_{j,k}-\bar{x_j})^2}}$

The (i,j) are user indices and "k" is a peak index

In [31]:
def pearson_correlation(full_data_matrix,frac=1):
    epsilon = 1e-5
    #train_data_matrix has dimensions of (nusers,n14ers)
    nusers = np.shape(train_data_matrix)[0]
    pcorr_matrix = np.zeros((nusers,nusers))
    user_means = np.mean(train_data_matrix,axis=1)*1/frac ##rescale if needed when inputting partially-zeroed matrices from train and test
    mean_subtract = train_data_matrix - np.reshape(user_means,(nusers,1))
    denom_sum_sq = (np.sum((mean_subtract)**2,axis=1))**0.5 + epsilon
    for i in range(nusers):
        for j in range(nusers):
            pcorr_matrix[i,j] = np.dot(mean_subtract[i,:],mean_subtract[j,:])/(denom_sum_sq[i]*denom_sum_sq[j])
    return pcorr_matrix

In [32]:
def cosine_correlation(full_data_matrix,frac=1):
    epsilon = 1e-5
    #train_data_matrix has dimensions of (nusers,n14ers)
    nusers = np.shape(train_data_matrix)[0]
    pcorr_matrix = np.zeros((nusers,nusers))
    user_means = np.mean(train_data_matrix,axis=1)*1/frac 
    denom_sum_sq = (np.sum((train_data_matrix)**2,axis=1))**0.5 + epsilon
    for i in range(nusers):
        for j in range(nusers):
            pcorr_matrix[i,j] = np.dot(mean_subtract[i,:],mean_subtract[j,:])/(denom_sum_sq[i]*denom_sum_sq[j])
    return pcorr_matrix

In [None]:
### WARNING: this will take a bit of time (~10 mins).  Pre-Load data if possible
user_similarity_full = pearson_correlation(full_data_matrix,1)
#pickle.dump(user_similarity_full,open(r'user_similarities_14er_pearson_full',"wb"))

In [33]:
user_similarity_full=pickle.load(open(r'user_similarities_14er_pearson_full',"rb"))

In [34]:
print(user_similarity[5000,-4000:-3900])

[ 0.23337712  0.14426057  0.02531338  0.23789859  0.06112325  0.03665148
  0.03665148  0.14426057  0.17117169  0.17117169  0.11948398  0.19520464
  0.10030286 -0.02410707  0.28772734  0.08332895  0.16867405  0.28772734
  0.03665148  0.0566243   0.23789859  0.05764809  0.03665148  0.25689865
  0.19702652  0.34394594  0.12513812  0.35167722 -0.01946856  0.22619658
  0.12157116  0.08021728  0.06112325  0.23337712  0.33026892  0.05752817
 -0.03473689  0.23789859  0.11948398 -0.11915439  0.05752817  0.05752817
  0.13189494  0.17518244  0.08021728  0.08837329  0.08023387  0.02531338
 -0.01946856  0.3101754   0.12148316  0.22437477  0.28772734  0.10480361
  0.11948398  0.22437477  0.19520464  0.47602322  0.23789859  0.12157116
  0.15875356  0.25186967  0.08023387  0.10107526  0.16867405  0.05752817
 -0.00170005  0.22437477  0.08021728  0.16226523  0.08021728  0.05764809
 -0.04409555  0.12513812  0.0083602   0.05764809 -0.03650706  0.15640666
  0.25689865  0.03665148  0.14426057  0.04286495  0

##### Spot checks to make sure that the 14er lists of user pairs rated most similar are actually close

In [35]:
sim_to_10000=user_similarity[10000,:].argsort() 
sim_to_5600=user_similarity[5600,:].argsort()
print(user_similarity[10000,sim_to_5600[-2]])
print(user_similarity[5000,sim_to_10000])
print('Check User 10000')
print(df_14ers[df_14ers['NewUserId']==sim_to_10000[-2]])
print('\n')
print(df_14ers[df_14ers['NewUserId']==10000])
print('\n\n')
print('Check User 5600')
print(df_14ers[df_14ers['NewUserId']==sim_to_5600[-2]])
print('\n')
print(df_14ers[df_14ers['NewUserId']==5600])

0.3806331687339949
[-0.32944555 -0.32944555 -0.32944555 ...  0.29378607  0.19702652
  0.16226523]
Check User 10000
       UserId     PeakName  NumClimbs   Username  PeakId  NewUserId
202215  41300    GraysPeak          1  drewlaski      11       8354
202216  41300  MtBierstadt          1  drewlaski      24       8354
202217  41300     MtElbert          1  drewlaski      29       8354
202218  41300  MtPrinceton          1  drewlaski      36       8354
202219  41300  TorreysPeak          1  drewlaski      59       8354


       UserId          PeakName  NumClimbs  Username  PeakId  NewUserId
220227  74031         GraysPeak          1  acarroll      11      10000
220228  74031  MissouriMountain          1  acarroll      21      10000
220229  74031       MtBierstadt          1  acarroll      24      10000
220230  74031          MtElbert          1  acarroll      29      10000
220231  74031       MtPrinceton          1  acarroll      36      10000
220232  74031       TorreysPeak          1 

In [36]:
def predict(ratings, similarity):
    epsilon = 1e-5
    mean_user_rating = ratings.mean(axis=1)
    ratings_diff = (ratings - mean_user_rating[:, np.newaxis]) 
    pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) / (np.array([np.abs(similarity).sum(axis=1)])+epsilon).T
    return pred

In [39]:
user_prediction_full = predict(full_data_matrix, user_similarity_full)

In [40]:
### The prediction matrix along with the user info dataframe will be used to make the Flask API
pickle.dump(user_prediction_full,open(r'user_prediction_full',"wb"))

### Perform some spot-checks of the recommendations 

In [44]:
np.shape(user_prediction_full)
### For now I just want to explicitly exclude the common peaks
common_peaks = ['GraysPeak', 'MtBierstadt', 'QuandaryPeak', 'TorreysPeak', 'MtDemocrat', 'MtElbert', 'MtBross', 'MtCameron', 'MtEvans', 'MtSherman', 'PikesPeak']
n=12
user_prediction=user_prediction_full.astype('float')

args=((-user_prediction[5000,:]).argsort())
hiked_list_5000 = (df_14ers[df_14ers['NewUserId']==5000])['PeakName'].tolist()
print('User 5000 Hiked:')
print(hiked_list_5000)
rec_list_5000 = index_ind['PeakName'][args].tolist()
new_rec_list_5000 = [elem for elem in rec_list_5000 if elem not in hiked_list_5000 and elem not in common_peaks]
print('\n\n')
print('Recommendation for User 5000:')
print(new_rec_list_5000[0:n])
args=((-user_prediction[5600,:]).argsort())
hiked_list_5600 = (df_14ers[df_14ers['NewUserId']==5600])['PeakName'].tolist()
print('\n\n')
print('User 5600 Hiked:')
print(hiked_list_5600)
rec_list_5600 = index_ind['PeakName'][args].tolist()
new_rec_list_5600 = [elem for elem in rec_list_5600 if elem not in hiked_list_5600 and elem not in common_peaks]
print('\n\n')
print('Recommendation for User 5600:')
print(new_rec_list_5600[0:n])

hiked_list_8600 = (df_14ers[df_14ers['NewUserId']==8600])['PeakName'].tolist()
print('\n\n')
print('User 8600 Hiked:')
print(hiked_list_8600)
rec_list_8600 = index_ind['PeakName'][args].tolist()
new_rec_list_8600 = [elem for elem in rec_list_8600 if elem not in hiked_list_8600 and elem not in common_peaks]
print('\n\n')
print('Recommendation for User 8600:')
print(new_rec_list_8600[0:n])

User 5000 Hiked:
['ChallengerPoint', 'CrestoneNeedle', 'HuronPeak', 'KitCarsonPeak', 'LaPlataPeak', 'MtBierstadt', 'MtHarvard', 'MtLindsey', 'MtMassive']



Recommendation for User 5000:
['LongsPeak', 'MtBelford', 'MtYale', 'MtShavano', 'MtPrinceton', 'MtOxford', 'MtoftheHolyCross', 'MissouriMountain', 'TabeguachePeak', 'MtColumbia', 'HumboldtPeak', 'MtAntero']



User 5600 Hiked:
['ElDientePeak', 'GraysPeak', 'LaPlataPeak', 'LongsPeak', 'MtAntero', 'MtBelford', 'MtBierstadt', 'MtBross', 'MtCameron', 'MtDemocrat', 'MtElbert', 'MtEvans', 'MtMassive', 'MtOxford', 'MtSherman', 'MtWilson', 'MtoftheHolyCross', 'QuandaryPeak', 'RedcloudPeak', 'TorreysPeak', 'WetterhornPeak']



Recommendation for User 5600:
['HuronPeak', 'MtYale', 'MtShavano', 'MtPrinceton', 'MissouriMountain', 'MtHarvard', 'TabeguachePeak', 'HandiesPeak', 'MtColumbia', 'SunshinePeak', 'HumboldtPeak', 'UncompahgrePeak']



User 8600 Hiked:
['GraysPeak', 'MtBross', 'MtCameron', 'MtElbert']



Recommendation for User 8600:
['L

## Test Solution

In [45]:
from sklearn.metrics import mean_squared_error
from math import sqrt
def rmse(prediction, ground_truth):
    prediction = prediction[ground_truth.nonzero()].flatten() 
    ground_truth = ground_truth[ground_truth.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction, ground_truth))

In [None]:
print('User-based CF RMSE: ' + str(rmse(user_prediction, test_data_matrix)))
