# Overall Project Description

###### This project will generate 14er and 13er hike recommendations for a given user, first based simply on the checklist of other hikes that the user has done.  The calculation will proceed using the Pearson coefficient for calculating similarity between users; that is, the similarity between users 1 and 2 is given by  
$s(u_1,u_2)=\frac{\sum_k (x_{1,k}-\bar{x_1})(x_{2,k}-\bar{x_2})}{\sqrt{\sum_k (x_{1,k}-\bar{x})^2 \sum_k (x_{2,k}-\bar{x})^2}}$

where $x_{n,k}$ is the rating that a user $n$ gives to a summit of peak $k$, and $\bar{x}_n$ is the average rating over all peaks given by user $n$.  

I am trying two different methods for representing the ratings $x_k$.  In the first, the $x_k$ are either -1 (user did not hike the peak) or 1 (user did hike the peak).  One issue here, particularly with the 14ers, is that there are a handful of front range peaks that are far more popular than all the others (Grays, Torreys, Quandary, and Bierdstat).  As a result, nearly all users share these peaks in common, and these peaks are generated as top recommendations for everyone regardless of other more unique hikes that the user has done.

This issue could be handled by simply looking further down the recommendation list than the top front range peaks, but another approach would be to re-weight the ratings based on net peak popularity: the user's rating for peak $k$ is given as:

$x_{n,k} = \frac{N_{k,n}}{\sum_{u_m} N_{k,m}}$, where $N_{k,n}$ is the total number of times that user $n$ logged a summit of peak $k$.

The recommendation for peak "k" for a given user $n$ is then calculated from the user similarities according to the intuition that probable likelihood of hiking a given peak is added to the average likelihood of hiking a peak for user "n" ($\bar{x}_k$) as a normalized, similarity-weighted sum over the mean-subtracted likelihood for all users "m":

$\hat{x}_{k,n} = \bar{x}_k + \frac{\sum_{u_m}s(u_n, u_m)(x_{k,m} - \bar{x}_m)}{\left| \sum_{u_m}{s(u_n,u_m)} \right|}$




###### Next,

In [1]:
import pandas as pd
import numpy as np
import random
import pickle
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing
import sklearn
#sns.reset_orig()
sns.set_style('white')
%matplotlib inline
pd.set_option('mode.chained_assignment',None)

## Import Checklist Data and Split into Train and Test

#### Import checklist data.  Also import user data to map user ID to user Name

In [2]:
fname_14erList = r'14erChecklistByUser_df'
fname_13erList = r'13erChecklistByUser_df'

df_13ers = pickle.load(open(fname_13erList,"rb"))
df_13ers = df_13ers[df_13ers['PeakName']!=""]
df_14ers = pickle.load(open(fname_14erList,"rb"))
df_14ers = df_14ers[df_14ers['PeakName']!=""]
df_14ers = df_14ers.sort_values('PeakName')
df_14ers['NumClimbs'] = df_14ers['NumClimbs'].apply(lambda x: -1 if x==0 else 1)
df_14ers.head(25)
#names = pd.Series(np.arange(len(names)), names)

fname_user = r'user_profile_data_FULL'
user_df = pickle.load(open(fname_user,"rb"))
user_df.head()


Unnamed: 0,UserId,PeakName,NumClimbs
69651,7398,BlancaPeak,1
137885,20364,BlancaPeak,1
2285,179,BlancaPeak,1
56466,5784,BlancaPeak,1
207894,46480,BlancaPeak,1
24385,2079,BlancaPeak,1
14644,1236,BlancaPeak,1
85890,10118,BlancaPeak,1
177026,33105,BlancaPeak,1
56533,5789,BlancaPeak,1


In [4]:
climb_counts = df_14ers.groupby('PeakName').agg('count').sort_values('NumClimbs')
print(climb_counts)
dropnames = climb_counts[climb_counts['NumClimbs']<200].index.tolist()
print(dropnames)
df_14ers = df_14ers[df_14ers['PeakName'].apply(lambda x: x not in dropnames)]
#df_14ers[df_14ers['NewUserId']==4].head(40)

                  UserId  NumClimbs
PeakName                           
EastLaPlata          245        245
NorthwestLindsey     335        335
MassiveGreen         434        434
SouthElbert          499        499
SouthMassive         532        532
NorthMassive         562        562
WestEvans            792        792
SouthBross          1133       1133
NorthEolus          1373       1373
ElDientePeak        1429       1429
MtWilson            1499       1499
CulebraPeak         1540       1540
NorthMaroonPeak     1593       1593
LittleBearPeak      1604       1604
WilsonPeak          1725       1725
CapitolPeak         1781       1781
MtEolus             1795       1795
MaroonPeak          1800       1800
SnowmassMountain    1829       1829
PyramidPeak         1886       1886
SunlightPeak        1996       1996
WindomPeak          2046       2046
EllingwoodPoint     2204       2204
CrestonePeak        2264       2264
ConundrumPeak       2267       2267
CrestoneNeedle      2462    

In [5]:
n_14er_users = df_14ers.UserId.nunique()
n_14ers = df_14ers.PeakName.nunique()
print(n_14er_users)
print(n_14ers)
index_ind=pd.DataFrame(data={'PeakId':np.arange(len(df_14ers.PeakName.unique())),'PeakName':df_14ers.PeakName.unique()})
index_ind.head(12)

15207
65


Unnamed: 0,PeakId,PeakName
0,0,BlancaPeak
1,1,CapitolPeak
2,2,CastlePeak
3,3,ChallengerPoint
4,4,ConundrumPeak
5,5,CrestoneNeedle
6,6,CrestonePeak
7,7,CulebraPeak
8,8,EastLaPlata
9,9,ElDientePeak


In [6]:
df_14ers = df_14ers.merge(index_ind)
df_14ers.head()

Unnamed: 0,UserId,PeakName,NumClimbs,PeakId
0,7398,BlancaPeak,1,0
1,20364,BlancaPeak,1,0
2,179,BlancaPeak,1,0
3,5784,BlancaPeak,1,0
4,46480,BlancaPeak,1,0


In [7]:
## We should also redo the user ids since they span the length of the full dataframe, and we
## have only taken the subset of the users that have logged a 14er hike
index_id=pd.DataFrame(data={'NewUserId':np.arange(len(df_14ers.UserId.unique())),'UserId':df_14ers.UserId.unique()})
index_id.head()
df_14ers = df_14ers.merge(index_id)
df_14ers.head()

Unnamed: 0,UserId,PeakName,NumClimbs,PeakId,NewUserId
0,7398,BlancaPeak,1,0,0
1,7398,CapitolPeak,1,1,0
2,7398,CastlePeak,1,2,0
3,7398,ChallengerPoint,1,3,0
4,7398,ConundrumPeak,1,4,0


In [8]:
from sklearn.model_selection import train_test_split
random.seed(1000)
train_data14ers, test_data14ers = train_test_split(df_14ers, test_size=0.25)

## Build the similarity rating between users based only on their common ratings, and then create a peak prediction for each user 

In [9]:
#Create two user-item matrices, one for training and another for testing
train_data_matrix = -1*np.ones((n_14er_users, n_14ers))
for line in train_data14ers.itertuples():
    #print([line[1],line[2],line[3]])
    train_data_matrix[line[5], line[4]] = line[3]  
for n in range(n_14er_users):
    if all(train_data_matrix[n,:]==-1):
        train_data_matrix[n,:]=0
test_data_matrix = -1*np.ones((n_14er_users, n_14ers))
for line in test_data14ers.itertuples():
    test_data_matrix[line[5], line[4]] = line[3]

In [26]:
print(np.shape(train_data_matrix))
means = np.mean(train_data_matrix,axis=0)
print(len(means))
np.max(means)
#np.min(means)

(15207, 65)
65


0.08976129414085618

##### Pearson Metric 

$s(u_i,u_j)=\frac{\sum_k (x_{i,k}-\bar{x_i})(x_{j,k}-\bar{x_j})}{\sqrt{\sum_k (x_{i,k}-\bar{x_i})^2 \sum_k (x_{j,k}-\bar{x_j})^2}}$

The (i,j) are user indices and "k" is a peak index

In [27]:
def pearson_correlation(train_data_matrix):
    epsilon = 1e-5
    #train_data_matrix has dimensions of (nusers,n14ers)
    nusers = np.shape(train_data_matrix)[0]
    pcorr_matrix = np.zeros((nusers,nusers))
    user_means = np.mean(train_data_matrix,axis=1)*1/.75 ##rescale because only 75% of the matrix is populated with training data
    #print(user_means)
    #print(train_data_matrix)
    user_means = np.reshape(user_means,(nusers,1))
    mean_subtract = train_data_matrix - user_means
    denom_sum_sq = (np.sum((mean_subtract)**2,axis=1))**0.5
    denom_sum_sq = [elem if elem!=0 else epsilon for elem in denom_sum_sq]
    #print(mean_subtract)
    for i in range(nusers):
        for j in range(nusers):
            pcorr_matrix[i,j] = np.dot(mean_subtract[i,:],mean_subtract[j,:])/(denom_sum_sq[i]*denom_sum_sq[j])
    return pcorr_matrix

In [62]:
def cosine_correlation(train_data_matrix):
    epsilon = 1e-5
    #train_data_matrix has dimensions of (nusers,n14ers)
    nusers = np.shape(train_data_matrix)[0]
    pcorr_matrix = np.zeros((nusers,nusers))
    user_means = np.mean(train_data_matrix,axis=1)*1/.75 ##rescale because only 75% of the matrix is populated with training data
    #print(user_means)
    #print(train_data_matrix)
   # user_means = np.reshape(user_means,(nusers,1))
    mean_subtract = train_data_matrix #- user_means
    denom_sum_sq = (np.sum((mean_subtract)**2,axis=1))**0.5
    denom_sum_sq = [elem if elem!=0 else epsilon for elem in denom_sum_sq]
    #print(mean_subtract)
    for i in range(nusers):
        for j in range(nusers):
            pcorr_matrix[i,j] = np.dot(mean_subtract[i,:],mean_subtract[j,:])/(denom_sum_sq[i]*denom_sum_sq[j])
    return pcorr_matrix

In [69]:
#from sklearn.metrics.pairwise import pairwise_distances
#### Run if needed otherwise load below
user_similarity = pearson_correlation(train_data_matrix)
pickle.dump(user_similarity,open(r'user_similarities_14er_pearson',"wb"))

In [10]:
user_similarity=pickle.load(open(r'user_similarities_14er_pearson',"rb"))

In [68]:
user_similarity_cos = user_similarity#cosine_correlation(train_data_matrix)

In [11]:
print(user_similarity[5000,-4000:-3900])

[ 0.25762622  0.22289028  0.05453595  0.31557175  0.17308552  0.02683448
  0.15684004  0.22437477  0.19358468  0.25762622  0.17308552  0.15684004
  0.17308552  0.02298496  0.31557175  0.10903558  0.2868456   0.47300505
  0.18758039  0.14261112  0.14647674  0.12951055  0.00515491  0.36797046
  0.31557175  0.32436385  0.19358468  0.38131288  0.02298496  0.34394594
  0.12951055  0.08216028  0.02298496  0.          0.25914893  0.10905521
  0.0778101   0.19358468  0.15033707 -0.06728665  0.14647674  0.0778101
  0.15684004  0.0719158   0.08216028  0.14647674  0.18758039  0.04152521
  0.08216028  0.32436385  0.0719158   0.12951055  0.38131288  0.15684004
  0.08216028  0.2868456   0.26456356  0.31557175  0.25762622  0.22437477
  0.12906937  0.17308552  0.12951055  0.17308552  0.10480361  0.10905521
  0.02683448  0.26456356  0.17308552  0.10905521  0.10480361  0.00515491
 -0.13039953  0.19358468  0.10903558  0.00515491  0.00536095  0.00536095
  0.36797046  0.08216028  0.2868456   0.12906937  0.

Spot checks to make sure that the 14er lists of user pairs rated most similar are actually close

In [13]:
sim_to_10000=user_similarity[10000,:].argsort() 
sim_to_5600=user_similarity[5600,:].argsort()
print(user_similarity[10000,sim_to_5600[-2]])
print(user_similarity[5000,sim_to_10000])
print('Check User 10000')
print(df_14ers[df_14ers['NewUserId']==sim_to_10000[-2]])
print('\n')
print(df_14ers[df_14ers['NewUserId']==10000])
print('\n\n')
print('Check User 5600')
print(df_14ers[df_14ers['NewUserId']==sim_to_5600[-2]])
print('\n')
print(df_14ers[df_14ers['NewUserId']==5600])

0.4717069078533
[-0.02298496 -0.00643337 -0.11664799 ...  0.31557175  0.31557175
  0.26456356]
Check User 10000
       UserId     PeakName  NumClimbs  PeakId  NewUserId
193151  12890    GraysPeak          1      11       7521
193152  12890  MtBierstadt          1      24       7521
193153  12890     MtElbert          1      29       7521


       UserId          PeakName  NumClimbs  PeakId  NewUserId
220227  74031         GraysPeak          1      11      10000
220228  74031  MissouriMountain          1      21      10000
220229  74031       MtBierstadt          1      24      10000
220230  74031          MtElbert          1      29      10000
220231  74031       MtPrinceton          1      36      10000
220232  74031       TorreysPeak          1      59      10000



Check User 5600
       UserId      PeakName  NumClimbs  PeakId  NewUserId
177446  52612     GraysPeak          1      11       6035
177447  52612     HuronPeak          1      14       6035
177448  52612   LaPlataPeak    

In [14]:
def predict(ratings, similarity):
    mean_user_rating = ratings.mean(axis=1)
    ratings_diff = (ratings - mean_user_rating[:, np.newaxis]) 
#    print(ratings_diff)
    pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    return pred

In [16]:
user_prediction = predict(train_data_matrix, 1-user_similarity)

[[ 0.64615385  0.64615385  0.64615385 ...  0.64615385  0.64615385
   0.64615385]
 [-0.95384615 -0.95384615 -0.95384615 ... -0.95384615 -0.95384615
  -0.95384615]
 [-1.29230769  0.70769231  0.70769231 ...  0.70769231  0.70769231
  -1.29230769]
 ...
 [-0.03076923 -0.03076923 -0.03076923 ... -0.03076923 -0.03076923
   1.96923077]
 [-0.03076923 -0.03076923 -0.03076923 ... -0.03076923 -0.03076923
   1.96923077]
 [-0.03076923 -0.03076923 -0.03076923 ... -0.03076923 -0.03076923
   1.96923077]]


In [35]:
#print(user_prediction)
np.shape(user_prediction)
common_peaks = ['GraysPeak', 'MtBierstadt', 'QuandaryPeak', 'TorreysPeak', 'MtDemocrat', 'MtElbert', 'MtBross', 'MtCameron', 'MtEvans', 'MtSherman', 'PikesPeak']
n=10
user_prediction=user_prediction.astype('float')

args=((-user_prediction[5000,:]).argsort())
hiked_list_5000 = (df_14ers[df_14ers['NewUserId']==5000])['PeakName'].tolist()
print('User 5000 Hiked:')
print(hiked_list_5000)
print('\n\n')
rec_list_5000 = index_ind['PeakName'][args].tolist()
new_rec_list_5000 = [elem for elem in rec_list_5000 if elem not in hiked_list_5000 and elem not in common_peaks]
print('\n\n')
print('Recommendation for User 5000:')
print(new_rec_list_5000[0:n])
args=((-user_prediction[5600,:]).argsort())
hiked_list_5600 = (df_14ers[df_14ers['NewUserId']==5600])['PeakName'].tolist()
print('\n\n')

print('User 5600 Hiked:')
print(hiked_list_5600)
rec_list_5600 = index_ind['PeakName'][args].tolist()
new_rec_list_5600 = [elem for elem in rec_list_5600 if elem not in hiked_list_5600 and elem not in common_peaks]
print('\n\n')
print('Recommendation for User 5600:')
print(new_rec_list_5600[0:n])

User 5000 Hiked:
['ChallengerPoint', 'CrestoneNeedle', 'HuronPeak', 'KitCarsonPeak', 'LaPlataPeak', 'MtBierstadt', 'MtHarvard', 'MtLindsey', 'MtMassive']






Recommendation for User 5000:
['LongsPeak', 'MtBelford', 'MtYale', 'MtShavano', 'MtPrinceton', 'MtOxford', 'MtoftheHolyCross', 'MissouriMountain', 'TabeguachePeak', 'HandiesPeak']



User 5600 Hiked:
['ElDientePeak', 'GraysPeak', 'LaPlataPeak', 'LongsPeak', 'MtAntero', 'MtBelford', 'MtBierstadt', 'MtBross', 'MtCameron', 'MtDemocrat', 'MtElbert', 'MtEvans', 'MtMassive', 'MtOxford', 'MtSherman', 'MtWilson', 'MtoftheHolyCross', 'QuandaryPeak', 'RedcloudPeak', 'TorreysPeak', 'WetterhornPeak']



Recommendation for User 5600:
['HuronPeak', 'MtYale', 'MtShavano', 'MtPrinceton', 'MtHarvard', 'MissouriMountain', 'TabeguachePeak', 'HandiesPeak', 'HumboldtPeak', 'MtColumbia']


# $\textbf{Final 14er Recommender Based on Collaborative Filtering } \\ \textbf{and Ratings} \textit{ Not } \textbf{Scaled By Peak Frequency}$

In [None]:
def recom_14er_noScale()

## Test Solution

In [76]:
from sklearn.metrics import mean_squared_error
from math import sqrt
def rmse(prediction, ground_truth):
    prediction = prediction[ground_truth.nonzero()].flatten() 
    ground_truth = ground_truth[ground_truth.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction, ground_truth))

In [77]:
print('User-based CF RMSE: ' + str(rmse(user_prediction, test_data_matrix)))


User-based CF RMSE: 0.6166488502620027


### Workbook

In [15]:
mat1 = np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])
#mat2 = np.array([[1],2,3,4]])
print(mat1)


[[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]]


In [17]:
np.sum(mat1,axis=0)

array([22, 26, 30])