# Overall Project Description

###### This project will generate 14er and 13er hike recommendations for a given user, first based simply on the checklist of other hikes that the user has done.  The calculation will proceed using the Pearson coefficient for calculating similarity between users; that is, the similarity between users 1 and 2 is given by  
$s(u_1,u_2)=\frac{\sum_k (x_{1,k}-\bar{x_1})(x_{2,k}-\bar{x_2})}{\sqrt{\sum_k (x_{1,k}-\bar{x})^2 \sum_k (x_{2,k}-\bar{x})^2}}$

where $x_{n,k}$ is the rating that a user $n$ gives to a summit of peak $k$, and $\bar{x}_n$ is the average rating over all peaks given by user $n$.  

I am trying two different methods for representing the ratings $x_k$.  In the first, the $x_k$ are either -1 (user did not hike the peak) or 1 (user did hike the peak).  One issue here, particularly with the 14ers, is that there are a handful of front range peaks that are far more popular than all the others (Grays, Torreys, Quandary, and Bierdstat).  As a result, nearly all users share these peaks in common, and these peaks are generated as top recommendations for everyone regardless of other more unique hikes that the user has done.

This issue could be handled by simply looking further down the recommendation list than the top front range peaks, but another approach would be to re-weight the ratings based on net peak popularity: the user's rating for peak $k$ is given as:

$x_{n,k} = \frac{N_{k,n}}{\sum_{u_m} N_{k,m}}$, where $N_{k,n}$ is the total number of times that user $n$ logged a summit of peak $k$.

The recommendation for peak "k" for a given user $n$ is then calculated from the user similarities according to the intuition that probable likelihood of hiking a given peak is added to the average likelihood of hiking a peak for user "n" ($\bar{x}_k$) as a normalized, similarity-weighted sum over the mean-subtracted likelihood for all users "m":

$\hat{x}_{k,n} = \bar{x}_k + \frac{\sum_{u_m}s(u_n, u_m)(x_{k,m} - \bar{x}_m)}{\left| \sum_{u_m}{s(u_n,u_m)} \right|}$




###### Next,

In [98]:
import pandas as pd
import numpy as np
import random
import pickle
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing
import sklearn
#sns.reset_orig()
sns.set_style('white')
%matplotlib inline
pd.set_option('mode.chained_assignment',None)

## Import Checklist Data and Split into Train and Test

In [99]:
fname_14erList = r'14erChecklistByUser_df'
fname_13erList = r'13erChecklistByUser_df'

df_13ers = pickle.load(open(fname_13erList,"rb"))
df_13ers = df_13ers[df_13ers['PeakName']!=""]
df_14ers = pickle.load(open(fname_14erList,"rb"))
df_14ers = df_14ers[df_14ers['PeakName']!=""]
df_14ers = df_14ers.sort_values('PeakName')
df_14ers['NumClimbs'] = df_14ers['NumClimbs'].apply(lambda x: -1 if x==0 else 1)
df_14ers.head(25)
#names = pd.Series(np.arange(len(names)), names)


Unnamed: 0,UserId,PeakName,NumClimbs
69651,7398,BlancaPeak,1
137885,20364,BlancaPeak,1
2285,179,BlancaPeak,1
56466,5784,BlancaPeak,1
207894,46480,BlancaPeak,1
24385,2079,BlancaPeak,1
14644,1236,BlancaPeak,1
85890,10118,BlancaPeak,1
177026,33105,BlancaPeak,1
56533,5789,BlancaPeak,1


In [100]:
climb_counts = df_14ers.groupby('PeakName').agg('count').sort_values('NumClimbs')
print(climb_counts)
dropnames = climb_counts[climb_counts['NumClimbs']<200].index.tolist()
print(dropnames)
df_14ers = df_14ers[df_14ers['PeakName'].apply(lambda x: x not in dropnames)]
#df_14ers[df_14ers['NewUserId']==4].head(40)

                   UserId  NumClimbs
PeakName                            
SunlightSpire          25         25
SouthWilson            39         39
NortheastCrestone      67         67
SouthLittleBear        96         96
NorthSnowmass         125        125
WestWilson            126        126
SoutheastLongs        167        167
EastCrestone          189        189
EastLaPlata           245        245
NorthwestLindsey      335        335
MassiveGreen          434        434
SouthElbert           499        499
SouthMassive          532        532
NorthMassive          562        562
WestEvans             792        792
SouthBross           1133       1133
NorthEolus           1373       1373
ElDientePeak         1429       1429
MtWilson             1499       1499
CulebraPeak          1540       1540
NorthMaroonPeak      1593       1593
LittleBearPeak       1604       1604
WilsonPeak           1725       1725
CapitolPeak          1781       1781
MtEolus              1795       1795
M

In [85]:
print(-1*5000.0+1*10000)

5000.0


In [102]:
n_14er_users = df_14ers.UserId.nunique()
n_14ers = df_14ers.PeakName.nunique()
print(n_14er_users)
print(n_14ers)
index_ind=pd.DataFrame(data={'PeakId':np.arange(len(df_14ers.PeakName.unique())),'PeakName':df_14ers.PeakName.unique()})
index_ind.head(12)

15207
65


Unnamed: 0,PeakId,PeakName
0,0,BlancaPeak
1,1,CapitolPeak
2,2,CastlePeak
3,3,ChallengerPoint
4,4,ConundrumPeak
5,5,CrestoneNeedle
6,6,CrestonePeak
7,7,CulebraPeak
8,8,EastLaPlata
9,9,ElDientePeak


In [109]:
df_14ers = df_14ers.merge(index_ind)
df_14ers.head()

Unnamed: 0,UserId,PeakName,NumClimbs,PeakId,NewUserId
0,7398,BlancaPeak,1,0,0
1,20364,BlancaPeak,1,0,1
2,179,BlancaPeak,1,0,2
3,5784,BlancaPeak,1,0,3
4,46480,BlancaPeak,1,0,4


In [104]:
## We should also redo the user ids since they span the length of the full dataframe, and we
## have only taken the subset of the users that have logged a 14er hike
index_id=pd.DataFrame(data={'NewUserId':np.arange(len(df_14ers.UserId.unique())),'UserId':df_14ers.UserId.unique()})
index_id.head()
df_14ers = df_14ers.merge(index_id)
df_14ers.head()

Unnamed: 0,UserId,PeakName,NumClimbs,PeakId,NewUserId
0,7398,BlancaPeak,1,0,0
1,7398,CapitolPeak,1,1,0
2,7398,CastlePeak,1,2,0
3,7398,ChallengerPoint,1,3,0
4,7398,ConundrumPeak,1,4,0


In [105]:
from sklearn.model_selection import train_test_split
random.seed(1000)
train_data14ers, test_data14ers = train_test_split(df_14ers, test_size=0.25)

## Build the similarity rating between users based only on their common ratings, and then create a peak prediction for each user 

In [124]:
#Create two user-item matrices, one for training and another for testing
train_data_matrix = -1*np.ones((n_14er_users, n_14ers))
for line in train_data14ers.itertuples():
    #print([line[1],line[2],line[3]])
    train_data_matrix[line[5], line[4]] = line[3]  
m=0
for n in range(n_14er_users):
    if all(train_data_matrix[n,:]==-1):
        m=m+1
        train_data_matrix[n,:]=0
print(m)
test_data_matrix = -1*np.ones((n_14er_users, n_14ers))
for line in test_data14ers.itertuples():
    test_data_matrix[line[5], line[4]] = line[3]

379


In [126]:
print(15000*.25)
np.sum(train_data_matrix[:,11])
(train_data_matrix[1,:])=0
print(train_data_matrix)

3750.0
[[ 1.  1.  1. ...  1.  1.  1.]
 [ 0.  0.  0. ...  0.  0.  0.]
 [-1.  1.  1. ...  1.  1.  1.]
 ...
 [ 0.  0.  0. ...  0.  0.  0.]
 [-1. -1. -1. ... -1. -1.  1.]
 [-1. -1. -1. ... -1. -1.  1.]]


In [110]:
print(np.shape(train_data_matrix))
means = np.mean(train_data_matrix,axis=0)
print(len(means))
np.max(means)
#np.min(means)

(15207, 65)
65


0.08285657920694417

# 
$s(u_1,u_2)=\frac{\sum_k (x_{1,k}-\bar{x_1})(x_{2,k}-\bar{x_2})}{\sqrt{\sum_k (x_{1,k}-\bar{x})^2 \sum_k (x_{2,k}-\bar{x})^2}}$


In [132]:
def pearson_correlation(train_data_matrix):
    user_means = np.mean(train_data_matrix,axis=1)*1/.75 ##rescale because only 75% of the matrix is populated with training data
    mean_subtract = train_data_matrix - user_means
    return

In [78]:
from sklearn.metrics.pairwise import pairwise_distances
user_similarity = pearson_correlation(train_data_matrix)

ValueError: Unknown metric pearson. Valid metrics are ['euclidean', 'l2', 'l1', 'manhattan', 'cityblock', 'braycurtis', 'canberra', 'chebyshev', 'correlation', 'cosine', 'dice', 'hamming', 'jaccard', 'kulsinski', 'mahalanobis', 'matching', 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule', 'wminkowski'], or 'precomputed', or a callable

Spot checks to make sure that the 14er lists of user pairs rated most similar are actually close

In [68]:
sim_to_5000=user_similarity[5000,:].argsort()
sim_to_5600=user_similarity[5600,:].argsort()
print(user_similarity[5600,sim_to_5600])
print(user_similarity[5000,sim_to_5000])
print('Check User 5000')
print(df_14ers[df_14ers['NewUserId']==4748])
print('\n')
print(df_14ers[df_14ers['NewUserId']==5000])
print('\n\n')
print('Check User 5600')
print(df_14ers[df_14ers['NewUserId']==sim_to_5600[1]])
print('\n')
print(df_14ers[df_14ers['NewUserId']==5600])

[0.         0.18461538 0.21538462 ... 1.44615385 1.44615385 1.47692308]
[0.         0.15384615 0.15384615 ... 1.53846154 1.53846154 1.56923077]
Check User 5000
       UserId          PeakName  NumClimbs  PeakId  NewUserId
156656  64987   ChallengerPoint          1       3       4748
156657  64987      HumboldtPeak          1      13       4748
156658  64987     KitCarsonPeak          1      15       4748
156659  64987  MissouriMountain          1      21       4748
156660  64987        MtColumbia          1      27       4748
156661  64987          MtElbert          1      29       4748
156662  64987         MtHarvard          1      32       4748
156663  64987         MtMassive          1      34       4748


       UserId         PeakName  NumClimbs  PeakId  NewUserId
161498  13587  ChallengerPoint          1       3       5000
161499  13587   CrestoneNeedle          1       5       5000
161500  13587        HuronPeak          1      14       5000
161501  13587    KitCarsonPeak      

In [69]:
def predict(ratings, similarity):
    mean_user_rating = ratings.mean(axis=1)
    #You use np.newaxis so that mean_user_rating has same format as ratings
    ratings_diff = (ratings - mean_user_rating[:, np.newaxis]) 
    print(ratings_diff)
    pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    return pred

In [74]:
user_prediction = predict(train_data_matrix, 1-user_similarity)

[[-1.41538462  0.58461538  0.58461538 ...  0.58461538  0.58461538
   0.58461538]
 [ 1.07692308 -0.92307692 -0.92307692 ... -0.92307692 -0.92307692
  -0.92307692]
 [ 0.76923077  0.76923077  0.76923077 ...  0.76923077  0.76923077
  -1.23076923]
 ...
 [-0.03076923 -0.03076923 -0.03076923 ... -0.03076923 -0.03076923
   1.96923077]
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [-0.03076923 -0.03076923 -0.03076923 ... -0.03076923 -0.03076923
   1.96923077]]


In [75]:
#print(user_prediction)
np.shape(user_prediction)
#q = np.max(user_prediction,axis=0)
#print(np.shape(q))
n=10
user_prediction=user_prediction.astype('float')
#print((-user_prediction).argsort(axis=0)[:n, :])
### The predictions for user 0
args=((-user_prediction[5000,:]).argsort())
print(user_prediction[5000,args])
print(df_14ers[df_14ers['NewUserId']==5000])
print('\n')
print(index_ind['PeakName'][args])
print('\n')
args=((-user_prediction[5600,:]).argsort())
print(user_prediction[5600,args])
print(df_14ers[df_14ers['NewUserId']==5600])
print('\n')
print(index_ind['PeakName'][args])

[-0.12422266 -0.13157922 -0.22315143 -0.223706   -0.36217535 -0.37780898
 -0.41147563 -0.44179598 -0.45828222 -0.54718343 -0.58077839 -0.59240931
 -0.64300736 -0.66176847 -0.67940913 -0.68111434 -0.70089029 -0.70940881
 -0.72543102 -0.76447739 -0.77216216 -0.77796064 -0.80492715 -0.81823687
 -0.81878767 -0.83829576 -0.84287569 -0.84469031 -0.8462748  -0.84945133
 -0.85923367 -0.8720039  -0.88862219 -0.89158745 -0.89618624 -0.90117738
 -0.90379556 -0.90424073 -0.91305351 -0.91731277 -0.92374128 -0.92639341
 -0.92934358 -0.93821295 -0.93853362 -0.94348704 -0.94386807 -0.94710119
 -0.94870077 -0.94890449 -0.95260541 -0.95355987 -0.95478974 -0.95502364
 -0.9557178  -0.9558687  -0.95627237 -0.95691748 -0.96141442 -0.96217648
 -0.96586608 -0.96731476 -0.96792592 -0.97137784 -0.97393189]
       UserId         PeakName  NumClimbs  PeakId  NewUserId
161498  13587  ChallengerPoint          1       3       5000
161499  13587   CrestoneNeedle          1       5       5000
161500  13587        Huro

## Test Solution

In [76]:
from sklearn.metrics import mean_squared_error
from math import sqrt
def rmse(prediction, ground_truth):
    prediction = prediction[ground_truth.nonzero()].flatten() 
    ground_truth = ground_truth[ground_truth.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction, ground_truth))

In [77]:
print('User-based CF RMSE: ' + str(rmse(user_prediction, test_data_matrix)))


User-based CF RMSE: 0.6166488502620027


### Workbook

In [131]:
mat1 = np.array([[1,2,3],[4,5,6],[7,8,9],[10,11,12]])
mat2 = np.array([[1],2,3,4]])
print(mat1-mat2)


ValueError: operands could not be broadcast together with shapes (4,3) (1,4) 