# Overall Project Description

These models generate Colorado 14er hike recommendations for a given user, first based simply on a standard user-based collaborative filter engine with user similarities calculated via the Pearson coefficient determined from the (user,mountain) matrix of hike checklists. 

Next, to address the "long-tail" issue and de-emphasize the importance of a small number of highly popular peaks hiked by many users, the similarity calculation will be updated to more heavily weight common experiences on less popular peaks.

Finally, the Pearson similarity metric will be appended in a weighted average with a similarity metric computed based on commonalities between user-provided profile information and hiking preferences.

RMSE metric is used to evaluate approaches.  However, some tradeoff between minimized RMSE and increased diversity of top recommendations is allowable in this context.

The same approach is repeated in a notebook for 13er hike recommendations.  There is a far wider scope of 13ers in Colorado, and more interesting utility to be gained from this latter recommendation engine.  The 14er recommender is constructed first to test model functionality with a pool of well-known peaks.




In [1]:
import pandas as pd
import numpy as np
import random
import pickle
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing
import sklearn
import json
#sns.reset_orig()
sns.set_style('white')
%matplotlib inline
pd.set_option('mode.chained_assignment',None)

import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import iplot, init_notebook_mode

import cufflinks
cufflinks.go_offline(connected=True)
init_notebook_mode(connected=True)

### Import Checklist Data and User Data

Import checklist data that was previously scraped from a popular hiking website as part of this project and stored in a dataframe format.  The web scraping class and methods are included in github as "forum_scraper.py".  The user profile data was also scraped with the same script; this data is also imported to map user ID to user Name, and merge dataframes (left join).

The "NumClimbs" field will hold the effective "rating" given by a user to a peak.  Currently, the ratings $x_k$ are simply binary representations of whether a user successfully summited a peak: -1 (user did not hike the peak) or 1 (user did hike the peak).  While some users log multiple climbs, using the actual number of climbs exacerbates the long tail problem since many users return repeatedly to the popular peaks simply out of convenience and proximity to cities.  This recommendation system is instead designed to provide unique new suggestions.

In [72]:
fname_14erList = r'14erChecklistByUser_df_winterfix'
fname_13erList = r'13erChecklistByUser_df_winterfix'

df_14ers = pickle.load(open(fname_14erList,"rb"))
df_14ers = df_14ers[df_14ers['PeakName']!=""]
df_14ers = df_14ers.sort_values('PeakName')
df_14ers['NumClimbs'] = df_14ers['NumClimbs'].apply(lambda x: -1 if x==0 else 1)

fname_user = r'user_profile_data_FULL_winterfix'
user_df = pickle.load(open(fname_user,"rb"))
#user_df = user_df[['UserId','Username']]

df_14ers = pd.merge(df_14ers, user_df[['UserId','Username']], on='UserId', how='left')
df_14ers = df_14ers.drop_duplicates()
df_14ers.head()
#user_df.head()

Unnamed: 0,UserId,PeakName,NumClimbs,Username
0,58801,BlancaPeak,1,Ozziejomi
1,4891,BlancaPeak,1,davidheese
2,11662,BlancaPeak,1,chrishoffbauer
3,4881,BlancaPeak,1,Cleatelite
4,15596,BlancaPeak,1,jwinters


In [73]:
user_df['Age']=user_df['Age'].apply(pd.to_numeric, errors='coerce').fillna(value=np.nan)
nusers_tot = len(user_df)
nusers_age = len(user_df[user_df['Age']>0])
print('Fraction of users providing an age is '+ str(nusers_age/nusers_tot))


Fraction of users providing an age is 0.11770057288977404


##### Plot user distribution by age, for the 12% who gave their age.

In [14]:
user_df['Age'].iplot(kind='hist',xTitle='Age',yTitle='Count')

##### Plot user distribution by join date (all users included)

In [23]:
jd = user_df.copy()
jd = jd.groupby([jd['JoinDate'].dt.year.rename('year'), jd['JoinDate'].dt.month.rename('month')]).agg({'count'})#jd.groupby(jd.JoinDate.dt.to_period("M")).agg('count')
jd['UserId'].iplot(kind='bar',xTitle='Join Date',yTitle='Count')

##### Next I drop some of the more obscure peaks from the 14ers checklist data.  Also plot to see counts distribution.

In [74]:
climb_counts = df_14ers.groupby('PeakName').agg('count').sort_values('NumClimbs')
climb_counts['UserId'].iplot(kind='bar',xTitle='Peak Name',yTitle='Logged Ascents')
dropnames = climb_counts[climb_counts['NumClimbs']<300].index.tolist()
df_14ers = df_14ers[df_14ers['PeakName'].apply(lambda x: x not in dropnames)]
#df_14ers[df_14ers['NewUserId']==4].head(40)

In [60]:
print(dropnames)

['SunlightSpire', 'SouthWilson', 'NortheastCrestone', 'SouthLittleBear', 'NorthSnowmass']


##### Next I users with less than 2 climbs.  This is mainly to keep the dataset size manageable for my desktop.

In [75]:
#user_counts = df_14ers.groupby('UserId').agg('count').sort_values('NumClimbs')
#v = df_14ers['UserId'].apply(pd.to_numeric)
user_counts = df_14ers.UserId.value_counts()
#print(v)
user_counts.iplot(kind='hist',xTitle='Number Of Peaks',yTitle='Count')#climb_counts['UserId'].iplot(kind='bar',xTitle='Peak Name',yTitle='Logged Ascents')
gooduids = user_counts[user_counts >2].index.to_list()
df_14ers = df_14ers[df_14ers['UserId'].isin(gooduids)]

#len(df_14ers_trunc)
#len(user_counts[user_counts['Username']>3])
#dropnames = climb_counts[climb_counts['NumClimbs']<300].index.tolist()
#df_14ers = df_14ers[df_14ers['PeakName'].apply(lambda x: x not in dropnames)]
#df_14ers[df_14ers['NewUserId']==4].head(40)

In [76]:
n_14er_users = df_14ers.UserId.nunique()
n_14ers = df_14ers.PeakName.nunique()
print(n_14er_users)
print(n_14ers)
## Create a dataframe that assigns an index to each peak
index_ind=pd.DataFrame(data={'PeakId':np.arange(len(df_14ers.PeakName.unique())),'PeakName':df_14ers.PeakName.sort_values().unique()})
index_ind.head(12)

15166
66


Unnamed: 0,PeakId,PeakName
0,0,BlancaPeak
1,1,CapitolPeak
2,2,CastlePeak
3,3,ChallengerPoint
4,4,ConundrumPeak
5,5,CrestoneNeedle
6,6,CrestonePeak
7,7,CulebraPeak
8,8,EastCrestone
9,9,EastLaPlata


In [77]:
df_14ers = df_14ers.merge(index_ind)
df_14ers.head()

Unnamed: 0,UserId,PeakName,NumClimbs,Username,PeakId
0,58801,BlancaPeak,1,Ozziejomi,0
1,4891,BlancaPeak,1,davidheese,0
2,11662,BlancaPeak,1,chrishoffbauer,0
3,4881,BlancaPeak,1,Cleatelite,0
4,15596,BlancaPeak,1,jwinters,0


In [78]:
## We should also redo the user ids since they span the length of the full user database, and we
## have only taken the subset of the users that have logged a 14er hike
index_id=pd.DataFrame(data={'NewUserId':np.arange(len(df_14ers.UserId.unique())),'UserId':df_14ers.UserId.unique()})
df_14ers = df_14ers.merge(index_id)
df_14ers.head()

Unnamed: 0,UserId,PeakName,NumClimbs,Username,PeakId,NewUserId
0,58801,BlancaPeak,1,Ozziejomi,0,0
1,58801,EllingwoodPoint,1,Ozziejomi,11,0
2,58801,GraysPeak,1,Ozziejomi,12,0
3,58801,HandiesPeak,1,Ozziejomi,13,0
4,58801,HumboldtPeak,1,Ozziejomi,14,0


In [79]:
### Run if desired
pickle.dump(df_14ers,open(r'14ers_df_final_trunc2',"wb"))

In [80]:
### Split df into train/test
from sklearn.model_selection import train_test_split
random.seed(1000)
train_data14ers, test_data14ers = train_test_split(df_14ers, test_size=0.25)

## Build the similarity metric matrix between users based only on Pearson metric, and then create a peak prediction for each user 

In [81]:
train_data_userids = train_data14ers['NewUserId'].tolist()
test_data_userids = test_data14ers['NewUserId'].tolist()

In [82]:
print(n_14ers)
df_14ers.tail(25)

66


Unnamed: 0,UserId,PeakName,NumClimbs,Username,PeakId,NewUserId
319665,61675,SanLuisPeak,1,tk58,52,15158
319666,61675,WindomPeak,1,tk58,65,15158
319667,51015,QuandaryPeak,1,Cat5Axual,50,15159
319668,51015,RedcloudPeak,1,Cat5Axual,51,15159
319669,51015,UncompahgrePeak,1,Cat5Axual,61,15159
319670,51015,WetterhornPeak,1,Cat5Axual,63,15159
319671,18660,QuandaryPeak,1,Dominion,50,15160
319672,18660,TorreysPeak,1,Dominion,60,15160
319673,18660,UncompahgrePeak,1,Dominion,61,15160
319674,38559,QuandaryPeak,1,maxpika,50,15161


In [83]:
full_data_matrix = -1*np.ones((n_14er_users, n_14ers))
for line in df_14ers.itertuples():
    full_data_matrix[line[6], line[5]] = line[3]
train_data_matrix = np.copy(full_data_matrix)
train_data_matrix[train_data_userids,:]=0
test_data_matrix = np.copy(full_data_matrix)
test_data_matrix[test_data_userids,:]=0


##### Pearson Metric 

$s(u_i,u_j)=\frac{\sum_k (x_{i,k}-\bar{x_i})(x_{j,k}-\bar{x_j})}{\sqrt{\sum_k (x_{i,k}-\bar{x_i})^2 \sum_k (x_{j,k}-\bar{x_j})^2}}$

where $x_{n,k}$ is the rating that a user $(i,j)$ gives to a summit of peak $k$, and $\bar{x}_n$ is the average rating over all peaks given by user $n$.  

In [84]:
def pearson_correlation(full_data_matrix,frac=1.0):
    epsilon = 1e-5
    #train_data_matrix has dimensions of (nusers,n14ers)
    nusers = np.shape(full_data_matrix)[0]
    pcorr_matrix = np.zeros((nusers,nusers))
    user_means = np.mean(full_data_matrix,axis=1)*1/frac ##rescale if needed when inputting partially-zeroed matrices from train and test
    mean_subtract = full_data_matrix - np.reshape(user_means,(nusers,1))
    denom_sum_sq = (np.sum((mean_subtract)**2,axis=1))**0.5 + epsilon
    for i in range(nusers):
        for j in range(nusers):
            pcorr_matrix[i,j] = np.dot(mean_subtract[i,:],mean_subtract[j,:])/(denom_sum_sq[i]*denom_sum_sq[j])
    return pcorr_matrix

In [32]:
def cosine_correlation(full_data_matrix,frac=1):
    epsilon = 1e-5
    #train_data_matrix has dimensions of (nusers,n14ers)
    nusers = np.shape(full_data_matrix)[0]
    pcorr_matrix = np.zeros((nusers,nusers))
    user_means = np.mean(full_data_matrix,axis=1)*1/frac 
    denom_sum_sq = (np.sum((full_data_matrix)**2,axis=1))**0.5 + epsilon
    for i in range(nusers):
        for j in range(nusers):
            pcorr_matrix[i,j] = np.dot(mean_subtract[i,:],mean_subtract[j,:])/(denom_sum_sq[i]*denom_sum_sq[j])
    return pcorr_matrix

In [85]:
### WARNING: this will take a bit of time (~10 mins).  Pre-Load data if possible
user_similarity_full = pearson_correlation(full_data_matrix,1.0)
#pickle.dump(user_similarity_full_winterfix,open(r'user_similarities_14er_pearson_full',"wb"))

In [86]:
### RUN THIS IF FILE EXISTS

pickle.dump(user_similarity_full,open(r'user_similarities_14er_pearson_gt2_winterfix',"wb"))

In [33]:
user_similarity_full=pickle.load(open(r'user_similarities_14er_pearson_full_winterfix',"rb"))

In [36]:
print(user_similarity_full[5001,-4000:-3900])

[0.17754963 0.32557098 0.26042351 0.59858454 0.16639686 0.44203639
 0.52553216 0.45464852 0.2387267  0.17754963 0.26042351 0.43501224
 0.42267587 0.47667529 0.30560675 0.05559635 0.16639686 0.24094688
 0.30250805 0.31143861 0.57688674 0.41204962 0.39787871 0.28152892
 0.32464083 0.46402669 0.40354025 0.26042351 0.18276756 0.07203203
 0.34916702 0.37400517 0.37400517 0.2387267  0.2995029  0.24094688
 0.36813442 0.19254669 0.10344824 0.241648   0.37960945 0.26042351
 0.34916702 0.34916702 0.16639686 0.29729209 0.34868291 0.05559635
 0.46472187 0.40354025 0.35156671 0.42267587 0.17754963 0.36832164
 0.29729209 0.39674777 0.54626989 0.48431882 0.34868291 0.54653288
 0.34916702 0.2387267  0.2387267  0.50678289 0.28941385 0.4474764
 0.241648   0.34868291 0.26218992 0.17754963 0.24827607 0.42145618
 0.12872505 0.26169252 0.35420387 0.11464917 0.10344824 0.40354025
 0.24988942 0.42267587 0.50678289 0.33528421 0.35316872 0.2387267
 0.48431882 0.32464083 0.45464852 0.48431882 0.44144091 0.368134

##### Spot checks to make sure that the 14er lists of user pairs rated most similar are actually close

In [87]:
user_similarity=user_similarity_full
sim_to_10000=user_similarity[10000,:].argsort() 
sim_to_5600=user_similarity[5600,:].argsort()
#print(user_similarity[10000,sim_to_5600[-2]])
#print(user_similarity[5000,sim_to_10000])
print('Check User 10000')
print(df_14ers[df_14ers['NewUserId']==sim_to_10000[-2]])
print('\n')
print(df_14ers[df_14ers['NewUserId']==10000])
print('\n\n')
print('Check User 5600')
print(df_14ers[df_14ers['NewUserId']==sim_to_5600[-2]])
print('\n')
print(df_14ers[df_14ers['NewUserId']==5600])

Check User 10000
       UserId         PeakName  NumClimbs       Username  PeakId  NewUserId
249354  49543        GraysPeak          1  scottdunsmuir      12       8157
249355  49543        MtBelford          1  scottdunsmuir      24       8157
249356  49543          MtBross          1  scottdunsmuir      26       8157
249357  49543        MtCameron          1  scottdunsmuir      27       8157
249358  49543       MtDemocrat          1  scottdunsmuir      29       8157
249359  49543         MtElbert          1  scottdunsmuir      30       8157
249360  49543        MtMassive          1  scottdunsmuir      35       8157
249361  49543         MtOxford          1  scottdunsmuir      36       8157
249362  49543        MtShavano          1  scottdunsmuir      38       8157
249363  49543        MtSherman          1  scottdunsmuir      39       8157
249364  49543     QuandaryPeak          1  scottdunsmuir      50       8157
249365  49543   TabeguachePeak          1  scottdunsmuir      59       

The recommendation for peak "k" for a given user $n$ is then calculated from the user similarities according to the intuition that probable likelihood of hiking a given peak is added to the average likelihood of hiking any peak for user "n" ($\bar{x}_k$) as a normalized, similarity-weighted sum over the mean-subtracted likelihood for all users "m":

$\hat{x}_{k,n} = \bar{x}_k + \frac{\sum_{u_m}s(u_n, u_m)(x_{k,m} - \bar{x}_m)}{\left| \sum_{u_m}{s(u_n,u_m)} \right|}$



In [88]:
def predict(ratings, similarity):
    epsilon = 1e-5
    mean_user_rating = ratings.mean(axis=1)
    ratings_diff = (ratings - mean_user_rating[:, np.newaxis]) 
    pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) / (np.array([np.abs(similarity).sum(axis=1)])+epsilon).T
    return pred

In [89]:
user_prediction_full = predict(full_data_matrix, user_similarity_full)

In [90]:
### The prediction matrix along with the user info dataframe will be used to make the Flask API
pickle.dump(user_prediction_full,open(r'user_prediction_gt2_winterfix',"wb"))

### Perform some spot-checks of the recommendations 

In [93]:
print(len(user_prediction_full))

15166


In [95]:
np.shape(user_prediction_full)
### For now I just want to explicitly exclude the common peaks
common_peaks = ['GraysPeak', 'MtBierstadt', 'QuandaryPeak', 'TorreysPeak', 'MtDemocrat', 'MtElbert', 'MtBross', 'MtCameron', 'MtEvans', 'MtSherman', 'PikesPeak']
n=12
user_prediction=user_prediction_full.astype('float')

args=((-user_prediction[15165,:]).argsort())
hiked_list_15166 = (df_14ers[df_14ers['NewUserId']==15165])['PeakName'].tolist()
print('User 15166 Hiked:')
print(hiked_list_15166)
rec_list_15166 = index_ind['PeakName'][args].tolist()
new_rec_list_15166 = [elem for elem in rec_list_15166 if elem not in hiked_list_15166 and elem not in common_peaks]
print('\n\n')
print('Recommendation for User 15166:')
print(new_rec_list_15166[0:n])
args=((-user_prediction[5600,:]).argsort())
hiked_list_5600 = (df_14ers[df_14ers['NewUserId']==5600])['PeakName'].tolist()
print('\n\n')
print('User 5600 Hiked:')
print(hiked_list_5600)
rec_list_5600 = index_ind['PeakName'][args].tolist()
new_rec_list_5600 = [elem for elem in rec_list_5600 if elem not in hiked_list_5600 and elem not in common_peaks]
print('\n\n')
print('Recommendation for User 5600:')
print(new_rec_list_5600[0:n])

hiked_list_8600 = (df_14ers[df_14ers['NewUserId']==8600])['PeakName'].tolist()
print('\n\n')
print('User 8600 Hiked:')
print(hiked_list_8600)
rec_list_8600 = index_ind['PeakName'][args].tolist()
new_rec_list_8600 = [elem for elem in rec_list_8600 if elem not in hiked_list_8600 and elem not in common_peaks]
print('\n\n')
print('Recommendation for User 8600:')
print(new_rec_list_8600[0:n])

User 15166 Hiked:
['SanLuisPeak', 'SunlightPeak', 'UncompahgrePeak', 'WetterhornPeak']



Recommendation for User 15166:
['WindomPeak', 'MtEolus', 'HandiesPeak', 'RedcloudPeak', 'SunshinePeak', 'NorthEolus', 'WilsonPeak', 'MtSneffels', 'MtWilson', 'ElDientePeak', 'CulebraPeak', 'SnowmassMountain']



User 5600 Hiked:
['CastlePeak', 'ConundrumPeak', 'GraysPeak', 'HandiesPeak', 'LongsPeak', 'MissouriMountain', 'MtBelford', 'MtBierstadt', 'MtElbert', 'MtEvans', 'MtOxford', 'MtoftheHolyCross', 'PikesPeak', 'QuandaryPeak', 'SunlightPeak', 'TorreysPeak', 'UncompahgrePeak', 'WestEvans']



Recommendation for User 5600:
['MtMassive', 'HuronPeak', 'LaPlataPeak', 'MtYale', 'MtShavano', 'MtPrinceton', 'MtHarvard', 'TabeguachePeak', 'MtAntero', 'MtColumbia', 'RedcloudPeak', 'HumboldtPeak']



User 8600 Hiked:
['GraysPeak', 'MtBelford', 'MtBierstadt', 'MtOxford', 'MtShavano', 'MtYale', 'QuandaryPeak', 'TabeguachePeak', 'TorreysPeak']



Recommendation for User 8600:
['LongsPeak', 'MtMassive', 'Huro

In [62]:
(df_14ers[df_14ers['NewUserId']==5000])['Username']

161498    TexasClimber1492
161499    TexasClimber1492
161500    TexasClimber1492
161501    TexasClimber1492
161502    TexasClimber1492
161503    TexasClimber1492
161504    TexasClimber1492
161505    TexasClimber1492
161506    TexasClimber1492
Name: Username, dtype: object

## Test Solution

#### RMSE

In [45]:
from sklearn.metrics import mean_squared_error
from math import sqrt
def rmse(prediction, ground_truth):
    prediction = prediction[ground_truth.nonzero()].flatten() 
    ground_truth = ground_truth[ground_truth.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction, ground_truth))

In [None]:
print('User-based CF RMSE: ' + str(rmse(user_prediction, test_data_matrix)))


### Recommendation Entropy

I am also defining a metric I'll call the "recommendation entropy", equal to the standard information entropy definition of

-$\sum_k p_k \text{log}(p_k)$, where here $p_k$ is a fractional value equal to the number of times a given peak $k$ was recommended as a top-10 peak over the pool of users, divided by the total number of users.

In [111]:
def calc_rec_entropy(user_prediction):
    nusers = np.shape(user_prediction)[0]
    npeaks = np.shape(user_prediction)[1]
    probs = np.zeros((npeaks,1))
    for n in range(npeaks):
        args=((-user_prediction).argsort())
        user_pred_top10 = args[:,0:10]
        probs[n] = np.sum(user_pred_top10 == n)
    probs = probs/float(nusers)
    return -1/10.0*np.dot((probs.flatten()),np.log((probs.flatten())))
        



In [118]:
ent_10=calc_rec_entropy(user_prediction)
print(ent_10)


0.2758027760186843


In [61]:
#phases = 0,90,270,180
mag = 4*dbm_to_vsq(-57.5)#np.sqrt(dbm_to_vsq(-57))*np.exp(1j*0)+np.sqrt(dbm_to_vsq(-57))*np.exp(1j*np.pi/2)+np.sqrt(dbm_to_vsq(-57))*np.exp(1j*270*np.pi/180.0)+np.sqrt(dbm_to_vsq(-57))*np.exp(1j*np.pi)
print(vsq_to_dbm(mag))

-51.479400086720375


In [53]:
mag

(0.0005616748812614789+0j)