#Make Full Recomendation Based on User Input

This notebook will make recomendations based on user input.

Input: 
User ID or User name (str)

Output: Destination Suggestions (str), 10 or so suggested climbs (int) in that area, and a what kind of climber the user is based on latent features of model.

##Import modules and load in data

In [4]:
import graphlab as gl
import numpy as np
import pandas as pd
import pickle, json
from collections import defaultdict

#user name to user id dict
with open('user_map.p','r') as f:
    user_map = pickle.load(f)

#star rating and climb data
df_data = pd.read_csv('star5.csv')
with open('df_raw_star5.p','r') as f:
    df_raw = pickle.load(f)
    
#load in recommendation models
sim_mod = gl.load_model('sim_mod')
rfr_mod = gl.load_model('rfm_mod_15')
rfr_mod_lf = gl.load_model('rfm_mod_features_extracted')

[INFO] This non-commercial license of GraphLab Create is assigned to dmneal@gmail.comand will expire on August 05, 2016. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-45254 - Server binary: /Users/datascientist/anaconda/envs/dato-env/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1439499282.log
[INFO] GraphLab Server Version: 1.5.2


First, we need to determine what the input was.  If it's a user id that we have, then we proceed with recommendation.  If it's a user name, we find the user id and proceed as above.  If it's a climb, or series of climbs, we make recomendations based on those climbs.  

In [5]:
def determine_input(input_str, df_data, user_map):
    try:
        id_val = int(input_str)
        if id_val in df_data.User.values:
            return id_val
        else: 
            return None
    except ValueError:
        if input_str in user_map:
            return user_map[input_str]
        else:
            return None

In [6]:
input_str = 'LauraColyer'
user_id = determine_input(input_str, df_data, user_map)

We will suggest the destination and climbs based on two models.  We will proceed through the top recomendations until ~3-5 areas the user has not rated a climb in have at least ~10 climbs suggested that are within their climbing ability.  We will suggest those areas and the recommended climbs in them.  

In [7]:
def user_rated_visited(user_id, df_data, df_raw):
    #Find destinations climber has rated climbs from
    climbs_rated = df_data[df_data.User==user_id].Climb
    visited = set(df_raw.loc[climbs_rated].sub_location.values)
  
    #Finder user difficulty rating range from stared climbs
    user_ratings = df_raw.loc[climbs_rated].rating
    user_rating_std = user_ratings.std()
    user_rating_mean = user_ratings.mean()
    rating_max = user_rating_mean+user_rating_std
    
    return visited, rating_max

In [8]:
visited, rating_max = user_rated_visited(user_id, df_data, df_raw)


In [9]:
#Find climb locations to recommend
def rec_loc_climb(user_id, visited, rating_max, df_raw,
                  model, verbose=False, n_areas=3, n_climbs=10):
    
    climb_recs = model.recommend(users=[user_id], k=13000)
    loc_climb_recs = defaultdict(list)
    loc_recs = []
    n_recs = 0
    for rec in climb_recs:
        climb = rec['Climb']
        if df_raw.loc[climb].rating < rating_max:
            loc = df_raw.loc[climb].sub_location
            if loc not in (list(visited) + loc_recs):
                loc_climb_recs[loc] += [climb]
                if len(loc_climb_recs[loc]) == n_climbs:
                    loc_recs += [loc]
                    n_recs += 1
                    if n_recs == n_areas:
                        break
    if verbose:
        for loc in loc_recs:
            print loc
            print loc_climb_recs[loc]
    return loc_recs, loc_climb_recs

In [10]:
#Get climb recomendations from item similarity model
loc_recs, loc_climb_recs = rec_loc_climb(user_id,
                                                  visited,
                                                  rating_max,
                                                  df_raw,
                                                  sim_mod,
                                                  verbose=True)

Yosemite National Park
[105840361, 105862896, 105862915, 105924807, 106154042, 107429493, 105862873, 105912192, 107677399, 105872907]
Custer State Park
[107684661, 108052987, 106053351, 105715232, 105714734, 107775061, 108244256, 107517307, 107810730, 105714761]
Adirondacks
[106532800, 107708237, 106831219, 107564731, 107185274, 106092527, 106542971, 106197345, 106078832, 106594953]


In [11]:
#Get climb recomendations from item similarity model
loc_recs_rfr, loc_climb_recs_rfr = rec_loc_climb(user_id,
                                                  visited,
                                                  rating_max,
                                                  df_raw,
                                                  rfr_mod,
                                                  verbose=True)

Yosemite National Park
[105862915, 105924807, 106154042, 105945535, 105862991, 105862873, 105862896, 106167844, 106187777, 105847471]
Joshua Tree National Park
[105721666, 105725788, 105722743, 105722305, 105723325, 105722050, 105725389, 105723478, 105722227, 105722431]
Cathedral Ledge
[105880759, 105919971, 105909672, 105880787, 105922177, 105938087, 105903672, 105920872, 105949212, 105924990]


We also want to say something about the user based on the latent features of rfr model.

In [12]:
def get_latent_user(user_id, model):
    coefs = model.get('coefficients')
    df_fac_user = pd.DataFrame(np.array(coefs['User']['factors']))
    df_fac_user.set_index(np.array(coefs['User']['User']), inplace=True)
    return df_fac_user.loc[user_id]

In [13]:
get_latent_user(user_id, rfr_mod_lf)

0    0.057407
1   -0.119440
2   -0.098539
3   -0.049110
Name: 107953067, dtype: float64

###Recommend based on one or more climbs 

In [14]:
def rec_loc_climb_sim(model, df_raw, climb_ids=[106129861, 106460891], 
                      rating_max = None,
                      verbose=False, 
                      n_areas=3, n_climbs=10):
    #Recommend locations and climbs based on similarities to input climbs
    rec_SF = model.get_similar_items(climb_ids, k=1000)
    top_recs = rec_SF.to_dataframe().sort(['distance'],
                                          ascending=False).similar
    visited = set(df_raw.loc[climb_ids].sub_location.values)
    
    if not rating_max:
    #Finder user difficulty rating range from climbs
        user_ratings = df_raw.loc[climb_ids].rating
        print 'user_rating', user_ratings
        if len(user_ratings)>1:
            rating_max = user_ratings.std()+user_ratings.mean()
        else:
            rating_max = user_ratings.mean()
    
    print "Rating Max:", rating_max
    
    loc_climb_recs = defaultdict(list)
    loc_recs = []
    n_recs = 0
    for climb in top_recs:
        if df_raw.loc[climb].rating < rating_max:
            loc = df_raw.loc[climb].sub_location
            if loc not in (list(visited) + loc_recs):
                loc_climb_recs[loc] += [climb]
                if len(loc_climb_recs[loc]) == n_climbs:
                    loc_recs += [loc]
                    n_recs += 1
                    if n_recs == n_areas:
                        break
    
    if verbose:
        for loc in loc_recs:
            print loc
            print loc_climb_recs[loc]
    return loc_recs, loc_climb_recs

In [16]:
loc_recs, loc_climb_recs = rec_loc_climb_sim(rfr_mod, 
                        df_raw, 
                        climb_ids=[106125070],
                        verbose=True) 


PROGRESS: Getting similar items completed in 0.16747
user_rating Climb
106125070    21
Name: rating, dtype: float64
Rating Max: 21.0
Spearfish Canyon
[106478790, 107282754, 108135568, 106371703, 107422535, 106476834, 107474601, 106514378, 106131704, 106709530]
Mount Rushmore National Memorial
[105715109, 105715403, 105715715, 108126846, 105715724, 106467636, 106245532, 108036270, 108126873, 105715052]
Yosemite National Park
[105877935, 106042450, 105862936, 105862957, 105865415, 105996954, 106026157, 106007588, 105874615, 105872907]


In [17]:
def get_latent_climbs(climb_ids, model):
    coefs = model.get('coefficients')
    df_fac_climb = pd.DataFrame(np.array(coefs['Climb']['factors']))
    df_fac_climb.set_index(np.array(coefs['Climb']['Climb']), inplace=True)
    climb_av = np.zeros(4)
    for climb_id in climb_ids:
        climb_av += df_fac_climb.loc[climb_id]
    return climb_av

In [77]:
get_latent_climbs([106460891, 106460901, 105875465], rfr_mod_lf)

106460891
106460901
105875465


array([-0.08909002,  0.09612912,  0.10399982,  0.13489014])

###Make recommendations using latent features as observed features

In [52]:
df_raw.head()

Unnamed: 0_level_0,location,rating,rating_dif,star_votes,stars,sub_location,type,comment
Climb,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
105834134,Utah,22.75,0.0,6,3.5,Wasatch Range,Trad,15.0
106181256,Colorado,19.5,1.5,12,3.1,Boulder,Trad,2.0
105920111,International,17.5,0.0,13,2.9,North America,Sport,
106208618,California,13.0,0.0,24,3.1,Yosemite National Park,Trad,3.0
105718117,Utah,28.0,0.5,45,4.9,Zion National Park,Sport,23.0


In [83]:
def rec_loc_climb_lf(input_lf, climb_type, model, df_raw, df_data,
                    rating_max=18, star_min=4, n_climbs=10, 
                     n_areas=3, verbose=False):


    scale_ls = 0.05 * input_lf
    coefs = model.get('coefficients')
    df_fac_climb = pd.DataFrame(np.array(coefs['Climb']['factors']))
    df_rec = pd.DataFrame(scale_ls.dot(df_fac_climb.values.T))
    df_rec.set_index(np.array(coefs['Climb']['Climb']), inplace=True)

    df_rec.sort(0, ascending=False, inplace=True)

    df_rec = df_rec[df_raw['stars']>star_min]
    df_rec = df_rec[df_raw['type']==climb_type]

    loc_climb_recs = defaultdict(list)
    loc_recs = []
    n_recs = 0
    for climb in df_rec.index:
        if df_raw.loc[climb].rating < rating_max:
            loc = df_raw.loc[climb].sub_location
            if loc not in (loc_recs):
                loc_climb_recs[loc] += [climb]
                if len(loc_climb_recs[loc]) == n_climbs:
                    loc_recs += [loc]
                    n_recs += 1
                    if n_recs == n_areas:
                        break
    if verbose:
        for loc in loc_recs:
            print loc
            print loc_climb_recs[loc]
    return loc_recs, loc_climb_recs


In [84]:
input_lf = np.array([-1,-1,1,-1])
climb_type = 'Trad'
star_min = 3
rating_max = 20
model = rfr_mod_lf
df_raw
df_data
loc_recs, loc_climb_recs=rec_loc_climb_lf(input_lf, climb_type, model, df_raw, df_data,
                    rating_max=rating_max, star_min=star_min,
                verbose=True)

Red River Gorge
[106064911, 106789874, 105865483, 105880914, 105919554, 105870326, 105860741, 106915658, 106128420, 105970537]
North America
[107198282, 108344970, 109007909, 108145859, 106798174, 106640630, 106798163, 105857749, 107407350, 105963366]
Acadia National Park
[105975325, 105998178, 106489240, 106034988, 106616203, 106035181, 106813156, 106867373, 105998473, 105975313]


###Inspect latent features to determine what they mean

In [45]:
table=User
side_data = pd.DataFrame(np.array(coefs['side_data']['factors']))
side_data.set_index(np.array(coefs['side_data']['index']), inplace=True)
side_data

Unnamed: 0,0,1,2,3
0,0.027357,0.036358,0.003158,0.031058
Utah,0.080573,1.93298,-1.132659,1.685715
International,0.017579,-1.321137,-0.054643,1.309529
Colorado,-0.03842,2.176859,-1.344112,-1.967785
Nevada,0.025285,-0.913047,-1.709671,0.327458
New Hampshire,-0.128422,-1.867894,2.034448,-2.327331
California,0.198598,-1.702715,-2.12658,0.778821
New York,0.181439,-1.864892,1.406713,-2.157098
South Dakota,-0.091835,2.19147,3.084531,-1.390929
Washington,0.027421,-1.318861,2.304941,2.267952


In [41]:
coefs

{'Climb': Columns:
 	Climb	int
 	linear_terms	float
 	factors	array
 
 Rows: 13675
 
 Data:
 +-----------+------------------+-------------------------------+
 |   Climb   |   linear_terms   |            factors            |
 +-----------+------------------+-------------------------------+
 | 105834134 | -0.328962981701  | [-0.0495708845556, -0.0473... |
 | 106181256 | -0.105537422001  | [-0.0724642798305, -0.0232... |
 | 105920111 | -0.148473024368  | [0.0290321689099, 0.063994... |
 | 106208618 | -0.0228561554104 | [-0.0333808325231, 0.02560... |
 | 105718117 |  0.246717840433  | [-0.118238158524, 0.054365... |
 | 105868666 | -0.179359763861  | [-0.044172488153, -0.04059... |
 | 106449034 | 0.0553373508155  | [-0.0781496763229, -0.0576... |
 | 105947723 |  0.114070989192  | [-0.0237222407013, 0.00327... |
 | 105732860 | -0.315011382103  | [-0.0341163463891, 0.03355... |
 | 108873104 | -0.153943553567  | [-0.000562944391277, 0.011... |
 +-----------+------------------+-----------------

In [86]:
%%writefile rec_funcs.py
import graphlab as gl
import numpy as np
import pandas as pd
import pickle, json
from collections import defaultdict


def determine_input(input_str, df_data, user_map):
    try:
        id_val = int(input_str)
        if id_val in df_data.User.values:
            return id_val
        else: 
            return None
    except ValueError:
        if input_str in user_map:
            return user_map[input_str]
        else:
            return None

def user_rated_visited(user_id, df_data, df_raw):
    #Find destinations climber has rated climbs from
    climbs_rated = df_data[df_data.User==user_id].Climb
    visited = set(df_raw.loc[climbs_rated].sub_location.values)
  
    #Finder user difficulty rating range from stared climbs
    user_ratings = df_raw.loc[climbs_rated].rating
    user_rating_std = user_ratings.std()
    user_rating_mean = user_ratings.mean()
    rating_max = user_rating_mean+user_rating_std
    
    return visited, rating_max
   
#Find climb locations to recommend
def rec_loc_climb(user_id, visited, rating_max, df_raw,
                  model, verbose=False, n_areas=3, n_climbs=10):
    
    climb_recs = model.recommend(users=[user_id], k=13000)
    loc_climb_recs = defaultdict(list)
    loc_recs = []
    n_recs = 0
    for rec in climb_recs:
        climb = rec['Climb']
        if df_raw.loc[climb].rating < rating_max:
            loc = df_raw.loc[climb].sub_location
            if loc not in (list(visited) + loc_recs):
                loc_climb_recs[loc] += [climb]
                if len(loc_climb_recs[loc]) == n_climbs:
                    loc_recs += [loc]
                    n_recs += 1
                    if n_recs == n_areas:
                        break
    if verbose:
        for loc in loc_recs:
            print loc
            print loc_climb_recs[loc]
    return loc_recs, loc_climb_recs


def get_latent_user(user_id, model):
    coefs = model.get('coefficients')
    df_fac_user = pd.DataFrame(np.array(coefs['User']['factors']))
    df_fac_user.set_index(np.array(coefs['User']['User']), inplace=True)
    return df_fac_user.loc[user_id]


def rec_loc_climb_sim(model, df_raw, climb_ids=[106129861, 106460891], 
                      rating_max = None,
                      verbose=False, 
                      n_areas=3, n_climbs=10):
    #Recommend locations and climbs based on similarities to input climbs
    rec_SF = model.get_similar_items(climb_ids, k=1000)
    top_recs = rec_SF.to_dataframe().sort(['distance'],
                                          ascending=False).similar
    visited = set(df_raw.loc[climb_ids].sub_location.values)
    
    if not rating_max:
    #Finder user difficulty rating range from climbs
        user_ratings = df_raw.loc[climb_ids].rating
        print 'user_rating', user_ratings
        if len(user_ratings)>1:
            rating_max = user_ratings.std()+user_ratings.mean()
        else:
            rating_max = user_ratings.mean()
    
    print "Rating Max:", rating_max
    
    loc_climb_recs = defaultdict(list)
    loc_recs = []
    n_recs = 0
    for climb in top_recs:
        if df_raw.loc[climb].rating < rating_max:
            loc = df_raw.loc[climb].sub_location
            if loc not in (list(visited) + loc_recs):
                loc_climb_recs[loc] += [climb]
                if len(loc_climb_recs[loc]) == n_climbs:
                    loc_recs += [loc]
                    n_recs += 1
                    if n_recs == n_areas:
                        break
    
    if verbose:
        for loc in loc_recs:
            print loc
            print loc_climb_recs[loc]
    return loc_recs, loc_climb_recs

def get_latent_climbs(climb_ids, model):
    coefs = model.get('coefficients')
    df_fac_climb = pd.DataFrame(np.array(coefs['Climb']['factors']))
    df_fac_climb.set_index(np.array(coefs['Climb']['Climb']), inplace=True)
    climb_av = np.zeros(4)
    for climb_id in climb_ids:
        climb_av += df_fac_climb.loc[climb_id]
    return climb_av

def rec_loc_climb_lf(input_lf, climb_type, model, df_raw, df_data,
                    rating_max=18, star_min=4, n_climbs=10, 
                     n_areas=3, verbose=False):


    scale_ls = 0.05 * input_lf
    coefs = model.get('coefficients')
    df_fac_climb = pd.DataFrame(np.array(coefs['Climb']['factors']))
    df_rec = pd.DataFrame(scale_ls.dot(df_fac_climb.values.T))
    df_rec.set_index(np.array(coefs['Climb']['Climb']), inplace=True)

    df_rec.sort(0, ascending=False, inplace=True)

    df_rec = df_rec[df_raw['stars']>star_min]
    df_rec = df_rec[df_raw['type']==climb_type]

    loc_climb_recs = defaultdict(list)
    loc_recs = []
    n_recs = 0
    for climb in df_rec.index:
        if df_raw.loc[climb].rating < rating_max:
            loc = df_raw.loc[climb].sub_location
            if loc not in (loc_recs):
                loc_climb_recs[loc] += [climb]
                if len(loc_climb_recs[loc]) == n_climbs:
                    loc_recs += [loc]
                    n_recs += 1
                    if n_recs == n_areas:
                        break
    if verbose:
        for loc in loc_recs:
            print loc
            print loc_climb_recs[loc]
    return loc_recs, loc_climb_recs


if __name__ == "__main__":
    #user name to user id dict
    with open('user_map.p','r') as f:
        user_map = pickle.load(f)

    #star rating and climb data
    df_data = pd.read_csv('star5.csv')
    with open('df_raw_star5.p','r') as f:
        df_raw = pickle.load(f)

    #load in recommendation models
    sim_mod = gl.load_model('sim_mod')
    rfr_mod = gl.load_model('rfm_mod_15')
    rfr_mod_lf = gl.load_model('rfm_mod_features_extracted')
    
    #user input
    input_str = 'LauraColyer'
    user_id = determine_input(input_str, df_data, user_map)
    
    visited, rating_max = user_rated_visited(user_id, df_data, df_raw)

    #Get climb recomendations from item similarity model
    loc_recs_sim, loc_climb_recs_sim = rec_loc_climb(user_id,
                                                  visited,
                                                  rating_max,
                                                  df_raw,
                                                  sim_mod,
                                                  verbose=True)
    
    #Get climb recomendations from item similarity model
    loc_recs_rfr, loc_climb_recs_rfr = rec_loc_climb(user_id,
                                                  visited,
                                                  rating_max,
                                                  df_raw,
                                                  rfr_mod,
                                                  verbose=True)
    print get_latent_user(user_id, rfr_mod_lf)
    
    loc_recs, loc_climb_recs = rec_loc_climb_sim(rfr_mod, 
                        df_raw, 
                        climb_ids=[106460891, 106460901, 105875465],
                        rating_max = 18,
                        verbose=True) 
    
    print get_latent_climbs([106460891, 106460901, 105875465], rfr_mod_lf)
    
    input_lf = np.array([-1,-1,1,-1])
    climb_type = 'Trad'
    star_min = 3
    rating_max = 20
    model = rfr_mod_lf
    df_raw
    df_data
    loc_recs, loc_climb_recs=rec_loc_climb_lf(input_lf, climb_type, model, df_raw, df_data,
                        rating_max=rating_max, star_min=star_min,
                    verbose=True)

Overwriting rec_funcs.py


In [133]:
!python rec_funcs.py

[INFO] This non-commercial license of GraphLab Create is assigned to dmneal@gmail.comand will expire on August 05, 2016. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-44475 - Server binary: /Users/datascientist/anaconda/envs/dato-env/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1439495276.log
[INFO] GraphLab Server Version: 1.5.2
Yosemite National Park
[105840361, 105862896, 105862915, 105924807, 106154042, 107429493, 105862873, 105912192, 107677399, 105872907]
Custer State Park
[107684661, 108052987, 106053351, 105715232, 105714734, 107775061, 108244256, 107517307, 107810730, 105714761]
Adirondacks
[106532800, 107708237, 106831219, 107564731, 107185274, 106092527, 106542971, 106197345, 106078832, 106594953]
Yosemite National Park
[105862915, 105924807, 106154042, 105945535, 105862991, 105862873, 105862896, 106167844, 106187777, 105847471]
Joshua Tree National Park
[105721666, 1

##Scratch work below

In [15]:
#Find recommended climbs in rating range and in area
m_rfr = gl.load_model('rfm_mod_15')
rfr_recs = m_rfr.recommend(users=[user_id], k=13000)

loc_recs_copy = list(loc_recs)
rfr_climb_recs = defaultdict(list)
for i,rec in enumerate(rfr_recs):
    climb = df_raw.loc[rec['Climb']]
    if (climb.rating > user_rating_range[0]) and \
            (climb.rating < user_rating_range[1]):
        if climb.sub_location in loc_recs_copy:
            rfr_climb_recs[climb.sub_location] += [rec['Climb']]
            #print i,climb.sub_location
            if len(rfr_climb_recs[climb.sub_location]) == 10:
                loc_recs_copy.remove(climb.sub_location)
                if not loc_recs_copy:
                    break
print json.dumps(rfr_climb_recs, indent=2)      

NameError: name 'loc_recs' is not defined