## Cosine Similarity and Recommendation System

In this jupyter notebook, I will be using cosine similarity to develop a recommendation system based on users to users.

This recommendation system will suggest hiking trails a user should do next based on user to user ratings. 

I'm trying to build an explicit colaborative based filtering system. Explicit means the ratings are given but I had to scrap the web to get them.

**NOTE:** this code will use a lot of RAM and runs the computer pretty hot, it will take about 30-60 minutes to run on an 8GB quad-core computer. Run the code at risk of over working your computer.  

Structure of this notebook:

1. Read in csv
2. Create list of all the reviewers (rows)
3. Convert to dataframe to remove duplicates, and create a matrix with the columns of the hiking trails
4. Fill in the matrix for the ratings with Zeroes
5. Fill in the ratings in the matrix
6. Reduce the dataframe by filter out reviewers with less than 3 reviews
7. Create the Cosine Similarity Matrix
8. Call functions to generate recommended hiking trails based on user to user similarities
9. Evaluate the recommendation system

## Libraries

In [1]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy as np
import seaborn as sns

### Lets Import the CSV Files

trail_id_df is the dataframe of the trails name with the trail_id created from the 3rd jupyter notebook with the EDA

In [37]:
trail_id = pd.read_csv('../data/trail_id_df.csv')

### Lets read in the dataframe that has all the California Hiking trails with the features and reviewers and ratings from  02_Scrapping_Reviews.ipynb

In [2]:
df = pd.read_csv('../data/clean_df.csv')

In [3]:
df.shape

(7727, 13)

## Create a List of all the Reviewers from all the hikings trails in California

In [4]:
# This list comprehensive will grab all the reviewrs in all the trails 
# and turn it into a list called all_users
all_users = [eval(df.reviewers_rating[t])[i].keys() 
    for t in df.index 
    for i in range(len(eval(df.reviewers_rating[t])))]


In [14]:
len(all_users)

321359

## Convert the List to a Dataframe, Remove Dups, Set Index and Fill it with Zeroes

In [15]:
# Creates a Dataframe of all the hiking trail reviewers
all_user_df = pd.DataFrame(all_users, columns={'users'})

# Removes all the duplicates
all_user_df.drop_duplicates(inplace=True)

# Sets the index as the users
all_user_df.set_index(['users'], inplace=True)

# Creates the matrix of all zeros
for i in df.T.columns:
     all_user_df[i] = 0

## Fill the Zeroes with the Ratings for Each Reviewer

In [23]:
for col in all_user_df.columns:
    for i, user in enumerate(eval(df.reviewers_rating[col])):
        all_user_df.at[list(user)[0], col] = dict(
            eval(df.reviewers_rating[col])[i].items())[list(user)[0]]

In [24]:
all_user_df.shape

(117292, 7727)

In [25]:
all_user_df.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,7717,7718,7719,7720,7721,7722,7723,7724,7725,7726
users,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Daniel Cons de León,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
David Loop,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Philip Henke,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Stephanie Silva,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Janelle Tompsett,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Reduce the DataFrame for Cosine Similarities

This is the formula for Cosine Similarity. I got this formula from the lecture notes at General Assembly.

## $$
cos(\theta) = \frac{A \cdot B}{\left\| A\right\| \left\| B\right\| } = \frac{A \cdot B}{\sqrt{\sum{A_i^2}} \cdot \sqrt{\sum{B_i^2}}}
$$

Cosine similarity is measuring the angle between one vector and another vector. These vectors are the users and their rating of a trail which has a scaler value with magnitude and direction. If two users like the same trails and give similar ratings then they will have similar direction and magnitudes which results in an acute angle. If they have different ratings than the direction and magnitudes are in different directions which results in obtuse angles. Cos 0 degree is equal to 1. while cosine 180 degrees is -1. Thus a value close 1 is consider similar, and a value close to -1 is dissimilar. 

The formula states, cosine theta equals the dot product of A and B over the Euclidean norm of the vector A and vector B. 

In [27]:
# This code overrights the df to a new df that returns all loc (locations) that has a sum of values (ratings) 
# that are greater than 3. Meaning, return all users with reviews greater than 3
all_user_df = all_user_df.loc[(all_user_df!=0).sum(1) > 3]

In [28]:
# As you can see the new shape is now 20,3111 rows vs before it was 117,292 rows.
all_user_df.shape

(20311, 7727)

In [29]:
print(f' Percentage of Data remain after filtering the users with more than 3 reviews: {20311/117292*100}')

 Percentage of Data remain after filtering the users with more than 3 reviews: 17.31661153360843


## Lets Create the Cosine Similarity Matrix for Users to Users

In [32]:
cosine_mat = cosine_similarity(all_user_df, all_user_df)

In [33]:
cosine_mat.shape

(20311, 20311)

Drop the extra column in trail_id

In [38]:
trail_id.drop(columns='Unnamed: 0', axis=1, inplace=True)

## Functions

This function **return_users_name_with_trails_and_ratings** returns the main user's name with all the trails the main user has reviewed with the ratings and requires the parameter user id number

In [57]:
def return_users_name_with_trails_and_ratings(uid):
    print(all_user_df.iloc[uid].name)
    for k in all_user_df.iloc[uid][(all_user_df.iloc[uid] > 0)].keys():
        print(trail_id[trail_id.index == int(k)].name.item(), all_user_df.iloc[uid][k])

This function **top_similar_users_and_ratings** requires the main user_id number and the number of users you want to see that are most similar. It will then return those similar users with their reviewed trails and ratings

In [58]:
def top_similar_users_and_ratings(user_id=0, top=10):
    '''
        user_id default is 0 for the user index at 0
        top default is 10 for top 10 users most similar
    '''
    for user in np.argsort(cosine_mat[user_id])[-2:(-top-2):-1]:
        return_users_name_with_trails_and_ratings(user)
        print('\n')

This function **generate_recommended_trails** requires the main user_id number and the number of users you want to see that are most similar. It will then return those similar users with their reviewed trails and ratings

In [63]:
def generate_recommended_trails(user_id=0, top=10):
    # List of trails from main user
    main_user_trails = [trail_id[trail_id.index == int(k)].name.item() 
                        for k in all_user_df.iloc[user_id][(all_user_df.iloc[user_id] > 0)].keys()]

    # List of trails from main user
    empty_list = []
    for user in np.argsort(cosine_mat[user_id])[-2:(-top-2):-1]:
        for k in all_user_df.iloc[user][(all_user_df.iloc[user] > 0)].keys():
            empty_list.append(trail_id[trail_id.index == int(k)].name.item())

    recommended_trail = list(set(empty_list))

    # Create dataframe of only the trails not in the main user list or the recommended trails
    recommended_df = pd.DataFrame(list(set(recommended_trail).difference(main_user_trails)), 
                                  columns={'Recommended_Trails'})

    # Create a column for the Total Ratings
    recommended_df['Total_Ratings'] = 0

    # Setting the index to Recommned Trails
    recommended_df.set_index('Recommended_Trails', inplace=True)

    # This loop sums the ratings to the recommended_trails
    for user in np.argsort(cosine_mat[user_id])[-2:(-top-2):-1]:
        for k in all_user_df.iloc[user][(all_user_df.iloc[user] > 0)].keys():
            try: 
                recommended_df.loc[trail_id[trail_id.index == 
                        int(k)].name.item()].Total_Ratings+=int(all_user_df.iloc[user][k])
            except:
                pass
    return_users_name_with_trails_and_ratings(user_id)    
    return recommended_df.sort_values(by='Total_Ratings', ascending=False).head(10)

## Evaluating the Recommendation System

#### Lets Take a look at one random user. This is user #5 Leonardo Dumo. 
From hiking experience, I know that all these hikes are in San Diego.

The recommended list shows 10 trails. The majority of the trails on the list is also in San Diego.

In [75]:
generate_recommended_trails(5, 200)

Leonardo Dumo
Potato Chip Rock via Mt. Woodson Trail 5
Iron Mountain Trail 4
Mission Peak Loop from Stanford Avenue Staging Area 5
Torrey Pines - Red Butte, Yucca Point, and Razor Point 4
Lake Miramar Trail 4
San Diego Coastline: Chula Vista to Coronado 3


Unnamed: 0_level_0,Total_Ratings
Recommended_Trails,Unnamed: 1_level_1
Cowles Mountain Trail,188
Fortuna Mountain Trail,154
Kwaay Paay Peak Trail,134
Lake Hodges Overlook Trail,118
Three Sisters Waterfalls Trail,115
El Cajon Mountain Trail,114
Torrey Pines Beach Trail Loop,98
Los Penasquitos Canyon Trail,76
Cowles Mountain from Big Rock Trail,71
Cedar Creek Falls Trail,68


This code shows the cosine_mat for user #5. the -2 is to bypass the user similarity to it's self which will be 1. The closer the value is to 1 the more similar it is. The more it moves away from 1 and closer to -1 the more dissimilar it is. 

In [85]:
np.sort(cosine_mat[5])[-2:(-12):-1]

array([0.63287759, 0.61462741, 0.56375861, 0.56203901, 0.51601932,
       0.50748949, 0.50748949, 0.50410628, 0.50270297, 0.49603025])

This code calls the generate_recommended_trails function for the main user we want to compare with at 5 and we want to evaluate the top 20 most similar users. These are the trails it recommends, starting with Torrey Pines Beach Trail Loop and so on because it has the highest ratings. 

In [70]:
generate_recommended_trails(5, 20)

Leonardo Dumo
Potato Chip Rock via Mt. Woodson Trail 5
Iron Mountain Trail 4
Mission Peak Loop from Stanford Avenue Staging Area 5
Torrey Pines - Red Butte, Yucca Point, and Razor Point 4
Lake Miramar Trail 4
San Diego Coastline: Chula Vista to Coronado 3


Unnamed: 0_level_0,Total_Ratings
Recommended_Trails,Unnamed: 1_level_1
Torrey Pines Beach Trail Loop,14
Lake Hodges Overlook Trail,12
Mother Miguel Mountain Via Rock House Trail,10
Los Penasquitos Canyon Trail,9
Cowles Mountain Trail,9
Twin Peaks Trail,9
Snake Trail,8
Double Peak Trail,8
Nighthawk Trail Black Mountain Loop,6
Cowles Mountain from Big Rock Trail,6


## Summary:

1. Successfully able to generate a cosine matrix.
2. Successfully able to look at user to user similarities based on ratings of the trails
3. Successfully create a function that extracts trails the main user hasn't reviewed yet and suggest the highest rating total as the suggested trails for the main user to try next based on user to user similarities.

    

## Next Steps for Future Improvements:

1. I want to be able to built a flask app to show the recommendations and also for anyone to enter in certain descriptions of the type of trails they want to hike and returns a list of possible trails as a recommendation.
2. I want put a threshold on the number of similar users based on the cosine matrix value. i.e. like a cutoff at cos theta value of .45
3. I want to be able to reduce memory time, RAM usage, and save the all_user_df. These were a memory hog. I need to be able to change them in a sparse matrix or a numpy array so the computer can run more efficiently.
4. I didn't get to use the features like difficulty of trails, distance of trails, and location of the trails. I want to be able to do item to item with user to user in the future.
5. Finally, I want to be able to validate my recommendation system, with a train test split and predicting a known trail the main user has already reviewed.