# Collaborative Filtering
## Big Idea
Input: user features, dataframe of user-activities rating with user features and activity features, an integer N.
1. Score users by how similar they are to the given user.
2. Multiply the similarity of the users with the activity score.
3. From the top N most similar rows, find the most common features (assuming features are independent). These belong to the "ideal" activity.
4. Based on the features of the "ideal" activity, score activities based on how similar the features are to the top features.
5. Return the ordered list of activities ranked by similarity to the "ideal" activity.

In [50]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [51]:

base_path = '/content/drive/My Drive/SCET/Data/'
response_path = base_path + 'SCET Friendship questionnaire (Responses) - Form Responses 1.csv'

In [52]:
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics.pairwise import cosine_similarity

# Import Data

In [53]:
deleted_features = ['Favorite movie genre?', 
                    'What qualities do you look for in a friend?', 
                    'Social distance run']
feedback_features = ['Timestamp', 'Email Address', 
                     'Any comments on the overall structure of the survey?', 
                     'Are the questions relevant to qualities you look for in a friend?', 
                     'Any suggestions on question we should be asking?', 
                     'What has been your favorite activity recently and why? ', 
                     'What are some virtual activities that you have been enjoying recently? ']
user_features = ['Age', 'Gender', 'College major?', 'How outdoorsy are you?',
                'When is your preferred time to hang out with friends? ',
                'What is your preferred way of spending time with friends?',
                'How often do you like to spend time with your friends?',
                'How many people do you like to spend time with at once?',
                'What is your top love language?', 'Introvert or extrovert?']
activites_features = ['Rank these activities on how much you enjoy them? (5 is most enjoyable) [Hiking]',
                      'Rank these activities on how much you enjoy them? (5 is most enjoyable) [Journaling]',
                      'Rank these activities on how much you enjoy them? (5 is most enjoyable) [Reading nonfiction]',
                      'Rank these activities on how much you enjoy them? (5 is most enjoyable) [Drawing]',
                      'Rank these activities on how much you enjoy them? (5 is most enjoyable) [Hanging out with friends. (pre-COVID)]',
                      'Social distance exercise (walk, run, etc.)', 
                      'Netflix party',
                      'Video chat hangout', 
                      'Wine tasting/Cocktail shake up',
                      'Trivia contests', 
                      'Virtual escape room', 
                      'Arts and crafts ']
user_id = ['First Name', 'Last Name']

In [54]:
full_data = pd.read_csv(response_path)
full_data = full_data.drop(range(6)).drop(deleted_features, axis=1)

user_data = full_data[user_id + user_features].copy()
activities_data = full_data[user_id + activites_features].copy()

In [55]:
user_data.drop(columns=user_id).head()

Unnamed: 0,Age,Gender,College major?,How outdoorsy are you?,When is your preferred time to hang out with friends?,What is your preferred way of spending time with friends?,How often do you like to spend time with your friends?,How many people do you like to spend time with at once?,What is your top love language?,Introvert or extrovert?
6,22,Female,Econ + Data Sci,Very outdoorsy,"Weekends during the day, Weekends at night","Grabbing foods or drinks together, Doing an ac...",Once a week,The more the merrier,Quality time,Extrovert
7,20,Female,STEM,Very outdoorsy,"Weekdays during the day, Weekends during the d...","Grabbing foods or drinks together, Doing an ac...",2-3 times a week,Small groups (up to 5 people),Acts of service,Introvert
8,19,Female,Business,Somewhat outdoorsy,Weekdays at night,"Electronically: messaging, calling, video-chat...",2-3 times a week,Small groups (up to 5 people),Acts of service,Introvert
9,21,Female,STEM,Somewhat outdoorsy,"Weekdays during the day, Weekends during the d...","Electronically: messaging, calling, video-chat...",Once a week,Small groups (up to 5 people),Quality time,Extrovert
10,21,Female,Humanities,More or less,"Weekends during the day, Weekends at night","Grabbing foods or drinks together, Doing an ac...",2-3 times a week,One-on-one only,Words of Affirmation,Extrovert


In [56]:
activities_data.drop(columns=user_id).head()

Unnamed: 0,Rank these activities on how much you enjoy them? (5 is most enjoyable) [Hiking],Rank these activities on how much you enjoy them? (5 is most enjoyable) [Journaling],Rank these activities on how much you enjoy them? (5 is most enjoyable) [Reading nonfiction],Rank these activities on how much you enjoy them? (5 is most enjoyable) [Drawing],Rank these activities on how much you enjoy them? (5 is most enjoyable) [Hanging out with friends. (pre-COVID)],"Social distance exercise (walk, run, etc.)",Netflix party,Video chat hangout,Wine tasting/Cocktail shake up,Trivia contests,Virtual escape room,Arts and crafts
6,5,4,3,5,5,3.0,5.0,5.0,5.0,4.0,4.0,
7,5,3,2,3,5,5.0,3.0,3.0,5.0,5.0,3.0,4.0
8,3,1,3,4,3,3.0,5.0,5.0,3.0,4.0,3.0,4.0
9,4,2,3,1,5,4.0,4.0,4.0,3.0,2.0,2.0,4.0
10,4,4,3,3,5,5.0,5.0,5.0,5.0,5.0,4.0,5.0


# Preprocessing

In [57]:
names_to_id = {}
def anonymize(data, user_id):
  data['UserID'] = range(data.shape[0])  
  for _, r in data[['UserID'] + user_id].iterrows():
    names_to_id["{} {}".format(r[1], r[2]).title()] = r[0]
  return data.drop(user_id, axis=1)

In [58]:
def preprocess_users(data, user_id):
  data = anonymize(data, user_id)
  
  columns = list(data.columns)
  one_hot_columns =  ['Gender', 'College major?', 'How outdoorsy are you?',
                      'When is your preferred time to hang out with friends? ',
                      'What is your preferred way of spending time with friends?',
                      'How often do you like to spend time with your friends?',
                      'How many people do you like to spend time with at once?',
                      'What is your top love language?', 'Introvert or extrovert?']
  
  cat_indices = []
  for col in one_hot_columns:
    cat_indices += [columns.index(col)]

  data_pipeline = ColumnTransformer([
      ('categorical', OneHotEncoder(), cat_indices)
  ], remainder='passthrough')

  return data_pipeline.fit_transform(data)

In [59]:
def preprocess_activities(data, user_id):
  data = anonymize(data, user_id)

  table = data.T
  table.columns = table.iloc[-1]
  table = table.iloc[:-1]

  user_ids = []
  user_enjoyment_level = []
  activity_name = []

  for i in table:
    row = table[i]

    for feat, val in zip(row.index, row):
      user_ids.append(i)
      user_enjoyment_level.append(val)
      activity_name.append(feat)

  df = pd.DataFrame({'UserID':user_ids, 'User Enjoyment Level':user_enjoyment_level, 'Activity Name':activity_name})

  return df

# Collaborative Filtering Algorithm

In [60]:
def similarity(x, y):
  return cosine_similarity(x.reshape(1,-1),y.reshape(1,-1))

In [61]:
def similarity_to_user(user, all_users):
  similarity_to_user_array = []
  for curr_user in all_users:
    similarity_to_user_array.extend(similarity(user, curr_user)[0])
  return similarity_to_user_array

In [62]:
def suggest_activity(N, user, all_users, activities_df, index_of_user_id, activity_score_name):
  # Score users by how similar they are to the given user.
  user_similarity_array = similarity_to_user(user, all_users)

  userIDs = list(range(len(user_similarity_array)))

  user_similarity_df = pd.DataFrame({"User Similarity": user_similarity_array, "UserID": userIDs})
  df = user_similarity_df.merge(all_activities, on='UserID').drop('UserID', axis=1)

  # Multiply the similarity of the users with the activity score. 
  df['Activity Match'] = [a*b for a, b in zip(df['User Similarity'], df[activity_score_name])]

  # From the top N most similar rows, find the most common features (assuming features are independent). These belong to the "ideal" activity.
  # Based on the features of the "ideal" activity, score activities based on how similar the features are to the top features.
  # Return the ordered list of activities ranked by similarity to the "ideal" activity.
  # >>>> CURRENT: Return top from sorted list of activities.
  df = df.sort_values('Activity Match', ascending=False)
  top_N = df.iloc[:N]
  sorted_activities = top_N['Activity Name'].value_counts(sort=True, ascending=False)

  return sorted_activities.index

In [63]:
all_users = np.array(preprocess_users(user_data, user_id).todense())
all_activities = preprocess_activities(activities_data, user_id)
N = 10
i = 15
index_of_user_id = 42
given_user = all_users[i]
suggest_activity(N, given_user, all_users, all_activities, index_of_user_id, 'User Enjoyment Level')

Index(['Rank these activities on how much you enjoy them? (5 is most enjoyable) [Hanging out with friends. (pre-COVID)]',
       'Wine tasting/Cocktail shake up', 'Trivia contests',
       'Video chat hangout',
       'Rank these activities on how much you enjoy them? (5 is most enjoyable) [Hiking]',
       'Arts and crafts '],
      dtype='object')

# Apply

In [64]:
def match_prefs(pref1, pref2):
  pref1 = list(pref1)
  pref2 = list(pref2)
  ranked_pref = {}

  for i in pref1:
    if i in pref2:
      ranked_pref[i] = (pref1.index(i) + 1) * (pref2.index(i) + 1)

  sorted_activities = sorted(ranked_pref, key=ranked_pref.__getitem__)

  if not sorted_activities:
    print("May be bad match.")
    return None

  return sorted_activities

In [65]:
pairing_path = base_path + 'Random Pairings - Sheet2.csv'
pairings_df = pd.read_csv(pairing_path, header=None)
standardized_pairings = pairings_df.apply(lambda x: x.apply(lambda y: y.title().strip()))
pairings = standardized_pairings.apply(lambda x: x.apply(lambda y: names_to_id[y]))

In [66]:
def clean_responses(responses):
  new_responses = []
  for response in responses:
    if "[" in response:
      new_responses.append(response[response.index("[")+1:-1])
    else:
      new_responses.append(response) 
  return new_responses

In [67]:
N = 5
activities_col = []
for _, pair in pairings.iterrows():
  print("User {} and User {}".format(pair[0], pair[1]))
  u1_ranked_pref = suggest_activity(N, all_users[pair[0]], all_users, all_activities, index_of_user_id, 'User Enjoyment Level')
  u2_ranked_pref = suggest_activity(N, all_users[pair[1]], all_users, all_activities, index_of_user_id, 'User Enjoyment Level')
  match = match_prefs(u1_ranked_pref, u2_ranked_pref)
  if not match:
    u1_ranked_pref = suggest_activity(len(activites_features), all_users[pair[0]], all_users, all_activities, index_of_user_id, 'User Enjoyment Level')
    u2_ranked_pref = suggest_activity(len(activites_features), all_users[pair[1]], all_users, all_activities, index_of_user_id, 'User Enjoyment Level')
    match = match_prefs(u1_ranked_pref, u2_ranked_pref)
  activities_col.append(clean_responses(match))
  print("Done")
standardized_pairings["Suggestion"] = activities_col
pairings["Suggestion"] = activities_col

User 1 and User 12
Done
User 2 and User 13
May be bad match.
Done
User 4 and User 14
Done
User 5 and User 15
Done
User 6 and User 16
Done
User 7 and User 17
Done
User 8 and User 18
Done
User 9 and User 19
Done
User 10 and User 20
Done
User 11 and User 0
Done


In [68]:
# Preliminary Result
standardized_pairings
# Anonymize
pairings

Unnamed: 0,0,1,Suggestion
0,1,12,"[Trivia contests, Hanging out with friends. (p..."
1,2,13,"[Hanging out with friends. (pre-COVID), Arts a..."
2,4,14,"[Hanging out with friends. (pre-COVID), Arts a..."
3,5,15,"[Trivia contests, Hanging out with friends. (p..."
4,6,16,"[Hanging out with friends. (pre-COVID), Arts a..."
5,7,17,"[Hanging out with friends. (pre-COVID), Social..."
6,8,18,[Hanging out with friends. (pre-COVID)]
7,9,19,"[Hanging out with friends. (pre-COVID), Wine t..."
8,10,20,"[Wine tasting/Cocktail shake up, Hanging out w..."
9,11,0,"[Drawing, Hanging out with friends. (pre-COVID)]"


In [69]:
activity_to_features = {
  'HIKING':[1, 1, 1, 0, 0, 1, 1, 0],
  'JOURNALING':[0, 0, 1, 1, 1, 0, 0, 0],
  'READING NONFICTION':[0, 0, 1, 1, 0, 0, 0, 0],
  'DRAWING':[0, 0, 1, 0, 1, 0, 0, 0],
  'HANGING OUT WITH FRIENDS. (PRE-COVID)':[1, 1, 1, 0, 0, 1, 1, 1],
  'SOCIAL DISTANCE EXERCISE (WALK, RUN, ETC.)':[1, 1, 1, 0, 0, 1, 1, 0],
  'NETFLIX PARTY':[0, 0, 0, 0, 1, 1, 0, 0],
  'VIDEO CHAT HANGOUT':[0, 0, 1, 0, 0, 1, 1, 0],
  'WINE TASTING/COCKTAIL SHAKE UP':[0, 0, 1, 0, 1, 1, 1, 1],
  'TRIVIA CONTESTS':[0, 0, 0, 0, 1, 1, 1, 0],
  'VIRTUAL ESCAPE ROOM':[0, 0, 1, 0, 1, 1, 1, 0],
  'ARTS AND CRAFTS':[0, 0, 1, 0, 1, 1, 0, 0]
}

suggestions_from_features = {
  'PICNIC':[1, 0, 1, 0, 1, 1, 1, 1],
  'GROUP GAMES (AMONG US, CODE NAMES, ETC.)':[0, 0, 0, 0, 1, 1, 1, 0],
  'GRABBING FOOD OR DRINKS TOGETHER':[1, 0, 0, 0, 0, 0, 1, 1],
  'STUDY TOGETHER':[0, 0, 0, 1, 0, 1, 0, 0],
  'VIDEO GAMES':[0, 0, 0, 0, 1, 1, 1, 0],
  'COOKING/BAKING CLASS':[0, 0, 1, 0, 1, 1, 1, 1],
  'PAINTING SOCIAL':[1, 0, 0, 0, 1, 1, 1, 0],
  'BOOK CLUB':[0, 0, 0, 1, 1, 1, 1, 0],
  'KARAOKE':[0, 0, 0, 0, 1, 1, 1, 0],
  'COOKING/BAKING COMPETITION':[0, 0, 1, 0, 1, 1, 1, 1],
  'WORKOUT SESSION':[1, 1, 1, 0, 0, 1, 0, 0],
  'SELF-CARE SHEET MASK + TEA SESSION':[0, 0, 1, 0, 1, 0, 1, 1],
  'ONLINE SHOPPING SESSION':[0, 0, 0, 0, 1, 1, 0, 0]
}


In [70]:
def distance(x, y):
  return np.sqrt(np.sum((np.array(x) - np.array(y))**2))

def activity_similarity(activity):
  similarity = {}
  for suggestion in suggestions_from_features.keys():
    similarity[suggestion.title()] = distance(activity, suggestions_from_features[suggestion])
  return sorted(similarity, key=similarity.__getitem__)

In [71]:
standardized_pairings["Suggestion"] = standardized_pairings["Suggestion"].apply(lambda y: activity_similarity(activity_to_features[y[0].strip().upper()]))
# Anonymize
pairings["Suggestion"] = pairings["Suggestion"].apply(lambda y: activity_similarity(activity_to_features[y[0].strip().upper()]))

standardized_pairings
# Anonymize
pairings

Unnamed: 0,0,1,Suggestion
0,1,12,"[Group Games (Among Us, Code Names, Etc.), Vid..."
1,2,13,"[Picnic, Workout Session, Grabbing Food Or Dri..."
2,4,14,"[Picnic, Workout Session, Grabbing Food Or Dri..."
3,5,15,"[Group Games (Among Us, Code Names, Etc.), Vid..."
4,6,16,"[Picnic, Workout Session, Grabbing Food Or Dri..."
5,7,17,"[Picnic, Workout Session, Grabbing Food Or Dri..."
6,8,18,"[Picnic, Workout Session, Grabbing Food Or Dri..."
7,9,19,"[Picnic, Workout Session, Grabbing Food Or Dri..."
8,10,20,"[Cooking/Baking Class, Cooking/Baking Competit..."
9,11,0,"[Self-Care Sheet Mask + Tea Session, Online Sh..."


In [72]:
# for _, r in standardized_pairings.iterrows():
for _, r in pairings.iterrows():
  print("Ranked Suggestions for {} and {}".format(r[0], r[1]))
  for i, activity in enumerate(r['Suggestion']):
    if i < 9:
      print("{}.  {}".format(i+1, activity))
    else:
      print("{}. {}".format(i+1, activity))
  print()

Ranked Suggestions for 1 and 12
1.  Group Games (Among Us, Code Names, Etc.)
2.  Video Games
3.  Karaoke
4.  Painting Social
5.  Book Club
6.  Online Shopping Session
7.  Cooking/Baking Class
8.  Cooking/Baking Competition
9.  Picnic
10. Study Together
11. Self-Care Sheet Mask + Tea Session
12. Grabbing Food Or Drinks Together
13. Workout Session

Ranked Suggestions for 2 and 13
1.  Picnic
2.  Workout Session
3.  Grabbing Food Or Drinks Together
4.  Cooking/Baking Class
5.  Cooking/Baking Competition
6.  Painting Social
7.  Self-Care Sheet Mask + Tea Session
8.  Group Games (Among Us, Code Names, Etc.)
9.  Video Games
10. Karaoke
11. Study Together
12. Book Club
13. Online Shopping Session

Ranked Suggestions for 4 and 14
1.  Picnic
2.  Workout Session
3.  Grabbing Food Or Drinks Together
4.  Cooking/Baking Class
5.  Cooking/Baking Competition
6.  Painting Social
7.  Self-Care Sheet Mask + Tea Session
8.  Group Games (Among Us, Code Names, Etc.)
9.  Video Games
10. Karaoke
11. Study To