# TripAdvisor Data Challenge
Prompt: https://drive.google.com/file/d/1jW9NvvpqDKOHrtv3PaKSQryMZekVGUny/view?usp=sharing


Author: Jackson Barkstrom

Team: Jackson Barkstrom, Rohan Dalvi, Alex Miao, Eric Liu

Motivation: we're looking at user patterns to find out the answer to the question, "What's the probability that a user viewed this hotel given that they viewed that hotel?" Our final dataframe answers this question with a matrix displaying this probability (note: this matrix is not based on time, and we don't know what happened before and what happened after). 

In [313]:
import pandas as pd
import numpy as np

In [314]:
# exploratory data analysis was pretty much already done for us
# see https://drive.google.com/open?id=15Mpsa6-dQ2K9sNtIenR8Gdq29DCYWk6r

In [423]:
hotels = pd.read_csv("./hotel_data.csv")
activity = pd.read_csv("./activity_data.csv")

In [None]:
# delete hotel values that are less common (we won't be recommending them)

threshold = 2000
value_counts = activity['hotel_id'].value_counts() # Specific column 
to_remove = value_counts[value_counts <= threshold].index

activity_clean = activity['hotel_id'].replace(to_remove, np.nan).dropna()

array = activity_clean['user_id'].unique()
n = len(array);
h = len(activity_clean['hotel_id'].unique())

matrix = np.zeros((h,n)); 
df = pd.DataFrame(matrix, columns = array)

df.index = activity_clean['hotel_id'].unique()

# create matrix with people and what hotels they viewed

for hotel_id in activity_clean['hotel_id'].unique():
    data = activity_clean.loc[activity_clean['hotel_id'] == hotel_id]['user_id'].values
    for i in range(len(data)):
        df.loc[hotel_id, data[i]] = 1

df = df.transpose()

# export matrix with people as index and hotels they viewed as column

df.to_pickle('./person_data.csv')

# then tack on this dataframe to our original data

result_by_person = pd.merge(activity, df, left_on = 'user_id', right_index = True, how = 'left')

# then tack on this dataframe to our original data

hotel_groupings = result_by_person.dropna().drop(['user_id', 'user_country', 'device', 'user_action'], axis=1)

# then group by mean -- we can see what hotels people are likely to go to after viewing one hotel
# obviously this will have a 1 in the column where we have the same hotel
# data looks very good overall -- some very high correlations, I saw a couple .2's

final_hotel_data = hotel_groupings.groupby(['hotel_id']).mean()

final_hotel_data.to_pickle('./hotels.pkl')