# Report
# Project Name - WhatTheRec

## Introduction and Problem Statement
In the present scenario, a plethora of events are being held daily at different places and different times. Hence, it is difficult for a user to keep track of all the events and choose between various events which to attend. The problem worsens because the users do not know which events would be interesting and relevant to them. Hence, it is important that an event sharing social network use recommenders to suggest relevant and interesting events to its users which help them choose between various events. While recommending events, another problem whiich arises is that of unseen data. We do not know for sure which users will attend what events, especially in the case of new users. This is often called the cold start problem. There are many Event Based Social Network (EBSN) websites like meetup.com, eventbrite.com etc., which could deploy this algorithm in order to successfully suggest interesting and relevant events to its users.


## Related Work
There has been some work done in hybrid approaches in event recommendation by Minkov et al. [1]. This paper demonstrates collaborative filtering on a dataset of academic events. Our implementation of the recommender system follows [2], which also utilizes other signals such as a group of users, location, and temporal preferences. Khrouf and Troncy [3] for recommending music related events, utilized category information about different artists from a well-known source. But this approach fails when we do not have this information about every event, as events can be across different domains. Our implementation takes into account RSVP and group information and other contextual data. Recent works [4] have shown that pure matrix factorization (based only on user-event interactions) performs poorly on EBSN data in comparison to other methods due to a high level of the sparsity of these datasets. As per experiments carried out, state-of-the-art matrix factorization algorithms did not perform better than simple collaborative filtering algorithms such as user-based k-NN. Thus the focus is on considering the explicit features than the latent ones.

## Approach & Methods

Data Acquisition - Initially we fetched data from meetup.com using their public apis. As we started developing, we realized the data was too less as this data is super sparse in nature. So, when we again starting getting data, they changed their apis and a lot of them were broken. So, we decided to contact the the researchers and got the raw data from them. We cleaned/parsed the data and used that for our project. We used data from Jan 1, 2010 - Jan 2014. 
The data contains the following information:
1. Events: event id, description of event, event group, users who rsvped to this event.
2. Members: member id, lat long
3. Groups: events organized by the group, members involved in the group

We, then, divided the data into timestamps of 6 months. For example, one timestamp would be July 1, 2010. Then, the first 6 months were used for training and the last 6 months for testing. We extracted the following features to recommend events: time, location, description and group frequency of event. 

a. Content-based - We formed a model of a user using the description of the past events he attended. Then found similarity of potential events with the user model. 

b. Location-based - We fitted a gausian distribution to the location of the past events of a user. Then found the probability of new event according to the curve. 

c. Group-Frequency - The intuition here is that the likelihood of the target user attending an event depends on the number of events this user attended in the group that event belongs to. 

d. Time-Aware - Another important factor that affect users' decision on attending an event is when the event occurs. We capture this intuition by assuming that users that attended events in the past at certain days of the week and at certain hours of the day will likely attend events with a similar temporal profile in the future.

Now, that we have built a model of the data, we then computed similarity scores for each member with the events that they had rsvp'd. Events in the first half were used to train the model and the events in the second half were used to test it. Note that we knew the events which the members had rsvp'd for the both the halves. So, the rsvp's for the second half were used to evaluate the model. Here's our index.py file, which is what you need run. 

First, generate the list of best users who have the most RSVP's

Now, you run our recommender index file. Type the name of cities you want to run the recommender on: 
Note: We have data from only Chicago, San jose and Phoenix.

Insert Parameters Below:

In [2]:
import os
os.chdir('src')
city = 'LCHICAGO' # or 'LSAN JOSE', or 'LPHOENIX'
algolist = ['rf']
number_of_members = 50


In [3]:
from preprocessing import *
import argparse
from partition import *
from content.content_recommender import ContentRecommender
from temporal.time_recommender import TimeRecommender
from location.location_recommender import LocationRecommender
from group_frequency.grp_freq_recommender import GrpFreqRecommender
from hybrid.learning_to_rank import LearningToRank
import datetime
import time

#number of seconds in 6 months
train_data_interval = ((364 / 2) * 24 * 60 * 60)


ImportError: cannot import name MLPClassifier

Training and Testing using Content Features:

In [4]:

def content_classifier(training_repo, test_repo, timestamp, simscores, test_members):

    training_events_dict = training_repo['members_events']
    potential_events = list(test_repo['events_info'].keys())

    contentRecommender = ContentRecommender()
    contentRecommender.train(training_events_dict, training_repo)
    test_events_vec = contentRecommender.get_test_events_wth_description(test_repo, potential_events)

    #TEST FOR BEST USERS
    for member in test_members:
         contentRecommender.test(member, potential_events, test_events_vec, simscores)


Training and Testing using Time Features

In [5]:
def time_classifier(training_repo, test_repo, timestamp, simscores, test_members):

    training_events_dict = training_repo['members_events']
    potential_events = list(test_repo['events_info'].keys())

    timeRecommender = TimeRecommender()
    timeRecommender.train(training_events_dict, training_repo)
    test_events_vec = timeRecommender.get_test_event_vecs_with_time(test_repo, potential_events)

    #TEST FOR BEST USERS
    for member in test_members:
         timeRecommender.test(member, potential_events, test_events_vec, simscores)


Training and Testing using Location feature

In [6]:
def loc_classifier(training_repo, test_repo, timestamp, simscores, test_members):
    training_events_dict = training_repo['members_events']
    potential_events = list(test_repo['events_info'].keys())

    locationRecommender = LocationRecommender()
    locationRecommender.train(training_events_dict, training_repo)

    #TEST FOR BEST USERS
    for member in test_members:
        locationRecommender.test(member, potential_events, test_repo, simscores)
        

Training and Testing using Group Frequency feature:

In [7]:

def grp_freq_classifier(training_repo, test_repo, timestamp, simscores, test_members):
    training_events_dict = training_repo['members_events']
    potential_events = list(test_repo['events_info'].keys())

    grp_freq_recommender = GrpFreqRecommender()
    grp_freq_recommender.train(training_events_dict, training_repo)
    
    for member in test_members:
         grp_freq_recommender.test(member, potential_events, test_repo, simscores)

    


Now, extract the data from csv file into python dictionaries.

In [8]:
def check_and_run_local_crawler():
    if os.path.isdir("../crawler/cities/LCHICAGO") and os.path.isdir("../crawler/cities/LSAN JOSE")\
            and os.path.isdir("../crawler/cities/LPHOENIX"):
        if len(os.listdir("../crawler/cities/LCHICAGO")) >= 5 and len(os.listdir("../crawler/cities/LSAN JOSE"))>=5\
                and len(os.listdir("../crawler/cities/LPHOENIX")) >= 5:
            return
    os.chdir("../crawler")
    os.system("python local_crawler.py")
    os.chdir("../src")

In [9]:
def run_script(number_of_members):
    os.chdir("scripts")
    os.system("python script.py --number " + str(number_of_members))
    os.chdir("..")

In [10]:
def main():
    
    check_and_run_local_crawler()
    print "Building best user database ..."
    run_script(number_of_members)
    print "Best users extracted."

    group_members, group_events, event_group = load_groups("../crawler/cities/" + city + "/group_members.json",
                                                            "../crawler/cities/" + city + "/group_events.json")
    events_info = load_events("../crawler/cities/" + city + "/events_info.json")
    members_info = load_members("../crawler/cities/" + city + "/members_info.json")
    member_events = load_rsvps("../crawler/cities/" + city + "/rsvp_events.json")

    repo = dict()
    repo['group_events'] = group_events
    repo['group_members'] = group_members
    repo['events_info'] = events_info
    repo['members_info'] = members_info
    repo['members_events'] = member_events
    repo['event_group'] = event_group
    
    #simscores_across_features is a dictionary to store similarity score obtained for each feature
    #for each member and for a given event. For example in case of content classifer we will
    #access the similarity score as follows: simscores['content_classifier'][member_id][event_id].
    #We will pass only a specific subdictionary (Ex: simscores['content_classifier']) to the
    #classifier functions, which will work on them and populate them.
    
    simscores_across_features = defaultdict(lambda :defaultdict(lambda :defaultdict(lambda :0)))
    hybrid_simscores = defaultdict(lambda :defaultdict(lambda :0))

    start_time = 1262304000 # 1st Jan 2010
    end_time = 1388534400 # 1st Jan 2014
    timestamps = get_timestamps(start_time, end_time)
    timestamps = sorted(timestamps, reverse=True)
    count_partition = 1

    f_temp = open('temp_result.txt', 'w+')
    f_temp.write("Using classification algorithms : " + str(algolist) + " and number of members as : " +\
                 str(number_of_members) + "\n")

    for t in timestamps:
        start_time = t - train_data_interval
        end_time = t + train_data_interval
        test_members = []
        f = open("scripts/"+city + "_best_users_" + str(start_time) + "_" + str(end_time) + ".txt", "r")
        for users in f:
            test_members.extend(users.split())
        f.close()
        test_members = test_members[:number_of_members]
        print "Partition at timestamp ", datetime.datetime.fromtimestamp(t), " are : "
        training_repo, test_repo = get_partitioned_repo_wrapper(t, repo)
        print "Partitioned Repo retrieved for timestamp : ", datetime.datetime.fromtimestamp(t)

        training_members = set(training_repo['members_events'].keys())
        test_members =  training_members.intersection(set(test_members))
        test_members = list(test_members)
        
        #Calling content based classifer train and test functions from here. Pass the repo
        #as an argument to these functions.
        start = time.clock()
        print "Starting Content Classifier"
        content_classifier(training_repo, test_repo, t, simscores_across_features['content_classifier'],\
                           test_members)
        print "Completed Content Classifier in ", time.clock() - start, " seconds"

        start = time.clock()
        print "Starting Time Classifier"
        time_classifier(training_repo, test_repo, t, simscores_across_features['time_classifier'], test_members)
        print "Completed Time Classifier in ", time.clock() - start, " seconds"

        start = time.clock()
        print "Starting Location Classifier"
        loc_classifier(training_repo, test_repo, t, simscores_across_features['location_classifier'], test_members)
        print "Completed Location Classifier in ", time.clock() - start, " seconds"

        start = time.clock()
        print "Starting Group Frequency Classifier"
        grp_freq_classifier(training_repo, test_repo, t, simscores_across_features['grp_freq_classifier'],\
                            test_members)
        print "Completed Group Frequency Classifier in ", time.clock() - start, " seconds"

        f_temp.write("============== Starting classification for partition : " +  str(count_partition) +\
                     " ===================\n")
        learningToRank = LearningToRank()
        learningToRank.learning(simscores_across_features, test_repo["events_info"].keys(), \
                                test_repo["members_events"], test_members, f_temp, algolist, \
                                number_of_members, count_partition)
        f_temp.write("============== Starting classification for partition : " +  str(count_partition) +\
                     " ===================\n")
        
        count_partition += 1
    
    f_temp.close()

if __name__ == "__main__":
    main()


Building best user database ...
Best users extracted.
Partition at timestamp 

NameError: global name 'datetime' is not defined

## Evaluation

We divided an year's data in 2 slots. So we build the model of the user from the rsvpd events in the first slot, and then for each of the event in the next slot, do the recommendation. We do a k fold setup to evaluate recommendations. For 80% of the users, we train the classifier and then calculate precision, recall and f-score for the remaining 20% of the users.

For a new user(who didn't attend any event in the fist slot), we recommend the event which occurs closest to user's location.

## Conclusion and Next Steps
In sum, we propose a recommender system for event based social networks. We resolved the cold start problem associated with the event recommendations by using content based features: location, time, description and group frequency. We got #### results. 

Further, the next steps would be to use data from other major sites like eventbrite.com, facebook.com and evaluate our recommendation on it. Another work could be to use more features for the system, like: ####

## References
[1] E. Minkov, B. Charrow, J. Ledlie, S. Teller, and T. Jaakkola, Collaborative future event recommendation, In Proc. of CIKM, pages 819-828, 2010.

[2] Augusto Q. Macedo, Leandro B. Marinho and Rodrygo L. T. Santos, Context-Aware Event Recommendation in Event-based Social Networks, In Proc. of Recsys, pages 123-130, 2015.

[3] H. Khrouf and R. Troncy. Hybrid event recommendation using linked data and user diversity, In Proc. of RecSys, pages 185-192, 2013.

[4] A. Q. Macedo and L. B. Marinho, Event recommendation in event-based social networks, In Proc. of Int. Work. on Social Personalization, 2014.