In this assignment, your tasks are to build recommender systems based on user's interests and past activities. This dataset was drawn from the social talk and seminar website named [**CoMeT**](http://pittsburgh.comettalks.com). It provides the local Pittsburgh talks and online webinars for research communities and enthusiasts.

The dataset contains two parts: training set and test set.

The training set has 3 files:
- train_talk.txt
- train_user_bookmark.txt
- train_user_view.txt

The test set has 4 files:
- test_talk.txt
- test_user_bookmark.txt
- test_user_view.txt
- test_user.txt

**Note**: The test test excludes any test users' information.

### Training Dataset
First of all, randomly split 80/10/10 (more or less) of the data as the training/validation/test sets. You can split them based on users and keep all talks, OR separate them based on talks and use all the users.

#### Task 1: Content-based Recommendation

1.1) TF-IDF Model
- Convert the textual fields (title and detail) of each talk from the [train_talk.txt](train_talk.txt) to TF-IDF vectors. To simplify the task, exclude the talk's statistics and types of talk features. Use [scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) or [gensim](https://radimrehurek.com/gensim/models/tfidfmodel.html) libraries or write your own.
- Build the "seen" user profile vector space model by averaging the talk vectors that users have seen [train_user_view.txt](train_user_view.txt).
- Build a recommender that from what user has seen (training set), predicts which talk user will view (in the validation and test sets).
- Report the accuracy on trainging/validation/test sets.
- **(Bonus)** Submit the predicted talks from the challenge test set ([challenge_test_talk.txt](challenge_test_talk.txt)) for each of the ([challenge test user](challenge_test_user.txt)).

1.2) Word Embedding Model
- Convert the textual fields (title and detail) of each talk from the train_talk.txt to word vectors (either word2vec [TensorFlow](https://www.tensorflow.org/versions/r0.11/tutorials/word2vec/index.html), [gensim](https://radimrehurek.com/gensim/models/word2vec.html) or [fastText](https://github.com/facebookresearch/fastText) or [GloVe](https://github.com/stanfordnlp/GloVe)). To simplify the task, exclude the talk's statistics and types of talk features. Following pre-trained word embedding models are trained from the latest (Nov 2016) Wikipedia English dump.
    - Gensim Word2Vec (Dim. 100): (shared in the Pitt Box)
    - Facebook fastText (Dim. 100): (shared in the Pitt Box)
    - Stanford GloVe (Dim. 300): [glove.42B.300d.zip](http://nlp.stanford.edu/data/wordvecs/glove.42B.300d.zip)
    
- Build the "seen" user profile vector space model by averaging the talk vectors that users have seen.
- Build a recommender that from what user has seen (training set), predicts which talk user will view (in the validation and test sets).
- Report the results on trainging/validation/test sets.
- **(Bonus)** Submit the predicted talks from the challenge test set ([challenge_test_talk.txt](challenge_test_talk.txt)) for each of the ([challenge test user](challenge_test_user.txt)).

1.3) **(Challenged Bonus)** Repeat all again with the bookmarked talks ([train_user_bookmark.txt](train_user_bookmark.txt)).

#### Task 2: Collaborative Filtering

From the training set, convert the [train_user_view.txt](train_user_view.txt) into a user-talk matrix.

2.1) User-based/Item-based Collaborative Filtering
- Follow and adapt from this [tutorial article](http://blog.ethanrosenthal.com/2015/11/02/intro-to-collaborative-filtering/).
- Report the results on trainging/validation/test sets.
- **(Bonus)** Submit the predicted talks from the challenge test set ([challenge_test_talk.txt](challenge_test_talk.txt)) for each of the ([challenge test user](challenge_test_user.txt)).

2.2) Matrix Factorization
- Follow and adapt from this [tutorial article](http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation-in-python/)
- Report the results on trainging/validation/test sets.
- **(Bonus)** Submit the predicted talks from the challenge test set ([challenge_test_talk.txt](challenge_test_talk.txt)) for each of the ([challenge test user](challenge_test_user.txt)).

2.3) **(Challenged Bonus)** Repeat all again with the bookmarked talks ([train_user_bookmark.txt](train_user_bookmark.txt)).

#### Task 3: Ensemble - Hybrid Recommendation
3.1) Ensemble 
- Use Talks' statitistic and other meta-data along with the new features that you create more to combine these feature with the outputs from task 1 and task 2.
- Create an ensemble classifier (linear or non-linear).
- Report the results on trainging/validation/test sets.
- **(Bonus)** Submit the predicted talks from the challenge test set ([challenge_test_talk.txt](challenge_test_talk.txt)) for each of the ([challenge test user](challenge_test_user.txt)).

3.2) **(Challenged Bonus)** Repeat all again with the bookmarked talks ([train_user_bookmark.txt](train_user_bookmark.txt)).

### Assignment Due: Dec 2nd, 2016 midnight

In [1]:
import pandas as pd

### train_talk.txt 
It contains the content of talks in the training dataset. It has 22 variables as shown below:

**Talk's information**
- talk_id: talk identifier
- title: talk's title
- detail: talk's description
- date: talk's date
- begintime: talk's begin time
- endtime: talk's end time

**Basic talk's statistic**
- viewno: the number of user views
- bookmarkno: the number of user bookmarks
- emailno: the number of times that users email about this talk

**Talk's Interest Areas**
- biological_science 
- computer_science
- general_interest
- education
- engineering
- geosciences
- math_physics
- social_science
- health_sciences
- arts_humanities
- business_industry
- law
- chemistry


In [2]:
train_talk_df = pd.read_table('train_talk.txt')
train_talk_df.head()

Unnamed: 0,talk_id,title,detail,date,begintime,endtime,viewno,bookmarkno,emailno,biological_science,...,education,engineering,geosciences,math_physics,social_science,health_sciences,arts_humanities,business_industry,law,chemistry
0,10445,webcast ruby rails performance optimize,alexander dymo discuss make ruby rails applica...,8/2/16 0:00,8/2/16 13:00,8/2/16 14:00,51,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,10446,webcast practical color theory people code,simple pick color palette thing scare develope...,8/11/16 0:00,8/11/16 13:00,8/11/16 14:00,104,1,0,0,...,0,0,0,0,0,0,1,1,0,0
2,10447,webcast scale jenkin docker apache meso,taking advantage apache meso jenkin platform d...,8/16/16 0:00,8/16/16 13:00,8/16/16 14:00,100,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,10448,webcast told microservice,drawing case gonzalo maldonado outline rails o...,8/30/16 0:00,8/30/16 13:00,8/30/16 14:00,20,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,10458,evolve critical system,increasingly software considered critical busi...,8/2/16 0:00,8/2/16 12:00,8/2/16 13:00,139,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### train_user_bookmark.txt
It contains logs about what time and which talk user bookmarked in the training set.

In [3]:
train_user_bookmark_df = pd.read_table('train_user_bookmark.txt')
train_user_bookmark_df.head()

Unnamed: 0,user_id,talk_id,bookmark_time
0,1,10446,08/07/2016 15:58:15
1,1,10522,08/20/2016 18:20:53
2,1,10523,08/20/2016 18:20:46
3,1,10558,09/05/2016 16:08:01
4,1,10572,09/24/2016 19:56:24


### train_user_view.txt
It contains logs about what time and which talk user viewed in the training set.

In [4]:
train_user_view_df = pd.read_table('train_user_view.txt')
train_user_view_df.head()

Unnamed: 0,user_id,talk_id,view_time
0,1,10474,07/27/2016 12:27:35
1,1,10488,08/05/2016 12:34:01
2,1,10506,08/07/2016 15:55:57
3,1,10446,08/07/2016 15:58:09
4,1,10506,08/07/2016 16:06:26


### challenge_test_talk.txt
It contains the content of talks in the test dataset. It has the same 22 variables as ones on the *challenge_test_talk.txt*.

In [16]:
challenge_test_talk_df = pd.read_table('challenge_test_talk.txt')
challenge_test_talk_df.head()

Unnamed: 0,talk_id,title,detail,date,begintime,endtime,viewno,bookmarkno,emailno,biological_science,...,education,engineering,geosciences,math_physics,social_science,health_sciences,arts_humanities,business_industry,law,chemistry
0,9830,importance measurement decision making science...,adapt defense alter environment response adver...,10/11/16 0:00,10/11/16 15:00,10/11/16 16:30,209,4,0,0,...,0,0,0,0,0,0,0,0,0,0
1,10493,residential remedy,lecture noon school social work conference cen...,10/19/16 0:00,10/19/16 12:00,10/19/16 13:30,10,0,0,0,...,0,0,0,0,1,0,0,0,0,0
2,10497,children women sale billion industry,lecture held noon 2017 cathedral learning lunc...,10/12/16 0:00,10/12/16 12:00,10/12/16 13:30,54,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,10512,operation research analyse humanitarian sector,abstract operation research analyse humanitari...,10/10/16 0:00,10/10/16 12:00,10/10/16 13:20,21,1,0,0,...,0,0,0,0,1,0,0,0,0,0
4,10535,model human communication dynamics,human communication dance participant continuo...,10/21/16 0:00,10/21/16 12:30,10/21/16 13:30,71,8,0,0,...,0,0,0,0,0,0,0,0,0,0


### challenge_test_user_bookmark.txt
It contains logs about what time and which talk user bookmarked in the test set. **Note**: the file excludes any information of the test users.

In [17]:
challenge_test_user_bookmark_df = pd.read_table('challenge_test_user_bookmark.txt')
challenge_test_user_bookmark_df.head()

Unnamed: 0,user_id,talk_id,bookmark_time
0,5,10923,10/24/2016 21:56:38
1,82,10685,10/03/2016 13:10:38
2,1396,10512,10/03/2016 11:14:38
3,1396,10598,10/03/2016 11:07:43
4,1396,10611,10/03/2016 11:09:57


### test_user_view.txt
It contains logs about what time and which talk user viewed in the test set. **Note**: the file excludes any information of the test users.

In [14]:
train_user_view_df = pd.read_table('train_user_view.txt')
train_user_view_df.head()

Unnamed: 0,user_id,talk_id,view_time
0,1,10474,07/27/2016 12:27:35
1,1,10488,08/05/2016 12:34:01
2,1,10506,08/07/2016 15:55:57
3,1,10446,08/07/2016 15:58:09
4,1,10506,08/07/2016 16:06:26


### test_user.txt
It contains a list of users that the model will recommend for.

In [15]:
test_user_df = pd.read_table('test_user.txt')
test_user_df.head()

Unnamed: 0,user_id
0,1
1,2
2,1496
3,1634
4,2418
