# Talk Recommender - Pycon 2018

With 32 tuotorials, 12 sponsor workshops, 16 talks at the education summit, and 95 talks at the main conference - Pycon has a lot to offer. Reading through all the talk descriptions and filtering out the ones that you should go to is a tedious process. But not anymore.

## Introducing TalkRecommender
Talk recommender is a recommendation system that recommends talks from this year's Pycon based on the ones that you went to last year.  This way you don't waste any time preparing a schedule and get to see the talks that matter the most to you! 

As shown in the demo, the users are asked to label previous year's talks into two categories - the one that they went to in person, and the ones they watched later online. Talk Recommender uses those labels to predict talks from this year that will be interesing to them. 

We will be using [`pandas`](https://pandas.pydata.org/) abd [`scikit-learn`](http://scikit-learn.org/) to build and the model.

*Remember to click on Save and Checkpoint from the File menu to save changes you made to the notebook* 

### Exercise A: Load the data
The data directory contains the snapshot of one such user's labeling - lets load that up and start with our analysis! 

In [None]:
!ls -lrt data

In [None]:
import pandas as pd
import numpy as np
df=pd.read_csv('data/talks.csv')
df.head()

Here is a brief description of the interesting fields.

variable | description  
------|------|
`title`|Title of the talk
`description`|Description of the talk
`year`|Is it a `2017` talk or `2018`  
`label`|`1` indicates the user preferred seeing the talk in person,<br> `0` indicates they would schedule it for later.

Select the 2017 talk descriptions that were labeled by the user for watching in person.

In [None]:
df[df.label==1][['description', 'label']]

### Exercise B: Feature Extraction
In this step we build the feature set by tokenization, counting and normalization of the bi-grams from the text descriptions of the talk.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range=(1, 2), stop_words="english")
vectorized_text = vectorizer.fit_transform(df['description'])

Split the `vectorized_text` into two parts - the 2017 talks will be used for training and the 2018 talks will we used for predicting.

In [None]:
labels = df[df.year == 2017]['label']
count_labeled = len(df[df.year == 2017])
vectorized_text_labeled = vectorized_text[:count_labeled]
vectorized_text_predict = vectorized_text[count_labeled:]

### Exercise C: Split into Training and Testing Set

Next we split our data into training set and testing set. This allows us to do cross validation and avoid overfitting. Use the `train_test_split` method from `sklearn.model_selection` to split the `vectorized_text_labeled` into training and testing set with the test size as one third of the size of the labeled.

[Here](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) is the documentation for the function

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(vectorized_text_labeled, labels, test_size=.3)

### Exercise D: Train the model
Finally we get to the stage for training the model. We are going to use a linear support vector machine. And check its accuracy by using the `classification_report` function. Note that we have not done any parameter tuning yet, so your model might not give you the best results. Feel free to tweak the parameters or use a different model to get a better result. 

In [None]:
import sklearn
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report
classifier = LinearSVC()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
report = sklearn.metrics.classification_report(y_test, y_pred)
print(report)

### Exercise E: Make Predictions
Use the model to predict which 2018 talks the user should go to. Print out the talk descriptions.

In [None]:
predicted_talks_vector = classifier.predict(vectorized_text_predict)
df_2018 = df[df.year==2018]
# Offset the rows by count_labeled as
predicted_talks_indexes = predicted_talks_vector.nonzero()[0] + count_labeled
df_2018.loc[predicted_talks_indexes][['id', 'description', 'presenters', 'title', 'location']]

## Exercise F: Expose it as a service

Now that you have pieces of the code ready, copy them together into the `model.py` file located in this folder, and rebuild your docker image. Copy the code from the above cells into the body of the `prediction` function.

Lets rebuild the docker image and start an new container.
```
docker stop <container_name>
docker build -t recommender .
docker run -p 8888:8888 -p 9000:9000 recommender
```

The `api.py` file in this directory is a flask app that makes call to the `model.py` module and exposes the model built in the previous steps as a service. In order to start the flask server, open a new terminal and run the following command:

```
docker exec (docker ps -ql) python api.py
```
Where `docker ps -ql` queries to get the container id of the last conatainer that was created in your system.

Finally go to http://0.0.0.0:9000/predict to see the talks that were recommended for this user.

## Exercise G: Expose it as a service

Finally we do not have to retrain our model anytime we have to make predictions. In most real life data science applications, the training phase is a time consuming proecss. We would seaprately train and serialize the model which is then exposed through the api to make the predictions.

In [None]:
from sklearn.externals import joblib
with open('talk_recommender.pkl', 'wb') as f:
    joblib.dump(classifier, f)

Use the `joblib.load` function to read the `classifier` back from the `talk_recommender.pkl` file.