# Capstone Check-in 2
---

### Project:

**NLP Recommender System for Movies** (Idea #1 from Check-in 1)

### Problem Statement:
Recommender systems for media often can't take into account important factors like the question, "what do you actually feel like watching right now?" The solution is to build a recommender system that asks for a free-form text input describing a person's momentary preferences, and recommends movies to watch based on that input. F1 scores will be used to evaluate and improve performance (comparing machine recommendations to human recommendations), as the ranking of the results will not be evaluated since the user's true preferences cannot be known.

### Methods and Goals:

Data will be collected from reddit, specifically the subreddit /r/moviesuggestions. Using the posts in this request-suggestion format, I will us NLP (spacy) to identify and tag movie titles from a list of films and build a database that connects request text to movie suggestions.

From here there are several possible apporaches for recommenders. A relatively simple approach utilizing important words in the corpus is likely the first step. This would be a content-based recommender system. For some posts it may be possible to combine this with a collaborative recommender if the user inputs movie titles as a part of their request. Other possibilites to explore include sentiment analysis and document similarity. This multifaceted system would need to decide which models to use and how to weight them, for example, only applying sentiment analyisis recommendations on documents that exhibit strong sentiments.

Despite the complexity above, a working system meeting one or two of the above goals should be achievable with time to build on the system and experiment. More complex NLP tools will be explored if the above goals are all achieved.

### The Data:

Based on early EDA using recent submissions to /r/moviesuggestions, it is likely that there are at least 10,000 submissions that can be used for this system. Currently the forum generally sees over 30 submissions per day, with 85% of those being requests for movies to watch. Recently these request posts average about 22 replies, though not all replies will contain any recommendations. I will assume that posts with negative scores are bad recommendations and not inlcude those in the data.

Currently, more work needs to be done, as the comments will require their own dataframe with comment-specific data (more than just text). These can be matched  to the submission text/self-text dataframe using the submission's 'id' value.

With this data source, there may be a strong tendency for the recommender to default to very popular movies for most input documents. I will have to research how to deal with that, if it is a problem.

### EDA:

Some information and analysis of this subreddit can be found here:

https://subredditstats.com/r/moviesuggestions

This site has graphs and estimates of posts and comments per day, as well the history of posting on this subreddit, such as a very mysterious spike in activity beginning in April, 2020, what could that be about? Were we all just sitting around watching movies? Yes, we were.

We can see that for the past year and a half, activity has been generally averaged 30+ posts per day and 300+ comments per day.

In [1]:
import pandas as pd

In [4]:
df = pd.read_csv('./data/moviesuggestions_data.csv')
df.head()

Unnamed: 0,index,created_utc,id,link_flair_css_class,num_comments,selftext,title,comments,assigned_comments
0,0,2020-09-15 19:28:59,itl4by,request,13,Looking for movie that involveds the hunting ...,Similar to The Purge series or 31,"['Hush (2016)', '""They Live-1988"", does it cou...",12
1,2,2020-09-15 18:40:25,itkci7,request,11,I am a fan of adult comedies like the American...,Adult Comedies?,['can’t be judging movies off of trailers smh'...,11
2,4,2020-09-15 18:16:15,itjyza,request,11,Could be any genre. Im looking for the movies ...,Cozy vibe movies,"['Time Bandits', 'The Ninth Gate\n\nCastaway\n...",11
3,5,2020-09-15 17:37:13,itjc7y,request,11,"I’m 15, and I really love movies (plus I want ...","Movie(s) that you think everyone should watch,...",['Tokyo Drifter\n\nSympathy for Lady Vengeance...,11
4,7,2020-09-15 15:28:09,itgyzn,suggest,1,It's an amazing movie. Nor many people have he...,Marrowbone,['Your Post was Removed because [Marrowbone] h...,1


#### Currently, the comments are not in an easily useable format and lack some useful information. For these reasons, I will build a new database for just the comments, to facilitate filtering and nlp.

In [5]:
df['link_flair_css_class'].value_counts(normalize=True)

request    0.861736
suggest    0.138264
Name: link_flair_css_class, dtype: float64

#### 86% of the latest posts were requests, rather than unsolicited suggestions.

In [7]:
len(df)

316

#### 500 posts were scaped with the API but only 316 were not removed (by moderators, administrators, etc).

#### If these numbers hold as more data is gathered, the past year and half should yield over 8,000 useable posts. If there are 30 posts per day, and (316/500 =) 63.2% of posts are not removed, and 85% of posts are requests, then about 8,800 posts will be useable from the last 18 months. Since I don't have any information about posting activity before that time, I cannot estimate how much more data can be obtained, though it is likely a considerable amount.

In [6]:
df.groupby(by = 'link_flair_css_class').mean()

Unnamed: 0_level_0,index,num_comments,assigned_comments
link_flair_css_class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
request,252.402985,22.063433,20.753731
suggest,230.046512,10.209302,9.255814


#### Recently, requests for recommendations recieve an average of 21 responses, but I don't yet have information about how many recommendations are being made. Comments might contain no recommendations, or several,

### Evaluation

While developing the system, the data will be split into training and testing sets. Recommendations made by the system will be compared to actual recommendations to find precision, sensitivity, and f1 scores for each recommendation. Any metric that evaluates true negatives (movies that were not recommended by either human or computer) will always have an extremely high score, since the vast majority movies will fall into this category for any given document. There will be some problems here with variation in number of responses, so some limitations (such as compare the top x recommendations) may be necessary when evaluating the recommendations.