# Insights

### Which site has the more positive reviews based on the data given?

The best model for predicting which forum was the Vote Ensemble with DecisionTree, Boosters and CountVectorizer, score = 84%.      

The best parameters included:
- Setting terms to ingore words that appear in less than 2 posts and words that occur in more than 90% of the posts.  
    - It seems there was a wide variety of words across all posts, that was harder to pin to one sentiment vs another.  
- There was an advantage to viewing words in pairs of 2.  
- Not using stop words in this model helped.   

This model was highest in Sensitivity, meaning it errored more false positives, than false negatives.

```python
'cvec__max_df': 0.9,
'cvec__max_features': 5000,
'cvec__min_df': 2,
'cvec__ngram_range': (1, 2),
'cvec__stop_words': None,
'vote__ada__n_estimators': 200,
'vote__grad_boost__n_estimators': 200
 ```

### How accurate was I in predicting a site based off of the comments?

The best model for predicting which forum was the Naive Bayes with Tfid Vectorizer, score = 81%.      

The best parameters included:
- Setting terms to ingore words that appear in less than 2 posts and words that occur in more than 80% of the posts.  
    - So it seems there was a sweet spot of commonly used words.  And that there were unique words specifically for each forum.  
- There was an advantage to viewing words in pairs of 2.  
- English stop words helped.  

This model was high in all aspects, but lowest in Sensitivity, meaning it errored more false negatives than false positives.

```python
'tvec__max_df': 0.8,
'tvec__max_features': 5000,
'tvec__min_df': 2,
'tvec__ngram_range': (1, 2),
'tvec__stop_words': 'english'
```

### What is my final prediction for giving Kari a site for recommendations of a movie to watch? 

___Viewing sentiment analysis from VADER, I would recommend to Kari that she go to AllThingsHorror.___
1. Based off of both compound and individual Sentiment, AllThingsHorror tends to stay more neutral than HorrrorMoviesOnly.  While HorrorMoviesONLY may have more users that are passionate about horror movies and therefore more likely to give extreme reviews.

<img src='./images/VADERSent.png'>

2. When combined with the winning predictive models, we see that both sets of forums have somewhat of a unique vocabulary.  

<img src='./images/o_common.png'>

<img src='./images/h_common.png'>

While HorrorMoviesONLY uses words like *new* and *trailer* in their top 15 list moving towards new movies, AllThingsHorror uses words like *horror*, *movie*, *really* and *good* more frequently than HorrorMoviesONLY.  I would associate these words with more positive reviews, in the context of the horror genre. And even though AllThingsHorror is more wordy, the sentiment stays more neutral across more data.

3. Let's look at body of the text itself which seems to indicate an overall preference for less wordy statements.


<img src='./images/all_counts.png'>

When looking at word count, AllThingsHorror really sticks to less wordy text.  HorrorMoviesONLY have a lot of variation in it's wordiness.  This could be because it is a more subjective look at a movie by someone who feels more passionate.  AllThingsHorror would be good for Kari who is looking quickly for a good movie and doesn't want to look at lengthy passionate discussions.



<img src='./images/wordcount_forum.png'>


This difference between subjective and objective is more apparant in character count betweent the forums.  
- There is alot of variablility in HorrorMoviesONLY perhaps because they are more subjective.  Whereas AllThingsHorror is normally distributed around 38 characters.  
- And the biggest spike is around 10 characters, wheras HorrorMoviesOnly has several around 10, 35, 50, 70; with the biggest spike around 38

<img src='./images/charcount_forum.png'>

4. One more thing to mention, I know Kari likes classic horror films.  So in this case getting access to new horror movies coming out is not necessarily something that she would particularly be interested in.  HorrorMoviesONLY tends to be more new movie releases (they have trailer in their top 15). And Kari is more interested in positive discussions about unknown or rediscovered classics.  

# Further modeling:

### To improve the chances of finding a movie Kari wants to see, I can 
1. Scrape genre and movie title from IMDB.  
2. Match those movies to the reddit comments.
4. Use NLP to decide which forum matches more closely with the genre she is looking for.  

This would direct Kari to a forum that is especially talking about slasher films.

### Other ways I can improve these models:
1. Improve my initial scoring using the VADER method.  Garbage in = garbage out.  By going in and really refining what a neutral, positive or negative review looks like in the horror genre.

*Notes on VADER:*
    I had a challenge using a lexicon to tackle Horror movie comments.  This is because Horror movie comments would have words such as ___horror___ and ___scary___ as GOOD sentiments.  I chose VADER  because it handles social media comments the best.  By removing varying instances of ___horror___ and ___scary___ from the lexicon file, I was hoping to give higher weight to words like ___good___, even if the comments mentioned what VADER terms as negative words.  To properly do an analysis, I would rather create a different lexicon version of VADER, tailored to the horror genre in general, to give higher weights to words like ___scary___, ___horror___, and whatnot.  
    
