# Lesson 5

## A/B Testing Case Study

This lesson:

- build a user funnel $\rightarrow$ decide on metrics $\rightarrow$ perform experiment sizing

Perform inferential statistics on metrics:

- invariant

- evaluation

Previous lessons, learned about components:

- conceptual

- statistical

For in an experiment:

- design

- analyse

### Scenario Description

#### On Udacity text:

"Let's say that you're working for a fictional productivity software company that is looking for ways to increase the number of people who pay for their software. The way that the software is currently set up, users can download and use the software free of charge, for a 7-day trial. After the end of the trial, users are required to pay for a license to continue using the software.

One idea that the company wants to try is to change the layout of the homepage to emphasize more prominently and higher up on the page that there is a 7-day trial available for the company's software. The current fear is that some potential users are missing out on using the software because of a lack of awareness of the trial period. If more people download the software and use it in the trial period, the hope is that this entices more people to make a purchase after seeing what the software can do.

In this case study, you'll go through steps for planning out an experiment to test the new homepage. You will start by constructing a user funnel and deciding on metrics to track. You'll also perform experiment sizing to see how long it should be run. Afterwards, you'll be given some data collected for the experiment, perform statistical tests to analyze the results, and come to conclusions regarding how effective the new homepage changes were for bringing in more users."

### Building a Funnel

### Deciding on Metrics

### Experiment Sizing

### Validity, Bias, Ethics

### Analyze Data

### Draw Conclusions

---

# Lesson 6

##  Recommendation Engines

---

## Movie Tweeting Data

## First Notebook - L5 - Intro to Recommendation Data

### Recommendations with MovieTweetings: Most Popular Recommendation

#### On Udacity text:

"Now that you have created the necessary columns we will be using throughout the rest of the lesson on creating recommendations, let's get started with the first of our recommendations."

To get started, read in the libraries and the two datasets you will be using throughout the lesson using the code below.

In [None]:
#import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import progressbar
import pickle
import udacourse3

import tests as t
import helper as h

from scipy.stats import spearmanr
from scipy.stats import kendalltau
from scipy.sparse import csr_matrix

from time import time
from collections import defaultdict
from IPython.display import HTML
#%matplotlib inline

In [None]:
# Read in the datasets
movie = udacourse3.fn_read_data('data/movies_clean.csv', remove_noisy_cols=True)
review = udacourse3.fn_read_data('data/reviews_clean.csv', remove_noisy_cols=True)

#### Part I: How To Find The Most Popular Movies?

#### On Udacity text:

"For this notebook, we have a single task.  The task is that no matter the user, we need to provide a list of the recommendations based on simply the most popular items.

For this task, we will consider what is "most popular" based on the following criteria:

* A movie with the highest average rating is considered best
* With ties, movies that have more ratings are better
* A movie must have a minimum of 5 ratings to be considered among the best movies
* If movies are tied in their average rating and number of ratings, the ranking is determined by the movie that is the most recent rating"

With these criteria, the goal for this notebook is to take a **user_id** and provide back the **num_top** recommendations.  Use the function below as the scaffolding that will be used for all the future recommendations as well.

In [None]:
review.head(1)

In [None]:
review.groupby('movie_id')['rating'].mean().head(2)

function `fn_ranked_movie` created!

function `fn_popular_recomendation` created!

#### On Udacity text:

"Using the three criteria above, you should be able to put together the above function.  If you feel confident in your solution, check the results of your function against our solution. On the next page, you can see a walkthrough and you can of course get the solution by looking at the solution notebook available in this workspace."

In [None]:
# Top 20 movies recommended for id 1
ranked_movie = udacourse3.fn_create_ranked_df(movie, 
                                              review,
                                              verbose=True) # only run this once - it is not fast

In [None]:
recs_20_for_1 = udacourse3.fn_popular_recommendation(user_id='1', 
                                                     num_top=20, 
                                                     ranked_movie=ranked_movie,
                                                     verbose=True)
# Top 5 movies recommended for id 53968
recs_5_for_53968 = udacourse3.fn_popular_recommendation(user_id='53968', 
                                                        num_top=5, 
                                                        ranked_movie=ranked_movie,
                                                        verbose=True)
# Top 100 movies recommended for id 70000
recs_100_for_70000 = udacourse3.fn_popular_recommendation(user_id='70000', 
                                                          num_top=100, 
                                                          ranked_movie=ranked_movie,
                                                          verbose=True)
# Top 35 movies recommended for id 43
recs_35_for_43 = udacourse3.fn_popular_recommendation(user_id='43', 
                                                      num_top=35, 
                                                      ranked_movie=ranked_movie,
                                                      verbose=True)

In [None]:
### You Should Not Need To Modify Anything In This Cell
# check 1 
assert t.popular_recommendations('1', 20, ranked_movie) == recs_20_for_1,  "The first check failed..."
# check 2
assert t.popular_recommendations('53968', 5, ranked_movie) == recs_5_for_53968,  "The second check failed..."
# check 3
assert t.popular_recommendations('70000', 100, ranked_movie) == recs_100_for_70000,  "The third check failed..."
# check 4
assert t.popular_recommendations('43', 35, ranked_movie) == recs_35_for_43,  "The fourth check failed..."

print("If you got here, looks like you are good to go!  Nice job!")

#### On Udacity text:

Top rated $\rightarrow$ is a fluid concept, and could depend on:

>- trending news
>- trending social events
>- a time window

**Notice:** 

"This wasn't the only way we could have determined the "top rated" movies. You can imagine that in keeping track of trending news or trending social events, you would likely want to create a time window from the current time, and then pull the articles in the most recent time frame.  There are always going to be some subjective decisions to be made.  

If you find that no one is paying any attention to your most popular recommendations, then it might be time to find a new way to recommend, which is what the next parts of the lesson should prepare us to do!"

### Part II: Adding Filters

#### On Udacity text:

Filters can bring $\rightarrow$ robustness for our model

**Robustnes** (asking Google) has two meanings:

>- "the quality or condition of being strong and in good condition"
>- "the ability to withstand or overcome adverse conditions or rigorous testing"

"Now that you have created a function to give back the **num_top** movies, let's make it a bit more robust.  Add arguments that will act as filters for the movie **year** and **genre**." 

Use the cells below to adjust your existing function to allow for **year** and **genre** arguments as **lists** of **strings**.  Then your ending results are filtered to only movies within the lists of provided years and genres (as `or` conditions).  If no list is provided, there should be no filter applied.

You can adjust other necessary inputs as necessary to retrieve the final results you are looking for!

function `fn_popular_recommendation_filtered` created!

In [None]:
# Top 20 movies recommended for id 1 with years=['2015', '2016', '2017', '2018'], genres=['History']
recs_20_for_1_filtered = udacourse3.fn_popular_recommendation_filtered(user_id='1', 
                                                                       num_top=20, 
                                                                       ranked_movie=ranked_movie,
                                                                       year=['2015', '2016', '2017', '2018'], 
                                                                       genre=['History'])

# Top 5 movies recommended for id 53968 with no genre filter but years=['2015', '2016', '2017', '2018']
recs_5_for_53968_filtered = udacourse3.fn_popular_recommendation_filtered(user_id='53968', 
                                                                          num_top=5, 
                                                                          ranked_movie=ranked_movie, 
                                                                          year=['2015', '2016', '2017', '2018'])

# Top 100 movies recommended for id 70000 with no year filter but genres=['History', 'News']
recs_100_for_70000_filtered = udacourse3.fn_popular_recommendation_filtered(user_id='70000', 
                                                                            num_top=100, 
                                                                            ranked_movie=ranked_movie, 
                                                                            genre=['History', 'News'])

In [None]:
### You Should Not Need To Modify Anything In This Cell
# check 1 
assert t.popular_recs_filtered('1', 20, ranked_movie, years=['2015', '2016', '2017', '2018'], genres=['History']) == recs_20_for_1_filtered,  "The first check failed..."
# check 2
assert t.popular_recs_filtered('53968', 5, ranked_movie, years=['2015', '2016', '2017', '2018']) == recs_5_for_53968_filtered,  "The second check failed..."
# check 3
assert t.popular_recs_filtered('70000', 100, ranked_movie, genres=['History', 'News']) == recs_100_for_70000_filtered,  "The third check failed..."
print("If you got here, looks like you are good to go!  Nice job!")

---

## Ways to Reccomend - Knowledge Based

## Second Notebook - L8 - Most Popular Recommendations

### Recommendations with MovieTweetings: Most Popular Recommendation

Now that you have created the necessary columns we will be using throughout the rest of the lesson on creating recommendations, let's get started with the first of our recommendations.

To get started, read in the libraries and the two datasets you will be using throughout the lesson using the code below.

In [None]:
# Read in the datasets
movie = udacourse3.fn_read_data('data/movies_clean.csv', remove_noisy_cols=True)
review = udacourse3.fn_read_data('data/reviews_clean.csv', remove_noisy_cols=True)

#### Part I: How To Find The Most Popular Movies?

For this notebook, we have a single task.  The task is that no matter the user, we need to provide a list of the recommendations based on simply the most popular items.

For this task, we will consider what is "most popular" based on the following criteria:

* A movie with the highest average rating is considered best
* With ties, movies that have more ratings are better
* A movie must have a minimum of 5 ratings to be considered among the best movies
* If movies are tied in their average rating and number of ratings, the ranking is determined by the movie that is the most recent rating

With these criteria, the goal for this notebook is to take a **user_id** and provide back the **n_top** recommendations.  Use the function below as the scaffolding that will be used for all the future recommendations as well.

Using the three criteria above, you should be able to put together the above function.  If you feel confident in your solution, check the results of your function against our solution. On the next page, you can see a walkthrough and you can of course get the solution by looking at the solution notebook available in this workspace.  

Function `fn_create_ranked_movie` created

In [None]:
# only run this once - it is not fast
ranked_movie = udacourse3.fn_create_ranked_df(movie, 
                                              review,
                                              verbose=True)

In [None]:
ranked_movie.head(5)

In [None]:
# Top 20 movies recommended for id 1
recs_20_for_1 = udacourse3.fn_popular_recommendation(
    user_id='1', 
    num_top=20, 
    ranked_movie=ranked_movie,
    verbose=True
)
# Top 5 movies recommended for id 53968
recs_5_for_53968 = udacourse3.fn_popular_recommendation(
    user_id='53968', 
    num_top=5, 
    ranked_movie=ranked_movie,
    verbose=True
)
# Top 100 movies recommended for id 70000
recs_100_for_70000 = udacourse3.fn_popular_recommendation(
    user_id='70000', 
    num_top=100, 
    ranked_movie=ranked_movie,
    verbose=True
)
# Top 35 movies recommended for id 43
recs_35_for_43 = udacourse3.fn_popular_recommendation(
    user_id='43', 
    num_top=35, 
    ranked_movie=ranked_movie,
    verbose=True
)

In [None]:
### You Should Not Need To Modify Anything In This Cell
# check 1 
assert t.popular_recommendations('1', 20, ranked_movie) == recs_20_for_1,  "The first check failed..."
# check 2
assert t.popular_recommendations('53968', 5, ranked_movie) == recs_5_for_53968,  "The second check failed..."
# check 3
assert t.popular_recommendations('70000', 100, ranked_movie) == recs_100_for_70000,  "The third check failed..."
# check 4
assert t.popular_recommendations('43', 35, ranked_movie) == recs_35_for_43,  "The fourth check failed..."
print("If you got here, looks like you are good to go!  Nice job!")

**Notice:** This wasn't the only way we could have determined the "top rated" movies.  You can imagine that in keeping track of trending news or trending social events, you would likely want to create a time window from the current time, and then pull the articles in the most recent time frame.  There are always going to be some subjective decisions to be made.  

If you find that no one is paying any attention to your most popular recommendations, then it might be time to find a new way to recommend, which is what the next parts of the lesson should prepare us to do!

### Part II: Adding Filters

Now that you have created a function to give back the **n_top** movies, let's make it a bit more robust.  Add arguments that will act as filters for the movie **year** and **genre**.  

Use the cells below to adjust your existing function to allow for **year** and **genre** arguments as **lists** of **strings**.  Then your ending results are filtered to only movies within the lists of provided years and genres (as `or` conditions).  If no list is provided, there should be no filter applied.

You can adjust other necessary inputs as necessary to retrieve the final results you are looking for!

In [None]:
# Top 20 movies recommended for id 1 with years=['2015', '2016', '2017', '2018'], genres=['History']
recs_20_for_1_filtered = udacourse3.fn_popular_recommendation_filtered(
    user_id='1', 
    num_top=20, 
    ranked_movie=ranked_movie,
    year=['2015', '2016', '2017', '2018'], 
    genre=['History']
)
# Top 5 movies recommended for id 53968 with no genre filter but years=['2015', '2016', '2017', '2018']
recs_5_for_53968_filtered = udacourse3.fn_popular_recommendation_filtered(
    user_id='53968', 
    num_top=5, 
    ranked_movie=ranked_movie,
    year=['2015', '2016', '2017', '2018']
)
# Top 100 movies recommended for id 70000 with no year filter but genres=['History', 'News']
recs_100_for_70000_filtered = udacourse3.fn_popular_recommendation_filtered(
    user_id='70000', 
    num_top=100, 
    ranked_movie=ranked_movie,
    genre=['History', 'News']
)

In [None]:
### You Should Not Need To Modify Anything In This Cell
# check 1 
assert t.popular_recs_filtered('1', 
                               20, 
                               ranked_movie, 
                               years=['2015', '2016', '2017', '2018'], 
                               genres=['History']) == recs_20_for_1_filtered,  "The first check failed..."
# check 2
assert t.popular_recs_filtered('53968', 
                               5, 
                               ranked_movie, 
                               years=['2015', '2016', '2017', '2018']) == recs_5_for_53968_filtered,\
"The second check failed..."
# check 3
assert t.popular_recs_filtered('70000', 
                               100, 
                               ranked_movie,
                               genres=['History', 'News']) == recs_100_for_70000_filtered,\
"The third check failed..."
print("If you got here, looks like you are good to go!  Nice job!")

## More Personalized Ways - Collaborative Filtering & Content Based

## Third Notebook - L14 - Measuring Similarity

### How to Find Your Neighbor?

As in k-Neighbors Classifier, some way to identify them $\rightarrow$ similar subjects = similar preferences



#### In Udacity text:

"In neighborhood based collaborative filtering, it is incredibly important to be able to identify an individual's neighbors.  Let's look at a small dataset in order to understand, how we can use different metrics to identify close neighbors."

In [None]:
play_data = pd.DataFrame({'x1': [-3, -2, -1, 0, 1, 2, 3], 
               'x2': [9, 4, 1, 0, 1, 4, 9],
               'x3': [1, 2, 3, 4, 5, 6, 7],
               'x4': [2, 5, 15, 27, 28, 30, 31]
})

#create play data dataframe
play_data = play_data[['x1', 'x2', 'x3', 'x4']]

### Measures of Similarity

#### In Udacity text:

"The first metrics we will look at have similar characteristics:

1. Pearson's Correlation Coefficient
2. Spearman's Correlation Coefficient
3. Kendall's Tau"

### Pearson's Correlation

relation between data in **X-axis** and data in **Y-axis** [statquest](https://www.youtube.com/watch?v=xZ_z8KWkhXE&ab_channel=StatQuestwithJoshStarmer)

green apples vs red apples:

>- normally I draw a line (don't matter the slope)
>- looking for **weak** or **strong** relationship (correlation)
>- [-1, 0] for **negative correlations** 

leads to... $R^2$ that can be **not linear**!

#### In Udacity text:

"First, **Pearson's correlation coefficient** is a measure related to the strength and direction of a **linear** relationship.  

If we have two vectors x and y, we can compare their individual elements in the following way to calculate Pearson's correlation coefficient:

$$CORR(\textbf{x}, \textbf{y}) = \frac{\sum\limits_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum\limits_{i=1}^{n}(x_i-\bar{x})^2}\sqrt{\sum\limits_{i=1}^{n}(y_i-\bar{y})^2}} $$

where 

$$\bar{x} = \frac{1}{n}\sum\limits_{i=1}^{n}x_i$$
"

1. Write a function that takes in two vectors and returns the Pearson correlation coefficient.  You can then compare your answer to the built in function in numpy by using the assert statements in the following cell.

In [None]:
# This cell will test your function against the built in numpy function
assert udacourse3.fn_compute_correlation(play_data['x1'], 
                                         play_data['x2'],
                                         corr_type='pearson') == np.corrcoef(play_data['x1'], 
                                                                             play_data['x2'])[0][1],\
'Oops!  The correlation between the first two columns should be 0, but your function returned {}.'\
.format(udacourse3.fn_compute_correlation(play_data['x1'], 
                                          play_data['x2'],
                                          corr_type='pearson'))
assert round(udacourse3.fn_compute_correlation(play_data['x1'], 
                                               play_data['x3'],
                                               corr_type='pearson'), 2) == np.corrcoef(play_data['x1'], 
                                                                                       play_data['x3'])[0][1],\
'Oops!  The correlation between the first and third columns should be {}, but your function returned {}.'\
.format(np.corrcoef(play_data['x1'], play_data['x3'])[0][1], 
                                     udacourse3.fn_compute_correlation(play_data['x1'], 
                                                                       play_data['x3'],
                                                                       corr_type='pearson'))
assert round(udacourse3.fn_compute_correlation(play_data['x3'], 
                                               play_data['x4'],
                                               corr_type='pearson'), 2) == round(np.corrcoef(play_data['x3'], 
                                                                                             play_data['x4'])[0][1], 
                                                                                 2),\
'Oops!  The correlation between the first and third columns should be {}, but your function returned {}.'\
.format(np.corrcoef(play_data['x3'], play_data['x4'])[0][1], 
                                     udacourse3.fn_compute_correlation(play_data['x3'], 
                                                                       play_data['x4'],
                                                                       corr_type='pearson'))
print("If this is all you see, it looks like you are all set!  Nice job coding up Pearson's correlation coefficient!")

`2.` Now that you have computed **Pearson's correlation coefficient**, use the below dictionary to identify statements that are true about **this** measure.

In [None]:
a = True
b = False
c = "We can't be sure."

pearson_dct = {"If when x increases, y always increases, Pearson's correlation will be always be 1.": b,
               "If when x increases by 1, y always increases by 3, Pearson's correlation will always be 1.": a,
               "If when x increases by 1, y always decreases by 5, Pearson's correlation will always be -1.": a,
               "If when x increases by 1, y increases by 3 times x, Pearson's correlation will always be 1.": b
}

t.sim_2_sol(pearson_dct)

### Spearman's Correlation

- Pearson vs Spearman [here](https://www.youtube.com/watch?v=c5ASFOYd918&ab_channel=StatistikinDD)

#### In Udacity text:

"Now, let's look at **Spearman's correlation coefficient**.  Spearman's correlation is what is known as a [non-parametric](https://en.wikipedia.org/wiki/Nonparametric_statistics) statistic, which is a statistic who's distribution doesn't depend parameters (statistics that follow normal distributions or binomial distributions are examples of parametric statistics).  

Frequently non-parametric statistics are based on the ranks of data rather than the original values collected.  This happens to be the case with Spearman's correlation coefficient, which is calculated similarly to Pearson's correlation.  However, instead of using the raw data, we use the rank of each value."

You can quickly change from the raw data to the ranks using the **.rank()** method as shown here:

In [None]:
print("The ranked values for the variable x1 are: {}".format(np.array(play_data['x1'].rank())))
print("The raw data values for the variable x1 are: {}".format(np.array(play_data['x1'])))

#### In Udacity text:

"If we map each of our data to ranked data values as shown above:

$$\textbf{x} \rightarrow \textbf{x}^{r}$$
$$\textbf{y} \rightarrow \textbf{y}^{r}$$

Here, we let the **r** indicate these are ranked values (this is not raising any value to the power of r).  Then we compute Spearman's correlation coefficient as:

$$SCORR(\textbf{x}, \textbf{y}) = \frac{\sum\limits_{i=1}^{n}(x^{r}_i - \bar{x}^{r})(y^{r}_i - \bar{y}^{r})}{\sqrt{\sum\limits_{i=1}^{n}(x^{r}_i-\bar{x}^{r})^2}\sqrt{\sum\limits_{i=1}^{n}(y^{r}_i-\bar{y}^{r})^2}} $$

where 

$$\bar{x}^r = \frac{1}{n}\sum\limits_{i=1}^{n}x^r_i$$

`3.` Write a function that takes in two vectors and returns the Spearman correlation coefficient.  You can then compare your answer to the built in function in scipy stats by using the assert statements in the following cell."

function `fn_compute_correlation` created

In [None]:
# This cell will test your function against the built in scipy function
assert udacourse3.fn_compute_correlation(play_data['x1'], 
                                         play_data['x2'],
                                         corr_type='spearman') == spearmanr(play_data['x1'], 
                                                                            play_data['x2'])[0],\
'Oops!  The correlation between the first two columns should be 0, but your function returned {}.'\
.format(compute_corr(play_data['x1'], play_data['x2']))
assert round(udacourse3.fn_compute_correlation(play_data['x1'], 
                                               play_data['x3'],
                                               corr_type='spearman'), 2) == spearmanr(play_data['x1'], 
                                                                                      play_data['x3'])[0],\
'Oops!  The correlation between the first and third columns should be {}, but your function returned {}.'\
.format(np.corrcoef(play_data['x1'], play_data['x3'])[0][1], compute_corr(play_data['x1'], play_data['x3']))
assert round(udacourse3.fn_compute_correlation(play_data['x3'], 
                                               play_data['x4'],
                                               corr_type='spearman'), 2) == round(spearmanr(play_data['x3'], 
                                                                                            play_data['x4'])[0], 2),\
'Oops!  The correlation between the first and third columns should be {}, but your function returned {}.'\
.format(np.corrcoef(play_data['x3'], play_data['x4'])[0][1], compute_corr(play_data['x3'], play_data['x4']))
print("If this is all you see, it looks like you are all set!  Nice job coding up Spearman's correlation coefficient!")

`4.` Now that you have computed **Spearman's correlation coefficient**, use the below dictionary to identify statements that are true about **this** measure.

In [None]:
a = True
b = False
c = "We can't be sure."

spearman_dct = {"If when x increases, y always increases, Spearman's correlation will be always be 1.": a,
               "If when x increases by 1, y always increases by 3, Pearson's correlation will always be 1.": a,
               "If when x increases by 1, y always decreases by 5, Pearson's correlation will always be -1.": a,
               "If when x increases by 1, y increases by 3 times x, Pearson's correlation will always be 1.": a
}

t.sim_4_sol(spearman_dct)

### Kendall's Tau

#### In Udacity notes:

"Kendall's tau is quite similar to Spearman's correlation coefficient.  Both of these measures are nonparametric measures of a relationship.  Specifically both Spearman and Kendall's coefficients are calculated based on ranking data and not the raw data.  

Similar to both of the previous measures, Kendall's Tau is always between -1 and 1, where -1 suggests a strong, negative relationship between two variables and 1 suggests a strong, positive relationship between two variables.

Though Spearman's and Kendall's measures are very similar, there are statistical advantages to choosing Kendall's measure in that Kendall's Tau has smaller variability when using larger sample sizes.  However Spearman's measure is more computationally efficient, as Kendall's Tau is O(n^2) and Spearman's correlation is O(nLog(n)). You can find more on this topic in [this thread](https://www.researchgate.net/post/Does_Spearmans_rho_have_any_advantage_over_Kendalls_tau).

Let's take a closer look at exactly how this measure is calculated.  Again, we want to map our data to ranks:

$$\textbf{x} \rightarrow \textbf{x}^{r}$$
$$\textbf{y} \rightarrow \textbf{y}^{r}$$

Then we calculate Kendall's Tau as:

$$TAU(\textbf{x}, \textbf{y}) = \frac{2}{n(n -1)}\sum_{i < j}sgn(x^r_i - x^r_j)sgn(y^r_i - y^r_j)$$

Where $sgn$ takes the the sign associated with the difference in the ranked values.  An alternative way to write 

$$sgn(x^r_i - x^r_j)$$ 

is in the following way:

$$
 \begin{cases} 
      -1  & x^r_i < x^r_j \\
      0 & x^r_i = x^r_j \\
      1 & x^r_i > x^r_j 
   \end{cases}
$$

Therefore the possible results of 

$$sgn(x^r_i - x^r_j)sgn(y^r_i - y^r_j)$$

are only 1, -1, or 0, which are summed to give an idea of the propotion of times the ranks of **x** and **y** are pointed in the right direction."

#### Task

`5.` Write a function that takes in two vectors and returns Kendall's Tau.  You can then compare your answer to the built in function in scipy stats by using the assert statements in the following cell.

function `fn_compute_correlation` improved for Kendall Tau!

In [None]:
# This cell will test your function against the built in scipy function
assert udacourse3.fn_compute_correlation(play_data['x1'], 
                                         play_data['x2'],
                                         corr_type='kendall_tau') == kendalltau(play_data['x1'], 
                                                                        play_data['x2'])[0],\
'Oops!  The correlation between the first two columns should be 0, but your function returned {}.'\
.format(udacourse3.fn_compute_correlation(play_data['x1'], 
                                          play_data['x2'],
                                          type='kendall_tau'))
assert round(udacourse3.fn_compute_correlation(play_data['x1'], 
                                               play_data['x3'],
                                               corr_type='kendall_tau'), 2) == kendalltau(play_data['x1'], 
                                                                                     play_data['x3'])[0],\
'Oops!  The correlation between the first and third columns should be {}, but your function returned {}.'\
.format(kendalltau(play_data['x1'], 
                   play_data['x3'])[0][1], udacourse3.fn_compute_correlation(play_data['x1'], 
                                                                             play_data['x3'],
                                                                             corr_type='kendall_tau'))
assert round(udacourse3.fn_compute_correlation(play_data['x3'], 
                                               play_data['x4'],
                                               corr_type='kendall_tau'), 2) == round(kendalltau(play_data['x3'], 
                                                                                                play_data['x4'])[0], 
                                                                                     2),\
'Oops!  The correlation between the first and third columns should be {}, but your function returned {}.'\
.format(kendalltau(play_data['x3'], play_data['x4'])[0][1], udacourse3.fn_compute_correlation(play_data['x3'],
                                                                                              play_data['x4'],
                                                                                              corr_type='kendall_tau'))
print("If this is all you see, it looks like you are all set!  Nice job coding up Kendall's Tau!")

`6.` Use your functions (and/or your knowledge of each of the above coefficients) to accurately identify each of the below statements as True or False.  **Note:** There may be some rounding differences due to the way numbers are stored, so it is recommended that you consider comparisons to 4 or fewer decimal places.

In [None]:
a = True
b = False
c = "We can't be sure."

corr_comp_dct = {"For all columns of play_data, Spearman and Kendall's measures match.": a,
                 "For all columns of play_data, Spearman and Pearson's measures match.": b, 
                 "For all columns of play_data, Pearson and Kendall's measures match.": b}

t.sim_6_sol(corr_comp_dct)

### Distance Measures

#### In Udacity notes:

"Each of the above measures are considered measures of correlation.  Similarly, there are distance measures (of which there are many).  [This is a great article](http://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/) on some popular distance metrics.  In this notebook, we will be looking specifically at two of these measures.  

1. Euclidean Distance
2. Manhattan Distance

Different than the three measures you built functions for, these two measures take on values between 0 and potentially infinity.  Measures that are closer to 0 imply that two vectors are more similar to one another.  The larger these values become, the more dissimilar two vectors are to one another.

Choosing one of these two `distance` metrics vs. one of the three `similarity` above is often a matter of personal preference, audience, and data specificities.  You will see in a bit a case where one of these measures (euclidean or manhattan distance) is optimal to using Pearson's correlation coefficient.

### Euclidean Distance

#### In Udacity notes:

"Euclidean distance can also just be considered as straight-line distance between two vectors.

For two vectors **x** and **y**, we can compute this as:

$$ EUC(\textbf{x}, \textbf{y}) = \sqrt{\sum\limits_{i=1}^{n}(x_i - y_i)^2}$$

"

### Manhattan Distance

#### In Udacity notes:

"Different from euclidean distance, Manhattan distance is a 'manhattan block' distance from one vector to another.  Therefore, you can imagine this distance as a way to compute the distance between two points when you are not able to go through buildings.

Specifically, this distance is computed as:

$$ MANHATTAN(\textbf{x}, \textbf{y}) = \sqrt{\sum\limits_{i=1}^{n}|x_i - y_i|}$$

Using each of the above, write a function for each to take two vectors and compute the euclidean and manhattan distances.


![distances](graphs/distances.png)

You can see in the above image, the **blue** line gives the **Manhattan** distance, while the **green** line gives the **Euclidean** distance between two points."

#### Task

`7.` Use the below cell to complete a function for each distance metric.  Then test your functions against the built in values using the below.

function `fn_calculate_distance` created!

In [None]:
# Test your functions
assert h.test_eucl(play_data['x1'], play_data['x2']) == udacourse3.fn_calculate_distance(play_data['x1'], 
                                                                                         play_data['x2'],
                                                                                         dist_type='euclidean')
assert h.test_eucl(play_data['x2'], play_data['x3']) == udacourse3.fn_calculate_distance(play_data['x2'], 
                                                                                         play_data['x3'],
                                                                                         dist_type='euclidean')
assert h.test_manhat(play_data['x1'], play_data['x2']) == udacourse3.fn_calculate_distance(play_data['x1'], 
                                                                                           play_data['x2'],
                                                                                           dist_type='manhattan')
assert h.test_manhat(play_data['x2'], play_data['x3']) == udacourse3.fn_calculate_distance(play_data['x2'], 
                                                                                           play_data['x3'],
                                                                                           dist_type='manhattan')
print('test passed!')

### Final Note

#### In Udacity notes:

"It is worth noting that two vectors could be similar by metrics like the three at the top of the notebook, while being incredibly, incredibly different by measures like these final two.  Again, understanding your specific situation will assist in understanding whether your metric is appropriate."

---

## Identifying Reccomendations

## Forth Notebook - L17 - Collaborative Filtering

## Recommendations with MovieTweetings: Collaborative Filtering

#### In Udacity notes:

"One of the most popular methods for making recommendations is **collaborative filtering**.  In collaborative filtering, you are using the collaboration of user-item recommendations to assist in making new recommendations.  

There are two main methods of performing collaborative filtering:

1. **Neighborhood-Based Collaborative Filtering**, which is based on the idea that we can either correlate items that are similar to provide recommendations or we can correlate users to one another to provide recommendations.

2. **Model Based Collaborative Filtering**, which is based on the idea that we can use machine learning and other mathematical models to understand the relationships that exist amongst items and users to predict ratings and provide ratings.


In this notebook, you will be working on performing **neighborhood-based collaborative filtering**.  There are two main methods for performing collaborative filtering:

1. **User-based collaborative filtering:** In this type of recommendation, users related to the user you would like to make recommendations for are used to create a recommendation.

2. **Item-based collaborative filtering:** In this type of recommendation, first you need to find the items that are most related to each other item (based on similar ratings).  Then you can use the ratings of an individual on those similar items to understand if a user will like the new item."

In this notebook you will be implementing **user-based collaborative filtering**.  However, it is easy to extend this approach to make recommendations using **item-based collaborative filtering**.  First, let's read in our data and necessary libraries.

**NOTE**: Because of the size of the datasets, some of your code cells here will take a while to execute, so be patient!

In [None]:
# Read in the datasets
movie = udacourse3.fn_read_data('data/movies_clean.csv', remove_noisy_cols=True)
review = udacourse3.fn_read_data('data/reviews_clean.csv', remove_noisy_cols=True)
review.head()

### Measures of Similarity

#### In Udacity notes:

"When using **neighborhood** based collaborative filtering, it is important to understand how to measure the similarity of users or items to one another.  

There are a number of ways in which we might measure the similarity between two vectors (which might be two users or two items)."  

In this notebook, we will look specifically at two measures used to compare vectors:

* **Pearson's correlation coefficient**

#### In Udacity notes:

"Pearson's correlation coefficient is a measure of the strength and direction of a linear relationship. The value for this coefficient is a value between -1 and 1 where -1 indicates a strong, negative linear relationship and 1 indicates a strong, positive linear relationship. 

If we have two vectors x and y, we can define the correlation between the vectors as:


$$CORR(x, y) = \frac{\text{COV}(x, y)}{\text{STDEV}(x)\text{ }\text{STDEV}(y)}$$

where 

$$\text{STDEV}(x) = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2}$$

and 

$$\text{COV}(x, y) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})$$

where n is the length of the vector, which must be the same for both x and y and $\bar{x}$ is the mean of the observations in the vector.  

We can use the correlation coefficient to indicate how alike two vectors are to one another, where the closer to 1 the coefficient, the more alike the vectors are to one another.  There are some potential downsides to using this metric as a measure of similarity.  You will see some of these throughout this workbook."


* **Euclidean distance**

#### In Udacity notes:

"Euclidean distance is a measure of the straightline distance from one vector to another.  Because this is a measure of distance, larger values are an indication that two vectors are different from one another (which is different than Pearson's correlation coefficient).

Specifically, the euclidean distance between two vectors x and y is measured as:

$$ \text{EUCL}(x, y) = \sqrt{\sum_{i=1}^{n}(x_i - y_i)^2}$$

Different from the correlation coefficient, no scaling is performed in the denominator.  Therefore, you need to make sure all of your data are on the same scale when using this metric.

**Note:** Because measuring similarity is often based on looking at the distance between vectors, it is important in these cases to scale your data or to have all data be in the same scale.  In this case, we will not need to scale data because they are all on a 10 point scale, but it is always something to keep in mind!"

------------

### User-Item Matrix

#### In Udacity notes:

"In order to calculate the similarities, it is common to put values in a matrix.  In this matrix, users are identified by each row, and items are represented by columns."

![user x item](graphs/userxitem.png "User Item Matrix")

#### In Udacity notes:

"In the above matrix, you can see that **User 1** and **User 2** both used **Item 1**, and **User 2**, **User 3**, and **User 4** all used **Item 2**.  However, there are also a large number of missing values in the matrix for users who haven't used a particular item.  A matrix with many missing values (like the one above) is considered **sparse**."

---

Our first goal for this notebook is to create the above matrix with the **reviews** dataset.  However, instead of 1 values in each cell, you should have the actual rating.  

The users will indicate the rows, and the movies will exist across the columns. To create the user-item matrix, we only need the first three columns of the **reviews** dataframe, which you can see by running the cell below.

In [None]:
user_item = review[['user_id', 'movie_id', 'rating']]
user_item.head()

### Creating the User-Item Matrix

#### In Udacity notes:

"In order to create the user-items matrix (like the one above), I personally started by using a [pivot table](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html). 

However, I quickly ran into a memory error (a common theme throughout this notebook).  I will help you navigate around many of the errors I had, and achieve useful collaborative filtering results!"

_____

`1.` Create a matrix where the users are the rows, the movies are the columns, and the ratings exist in each cell, or a NaN exists in cells where a user hasn't rated a particular movie. If you get a memory error (like I did), [this link here](https://stackoverflow.com/questions/39648991/pandas-dataframe-pivot-memory-error) might help you!

In [None]:
user_item.head(2)

function `udacourse3.fn_create_user_item` created!

renamed to `udacourse3.fn_create_user_movie`

In [None]:
user_by_movie = udacourse3.fn_create_user_movie(df_user_item=user_item, 
                                                verbose=True)
user_by_movie.head(1)

Check your results below to make sure your matrix is ready for the upcoming sections.

In [None]:
assert movie.shape[0] == user_by_movie.shape[1],\
"Oh no! Your matrix should have {} columns, and yours has {}!".format(movie.shape[0], user_by_movie.shape[1])
assert review.user_id.nunique() == user_by_movie.shape[0],\
"Oh no! Your matrix should have {} rows, and yours has {}!".format(review.user_id.nunique(), user_by_movie.shape[0])
print("Looks like you are all set! Proceed!")
#HTML('<img src="graphs/greatjob.webp">')

`2.` Now that you have a matrix of users by movies, use this matrix to create a dictionary where the key is each user and the value is an array of the movies each user has rated.

function `fn_movie_watched` created!

- iterate over `user_by_movie` dataset

#### Note: this is big data processing!
    
So, the first time you run this notebook, you need to uncomment the following lines for creating the file `watched.pkl` in your computer. Then turn to comment these lines, for just loading the data, saving processing time!

In [None]:
#watched = udacourse3.fn_movie_watched(df_user_movie=user_by_movie,
#                                      user_id=66,
#                                      lower_filter=None,
#                                      verbose=True)
#watched[0]

In [None]:
#with open('watched.pkl', 'wb') as handle:
#    pickle.dump(watched, handle)

with open('watched.pkl', 'rb') as handle:
    watched = pickle.load(handle)

watched[0]

function `fn_create_movie_dict` created!

- iterate over `user_by_movie` dataset

- this is a polimorphic function, so you can enter as `df_user_movie` an already created dictionnary, or a Pandas dataset!

#### Note: this is big data processing!
    
So, the first time you run this notebook, you need to uncomment the following lines for creating the file `seen.pkl` in your computer. Then turn to comment these lines, for just loading the data, saving processing time!

In [None]:
#movie_seen = udacourse3.fn_create_user_movie_dict(df_user_movie=user_by_movie,
#                                                   lower_filter=None,
#                                                   verbose=False)
#len(movie_seen)

In [None]:
#with open('seen.pkl', 'wb') as handle:
#    pickle.dump(movie_seen, handle)

with open('seen.pkl', 'rb') as handle:
    movie_seen = pickle.load(handle)
    
len(movie_seen)

`3.` If a user hasn't rated more than 2 movies, we consider these users "too new".  Create a new dictionary that only contains users who have rated more than 2 movies.  This dictionary will be used for all the final steps of this workbook.

In [None]:
#as a dataset
#movie_filtered = udacourse3.fn_create_user_movie_dict(df_user_movie=user_by_movie,
#                                                      lower_filter=2,
#                                                      verbose=True)
#len(movies_to_analyze)

In [None]:
#using our already created dictionnary
movie_to_analyze = udacourse3.fn_create_user_movie_dict(df_user_movie=movie_seen,
                                                        lower_filter=2,
                                                        verbose=True)
#for usr in movies_to_analyze.keys():
#    print(movies_to_analyze[usr])
len(movie_to_analyze)

In [None]:
# Run the tests below to check that your movies_to_analyze matches the solution
assert len(movie_to_analyze) == 23512,\
"Oops!  It doesn't look like your dictionary has the right number of individuals."
assert len(movie_to_analyze[2]) == 23,\
"Oops!  User 2 didn't match the number of movies we thought they would have."
assert len(movie_to_analyze[7])  == 3,\
"Oops!  User 7 didn't match the number of movies we thought they would have."
print("If this is all you see, you are good to go!")

### Calculating User Similarities

#### In Udacity notes:

"Now that you have set up the **movies_to_analyze** dictionary, it is time to take a closer look at the similarities between users. Below is the pseudocode for how I thought about determining the similarity between users:

```
for user1 in movies_to_analyze
    for user2 in movies_to_analyze
        see how many movies match between the two users
        if more than two movies in common
            pull the overlapping movies
            compute the distance/similarity metric between ratings on the same movies for the two users
            store the users and the distance metric
```

However, this took a very long time to run, and other methods of performing these operations did not fit on the workspace memory!

Therefore, rather than creating a dataframe with all possible pairings of users in our data, your task for this question is to look at a few specific examples of the correlation between ratings given by two users.  For this question consider you want to compute the [correlation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.corr.html) between users."

`4.` Using the **movies_to_analyze** dictionary and **user_by_movie** dataframe, create a function that computes the correlation between the ratings of similar movies for two users.  Then use your function to compare your results to ours using the tests below.  

function `fn_take_correlation` created!

- iterate over `user_by_movie` dataset

In [None]:
user1 = 2
user2 = 66
correlation = udacourse3.fn_take_correlation(for_user1=user_by_movie.loc[user1], 
                                             for_user2=user_by_movie.loc[user2],
                                             verbose=True)
correlation

In [None]:
# Test your function against the solution
assert udacourse3.fn_take_correlation(for_user1=user_by_movie.loc[2],
                                      for_user2=user_by_movie.loc[2]) == 1.0,\
"Oops!  The correlation between a user and itself should be 1.0."
assert round(udacourse3.fn_take_correlation(for_user1=user_by_movie.loc[2],
                                            for_user2=user_by_movie.loc[66]), 2) == 0.76,\
"Oops!  The correlation between user 2 and 66 should be about 0.76."
assert np.isnan(udacourse3.fn_take_correlation(for_user1=user_by_movie.loc[2],
                                               for_user2=user_by_movie.loc[104])),\
"Oops!  The correlation between user 2 and 104 should be a NaN."
print("If this is all you see, then it looks like your function passed all of our tests!")

### Why the NaN's?

#### In Udacity notes:

"If the function you wrote passed all of the tests, then you have correctly set up your function to calculate the correlation between any two users."  

`5.` But one question is, why are we still obtaining **NaN** values?  As you can see in the code cell above, users 2 and 104 have a correlation of **NaN**. Why?

#### In Udacity notes:

"Think and write your ideas here about why these NaNs exist, and use the cells below to do some coding to validate your thoughts. You can check other pairs of users and see that there are actually many NaNs in our data - 2,526,710 of them in fact. **These NaN's ultimately make the correlation coefficient a less than optimal measure of similarity between two users.**

```
In the denominator of the correlation coefficient, we calculate the standard deviation for each user's ratings.  The ratings for user 2 are all the same rating on the movies that match with user 104.  Therefore, the standard deviation is 0.  Because a 0 is in the denominator of the correlation coefficient, we end up with a **NaN** correlation coefficient.  Therefore, a different approach is likely better for this particular situation.
```
"

In [None]:
# Which movies did both user 2 and user 104 see?
set_2 = set(movie_to_analyze[2])
set_104 = set(movie_to_analyze[104])
set_2.intersection(set_104)

In [None]:
# What were the ratings for each user on those movies?
print(user_by_movie.loc[2, set_2.intersection(set_104)])
print(user_by_movie.loc[104, set_2.intersection(set_104)])

`6.` Because the correlation coefficient proved to be less than optimal for relating user ratings to one another, we could instead calculate the euclidean distance between the ratings.  I found [this post](https://stackoverflow.com/questions/1401712/how-can-the-euclidean-distance-be-calculated-with-numpy) particularly helpful when I was setting up my function.  This function should be very similar to your previous function.  When you feel confident with your function, test it against our results.

In [None]:
def compute_euclidean_dist(user1, user2):
    movies1 = movie_to_analyze[user1]
    movies2 = movie_to_analyze[user2]
    sim_movs = np.intersect1d(movies1, movies2, assume_unique=True)
    df = user_by_movie.loc[(user1, user2), sim_movs] #not necessary
    dist = np.linalg.norm(df.loc[user1] - df.loc[user2])
    return (sim_movs, df, df.loc[user1], df.loc[user2], dist)

In [None]:
rtup = compute_euclidean_dist(user1=2, user2=66)
print('euclidean distance:', rtup[4])
print('identical movies id:',rtup[0])
print('series for user1:', rtup[2])
print('series for user2:', rtup[3])
rtup[1]

function `fn_take_euclidean_dist` created!

- iterate over `user_by_movie` dataset

In [None]:
euclidean = udacourse3.fn_take_euclidean_dist(for_user1=user_by_movie.loc[2], 
                                              for_user2=user_by_movie.loc[66],
                                              verbose=True)
euclidean

In [None]:
# Read in solution euclidean distances"
df_dist = pd.read_pickle("data/dists.p")

In [None]:
# Test your function against the solution
assert udacourse3.fn_take_euclidean_dist(
    for_user1=user_by_movie.loc[2],
    for_user2=user_by_movie.loc[2]) == df_dist.query("user1 == 2 and user2 == 2")['eucl_dist'][0],\
"Oops!  The distance between a user and itself should be 0.0."
assert round(udacourse3.fn_take_euclidean_dist(
    for_user1=user_by_movie.loc[2],
    for_user2=user_by_movie.loc[66]),2) == round(df_dist.query("user1 == 2 and user2 == 66")['eucl_dist'][1], 2),\
"Oops!  The distance between user 2 and 66 should be about 2.24."
assert np.isnan(udacourse3.fn_take_euclidean_dist(
    for_user1=user_by_movie.loc[2],
    for_user2=user_by_movie.loc[66])) == np.isnan(df_dist.query("user1 == 2 and user2 == 104")['eucl_dist'][4]),\
"Oops!  The distance between user 2 and 104 should be 2."
print("If this is all you see, then it looks like your function passed all of our tests!")

### Using the Nearest Neighbors to Make Recommendations

#### In Udacity notes:

"In the previous question, you read in **df_dists**. Therefore, you have a measure of distance between each user and every other user. This dataframe holds every possible pairing of users, as well as the corresponding euclidean distance.

Because of the **NaN** values that exist within the correlations of the matching ratings for many pairs of users, as we discussed above, we will proceed using **df_dists**. You will want to find the users that are 'nearest' each user.  Then you will want to find the movies the closest neighbors have liked to recommend to each user.

I made use of the following objects:

* df_dists (to obtain the neighbors)
* user_items (to obtain the movies the neighbors and users have rated)
* movies (to obtain the names of the movies)"

`7.` Complete the functions below, which allow you to find the recommendations for any user.  There are five functions which you will need:

* **find_closest_neighbors** - this returns a list of user_ids from closest neighbor to farthest neighbor using euclidean distance

* **movies_liked** - returns an array of movie_ids


* **movie_names** - takes the output of movies_liked and returns a list of movie names associated with the movie_ids


* **make_recommendations** - takes a user id and goes through closest neighbors to return a list of movie names as recommendations


* **all_recommendations** = loops through every user and returns a dictionary of with the key as a user_id and the value as a list of movie recommendations

In [None]:
user = 2
#df_dists.head(10)
filt_user1 = df_dist[df_dist['user1'] == user]
filt_user1.head(1)

In [None]:
user1 = filt_user1['user1'].iloc[0]
filt_user2 = filt_user1[filt_user1['user2'] != user1]
filt_user2.head(1)

In [None]:
user1 = 2
closest_user = df_dist[df_dist['user1']==user1].sort_values(by='eucl_dist').iloc[1:]#['user2']
closest_user.head(1)

In [None]:
closest_neighbor = np.array(closest_user)
closest_neighbor[0]

In [None]:
filt_user2.sort_values(by='eucl_dist')[:1]

function `fn_find_closest_neighbor` created!

In [None]:
user = 2
neighbor = udacourse3.fn_find_closest_neighbor(filt_user1=df_dist[df_dist['user1'] == user],
                                               limit=10,
                                               verbose=True)
for i in range (1,5):
    print(neighbor[i])

In [None]:
user_item.head(1)

In [None]:
user = 66
movie_liked = user_item[(user_item['user_id'] == user) & (user_item['rating'] > 7)]
movie_liked.sort_values(by='rating').head(1)

In [None]:
movie_liked.iloc[0]['user_id']

filter with variable value:

    
`.query(user_id == @user_id and rating > (@min_rating -1)['movie_id'])`

function `fn_movie_liked` created!

function `fn_movie_liked2` created!

- iterate over `user_by_movie` dataset

In [None]:
user_item.head(1)

In [None]:
user_id = 66
user_item[user_item['user_id'] == user_id].head(1)

In [None]:
user_item[user_item['user_id'] == user_id]['rating']

In [None]:
user_by_movie.head(1)

In [None]:
data2 = user_by_movie.loc[user_id].dropna()
data2.head(1)

In [None]:
data2[data2 > 7].head(1)

In [None]:
#deprecated function!
#user_id = 66
#liked = udacourse3.fn_movie_liked(item=user_item[user_item['user_id'] == user_id],
#                                  verbose=True)
#liked[0]

In [None]:
user_id = 66
liked = udacourse3.fn_movie_liked2(item=user_by_movie.loc[user_id].dropna(),
                                   sort=True,
                                   verbose=True)
#liked[0]
len(liked)

In [None]:
movie.head(1)

In [None]:
movie[movie['movie_id'].isin(liked)].head(1)

Original filtering machine:

`movies[movies['movie_id'].isin(movie_ids)]['movie']`

In [None]:
movie_retrieved = udacourse3.fn_movie_name(df_movie=movie,
                                           movie_id=liked,
                                           verbose=True)
movie_retrieved[0]

In [None]:
watched = udacourse3.fn_movie_watched(df_user_movie=user_by_movie,
                                      user_id=66,
                                      lower_filter=None,
                                      verbose=True)
watched[0]

In [None]:
user_by_movie.head(1)

In [None]:
neighbor_id = 33854

filt_user = user_by_movie.loc[user_id].dropna()
filt_user.index

item=filt_user

movie_liked = item[item > 7]
movie_liked
np.array(movie_liked.index)

In [None]:
df_dist[df_dist['user1'] == user].head(1)

function `fn_make_recommendation` created!

name altered to `fn_make_recommendation_collab`

In [None]:
user=66
udacourse3.fn_make_recommendation_collab(filt_dist=df_dist[df_dist['user1'] == user],
                                         df_user_movie=user_by_movie,
                                         df_movie=movie,
                                         num_rec=10,
                                         limit=100,
                                         min_rating=7,
                                         sort=True,
                                         verbose=True)

In [None]:
df_dist[df_dist['user1'] == user].head(1)

In [None]:
user_id = 66
isinstance(user_id, int)

In [None]:
user_by_movie.loc[user_id].dropna().head(1)

In [None]:
df_dist.head(1)

function `fn_all_recommendation` created!

renamed to `fn_all_recommendation_collab`

#### Note: this is big data processing!
    
So, the first time you run this notebook, you need to uncomment the following lines for creating the file `recommended.pkl` in your computer. Then turn to comment these lines, for just loading the data, saving processing time!

In [None]:
#all_recs = udacourse3.fn_all_recommendation_collab(
#               df_dist=df_dist,
#               df_user_movie=user_by_movie,
#               df_movie=movie,
#               num_rec=10,
#               limit=100,
#               min_rating=7,
#               sort=False,                                 
#               verbose=False)
#len(all_recs)

In [None]:
#with open('recommended.pkl', 'wb') as handle:
#    pickle.dump(all_recs, handle)

with open('recommended.pkl', 'rb') as handle:
    all_recs = pickle.load(handle)
    
len(all_recs)

In [None]:
#This loads our solution dictionary so you can compare results
#FULL PATH IS "data/Term2/recommendations/lesson1/data/all_recs.p"
all_recs_sol = pd.read_pickle("data/all_recs.p")

In [None]:
assert all_recs[2] == udacourse3.fn_make_recommendation_collab(
                          filt_dist=df_dist[df_dist['user1'] == 2],
                          df_user_movie=user_by_movie,
                          df_movie=movie),\
"Oops!  Your recommendations for user 2 didn't match ours."
assert all_recs[26] == udacourse3.fn_make_recommendation_collab(
                          filt_dist=df_dist[df_dist['user1'] == 26],
                          df_user_movie=user_by_movie,
                          df_movie=movie),\
"Oops!  It actually wasn't possible to make any recommendations for user 26."
assert all_recs[1503] == udacourse3.fn_make_recommendation_collab(
                          filt_dist=df_dist[df_dist['user1'] == 1503],
                          df_user_movie=user_by_movie,
                          df_movie=movie),\
"Oops! Looks like your solution for user 1503 didn't match ours."
print("If you made it here, you now have recommendations for many users using collaborative filtering!")
#HTML('<img src="images/greatjob.webp">')

### Now What?

#### In Udacity notes:

"If you made it this far, you have successfully implemented a solution to making recommendations using collaborative filtering."

`8.` Let's do a quick recap of the steps taken to obtain recommendations using collaborative filtering.  

In [None]:
# Check your understanding of the results by correctly filling in the dictionary below
a = "pearson's correlation and spearman's correlation"
b = 'item based collaborative filtering'
c = "there were too many ratings to get a stable metric"
d = 'user based collaborative filtering'
e = "euclidean distance and pearson's correlation coefficient"
f = "manhattan distance and euclidean distance"
g = "spearman's correlation and euclidean distance"
h = "the spread in some ratings was zero"
i = 'content based recommendation'

sol_dict = {
    'The type of recommendation system implemented here was a ...': d,
    'The two methods used to estimate user similarity were: ': e,
    'There was an issue with using the correlation coefficient.  What was it?': h
}

t.test_recs(sol_dict)

"Additionally, let's take a closer look at some of the results.  There are two solution files that you read in to check your results, and you created these objects

* **df_dists** - a dataframe of user1, user2, euclidean distance between the two users
* **all_recs_sol** - a dictionary of all recommendations (key = user, value = list of recommendations)" 

`9.` Use these two objects along with the cells below to correctly fill in the dictionary below and complete this notebook!

In [None]:
#from importlib import reload 
#import tests as t
#t = reload(tests)

In [None]:
a = 567
b = 1503
c = 1319
d = 1325
e = 2526710
f = 0
g = 'Use another method to make recommendations - content based, knowledge based, model based collaborative filtering'

sol_dict2 = {
    'For how many pairs of users were we not able to obtain a measure of similarity using correlation?': e,
    'For how many pairs of users were we not able to obtain a measure of similarity using euclidean distance?': f,
    'For how many users were we unable to make any recommendations for using collaborative filtering?': c,
    'For how many users were we unable to make 10 recommendations for using collaborative filtering?': d,
    'What might be a way for us to get 10 recommendations for every user?': g   
}

t.test_recs2(sol_dict2)

In [None]:
# Users without recs
users_without_recs = []
for user, movie_recs in all_recs.items():
    if len(movie_recs) == 0:
        users_without_recs.append(user)
    
len(users_without_recs)

In [None]:
# NaN euclidean distance values
df_dist['eucl_dist'].isnull().sum()

In [None]:
# Users with fewer than 10 recs
users_with_less_than_10recs = []
for user, movie_recs in all_recs.items():
    if len(movie_recs) < 10:
        users_with_less_than_10recs.append(user)
    
len(users_with_less_than_10recs)

## Ways to Reccomend - Content Based

## Fifth Notebook - L 21 - Content Based Recommendations

### Content Based Recommendations

#### In Udacity notes:

"In the previous notebook, you were introduced to a way to make recommendations using collaborative filtering.  However, using this technique there are a large number of users who were left without any recommendations at all.  Other users were left with fewer than the ten recommendations that were set up by our function to retrieve..."

In order to help these users out, let's try another technique **content based** recommendations.  Let's start off where we were in the previous notebook.

In [None]:
# Read in the datasets
movie = udacourse3.fn_read_data('data/movies_clean.csv', remove_noisy_cols=True)
review = udacourse3.fn_read_data('data/reviews_clean.csv', remove_noisy_cols=True)

all_rec = pickle.load(open("data/all_recs.p", "rb"))

### Datasets

#### In Udacity notes:

"From the above, you now have access to three important items that you will be using throughout the rest of this notebook.  

`a.` **movie** - a dataframe of all of the movies in the dataset along with other content related information about the movies (genre and date)


`b.` **review** - this was the main dataframe used before for collaborative filtering, as it contains all of the interactions between users and movies.


`c.` **all_rec** - a dictionary where each key is a user, and the value is a list of movie recommendations based on collaborative filtering

For the individuals in **all_rec** who did recieve 10 recommendations using collaborative filtering, we don't really need to worry about them.  However, there were a number of individuals in our dataset who did not receive any recommendations."

-----

`1.` Let's start with finding all of the users in our dataset who didn't get all 10 ratings we would have liked them to have using collaborative filtering.  

In [None]:
user_with_all_rec = []
for user, movie_rec in all_rec.items():
    if len(movie_rec) > 9:
        user_with_all_rec.append(user)

print("There are {} users with all reccomendations from collaborative filtering.".format(len(user_with_all_rec)))

user = np.unique(review['user_id'])
user_who_need_rec = np.setdiff1d(user, user_with_all_rec)

print("There are {} users who still need recommendations.".format(len(user_who_need_rec)))
print("This means that only {}% of users received all 10 of their recommendations using collaborative filtering"\
      .format(round(len(user_with_all_rec)/len(np.unique(review['user_id'])), 4)*100))   

In [None]:
# Some test here might be nice
assert len(user_with_all_rec) == 22187
print("That's right there were still another 31781 users who needed recommendations \
when we only used collaborative filtering!")

### Content Based Recommendations

#### In Udacity notes:

"You will be doing a bit of a mix of content and collaborative filtering to make recommendations for the users this time.  This will allow you to obtain recommendations in many cases where we didn't make recommendations earlier."

`2.` Before finding recommendations, rank the user's ratings from highest to lowest. You will move through the movies in this order looking for other similar movies.

In [None]:
# create a dataframe similar to reviews, but ranked by rating for each user
ranked_review = review.sort_values(by=['user_id', 'rating'], 
                                   ascending=False)

### Similarities

#### In Udacity notes:

"In the collaborative filtering sections, you became quite familiar with different methods of determining the similarity (or distance) of two users.  We can perform similarities based on content in much the same way.  

In many cases, it turns out that one of the fastest ways we can find out how similar items are to one another (when our matrix isn't totally sparse like it was in the earlier section) is by simply using matrix multiplication.  If you are not familiar with this, an explanation is available [here by 3blue1brown](https://www.youtube.com/watch?v=LyGKycYT2v0) and another quick explanation is provided [on the post here](https://math.stackexchange.com/questions/689022/how-does-the-dot-product-determine-similarity).

For us to pull out a matrix that describes the movies in our dataframe in terms of content, we might just use the indicator variables related to **year** and **genre** for our movies.  

Then we can obtain a matrix of how similar movies are to one another by taking the dot product of this matrix with itself.  Notice in the below that the dot product where our 1 values overlap gives a value of 2 indicating higher similarity.  In the second dot product, the 1 values don't match up.  This leads to a dot product of 0 indicating lower similarity.

<img src="graphs/dotprod1.png" alt="Dot Product" height="500" width="500">

We can perform the dot product on a matrix of movies with content characteristics to provide a movie by movie matrix where each cell is an indication of how similar two movies are to one another.  In the below image, you can see that movies 1 and 8 are most similar, movies 2 and 8 are most similar and movies 3 and 9 are most similar for this subset of the data.  The diagonal elements of the matrix will contain the similarity of a movie with itself, which will be the largest possible similarity (which will also be the number of 1's in the movie row within the orginal movie content matrix.

<img src="graphs/moviemat.png" alt="Dot Product" height="500" width="500">

"

`3.` Create a numpy array that is a matrix of indicator variables related to year (by century) and movie genres by movie.  Perform the dot product of this matrix with itself (transposed) to obtain a similarity matrix of each movie with every other movie.  The final matrix should be 31245 x 31245.

In [None]:
movie.iloc[:,4:].head(1)

In [None]:
# Subset so movie_content is only using the dummy variables for each genre and the 3 century based year dummy columns
movie_content = np.array(movie.iloc[:,4:])
movie_content[0]

#### Note: this is big data processing!
    
So, the first time you run this notebook, you need to uncomment the following lines for creating the file `dot_prod.pkl` in your computer. Then turn to comment these lines, for just loading the data, saving processing time!

Take the dot product to obtain a movie x movie matrix of similarities:

*Observation: I could not save the `dot_prod.pkl` in my computer. So every time that I need it, I need to run the following lines...*

In [None]:
begin = time()

dot_prod_movie = movie_content.dot(np.transpose(movie_content))

end = time()
print('elapsed time: {:.4f}s'.format(end-begin))
dot_prod_movie

In [None]:
#with open('dot_prod.pkl', 'wb') as handle:
#    pickle.dump(dot_prod_movie, handle)

#with open('dot_prod.pkl', 'rb') as handle:
#    dot_prod_movie = pickle.load(handle)
    
#dot_prod_movie

In [None]:
# create checks for the dot product matrix
assert dot_prod_movie.shape[0] == 31245
assert dot_prod_movie.shape[1] == 31245
assert dot_prod_movie[0, 0] == np.max(dot_prod_movie[0])
print("Looks like you passed all of the tests.")
print("Though they weren't very robust - if you want to write some of your own, I won't complain!")

### For Each User...

#### In Udacity notes:

"Now that you have a matrix where each user has their ratings ordered.  You also have a second matrix where movies are each axis, and the matrix entries are larger where the two movies are more similar and smaller where the two movies are dissimilar.  This matrix is a measure of content similarity. Therefore, it is time to get to the fun part.

For each user, we will perform the following:

    i. For each movie, find the movies that are most similar that the user hasn't seen.

    ii. Continue through the available, rated movies until 10 recommendations or until there are no additional movies.

As a final note, you may need to adjust the criteria for 'most similar' to obtain 10 recommendations.  As a first pass, I used only movies with the highest possible similarity to one another as similar enough to add as a recommendation."

`3.` In the below cell, complete each of the functions needed for making content based recommendations.

In [None]:
def find_similar_movie(movie_id):
    movie_idx = np.where(movie['movie_id'] == movie_id)[0][0]
    similar_idx = np.where(dot_prod_movie[movie_idx] == np.max(dot_prod_movie[movie_idx]))[0]
    similar_movie = np.array(movie.iloc[similar_idx, ]['movie'])    
    return similar_movie

function `fn_find_similar_movie` created!

function `fn_find_similar_movie` adapted to use `fn_get_movie_name` service

In [None]:
similar = udacourse3.fn_find_similar_movie(df_dot_product=dot_prod_movie,
                                           df_movie=movie,
                                           movie_id=2106284,
                                           verbose=True)
print(similar)

function `fn_get_movie_name` created!

- test for default

In [None]:
movie_id = [2106284, 231122344441]
movie_id = 2106284
udacourse3.fn_get_movie_name(df_movie=movie,
                             movie_id=movie_id,
                             verbose=True)

- test for `fn_find_similar_movie` function

In [None]:
movie_idx = 21310
udacourse3.fn_get_movie_name(df_movie=movie,
                             movie_id=movie_idx,
                             by_id=False,
                             as_list=False,
                             verbose=True)

function `fn_make_recommendation_content` created!

In [None]:
user=[2, 22]
user=2
rec = fn_make_recommendation_content(
          df_dot_product=dot_prod_movie,
          df_movie=movie,
          user=user,
          verbose=True)
rec

### How Did We Do?

#### In Udacity notes:

"Now that you have made the recommendations, how did we do in providing everyone with a set of recommendations?"

`4.` Use the cells below to see how many individuals you were able to make recommendations for, as well as explore characteristics about individuals who you were not able to make recommendations for.  

In [None]:
# Explore recommendations
user_without_all_rec = []
user_with_all_rec = []
no_rec = []
for user, movie_rec in rec.items():
    if len(movie_rec) < 10:
        user_without_all_rec.append(user)
    if len(movie_rec) > 9:
        user_with_all_rec.append(user)
    if len(movie_rec) == 0:
        no_rec.append(user)

In [None]:
# Some characteristics of my content based recommendations
print("There were {} users without all 10 recommendations we would have liked to have."\
      .format(len(user_without_all_rec)))
print("There were {} users with all 10 recommendations we would like them to have."\
      .format(len(user_with_all_rec)))
print("There were {} users with no recommendations at all!".format(len(no_rec)))

In [None]:
from importlib import reload 
import udacourse3

udacourse3 = reload(udacourse3)

In [None]:
#a closer look at individual user characteristics
user_item = review[['user_id', 'movie_id', 'rating']]
user_by_movie = udacourse3.fn_create_user_movie(df_user_item=user_item, 
                                                verbose=True)

In [None]:
user_item.head(1)

In [None]:
user_by_movie.head(1)

In [None]:
user_id = 189
watched = udacourse3.fn_movie_watched(
              df_user_movie=user_by_movie,
              user_id=user_id,
              verbose=True)
watched

In [None]:
counter = 0
print("Some of the movie lists for users without any recommendations include:")
for user_id in no_rec:
    print('user id:', user_id)
    print(udacourse3.fn_get_movie_name(
              df_movie=movie,
              movie_id=fn_movie_watched(
                  df_user_movie=user_by_movie,
                  user_id=user_id,
                  verbose=True)),
              verbose=True)
    counter += 1
    if counter > 10:
        break

### Now What?  

#### In Udacity notes:

"Well, if you were really strict with your criteria for how similar two movies are (like I was initially), then you still have some users that don't have all 10 recommendations (and a small group of users who have no recommendations at all). 

As stated earlier, recommendation engines are a bit of an **art** and a **science**.  There are a number of things we still could look into - how do our collaborative filtering and content based recommendations compare to one another? How could we incorporate user input along with collaborative filtering and/or content based recommendations to improve any of our recommendations?  How can we truly gain recommendations for every user?"

`5.` In this last step feel free to explore any last ideas you have with the recommendation techniques we have looked at so far.  You might choose to make the final needed recommendations using the first technique with just top ranked movies.  You might also loosen up the strictness in the similarity needed between movies.  Be creative and share your insights with your classmates!