In [3]:
import findspark
findspark.init()

from pyspark import SparkContext
sc = SparkContext("local", "pyspark-shell")

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

# Recommendations Are Everywhere

## Why learn how to build recommendation engines?


Building recommendation engines using Alternating Least Squares or "ALS" in PySpark. Companies offer recommendations based on preferences of users. 

In [29]:
import pandas as pd

TJ_ALS_recs = pd.read_csv("TJ_ALS_recs.csv")


def get_ALS_recs(list_of_user_names, recs = TJ_ALS_recs):
    """ 
    Returns recommendations generated by ALS algorithm
    
    Parameters:
      - list_of_user_names: List or array of user names
      - recs: ALS-generated recommendations
      
    Returns: Pyspark dataframe of recommendations for user names submitted in list_of_usre_names.
    """
    if len(list_of_user_names) == 0:
        print ("None")
    elif len(list_of_user_names) == 1:
        print (recs[recs.userId == list_of_user_names[0]].sort_values(by=['user_name','pred_rating'], ascending = [0,0]))
    elif len(list_of_user_names) == 2:
        print (recs)
    return None

TJ_ratings = spark.read.csv("TJ_ratings.csv", header=True)

TJ_ratings.show()

get_ALS_recs(["Taylor", "Jane"])

+---------+--------------------+------+
|user_name|          movie_name|rating|
+---------+--------------------+------+
|   Taylor|            Twilight|   4.9|
|   Taylor|  A Walk to Remember|   4.5|
|   Taylor|        The Notebook|   5.0|
|   Taylor|Raiders of the Lo...|   1.2|
|   Taylor|      The Terminator|   1.0|
|   Taylor|      Mrs. Doubtfire|   1.0|
|     Jane|            Iron Man|   4.8|
|     Jane|Raiders of the Lo...|   4.9|
|     Jane|      The Terminator|   4.6|
|     Jane|           Anchorman|   1.2|
|     Jane|        Pretty Woman|   1.0|
|     Jane|           Toy Story|   1.2|
+---------+--------------------+------+

    userId  pred_rating                 title          genres
0   Taylor         3.89   Seven Pounds (2008)           Drama
1   Taylor         3.61       Cure The (1995)           Drama
2   Taylor         3.55   Kiss Me Guido (1997          Comedy
3   Taylor         3.29  You've Got Mail (199  Comedy|Romance
4   Taylor         3.27  10 Things I Hate Abo  Co

## Recommendation engine types and data types

There are two basic types of recommendation engines; Collaborative-filtering engines and content-based filtering engines. Both aim to offer meaning ful recommendations but they do so in slightly different ways. 

Content-based filtering tries to understand the content and based on the features of the items.

Collaborative filtering is a little bit different. It is based on user similarity and preferences. ALS algorithm can mathematically group you with similar users. ALS can have content-based and collaborative applications.

There are two main types of ratings: explicit ratings and implicit ratings. Examples of explicit ratings are inputting stars or thumbs up or down. Implicit ratings are based on the passive tracking of you behavior, like the number of movies you've seen in different genres. Implicit ratings are generated from the frequency of your actions. For example calculating confidence ratings of the genres of the movies that users watched.

### Collaborative vs content based filtering

In [30]:
df = spark.read.csv("df.csv", header=True)

df.show()

+------+-------+-----------------+--------+-------------+--------+------+
|UserId|MovieId|      Movie_Title|   Genre|Year_Produced|Language|rating|
+------+-------+-----------------+--------+-------------+--------+------+
| User1|   2112|     Finding Nemo|Animated|      English|    2003|     3|
| User1|   2113|   The Terminator|  Action|      English|    1984|     0|
| User1|   2114|       Spinal Tap|  Satire|      English|    1984|     4|
| User1|   2115|Life Is Beautiful|   Drama|      Italian|    1998|     4|
| User2|   2112|     Finding Nemo|Animated|      English|    2003|     4|
| User2|   2113|   The Terminator|  Action|      English|    1984|     0|
| User2|   2114|       Spinal Tap|  Satire|      English|    1984|     0|
| User2|   2115|Life Is Beautiful|   Drama|      Italian|    1998|     4|
| User3|   2112|     Finding Nemo|Animated|      English|    2003|     1|
| User3|   2113|   The Terminator|  Action|      English|    1984|     2|
| User3|   2114|       Spinal Tap|  Sa

Because this dataset includes descriptive tags like genre and language, as well as user ratings, it is suited for both collaborative and content-based filtering.

### Implicit vs explicit data


In [31]:
df1 = spark.read.csv("df1.csv", header=True)
df1.show()

+--------------------+------------------+---------+
|         Movie_Title|             Genre|Num_Views|
+--------------------+------------------+---------+
|        Finding Nemo|Animated Childrens|       12|
|           Toy Story|Animated Childrens|        6|
|            Iron Man|            Action|        1|
|     Captain America|            Action|        1|
|     The Incredibles|Animated Childrens|        9|
|              Frozen|Animated Childrens|       22|
|The Shawshank Red...|             Drama|        2|
|  Rabbit Proof Fence|             Drama|        2|
|Searching for Sug...|       Documentary|        3|
|              Powder|             Drama|        1|
|        The Fugitive|            Action|        2|
+--------------------+------------------+---------+



This dataset includes user behavior counts which are used as implicit ratings.

### Ratings data types

In [35]:
markus_rating = spark.read.csv("markus_ratings.csv", header=True, inferSchema=True)

markus_rating.groupBy("Genre").sum("Num_Views").show()

+------------------+--------------+
|             Genre|sum(Num_Views)|
+------------------+--------------+
|             Drama|             5|
|       Documentary|             3|
|            Action|             4|
|Animated Childrens|            49|
+------------------+--------------+



## Uses for recommendation engines

Also other applications can be useful as the use case for the ALS algorithm. These include latent feature discovery, item grouping, dimensionality reduction and image compression.

Consumers categorize products based on their experience and this could add more power to marketing strategies and ALS can help with this. 

ALS factor matrix into two matrices (one for user one for product). Each matrix has unlabeled axes that contain latent features. The number of latent features is referred to as the "rank" of these matrices. ALS can see ratings of products and based on that can determine that these are different types of movies. This allows us to methematically see how users experience these movies and to what degree users feel each product falls into each category.

### Confirm understanding of latent features


In [37]:
Pi = spark.read.csv("Pi.csv", header=True)
Pi.show()

+---------+--------+------------+--------+---------+------------+------+----------+
| Lat Feat|Iron Man|Finding Nemo|Avengers|Toy Story|Forrest Gump|Wall-E|Green Mile|
+---------+--------+------------+--------+---------+------------+------+----------+
| Animated|     0.2|         2.4|     0.1|      2.4|           0|   2.5|         0|
|    Drama|     1.5|         1.4|     1.4|      1.3|         1.8|   1.8|       2.5|
|Superhero|     2.5|         1.1|     2.4|      0.9|         0.2|   0.9|      0.09|
|   Comedy|     1.9|           2|     1.5|      2.2|         1.2|   0.3|      0.01|
|Tom Hanks|       0|           0|       0|      1.8|         2.2|     0|       2.5|
+---------+--------+------------+--------+---------+------------+------+----------+

