<a href="https://colab.research.google.com/github/VisshnuPrethi/Pearson-correlation-coefficient-/blob/main/Pyspark_BookPrediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Building a Memory-Based Book Recommendation System Using Pearson Collaborative Filtering**

Submitted By : Visshnu Prethi Manjere Kumar and Yaamini Sree Dilli Shankar

The project is divided into two major components:
1. Data Cleaning and Preparation
2. Collaborative Filtering and Recommendation Generation

This below code connects Google Drive to Colab, sets the data folder path, makes sure the folder exists, and shows all files inside it


In [None]:
import os
from google.colab import drive
drive.mount("/content/drive", force_remount=True)
drive_dir = "/content/drive/MyDrive/ens/cnam/data/"
INPUT_CSV = drive_dir + "/Books.csv"
os.makedirs(drive_dir, exist_ok=True)
os.listdir(drive_dir)

Mounted at /content/drive


['Books.csv']

In [None]:
#Installing the PySpark and Findspark libraries required to run Spark in Google Colab.
!pip install -q pyspark
!pip install -q findspark

In [None]:
#Set the system paths
os.environ["SPARK_HOME"] = "/usr/local/lib/python3.12/dist-packages/pyspark"
os.environ["JAVA_HOME"] = "/usr"

In [None]:
#Import the necessary PySpark functions, sets the output directory, and creates a Spark session to process the data.
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.sql.window import Window
from pyspark.sql.functions import col, expr
from pyspark.sql.functions import split
from pyspark.sql import Window
from pyspark.sql.functions import coalesce, lit
from pyspark.sql.functions import trim, when
from pyspark.sql.functions import row_number, desc
from pyspark.sql.functions import regexp_replace, length, count
OUT_DIR = "./out"

spark = SparkSession.builder \
    .appName("BooksPearsonCF") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

In [None]:
#This code sets the folder path where the CSV dataset files are stored
DATASET_DIR="/content/drive/MyDrive/ens/cnam/data"

In [None]:
!head $DATASET_DIR/Books.csv

,user_id,location,age,isbn,rating,book_title,book_author,year_of_publication,publisher,img_s,img_m,img_l,Summary,Language,Category,city,state,country
0,2,"stockton, california, usa",18,195153448,0,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/0195153448.01.MZZZZZZZ.jpg,http://images.amazon.com/images/P/0195153448.01.LZZZZZZZ.jpg,"Provides an introduction to classical myths placing the addressed
topics within their historical context, discussion of archaeological
evidence as support for mythical events, and how these themes have
been portrayed in literature, art, ...",en,['Social Science'],stockton,california,usa
1,8,"timmins, ontario, canada",34.74389988,2005018,5,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.01.THUMBZZZ.jpg,http://images.amazon.com/images/P/0002005018.01.MZZZZZZZ.jpg,http://images.amazon.com/

# **Data Cleaning and Preparation**
    read the books.csv file
    display the schema
    display columns (attributes)
    display content (5 books)



In [None]:
raw = spark.read.option("header", True).option("inferSchema", False).csv(INPUT_CSV)
raw.show(5, truncate=False)
raw.printSchema()

+---------------------------------------+-----------------------------+-------------------------+-----------+------------------+--------+-------------------+--------------------+-------------------+-----------------------+------------------------------------------------------------+------------------------------------------------------------+------------------------------------------------------------+----------------------------------------------------------------------+--------+--------+----+-----+-------+
|_c0                                    |user_id                      |location                 |age        |isbn              |rating  |book_title         |book_author         |year_of_publication|publisher              |img_s                                                       |img_m                                                       |img_l                                                       |Summary                                                               |Language|C

# **Exercise 1 : Inspect and clean the Raw Data**

In [None]:
# This code selects important columns, renames them to standard names, and displays the first 5 rows.
df = raw.select("user_id", "isbn", "rating", "book_title", "book_author")
df = df.withColumnRenamed("isbn", "ISBN") \
       .withColumnRenamed("book_title", "Title") \
       .withColumnRenamed("book_author", "Author")
df.show(5, truncate=False)


+-----------------------------+------------------+--------+-------------------+--------------------+
|user_id                      |ISBN              |rating  |Title              |Author              |
+-----------------------------+------------------+--------+-------------------+--------------------+
|2                            |195153448         |0       |Classical Mythology|Mark P. O. Morford  |
| discussion of archaeological|NULL              |NULL    |NULL               |NULL                |
| and how these themes have   |NULL              |NULL    |NULL               |NULL                |
| art                         |['Social Science']|stockton|california         |usa                 |
|8                            |2005018           |5       |Clara Callan       |Richard Bruce Wright|
+-----------------------------+------------------+--------+-------------------+--------------------+
only showing top 5 rows



In [None]:

# This code cleans the ISBNs by keeping only digits, removes empty or rare ones, and shows the count and sample rows.

df = df.withColumn("ISBN", regexp_replace(col("ISBN"), r"[^0-9]", ""))
df = df.filter((col("ISBN").isNotNull()) & (length(col("ISBN")) > 0))
isbn_counts = df.groupBy("ISBN").agg(count("*").alias("cnt"))
df = df.join(isbn_counts.filter(col("cnt") >= 2), on="ISBN", how="inner")
print("Unique ISBNs after cleaning:", df.select("ISBN").distinct().count())
df.show(5, truncate=False)


Unique ISBNs after cleaning: 1855
+-------+-------+------+------------+--------------------+---+
|ISBN   |user_id|rating|Title       |Author              |cnt|
+-------+-------+------+------------+--------------------+---+
|2005018|8      |5     |Clara Callan|Richard Bruce Wright|14 |
|2005018|11400  |0     |Clara Callan|Richard Bruce Wright|14 |
|2005018|11676  |8     |Clara Callan|Richard Bruce Wright|14 |
|2005018|41385  |0     |Clara Callan|Richard Bruce Wright|14 |
|2005018|67544  |8     |Clara Callan|Richard Bruce Wright|14 |
+-------+-------+------+------------+--------------------+---+
only showing top 5 rows



In [None]:
# Converting the ratings to numbers and keeps only valid ratings between 0 and 10.

df = df.withColumn("rating", col("rating").cast("double"))
df = df.filter((col("rating").isNotNull()) & (col("rating") >= 0) & (col("rating") <= 10))
df.show(5, truncate=False)


+-------+-------+------+------------+--------------------+---+
|ISBN   |user_id|rating|Title       |Author              |cnt|
+-------+-------+------+------------+--------------------+---+
|2005018|8      |5.0   |Clara Callan|Richard Bruce Wright|14 |
|2005018|11400  |0.0   |Clara Callan|Richard Bruce Wright|14 |
|2005018|11676  |8.0   |Clara Callan|Richard Bruce Wright|14 |
|2005018|41385  |0.0   |Clara Callan|Richard Bruce Wright|14 |
|2005018|67544  |8.0   |Clara Callan|Richard Bruce Wright|14 |
+-------+-------+------+------------+--------------------+---+
only showing top 5 rows



In [None]:
#The below code keeps only users with at least 5 ratings and shows the count and sample rows.

user_counts = df.groupBy("user_id").agg(count("*").alias("n_ratings"))
active_users = user_counts.filter(col("n_ratings") >= 5)
df = df.join(active_users.select("user_id"), on="user_id", how="inner")

print("Active users count:", active_users.count())
df.show(5, truncate=False)


Active users count: 3876
+-------+-------+------+------------+--------------------+---+
|user_id|ISBN   |rating|Title       |Author              |cnt|
+-------+-------+------+------------+--------------------+---+
|8      |2005018|5.0   |Clara Callan|Richard Bruce Wright|14 |
|11400  |2005018|0.0   |Clara Callan|Richard Bruce Wright|14 |
|11676  |2005018|8.0   |Clara Callan|Richard Bruce Wright|14 |
|85526  |2005018|0.0   |Clara Callan|Richard Bruce Wright|14 |
|96054  |2005018|0.0   |Clara Callan|Richard Bruce Wright|14 |
+-------+-------+------+------------+--------------------+---+
only showing top 5 rows



In [None]:

# Keeping only the latest rating for each user book pair and showing sample rows.

w = Window.partitionBy("user_id", "ISBN").orderBy(desc("rating"))
df = df.withColumn("rn", row_number().over(w)).filter(col("rn") == 1).drop("rn")
df.show(5, truncate=False)


+-------+---------+------+------------------------------------------------------------------------+-------------+---+
|user_id|ISBN     |rating|Title                                                                   |Author       |cnt|
+-------+---------+------+------------------------------------------------------------------------+-------------+---+
|100009 |385504209|8.0   |The Da Vinci Code                                                       |Dan Brown    |883|
|100009 |440224675|0.0   |Hannibal                                                                |Thomas Harris|284|
|100009 |440234743|0.0   |The Testament                                                           |John Grisham |422|
|100009 |553582747|0.0   |From the Corner of His Eye                                              |Dean Koontz  |165|
|100009 |60392452 |8.0   |Stupid White Men ...and Other Sorry Excuses for the State of the Nation!|Michael Moore|283|
+-------+---------+------+------------------------------

In [None]:
# This code trims extra spaces and fills missing book titles and authors with default values.

df = df.withColumn("Title", trim(col("Title")))
df = df.withColumn("Title", when((col("Title").isNull()) | (col("Title") == ""), "Unknown Title").otherwise(col("Title")))

df = df.withColumn("Author", trim(col("Author")))
df = df.withColumn("Author", when((col("Author").isNull()) | (col("Author") == ""), "Unknown Author").otherwise(col("Author")))

df.show(5, truncate=False)


+-------+---------+------+------------------------------------------------------------------------+-------------+---+
|user_id|ISBN     |rating|Title                                                                   |Author       |cnt|
+-------+---------+------+------------------------------------------------------------------------+-------------+---+
|100009 |385504209|8.0   |The Da Vinci Code                                                       |Dan Brown    |883|
|100009 |440224675|0.0   |Hannibal                                                                |Thomas Harris|284|
|100009 |440234743|0.0   |The Testament                                                           |John Grisham |422|
|100009 |553582747|0.0   |From the Corner of His Eye                                              |Dean Koontz  |165|
|100009 |60392452 |8.0   |Stupid White Men ...and Other Sorry Excuses for the State of the Nation!|Michael Moore|283|
+-------+---------+------+------------------------------

# **Exercise 2 : Final Inspection**

In [None]:
# OUT_DIR = "/tmp/cleaned_data"
os.makedirs(OUT_DIR, exist_ok=True)

ratings_clean = df.select("user_id", "ISBN", "rating").distinct()
books_clean = df.select("ISBN", "Title", "Author").distinct()
users_active = df.select("user_id").distinct()

ratings_clean.write.mode("overwrite").option("header",True).csv(os.path.join(OUT_DIR, "ratings_clean"))
books_clean.write.mode("overwrite").option("header",True).csv(os.path.join(OUT_DIR, "books_clean"))
users_active.write.mode("overwrite").option("header",True).csv(os.path.join(OUT_DIR, "users_active"))

print("Cleaned datasets saved in:", OUT_DIR)


Cleaned datasets saved in: ./out


In [None]:
# Check ratings are within [0,10]
ratings_out_of_range = ratings_clean.filter((col("rating") < 0) | (col("rating") > 10)).count()
print("Ratings outside [0,10]:", ratings_out_of_range)

Ratings outside [0,10]: 0


In [None]:
# Users with at least 5 ratings
user_counts = ratings_clean.groupBy("user_id").count()
low_users = user_counts.filter(col("count") < 5)
print("Users with <5 ratings:", low_users.count())


Users with <5 ratings: 0


In [None]:
# Books with at least 5 ratings
book_counts = ratings_clean.groupBy("ISBN").count()
low_books = book_counts.filter(col("count") < 5)
print("Books with <5 ratings:", low_books.count())

Books with <5 ratings: 541


In [None]:
# Compute counts again
book_counts = ratings_clean.groupBy("ISBN").count()

# Keep only books with >=5 ratings
valid_books = book_counts.filter(col("count") >= 5).select("ISBN")

ratings_clean = ratings_clean.join(valid_books, on="ISBN", how="inner")
books_less_than_5 = ratings_clean.groupBy("ISBN").count().filter(col("count") < 5)
print("Remaining books with <5 ratings:", books_less_than_5.count())

Remaining books with <5 ratings: 0


In [None]:
# ISBNs should be non-empty strings
invalid_isbn = books_clean.filter((col("ISBN").isNull()) | (col("ISBN") == ""))
print("Invalid ISBN count:", invalid_isbn.count())


Invalid ISBN count: 0


In [None]:
# No duplicated (user_id, ISBN) rows
duplicates = ratings_clean.groupBy("user_id", "ISBN").agg(count("*").alias("cnt")).filter(col("cnt") > 1).count()
print("Duplicate (user_id, ISBN) rows:", duplicates)


Duplicate (user_id, ISBN) rows: 0


In [None]:
# Titles and authors should be valid
titles_missing = books_clean.filter((col("Title").isNull()) | (col("Title") == "")).count()
authors_missing = books_clean.filter((col("Author").isNull()) | (col("Author") == "")).count()
print("Missing titles:", titles_missing)
print("Missing authors:", authors_missing)


Missing titles: 0
Missing authors: 0


In [None]:
# Check that the number of books matches number of ISBNs in ratings_clean
books_match = books_clean.select("ISBN").distinct().count() == ratings_clean.select("ISBN").distinct().count()
print("Number of books matches ISBNs in ratings_clean:", books_match)


Number of books matches ISBNs in ratings_clean: False


In [None]:
# Original dataset size
total_ratings_orig = raw.count()
total_users_orig = raw.select("user_id").distinct().count()
total_books_orig = raw.select("isbn").distinct().count()

# Cleaned dataset size
total_ratings_clean = ratings_clean.count()
total_users_clean = users_active.count()
total_books_clean = books_clean.count()

# Percentage removed
removed_ratings_pct = 100 * (total_ratings_orig - total_ratings_clean) / total_ratings_orig
removed_users_pct = 100 * (total_users_orig - total_users_clean) / total_users_orig
removed_books_pct = 100 * (total_books_orig - total_books_clean) / total_books_orig

print("\n--- Cleaning Summary ---")
print(f"Remaining users: {total_users_clean}")
print(f"Remaining books: {total_books_clean}")
print(f"Remaining ratings: {total_ratings_clean}")
print(f"Percentage of original ratings removed: {removed_ratings_pct:.2f}%")
print(f"Percentage of original users removed: {removed_users_pct:.2f}%")
print(f"Percentage of original books removed: {removed_books_pct:.2f}%")



--- Cleaning Summary ---
Remaining users: 3876
Remaining books: 1813
Remaining ratings: 65417
Percentage of original ratings removed: 87.31%
Percentage of original users removed: 86.65%
Percentage of original books removed: 65.44%


# **Data Cleaning and Preparation**

# **Exercise 9: Compute Pearson Similarity Between Users**

In [None]:
# Compute average rating per user (ignoring 0 ratings)
user_avg = ratings_clean.filter(col("rating") > 0) \
    .groupBy("user_id") \
    .agg(F.avg("rating").alias("avg_rating"))

user_avg.show(5)


+-------+-----------------+
|user_id|       avg_rating|
+-------+-----------------+
| 249223|              7.0|
| 227447|              9.0|
| 166825|              7.0|
|  37311|7.285714285714286|
| 131182|             10.0|
+-------+-----------------+
only showing top 5 rows



Join Ratings with User Mean

In [None]:
# Join ratings with user average
ratings_with_avg = ratings_clean.join(user_avg, on="user_id")
ratings_with_avg = ratings_with_avg.withColumn("rating_diff", col("rating") - col("avg_rating"))

ratings_with_avg.show(5)


+-------+---------+------+-----------------+------------------+
|user_id|     ISBN|rating|       avg_rating|       rating_diff|
+-------+---------+------+-----------------+------------------+
| 100009|385504209|   8.0|7.333333333333333| 0.666666666666667|
| 100009|440224675|   0.0|7.333333333333333|-7.333333333333333|
| 100009|440234743|   0.0|7.333333333333333|-7.333333333333333|
| 100009|553582747|   0.0|7.333333333333333|-7.333333333333333|
| 100009| 60392452|   8.0|7.333333333333333| 0.666666666666667|
+-------+---------+------+-----------------+------------------+
only showing top 5 rows



Prepare Pairs of Users Who Rated the Same Book

In [None]:
# Self-join on ISBN to get all user-user pairs who rated the same book
ratings_pairs = ratings_with_avg.alias("a") \
    .join(ratings_with_avg.alias("b"), on="ISBN") \
    .filter(col("a.user_id") != col("b.user_id")) \
    .select(
        col("a.user_id").alias("user_u"),
        col("b.user_id").alias("user_v"),
        col("a.rating_diff").alias("r_diff_u"),
        col("b.rating_diff").alias("r_diff_v")
    )

ratings_pairs.show(5)


+------+------+-----------------+--------+
|user_u|user_v|         r_diff_u|r_diff_v|
+------+------+-----------------+--------+
|100009| 98741|0.666666666666667|   -8.25|
|100009| 98547|0.666666666666667|    -7.0|
|100009| 97290|0.666666666666667|   -10.0|
|100009| 95574|0.666666666666667|    -3.0|
|100009| 95193|0.666666666666667|     0.0|
+------+------+-----------------+--------+
only showing top 5 rows



Compute Numerator and Denominator for Pearson

In [None]:
# Sum(r_diff_u * r_diff_v), sqrt(sum(r_diff_u^2) * sum(r_diff_v^2))
pearson_scores = ratings_pairs.groupBy("user_u", "user_v") \
    .agg(
        F.sum(col("r_diff_u") * col("r_diff_v")).alias("numerator"),
        F.sqrt(F.sum(col("r_diff_u")**2) * F.sum(col("r_diff_v")**2)).alias("denominator")
    ) \
    .withColumn("similarity", col("numerator") / col("denominator"))

pearson_scores.show(5)


+------+------+------------------+-----------------+-------------------+
|user_u|user_v|         numerator|      denominator|         similarity|
+------+------+------------------+-----------------+-------------------+
|100009|100088| 0.666666666666667|0.666666666666667|                1.0|
|100009|100459|-5.055555555555556|7.845931555609487|-0.6443537672643935|
|100009|100644|               0.0|              0.0|               NULL|
|100009|100846|               0.0|              0.0|               NULL|
|100009|100906| -5.56521739130435| 5.56521739130435|               -1.0|
+------+------+------------------+-----------------+-------------------+
only showing top 5 rows



In [None]:
# Replace NaN similarities (no common books) with 0
pearson_scores = pearson_scores.fillna({"similarity": 0})
pearson_scores.show(5)

+------+------+------------------+-----------------+-------------------+
|user_u|user_v|         numerator|      denominator|         similarity|
+------+------+------------------+-----------------+-------------------+
|100009|100088| 0.666666666666667|0.666666666666667|                1.0|
|100009|100459|-5.055555555555556|7.845931555609487|-0.6443537672643935|
|100009|100644|               0.0|              0.0|                0.0|
|100009|100846|               0.0|              0.0|                0.0|
|100009|100906| -5.56521739130435| 5.56521739130435|               -1.0|
+------+------+------------------+-----------------+-------------------+
only showing top 5 rows



Select Top-k Neighbors for Each User

In [None]:
k = 5  # number of neighbors
w = Window.partitionBy("user_u").orderBy(col("similarity").desc())

top_k_neighbors = pearson_scores.withColumn("rank", F.row_number().over(w)) \
    .filter(col("rank") <= k) \
    .drop("rank")

top_k_neighbors.show(10)


+------+------+-------------------+-------------------+------------------+
|user_u|user_v|          numerator|        denominator|        similarity|
+------+------+-------------------+-------------------+------------------+
|100009|130571|             56.375|  56.37499999999999|1.0000000000000002|
|100009|156300|              220.0| 219.99999999999997|1.0000000000000002|
|100009|162639| 61.180952380952384|  61.18095238095238|1.0000000000000002|
|100009|170518|  46.44444444444444| 46.444444444444436|1.0000000000000002|
|100009|182459|  71.70370370370371|   71.7037037037037|1.0000000000000002|
|100053|108005| 13.464285714285719| 13.464285714285717|1.0000000000000002|
|100053|210587| 0.9999999999999991|  0.999999999999999|1.0000000000000002|
|100053|274004|0.24999999999999978|0.24999999999999975|1.0000000000000002|
|100053| 38995| 11.071428571428573| 11.071428571428571|1.0000000000000002|
|100053|103336| 53.650000000000006|              53.65|1.0000000000000002|
+------+------+----------

# **Exercise 10: Predict ratings for books the target user has not read**

Choose Target User and Get Unread Books

In [None]:
target_user = 8
# Books already rated by target user
books_rated = ratings_clean.filter(col("user_id") == target_user).select("ISBN").distinct()

# Books not yet rated by target user
books_unread = ratings_clean.select("ISBN").distinct().subtract(books_rated)
books_unread.show(5)


+---------+
|     ISBN|
+---------+
|014100018|
|307132668|
|380817144|
|440154731|
|345384733|
+---------+
only showing top 5 rows



Get Target User Average Rating

In [None]:
# Compute average rating of target user (ignoring 0 ratings)
target_avg = ratings_clean.filter((col("user_id") == target_user) & (col("rating") > 0)) \
    .agg(F.avg("rating").alias("avg_rating")) \
    .collect()[0]["avg_rating"]
print(f"Target user {target_user} average rating:", target_avg)


Target user 8 average rating: 5.0


In [None]:
# number of neighbors
k = 5

# Get top-k neighbors of target user
neighbors = top_k_neighbors.filter(col("user_u") == target_user) \
    .orderBy(col("similarity").desc()).limit(k)

neighbors.show()


+------+------+-----------------+-----------------+----------+
|user_u|user_v|        numerator|      denominator|similarity|
+------+------+-----------------+-----------------+----------+
|     8| 29526|3.611111111111107|3.611111111111107|       1.0|
|     8|137688|             40.0|             40.0|       1.0|
|     8|209163|            38.75|            38.75|       1.0|
|     8|236322|             37.0|             37.0|       1.0|
|     8|242247|             50.0|             50.0|       1.0|
+------+------+-----------------+-----------------+----------+



In [None]:
# Ratings of neighbors for the books target user hasn't rated
neighbors_ratings = neighbors.join(ratings_clean.alias("r"), neighbors.user_v == col("r.user_id")) \
    .select(
        col("user_v").alias("neighbor_id"),
        col("ISBN"),
        col("similarity"),
        col("r.rating").alias("neighbor_rating")
    )

# Keep only unread books
neighbors_ratings = neighbors_ratings.join(books_unread.alias("b"), "ISBN")
neighbors_ratings.show(5)


+----------+-----------+----------+---------------+
|      ISBN|neighbor_id|similarity|neighbor_rating|
+----------+-----------+----------+---------------+
| 055356451|     137688|       1.0|            8.0|
| 080410753|     137688|       1.0|            0.0|
| 140293248|     137688|       1.0|            0.0|
| 140298479|     137688|       1.0|            0.0|
|1573225487|     137688|       1.0|            0.0|
+----------+-----------+----------+---------------+
only showing top 5 rows



In [None]:
# Join neighbor averages
neighbor_avg = ratings_clean.groupBy("user_id").agg(F.avg("rating").alias("avg_rating")) \
    .withColumnRenamed("user_id", "neighbor_id") \
    .withColumnRenamed("avg_rating", "neighbor_avg")

predictions = neighbors_ratings.join(neighbor_avg, on="neighbor_id")

# Weighted formula: r_hat = r_u_bar + sum(sim * (r_vj - r_v_bar)) / sum(|sim|)
predictions = predictions.withColumn(
    "weighted_diff", col("similarity") * (col("neighbor_rating") - col("neighbor_avg"))
)

predicted_ratings = predictions.groupBy("ISBN").agg(
    (F.lit(target_avg) + F.sum("weighted_diff") / F.sum(F.abs(col("similarity")))).alias("predicted_rating")
)

predicted_ratings = predicted_ratings.orderBy(col("predicted_rating").desc())
predicted_ratings.show(10)


+---------+-----------------+
|     ISBN| predicted_rating|
+---------+-----------------+
|671510053|             12.5|
| 64407667|             11.5|
|553210092|             11.5|
|743260244|             11.5|
|743237188|             10.5|
|055356451|             10.5|
|553561278|             10.5|
|312243022|9.916666666666666|
|142000205|9.916666666666666|
|067976402|9.916666666666666|
+---------+-----------------+
only showing top 10 rows



# **Recommend top-N books**

In [None]:
top_n = 5
recommendations = predicted_ratings.limit(top_n)

# Join with book titles
recommendations = recommendations.join(books_clean, on="ISBN")
recommendations.select("ISBN", "Title", "Author", "predicted_rating").show()


+---------+--------------------+-------------------+----------------+
|     ISBN|               Title|             Author|predicted_rating|
+---------+--------------------+-------------------+----------------+
|671510053|       SHIPPING NEWS|       Annie Proulx|            12.5|
|553210092|  The Scarlet Letter|NATHANIEL HAWTHORNE|            11.5|
| 64407667|The Bad Beginning...|     Lemony Snicket|            11.5|
|055356451|          Night Sins|          TAMI HOAG|            10.5|
|743260244|Against All Enemi...|  Richard A. Clarke|            11.5|
+---------+--------------------+-------------------+----------------+



# **Exercise 11: Compute Pearson Similarity Between Books (Item–Item Collaborative Filtering)**

In [None]:
# Create a pivot table: each row = book, each column = user
book_user_matrix = (
    ratings_clean
        .groupBy("ISBN")
        .pivot("user_id")
        .agg(F.first("rating"))
)

print("Book–User matrix created.")
print("Number of books:", book_user_matrix.count())
print("Number of users:", len(book_user_matrix.columns) - 1)

Book–User matrix created.
Number of books: 1272
Number of users: 3875


In [None]:

from pyspark.sql import functions as F
from pyspark.sql.functions import col

# Create book–user matrix (pivot table)
book_user_matrix = ratings_clean.groupBy("ISBN").pivot("user_id").agg(F.first("rating"))
print("Book–User matrix created:", book_user_matrix.count(), "books,", len(book_user_matrix.columns)-1, "users")

# Manual Pearson similarity function
def pearson_sim(a, b):
    pairs = [(x,y) for x,y in zip(a,b) if x is not None and y is not None]
    if len(pairs)<2:
        return None
    xs = [p[0] for p in pairs]
    ys = [p[1] for p in pairs]
    mean_x = sum(xs)/len(xs)
    mean_y = sum(ys)/len(ys)
    num = sum((xs[i]-mean_x)*(ys[i]-mean_y) for i in range(len(pairs)))
    den_x = sum((xs[i]-mean_x)**2 for i in range(len(pairs)))**0.5
    den_y = sum((ys[i]-mean_y)**2 for i in range(len(pairs)))**0.5
    den = den_x * den_y
    return num/den if den!=0 else None

# Pick target book
target = "671510053"

# Extract target book vector
row = book_user_matrix.filter(col("ISBN")==target).collect()[0]
target_vector = row[1:]

# Compute similarity of target book with all others
matrix_rdd = book_user_matrix.rdd.map(lambda r: (r[0], r[1:]))
all_scores_rdd = matrix_rdd.map(lambda entry: (entry[0], pearson_sim(target_vector, entry[1])))

# Convert back to DataFrame
book_sim_df = all_scores_rdd.toDF(["ISBN","similarity"])

# Filter and sort
similar_clean = book_sim_df.filter(col("similarity").isNotNull())\
                           .filter(col("ISBN")!=target)\
                           .orderBy(col("similarity").desc())
print("Top similar books:")
similar_clean.show(10)


Book–User matrix created: 1272 books, 3875 users
Top similar books:
+---------+------------------+
|     ISBN|        similarity|
+---------+------------------+
|449001954|1.0000000000000002|
|441004016|1.0000000000000002|
|679879242|1.0000000000000002|
|345384350|1.0000000000000002|
|051513399|1.0000000000000002|
|380973650|1.0000000000000002|
|375500766|1.0000000000000002|
|451150244|1.0000000000000002|
|043912042|1.0000000000000002|
|553377876|1.0000000000000002|
+---------+------------------+
only showing top 10 rows



# **Exercise 12: Produce “Because you liked book X, you may like book Y” recommendations**


In [None]:
k = 10
top_books = similar_clean.limit(k)

# Join with metadata
recommendations = top_books.join(books_clean, on="ISBN", how="left")\
                           .select("ISBN","Title","Author","similarity")
recommendations.show(k, truncate=False)

+---------+--------------------------------------------------------------+-----------------+------------------+
|ISBN     |Title                                                         |Author           |similarity        |
+---------+--------------------------------------------------------------+-----------------+------------------+
|425184129|Big Trouble                                                   |Dave Barry       |1.0000000000000002|
|345384350|Icebound                                                      |Dean R. Koontz   |1.0000000000000002|
|380973650|American Gods: A Novel                                        |Neil Gaiman      |1.0000000000000002|
|449001954|Murder at the Library of Congress (Capital Crimes (Paperback))|Margaret Truman  |1.0000000000000002|
|679879242|The Golden Compass (His Dark Materials, Book 1)               |PHILIP PULLMAN   |1.0000000000000002|
|743439775|Flight                                                        |Jan Burke        |1.0000000000