# 📘 Final Project: Data Analysis using MongoDB and Apache Spark

This project demonstrates how to work with a real-world dataset (Amazon Books Reviews Dataset) using MongoDB for storage and Apache Spark (PySpark) for processing. We explore schema design, querying, performance optimization, and visual insights.

---

# Step 1: Import libraries
import pandas as pd
import numpy as np
import json
from pymongo import MongoClient

In [None]:
# Step 2: Configure Pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_colwidth', None)

In [None]:
# Step 3: Load CSV files
books_df = pd.read_csv("books_data.csv")
ratings_df = pd.read_csv("Books_rating.csv")

## 📂 Dataset Overview

We use the **Amazon Books Reviews** dataset from Kaggle, which contains user-generated reviews for a wide variety of books available on Amazon. This real-world dataset includes valuable attributes such as:

- **Title**: The name of the book.
- **Author(s)**: The author(s) of the book.
- **Categories**: Genre or subject of the book (e.g., Comics, Fiction, Education).
- **Rating**: User rating scores, typically from 1 to 5.
- **Review Text**: Actual written review provided by users.
- **Review Date**: When the review was posted.
- **ASIN**: Unique Amazon product identifier for each book.

This dataset provides an excellent foundation for exploring data storage, processing, and analysis using MongoDB and Apache Spark due to its unstructured nature and scale.

In [None]:
# Step 4: Basic exploration
print("📘 Books Data Sample:")
print(books_df.head(), "\n")

print("📝 Ratings Data Sample:")
print(ratings_df.head(), "\n")

print("📊 Shapes:")
print("Books Data Shape:", books_df.shape)
print("Ratings Data Shape:", ratings_df.shape, "\n")

print("🔍 Missing Values in Books Data:")
print(books_df.isnull().sum(), "\n")

print("🔍 Missing Values in Ratings Data:")
print(ratings_df.isnull().sum(), "\n")

In [None]:
# Step 5: Optional Cleaning
# Drop rows with missing essential review text
ratings_df = ratings_df.dropna(subset=["review/text"])


# Fill missing summaries with a placeholder (safe version)
ratings_df["review/summary"] = ratings_df["review/summary"].fillna("No summary provided")


# You can apply similar cleaning to books_df if needed
# books_df = books_df.dropna()  # example

## 🗃 Storing Dataset in MongoDB
We connect to MongoDB and insert our dataset using an optimized schema.

In [None]:
# Step 6: Connect to MongoDB
client = MongoClient("mongodb://localhost:27017/")  # Change if hosted elsewhere
db = client["books_database"]

In [None]:
# Step 7: Convert DataFrames to list of dictionaries
books_data = books_df.to_dict("records")
ratings_data = ratings_df.to_dict("records")

In [None]:
from pymongo.errors import BulkWriteError

def insert_in_batches(collection, data, batch_size=100):
    for i in range(0, len(data), batch_size):
        batch = data[i:i+batch_size]
        try:
            collection.insert_many(batch)
        except BulkWriteError as bwe:
            print(f"❌ Bulk write error: {bwe.details}")
        except Exception as e:
            print(f"❌ Error inserting batch {i // batch_size}: {e}")

db["books"].drop()
db["ratings"].drop()

db["books"].insert_many(books_data)
insert_in_batches(db["ratings"], ratings_data)

print("✅ Data inserted into MongoDB successfully.")

## ⚙ Data Processing with PySpark
We use PySpark to read, transform, and analyze the dataset loaded from MongoDB.

In [None]:
from pyspark.sql import SparkSession

# Step 1: Create a Spark session
spark = SparkSession.builder \
    .appName("BooksMiniProject") \
    .getOrCreate()

# Step 2: Load your CSV files into Spark DataFrames
books_df = spark.read.csv("books_data.csv", header=True, inferSchema=True)
ratings_df = spark.read.csv("Books_rating.csv", header=True, inferSchema=True)

# Optional: Show first few rows for confirmation
books_df.show(5)
ratings_df.show(5)

In [None]:
# Read documents from MongoDB, exclude '_id'
books_docs = list(db["books"].find({}, {"_id": 0}))
ratings_docs = list(db["ratings"].find({}, {"_id": 0}).limit(10000))  # 🔁 Adjust limit as needed

# Convert to Spark DataFrames
df_books = spark.createDataFrame(books_docs)
df_ratings = spark.createDataFrame(ratings_docs)

# Show sample rows
df_books.show(3)
df_ratings.show(3)