In [None]:
# check if this instance of the notebook already has files present
# and thus determine which steps required prior to reading in file and handling the data
!ls


In [None]:
# set-up spark (NB if Apache amend versions on download site we will need to amend path in wget command)
## NOTE that this version would make use of Hadoop if installed BUT that HDFS & Hadoop is not installed on our Colab
## (we are only using a single node (probably as a VM) so we will not be able to benefit from parallelism)
!clear
!echo welcome

!rm -f spark-3.4.[01]-bin-hadoop3.tgz*
!rm -rf spark-3.4.[01]-bin-hadoop3

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget https://downloads.apache.org/spark/spark-3.4.4/spark-3.4.4-bin-hadoop3.tgz
!tar -xf spark-3.4.4-bin-hadoop3.tgz

!ls -alt
print("standalone Spark is now installed")

In [None]:
# init spark (ensure SPARK_HOME set to same version as we download earlier)
!pip3 install findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.4.4-bin-hadoop3"
import findspark
findspark.init()
from pyspark import SparkConf, SparkContext
# the next line gives us 'local' mode. try 'local[2]' to use 2 cores or 'master:NNNN' to run on Spark standalone cluster at port NNNN
spark_conf = SparkConf().setMaster('local[2]').setAppName('MyApp')
sc = SparkContext(conf=spark_conf)
# see what we have by examining the Spark User Interface
from pyspark.sql import *
from pyspark.sql.functions import *
SparkSession.builder.getOrCreate()
##

In [None]:
import pandas as pd
import io

In [None]:
## this is how one could upload a file into colab using the colab GUI (uncomment both lines if want to try it)

# from google.colab import files
# files.upload()


In [None]:
from google.colab import drive
drive.mount('/content/drive')


In [None]:
!unzip Big_Data_Assignment_Dataset.zip

In [None]:
# at this point we have Spark initialised and we have a number of CSV files.
# NB you can try also download the zipfile to your host machine and try opening in Excel (Win)
# (in Linux, easiest to open a file manager GUI then double-click on .csv file to open associated spreadsheet app)

# ❓ — Question:
##1️⃣Step 1:
- Based on your lectures and labs, your Research Hypothesis is: “In the second quarter of 2014, products given
a review rating of 3 or more are significantly different compared to other products” which you may refine to make more precise and/or testable from the data available in the provided dataset. You should only use the provided dataset for the Hypothesis.

## 2️⃣Step 2:
- Write your report where, for the above Research Hypothesis, in a clear, concise and consistent
manner, you should not exceed 1000 words and:
  - include a title page giving your name, MMU ID, signed declaration the work is your own
  -state the Research Hypothesis and your test for determination of whether it is true
  -explain the results and discuss your approach and what you have learned from the data
  *detail your test on the required data and state whether the Hypothesis has been found true or not, or what you would need to do next to obtain a conclusive result

## 3️⃣ Step 3:
- End-to-End Big Data Pipeline - This section should not exceed 500 words for the technical report. This section
assesses your competency with the core big data tools and your ability to integrate them into an end-toend pipeline.
- You have been given a public dataset (or you may choose one of your own from a reputable open data
source). Your task is to design and implement a big data pipeline that showcases your understanding of
the following technologies within your Azure Labs environment or Google Colab if you prefer:
  1. Hadoop (HDFS) – for distributed data storage.
  2. Spark – for data processing or a machine learning task (Spark MLlib).
  3. Kafka – for streaming data ingestion (can be real or simulated).
  4. Scala – as the programming language for your Spark and Kafka integration code
  5. Data Source: Pick a public dataset. For instance:
    - (a). OpenWeather API for streaming weather data, or
    - (b). Mockaroo-generated data

In [None]:
spark = SparkSession.builder.appName("electronics_ratings").getOrCreate()
file = "./ratings_Electronics (1).csv"
column_names = ["userID", "productID", "Rating", "timestamp"]
# Read the CSV file
df = (spark.read.format("csv")
         .option("header", "true")
         .option("inferSchema", "true")
         .load(file)
         .toDF(*column_names))

# Show the top 10 rows
df.show(10)

In [None]:
from pyspark.sql.functions import col
df = df.withColumn("datetime", from_unixtime(col("timestamp")).cast("timestamp"))


In [None]:
df.show(10)

# Looking at the shape of the dataset

In [None]:
num_rows = df.count()
num_cols = len(df.columns)
print((num_rows, num_cols))


#🔍 Finding the 👯Duplicates

In [None]:
from pyspark.sql import functions as F

num_duplicates = (
    df.groupBy(df.columns)
      .count()
      .filter("count > 1")
      .agg(F.sum(F.col("count") - 1).alias("total_duplicates"))
      .collect()[0]["total_duplicates"]
)

print(f"🔢 Total number of duplicate rows: {num_duplicates or 0}")


 # 📝 Conclusion:
 As there is no duplicates meaning that the each product has one rating and one time stamp


# ⚠️ Looking at the Null values

In [None]:
from pyspark.sql.functions import sum, when

df.select([
    sum(when(col(c).isNull(), 1).otherwise(0)).alias(c + "_nulls")
    for c in df.columns
]).show()

print("⚠️ Features with Null Values:")


## Dividing the datetime into date and time colomn


In [None]:
from pyspark.sql import functions as F

df = df.withColumn("date", F.split(F.col("datetime"), " ").getItem(0)) \
             .withColumn("time", F.split(F.col("datetime"), " ").getItem(1)) \
             .drop("datetime")
df.show(10)

In [None]:
df.describe().show()

## 📊 Important Insights from (<font color="yellow">Descriptive Stats Table</font>)


1.   the data is from 1998-12-04 to 2014-07-23
2.   the mean value of rating is 4.012 and the std of rating 1.38



## 🧑‍💻 Checking Data Types of Features:

In [None]:
# Show the data types of each column in the DataFrame
df.dtypes


## 🧪 Filtering the data for second Quarter (<font color="#FF5733">April 1 to June 30</font>) of 2014


In [None]:

from pyspark.sql.functions import to_date

# converting string datatype of date into date type
df = df.withColumn("date", to_date(col("date"), "yyyy-MM-dd"))

# Select the desired columns
df_1 = df.select("productId", "Rating", "date")
df_1.show()

# here to_date converts the datatype string to date format

# Filter the DataFrame for Q2 2014 (April to June 2014)
fil_df = df_1.filter((col("date") >= '2014-04-01') & (col("date") <= '2014-06-30'))
# the filter method is used to filter out the desired second quarter of 2014
fil_df.show()

In [None]:
fil_df.describe().show()

In [None]:
fil_df.dtypes

In [None]:
num_rows = fil_df.count()
num_cols = len(fil_df.columns)
print((num_rows, num_cols))


## 📝Note:
- at start they were (7824481, 5)
- now after filtering they are (664014, 3)
- 7,160,467 are are the datapoint other than our required hypothesis points.



## Creating Two Groups:
    Rating Category:
    - High for >=3
    - Low for <3

In [None]:
from pyspark.sql.functions import when

fil_df = fil_df.withColumn(
    "rating_category",
    when(fil_df["Rating"] >= 3, "High").otherwise("Low")
)

fil_df.show()

In [None]:
group_high = fil_df[fil_df['rating_category'] == 'High']
group_low = fil_df[fil_df['rating_category'] == 'Low']


In [None]:
fil_df.groupBy("rating_category").count().show()


Conclusion/Insights:
- The numberof high rated rating in Q2 2014 are 549939
- the number of low rating products in Q2 2014 are 114075

## Distribution graph

In [None]:
# Count the number of each rating (e.g., 1-5)
rating_counts = fil_df.groupBy("Rating").count().orderBy("Rating")

# Now convert the small aggregated result to Pandas
rating_counts_pd = rating_counts.toPandas()

# Plot
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8, 5))
sns.barplot(x="Rating", y="count", data=rating_counts_pd, palette="viridis")
plt.title("Rating Distribution")
plt.xlabel("Rating")
plt.ylabel("Number of Reviews")
plt.grid(True)
plt.show()


 # 📝 Conclusion:
-  The higher frequency are of high 5 in rating then 4.
-  The  data is left skewed as we can see becaus of the high rated 5 which moved the mean to right side as we can see from descriptive stats. .
-  This should be noted that the graph is about the filtered data which is of Q2 2014

In [None]:
# 2. Aggregate or select the 'Rating' column
rating_df = fil_df.select("Rating")

# 3. Convert to Pandas (works well for small columns like this)
rating_pd = rating_df.toPandas()

# 4. Plot the distribution using seaborn or matplotlib
plt.figure(figsize=(8, 5))
sns.histplot(rating_pd['Rating'], bins=5, kde=True, color='skyblue')
plt.title("Distribution of Product Ratings in Q2 2014")
plt.xlabel("Rating")
plt.ylabel("Count")
plt.grid(True)
plt.show()


In [None]:
# Group by 'productid' and count the number of ratings for each product
rating_count_df = fil_df.groupBy("productid").count()

# Filter the products with more than one rating
products_with_multiple_ratings = rating_count_df.filter(col("count") > 1)

# Show the result
products_with_multiple_ratings.show()

✅ 2. Calculate average rating per product

In [None]:
avg_ratings = fil_df.groupBy("productId").agg(avg("Rating").alias("avg_rating"))
print(avg_ratings.show())

- This means that the prodcuts have more than one ratings.
- using the average rating will help us to give products with more high ratings.

✅ 3. Count number of reviews per product

In [None]:
review_counts = fil_df.groupBy("productId").count().withColumnRenamed("count", "num_reviews")


✅ 4. Merge average rating and review count

In [None]:
product_stats = avg_ratings.join(review_counts, on="productId")


✅ 5. Categorize products as high or low rated

In [None]:
product_stats = product_stats.withColumn("Rating_Category",
    when(col("avg_rating") >= 3, "high").otherwise("low"))


✅ 6. Compare number of reviews between high-rated and low-rated products

In [None]:
# If you're using Spark and want to do a quick summary:
product_stats.groupBy("Rating_Category").agg(avg("num_reviews")).show()


In [None]:
pd_df = product_stats.toPandas()

from scipy.stats import ttest_ind

high = pd_df[pd_df['Rating_Category'] == 'high']['num_reviews']
low = pd_df[pd_df['Rating_Category'] == 'low']['num_reviews']

t_stat, p_val = ttest_ind(high, low, equal_var=False)
print("p-value:", p_val)
if p_val < 0.05:
    print("Reject the null hypothesis. There is a significant difference in the average number of reviews between high-rated and low-rated products.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference in the average number of reviews between high-rated and low-rated products.")

✅ 7. Interpret the result

- As we can see that p value is less than 0.05 , reject the null hypothesis
- There is a significant difference.

🎨 2. Plot the Bar Chart

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Plotting the average number of reviews by rating category (high vs low)
plt.figure(figsize=(8,6))
sns.barplot(x='Rating_Category', y='num_reviews', data=pd_df, estimator='mean', ci=None, palette='coolwarm')

plt.title('Average Number of Reviews: High vs Low Ratings (Q2 2014)', fontsize=16)
plt.xlabel('Rating Category', fontsize=14)
plt.ylabel('Average Number of Reviews', fontsize=14)

# Show plot
plt.show()


## ⚖️ <font color="white">Hypothesis Testing</font>:
###- "Products with ≥3-star ratings in Q2 2014 (Apr-Jun 2014) have a significantly higher average number of reviews compared to products with <3-star ratings in the same period."

## 😐 H₀ (Null Hypothesis):
  - There's no significant difference in the average number of reviews between <font color="Green">high-rated products (rating ≥ 3)</font> and <font color="Red">low-rated products (rating < 3)</font>."

## 😮 H₁ (Alternate Hypothesis):
- There is a significant difference in the average number of reviews between <font color="Green">high-rated products (rating ≥ 3)</font> and <font color="Red">low-rated products (rating < 3)</font>.

ye bhi ho skta hai k 2014 me 3 se upar wlai zda the baki salo se
ye bhi ho skta hai k 3 quarter or 1 quarter ko compare kia jai .


Great question! Let's discuss both approaches so you can decide which one is the best fit for your hypothesis.

### 🧠 **Approach 1: Using the Average Number of Reviews**

In this approach, you're testing the hypothesis that **products with higher ratings (≥ 3)** have a **significantly higher number of reviews** compared to those with lower ratings (< 3).

- **Strengths of this approach:**
  1. **Clear comparison**: You are comparing a measurable metric (number of reviews) directly between two groups (high-rated vs low-rated products).
  2. **Real-world relevance**: It makes sense in a business context—products with more positive reviews could indicate better customer satisfaction or higher sales, making it valuable for analysis.
  3. **Statistical rigor**: A t-test or similar test works well here, as you’re comparing the mean review counts of the two groups.

- **Challenges:**
  1. **Assumption**: This approach assumes the average number of reviews is the best way to represent the "popularity" or "performance" of a product.
  2. **Potential skew**: If a product has just a few reviews but a high rating, it might skew the results. For example, a product with only 3 reviews, all rated 5 stars, might have a higher rating but not be a fair comparison to a product with 100 reviews.

### 🧑‍💻 **Approach 2: Using the Rating Category Only (without counting reviews)**

In this approach, you might look at the **distribution of products based on their rating category**, without considering how many reviews they have. For example:
- **Group A (high ratings)**: Products with ratings ≥ 3
- **Group B (low ratings)**: Products with ratings < 3

You could then compare these two groups in terms of other variables (like their average rating, the number of products, etc.).

- **Strengths of this approach**:
  1. **Simpler analysis**: You might focus purely on the rating values and avoid complications from different numbers of reviews.
  2. **Direct interpretation**: It's easier to understand whether high-rated products (≥ 3) differ from low-rated ones in terms of their ratings.
  3. **No skew from review counts**: Since you're not looking at the number of reviews, you avoid potential skew from a few very high-rated products with few reviews.

- **Challenges**:
  1. **Lack of a performance measure**: You're not testing the **effectiveness** or **popularity** of products (which number of reviews could indicate). You would be missing the connection between the **rating** and **review volume**.
  2. **Less practical**: In many cases, you'd want to understand whether products with better ratings are actually more popular or reviewed more, which this approach doesn’t directly measure.

### 🏆 **Which approach is best?**

**It depends on your research question**:
- If you want to explore **whether higher-rated products are getting more attention or reviews**, **Approach 1** (Average Number of Reviews) would likely be the better fit. This approach aligns with your hypothesis and gives a more **complete picture** of how reviews and ratings interact.

- If you want a **more basic comparison** and are only interested in comparing product ratings without considering the number of reviews, **Approach 2** (Using Rating Category Only) might be sufficient but might **miss out on a more nuanced analysis**.

### 🧑‍🏫 **Optimized Approach:**
Given that your hypothesis suggests that products with **ratings ≥ 3** should differ from others in terms of **number of reviews**, **Approach 1** is probably the **best choice**. You will:
1. Use **number of reviews** as a performance metric.
2. Compare products based on their **rating category** (high vs low).
3. Perform a **statistical test** (like a t-test) to see if there's a significant difference in the average number of reviews.

---

### 🚀 **Next Steps:**

- **If using Approach 1 (number of reviews)**, you can:
  - Test the **average number of reviews** for each group using a **t-test**.
  - Visualize the results using a **bar chart** or **box plot**.
  
- **If you wanted to use Approach 2 (rating category)**, you would compare:
  - The **average rating** for each group.
  - Maybe calculate how many products fall into each category and look for trends.

---

### 💡 **Summary**:  
I recommend **Approach 1** as it provides more detailed insight into whether higher-rated products are reviewed more frequently. This fits well with your hypothesis about **ratings affecting review counts**.

Let me know if you'd like help implementing any of these approaches or further refining the analysis!