## Code Executive Summary

---

### Objective:
Utilize Apache Spark, Confluent Kafka, and Databricks to process and predict IMDb movie ratings.

---

### Technologies Used:
- **Apache Spark**: Distributed computing framework for large-scale data processing.
- **Confluent Kafka**: Distributed event streaming platform for real-time data pipelines.
- **Databricks**: Unified analytics platform based on Apache Spark.

---

### Tasks & Implementation:

#### 1. Data Reading and Preprocessing:
- Loaded IMDb movie data from a CSV file into a Spark DataFrame.
- Selected relevant numerical features and target variable (`imdb_score`).
- Dropped rows with missing values and converted selected features and target to `DoubleType`.

#### 2. Data Splitting:
- Partitioned the preprocessed data into training, batch, and stream DataFrames using an 80-10-10 ratio.

#### 3. Batch Data to Kafka:
- Converted 10% of the batch DataFrame to JSON.
- Pushed the JSON data to the Kafka topic "a1" using Confluent Kafka configurations.

#### 4. Kafka Data Reading:
- Read the data from the Kafka topic "a1" into a Spark DataFrame using Confluent Kafka configurations.

#### 5. Model Training:
- Trained a RandomForestRegressor model on the training DataFrame.
- Assembled features into a vector and predicted `imdb_score`.

#### 6. Stream Data to Kafka and Prediction:
- Converted the stream DataFrame to JSON.
- Pushed the JSON data to the Kafka topic "a1".
- Read the stream data back from Kafka and made predictions using the trained model.

## 1. Data Preparation

In [0]:
# Import necessary libraries
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql.functions import col, current_timestamp, from_json
from pyspark.sql.types import StructType, StructField, StringType, DoubleType

# Read the data from CSV
df1 = spark.read.format("csv").option("header", "true").load("dbfs:/FileStore/shared_uploads/arham.anwar@mail.mcgill.ca/IMDB_data_Fall_2023.csv")

# Display the data
display(df1)

# Define selected numerical features and target
selected_features = [
    'movie_budget',
    'duration',
    'actor1_star_meter',
    'actor2_star_meter',
    'actor3_star_meter',
    'nb_news_articles',
    'movie_meter_IMDBpro'
]

# Drop rows with missing values in selected features and target
df2 = df1.select(selected_features + ['imdb_score']).dropna()

# Convert selected features and imdb_score to DoubleType
for feature in selected_features + ['imdb_score']:
    df2 = df2.withColumn(feature, col(feature).cast('double'))


movie_title,movie_id,imdb_link,imdb_score,movie_budget,release_day,release_month,release_year,duration,language,country,maturity_rating,aspect_ratio,distributor,nb_news_articles,director,actor1,actor1_star_meter,actor2,actor2_star_meter,actor3,actor3_star_meter,colour_film,genres,nb_faces,plot_keywords,action,adventure,scifi,thriller,musical,romance,western,sport,horror,drama,war,animation,crime,movie_meter_IMDBpro,cinematographer,production_company
August: Osage County,2,http://www.imdb.com/title/tt1322269/?ref_=fn_tt_tt_1,7.3,25000000,10,Jan,2014,121,English,USA,R,2.35,The Weinstein Company,2141,John Wells,Benedict Cumberbatch,259,Meryl Streep,559,Julia Roberts,513,Color,Drama,3,based on play|incestuous relationship|pedophilia|secret|teenage daughter,0,0,0,0,0,0,0,0,0,1,0,0,0,4000,Adriano Goldman,The Weinstein Company
Radio,12,http://www.imdb.com/title/tt0316465/?ref_=fn_tt_tt_1,6.9,35000000,24,Oct,2003,109,English,USA,PG,1.85,Columbia Pictures Corporation,331,Michael Tollin,Alfre Woodard,2735,Riley Smith,3915,Debra Winger,1845,Color,Biography|Drama|Sport,1,coach|football|football coach|high school|radio,0,0,0,0,0,0,0,1,0,1,0,0,0,8556,Don Burgess,Revolution Studios
Coach Carter,15,http://www.imdb.com/title/tt0393162/?ref_=fn_tt_tt_1,7.2,30000000,14,Jan,2005,136,English,USA,PG-13,2.35,Paramount Pictures,223,Thomas Carter,Channing Tatum,573,Rick Gonzalez,4793,Robert Ri'chard,6729,Color,Drama|Sport,0,basketball|basketball coach|coach|contract|high school,0,0,0,0,0,0,0,1,0,1,0,0,0,3940,Sharone Meir,Coach Carter
The Possession,20,http://www.imdb.com/title/tt0431021/?ref_=fn_tt_tt_1,5.9,14000000,20,Aug,2012,92,English,USA,PG-13,2.35,Lionsgate,620,Ole Bornedal,Kyra Sedgwick,2047,Madison Davenport,1769,Natasha Calis,11963,Color,Horror|Thriller,0,basketball coach|box|jewish|rabbi|yard sale,0,0,0,1,0,0,0,0,1,0,0,0,0,5452,Dan Laustsen,Ghost House Pictures
Escape from Alcatraz,22,http://www.imdb.com/title/tt0079116/?ref_=fn_tt_tt_1,7.6,8000000,22,Jun,1979,112,English,USA,PG,1.85,Paramount Pictures,97,Don Siegel,Clint Eastwood,102,Patrick McGoohan,5062,Fred Ward,5451,Color,Biography|Crime|Drama,0,alcatraz|escape|inmate|island|prison,0,0,0,0,0,0,0,0,0,1,0,0,1,4722,Bruce Surtees,Paramount Pictures
She's the Man,23,http://www.imdb.com/title/tt0454945/?ref_=fn_tt_tt_1,6.4,20000000,17,Mar,2006,105,English,USA,PG-13,1.85,Lakeshore International,173,Andy Fickman,Channing Tatum,573,Alexandra Breckenridge,370,Laura Ramsey,3711,Color,Comedy|Romance,0,disguise|roommate|school|soccer|twin,0,0,0,0,0,1,0,0,0,0,0,0,0,2446,Greg Gardiner,DreamWorks
Spaceballs,26,http://www.imdb.com/title/tt0094012/?ref_=fn_tt_tt_1,7.1,22700000,24,Jun,1987,96,English,USA,PG,1.85,Metro-Goldwyn-Mayer (MGM),408,Mel Brooks,Joan Rivers,12294,Dick Van Patten,13732,Michael Winslow,8419,Color,Adventure|Comedy|Sci-Fi,2,parody|planet enclosed within shield|sci fi spoof|self referential|winnebago,0,1,1,0,0,0,0,0,0,0,0,0,0,2294,Nick McLean,Brooksfilms
No Country for Old Men,31,http://www.imdb.com/title/tt0477348/?ref_=fn_tt_tt_1,8.1,25000000,21,Nov,2007,122,English,USA,R,2.35,Hispanic Education And Media Group,4135,Ethan Coen,Kelly Macdonald,628,Stephen Root,2450,Barry Corbin,3592,Color,Crime|Drama|Thriller,0,coin toss|desert|sheriff|texas|tracking device,0,0,0,1,0,0,0,0,0,1,0,0,1,513,Roger Deakins,Paramount Vantage
Blade,38,http://www.imdb.com/title/tt0120611/?ref_=fn_tt_tt_1,7.1,45000000,21,Aug,1998,110,English,USA,R,2.35,New Line Cinema,1723,Stephen Norrington,Sanaa Lathan,547,Traci Lords,1054,Udo Kier,3001,Color,Action|Horror,1,1990s|blade|blood|vampire|vampire hunter,1,0,0,0,0,0,0,0,1,0,0,0,0,697,Theo van de Sande,Amen Ra Films
Music and Lyrics,39,http://www.imdb.com/title/tt0758766/?ref_=fn_tt_tt_1,6.5,40000000,14,Feb,2007,95,English,USA,PG-13,1.85,Warner Bros.,378,Marc Lawrence,Brad Garrett,358742,Scott Porter,3086,Haley Bennett,642,Color,Comedy|Music|Romance,4,love|lyricist|singer|singing|song,0,0,0,0,1,1,0,0,0,0,0,0,0,6854,Xavier Grobet,Castle Rock Entertainment


## 2. split data for training, and test (batch and streaming)

In [0]:
# Split the DataFrame into train, batch, and stream
train_df, batch_df, stream_df = df2.randomSplit([0.8, 0.1, 0.1], seed=42)

# Print counts of each split
print("Train DataFrame Count:", train_df.count())
print("Batch DataFrame Count:", batch_df.count())
print("Stream DataFrame Count:", stream_df.count())


Train DataFrame Count: 1588
Batch DataFrame Count: 174
Stream DataFrame Count: 168


## 3. Push batch to kafka

In [0]:
from pyspark.sql.functions import to_json, struct
from pyspark.sql import SparkSession

# Initialize Spark Session
spark = SparkSession.builder.appName("KafkaIntegration").getOrCreate()

# Define Kafka configurations
bootstrap_servers = "pkc-4rn2p.canadacentral.azure.confluent.cloud:9092"
topic = "a1"

# Convert DataFrame to JSON and push to Kafka
batch_df_to_kafka = batch_df.select(to_json(struct(selected_features)).alias("value"))
batch_df_to_kafka.write.format("kafka") \
    .option("kafka.bootstrap.servers", bootstrap_servers) \
    .option("topic", topic) \
    .option("kafka.security.protocol", "SASL_SSL") \
    .option("kafka.sasl.mechanism", "PLAIN") \
    .option("kafka.sasl.jaas.config", """kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username="6K7GG3OYQ7RUIEHR" password="Cts0b2WPh8+WTVLypXyidYNBxh9YagMtzrmLX21y4DFVs8Xm6+++Z2G38f1eSTK6";""") \
    .save()


## 4. Pull batch data from kafka

In [0]:
# Read data from Kafka topic into a DataFrame
kafka_df = spark.read \
    .format("kafka") \
    .option("kafka.bootstrap.servers", bootstrap_servers) \
    .option("subscribe", topic) \
    .option("kafka.security.protocol", "SASL_SSL") \
    .option("kafka.sasl.mechanism", "PLAIN") \
    .option("kafka.sasl.jaas.config", """kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username="6K7GG3OYQ7RUIEHR" password="Cts0b2WPh8+WTVLypXyidYNBxh9YagMtzrmLX21y4DFVs8Xm6+++Z2G38f1eSTK6";""") \
    .load()

# Convert value column from binary to string and then to DataFrame
kafka_df = kafka_df.selectExpr("CAST(value AS STRING)")

# Parse JSON data from Kafka
schema = StructType([StructField(feature, DoubleType()) for feature in selected_features])
kafka_parsed_df = kafka_df.select(from_json(col("value"), schema).alias("data")).select("data.*")


## 5. Use model trained on training data to predict on batch data pulled on kafka batch data

In [0]:
# Train RandomForestRegressor Model
assembler = VectorAssembler(inputCols=selected_features, outputCol="features")
train_df_assembled = assembler.transform(train_df)
rf = RandomForestRegressor(featuresCol="features", labelCol="imdb_score", seed=42)
rf_model = rf.fit(train_df_assembled)

# Assemble features vector for Kafka parsed data
kafka_parsed_df_assembled = assembler.transform(kafka_parsed_df)

# Predict imdb_score for Kafka parsed data
predictions = rf_model.transform(kafka_parsed_df_assembled)

# Display predictions
predictions.select("prediction").show()

+------------------+
|        prediction|
+------------------+
| 5.954377113431841|
| 6.315117064421017|
| 6.358680402143992|
| 6.827163727882019|
| 6.179255396127019|
| 5.859601908692444|
| 6.569571019642531|
| 6.091150932350442|
| 7.184398864278526|
| 6.150184118967506|
| 5.922232430680394|
| 6.446758823378573|
| 6.608259791796632|
| 6.702176093341421|
|6.7424715347722515|
| 6.770984010855118|
| 5.607684647826716|
| 6.885577904012186|
| 6.548390628552016|
|6.7381164423670565|
+------------------+
only showing top 20 rows



## 5. Push stream data to kafka

In [0]:
# Convert stream_df to JSON and push to Kafka
stream_df_to_kafka = stream_df.select(to_json(struct(selected_features)).alias("value"))
stream_df_to_kafka.write.format("kafka") \
    .option("kafka.bootstrap.servers", bootstrap_servers) \
    .option("topic", topic) \
    .option("kafka.security.protocol", "SASL_SSL") \
    .option("kafka.sasl.mechanism", "PLAIN") \
    .option("kafka.sasl.jaas.config", """kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username="6K7GG3OYQ7RUIEHR" password="Cts0b2WPh8+WTVLypXyidYNBxh9YagMtzrmLX21y4DFVs8Xm6+++Z2G38f1eSTK6";""") \
    .save()

## 6. Pull stream data from kafka

In [0]:
# Read data from Kafka topic into a DataFrame
kafka_df = spark.read \
    .format("kafka") \
    .option("kafka.bootstrap.servers", bootstrap_servers) \
    .option("subscribe", topic) \
    .option("kafka.security.protocol", "SASL_SSL") \
    .option("kafka.sasl.mechanism", "PLAIN") \
    .option("kafka.sasl.jaas.config", """kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username="6K7GG3OYQ7RUIEHR" password="Cts0b2WPh8+WTVLypXyidYNBxh9YagMtzrmLX21y4DFVs8Xm6+++Z2G38f1eSTK6";""") \
    .load()


## 7. Make predicitons on kafka data streamed using our pretrained model

In [0]:
# Convert value column from binary to string and then to DataFrame
kafka_df = kafka_df.selectExpr("CAST(value AS STRING)")

# Parse JSON data from Kafka
schema = StructType([StructField(feature, DoubleType()) for feature in selected_features])
kafka_parsed_df = kafka_df.select(from_json(col("value"), schema).alias("data")).select("data.*")

# Assemble features vector for Kafka parsed data
kafka_parsed_df_assembled = assembler.transform(kafka_parsed_df)

# Predict imdb_score for Kafka parsed data
predictions = rf_model.transform(kafka_parsed_df_assembled)

# Display predictions
predictions.select("prediction").show()

+------------------+
|        prediction|
+------------------+
|  6.81139317021092|
| 6.288038742103215|
| 6.430330446227313|
| 6.423687631521448|
| 7.031589182170767|
| 6.414649562866531|
| 6.765041629974763|
|  7.27430934372303|
|6.0341432887142625|
| 6.856890322186122|
| 6.240286499739469|
|  6.18602752252957|
| 6.153713303774846|
|  5.72934134347489|
|  7.41498092463393|
| 6.366269328666146|
| 6.896620004289369|
|  6.47755823004419|
| 6.666107478756009|
|7.2070441653949455|
+------------------+
only showing top 20 rows

