## Spark ML Exploration Part I

## Learning Outcomes
In this assignment you will: 

-  Use ML piplenes
-  Improve a Random Forest model
-  Perform Hyperparameter tuning

## Analysis 1: Predicting Farmer's Markets Locations

From our preliminary analysis using the ML pipeline, we noted certain links but observed results that could be further optimized. Despite the limitations like potential invalid assumptions such as zip code shortening, there's room for refinement. One proposition is to consider neighboring zip codes or utilize distance measures to predict the likelihood of a farmer's market within a certain radius from affluent zip codes.
The challenge is not strictly about enhancing the model's performance, but also about making iterative changes for potential improvements. The dataset in focus is the Farmers Markets dataset, which can be reviewed https://catalog.data.gov/dataset/farmers-markets-directory-and-geographic-data. For a baseline, refer to the model setup in thishttps://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5915990090493625/2446126855165611/6085673883631125/latest.html.

## Analysis 2: Predicting Diamond Prices

Leverage the power of the Apache Spark ML pipeline to construct a model predicting diamond prices based on the provided features. Detailed dataset information is available in this https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5915990090493625/4396972618536508/6085673883631125/latest.html.

### In 'Spark ML Exploration Part I', I delved into basic data exploration using a Linear Regression model for initial predictions. Despite attempts to enhance the model with hyperparameter tuning, the results were unsatisfactory. In 'Part II', I'm adopting a new approach to address the farmer market problem. I've integrated data from zcta_coordinates and the Census Bureau's API to represent income and population by zip code accurately. Emphasis has been placed on feature engineering, especially with tools like the haversine algorithm and square root transformation.

In [0]:
# Read the data into a DataFrame
taxes2013 = spark.read \
  .format("csv") \
  .option("header", "true") \
  .load("dbfs:/databricks-datasets/data.gov/irs_zip_code_data/data-001/2013_soi_zipcode_agi.csv")

# Create a temporary view
taxes2013.createOrReplaceTempView("taxes2013")

# Now you can use Spark SQL to query the data
result = spark.sql("SELECT * FROM taxes2013 LIMIT 10")
result.show()


+---------+-----+-------+--------+-----------+-----------+-----------+-----------+-----------+------------+-----------+------------+-----------+------------+-----------+------------+-----------+-----------+----------+-----------+----------+-----------+-----------+-----------+-----------+-----------+----------+------------+----------+-----------+-----------+------------+---------+----------+-----------+----------+-----------+----------+------------+-----------+-----------+----------+---------+---------+----------+---------+----------+---------+----------+----------+----------+---------+----------+---------+----------+-----------+------------+------------+-----------+-----------+----------+----------+-----------+-----------+-----------+-----------+-----------+------------+-----------+-----------+-----------+------------+-----------+------------+----------+----------+-----------+-----------+----------+----------+----------+----------+----------+----------+----------+----------+-----------

In [0]:
# Load the data into a DataFrame
markets = spark.read \
  .format("csv") \
  .option("header", "true") \
  .load("dbfs:/databricks-datasets/data.gov/farmers_markets_geographic_data/data-001/market_data.csv")

# Create a temporary view
markets.createOrReplaceTempView("markets")

# Now you can use Spark SQL to query the data
result = spark.sql("SELECT * FROM markets LIMIT 10")
result.show()


+-------+--------------------+--------------------+--------------------+--------------------+-------+--------------------+--------------------+-----------+--------------------+--------------------+-----+--------------------+--------------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+----------+--------------------+------+---+-------+-----+----+-------+----------+------+------+-------+----+-------+-----+----------+-----+----+-----+----+-------+----+------+-------+--------+----+-----+----+------+-----+------+------+------+---------+-------+----+-------------+--------------------+
|   FMID|          MarketName|             Website|            Facebook|             Twitter|Youtube|          OtherMedia|              street|       city|              County|               State|  zip|         Season1Date|         Season1Time|Season2Date|Season2Time|Season3Date|Season3Time|Season4Date|Season4Time|          x|         y|            Location|Credit|WI

In [0]:
# Filter out the desired columns
taxes2013 = taxes2013.select(["STATE", "zipcode", "MARS1", "MARS2", "NUMDEP", "A02650", "A00300", "A00900", "A01000"])

# Create a temporary view
taxes2013.createOrReplaceTempView("taxes2013")

# Execute the SQL to get the various aggregates
taxes_aggregate = spark.sql("""
    SELECT 
        zipcode, 
        SUM(MARS1) AS single_returns, 
        SUM(MARS2) AS joint_returns, 
        SUM(NUMDEP) AS numdep, 
        SUM(A02650) AS total_income_amount, 
        SUM(A00300) AS taxable_interest_amount 
    FROM taxes2013 
    GROUP BY zipcode
""")
taxes_aggregate.createOrReplaceTempView("filtered_taxes2013")
taxes_aggregate.show()

+-------+--------------+-------------+-------+-------------------+-----------------------+
|zipcode|single_returns|joint_returns| numdep|total_income_amount|taxable_interest_amount|
+-------+--------------+-------------+-------+-------------------+-----------------------+
|  35004|        1960.0|       2150.0| 3200.0|           258024.0|                  964.0|
|  35444|         480.0|        710.0| 1190.0|            76844.0|                  152.0|
|  35640|        3770.0|       5270.0| 7460.0|           543352.0|                 3152.0|
|  35670|        1090.0|       1620.0| 2170.0|           150952.0|                  875.0|
|  36067|        4100.0|       4470.0| 8630.0|           543110.0|                 2199.0|
|  36526|        5790.0|       6170.0| 9270.0|           940114.0|                 6004.0|
|  85022|       12100.0|       7040.0|12040.0|          1264372.0|                 8502.0|
|  72472|        1200.0|       1360.0| 2780.0|           123953.0|                  657.0|

In [0]:
# Perform the join operation
joined_table = spark.sql("""
    SELECT 
        a.zipcode , 
        a.single_returns, 
        a.joint_returns, 
        a.numdep, 
        a.total_income_amount, 
        a.taxable_interest_amount, 
        b.count, 
        b.zip 
    FROM filtered_taxes2013 a 
    LEFT OUTER JOIN markets_filtered b 
    ON(a.zipcode=b.zip)
""")
joined_table.show()


+-------+--------------+-------------+-------+-------------------+-----------------------+-----+-----+
|zipcode|single_returns|joint_returns| numdep|total_income_amount|taxable_interest_amount|count|  zip|
+-------+--------------+-------------+-------+-------------------+-----------------------+-----+-----+
|  35004|        1960.0|       2150.0| 3200.0|           258024.0|                  964.0| null| null|
|  35444|         480.0|        710.0| 1190.0|            76844.0|                  152.0| null| null|
|  35640|        3770.0|       5270.0| 7460.0|           543352.0|                 3152.0|    1|35640|
|  35670|        1090.0|       1620.0| 2170.0|           150952.0|                  875.0| null| null|
|  36067|        4100.0|       4470.0| 8630.0|           543110.0|                 2199.0|    1|36067|
|  36526|        5790.0|       6170.0| 9270.0|           940114.0|                 6004.0| null| null|
|  85022|       12100.0|       7040.0|12040.0|          1264372.0|       

In [0]:
# Display statistical summary of the DataFrame
joined_table.describe().show()

+-------+------------------+------------------+------------------+-----------------+-------------------+-----------------------+------------------+-----------------+
|summary|           zipcode|    single_returns|     joint_returns|           numdep|total_income_amount|taxable_interest_amount|             count|              zip|
+-------+------------------+------------------+------------------+-----------------+-------------------+-----------------------+------------------+-----------------+
|  count|             27692|             27692|             27692|            27692|              27692|                  27692|              5792|             5792|
|   mean| 48846.37787086523|4800.0888343203815| 3822.125162501806|6972.645890509894|  669836.5034305936|     5973.7401415571285|1.2361878453038675|47420.74084944752|
| stddev|27024.237561576883| 399292.4548633762|317849.17313310504|580001.5259658371| 5.57425668769764E7|      497351.9014347161|0.5953969271711737|28902.49633047836|
|   

In [0]:
# Print the data types of all columns
for column in joined_table.dtypes:
    print(f"Column name: {column[0]}, Type: {column[1]}")

Column name: zipcode, Type: string
Column name: single_returns, Type: double
Column name: joint_returns, Type: double
Column name: numdep, Type: double
Column name: total_income_amount, Type: double
Column name: taxable_interest_amount, Type: double
Column name: count, Type: bigint
Column name: zip, Type: int


In [0]:
from pyspark.sql.functions import col
joined_table = joined_table.withColumn("zipcode", col("zipcode").cast("int"))


In [0]:
joined_table = joined_table.na.fill(0)
joined_table.show()

+-------+--------------+-------------+-------+-------------------+-----------------------+-----+-----+
|zipcode|single_returns|joint_returns| numdep|total_income_amount|taxable_interest_amount|count|  zip|
+-------+--------------+-------------+-------+-------------------+-----------------------+-----+-----+
|  35004|        1960.0|       2150.0| 3200.0|           258024.0|                  964.0|    0|    0|
|  35444|         480.0|        710.0| 1190.0|            76844.0|                  152.0|    0|    0|
|  35640|        3770.0|       5270.0| 7460.0|           543352.0|                 3152.0|    1|35640|
|  35670|        1090.0|       1620.0| 2170.0|           150952.0|                  875.0|    0|    0|
|  36067|        4100.0|       4470.0| 8630.0|           543110.0|                 2199.0|    1|36067|
|  36526|        5790.0|       6170.0| 9270.0|           940114.0|                 6004.0|    0|    0|
|  85022|       12100.0|       7040.0|12040.0|          1264372.0|       

In [0]:
from pyspark.ml.feature import VectorAssembler

# Columns that are not features
nonFeatureCols = ["zip", "zipcode", "count"]

# Columns that are features
featureCols = [item for item in joined_table.columns if item not in nonFeatureCols]

# Use VectorAssembler to assemble these feature columns into a single vector column.
assembler = VectorAssembler(inputCols=featureCols, outputCol="features")
finalPrep = assembler.transform(joined_table)

# Now, split the dataset into a training set and a test set.
training, test = finalPrep.randomSplit([0.7, 0.3])

# Cache the data to speed up training
training.cache()
test.cache()

# Print the number of data points in the training and test sets
print("Number of training records: " + str(training.count()))
print("Number of testing records : " + str(test.count()))



Number of training records: 19309
Number of testing records : 8383


###At this point I have followed the same steps that in the guide notebook and I will improve the classifier. The notebook trained The Random Forest model using a pipeline and hyperparameters were tuned using a cross-validator. This are the results MSE:  11.597986974541147; MAE:  1.2155121373593842; RMSE Squared: 3.4055817380502185; R Squared:  -2.932901971002445; Explained Variance:  10.376238335259561. 

###Considering Zipcodes as Categorical Features: One problem I noticed was that zipcodes were being interpreted as numerical features with intrinsic meaning. To rectify this, I created two new binary features. This allowed the zipcodes to function as categorical data. The first feature tags 'wealthy' zip codes as '1' if the average income in the zip code is above the 75th percentile of all zip codes, and '0' otherwise. The second feature does the same for highly populated zip codes.

In [0]:
# Create new feature columns
joined_table = joined_table.withColumn("population", joined_table["single_returns"] + joined_table["joint_returns"])
joined_table = joined_table.withColumn("average_income", joined_table["total_income_amount"] / joined_table["population"])
joined_table = joined_table.withColumn("dependent_ratio", joined_table["numdep"] / joined_table["population"])

# Print the data types of all columns
for column in joined_table.dtypes:
    print(f"Column name: {column[0]}, Type: {column[1]}")

Column name: zipcode, Type: int
Column name: single_returns, Type: double
Column name: joint_returns, Type: double
Column name: numdep, Type: double
Column name: total_income_amount, Type: double
Column name: taxable_interest_amount, Type: double
Column name: count, Type: bigint
Column name: zip, Type: int
Column name: population, Type: double
Column name: average_income, Type: double
Column name: dependent_ratio, Type: double


In [0]:
from pyspark.sql.functions import col, when

# Calculate the 75th percentile for average income and population
wealthy_threshold = joined_table.approxQuantile("average_income", [0.75], 0)
populated_threshold = joined_table.approxQuantile("population", [0.75], 0)

# Create binary 'wealthy' and 'populated' features
joined_table = (joined_table
              .withColumn("wealthy_zipcode", 
                          when(col("average_income") > wealthy_threshold[0], 1).otherwise(0))
              .withColumn("populated_zipcode", 
                          when(col("population") > populated_threshold[0], 1).otherwise(0)))

# Show the DataFrame with new features
joined_table.show()


+-------+--------------+-------------+-------+-------------------+-----------------------+-----+-----+----------+------------------+-------------------+---------------+-----------------+
|zipcode|single_returns|joint_returns| numdep|total_income_amount|taxable_interest_amount|count|  zip|population|    average_income|    dependent_ratio|wealthy_zipcode|populated_zipcode|
+-------+--------------+-------------+-------+-------------------+-----------------------+-----+-----+----------+------------------+-------------------+---------------+-----------------+
|  35004|        1960.0|       2150.0| 3200.0|           258024.0|                  964.0|    0|    0|    4110.0| 62.77956204379562| 0.7785888077858881|              0|                0|
|  35444|         480.0|        710.0| 1190.0|            76844.0|                  152.0|    0|    0|    1190.0| 64.57478991596639|                1.0|              0|                0|
|  35640|        3770.0|       5270.0| 7460.0|           543352.0

In [0]:
nonFeatureCols = ["zip", "zipcode"]

# convert 'count' column to DoubleType
joined_table = joined_table.withColumn("count", col("count").cast(DoubleType()))

for column in joined_table.columns:
    if column not in nonFeatureCols:  # don't calculate correlation for non-numeric columns
        print(f"Correlation between count and {column}: {joined_table.stat.corr('count', column)}")



Correlation between count and single_returns: 0.010468341870458496
Correlation between count and joint_returns: 0.010007986855027915
Correlation between count and numdep: 0.009918426525105929
Correlation between count and total_income_amount: 0.01043025480499017
Correlation between count and taxable_interest_amount: 0.010653927714819945
Correlation between count and count: 1.0
Correlation between count and population: 0.010264349133589328
Correlation between count and average_income: 0.08135668465016815
Correlation between count and dependent_ratio: -0.056029436392937344
Correlation between count and wealthy_zipcode: 0.10210115592774131
Correlation between count and populated_zipcode: 0.29444683261738497


## Feature Correlations
##Analyzing the correlation of different features with the 'count' variable, I can see the influence of each feature on the 'count' value.

####- The correlation between 'count' and most of the features ('single_returns', 'joint_returns', 'numdep', 'total_income_amount', 'taxable_interest_amount', and 'population') are extremely low, indicating very little linear relationship.
####- Notably, the 'average_income' feature has a correlation of 0.081 with 'count', suggesting a somewhat weak positive linear relationship.
####- The 'dependent_ratio' feature shows a weak negative linear relationship with 'count' with a correlation of -0.056.
####- The new features I introduced - 'wealthy_zipcode' and 'populated_zipcode' - have a significantly higher correlation with 'count' (0.102 and 0.294, respectively) compared to other features. This suggests that my approach to recategorize zip codes based on income and population may indeed be beneficial for the predictive model.


In [0]:
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler

# Define the non-feature columns
nonFeatureCols = ["zipcode", "zip", "count"]

# Get a list of feature columns
featureCols = [item for item in joined_table.columns if item not in nonFeatureCols]

# Assemble the feature columns into a feature vector
assembler = VectorAssembler(inputCols=featureCols, outputCol="features")

rfModel = RandomForestRegressor(labelCol="count", featuresCol="features")

stages = [assembler, rfModel]

pipeline = Pipeline().setStages(stages)

# Split the data into training and test sets
train_data, test_data = joined_table.randomSplit([0.7, 0.3])

# Define the CrossValidator
cv = CrossValidator(estimator=pipeline, 
                    estimatorParamMaps=ParamGridBuilder().build(),  # No parameter search
                    evaluator=RegressionEvaluator(labelCol="count"),
                    numFolds=3)  # Use 3+ in practice

# Fit the model on the training data
model = cv.fit(train_data)

# Make predictions on the test data
predictions = model.transform(test_data)

# Evaluate the model
evaluator = RegressionEvaluator(labelCol="count")
rmse = evaluator.evaluate(predictions)
print(f"Root Mean Squared Error (RMSE) on test data: {rmse}")

mse = evaluator.evaluate(predictions, {evaluator.metricName: "mse"})
mae = evaluator.evaluate(predictions, {evaluator.metricName: "mae"})
r2 = evaluator.evaluate(predictions, {evaluator.metricName: "r2"})

print(f"Mean Squared Error (MSE): {mse}")
print(f"Mean Absolute Error (MAE): {mae}")
print(f"R Squared (R2): {r2}")



Root Mean Squared Error (RMSE) on test data: 0.5059397283149242
Mean Squared Error (MSE): 0.2559750086873793
Mean Absolute Error (MAE): 0.3306623264521562
R Squared (R2): 0.1747266973055437


##Model Performance Comparison
###The enhanced model outperforms the one from the guide notebook across all metrics:
###- RMSE: Improved from 3.406 to 0.506
###- MSE: Decreased from 11.598 to 0.256
###- MAE: Lowered from 1.216 to 0.331
###- R2: Increased from -2.933 to 0.175
###These improvements stem from better feature engineering, particularly with the categorization of zip codes.



##Hyperparameter Tuning

In [0]:
# Define the non-feature columns
nonFeatureCols = ["zipcode", "zip", "count"]

# Get a list of feature columns
featureCols = [item for item in joined_table.columns if item not in nonFeatureCols]

# Assemble the feature columns into a feature vector
assembler = VectorAssembler(inputCols=featureCols, outputCol="features")

rfModel = RandomForestRegressor(labelCol="count", featuresCol="features")

stages = [assembler, rfModel]

pipeline = Pipeline().setStages(stages)

# Split the data into training and test sets
train_data, test_data = joined_table.randomSplit([0.7, 0.3])

# Define a grid of hyperparameters to search over
paramGrid = ParamGridBuilder() \
    .addGrid(rfModel.numTrees, [10, 20, 30]) \
    .addGrid(rfModel.maxDepth, [5, 10]) \
    .build()

# Define the CrossValidator
cv = CrossValidator(estimator=pipeline, 
                    estimatorParamMaps=paramGrid, 
                    evaluator=RegressionEvaluator(labelCol="count"),
                    numFolds=3)

# Fit the model on the training data
model = cv.fit(train_data)

# Make predictions on the test data
predictions = model.transform(test_data)

# Evaluate the model
evaluator = RegressionEvaluator(labelCol="count")
rmse = evaluator.evaluate(predictions)
print(f"Root Mean Squared Error (RMSE) on test data: {rmse}")

mse = evaluator.evaluate(predictions, {evaluator.metricName: "mse"})
mae = evaluator.evaluate(predictions, {evaluator.metricName: "mae"})
r2 = evaluator.evaluate(predictions, {evaluator.metricName: "r2"})

print(f"Mean Squared Error (MSE): {mse}")
print(f"Mean Absolute Error (MAE): {mae}")
print(f"R Squared (R2): {r2}")


Root Mean Squared Error (RMSE) on test data: 0.5068063382056074
Mean Squared Error (MSE): 0.25685266444537647
Mean Absolute Error (MAE): 0.3292528243028155
R Squared (R2): 0.16109779132549484


###Conclusions: After tuning the model's hyperparameters, the metrics showed insignificant changes, indicating that the model has likely reached its predictive limit given the current features and dataset. I made several attempts to enhance the model, some of which were computationally intensive, which underscored for me the vital role that feature selection plays.
###Ways to Improve: To enhance the model further, the feature selection process can be refined from the beginning. This could include utilizing geolocation data (possibly from an API) or ZCTAs to better handle zip codes. Exploring additional columns from the tax dataset like N1 returns to gain a deeper understanding of population trends. In the farmer's market dataset, estate and county attributes could provide more geographical insights, which could be useful in the modeling to create a more comprehensive picture of the farmer's market landscape. Finally, trying different models like Gradient Boosting or Support Vector Regressors could also help improve the model's performance.

##Using the Apache Spark ML pipeline, build a model to predict the price of a diamond based on the available features.

###Data Exploration

In [0]:
# Read the data into a DataFrame
diamonds = spark.read \
  .format("csv") \
  .option("header", "true") \
  .option("inferSchema", "true") \
  .load("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv")

# Create a temporary view
diamonds.createOrReplaceTempView("diamonds")




In [0]:
%sql
SELECT COUNT(*) AS total_diamonds, AVG(price) AS average_price FROM diamonds


total_diamonds,average_price
53940,3932.799721913237


In [0]:
%sql
SELECT 'Cut' AS feature, cut AS category, COUNT(*) AS count FROM diamonds GROUP BY cut UNION ALL
SELECT 'Color' AS feature, color AS category, COUNT(*) AS count FROM diamonds GROUP BY color UNION ALL
SELECT 'Clarity' AS feature, clarity AS category, COUNT(*) AS count FROM diamonds GROUP BY clarity

feature,category,count
Cut,Premium,13791
Cut,Ideal,21551
Cut,Good,4906
Cut,Fair,1610
Cut,Very Good,12082
Color,F,9542
Color,E,9797
Color,D,6775
Color,J,2808
Color,G,11292


In [0]:
%sql
SELECT 'Cut' AS feature, cut AS category, AVG(price) AS average_price FROM diamonds GROUP BY cut UNION ALL
SELECT 'Color' AS feature, color AS category, AVG(price) AS average_price FROM diamonds GROUP BY color UNION ALL
SELECT 'Clarity' AS feature, clarity AS category, AVG(price) AS average_price FROM diamonds GROUP BY clarity

feature,category,average_price
Cut,Premium,4584.2577042999055
Cut,Ideal,3457.541970210199
Cut,Good,3928.864451691806
Cut,Fair,4358.757763975155
Cut,Very Good,3981.759890746565
Color,F,3724.886396981765
Color,E,3076.7524752475247
Color,D,3169.95409594096
Color,J,5323.81801994302
Color,G,3999.135671271697


####The following SQL query provides key statistical insights into the distribution of our target variable, 'price'. By understanding the minimum, maximum, average, and standard deviation of diamond prices, we can better interpret the performance of our model, particularly the Root Mean Squared Error (RMSE), which is in the same units as our target variable.

In [0]:
%sql
SELECT 
  MIN(price) AS min_price, 
  MAX(price) AS max_price, 
  AVG(price) AS avg_price, 
  STDDEV(price) AS std_dev_price 
FROM diamonds


min_price,max_price,avg_price,std_dev_price
326,18823,3932.799721913237,3989.439738146397


###Preprocess the data, Normalize the features, and Split the data into training and test sets


In [0]:
from pyspark.ml.feature import StringIndexer, VectorAssembler

# Convert categorical variables into numerical form
indexer = StringIndexer(inputCols=["cut", "color", "clarity"], outputCols=["cutIndex", "colorIndex", "clarityIndex"])
diamonds = indexer.fit(diamonds).transform(diamonds)

# Assemble all the features into a single vector
assembler = VectorAssembler(inputCols=["carat", "cutIndex", "colorIndex", "clarityIndex", "depth", "table", "x", "y", "z"], outputCol="features")
diamonds = assembler.transform(diamonds)


In [0]:
from pyspark.ml.feature import StandardScaler

# Normalize the features
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures", withStd=True, withMean=False)
scalerModel = scaler.fit(diamonds)
diamonds = scalerModel.transform(diamonds)


In [0]:
# Split the data into training and test sets
training, test = diamonds.randomSplit([0.7, 0.3])


###Building and Evaluating a Linear Regression and Random Forest Model with all data: The goal here is to explore the performance of a different machine learning model, in this case, a Random Forest Regressor.

In [0]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

# Create a Linear Regression model
lr = LinearRegression(labelCol="price", featuresCol="scaledFeatures")

# Fit the model to the training data
lrModel = lr.fit(training)

# Make predictions on the test data
predictions = lrModel.transform(test)

# Evaluate the model
train_rmse = evaluator.evaluate(train_predictions)
print("Root Mean Squared Error (RMSE) on training data = %g" % train_rmse)
evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

Root Mean Squared Error (RMSE) on test data = 1489.49


In [0]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

# Create a Linear Regression model
lr = LinearRegression(labelCol="price", featuresCol="scaledFeatures")

# Fit the model to the training data
lrModel = lr.fit(training)

# Make predictions on the training data
train_predictions = lrModel.transform(training)

# Evaluate the model on training data
evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="rmse")
train_rmse = evaluator.evaluate(train_predictions)
print("Root Mean Squared Error (RMSE) on training data = %g" % train_rmse)

# Make predictions on the test data
predictions = lrModel.transform(test)

# Evaluate the model on test data
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

from pyspark.ml.regression import RandomForestRegressor

# Create a Random Forest Regressor model
rf = RandomForestRegressor(labelCol="price", featuresCol="scaledFeatures")

# Fit the model to the training data
rfModel = rf.fit(training)

# Make predictions on the training data
train_predictions_rf = rfModel.transform(training)

# Evaluate the model on training data
train_rmse_rf = evaluator.evaluate(train_predictions_rf)
print("Root Mean Squared Error (RMSE) on training data = %g" % train_rmse_rf)

# Make predictions on the test data
predictions = rfModel.transform(test)

# Evaluate the model on test data
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)


Root Mean Squared Error (RMSE) on test data = 1169.92


####Linear Regression (RMSE = 1489.49): The model's predictions are off by about 37% of the average diamond price ($3933), indicating a high error rate and need for improvement.
####Random Forest (RMSE = 1169.92): This model performs better with an average prediction error of about 29% of the average price, but there's still room for optimization.

### Evaluating Models without Outliers: To enhance the models, I'll assess their performance post outlier removal. This will clarify the outliers' impact on predictive accuracy and determine if their removal improves the models.

In [0]:
# Calculate Q1, Q3, and IQR
Q1 = diamonds.approxQuantile("price", [0.25], 0)[0]
Q3 = diamonds.approxQuantile("price", [0.75], 0)[0]
IQR = Q3 - Q1

# Filter out the outliers
outliers = diamonds.filter((diamonds["price"] < Q1 - 1.5*IQR) | (diamonds["price"] > Q3 + 1.5*IQR))

# Print out the outliers
outliers.show()

# Count the total number of outliers
print("Total outliers: ", outliers.count())

+-----+-----+---------+-----+-------+-----+-----+-----+----+----+----+--------+----------+------------+--------------------+--------------------+
|  _c0|carat|      cut|color|clarity|depth|table|price|   x|   y|   z|cutIndex|colorIndex|clarityIndex|            features|      scaledFeatures|
+-----+-----+---------+-----+-------+-----+-----+-----+----+----+----+--------+----------+------------+--------------------+--------------------+
|23821| 1.17|    Ideal|    F|   VVS1| 62.1| 57.0|11886|6.82|6.73|4.21|     0.0|       2.0|         5.0|[1.17,0.0,2.0,5.0...|[2.46829587654107...|
|23822| 2.08|    Ideal|    I|    SI2| 62.0| 56.0|11886|8.21| 8.1|5.06|     0.0|       5.0|         2.0|[2.08,0.0,5.0,2.0...|[4.38808155829524...|
|23823|  1.7|  Premium|    I|    VS2| 62.2| 58.0|11888|7.65| 7.6|4.74|     1.0|       5.0|         1.0|[1.7,1.0,5.0,1.0,...|[3.58641281206822...|
|23824| 1.09|    Ideal|    F|     IF| 61.6| 55.0|11888|6.59|6.65|4.08|     0.0|       2.0|         6.0|[1.09,0.0,2.0,6.0...|

In [0]:
# Filter out the outliers to get the data without outliers
diamonds_no_outliers = diamonds.filter((diamonds["price"] >= Q1 - 1.5*IQR) & (diamonds["price"] <= Q3 + 1.5*IQR))

# Split the data into training and test sets
training_no_outliers, test_no_outliers = diamonds_no_outliers.randomSplit([0.7, 0.3])

In [0]:
# Fit the model to the training data
lrModel_no_outliers = lr.fit(training_no_outliers)

# Make predictions on the test data
predictions_no_outliers = lrModel_no_outliers.transform(test_no_outliers)

# Evaluate the model
rmse = evaluator.evaluate(predictions_no_outliers)
print("Root Mean Squared Error (RMSE) on test data without outliers = %g" % rmse)


Root Mean Squared Error (RMSE) on test data without outliers = 1078.9


In [0]:
# Fit the model to the training data
rfModel_no_outliers = rf.fit(training_no_outliers)

# Make predictions on the test data
predictions_no_outliers = rfModel_no_outliers.transform(test_no_outliers)

# Evaluate the model
rmse = evaluator.evaluate(predictions_no_outliers)
print("Root Mean Squared Error (RMSE) on test data without outliers = %g" % rmse)


Root Mean Squared Error (RMSE) on test data without outliers = 805.622


#### Linear Regression (RMSE = 1078.9): The model's predictions are off by about 27% of the average diamond price ($3933), indicating a moderate error rate. While this is an improvement, there's still potential for further refinement.
#### Random Forest (RMSE = 805.622): This model performs better with an average prediction error of about 20% of the average price. Although this is a significant improvement, there's still room for further optimization.


### Tuning Model Parameters: In the next step, I will leverage the CrossValidator and ParamGridBuilder tools to fine-tune the parameters of the Random Forest model without outliers, aiming to enhance its predictive performance.

In [0]:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

# Define the RandomForestRegressor model
rf = RandomForestRegressor(labelCol="price", featuresCol="scaledFeatures")

# Define the parameter grid
paramGrid = ParamGridBuilder() \
    .addGrid(rf.numTrees, [10, 20, 30]) \
    .addGrid(rf.maxDepth, [5, 10, 15]) \
    .build()

# Define the cross-validator
crossval = CrossValidator(estimator=rf,
                          estimatorParamMaps=paramGrid,
                          evaluator=RegressionEvaluator(labelCol="price"),
                          numFolds=3)  # use 3+ folds in practice

# Fit the model to the training data
cvModel = crossval.fit(training_no_outliers)

# Make predictions on the test data
predictions_no_outliers = cvModel.transform(test_no_outliers)

# Evaluate the model
rmse = evaluator.evaluate(predictions_no_outliers)
print("Root Mean Squared Error (RMSE) on test data without outliers = %g" % rmse)


Root Mean Squared Error (RMSE) on test data without outliers = 443.957


#### Post parameter tuning, the RMSE on test data (excluding outliers) is 443.957, indicating an average prediction error of about 11% of the average diamond price. This highlights the effectiveness of parameter tuning in enhancing model accuracy.

In [0]:
from pyspark.ml.evaluation import RegressionEvaluator

# Initialize the evaluator
evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="r2")

# Calculate R^2
r2 = evaluator.evaluate(predictions_no_outliers)
print("R^2: ", r2)


R^2:  0.9742694627926274


####The R^2 score of 0.974 indicates that the tuned Random Forest model explains approximately 97.4% of the variance in the diamond price. This high R^2 score signifies a strong fit of the model to the data and its high predictive power.