#Multi-Variable Linear Regression
## Supervised Machine Learning with MLib

Credit: Ezra Benitez, Johnny Castillo, Edwin Figueroa, Yobani Ledezma, Bryan Romero

### Wine Quality Dataset
[Original Source Material](https://archive.ics.uci.edu/ml/datasets/wine+quality)
###### Data related to red wine quality was used.  The delimiter was changed from ";" to ",".

#### Date Fields (attributes)
<table>
  <tr>
    <td width="33%">
      1
    </td>
    <td width="33%">
      fixed acidity
    </td>
    <td width="33%">
      feature
    </td>
  </tr>
  <tr>
    <td width="33%">
      2
    </td>
    <td width="33%">
      volatile acidity
    </td>
    <td width="33%">
      feature
    </td>
  </tr>
  <tr>
    <td width="33%">
      3
    </td>
    <td width="33%">
      citric acid 
    </td>
    <td width="33%">
      feature
    </td>
  </tr>
  <tr>
    <td width="33%">
      4
    </td>
    <td width="33%">
      residual sugar
    </td>
    <td width="33%">
      feature
    </td>
  </tr>
  <tr>
    <td width="33%">
      5
    </td>
    <td width="33%">
      chlorides 
    </td>
    <td width="33%">
      feature
    </td>
  </tr>
  <tr>
    <td width="33%">
      6
    </td>
    <td width="33%">
       free sulfur dioxide
    </td>
    <td width="33%">
      feature
    </td>
  </tr>
  <tr>
    <td width="33%">
      7
    </td>
    <td width="33%">
       total sulfur dioxide
    </td>
    <td width="33%">
      feature
    </td>
  </tr>
  <tr>
    <td width="33%">
      8
    </td>
    <td width="33%">
      density 
    </td>
    <td width="33%">
      feature
    </td>
  </tr>
  <tr>
    <td width="33%">
      9
    </td>
    <td width="33%">
      pH 
    </td>
    <td width="33%">
      feature
    </td>
  </tr>
  <tr>
    <td width="33%">
      10
    </td>
    <td width="33%">
      sulphates 
    </td>
    <td width="33%">
      feature
    </td>
  </tr>
  <tr>
    <td width="33%">
      11
    </td>
    <td width="33%">
      alcohol 
    </td>
    <td width="33%">
      feature
    </td>
  </tr>
  <tr>
    <td width="33%">
      12
    </td>
    <td width="33%">
       quality (score between 0 and 10)
    </td>
    <td width="33%">
      label
    </td>
  </tr>
</table>

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
spark = SparkSession.builder.appName('RedWineQuality').getOrCreate()
from pyspark.ml.regression import LinearRegression

schema = StructType([ \
                     StructField("fixed_acidity", DoubleType(), True),
                     StructField("volatile_acidity", DoubleType(), True), 
                     StructField("citric_acid", DoubleType(), True),
                     StructField("residual_sugar", DoubleType(), True),
                     StructField("chlorides", DoubleType(), True),
                     StructField("free_sulfur_dioxide", DoubleType(), True),
                     StructField("total_sulfur_dioxide", DoubleType(), True),
                     StructField("density", DoubleType(), True),
                     StructField("pH", DoubleType(), True),
                     StructField("sulphates", DoubleType(), True),
                     StructField("alcohol", DoubleType(), True),
                     StructField("quality", IntegerType(), True),
                   ])

data_path ="dbfs:/FileStore/tables/redwine.csv"
df = spark.read.format("csv").option("header", True).schema(schema).load(data_path)

print("Initial Data")
df.show(3)



Initial Data
+-------------+----------------+-----------+--------------+---------+-------------------+--------------------+-------+----+---------+-------+-------+
|fixed_acidity|volatile_acidity|citric_acid|residual_sugar|chlorides|free_sulfur_dioxide|total_sulfur_dioxide|density|  pH|sulphates|alcohol|quality|
+-------------+----------------+-----------+--------------+---------+-------------------+--------------------+-------+----+---------+-------+-------+
|          7.4|             0.7|        0.0|           1.9|    0.076|               11.0|                34.0| 0.9978|3.51|     0.56|    9.4|      5|
|          7.8|            0.88|        0.0|           2.6|    0.098|               25.0|                67.0| 0.9968| 3.2|     0.68|    9.8|      5|
|          7.8|            0.76|       0.04|           2.3|    0.092|               15.0|                54.0|  0.997|3.26|     0.65|    9.8|      5|
+-------------+----------------+-----------+--------------+---------+------------------

In [0]:
#importing the VectorAssembler to convert the features into spark accepted format
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

#importing the VectorAssembler to convert the attributes into features
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

#converting the relevant data to feature(s) as multiple columns
assembler_object = VectorAssembler(inputCols=['fixed_acidity','volatile_acidity', 'citric_acid','residual_sugar',
                                              'chlorides', 'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density', 'pH', 
                                              'sulphates', 'alcohol'],
                                   outputCol='red_wine_features')

feature_vector_dataframe = assembler_object.transform(df)

#test output
print(feature_vector_dataframe.head(1))
feature_vector_dataframe.printSchema()
formatted_data = feature_vector_dataframe.select('red_wine_features','quality')
print("Consolidated Data with accepted features and labels")
formatted_data.show(3)


[Row(fixed_acidity=7.4, volatile_acidity=0.7, citric_acid=0.0, residual_sugar=1.9, chlorides=0.076, free_sulfur_dioxide=11.0, total_sulfur_dioxide=34.0, density=0.9978, pH=3.51, sulphates=0.56, alcohol=9.4, quality=5, red_wine_features=DenseVector([7.4, 0.7, 0.0, 1.9, 0.076, 11.0, 34.0, 0.9978, 3.51, 0.56, 9.4]))]
root
 |-- fixed_acidity: double (nullable = true)
 |-- volatile_acidity: double (nullable = true)
 |-- citric_acid: double (nullable = true)
 |-- residual_sugar: double (nullable = true)
 |-- chlorides: double (nullable = true)
 |-- free_sulfur_dioxide: double (nullable = true)
 |-- total_sulfur_dioxide: double (nullable = true)
 |-- density: double (nullable = true)
 |-- pH: double (nullable = true)
 |-- sulphates: double (nullable = true)
 |-- alcohol: double (nullable = true)
 |-- quality: integer (nullable = true)
 |-- red_wine_features: vector (nullable = true)

Consolidated Data with accepted features and labels
+--------------------+-------+
|   red_wine_features|quali

In [0]:
# Splitting the data into train_data @ 60% and test_data @ 40% 
train_data, test_data = formatted_data.randomSplit([0.6,0.4])

#Defining our Linear regression
lireg = LinearRegression(featuresCol='red_wine_features',labelCol='quality')

#Training our model with training data
lireg_model = lireg.fit(train_data)

In [0]:
#Evaluating our model with test_data
test_results = lireg_model.evaluate(test_data)
print("Residuals info - distance between data points and fitted regression line")
test_results.residuals.show(4)

print("Root Mean Square Error {}".format(test_results.rootMeanSquaredError))
print("R square value {}".format(test_results.r2))


Residuals info - distance between data points and fitted regression line
+-------------------+
|          residuals|
+-------------------+
|-1.8744400707056954|
|-0.6751363008186768|
|  1.377169082859517|
|-0.8626699660468686|
+-------------------+
only showing top 4 rows

Root Mean Square Error 0.6766899929135565
R square value 0.3348626568765454


In [0]:
#Creating unlabeled data from test_data by removing the label field in order to get predictions
unlabeled_data = test_data.select('red_wine_features')
predictions = lireg_model.transform(unlabeled_data)
print("\nPredictions for Novel Data")
predictions.show(4)

#Checking our model with new value manually
print("Coeffecients are {}".format(lireg_model.coefficients))
print("\nIntercept is {}".format(lireg_model.intercept))

#hypothetical example
sample_fixed_acidity = 10
sample_volatile_acidity = 0.7
sample_citric_acid = 0
sample_residual_sugar = 1.9
sample_chlorides = 0.076
sample_free_sulfur_dioxide = 11
sample_total_sulfur_dioxide = 34
sample_density = 0.9978
sample_pH = 3.51
sample_sulphates = 0.56
sample_alcohol = 9.4

#Mimicking the hypothesis function to get a prediction
quality = ((lireg_model.intercept) + (lireg_model.coefficients[0]) * sample_fixed_acidity + (lireg_model.coefficients[1]) * sample_volatile_acidity + (lireg_model.coefficients[2]) * sample_citric_acid + (lireg_model.coefficients[3]) * sample_residual_sugar
           + (lireg_model.coefficients[4]) * sample_chlorides + (lireg_model.coefficients[5]) * sample_free_sulfur_dioxide + (lireg_model.coefficients[6]) * sample_total_sulfur_dioxide + (lireg_model.coefficients[7]) * sample_density
           + (lireg_model.coefficients[8]) * sample_pH + (lireg_model.coefficients[9]) * sample_sulphates + (lireg_model.coefficients[10]) * sample_alcohol
          )

print("\nPredicted red wine quality with the following composition:\n\nFixed Acidity:: \t{}\nVolatile Acidity:: \t{}\nCitric Acid:: \t\t{}\nResidual Sugar:: \t{}\nChlorides:: \t\t{}\nFree Sulfur Dioxide:: \t{}\nTotal Sulfur Dioxide:: \t{}\nDensity:: \t\t{}\nPH:: \t\t\t{}\nSulphates:: \t\t{}\nAlcohol:: \t\t{}\n\nWould have a quality index of:: \t{}".format(sample_fixed_acidity, sample_volatile_acidity,
    sample_citric_acid, sample_residual_sugar, sample_chlorides, sample_free_sulfur_dioxide, sample_total_sulfur_dioxide, sample_density, sample_pH, sample_sulphates, sample_alcohol, quality))


Predictions for Novel Data
+--------------------+-----------------+
|   red_wine_features|       prediction|
+--------------------+-----------------+
|[4.6,0.52,0.15,2....|5.874440070705695|
|[5.0,0.38,0.01,1....|6.675136300818677|
|[5.0,0.42,0.24,2....|6.622830917140483|
|[5.0,1.02,0.04,1....|4.862669966046869|
+--------------------+-----------------+
only showing top 4 rows

Coeffecients are [-0.03028338834524094,-0.9582981662296036,0.13107657193825065,-0.011540329018867914,-1.9092069439173294,0.006591836279829464,-0.004972689741099144,1.8559194821441756,-0.5154490025325412,1.1465607796430188,0.24452384396725804]

Intercept is 3.2114635303301426

Predicted red wine quality with the following composition:

Fixed Acidity:: 	10
Volatile Acidity:: 	0.7
Citric Acid:: 		0
Residual Sugar:: 	1.9
Chlorides:: 		0.076
Free Sulfur Dioxide:: 	11
Total Sulfur Dioxide:: 	34
Density:: 		0.9978
PH:: 			3.51
Sulphates:: 		0.56
Alcohol:: 		9.4

Would have a quality index of:: 	4.9574419558107525
