# **Regression Algorithms**
A regression algorithm is used to predict the value of a continuous attribute (the target attribute) by applying a model on the predictive attributes. The regression algorithms available in Spark work only on numerical data.

The input data must be transformed in a DataFrame having the following attributes:
- label: double
    - The continuous numerical value to be predicted
- features: Vector of doubles
    - Predictive features
    
## **Linear Regression**
Linear regression is a popular, effective and efficient regression algorithm.

In [None]:
from pyspark.mllib.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml import Pipeline
from pyspark.ml import PipelineModel

# input and output folders
trainingData = "ex_dataregression/trainingData.csv"
unlabeledData = "ex_dataregression/unlabeledData.csv"
outputPath = "predictionsLinearRegressionPipeline/“

In [None]:
# *************************
# Training step
# *************************

# Create a DataFrame from trainingData.csv
# Training data in raw format
trainingData = spark.read.load(trainingData,\
                                format="csv", header=True,\
                                inferSchema=True)

# Define an assembler to create a column (features) of type Vector
# containing the double values associated with columns attr1, attr2, attr3
assembler = VectorAssembler(inputCols=["attr1", "attr2", "attr3"], outputCol="features")

In [None]:
# Create a LinearRegression object.
lr = LinearRegression()
lr.setMaxIter(10)
lr.setRegParam(0.01)

# Define a pipeline 
pipeline = Pipeline().setStages([assembler, lr])

# Execute the pipeline
regressionModel = pipeline.fit(trainingData)

In [None]:
# Create a DataFrame from unlabeledData.csv
unlabeledData = spark.read.load(unlabeledData,\
                    format="csv", header=True, inferSchema=True)

# Make predictions
predictionsDF = regressionModel.transform(unlabeledData)
predictions = predictionsDF.select("attr1", "attr2", "attr3", "prediction")