<a href="https://www.kaggle.com/code/youlikecode/predicting-stock-pyspark?scriptVersionId=122713433" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Problem definition

In this project, we’ll learn how to predict stock prices using python, pandas, and scikit-learn. 
we’ll create a machine learning model, and develop a back-testing engine. 

In this case, let’s say that we are trading stocks. We’re interested in making profitable stock trades with minimal risk. So when we buy a stock, we want to be fairly certain that the price will increase. We’ll buy stock when the market opens, and sell it when the market closes.

In [1]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.3.2.tar.gz (281.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.4/281.4 MB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l- done
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 kB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l- \ | / - \ | / - \ | / - done
[?25h  Created wheel for pyspark: filename=pyspark-3.3.2-py2.py3-none-any.whl size=281824025 sha256=078ad133437cec12349f6957d3c84b6a1f34a9a8d40826130887bc7074fab20a
  Stored in directory: /root/.cache/pip/wheels/5a/54/9b/a89cac960efb57c4c35d41cc7c9f7b80daa21108bc376339b7
Successfully built pyspark
Installing collected packages: py4j, pyspark
  Attempting uninstall: py4

In [2]:
# Loading the packages
from pyspark.sql import SparkSession
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import RegressionEvaluator

In [3]:
# Create an instance of your Spark session
spark = SparkSession.builder.appName('stock_price_prediction').getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/03/20 01:47:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
# Load the data from the csv file
path = "/kaggle/input/daily-historical-stock-prices-1970-2018/historical_stock_prices.csv"
df = spark.read.csv(path, header=True, inferSchema=True)
df = df.withColumn('date', df['date'].cast('timestamp'))

                                                                                

In [5]:
# Pre-process the data
data = df.select('open', 'close', 'low', 'high', 'volume').na.drop()
data = data.select(*(data[c].cast("float").alias(c) for c in data.columns))

In [6]:
# Split the data into a training set and a test set
(trainingData, testData) = data.randomSplit([0.7, 0.3], seed=42)

In [7]:
# Define the characteristics and the target variable
assembler = VectorAssembler(inputCols=['open', 'low', 'high', 'volume'], outputCol='features')
trainingData = assembler.transform(trainingData)
testData = assembler.transform(testData)
trainingData = trainingData.select("features", "close")
testData = testData.select("features", "close")

In [8]:
# Create the model using PySpark's linear regression algorithm
lr = LinearRegression(labelCol="close", featuresCol="features", regParam=0.1)

In [9]:
# Train the model using the training set
lrModel = lr.fit(trainingData)

                                                                                

In [10]:
# Perform the prediction using the test suite
predictions = lrModel.transform(testData)

In [11]:
# Evaluate the model using evaluation metrics

# mean square error
evaluator = RegressionEvaluator(labelCol="close", predictionCol="prediction", metricName="mse")
mse = evaluator.evaluate(predictions)

# coefficient of determination
evaluator = RegressionEvaluator(labelCol="close", predictionCol="prediction", metricName="r2")
r2 = evaluator.evaluate(predictions)

print('MSE:', mse)
print('R2:', r2)



MSE: 7611.022790527066
R2: 0.9990251050868884


                                                                                