# Pyspark Stocks
## Adrián Téllez

In [None]:
!pip install pyspark
!pip install yfinance

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:

# Importar librerías necesarias
from pyspark.sql import SparkSession
from pyspark.ml.regression import LinearRegression
import pandas as pd
import yfinance as yf
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.feature import VectorAssembler


In [None]:
spark = SparkSession.builder.appName('stock_prediction').getOrCreate()

In [None]:

def precios_acciones(stock_sym):
  # Cargar dataset de acciones en un DataFrame de Pandas
  print(f'Vamos a realizar proyecciones sobre el precio de la acción de {stock_sym}')
  getInfo = yf.Ticker(stock_sym, )
  df = getInfo.history(period="max")
  # Convertir DataFrame de Pandas a un DataFrame de PySpark
  df.info()
  df = spark.createDataFrame(df)
  feature_cols = df.drop("Close").columns
  assembler = VectorAssembler(inputCols = feature_cols,
                            outputCol = "features")
  df = assembler.transform(df)
  df = df.select("Close", "features")
  train, test = df.randomSplit([0.7, 0.3])
  lr = LinearRegression(labelCol="Close", featuresCol="features")
  model = lr.fit(train)
  predictions = model.transform(test)
  evaluator = RegressionEvaluator(labelCol="Close")
  mse = evaluator.evaluate(predictions, {evaluator.metricName: "mse"})
  r2 = evaluator.evaluate(predictions, {evaluator.metricName: "r2"})
  print(f'Mean Squared Error (MSE) = {mse:.3f}')
  print(f'Coefficient of Determination (R2) = {r2:.3f}')
  return predictions.show()








In [None]:
precios_acciones("MSFT")


Vamos a realizar proyecciones sobre el precio de la acción de MSFT
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 9266 entries, 1986-03-13 00:00:00-05:00 to 2022-12-14 00:00:00-05:00
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Open          9266 non-null   float64
 1   High          9266 non-null   float64
 2   Low           9266 non-null   float64
 3   Close         9266 non-null   float64
 4   Volume        9266 non-null   int64  
 5   Dividends     9266 non-null   float64
 6   Stock Splits  9266 non-null   float64
dtypes: float64(6), int64(1)
memory usage: 837.2 KB
Mean Squared Error (MSE) = 0.327
Coefficient of Determination (R2) = 1.000
+--------------------+--------------------+--------------------+
|               Close|            features|          prediction|
+--------------------+--------------------+--------------------+
| 0.05755116418004036|[0.05646536384981...|0.062122309665051376|
|0.05

In [None]:
precios_acciones("AAPL")

Vamos a realizar proyecciones sobre el precio de la acción de AAPL
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 10592 entries, 1980-12-12 00:00:00-05:00 to 2022-12-14 00:00:00-05:00
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Open          10592 non-null  float64
 1   High          10592 non-null  float64
 2   Low           10592 non-null  float64
 3   Close         10592 non-null  float64
 4   Volume        10592 non-null  int64  
 5   Dividends     10592 non-null  float64
 6   Stock Splits  10592 non-null  float64
dtypes: float64(6), int64(1)
memory usage: 920.0 KB
Mean Squared Error (MSE) = 0.060
Coefficient of Determination (R2) = 1.000
+--------------------+--------------------+--------------------+
|               Close|            features|          prediction|
+--------------------+--------------------+--------------------+
|0.039949383586645126|[0.03994938358664...| 0.04160244898750414|
|0.0

In [None]:
precios_acciones("NKE")

Vamos a realizar proyecciones sobre el precio de la acción de NKE
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 10600 entries, 1980-12-02 00:00:00-05:00 to 2022-12-14 00:00:00-05:00
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Open          10600 non-null  float64
 1   High          10600 non-null  float64
 2   Low           10600 non-null  float64
 3   Close         10600 non-null  float64
 4   Volume        10600 non-null  int64  
 5   Dividends     10600 non-null  float64
 6   Stock Splits  10600 non-null  float64
dtypes: float64(6), int64(1)
memory usage: 920.5 KB
Mean Squared Error (MSE) = 0.079
Coefficient of Determination (R2) = 1.000
+-------------------+--------------------+-------------------+
|              Close|            features|         prediction|
+-------------------+--------------------+-------------------+
|0.07790081948041916|[0.07931710449456...|0.08162568720846017|
|  0.07931710

In [None]:
precios_acciones("IBM")

Vamos a realizar proyecciones sobre el precio de la acción de IBM
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 15345 entries, 1962-01-02 00:00:00-05:00 to 2022-12-14 00:00:00-05:00
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Open          15345 non-null  float64
 1   High          15345 non-null  float64
 2   Low           15345 non-null  float64
 3   Close         15345 non-null  float64
 4   Volume        15345 non-null  int64  
 5   Dividends     15345 non-null  float64
 6   Stock Splits  15345 non-null  float64
dtypes: float64(6), int64(1)
memory usage: 1.4 MB
Mean Squared Error (MSE) = 0.098
Coefficient of Determination (R2) = 1.000
+------------------+--------------------+------------------+
|             Close|            features|        prediction|
+------------------+--------------------+------------------+
|0.9017170667648315|[0.92527728237413...|0.9044140715438159|
|0.9138545989990234|[0.

In [None]:
precios_acciones("TSLA")

Vamos a realizar proyecciones sobre el precio de la acción de TSLA
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3139 entries, 2010-06-29 00:00:00-04:00 to 2022-12-14 00:00:00-05:00
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Open          3139 non-null   float64
 1   High          3139 non-null   float64
 2   Low           3139 non-null   float64
 3   Close         3139 non-null   float64
 4   Volume        3139 non-null   int64  
 5   Dividends     3139 non-null   int64  
 6   Stock Splits  3139 non-null   float64
dtypes: float64(5), int64(2)
memory usage: 260.7 KB
Mean Squared Error (MSE) = 7.899
Coefficient of Determination (R2) = 0.999
+------------------+--------------------+------------------+
|             Close|            features|        prediction|
+------------------+--------------------+------------------+
|1.1733330488204956|[1.18666696548461...|1.1645063629466845|
|1.2213330268859863|[