#**CS 431 Final Project Winter 2021**



Authors:

Arsheya Jain (a76jain) and Gregory Hogg (gahogg)

In the cells below, we are installing Spark. We are also setting the environment path that enables us to run Pyspark, as well as creating a SparkContext. 

In [1]:
!apt-get update -qq > /dev/null
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz
!tar xf spark-2.4.7-bin-hadoop2.7.tgz
!pip install -q findspark

In [16]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.7-bin-hadoop2.7"

import findspark
findspark.init()

from pyspark import SparkContext
sc = SparkContext(appName="YourTest", master="local[*]")

from pyspark.sql import SparkSession
import random

spark = SparkSession.builder.appName("YourTest").master("local[2]").config('spark.ui.port', random.randrange(4000,5000)).getOrCreate()

##Problem statement 

Which company out of Microsoft, Nintendo, and Sony is the best to invest in,
given that you plan to sell in exactly x years from today (April 14th, 2021)? 

 

In [17]:
# For converting strings to datetime

from datetime import datetime
def to_date(d):
  return datetime.strptime(d, '%Y-%m-%d')

In [22]:
# Using Spark to remove unnecessary columns, seperate into Microsoft, Sony, and Nintendo,
# and only keeping data from the last decade

original_cols = ['Stock', 'Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']
data = sc.textFile('stocks.csv').map(lambda x: x.split(',')).map(lambda x: [x[0], x[1], x[5]])
microsoft = data.filter(lambda x: 'Microsoft' == x[0]).filter(lambda x: int(x[1][:4]) >= 2011).map(lambda x: [to_date(x[1]), float(x[2])])
sony = data.filter(lambda x: 'Sony' == x[0]).filter(lambda x: int(x[1][:4]) >= 2011).map(lambda x: [to_date(x[1]), float(x[2])])
nintendo = data.filter(lambda x: 'Nintendo' == x[0]).filter(lambda x: int(x[1][:4]) >= 2011).map(lambda x: [to_date(x[1]), float(x[2])])
nintendo.take(5)

[[datetime.datetime(2011, 1, 3, 0, 0), 36.700001],
 [datetime.datetime(2011, 1, 4, 0, 0), 35.5],
 [datetime.datetime(2011, 1, 5, 0, 0), 34.599998],
 [datetime.datetime(2011, 1, 6, 0, 0), 35.25],
 [datetime.datetime(2011, 1, 7, 0, 0), 35.200001]]

In [30]:
# Converting processed RDDs to Spark DataFrames

nintendo_df = spark.createDataFrame(nintendo).withColumnRenamed('_1', 'Date').withColumnRenamed('_2', 'Closing Value (USD)')
sony_df = spark.createDataFrame(sony).withColumnRenamed('_1', 'Date').withColumnRenamed('_2', 'Closing Value (USD)')
microsoft_df = spark.createDataFrame(microsoft).withColumnRenamed('_1', 'Date').withColumnRenamed('_2', 'Closing Value (USD)')
microsoft_df.show()

+-------------------+-------------------+
|               Date|Closing Value (USD)|
+-------------------+-------------------+
|2011-01-03 00:00:00|              27.98|
|2011-01-04 00:00:00|              28.09|
|2011-01-05 00:00:00|               28.0|
|2011-01-06 00:00:00|              28.82|
|2011-01-07 00:00:00|               28.6|
|2011-01-10 00:00:00|          28.219999|
|2011-01-11 00:00:00|          28.110001|
|2011-01-12 00:00:00|          28.549999|
|2011-01-13 00:00:00|          28.190001|
|2011-01-14 00:00:00|          28.299999|
|2011-01-18 00:00:00|              28.66|
|2011-01-19 00:00:00|          28.469999|
|2011-01-20 00:00:00|              28.35|
|2011-01-21 00:00:00|              28.02|
|2011-01-24 00:00:00|          28.379999|
|2011-01-25 00:00:00|          28.450001|
|2011-01-26 00:00:00|          28.780001|
|2011-01-27 00:00:00|          28.870001|
|2011-01-28 00:00:00|              27.75|
|2011-01-31 00:00:00|              27.73|
+-------------------+-------------

In [52]:
# Converting Spark DataFrames to Pandas DataFrames

nintendo_pd_df = nintendo_df.select('*').toPandas()
microsoft_pd_df = microsoft_df.select('*').toPandas()
sony_pd_df = sony_df.select('*').toPandas()
sony_pd_df

Unnamed: 0,Date,Closing Value (USD)
0,2011-01-03,36.380001
1,2011-01-04,36.160000
2,2011-01-05,36.209999
3,2011-01-06,36.009998
4,2011-01-07,36.040001
...,...,...
2582,2021-04-08,109.820000
2583,2021-04-09,111.790001
2584,2021-04-12,111.389999
2585,2021-04-13,111.879997


In [53]:
# Save Pandas DataFrames into Pickle files with Date as Pandas Index Column

import os

dataframes = [["Nintendo", nintendo_pd_df], ["Sony", sony_pd_df], ["Microsoft", microsoft_pd_df]]
for pair in dataframes:
  name, pd_df = pair
  pd_df.set_index('Date', drop=True, inplace=True)
  if not os.path.exists('data'):
    os.makedirs('data')
  pd_df.to_pickle('data/' +name + ".pkl")
pd.read_pickle('data/Sony.pkl')

Unnamed: 0_level_0,Closing Value (USD)
Date,Unnamed: 1_level_1
2011-01-03,36.380001
2011-01-04,36.160000
2011-01-05,36.209999
2011-01-06,36.009998
2011-01-07,36.040001
...,...
2021-04-08,109.820000
2021-04-09,111.790001
2021-04-12,111.389999
2021-04-13,111.879997


In [31]:
import numpy as np
test_points = np.array([['1986-03-13', 0.097222],
 ['1986-03-14', 0.100694],
 ['1986-03-17', 0.102431]])
x = np.array([datetime.strptime(a, '%Y-%m-%d') for a in test_points[:, 0]])
y = test_points[:, 1].astype(float)
x

array([datetime.datetime(1986, 3, 13, 0, 0),
       datetime.datetime(1986, 3, 14, 0, 0),
       datetime.datetime(1986, 3, 17, 0, 0)], dtype=object)

In [32]:
import plotly.express as px
import plotly.graph_objects as go
fig = px.scatter(x=x, y=y,title="Stock Closing Value Vs. Time", labels={'x': 'Time (Daily)',
                                                                   'y': 'Closing Stock Value (USD)' })
fig.show()