<h1> Problem Statement: Stock Market Analysis and Prediction

Explanation: Our aim is to create software that analyses previous stock data of certain companies,
with help of certain parameters that affect stock value. We are going to implement these values in data mining algorithms.
This will also help us to determine the values that particular stock will have in near future.
We will determine the Month’s High and Low with help of data mining algorithms.
In this project we are going to take a five years of stock data for our analysis and prediction

In [1]:
#Install the dependencies
import quandl
import numpy as np 
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split

In [2]:
# Get the stock data
df = quandl.get("WIKI/MSFT")
# Take a look at the data
print(df.head())

             Open   High   Low  Close     Volume  Ex-Dividend  Split Ratio  \
Date                                                                         
1986-03-13  25.50  29.25  25.5  28.00  3582600.0          0.0          1.0   
1986-03-14  28.00  29.50  28.0  29.00  1070000.0          0.0          1.0   
1986-03-17  29.00  29.75  29.0  29.50   462400.0          0.0          1.0   
1986-03-18  29.50  29.75  28.5  28.75   235300.0          0.0          1.0   
1986-03-19  28.75  29.00  28.0  28.25   166300.0          0.0          1.0   

            Adj. Open  Adj. High  Adj. Low  Adj. Close   Adj. Volume  
Date                                                                  
1986-03-13   0.058941   0.067609  0.058941    0.064720  1.031789e+09  
1986-03-14   0.064720   0.068187  0.064720    0.067031  3.081600e+08  
1986-03-17   0.067031   0.068765  0.067031    0.068187  1.331712e+08  
1986-03-18   0.068187   0.068765  0.065876    0.066454  6.776640e+07  
1986-03-19   0.066454   0.0

In [3]:
# Get the Adjusted Close Price 
df = df[['Adj. Close']] 
# Take a look at the new data 
print(df.head())

            Adj. Close
Date                  
1986-03-13    0.064720
1986-03-14    0.067031
1986-03-17    0.068187
1986-03-18    0.066454
1986-03-19    0.065298


In [4]:
# A variable for predicting 'n' days out into the future
forecast_out = 30 #'n=30' days
#Create another column (the target ) shifted 'n' units up
df['Prediction'] = df[['Adj. Close']].shift(-forecast_out)
#print the new data set
print(df.tail())

            Adj. Close  Prediction
Date                              
2018-03-21       92.48         NaN
2018-03-22       89.79         NaN
2018-03-23       87.18         NaN
2018-03-26       93.78         NaN
2018-03-27       89.47         NaN


In [5]:
### Create the independent data set (X)  #######
# Convert the dataframe to a numpy array
X = np.array(df.drop(['Prediction'],1))

#Remove the last '30' rows
X = X[:-forecast_out]
print(X)

[[6.47199796e-02]
 [6.70314075e-02]
 [6.81871214e-02]
 ...
 [8.48900000e+01]
 [8.81000000e+01]
 [8.91300000e+01]]


In [6]:
### Create the dependent data set (y)  #####
# Convert the dataframe to a numpy array 
y = np.array(df['Prediction'])
# Get all of the y values except the last '30' rows
y = y[:-forecast_out]
print(y)

[7.80106897e-02 7.85885467e-02 7.62771188e-02 ... 8.71800000e+01
 9.37800000e+01 8.94700000e+01]


In [7]:
# Split the data into 80% training and 20% testing
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [8]:
# Create and train the Support Vector Machine (Regressor) 
svr_rbf = SVR(kernel='rbf', C=1e3, gamma=0.1) 
svr_rbf.fit(x_train, y_train)

SVR(C=1000.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma=0.1,
    kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [9]:
# Testing Model: Score returns the coefficient of determination R^2 of the prediction. 
# The best possible score is 1.0
svm_confidence = svr_rbf.score(x_test, y_test)
print("svm confidence: ", svm_confidence)

svm confidence:  0.9877940261912018


In [10]:
lr = LinearRegression()
# Train the model
lr.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [11]:
# Testing Model: Score returns the coefficient of determination R^2 of the prediction. 
# The best possible score is 1.0
lr_confidence = lr.score(x_test, y_test)
print("lr confidence: ", lr_confidence)

lr confidence:  0.9860994783583743


In [12]:
# Set x_forecast equal to the last 30 rows of the original data set from Adj. Close column
x_forecast = np.array(df.drop(['Prediction'],1))[-forecast_out:]
print(x_forecast)

[[89.83]
 [90.81]
 [92.66]
 [92.  ]
 [92.72]
 [91.49]
 [91.74]
 [94.06]
 [95.42]
 [94.2 ]
 [93.77]
 [92.85]
 [93.05]
 [93.64]
 [93.32]
 [93.86]
 [94.43]
 [96.54]
 [96.77]
 [94.41]
 [93.85]
 [94.18]
 [94.6 ]
 [92.89]
 [93.13]
 [92.48]
 [89.79]
 [87.18]
 [93.78]
 [89.47]]


In [13]:
# Print linear regression model predictions for the next '30' days
lr_prediction = lr.predict(x_forecast)
print(lr_prediction)
# Print support vector regressor model predictions for the next '30' days
svm_prediction = svr_rbf.predict(x_forecast)
print(svm_prediction)

[92.03460977 93.04002633 94.93800656 94.26088929 94.99956267 93.7376623
 93.99414612 96.37431592 97.76958787 96.51794686 96.07679469 95.13293426
 95.33812131 95.94342311 95.61512383 96.16912887 96.75391197 98.91863536
 99.15460047 96.73339326 96.15886952 96.49742815 96.92832096 95.17397167
 95.42019613 94.75333821 91.99357236 89.31588134 96.08705405 91.66527308]
[89.84202534 91.21869726 94.42226826 93.90527353 94.44076167 92.96480042
 93.4869727  94.68663376 92.90338295 94.70006793 94.6281859  94.47175036
 94.50412763 94.59929856 94.54080639 94.6483585  94.67447481 85.30052234
 83.00426071 94.67987413 94.64615216 94.6991728  94.59634363 94.47934432
 94.51477811 94.34507756 89.85457648 93.96727911 94.63044682 90.17536334]


<h1> Linear regression model

In [14]:
# Create and train the Linear Regression  Model
lr = LinearRegression()
# Train the model
lr.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [15]:
# Testing Model: Score returns the coefficient of determination R^2 of the prediction. 
# The best possible score is 1.0
lr_confidence = lr.score(x_test, y_test)
print("lr confidence: ", lr_confidence)

lr confidence:  0.9860994783583743


In [16]:
# Set x_forecast equal to the last 30 rows of the original data set from Adj. Close column
x_forecast = np.array(df.drop(['Prediction'],1))[-forecast_out:]
print(x_forecast)

[[89.83]
 [90.81]
 [92.66]
 [92.  ]
 [92.72]
 [91.49]
 [91.74]
 [94.06]
 [95.42]
 [94.2 ]
 [93.77]
 [92.85]
 [93.05]
 [93.64]
 [93.32]
 [93.86]
 [94.43]
 [96.54]
 [96.77]
 [94.41]
 [93.85]
 [94.18]
 [94.6 ]
 [92.89]
 [93.13]
 [92.48]
 [89.79]
 [87.18]
 [93.78]
 [89.47]]


In [17]:
# Print linear regression model predictions for the next '30' days
lr_prediction = lr.predict(x_forecast)
print(lr_prediction)
# Print support vector regressor model predictions for the next '30' days
svm_prediction = svr_rbf.predict(x_forecast)
print(svm_prediction[:30])

[92.03460977 93.04002633 94.93800656 94.26088929 94.99956267 93.7376623
 93.99414612 96.37431592 97.76958787 96.51794686 96.07679469 95.13293426
 95.33812131 95.94342311 95.61512383 96.16912887 96.75391197 98.91863536
 99.15460047 96.73339326 96.15886952 96.49742815 96.92832096 95.17397167
 95.42019613 94.75333821 91.99357236 89.31588134 96.08705405 91.66527308]
[89.84202534 91.21869726 94.42226826 93.90527353 94.44076167 92.96480042
 93.4869727  94.68663376 92.90338295 94.70006793 94.6281859  94.47175036
 94.50412763 94.59929856 94.54080639 94.6483585  94.67447481 85.30052234
 83.00426071 94.67987413 94.64615216 94.6991728  94.59634363 94.47934432
 94.51477811 94.34507756 89.85457648 93.96727911 94.63044682 90.17536334]
