## Part 3: Implementing Inner-Join on the BSE Stock Dataset and Setiment Dataset

#### NOTE: If you want to directly skip to part 4: Modelling building and Evaluation please skip to the 3rd Notebook.
All the stock data found in this project has been directly retreived from finance.yahoo.com 
Please make sure you clone the entire library and all the path variables are rectified to avoid errors.

### Walkthrough:
##### 1. Importing the libraries, and reading the dataset(.csv):
Importing the pandas and numpy library. Scikit-learn is not required in this part of the series.   
##### 3. Correcting the data type for the date values:
It is important to convert the dates to a datetime variable and normalize it to 00:00:00 UTC to keep things simple and free from complications. This can be done using the to_datetime function in Pandas.
##### 3. Merging the two datasets using Inner Join
For more inofrmation on inner join, please visit: https://www.w3schools.com/sql/sql_join_inner.asp
##### 4. Exporting:
The final section in this notebook involves exporting our data after we have sorted them according to the increasing order of our dates.



In [17]:
import os
import numpy as np
import pandas as pd


In [18]:
df1 = pd.read_csv("price_data.csv")
df2 = pd.read_csv("news_data_sentiments.csv")

In [19]:
df1 

Unnamed: 0,symbol,datetime,close,high,low,open,volume,change_in_price,down_days,up_days,RSI,low_14,high_14,k_percent,r_percent,MACD,MACD_EMA,Price_Rate_Of_Change,On Balance Volume,Prediction
0,BSESN,8/1/2003,3815.310059,3831.459961,3779.729980,3800.729980,26000,22.699952,0.000000,22.699952,75.909512,3534.060059,3835.750000,93.224852,-6.775148,20.986077,8.394436,0.068840,143800,1
1,BSESN,8/4/2003,3832.500000,3840.719971,3785.850098,3798.810059,19800,17.189941,0.000000,17.189941,78.098188,3534.060059,3840.719971,97.319516,-2.680484,24.858619,11.782642,0.078323,163600,1
2,BSESN,8/5/2003,3765.820068,3878.719971,3761.840088,3845.929932,26000,-66.679932,66.679932,0.000000,55.521299,3534.060059,3878.719971,67.243100,-32.756900,23.377022,14.154938,0.052525,137600,-1
3,BSESN,8/6/2003,3741.659912,3798.870117,3722.080078,3754.649902,29000,-24.160156,24.160156,0.000000,49.534639,3534.060059,3878.719971,60.233246,-39.766754,20.432082,15.433397,0.020062,108600,-1
4,BSESN,8/7/2003,3806.830078,3816.159912,3733.629883,3749.179932,24200,65.170166,0.000000,65.170166,62.215214,3534.060059,3878.719971,79.141789,-20.858211,22.100937,16.786404,0.021567,132800,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4241,BSESN,11/3/2020,40261.128910,40354.730470,39952.789060,39990.750000,21400,503.550781,0.000000,503.550781,55.461668,39241.871090,41048.050780,56.431695,-43.568305,251.256095,370.293241,-0.010961,63790400,1
4242,BSESN,11/4/2020,40616.140630,40693.511720,40076.468750,40171.710940,20900,355.011719,0.000000,355.011719,62.076880,39241.871090,40976.019530,79.247514,-20.752486,277.246758,351.683944,0.001421,63811300,1
4243,BSESN,11/5/2020,41340.160160,41370.910160,41030.171880,41112.121090,42600,724.019531,0.000000,724.019531,71.898697,39241.871090,41370.910160,98.555686,-1.444314,352.206850,351.788526,0.016091,63853900,1
4244,BSESN,11/6/2020,41893.058590,41954.929690,41383.289060,41438.761720,19000,552.898438,0.000000,552.898438,77.120077,39241.871090,41954.929690,97.719508,-2.280492,451.028372,371.636495,0.043531,63872900,1


In [20]:
df2

Unnamed: 0,datetime,headline,sentiment
0,2003-07-14,Now; get paid for leading an extravagant life,0.0000
1,2003-07-15,Demand for Ayodhya legislation a farce: Cong,-0.4939
2,2003-07-16,MD's fake diploma finally gets him,-0.4767
3,2003-07-17,ONGC plans to bid for Lanka petrol stations,0.0000
4,2003-07-18,Maya welcomes SC order on Taj project,0.4019
...,...,...,...
6192,2020-06-26,Moving Europe troops to counter China threat t...,-0.5267
6193,2020-06-27,Monsoon covers India 12 days in advance; faste...,0.0000
6194,2020-06-28,Flood situation in Assam worsens; 2 more die; ...,-0.8357
6195,2020-06-29,To de-escalate; India and China put faith in t...,0.4215


In [21]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4246 entries, 0 to 4245
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   symbol                4246 non-null   object 
 1   datetime              4246 non-null   object 
 2   close                 4246 non-null   float64
 3   high                  4246 non-null   float64
 4   low                   4246 non-null   float64
 5   open                  4246 non-null   float64
 6   volume                4246 non-null   int64  
 7   change_in_price       4246 non-null   float64
 8   down_days             4246 non-null   float64
 9   up_days               4246 non-null   float64
 10  RSI                   4246 non-null   float64
 11  low_14                4246 non-null   float64
 12  high_14               4246 non-null   float64
 13  k_percent             4246 non-null   float64
 14  r_percent             4246 non-null   float64
 15  MACD                 

In [22]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6197 entries, 0 to 6196
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   datetime   6197 non-null   object 
 1   headline   6197 non-null   object 
 2   sentiment  6197 non-null   float64
dtypes: float64(1), object(2)
memory usage: 145.4+ KB


In [24]:
df1["datetime"] = pd.to_datetime(df1["datetime"])

In [25]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4246 entries, 0 to 4245
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   symbol                4246 non-null   object        
 1   datetime              4246 non-null   datetime64[ns]
 2   close                 4246 non-null   float64       
 3   high                  4246 non-null   float64       
 4   low                   4246 non-null   float64       
 5   open                  4246 non-null   float64       
 6   volume                4246 non-null   int64         
 7   change_in_price       4246 non-null   float64       
 8   down_days             4246 non-null   float64       
 9   up_days               4246 non-null   float64       
 10  RSI                   4246 non-null   float64       
 11  low_14                4246 non-null   float64       
 12  high_14               4246 non-null   float64       
 13  k_percent         

In [26]:
df2["datetime"] = pd.to_datetime(df2["datetime"])

In [29]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6197 entries, 0 to 6196
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   datetime   6197 non-null   datetime64[ns]
 1   headline   6197 non-null   object        
 2   sentiment  6197 non-null   float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 145.4+ KB


In [30]:
df1

Unnamed: 0,symbol,datetime,close,high,low,open,volume,change_in_price,down_days,up_days,RSI,low_14,high_14,k_percent,r_percent,MACD,MACD_EMA,Price_Rate_Of_Change,On Balance Volume,Prediction
0,BSESN,2003-08-01,3815.310059,3831.459961,3779.729980,3800.729980,26000,22.699952,0.000000,22.699952,75.909512,3534.060059,3835.750000,93.224852,-6.775148,20.986077,8.394436,0.068840,143800,1
1,BSESN,2003-08-04,3832.500000,3840.719971,3785.850098,3798.810059,19800,17.189941,0.000000,17.189941,78.098188,3534.060059,3840.719971,97.319516,-2.680484,24.858619,11.782642,0.078323,163600,1
2,BSESN,2003-08-05,3765.820068,3878.719971,3761.840088,3845.929932,26000,-66.679932,66.679932,0.000000,55.521299,3534.060059,3878.719971,67.243100,-32.756900,23.377022,14.154938,0.052525,137600,-1
3,BSESN,2003-08-06,3741.659912,3798.870117,3722.080078,3754.649902,29000,-24.160156,24.160156,0.000000,49.534639,3534.060059,3878.719971,60.233246,-39.766754,20.432082,15.433397,0.020062,108600,-1
4,BSESN,2003-08-07,3806.830078,3816.159912,3733.629883,3749.179932,24200,65.170166,0.000000,65.170166,62.215214,3534.060059,3878.719971,79.141789,-20.858211,22.100937,16.786404,0.021567,132800,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4241,BSESN,2020-11-03,40261.128910,40354.730470,39952.789060,39990.750000,21400,503.550781,0.000000,503.550781,55.461668,39241.871090,41048.050780,56.431695,-43.568305,251.256095,370.293241,-0.010961,63790400,1
4242,BSESN,2020-11-04,40616.140630,40693.511720,40076.468750,40171.710940,20900,355.011719,0.000000,355.011719,62.076880,39241.871090,40976.019530,79.247514,-20.752486,277.246758,351.683944,0.001421,63811300,1
4243,BSESN,2020-11-05,41340.160160,41370.910160,41030.171880,41112.121090,42600,724.019531,0.000000,724.019531,71.898697,39241.871090,41370.910160,98.555686,-1.444314,352.206850,351.788526,0.016091,63853900,1
4244,BSESN,2020-11-06,41893.058590,41954.929690,41383.289060,41438.761720,19000,552.898438,0.000000,552.898438,77.120077,39241.871090,41954.929690,97.719508,-2.280492,451.028372,371.636495,0.043531,63872900,1


In [31]:
df2

Unnamed: 0,datetime,headline,sentiment
0,2003-07-14,Now; get paid for leading an extravagant life,0.0000
1,2003-07-15,Demand for Ayodhya legislation a farce: Cong,-0.4939
2,2003-07-16,MD's fake diploma finally gets him,-0.4767
3,2003-07-17,ONGC plans to bid for Lanka petrol stations,0.0000
4,2003-07-18,Maya welcomes SC order on Taj project,0.4019
...,...,...,...
6192,2020-06-26,Moving Europe troops to counter China threat t...,-0.5267
6193,2020-06-27,Monsoon covers India 12 days in advance; faste...,0.0000
6194,2020-06-28,Flood situation in Assam worsens; 2 more die; ...,-0.8357
6195,2020-06-29,To de-escalate; India and China put faith in t...,0.4215


In [34]:
df3 = df1.set_index("datetime")
df4 = df2.set_index("datetime")

In [35]:
df3

Unnamed: 0_level_0,symbol,close,high,low,open,volume,change_in_price,down_days,up_days,RSI,low_14,high_14,k_percent,r_percent,MACD,MACD_EMA,Price_Rate_Of_Change,On Balance Volume,Prediction
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2003-08-01,BSESN,3815.310059,3831.459961,3779.729980,3800.729980,26000,22.699952,0.000000,22.699952,75.909512,3534.060059,3835.750000,93.224852,-6.775148,20.986077,8.394436,0.068840,143800,1
2003-08-04,BSESN,3832.500000,3840.719971,3785.850098,3798.810059,19800,17.189941,0.000000,17.189941,78.098188,3534.060059,3840.719971,97.319516,-2.680484,24.858619,11.782642,0.078323,163600,1
2003-08-05,BSESN,3765.820068,3878.719971,3761.840088,3845.929932,26000,-66.679932,66.679932,0.000000,55.521299,3534.060059,3878.719971,67.243100,-32.756900,23.377022,14.154938,0.052525,137600,-1
2003-08-06,BSESN,3741.659912,3798.870117,3722.080078,3754.649902,29000,-24.160156,24.160156,0.000000,49.534639,3534.060059,3878.719971,60.233246,-39.766754,20.432082,15.433397,0.020062,108600,-1
2003-08-07,BSESN,3806.830078,3816.159912,3733.629883,3749.179932,24200,65.170166,0.000000,65.170166,62.215214,3534.060059,3878.719971,79.141789,-20.858211,22.100937,16.786404,0.021567,132800,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-11-03,BSESN,40261.128910,40354.730470,39952.789060,39990.750000,21400,503.550781,0.000000,503.550781,55.461668,39241.871090,41048.050780,56.431695,-43.568305,251.256095,370.293241,-0.010961,63790400,1
2020-11-04,BSESN,40616.140630,40693.511720,40076.468750,40171.710940,20900,355.011719,0.000000,355.011719,62.076880,39241.871090,40976.019530,79.247514,-20.752486,277.246758,351.683944,0.001421,63811300,1
2020-11-05,BSESN,41340.160160,41370.910160,41030.171880,41112.121090,42600,724.019531,0.000000,724.019531,71.898697,39241.871090,41370.910160,98.555686,-1.444314,352.206850,351.788526,0.016091,63853900,1
2020-11-06,BSESN,41893.058590,41954.929690,41383.289060,41438.761720,19000,552.898438,0.000000,552.898438,77.120077,39241.871090,41954.929690,97.719508,-2.280492,451.028372,371.636495,0.043531,63872900,1


In [36]:
df4

Unnamed: 0_level_0,headline,sentiment
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1
2003-07-14,Now; get paid for leading an extravagant life,0.0000
2003-07-15,Demand for Ayodhya legislation a farce: Cong,-0.4939
2003-07-16,MD's fake diploma finally gets him,-0.4767
2003-07-17,ONGC plans to bid for Lanka petrol stations,0.0000
2003-07-18,Maya welcomes SC order on Taj project,0.4019
...,...,...
2020-06-26,Moving Europe troops to counter China threat t...,-0.5267
2020-06-27,Monsoon covers India 12 days in advance; faste...,0.0000
2020-06-28,Flood situation in Assam worsens; 2 more die; ...,-0.8357
2020-06-29,To de-escalate; India and China put faith in t...,0.4215


In [38]:
df5 = pd.merge(df1, df2, on = "datetime", how = "inner") 

In [41]:
df_final = df5.drop(["headline"], axis = 1)

In [43]:
df_final

Unnamed: 0,symbol,datetime,close,high,low,open,volume,change_in_price,down_days,up_days,...,low_14,high_14,k_percent,r_percent,MACD,MACD_EMA,Price_Rate_Of_Change,On Balance Volume,Prediction,sentiment
0,BSESN,2003-08-01,3815.310059,3831.459961,3779.729980,3800.729980,26000,22.699952,0.000000,22.699952,...,3534.060059,3835.750000,93.224852,-6.775148,20.986077,8.394436,0.068840,143800,1,-0.3802
1,BSESN,2003-08-04,3832.500000,3840.719971,3785.850098,3798.810059,19800,17.189941,0.000000,17.189941,...,3534.060059,3840.719971,97.319516,-2.680484,24.858619,11.782642,0.078323,163600,1,0.0000
2,BSESN,2003-08-05,3765.820068,3878.719971,3761.840088,3845.929932,26000,-66.679932,66.679932,0.000000,...,3534.060059,3878.719971,67.243100,-32.756900,23.377022,14.154938,0.052525,137600,-1,0.0000
3,BSESN,2003-08-06,3741.659912,3798.870117,3722.080078,3754.649902,29000,-24.160156,24.160156,0.000000,...,3534.060059,3878.719971,60.233246,-39.766754,20.432082,15.433397,0.020062,108600,-1,-0.4767
4,BSESN,2003-08-07,3806.830078,3816.159912,3733.629883,3749.179932,24200,65.170166,0.000000,65.170166,...,3534.060059,3878.719971,79.141789,-20.858211,22.100937,16.786404,0.021567,132800,1,0.0000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4148,BSESN,2020-06-24,34868.980470,35706.550780,34794.929690,35679.738280,26600,-561.449219,561.449219,0.000000,...,32348.099610,35706.550780,75.060816,-24.939184,738.448126,650.151346,0.039674,63506800,-1,0.0000
4149,BSESN,2020-06-25,34842.101560,35081.609380,34499.781250,34525.390630,24600,-26.878906,26.878906,0.000000,...,32348.099610,35706.550780,74.260480,-25.739520,729.259048,665.972886,0.031415,63482200,-1,0.0000
4150,BSESN,2020-06-26,35171.269530,35254.878910,34910.339840,35144.781250,24800,329.167968,0.000000,329.167968,...,32348.099610,35706.550780,84.061663,-15.938337,740.007398,680.779789,0.058457,63507000,1,-0.5267
4151,BSESN,2020-06-29,34961.519530,35032.359380,34662.058590,34926.949220,18300,-209.750000,209.750000,0.000000,...,32348.099610,35706.550780,77.816225,-22.183775,723.263143,689.276459,0.040360,63488700,-1,0.4215


In [44]:
result = df_final.to_csv("price_data.csv")