# Final Project: WSJ Modeling of Predicting the Direction of the S&P500

__Background:__ This is a capstone project for FlatIron School. Each student was given the opportunity to pick a project that interested them most. I find the WSJ and finacial markets very interesting so I decided to combine those two passions of mine. The following notebook goes through the OSEMN framework to do a bit of EDA and NLP modeling on the WSJ homepage. I hope you enjoy!

__Approach:__  
The standard OSEMN framework was used for the entirety of the project. Below are the details of the different packages, websites, and tools used throughout the notebook.


__OSEMN__
1. Obtain: The data needed to be scraped from the WSJ homepage. To do this the waybackmachine was used; a historical archive of essentially almost any website with multiple snapshots of a website each day. Using Selenium, a web browser automation package, I easily scrapped the homepage for the headers and sub headers from Jan 1,2019 until present day.  
    In addition to the wsj headlines, the stock market data was needed untilamtely for the labels or the Y variable. For this, I used yahoo finances downloadable CSV file.  

  
2. Scrub: The data needed the traditional Natural Language Processing cleaning. This consisted of eliminating stop words and any other english lanugage words or symbols that appeared most frequently throughout the WSJ. Along with this, all text data was lower cased to compare apples to apples for the computers sake. All of the text data was put into a dataframe along with the corresponding label of a positive or negative day represented as 1 or -1 respectively.


3. Explore: A little bit of exploratory data analysis was conducted before the modeling to gain insight into the extracted data. The positive and negative days term frequency distributions were compared to eachother. Along with different NLP methods were used such as bigrams vs. stemming vs. lemmatization. 


4. Model: For each of the different NLP strategies, I decided to run a very basic logistic regression model. After, I decided to run with a neural network to see if I could get my accuracy up any further. 


5. Interpret: I will leave the interpretation to the end of the notebook. If you wish to see just scroll all the way to the conclusion at the bottom of the notebook. Thank you! 

In [1]:
from selenium import webdriver #Selenium web driver
from selenium.webdriver.common.keys import Keys #Selenium wkeys
from bs4 import BeautifulSoup #BeatifulSoup for webscraping
import pandas as pd #Pandas dataframes and etc.

## Retrieve Data

First we need to retrieve the financial market data from yahoo finance. Historical data in the form of a csv is available through direct download at the following link, https://finance.yahoo.com/quote/%5EGSPC/history?p=%5EGSPC.

In [2]:
sp_500=pd.read_csv('S&P 500 Data.csv') #Obtaining the market data

In [3]:
sp_500.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,10/22/2018,2773.939941,2778.939941,2749.219971,2755.879883,2755.879883,3307140000
1,10/23/2018,2721.030029,2753.590088,2691.429932,2740.689941,2740.689941,4348580000
2,10/24/2018,2737.870117,2742.590088,2651.889893,2656.100098,2656.100098,4709310000
3,10/25/2018,2674.879883,2722.699951,2667.840088,2705.570068,2705.570068,4634770000
4,10/26/2018,2667.860107,2692.379883,2628.159912,2658.689941,2658.689941,4803150000


In [4]:
sp_500['Change']=sp_500['Close']-sp_500['Open'] #Creating a new column for the change in S&P500

In [5]:
sp_500_final= sp_500[['Date','Change']] #creating new dataframe with just the date and change in S&P500

In [6]:
sp_500_final.head()

Unnamed: 0,Date,Change
0,10/22/2018,-18.060058
1,10/23/2018,19.659912
2,10/24/2018,-81.770019
3,10/25/2018,30.690185
4,10/26/2018,-9.170166


In [7]:
Encode=pd.DataFrame(sp_500_final['Change'].where(sp_500_final['Change']<0, other=1)) #Coding the positive vs. negative days

In [8]:
encoded_data=pd.DataFrame(Encode.where(Encode>0,-1))

In [9]:
encoded_data.columns=['Coded']

In [10]:
sp_500=pd.concat([sp_500_final,encoded_data], axis=1)

In [11]:
sp_500.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 251 entries, 0 to 250
Data columns (total 3 columns):
Date      251 non-null object
Change    251 non-null float64
Coded     251 non-null float64
dtypes: float64(2), object(1)
memory usage: 6.0+ KB


In [12]:
sp_500.Date = pd.to_datetime(sp_500.Date) #changing the date column to a datetime object for easy merging later on

In [13]:
sp_500.head()

Unnamed: 0,Date,Change,Coded
0,2018-10-22,-18.060058,-1.0
1,2018-10-23,19.659912,1.0
2,2018-10-24,-81.770019,-1.0
3,2018-10-25,30.690185,1.0
4,2018-10-26,-9.170166,-1.0


Now to use Beautful Soup to webscrape the wayback machine. The wayback machine is a open source website that stores historical webcrawls of many websites. I am going to use the wayback machine to webscrape the WSJ front page. From there I will scrape just the headlines and the sub headings to perform Natural Language Processing. 

In [14]:
url= 'https://web.archive.org/web/*/wsj.com' #URL homepage for WSJ of all different days
driver = webdriver.Chrome(executable_path=r'C:\Users\GBLS\Downloads\chromedriver_win32\chromedriver') #driver
driver.get('https://web.archive.org/web/*/wsj.com') #Automation of going to website

In [15]:
my_html = driver.page_source #Getting the HTML
soup = BeautifulSoup(my_html, 'html.parser')

In [16]:
print(soup.prettify())

<html lang="en">
 <head>
  <title>
   Wayback Machine
  </title>
  <script src="//archive.org/includes/jquery-1.10.2.min.js?v1.10.2" type="text/javascript">
  </script>
  <script src="//archive.org/includes/analytics.js?v=91331656" type="text/javascript">
  </script>
  <script src="//archive.org/includes/build/npm/jquery-ui.min.js?v1.12.1" type="text/javascript">
  </script>
  <script src="//archive.org/includes/bootstrap.min.js?v3.0.0" type="text/javascript">
  </script>
  <script src="//archive.org/components/npm/clipboard/dist/clipboard.js?v=91331656" type="text/javascript">
  </script>
  <script src="//archive.org/includes/build/npm/react/umd/react.production.min.js?v16.7.0" type="text/javascript">
  </script>
  <script src="//archive.org/includes/build/npm/react-dom/umd/react-dom.production.min.js?v16.7.0" type="text/javascript">
  </script>
  <script src="//archive.org/includes/build/js/archive.min.js?v=91331656" type="text/javascript">
  </script>
  <script src="//archive.org/in

The code above is all of the HTML of the webpage. The links to the different dates is what was needed. To do this, beutiful soups find all method was used to search for the links.

In [17]:
links=soup.find_all('a')

In [18]:
links #looking at all links

[<a data-action="ia-banner-close" href="#"></a>,
 <a class="navia-link home" href="/" target="_top" title="Home">
 <span class="iconochive-logo"></span>
 <span>Home</span>
 </a>,
 <a class="navia-link web" data-top-kind="web" href="https://archive.org/web/" target="_top" title="Web"><span aria-hidden="true" class="iconochive-web"></span><span>web</span></a>,
 <a class="navia-link texts" data-top-kind="texts" href="https://archive.org/details/texts" target="_top" title="Texts"><span aria-hidden="true" class="iconochive-texts"></span><span>books</span></a>,
 <a class="navia-link movies" data-top-kind="movies" href="https://archive.org/details/movies" target="_top" title="Video"><span aria-hidden="true" class="iconochive-movies"></span><span>video</span></a>,
 <a class="navia-link audio" data-top-kind="audio" href="https://archive.org/details/audio" target="_top" title="Audio"><span aria-hidden="true" class="iconochive-audio"></span><span>audio</span></a>,
 <a class="navia-link software" 

Now that all the links were printed, the only links needed are the direct access links to each WSJ landing page for each specific day. The following for loop does exactly that; creating a list of only links that will take us to the landing page of each day. 

In [19]:
links_wsj= []
for x in links:
    if 'wsj' in x['href']:
        links_wsj.append(x['href'])

In [20]:
links_wsj

['/web/collections/*/wsj.com',
 '/web/changes/wsj.com',
 '/details/wsj.com',
 '/web/sitemap/wsj.com',
 '/web/19961203131021/wsj.com',
 '/web/20191114154136/wsj.com',
 '/web/20190101/wsj.com',
 '/web/20190102/wsj.com',
 '/web/20190103/wsj.com',
 '/web/20190104/wsj.com',
 '/web/20190105/wsj.com',
 '/web/20190106/wsj.com',
 '/web/20190107/wsj.com',
 '/web/20190108/wsj.com',
 '/web/20190109/wsj.com',
 '/web/20190110/wsj.com',
 '/web/20190111/wsj.com',
 '/web/20190112/wsj.com',
 '/web/20190113/wsj.com',
 '/web/20190114/wsj.com',
 '/web/20190115/wsj.com',
 '/web/20190116/wsj.com',
 '/web/20190117/wsj.com',
 '/web/20190118/wsj.com',
 '/web/20190119/wsj.com',
 '/web/20190120/wsj.com',
 '/web/20190121/wsj.com',
 '/web/20190122/wsj.com',
 '/web/20190123/wsj.com',
 '/web/20190124/wsj.com',
 '/web/20190125/wsj.com',
 '/web/20190126/wsj.com',
 '/web/20190127/wsj.com',
 '/web/20190128/wsj.com',
 '/web/20190129/wsj.com',
 '/web/20190130/wsj.com',
 '/web/20190131/wsj.com',
 '/web/20190201/wsj.com',
 '

The links needed to be made whole again so it easily can be fed into selenium. Thus, I decided to create another for loop to easily add on the neccesary text to create a valid link. 

In [21]:
full_link= []
for x in links_wsj:
    full_link.append('https://web.archive.org/'+ x)

In [22]:
full_link

['https://web.archive.org//web/collections/*/wsj.com',
 'https://web.archive.org//web/changes/wsj.com',
 'https://web.archive.org//details/wsj.com',
 'https://web.archive.org//web/sitemap/wsj.com',
 'https://web.archive.org//web/19961203131021/wsj.com',
 'https://web.archive.org//web/20191114154136/wsj.com',
 'https://web.archive.org//web/20190101/wsj.com',
 'https://web.archive.org//web/20190102/wsj.com',
 'https://web.archive.org//web/20190103/wsj.com',
 'https://web.archive.org//web/20190104/wsj.com',
 'https://web.archive.org//web/20190105/wsj.com',
 'https://web.archive.org//web/20190106/wsj.com',
 'https://web.archive.org//web/20190107/wsj.com',
 'https://web.archive.org//web/20190108/wsj.com',
 'https://web.archive.org//web/20190109/wsj.com',
 'https://web.archive.org//web/20190110/wsj.com',
 'https://web.archive.org//web/20190111/wsj.com',
 'https://web.archive.org//web/20190112/wsj.com',
 'https://web.archive.org//web/20190113/wsj.com',
 'https://web.archive.org//web/20190114/

In [23]:
jan=full_link[6:37] #Splitting up the links into each month to make for easy webscraping by month
feb=full_link[37:65]
mar=full_link[65:96]
apr=full_link[96:126]
may=full_link[126:157]
jun=full_link[157:187]
jul=full_link[187:218]
aug=full_link[218:249]
sept=full_link[249:279]
octb=full_link[279:310]

The following code was used to obtain all of the text data for each month from the weblinks I created above. It is divided by month and the following for loop is used to obtain just the headers of each page along with the subheading text. From here I added the date with the text and code (negative or positve day) into a dataframe. After the for loop, I check to ensure there are no NaN. 

Sometimes there are NaN due to it being the weekend and markets being closed. For these instances I used a method ffill() which is a fill foward method for dataframes. Once I checked for a complete clean dataframe, I saved each one into a csv file. 

In [29]:
date=[] #List created for dates
headline_blurb_data= [] #Headline and subheading list
for link in jan: 
    driver.get(str(link)) #Selenium getting the links from the provided month 
    my_html= driver.page_source
    soup= BeautifulSoup(my_html, 'html.parser')
    for div in soup.findAll('a', {'class': 'wsj-headline-link'}): #Obtaining the headings from each link's html
        headline_blurb_data.append(div.contents[0])
        date.append(str(link)[29:37])
    for div in soup.findAll('p', {'class':'wsj-summary'}): #Getting all the sub-headings
        headline_blurb_data.append(div.text)
        date.append(str(link)[29:37])
df_wsj=pd.DataFrame({'Date':date,'Text':headline_blurb_data}) #Creating a dataframe 
df_wsj.Date= pd.to_datetime(df_wsj.Date) #Changing the date to a date time object
df_jan=df_wsj.merge(sp_500,how='left') #Merging into one dataframe

In [31]:
df_jan=df_jan.fillna(1)

In [32]:
df_jan.isna().sum()

Date      0
Text      0
Change    0
Coded     0
dtype: int64

In [33]:
df_jan.to_csv('January')

In [34]:
date=[] #List created for dates
headline_blurb_data= [] #Headline and subheading list
for link in feb: 
    driver.get(str(link)) #Selenium getting the links from the provided month
    my_html= driver.page_source
    soup= BeautifulSoup(my_html, 'html.parser')
    for div in soup.findAll('a', {'class': 'wsj-headline-link'}): #Obtaining the headings from each link's html
        headline_blurb_data.append(div.contents[0])
        date.append(str(link)[29:37])
    for div in soup.findAll('p', {'class':'wsj-summary'}): #Getting all the sub-headings
        headline_blurb_data.append(div.text)
        date.append(str(link)[29:37])
df_wsj=pd.DataFrame({'Date':date,'Text':headline_blurb_data}) #Creating a dataframe
df_wsj.Date= pd.to_datetime(df_wsj.Date) #Changing the date to a date time object
df_feb=df_wsj.merge(sp_500,how='left') #Merging into one dataframe

In [35]:
df_feb.head()

Unnamed: 0,Date,Text,Change,Coded
0,2019-02-01,1MDB Scandal Could Hit Pay for Some Top Goldma...,4.209961,1.0
1,2019-02-01,Job Market Powers Past Headwinds as Payrolls E...,4.209961,1.0
2,2019-02-01,Virginia Governor Apologizes for Racist Medica...,4.209961,1.0
3,2019-02-01,Big Oil Companies Finished 2018 Strong Despite...,4.209961,1.0
4,2019-02-01,Inspectors of Collapsed Brazilian Dam Had Clos...,4.209961,1.0


In [36]:
df_feb=df_feb.ffill(axis=0)
df_feb.to_csv('Febuary')

In [37]:
date=[]#List created for dates  
headline_blurb_data= []#Headline and subheading list 
for link in mar: 
    driver.get(str(link))#Selenium getting the links from the provided month
    my_html= driver.page_source
    soup= BeautifulSoup(my_html, 'html.parser')
    for div in soup.findAll('a', {'class': 'wsj-headline-link'}):#Obtaining the headings from each link's html
        headline_blurb_data.append(div.contents[0])
        date.append(str(link)[29:37])
    for div in soup.findAll('p', {'class':'wsj-summary'}): #Getting all the sub-headings
        headline_blurb_data.append(div.text)
        date.append(str(link)[29:37])
df_wsj=pd.DataFrame({'Date':date,'Text':headline_blurb_data})#Creating a dataframe
df_wsj.Date= pd.to_datetime(df_wsj.Date)#Changing the date to a date time object
df_mar=df_wsj.merge(sp_500,how='left') #Merging into one dataframe

In [38]:
df_mar=df_mar.ffill(axis=0)

In [39]:
df_mar.isna().sum()

Date      0
Text      0
Change    0
Coded     0
dtype: int64

In [40]:
df_mar.to_csv('March')

In [41]:
date=[] #List created for dates
headline_blurb_data= [] #Headline and subheading list
for link in apr:
    driver.get(str(link)) #Selenium getting the links from the provided month
    my_html= driver.page_source
    soup= BeautifulSoup(my_html, 'html.parser')
    for div in soup.findAll('a', {'class': 'wsj-headline-link'}): #Obtaining the headings from each link's html
        headline_blurb_data.append(div.contents[0])
        date.append(str(link)[29:37])
    for div in soup.findAll('p', {'class':'wsj-summary'}): #Getting all the sub-headings
        headline_blurb_data.append(div.text)
        date.append(str(link)[29:37])
df_wsj=pd.DataFrame({'Date':date,'Text':headline_blurb_data})  #Creating a dataframe
df_wsj.Date= pd.to_datetime(df_wsj.Date) #Changing the date to a date time object
df_apr=df_wsj.merge(sp_500,how='left') #Merging into one dataframe 
df_apr=df_apr.ffill(axis=0) #Fillfoward method

In [42]:
df_apr.isna().sum()

Date      0
Text      0
Change    0
Coded     0
dtype: int64

In [43]:
df_apr.to_csv('April')

In [44]:
df_apr.head()

Unnamed: 0,Date,Text,Change,Coded
0,2019-04-01,"U.S. and Chinese Manufacturing Stabilize, Whil...",18.560058,1.0
1,2019-04-01,Stocks Rise as China Data Ease Concerns Over G...,18.560058,1.0
2,2019-04-01,Slack Picks NYSE for Direct Listing,18.560058,1.0
3,2019-04-01,U.K. Lawmakers Closer to Brexit Alternatives b...,18.560058,1.0
4,2019-04-01,Amazon Cuts More Prices at Whole Foods,18.560058,1.0


In [45]:
date=[] #List created for dates
headline_blurb_data= [] #Headline and subheading list
for link in may: 
    driver.get(str(link)) #Selenium getting the links from the provided month
    my_html= driver.page_source
    soup= BeautifulSoup(my_html, 'html.parser')
    for div in soup.findAll('a', {'class': 'wsj-headline-link'}): #Obtaining the headings from each link's html
        headline_blurb_data.append(div.contents[0])
        date.append(str(link)[29:37])
    for div in soup.findAll('p', {'class':'wsj-summary'}): #Getting all the sub-headings
        headline_blurb_data.append(div.text)
        date.append(str(link)[29:37])
df_wsj=pd.DataFrame({'Date':date,'Text':headline_blurb_data}) #Creating a dataframe
df_wsj.Date= pd.to_datetime(df_wsj.Date) #Changing the date to a date time object
df_may=df_wsj.merge(sp_500,how='left') #Merging into one dataframe
df_may=df_may.ffill(axis=0) #Fillfoward method

In [46]:
print(df_may.isna().sum())
df_may.head()

Date      0
Text      0
Change    0
Coded     0
dtype: int64


Unnamed: 0,Date,Text,Change,Coded
0,2019-05-01,"Fed Leaves Rates Unchanged, Notes Subdued Infl...",-28.600098,-1.0
1,2019-05-01,Qualcomm to Get at Least $4.5 Billion in Apple...,-28.600098,-1.0
2,2019-05-01,U.K.’s May Fires Defense Secretary Over Huawei...,-28.600098,-1.0
3,2019-05-01,Family Paid $6.5 Million to Get Their Daughter...,-28.600098,-1.0
4,2019-05-01,"Barr, Democrats Clash Over Mueller Report",-28.600098,-1.0


In [47]:
df_may.to_csv('May')

In [130]:
df_may.tail()

Unnamed: 0,Date,Text,Change,Coded
3604,2019-05-31,U.S. and South Korean authorities are looking ...,14.089843,1.0
3605,2019-05-31,"The Trump administration has delayed new, toug...",14.089843,1.0
3606,2019-05-31,A Syrian detention center run by U.S.-backed K...,14.089843,1.0
3607,2019-05-31,Silent retreats offer a respite from our clamo...,14.089843,1.0
3608,2019-05-31,Ask Encore columnist Glenn Ruffenach also answ...,14.089843,1.0


In [48]:
date=[]#List created for dates 
headline_blurb_data= []#Headline and subheading list 
for link in jun:
    driver.get(str(link))#Selenium getting the links from the provided month 
    my_html= driver.page_source
    soup= BeautifulSoup(my_html, 'html.parser')
    for div in soup.findAll('a', {'class': 'wsj-headline-link'}): #Obtaining the headings from each link's html
        headline_blurb_data.append(div.contents[0])
        date.append(str(link)[29:37])
    for div in soup.findAll('p', {'class':'wsj-summary'}):#Getting all the sub-headings
        headline_blurb_data.append(div.text)
        date.append(str(link)[29:37])
df_wsj=pd.DataFrame({'Date':date,'Text':headline_blurb_data}) #Creating a dataframe
df_wsj.Date= pd.to_datetime(df_wsj.Date) #Changing the date to a date time object
df_jun=df_wsj.merge(sp_500,how='left') #Merging into one dataframe
df_jun=df_jun.ffill(axis=0)#Fillfoward method



In [49]:
df_jun.head()

Unnamed: 0,Date,Text,Change,Coded
0,2019-06-01,Justice Department Prepares Antitrust Probe of...,,
1,2019-06-01,Tariffs on Mexican Imports Would Hit More Than...,,
2,2019-06-01,"Factories Stall on Strong Dollar, Trade Tensions",,
3,2019-06-01,FedEx Caught in U.S.-China Tensions,,
4,2019-06-01,Virginia Beach Grieves Deaths of 12 Shooting V...,,


In [50]:
df_jun=df_jun.fillna(1)

In [51]:
df_jun.isna().sum()

Date      0
Text      0
Change    0
Coded     0
dtype: int64

In [52]:
df_jun.to_csv('June')

In [53]:
df_jun.head()

Unnamed: 0,Date,Text,Change,Coded
0,2019-06-01,Justice Department Prepares Antitrust Probe of...,1.0,1.0
1,2019-06-01,Tariffs on Mexican Imports Would Hit More Than...,1.0,1.0
2,2019-06-01,"Factories Stall on Strong Dollar, Trade Tensions",1.0,1.0
3,2019-06-01,FedEx Caught in U.S.-China Tensions,1.0,1.0
4,2019-06-01,Virginia Beach Grieves Deaths of 12 Shooting V...,1.0,1.0


In [54]:
date=[] 
headline_blurb_data= [] 
for link in jul: 
    driver.get(str(link)) 
    my_html= driver.page_source
    soup= BeautifulSoup(my_html, 'html.parser')
    for div in soup.findAll('a', {'class': 'wsj-headline-link'}): 
        headline_blurb_data.append(div.contents[0])
        date.append(str(link)[29:37])
    for div in soup.findAll('p', {'class':'wsj-summary'}): #Getting all the headlines 
        headline_blurb_data.append(div.text)
        date.append(str(link)[29:37])
    for div in soup.find_all('p', {'class': 'WSJTheme--summary--12br5Svc'}):
        headline_blurb_data.append(div.text)
        date.append(str(link)[29:37])
    for div in soup.find_all('h3'):
        headline_blurb_data.append(div.text)
        date.append(str(link)[29:37])
df_wsj=pd.DataFrame({'Date':date,'Text':headline_blurb_data}) 
df_wsj.Date= pd.to_datetime(df_wsj.Date) 
df_jul=df_wsj.merge(sp_500,how='left')  
df_jul=df_jul.ffill(axis=0)

In [55]:
df_jul.isna().sum()

Date      0
Text      0
Change    0
Coded     0
dtype: int64

In [56]:
df_jul.to_csv('July')

In [57]:
df_jul.tail()

Unnamed: 0,Date,Text,Change,Coded
4363,2019-07-31,Opinion: The 99% Get a Bigger Raise,-35.840088,-1.0
4364,2019-07-31,Anxiety Looks Different in Men,-35.840088,-1.0
4365,2019-07-31,"A Generation of Siblings, Raised to Be Entrepr...",-35.840088,-1.0
4366,2019-07-31,They Were Huge Franchises. Why Did They Collapse?,-35.840088,-1.0
4367,2019-07-31,One Bookstore Finds the Secret to Succeeding i...,-35.840088,-1.0


In [58]:
date=[] #List created for dates
headline_blurb_data= [] #Headline and subheading list
for link in aug: 
    driver.get(str(link))
    my_html= driver.page_source #Selenium getting the links from the provided month
    soup= BeautifulSoup(my_html, 'html.parser')
    for div in soup.findAll('a', {'class': 'wsj-headline-link'}): #Getting all the headings/sub-headings
        headline_blurb_data.append(div.contents[0])
        date.append(str(link)[29:37])
    for div in soup.findAll('p', {'class':'wsj-summary'}): #Getting all the headings/sub-headings
        headline_blurb_data.append(div.text)
        date.append(str(link)[29:37])
    for div in soup.find_all('p', {'class': 'WSJTheme--summary--12br5Svc'}): #Getting all the headings/sub-headings
        headline_blurb_data.append(div.text)
        date.append(str(link)[29:37])
    for div in soup.find_all('h3'): #Getting all the headings/sub-headings
        headline_blurb_data.append(div.text)
        date.append(str(link)[29:37])
df_wsj=pd.DataFrame({'Date':date,'Text':headline_blurb_data}) #Creating a dataframe
df_wsj.Date= pd.to_datetime(df_wsj.Date) #Changing the date to a date time object
df_aug=df_wsj.merge(sp_500,how='left') #Merging into one dataframe
df_aug=df_aug.ffill(axis=0) #Fillfoward method

In [59]:
df_aug=df_aug.ffill(axis=0)

In [60]:
df_aug.to_csv('August')

In [61]:
date=[]
headline_blurb_data= []
for link in sept: #FULLY FUNCTIONAL SCRAPE OF HEADLINES WHEN PASS IN LIST OF LINKS
    driver.get(str(link))
    my_html= driver.page_source
    soup= BeautifulSoup(my_html, 'html.parser')
    for div in soup.findAll('a', {'class': 'wsj-headline-link'}):
        headline_blurb_data.append(div.contents[0])
        date.append(str(link)[29:37])
    for div in soup.findAll('p', {'class':'wsj-summary'}): #Getting all the headlines
        headline_blurb_data.append(div.text)
        date.append(str(link)[29:37])
    for div in soup.find_all('p', {'class': 'WSJTheme--summary--12br5Svc'}):
        headline_blurb_data.append(div.text)
        date.append(str(link)[29:37])
    for div in soup.find_all('h3'):
        headline_blurb_data.append(div.text)
        date.append(str(link)[29:37])
df_wsj=pd.DataFrame({'Date':date,'Text':headline_blurb_data})
df_wsj.Date= pd.to_datetime(df_wsj.Date)
df_sept=df_wsj.merge(sp_500,how='left') 
df_sept=df_sept.ffill(axis=0)

In [62]:
df_aug.tail()

Unnamed: 0,Date,Text,Change,Coded
3334,2019-08-31,Opinion: A Feminist Capitalist Professor Under...,-10.630127,-1.0
3335,2019-08-31,Rob Gronkowski Lost Weight and Changed His Life,-10.630127,-1.0
3336,2019-08-31,An Obituary Writer Is Writing His Own Obituary,-10.630127,-1.0
3337,2019-08-31,Banks Monitor Older Customers for Cognitive De...,-10.630127,-1.0
3338,2019-08-31,Smart Financial Strategies Between Retirement ...,-10.630127,-1.0


In [63]:
df_sept=df_sept.fillna(1)

In [64]:
df_sept.to_csv('September')

In [65]:
date=[] #List created for dates
headline_blurb_data= [] #Headline and subheading list
for link in octb: 
    driver.get(str(link)) #Selenium getting the links from the provided month
    my_html= driver.page_source
    soup= BeautifulSoup(my_html, 'html.parser')
    for div in soup.findAll('a', {'class': 'wsj-headline-link'}): #Getting all the headings/sub-headings
        headline_blurb_data.append(div.contents[0])
        date.append(str(link)[29:37])
    for div in soup.findAll('p', {'class':'wsj-summary'}): #Getting all the headings/sub-headings
        headline_blurb_data.append(div.text)
        date.append(str(link)[29:37])
    for div in soup.find_all('p', {'class': 'WSJTheme--summary--12br5Svc'}): #Getting all the headings/sub-headings
        headline_blurb_data.append(div.text)
        date.append(str(link)[29:37])
    for div in soup.find_all('h3'): #Getting all the headings/sub-headings
        headline_blurb_data.append(div.text)
        date.append(str(link)[29:37])
df_wsj=pd.DataFrame({'Date':date,'Text':headline_blurb_data}) #Creating a dataframe
df_wsj.Date= pd.to_datetime(df_wsj.Date) #Changing the date to a date time object
df_octb=df_wsj.merge(sp_500,how='left') #Merging into one dataframe
df_octb=df_octb.ffill(axis=0) #Fillfoward method

In [66]:
df_octb.isna().sum()

Date      0
Text      0
Change    0
Coded     0
dtype: int64

October was the last dataframe created, from here all dataframes were called to then concat all into one final dataframe that can be used for EDA and modeling.

In [67]:
df_octb.to_csv('October')

In [234]:
df_jan= pd.read_csv('January')

In [236]:
df_feb=pd.read_csv('Febuary')

In [237]:
df_mar=pd.read_csv('March')

In [238]:
df_apr=pd.read_csv('April')

In [239]:
df_may=pd.read_csv('May')

In [240]:
df_jun=pd.read_csv('June')

In [241]:
df_jul=pd.read_csv('July')

In [242]:
df_aug=pd.read_csv('Aug')

In [243]:
df_sept=pd.read_csv('September')

In [244]:
df_octb=pd.read_csv('October')

In [68]:
df_jan_oct_final=pd.concat([df_jan,df_feb,df_mar,df_apr,df_may,df_jun,df_jul,df_aug,df_sept,df_octb])

In [69]:
df_jan_oct_final=df_jan_oct_final.reset_index(drop=True)

In [70]:
df_jan_oct_final

Unnamed: 0,Date,Text,Change,Coded
0,2019-01-01,Trump Invites Top Lawmakers in Effort To End S...,1.000000,1.0
1,2019-01-01,Kim Jong Un Extends Peace Overture to U.S.,1.000000,1.0
2,2019-01-01,"American Detained in Russia Isn’t a Spy, Famil...",1.000000,1.0
3,2019-01-01,The Money Managers to Watch in 2019,1.000000,1.0
4,2019-01-01,Investors Try Not to Panic Over Stock Volatility,1.000000,1.0
5,2019-01-01,Rewards Credit Cards Gained a Fanatic Followin...,1.000000,1.0
6,2019-01-01,Chesapeake Energy Bet on Oil. Then Crude Price...,1.000000,1.0
7,2019-01-01,Brazil’s Idea to Fix Rampant Gun Violence: Mor...,1.000000,1.0
8,2019-01-01,"Conservative Takes Reins in Brazil, Vows to Re...",1.000000,1.0
9,2019-01-01,"To Woo Millennials, Atlanta Weighs Parks Over ...",1.000000,1.0


In [71]:
df_jan_oct_final.to_csv('January-October',index=False)