# Project and Presentation - Text Analytics
## Aaron Bromeland

### Purpose:
- To obtain data from the SEC edgar search and stock market to see the impact of reportings to the SEC on stock valutaion of Deere & Co. Will look to develop a model with Deere and then try to apply it to other companies within the industry and then potentially other companies outside of the Agricultural/Construction and Forestry Industry. 

- Background:
     - Deere & Company is an Agriculture and Construction and Forestry manufacturer that is headquartered in Moline, Illinois. Deere & Company started in 1837 after John Deere invented the self scouring plow in 1836 that allowed Midwest farmers to plow the thick black dirt of the Midwest without clogging. The company started in the town of Grad Detour, Illinois, and was moved in 1848 to Moline, Illinois. The company started with the technological advancement of the plow that was made from steel instead of cast iron and has since led the industry in technological advances in agriculture all the way to today with the introduction of the first commercially available fully autonomous tractor.  The company was formally incorporated as Deere & Company in 1868. John Deere served as the Founder and president of the company from 1837-1886. Then decedents of Deere served as President and Chairman from 1886-1982 (Charles Deere, William Butterworth, Charles Deere Wiman, William Hewitt). Then from 1982-today, John Deere has been run by members outside of the Deere family (Robert Hanson, Hans Becherer, Robert Lane, Samuel Allen, and John May)

- Project Goals
    - The project is hoping to identify trends in SEC filings that correlate to positive or negative stock returns. Specifically, we are looking to identify long term trends in the data to determine if SEC filings can help point to long-term growth opportunities in the stock of Deere. Additionally, we would like to extend this analysis beyond Deere to other companies first in the Agricultural and Construction and Forestry industries, and if we get successful results, then extend this to other company filings, as well. 


- Research Questions
    1.	Which standard forms filed with the SEC have the highest sentiment scores?
    2.	Which standard forms filed with the SEC have the lowest sentiment scores? 
    3.	Which standard forms filed with the SEC have the highest variance in sentiment scores?
    4.	Which filings coincide with the greatest change in stock price for the company? 
    5.	Do SEC filings have a material impact on the stock price of Deere & Co?
    6.	Can the sentiment analysis developed for Deere & Co be applied to other companies within the agricultural and construction industry?
    7.	Can the sentiment analysis developed for Deere & Co be applied to other companies outside of Deere’s industry? 


- Data Source
    - The dataset is comprised of 1000 filings of Deere with the SEC from October 3rd, 2011 to May, 31st, 2022. Initially, the shortest document in the dataset was 37 characters long with average character lengths of 24,110. However, 38 documents had under 500 characters in them. These were removed because after review of these documents, they were for PDF files that had been loaded or were references to other paper submissions that had been done.  After removing these documents, the average character lengths of the documents were 25,054 with a maximum character length of 564,539 and minimum character length of 539.


### Required Imports

In [None]:
# imports required
import requests
import pandas as pd
import numpy as np
import json
import re
from lxml import html
#!pip3 install yfinance
import yfinance as yf
import time

#!pip3 install afinn
from afinn import Afinn
#!pip3 install textblob
from textblob import TextBlob
#!pip3 install vaderSentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
#!pip3 install pysentiment2
import pysentiment2 as ps

from sklearn.preprocessing import MinMaxScaler
import numpy as np

import matplotlib.pyplot as plt                      # a library for visualization

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import nltk

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.preprocessing import MaxAbsScaler
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LassoCV
from sklearn.svm import l1_min_c
from sklearn.linear_model import LogisticRegressionCV

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation  #LDA module from sklearn. 
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression

#!pip install scipy --upgrade

### Pull SEC Data
- Pull data via the SEC API's that are made available. This pulls by company CIK number. Using the CIK number we can get all the recent filings for a company. Then we used the information from the first data pull to pull all of the text content for the items that have been previously submitted. 

In [None]:
# Request Data from Edgar Search. Pulled by Company CIK number
data = requests.get('https://data.sec.gov/submissions/CIK0000315189.json',headers={'User-Agent':'University of Iowa abromeland@uiowa.edu'})
data = json.loads(data.content)['filings']['recent']
df = pd.DataFrame.from_dict(data)
df.sort_values(by='filingDate',ascending=False,inplace=True)
df.reset_index(inplace=True,drop=True)
df.head()

In [None]:
# Pull data by accessionNumber and FileName to be added to the dataframe
for i in range(len(df)):
    accessionNumber = df.loc[i,"accessionNumber"]
    filename = df.loc[i,'primaryDocument']
    URL = f'https://www.sec.gov/Archives/edgar/data/315189/{re.sub("-","",accessionNumber)}/{filename}'
    page = requests.get(URL,headers={'User-Agent':'University of Iowa abromeland@uiowa.edu'})
    root = html.fromstring(page.content)
    text = [s.text_content() for s in root.xpath('/html/body')]
    df.loc[i,'text'] = ' '.join(text)
    time.sleep(.5)
df

In [None]:
# Clean Data and print out example data
df['text'] = df['text'].str.replace("[\\u200b||\\n||\\xa0]",'',regex=True)
df['text'] = df['text'].str.replace("\$\d+[\d,\.]*",'moneyToken',regex=True)
df['text'] = df['text'].str.lower()
print(df.text[0])
print(df.text[1])
print(df.text[2])

In [None]:
# Check for Filing Date nulls, and save filing date so it can be used later in graphing and to merge datasets.
print(df.filingDate.isnull().sum())
print(df.filingDate.dtype)
df["filingDate"] = pd.to_datetime(df["filingDate"],format="%Y-%m-%d %H:%M:%S")
print(df.filingDate.dtype)
df.head()

In [None]:
# Print the shape and save the data for later use
print(df.shape)
df.to_csv("Deere_SEC.csv",index=False)

### Obtain Stock Market Data
- Obtained from the yfinance API. 
- Will be used to generate if the stock is decreasing or increasing in a given time period.
- Added the columns 
    1. dayDiff - Holds the differences that are recorded daily
    2. weekDiff - Holds the differences that are recorded for 5 business days (week)
    3. monthDiff - Holds the differences that are recorded monthly. 5x4 + 2 - Five Days over four weeks, and add two days for the quarters. 

In [None]:
df_price = yf.download("DE",start='2011-08-01',end='2022-07-20',progress=False)
df_price.head()

In [None]:
print(type(df_price.index))
df_price.sort_index(ascending=False,inplace=True)
df_price.head()

In [None]:
df_price['dayDiff'] = -1*df_price['Adj Close'].diff()
df_price['weekDiff'] = -1*df_price['Adj Close'].diff(periods=5)
df_price['monthDiff'] = -1*df_price['Adj Close'].diff(periods=22)
df_price.head(40)

### Sentiment Analysis 

- Running the sentiment Analysis of all documents, and will plot this against the changes in stock price that are obtained for the same time period. 
- The sentiment analizers that were used are below:
    1. AFINN
    2. TextBlob
    3. VADER
    4. LM - Loughran and McDonald Financial Sentiment Dictionaries

In [None]:
# WARNING - THIS CELL WILL TAKE A LONG TIME TO RUN

# AFINN Sentiment Analysis Scores
afinn = Afinn(emoticons=True)
df["AFINN"]=[afinn.score(s) for s in df.text]

# TextBlob Sentiment Analysis Scores
df["TextBlob"]=[TextBlob(s).sentiment.polarity for s in df.text]

# VADER Sentiment Analysis Scores
analyzer=SentimentIntensityAnalyzer()
df["VADER"]=[analyzer.polarity_scores(s)['compound'] for s in df.text]

# Loughran and McDonald Sentiment Scores
lm = ps.LM()
df['LMTitle'] = 0
for i in range(len(df['text'])):
    tokens = lm.tokenize(df['text'][i])
    score = lm.get_score(tokens)
    df.loc[i,"LMTitle"]=score["Polarity"]

df

In [None]:
# Write Out Results
df.to_csv("Deere_Sentiment_Scores.csv",index=False)

### Visualize Sentiment Analysis 
    1. Graph All Sentiment Scores together
    2. Graph Daily, Weekly, and Monthly Changes
    3. Merge Sentiment Scores with Stock Data 
    4. Graph Stock data with Sentiment Scores

In [None]:
# Plot daily, weekly, and monthly difference

plt.plot(df_price.index, df_price['dayDiff'].rolling(window=50).mean(), "-g", label="Daily Difference")
plt.plot(df_price.index, df_price['weekDiff'].rolling(window=50).mean(), "-b", label="Weekly Difference")
plt.plot(df_price.index, df_price['monthDiff'].rolling(window=50).mean(), "-r", label="Monthly Difference")


plt.legend(loc="upper left")
plt.title("Ticker Price - Adjusted Close Difference")
plt.xlabel("Date")
plt.ylabel("Stock Price (Moving Average with Window Size 50)")
plt.grid(axis='both')
plt.show()

#### Re-load Location - Allows for csv to be reloaded and merged with original data.

In [None]:

# Re-load data set location - uncomment to re-laod 
#df = pd.read_csv("Deere_Sentiment_Scores.csv")
#print(df.filingDate.isnull().sum())
#print(df.filingDate.dtype)
#df["filingDate"] = pd.to_datetime(df["filingDate"],format="%Y-%m-%d %H:%M:%S")
#print(df.filingDate.dtype)
#df['AFINN_SCALE'] = MinMaxScaler().fit_transform(np.array(df['AFINN']).reshape(-1,1))

# Merge Stock and Sentiment Data
df_all = pd.merge(df,df_price,how='left',left_on = 'filingDate',right_index = True, copy=True)
df_all

In [None]:
# Write out Data set so it can be used and re-loaded at a later date.
df_all.to_csv("Deere_All.csv",index=False)
df_all

In [None]:
# Graph all stock data with Sentiment Data - First Scale difference data to be graphed and compared with sentiment data.
df_all['dayDiff_SCALE'] = MinMaxScaler().fit_transform(np.array(df_all['dayDiff']).reshape(-1,1))
df_all['weekDiff_SCALE'] = MinMaxScaler().fit_transform(np.array(df_all['weekDiff']).reshape(-1,1))
df_all['monthDiff_SCALE'] = MinMaxScaler().fit_transform(np.array(df_all['monthDiff']).reshape(-1,1))
df_all['Adj_Close_SCALE'] = MinMaxScaler().fit_transform(np.array(df_all['Adj Close']).reshape(-1,1))


# inline display of plots
%matplotlib inline
plt.figure(figsize=(15, 6))

plt.plot(df_all.filingDate, df_all.AFINN_SCALE.rolling(window=50).mean(), "-b", label="AFINN Scaled")
plt.plot(df_all.filingDate, df_all.TextBlob.rolling(window=50).mean(), "-g", label="TextBlob")
plt.plot(df_all.filingDate, df_all.VADER.rolling(window=50).mean(), "-r", label="VADER")
plt.plot(df_all.filingDate, df_all.LMTitle.rolling(window=50).mean(), "-c", label="LM")
plt.plot(df_all.filingDate, df_all.dayDiff_SCALE.rolling(window=50).mean(), "-m", label="Day Difference")
plt.plot(df_all.filingDate, df_all.weekDiff_SCALE.rolling(window=50).mean(), "-y", label="Week Difference")
plt.plot(df_all.filingDate, df_all.monthDiff_SCALE.rolling(window=50).mean(), "-k", label="Month Difference")
plt.plot(df_all.filingDate, df_all.Adj_Close_SCALE.rolling(window=50).mean(), "-p", label="Adjusted Close")


plt.legend(loc="upper left")
plt.title("Comparison on Sentiment Score")
plt.xlabel("Date")
plt.ylabel("Sentiment Score (Moving Average with Window Size 50)")
plt.grid(axis='both')
plt.show()

In [None]:
# Graph Sentiment scores with Adjusted Closing Values
plt.figure(figsize=(15, 6))

plt.plot(df_all.filingDate, df_all.AFINN_SCALE.rolling(window=50).mean(), "-b", label="AFINN Scaled")
plt.plot(df_all.filingDate, df_all.TextBlob.rolling(window=50).mean(), "-g", label="TextBlob")
plt.plot(df_all.filingDate, df_all.VADER.rolling(window=50).mean(), "-r", label="VADER")
plt.plot(df_all.filingDate, df_all.LMTitle.rolling(window=50).mean(), "-k", label="LM")
plt.plot(df_all.filingDate, df_all.Adj_Close_SCALE.rolling(window=50).mean(), "-c", label="Adjusted Close")


plt.legend(loc="upper left")
plt.title("Comparison on Sentiment Score")
plt.xlabel("Date")
plt.ylabel("Sentiment Score (Moving Average with Window Size 50)")
plt.grid(axis='both')
plt.show()

### Summary Statistics on Data Set

1. Obtain the character counts and throw out data that is too small.
2. Create DTM's
    1. Unigrams
    2. Bigrams
    3. Trigrams
    
    
#### Remove and Review of Small Character Count Documents

In [None]:
df_all['characters'] = [len(s) for s in df_all['text']]
print(f"Average character length: {df_all['characters'].mean()}")
print(f"Minimum character length: {df_all['characters'].min()}")
print(f"Maximum character length: {df_all['characters'].max()}")

In [None]:
df_small = df_all[df_all['characters']<1000].copy()
print(df_small.shape)
df_small

In [None]:
#df_small['text'][556]

In [None]:
#df_small['text'][991]

In [None]:
df_all = df_all[df_all['characters']>500].copy()
df_all.sort_values(by='filingDate',ascending=False,inplace=True)
df_all.reset_index(inplace=True,drop=True)
print(df_all.shape)
df_all

In [None]:
print(f"Average character length: {df_all['characters'].mean()}")
print(f"Minimum character length: {df_all['characters'].min()}")
print(f"Maximum character length: {df_all['characters'].max()}")

#### Creation Of DTM's

##### Unigrams - Use of LM Dictionary

In [None]:
custom_stop_words = ['moneytoken','font','serif','famili','helvetica','size',
                                                            'includ','arial','px','form','weight','color','text',
                                                            'align','de','file','solid','e','c','b','k','f','r',
                    'p','g','n','q','h']

In [None]:
# Use LM dictionary for DTM - No Stop word Removal
lm = ps.LM()
vectorizer = CountVectorizer(tokenizer = lm.tokenize)
DTM = vectorizer.fit_transform(df_all['text'])

In [None]:
dffreq = pd.DataFrame({'Term': vectorizer.get_feature_names_out(),
                   'Frequency': DTM.sum(axis=0).tolist()[0]
                  })

dffreq.sort_values(by="Frequency",inplace=True,ascending=False)
dffreq.reset_index(inplace=True,drop=True)
dffreq.head(10)

In [None]:
# Use LM dictionary for DTM - No Stop word Removal except custom stop words
lm = ps.LM()
vectorizer = CountVectorizer(tokenizer = lm.tokenize,stop_words=custom_stop_words)
DTM = vectorizer.fit_transform(df_all['text'])
dffreq = pd.DataFrame({'Term': vectorizer.get_feature_names_out(),
                   'Frequency': DTM.sum(axis=0).tolist()[0]
                  })

dffreq.sort_values(by="Frequency",inplace=True,ascending=False)
dffreq.reset_index(inplace=True,drop=True)
dffreq.head(10)

In [None]:
# Use LM dictionary for DTM - Stop Word Removal
nltk_stopwords = nltk.corpus.stopwords.words("english") + custom_stop_words
vectorizer = CountVectorizer(tokenizer = lm.tokenize,stop_words=nltk_stopwords)
DTM = vectorizer.fit_transform(df_all['text'])
dffreq = pd.DataFrame({'Term': vectorizer.get_feature_names_out(),
                   'Frequency': DTM.sum(axis=0).tolist()[0]
                  })

dffreq.sort_values(by="Frequency",inplace=True,ascending=False)
dffreq.reset_index(inplace=True,drop=True)
dffreq.head(10)

##### Bi-grams - Use LM Dictionary and custom Stop word List

In [None]:
vectorizer = CountVectorizer(tokenizer = lm.tokenize,ngram_range=(2,2),stop_words=custom_stop_words)
DTM = vectorizer.fit_transform(df_all['text'])
dffreq = pd.DataFrame({'Term': vectorizer.get_feature_names_out(),
                   'Frequency': DTM.sum(axis=0).tolist()[0]
                  })

dffreq.sort_values(by="Frequency",inplace=True,ascending=False)
dffreq.reset_index(inplace=True,drop=True)
dffreq.head(10)

In [None]:
nltk_stopwords = nltk.corpus.stopwords.words("english") + custom_stop_words
vectorizer = CountVectorizer(tokenizer = lm.tokenize,stop_words=nltk_stopwords,ngram_range=(2,2))
DTM = vectorizer.fit_transform(df_all['text'])
dffreq = pd.DataFrame({'Term': vectorizer.get_feature_names_out(),
                   'Frequency': DTM.sum(axis=0).tolist()[0]
                  })

dffreq.sort_values(by="Frequency",inplace=True,ascending=False)
dffreq.reset_index(inplace=True,drop=True)
dffreq.head(10)

##### Tri-grams - LM Tokenizer with custom stop word list

In [None]:
vectorizer = CountVectorizer(tokenizer = lm.tokenize,ngram_range=(3,3),stop_words = custom_stop_words)
DTM = vectorizer.fit_transform(df_all['text'])
dffreq = pd.DataFrame({'Term': vectorizer.get_feature_names_out(),
                   'Frequency': DTM.sum(axis=0).tolist()[0]
                  })

dffreq.sort_values(by="Frequency",inplace=True,ascending=False)
dffreq.reset_index(inplace=True,drop=True)
dffreq.head(10)

In [None]:
nltk_stopwords = nltk.corpus.stopwords.words("english") + custom_stop_words
vectorizer = CountVectorizer(tokenizer = lm.tokenize,stop_words=nltk_stopwords,ngram_range=(3,3))
DTM = vectorizer.fit_transform(df_all['text'])
dffreq = pd.DataFrame({'Term': vectorizer.get_feature_names_out(),
                   'Frequency': DTM.sum(axis=0).tolist()[0]
                  })

dffreq.sort_values(by="Frequency",inplace=True,ascending=False)
dffreq.reset_index(inplace=True,drop=True)
dffreq.head(10)

##### Topic Model

In [None]:
nltk_stopwords = nltk.corpus.stopwords.words("english") + custom_stop_words
vectorizer = CountVectorizer(tokenizer = lm.tokenize,stop_words=nltk_stopwords,ngram_range=(1,1))
DTM =vectorizer.fit_transform(df_all['text'])


num_topics=[1,2,3,4,5,6,7,8,9,10,11,12,13]
lda = LatentDirichletAllocation(n_jobs=-1,   
                                max_iter=10,  
                                random_state=2021 
                               )
perplexity=[]
for i in num_topics:
    print(i)
    lda.set_params(n_components=i)
    lda.fit(DTM)
    perplexity.append(lda.perplexity(DTM))

plt.plot(num_topics, perplexity)
plt.xlabel('Num. of Topics')
plt.ylabel('Perplexity')

In [None]:
lda = LatentDirichletAllocation(n_components=4,
                                n_jobs=-1,   
                                max_iter=20,   
                                random_state=2021 
                               )
lda.fit(DTM)

In [None]:
#Create the top words for each topic and put them together in the same data frame.
temparray = preprocessing.normalize(lda.components_,norm="l1")
TTopicM = pd.DataFrame(np.transpose(temparray), index = vectorizer.get_feature_names())
TermOfTopic =pd.DataFrame([])
for i in range(4):
    TermOfTopic[i]=(list(TTopicM.sort_values(by=i,ascending=False).iloc[:10,i].index))
TermOfTopic

In [None]:
df_main = df_all[['form','text','filingDate']]
DTopicM = pd.DataFrame(lda.transform(DTM))
dfnew = pd.concat([df_main, DTopicM], axis=1)
dfnew.sort_values(by=0,ascending=False,inplace=True)
dfnew.reset_index(inplace=True,drop=True)
dfnew.head(5)

In [None]:
dfnew.text[0]

In [None]:
dfnew.sort_values(by=1,ascending=False,inplace=True)
dfnew.reset_index(inplace=True,drop=True)
dfnew.head(10)

In [None]:
dfnew.text[4]

In [None]:
dfnew.sort_values(by=2,ascending=False,inplace=True)
dfnew.reset_index(inplace=True,drop=True)
dfnew.head(5)

In [None]:
dfnew.sort_values(by=3,ascending=False,inplace=True)
dfnew.reset_index(inplace=True,drop=True)
dfnew.head(5)

In [None]:
nltk_stopwords = nltk.corpus.stopwords.words("english") + custom_stop_words
vectorizer = CountVectorizer(tokenizer = lm.tokenize,stop_words=nltk_stopwords,ngram_range=(2,2))
DTM =vectorizer.fit_transform(df_all['text'])


num_topics=[1,2,3,4,5,6,7,8,9,10,11,12,13]
lda = LatentDirichletAllocation(n_jobs=-1,   
                                max_iter=10,  
                                random_state=2021 
                               )
perplexity=[]
for i in num_topics:
    print(i)
    lda.set_params(n_components=i)
    lda.fit(DTM)
    perplexity.append(lda.perplexity(DTM))

plt.plot(num_topics, perplexity)
plt.xlabel('Num. of Topics')
plt.ylabel('Perplexity')

In [None]:
lda = LatentDirichletAllocation(n_components=3,
                                n_jobs=-1,   
                                max_iter=20,   
                                random_state=2021 
                               )
lda.fit(DTM)

In [None]:
#Create the top words for each topic and put them together in the same data frame.
temparray = preprocessing.normalize(lda.components_,norm="l1")
TTopicM = pd.DataFrame(np.transpose(temparray), index = vectorizer.get_feature_names())
TermOfTopic =pd.DataFrame([])
for i in range(3):
    TermOfTopic[i]=(list(TTopicM.sort_values(by=i,ascending=False).iloc[:10,i].index))
TermOfTopic

In [None]:
df_main = df_all[['form','text','filingDate']]
DTopicM = pd.DataFrame(lda.transform(DTM))
dfnew = pd.concat([df_main, DTopicM], axis=1)
dfnew.sort_values(by=0,ascending=False,inplace=True)
dfnew.reset_index(inplace=True,drop=True)
dfnew.head(5)

In [None]:
dfnew.sort_values(by=1,ascending=False,inplace=True)
dfnew.reset_index(inplace=True,drop=True)
dfnew.head(5)

In [None]:
dfnew.sort_values(by=2,ascending=False,inplace=True)
dfnew.reset_index(inplace=True,drop=True)
dfnew.head(10)


In [None]:
nltk_stopwords = nltk.corpus.stopwords.words("english") + custom_stop_words
vectorizer = CountVectorizer(tokenizer = lm.tokenize,stop_words=nltk_stopwords,ngram_range=(3,3))
DTM =vectorizer.fit_transform(df_all['text'])


num_topics=[1,2,3,4,5,6,7,8,9,10,11,12,13]
lda = LatentDirichletAllocation(n_jobs=-1,   
                                max_iter=10,  
                                random_state=2021 
                               )
perplexity=[]
for i in num_topics:
    print(i)
    lda.set_params(n_components=i)
    lda.fit(DTM)
    perplexity.append(lda.perplexity(DTM))

plt.plot(num_topics, perplexity)
plt.xlabel('Num. of Topics')
plt.ylabel('Perplexity')

In [None]:
lda = LatentDirichletAllocation(n_components=3,
                                n_jobs=-1,   
                                max_iter=20,   
                                random_state=2021 
                               )
lda.fit(DTM)

In [None]:
#Create the top words for each topic and put them together in the same data frame.
temparray = preprocessing.normalize(lda.components_,norm="l1")
TTopicM = pd.DataFrame(np.transpose(temparray), index = vectorizer.get_feature_names())
TermOfTopic =pd.DataFrame([])
for i in range(3):
    TermOfTopic[i]=(list(TTopicM.sort_values(by=i,ascending=False).iloc[:10,i].index))
TermOfTopic

In [None]:
df_main = df_all[['form','text','filingDate']]
DTopicM = pd.DataFrame(lda.transform(DTM))
dfnew = pd.concat([df_main, DTopicM], axis=1)
dfnew.sort_values(by=0,ascending=False,inplace=True)
dfnew.reset_index(inplace=True,drop=True)
dfnew.head(5)

In [None]:
dfnew.sort_values(by=1,ascending=False,inplace=True)
dfnew.reset_index(inplace=True,drop=True)
dfnew.head(5)

In [None]:
dfnew.sort_values(by=2,ascending=False,inplace=True)
dfnew.reset_index(inplace=True,drop=True)
dfnew.head(5)

### Data Questions

1.	Which standard forms filed with the SEC have the highest sentiment scores?

In [None]:
df_forms = df.copy()
df_forms.sort_values(by='form',ascending=False,inplace=True)
df_forms.reset_index(inplace=True,drop=True)
df_forms = df_forms.set_index(['form',df.index])
df_forms

In [None]:
sentiment = df_forms[['AFINN','TextBlob','VADER','LMTitle']].groupby(level='form').mean()
df_sent = pd.DataFrame([])
for i in ['AFINN','TextBlob','VADER','LMTitle']:
    df_sent[i] = (list(sentiment.sort_values(by=i,ascending=False)[i][:10].index))
df_sent

In [None]:
sentiment = df_forms[['AFINN','TextBlob','VADER','LMTitle']].groupby(level='form').max()
df_sent = pd.DataFrame([])
for i in ['AFINN','TextBlob','VADER','LMTitle']:
    df_sent[i] = (list(sentiment.sort_values(by=i,ascending=False)[i][:10].index))
df_sent

In [None]:
sentiment = df_forms[['AFINN','TextBlob','VADER','LMTitle']].groupby(level='form').min()
df_sent = pd.DataFrame([])
for i in ['AFINN','TextBlob','VADER','LMTitle']:
    df_sent[i] = (list(sentiment.sort_values(by=i,ascending=True)[i][:10].index))
df_sent

In [None]:
sentiment = df_forms[['AFINN','TextBlob','VADER','LMTitle']].groupby(level='form').var()
df_sent = pd.DataFrame([])
for i in ['AFINN','TextBlob','VADER','LMTitle']:
    df_sent[i] = (list(sentiment.sort_values(by=i,ascending=False)[i][:10].index))
df_sent

In [None]:
sentiment = df_forms[['AFINN','TextBlob','VADER','LMTitle']].groupby(level='form').std()
df_sent = pd.DataFrame([])
for i in ['AFINN','TextBlob','VADER','LMTitle']:
    df_sent[i] = (list(sentiment.sort_values(by=i,ascending=False)[i][:10].index))
df_sent

In [None]:
#print(df_all[df_all['LMTitle'] == df_all['LMTitle'].max()])
#df_all['text'][842]

In [None]:
df_8K = df_all[df_all['form']=='8-K']
max_form = df_8K[df_8K['LMTitle'] == df_8K['LMTitle'].max()]
max_form = max_form[['form','filingDate','LMTitle','Adj Close']]
max_form


In [None]:
#print(df_all[df_all['LMTitle'] == df_all['LMTitle'].min()])
#print(df_all['text'][139])

In [None]:
df_forms = df_all.copy()
df_forms.sort_values(by='form',ascending=False,inplace=True)
df_forms.reset_index(inplace=True,drop=True)
df_forms = df_forms.set_index(['form',df_all.index])
df_forms
print(df_forms[['dayDiff','weekDiff','monthDiff']].groupby(level='form').max())
priceDiff = df_forms[['dayDiff','weekDiff','monthDiff']].groupby(level='form').max().abs()
print(priceDiff)
df_sent = pd.DataFrame([])
for i in ['dayDiff','weekDiff','monthDiff']:
    df_sent[i] = (list(priceDiff.sort_values(by=i,ascending=False)[i][:10].index))
df_sent

In [None]:
print(df_all[df_all['dayDiff']==df_all['dayDiff'].max()])

In [None]:
print(df_all[df_all['weekDiff']==df_all['weekDiff'].max()])

In [None]:
df_8K = df_all[df_all['form'] == '8-K']
print(df_8K[df_8K['weekDiff']==df_8K['weekDiff'].max()])

In [None]:
print(df_all[df_all['monthDiff']==df_all['monthDiff'].max()])

In [None]:
print(df_forms[['dayDiff','weekDiff','monthDiff']].groupby(level='form').mean())
priceDiff = df_forms[['dayDiff','weekDiff','monthDiff']].groupby(level='form').mean().abs()
print(priceDiff)
df_sent = pd.DataFrame([])
for i in ['dayDiff','weekDiff','monthDiff']:
    df_sent[i] = (list(priceDiff.sort_values(by=i,ascending=False)[i][:10].index))
df_sent

## Sparse Logistic Regression

1. Dataset Preparation
    - Add column for daily difference - positive or negative
    - Add column for weekly difference - positive or negative
    - Add column for monthly difference - positive or negative
    - Data was already cleaned in prior steps
    - Create DTMs for training and testing data
2. Feature Engineering
3. Model Training
4. Descriptive Analytics
5. Performance Metric

### Daily Difference

In [None]:
df = df_all[["text","dayDiff"]].copy()
df['dayDiff'] = np.where(df.dayDiff > 0 ,'postive','negative')
df

In [None]:
df_train, df_test = train_test_split(df, test_size=0.33, random_state=2021)
df_train.reset_index(drop=True,inplace=True)
df_test.reset_index(drop=True,inplace=True)

In [None]:
lm = ps.LM()
vectorizer = TfidfVectorizer(tokenizer = lm.tokenize,stop_words=custom_stop_words)
#Create the training DTM and the labels
train_x = vectorizer.fit_transform(df_train["text"])
train_y = df_train["dayDiff"]
train_x.shape

In [None]:
#Create the testing DTM and the labels
test_x = vectorizer.transform(df_test["text"])
test_y = df_test["dayDiff"]
test_x.shape

In [None]:
pd.DataFrame({'Train': train_y.value_counts(),
              'Test': test_y.value_counts()})

In [None]:
sparselr = LogisticRegression(penalty='l1', 
                              solver='liblinear',
                              random_state=2021,
                              tol=0.0001,
                              max_iter=1000, 
                              C=1)
sparselr.fit(train_x,train_y)

In [None]:
#How many non-zero betas in total
sum(sparselr.coef_[0]!=0)

In [None]:
dfbeta = pd.DataFrame({'Term': vectorizer.get_feature_names(),
                       'Beta': sparselr.coef_[0]
                     })

In [None]:
#Show the most positive terms
dfbeta.sort_values(by="Beta",inplace=True,ascending=False)
dfbeta.reset_index(inplace=True,drop=True)
dfbeta.head(10)

In [None]:
#Show the most negative terms
dfbeta.sort_values(by="Beta",inplace=True,ascending=True)
dfbeta.reset_index(inplace=True,drop=True)
dfbeta.head(10)

In [None]:
# Accuracy
print("Train:")
print(accuracy_score(train_y,sparselr.predict(train_x)))
print("Test:")
print(accuracy_score(test_y,sparselr.predict(test_x)))

In [None]:
# AUC Score
print("Train:")
print(roc_auc_score(train_y,sparselr.predict_proba(train_x)[:, 1]))
print("Test:")
print(roc_auc_score(test_y,sparselr.predict_proba(test_x)[:, 1]))

### Weekly Difference

In [None]:
df = df_all[["text","weekDiff"]].copy()
df['weekDiff'] = np.where(df.weekDiff > 0 ,'postive','negative')

df_train, df_test = train_test_split(df, test_size=0.33, random_state=2021)
df_train.reset_index(drop=True,inplace=True)
df_test.reset_index(drop=True,inplace=True)

lm = ps.LM()
vectorizer = TfidfVectorizer(tokenizer = lm.tokenize,stop_words=custom_stop_words)
#Create the training DTM and the labels
train_x = vectorizer.fit_transform(df_train["text"])
train_y = df_train["weekDiff"]
train_x.shape

#Create the testing DTM and the labels
test_x = vectorizer.transform(df_test["text"])
test_y = df_test["weekDiff"]
test_x.shape

print(pd.DataFrame({'Train': train_y.value_counts(),
              'Test': test_y.value_counts()}))

sparselr = LogisticRegression(penalty='l1', 
                              solver='liblinear',
                              random_state=2021,
                              tol=0.0001,
                              max_iter=1000, 
                              C=1)
sparselr.fit(train_x,train_y)

#How many non-zero betas in total
print(sum(sparselr.coef_[0]!=0))

dfbeta = pd.DataFrame({'Term': vectorizer.get_feature_names(),
                       'Beta': sparselr.coef_[0]
                     })

In [None]:
#Show the most positive terms
dfbeta.sort_values(by="Beta",inplace=True,ascending=False)
dfbeta.reset_index(inplace=True,drop=True)
dfbeta.head(10)

In [None]:
#Show the most negative terms
dfbeta.sort_values(by="Beta",inplace=True,ascending=True)
dfbeta.reset_index(inplace=True,drop=True)
dfbeta.head(10)

In [None]:
# Accuracy
print("Train:")
print(accuracy_score(train_y,sparselr.predict(train_x)))
print("Test:")
print(accuracy_score(test_y,sparselr.predict(test_x)))

In [None]:
# AUC Score
print("Train:")
print(roc_auc_score(train_y,sparselr.predict_proba(train_x)[:, 1]))
print("Test:")
print(roc_auc_score(test_y,sparselr.predict_proba(test_x)[:, 1]))

### Monthly Difference

In [None]:
df = df_all[["text","monthDiff"]].copy()
df['monthDiff'] = np.where(df.monthDiff > 0 ,'postive','negative')

df_train, df_test = train_test_split(df, test_size=0.33, random_state=2021)
df_train.reset_index(drop=True,inplace=True)
df_test.reset_index(drop=True,inplace=True)

lm = ps.LM()
vectorizer = TfidfVectorizer(tokenizer = lm.tokenize,stop_words=custom_stop_words)
#Create the training DTM and the labels
train_x = vectorizer.fit_transform(df_train["text"])
train_y = df_train["monthDiff"]
print(train_x.shape)

#Create the testing DTM and the labels
test_x = vectorizer.transform(df_test["text"])
test_y = df_test["monthDiff"]
print(test_x.shape)

print(pd.DataFrame({'Train': train_y.value_counts(),
              'Test': test_y.value_counts()}))

sparselr = LogisticRegression(penalty='l1', 
                              solver='liblinear',
                              random_state=2021,
                              tol=0.0001,
                              max_iter=1000, 
                              C=1)
sparselr.fit(train_x,train_y)

#How many non-zero betas in total
print(sum(sparselr.coef_[0]!=0))

dfbeta = pd.DataFrame({'Term': vectorizer.get_feature_names(),
                       'Beta': sparselr.coef_[0]
                     })

#Show the most positive terms
dfbeta.sort_values(by="Beta",inplace=True,ascending=False)
dfbeta.reset_index(inplace=True,drop=True)
dfbeta.head(10)

In [None]:
#Show the most negative terms
dfbeta.sort_values(by="Beta",inplace=True,ascending=True)
dfbeta.reset_index(inplace=True,drop=True)
dfbeta.head(10)

In [None]:
# Accuracy
print("Train:")
print(accuracy_score(train_y,sparselr.predict(train_x)))
print("Test:")
print(accuracy_score(test_y,sparselr.predict(test_x)))

# AUC Score
print("Train:")
print(roc_auc_score(train_y,sparselr.predict_proba(train_x)[:, 1]))
print("Test:")
print(roc_auc_score(test_y,sparselr.predict_proba(test_x)[:, 1]))

### Regression Analysis
    - First Attempt at Regression Analysis using Lasso

In [None]:
df = df_all[["text","Adj Close"]].copy()


df_train, df_test = train_test_split(df, test_size=0.33, random_state=2021)
df_train.reset_index(drop=True,inplace=True)
df_test.reset_index(drop=True,inplace=True)

lm = ps.LM()
vectorizer = TfidfVectorizer(tokenizer = lm.tokenize,stop_words=custom_stop_words)
#Create the training DTM and the labels
train_x = vectorizer.fit_transform(df_train["text"])
train_y = df_train["Adj Close"]
print(train_x.shape)

#Create the testing DTM and the labels
test_x = vectorizer.transform(df_test["text"])
test_y = df_test["Adj Close"]
print(test_x.shape)

scaler = MaxAbsScaler()
scaler.fit(train_x)
train_x=scaler.transform(train_x)
test_x=scaler.transform(test_x)

lasso = Lasso(alpha=0.01, #Regularization parameter.
              max_iter=5000
              )
lasso.fit(train_x,train_y)
print(np.count_nonzero(lasso.coef_))
#Check the percentage of non-zero slopes
print(np.count_nonzero(lasso.coef_)/len(lasso.coef_))

dfbeta = pd.DataFrame({'Term': vectorizer.get_feature_names(),
                       'Beta': lasso.coef_
                     })
dfbeta.sort_values(by="Beta",inplace=True,ascending=False)
dfbeta.reset_index(inplace=True,drop=True)
dfbeta.head(10)

In [None]:
dfbeta.sort_values(by="Beta",inplace=True,ascending=True)
dfbeta.reset_index(inplace=True,drop=True)
dfbeta.head(10)

In [None]:
#Apply the model to the reviews testing set and predict the star ratings
print(lasso.predict(test_x)[0:10])
print(test_y[0:10])

In [None]:
print("Mean Squared Error:")
print("Training:")
print(mean_squared_error(train_y,lasso.predict(train_x),squared=False))
print()
print("Testing:")
print(mean_squared_error(test_y,lasso.predict(test_x),squared=False))
print()
print("R-Squared:")
print("Training:")
print(r2_score(train_y,lasso.predict(train_x)))
print()
print("Testing:")
print(r2_score(test_y,lasso.predict(test_x)))

### Lasso Regression Cross-Fold Validation

    Lasso Regression k-fold cross validation using different alphas. 

In [None]:
alphaList = np.logspace(start=-4,stop=0,num=10)
alphaList

lasso = LassoCV(alphas=alphaList,  
                   cv=5,       #Number of folds, i.e., K
                   max_iter=5000)
lasso.fit(train_x, train_y)

print(train_x.shape)
print(lasso.alphas_)

In [None]:
pd.DataFrame(lasso.mse_path_)

In [None]:
%matplotlib inline
plt.plot(np.log10(lasso.alphas_), lasso.mse_path_.mean(axis=1))
plt.xlabel('log(alpha)')
plt.ylabel('Accuracy')
plt.title('Lasso Path')
plt.axis('tight')
plt.show()
bestalpha=lasso.alpha_
print(bestalpha)
print()
print("Mean Squared Error:")
print("Training:")
print(mean_squared_error(train_y,lasso.predict(train_x),squared=False))
print()
print("Testing:")
print(mean_squared_error(test_y,lasso.predict(test_x),squared=False))
print()
print("R-Squared:")
print("Training:")
print(r2_score(train_y,lasso.predict(train_x)))
print()
print("Testing:")
print(r2_score(test_y,lasso.predict(test_x)))

In [None]:
print(np.count_nonzero(lasso.coef_))
#Check the percentage of non-zero slopes
print(np.count_nonzero(lasso.coef_)/len(lasso.coef_))

dfbeta = pd.DataFrame({'Term': vectorizer.get_feature_names(),
                       'Beta': lasso.coef_
                     })
dfbeta.sort_values(by="Beta",inplace=True,ascending=False)
dfbeta.reset_index(inplace=True,drop=True)
dfbeta.head(10)

In [None]:
dfbeta.sort_values(by="Beta",inplace=True,ascending=True)
dfbeta.reset_index(inplace=True,drop=True)
dfbeta.head(10)

### Logistic Regression - K-fold cross Validation
    - Used different predictor columns, vectorizers, and N-grams
    - Then printed results by predictor columns.
    - Only trying to predict if the differences were positive.

In [None]:
columns = ['dayDiff','weekDiff','monthDiff']
scorer = ['TF','Binary','NormTfidf','UnnormalizedTfidf']
ngrams = [1,2,3]
lm = ps.LM()
measures = []
vector = []
grams = []
trainAcc = []
testAcc = []
trainROC = []
testROC = []

for column in columns:
    for score in scorer:
        for ngram in ngrams:
            if score == 'TF':
                vectorizer = CountVectorizer(tokenizer = lm.tokenize,
                                             stop_words=custom_stop_words,
                                             ngram_range =(ngram,ngram),
                                             binary=False)
            elif score == 'Binary':
                vectorizer = CountVectorizer(tokenizer = lm.tokenize,
                                             stop_words=custom_stop_words,
                                             ngram_range =(ngram,ngram),
                                             binary=True)
            elif score == 'NormTfidf':
                vectorizer = TfidfVectorizer(tokenizer = lm.tokenize,
                                             stop_words=custom_stop_words,
                                             norm = 'l1',
                                             ngram_range = (ngram,ngram),
                                             binary=False)
            elif score == 'UnnormalizedTfidf':
                vectorizer = TfidfVectorizer(tokenizer = lm.tokenize,
                                             stop_words=custom_stop_words,
                                             norm = None,
                                             ngram_range = (ngram,ngram),
                                             binary=False)
            
            print(f"Starting: Measure - {column}, Scorer - {score}, N-Grams - {ngram}")
            df = df_all[["text",column]].copy()
            df[column] = np.where(df[column]> 0 ,'postive','negative')
            df_train, df_test = train_test_split(df, test_size=0.33, random_state=2021)
            df_train.reset_index(drop=True,inplace=True)
            df_test.reset_index(drop=True,inplace=True)

            #Create the training DTM and the labels
            train_x = vectorizer.fit_transform(df_train["text"])
            train_y = df_train[column]
            print(train_x.shape)

            #Create the testing DTM and the labels
            test_x = vectorizer.transform(df_test["text"])
            test_y = df_test[column]
            print(test_x.shape)

            print(pd.DataFrame({'Train': train_y.value_counts(),
                      'Test': test_y.value_counts()}))

            param_grid = l1_min_c(train_x, train_y, loss='log') * np.logspace(start=0, stop=5, num=20) 
            print(param_grid)

            sparselr = LogisticRegressionCV(penalty='l1', 
                                        solver='liblinear', 
                                        Cs=param_grid,   #Use the grid generated above
                                        cv=5,            #Number of folds, that is, K
                                        scoring='accuracy', #The performance metric to select the best C.
                                        random_state=2021,  #To make sure the result is reproducible
                                        tol=0.001,
                                        max_iter=1000)
            sparselr.fit(train_x, train_y)

            #All candicates
            print("All candicates")
            print(sparselr.Cs)
            print("best value of C among Candidates")


            print("Train Accuracy:")
            print(accuracy_score(train_y,sparselr.predict(train_x)))
            print("Test Accuracy:")
            print(accuracy_score(test_y,sparselr.predict(test_x)))
            print("Train AUC:")
            print(roc_auc_score(train_y,sparselr.predict_proba(train_x)[:, 1]))
            print("Test AUC:")
            print(roc_auc_score(test_y,sparselr.predict_proba(test_x)[:, 1]))
            
            measures.append(column)
            vector.append(score)
            grams.append(ngram)
            trainAcc.append(accuracy_score(train_y,sparselr.predict(train_x)))
            testAcc.append(accuracy_score(test_y,sparselr.predict(test_x)))
            trainROC.append(roc_auc_score(train_y,sparselr.predict_proba(train_x)[:, 1]))
            testROC.append(roc_auc_score(test_y,sparselr.predict_proba(test_x)[:, 1]))
            
            print(f"Finishing: Measure - {column}, Scorer - {score}, N-Grams - {ngram}")

In [None]:
results = pd.DataFrame({'Prediction': measures,
                        'Vectorizer': vector,
                        'N-Grams': grams,
                        'Training Accuracy': trainAcc,
                        'Testing Accuracy': testAcc,
                        'Training ROC': trainROC,
                        'Testing ROC': testROC})
results.sort_values(by='Testing ROC',ascending=False,inplace=True)
results.reset_index(inplace=True,drop=True)
results

In [None]:
results_day = results.copy()
results_day = results_day[results_day['Prediction'] == 'dayDiff']
results_day.sort_values(by='Testing ROC',ascending=False,inplace=True)
results_day.reset_index(inplace=True,drop=True)
results_day

In [None]:
results_week = results.copy()
results_week = results_week[results_week['Prediction'] == 'weekDiff']
results_week.sort_values(by='Testing ROC',ascending=False,inplace=True)
results_week.reset_index(inplace=True,drop=True)
results_week

In [None]:
results_month = results.copy()
results_month = results_month[results_month['Prediction'] == 'monthDiff']
results_month.sort_values(by='Testing ROC',ascending=False,inplace=True)
results_month.reset_index(inplace=True,drop=True)
results_month

### Pull in Catapillar Data and Apply Lasso and Logistic Regression Models
    - Testing if models developed for Deere and Company can be applied to other companies within the same industry

In [None]:
# Request Data from Edgar Search. Pulled by Company CIK number
data = requests.get('https://data.sec.gov/submissions/CIK0000018230.json',headers={'User-Agent':'University of Iowa abromeland@uiowa.edu'})
data = json.loads(data.content)['filings']['recent']
df = pd.DataFrame.from_dict(data)
df.sort_values(by='filingDate',ascending=False,inplace=True)
df.reset_index(inplace=True,drop=True)
df.head()

In [None]:
# Pull data by accessionNumber and FileName to be added to the dataframe
for i in range(len(df)):
    accessionNumber = df.loc[i,"accessionNumber"]
    filename = df.loc[i,'primaryDocument']
    URL = f'https://www.sec.gov/Archives/edgar/data/18230/{re.sub("-","",accessionNumber)}/{filename}'
    page = requests.get(URL,headers={'User-Agent':'University of Iowa abromeland@uiowa.edu'})
    root = html.fromstring(page.content)
    text = [s.text_content() for s in root.xpath('/html/body')]
    df.loc[i,'text'] = ' '.join(text)
    time.sleep(.5)
df

In [None]:
# Clean Data and print out example data
df['text'] = df['text'].str.replace("[\\u200b||\\n||\\xa0]",'',regex=True)
df['text'] = df['text'].str.replace("\$\d+[\d,\.]*",'moneyToken',regex=True)
df['text'] = df['text'].str.lower()
print(df.filingDate.isnull().sum())
print(df.filingDate.dtype)
df["filingDate"] = pd.to_datetime(df["filingDate"],format="%Y-%m-%d %H:%M:%S")
print(df.filingDate.dtype)

In [None]:
df.sort_values(by='filingDate',ascending=False,inplace=True)
df

In [None]:
df_price = yf.download("CAT",start='2015-08-01',end='2022-07-28',progress=False)
print(type(df_price.index))
df_price.sort_index(ascending=False,inplace=True)
df_price['dayDiff'] = -1*df_price['Adj Close'].diff()
df_price['weekDiff'] = -1*df_price['Adj Close'].diff(periods=5)
df_price['monthDiff'] = -1*df_price['Adj Close'].diff(periods=22)
df_price

In [None]:
# WARNING - THIS CELL WILL TAKE A LONG TIME TO RUN

# AFINN Sentiment Analysis Scores
afinn = Afinn(emoticons=True)
df["AFINN"]=[afinn.score(s) for s in df.text]

# TextBlob Sentiment Analysis Scores
df["TextBlob"]=[TextBlob(s).sentiment.polarity for s in df.text]

# VADER Sentiment Analysis Scores
analyzer=SentimentIntensityAnalyzer()
df["VADER"]=[analyzer.polarity_scores(s)['compound'] for s in df.text]

# Loughran and McDonald Sentiment Scores
lm = ps.LM()
df['LMTitle'] = 0
for i in range(len(df['text'])):
    tokens = lm.tokenize(df['text'][i])
    score = lm.get_score(tokens)
    df.loc[i,"LMTitle"]=score["Polarity"]

df

# Scale AFINN score to compare it with other sentiment analyzers
df['AFINN_SCALE'] = MinMaxScaler().fit_transform(np.array(df['AFINN']).reshape(-1,1))


# inline display of plots
%matplotlib inline
plt.figure(figsize=(15, 6))

plt.plot(df.filingDate, df.AFINN_SCALE.rolling(window=50).mean(), "-r", label="AFINN Scaled")
plt.plot(df.filingDate, df.TextBlob.rolling(window=50).mean(), "-b", label="TextBlob")
plt.plot(df.filingDate, df.VADER.rolling(window=50).mean(), "-g", label="VADER")
plt.plot(df.filingDate, df.LMTitle.rolling(window=50).mean(), "-k", label="LM")


plt.legend(loc="upper left")
plt.title("Comparison on Sentiment Score")
plt.xlabel("Date")
plt.ylabel("Sentiment Score (Moving Average with Window Size 50)")
plt.grid(axis='both')
plt.show()

In [None]:
df_cat = pd.merge(df,df_price,how='left',left_on = 'filingDate',right_index = True, copy=True)
df_cat

In [None]:
# Graph all stock data with Sentiment Data - First Scale difference data to be graphed and compared with sentiment data.
df_cat['dayDiff_SCALE'] = MinMaxScaler().fit_transform(np.array(df_cat['dayDiff']).reshape(-1,1))
df_cat['weekDiff_SCALE'] = MinMaxScaler().fit_transform(np.array(df_cat['weekDiff']).reshape(-1,1))
df_cat['monthDiff_SCALE'] = MinMaxScaler().fit_transform(np.array(df_cat['monthDiff']).reshape(-1,1))
df_cat['Adj_Close_SCALE'] = MinMaxScaler().fit_transform(np.array(df_cat['Adj Close']).reshape(-1,1))

# Graph Sentiment scores with Adjusted Closing Values
plt.figure(figsize=(15, 6))

plt.plot(df_cat.filingDate, df_cat.AFINN_SCALE.rolling(window=50).mean(), "-b", label="AFINN Scaled")
plt.plot(df_cat.filingDate, df_cat.TextBlob.rolling(window=50).mean(), "-g", label="TextBlob")
plt.plot(df_cat.filingDate, df_cat.VADER.rolling(window=50).mean(), "-r", label="VADER")
plt.plot(df_cat.filingDate, df_cat.LMTitle.rolling(window=50).mean(), "-k", label="LM")
plt.plot(df_cat.filingDate, df_cat.Adj_Close_SCALE.rolling(window=50).mean(), "-c", label="Adjusted Close")


plt.legend(loc="upper left")
plt.title("Comparison on Sentiment Score")
plt.xlabel("Date")
plt.ylabel("Sentiment Score (Moving Average with Window Size 50)")
plt.grid(axis='both')
plt.show()

In [None]:
df_cat['characters'] = [len(s) for s in df_cat['text']]
print(f"Average character length: {df_cat['characters'].mean()}")
print(f"Minimum character length: {df_cat['characters'].min()}")
print(f"Maximum character length: {df_cat['characters'].max()}")

In [None]:
df_small = df_cat[df_cat['characters']<1000].copy()
print(df_small.shape)
df_small

In [None]:
df_cat = df_cat[df_cat['characters']>500].copy()
df_cat.sort_values(by='filingDate',ascending=False,inplace=True)
df_cat.reset_index(inplace=True,drop=True)
print(df_cat.shape)
df_cat

In [None]:
print(f"Average character length: {df_cat['characters'].mean()}")
print(f"Minimum character length: {df_cat['characters'].min()}")
print(f"Maximum character length: {df_cat['characters'].max()}")

In [None]:
# Regraining from best Deere Model on entire dataset to be used on Cat Dataset
df = df_all[["text","Adj Close"]].copy()


lm = ps.LM()
vectorizer = TfidfVectorizer(tokenizer = lm.tokenize,stop_words=custom_stop_words)
#Create the training DTM and the labels
train_x = vectorizer.fit_transform(df["text"])
train_y = df["Adj Close"]
print(train_x.shape)


scaler = MaxAbsScaler()
scaler.fit(train_x)
train_x=scaler.transform(train_x)

lasso = Lasso(alpha=bestalpha, #Regularization parameter.
              max_iter=5000
              )
lasso.fit(train_x,train_y)
print(np.count_nonzero(lasso.coef_))
#Check the percentage of non-zero slopes
print(np.count_nonzero(lasso.coef_)/len(lasso.coef_))


In [None]:
df = df_cat[['text','Adj Close']]
df = df[df['Adj Close'].isna()==False]
test_x = vectorizer.transform(df['text'])
test_y = df['Adj Close']
print(test_x.shape)
print(test_y.shape)
test_x = scaler.transform(test_x)
print("Mean Squared Error:")
print("Deere:")
print(mean_squared_error(train_y,lasso.predict(train_x),squared=False))
print()
print("Caterpillar:")
print(mean_squared_error(test_y,lasso.predict(test_x),squared=False))
print()
print("R-Squared:")
print("Deere:")
print(r2_score(train_y,lasso.predict(train_x)))
print()
print("Caterpillar:")
print(r2_score(test_y,lasso.predict(test_x)))

In [None]:
df_d = df_all[["text","dayDiff"]].copy()
df_d['dayDiff'] = np.where(df_d.dayDiff > 0 ,'postive','negative')
lm = ps.LM()
vectorizer = CountVectorizer(tokenizer = lm.tokenize,
                             stop_words=custom_stop_words,
                             ngram_range =(3,3),
                             binary=False)
#Create the training DTM and the labels
train_x = vectorizer.fit_transform(df_d["text"])
train_y = df_d["dayDiff"]
print(train_x.shape)

df_c = df_cat[["text",'dayDiff']].copy()
df_c['dayDiff'] = np.where(df_c.dayDiff > 0 ,'postive','negative')
#Create the testing DTM and the labels
test_x = vectorizer.transform(df_c["text"])
test_y = df_c["dayDiff"]
print(test_x.shape)


param_grid = l1_min_c(train_x, train_y, loss='log') * np.logspace(start=0, stop=5, num=20) 
print(param_grid)
sparselr = LogisticRegressionCV(penalty='l1', 
                                solver='liblinear', 
                                Cs=param_grid,   #Use the grid generated above
                                cv=5,            #Number of folds, that is, K
                                scoring='accuracy', #The performance metric to select the best C.
                                random_state=2021,  #To make sure the result is reproducible
                                tol=0.001,
                                max_iter=1000)
sparselr.fit(train_x, train_y)

# Accuracy
print("Day Difference")
print("Accuracy:")
print("Deere:")
print(accuracy_score(train_y,sparselr.predict(train_x)))
print("Caterpillar:")
print(accuracy_score(test_y,sparselr.predict(test_x)))

# AUC Score
print("AUC:")
print("Deere:")
print(roc_auc_score(train_y,sparselr.predict_proba(train_x)[:, 1]))
print("Caterpillar:")
print(roc_auc_score(test_y,sparselr.predict_proba(test_x)[:, 1]))

In [None]:
df_d = df_all[["text","weekDiff"]].copy()
df_d['weekDiff'] = np.where(df_d.weekDiff > 0 ,'postive','negative')
lm = ps.LM()
vectorizer = CountVectorizer(tokenizer = lm.tokenize,
                             stop_words=custom_stop_words,
                             ngram_range =(3,3),
                             binary=True)
#Create the training DTM and the labels
train_x = vectorizer.fit_transform(df_d["text"])
train_y = df_d["weekDiff"]
print(train_x.shape)

df_c = df_cat[["text",'weekDiff']].copy()
df_c['weekDiff'] = np.where(df_c.weekDiff > 0 ,'postive','negative')
#Create the testing DTM and the labels
test_x = vectorizer.transform(df_c["text"])
test_y = df_c["weekDiff"]
print(test_x.shape)


param_grid = l1_min_c(train_x, train_y, loss='log') * np.logspace(start=0, stop=5, num=20) 
print(param_grid)
sparselr = LogisticRegressionCV(penalty='l1', 
                                solver='liblinear', 
                                Cs=param_grid,   #Use the grid generated above
                                cv=5,            #Number of folds, that is, K
                                scoring='accuracy', #The performance metric to select the best C.
                                random_state=2021,  #To make sure the result is reproducible
                                tol=0.001,
                                max_iter=1000)
sparselr.fit(train_x, train_y)

# Accuracy
print("Week Difference")
print("Accuracy:")
print("Deere:")
print(accuracy_score(train_y,sparselr.predict(train_x)))
print("Caterpillar:")
print(accuracy_score(test_y,sparselr.predict(test_x)))

# AUC Score
print("AUC:")
print("Deere:")
print(roc_auc_score(train_y,sparselr.predict_proba(train_x)[:, 1]))
print("Caterpillar:")
print(roc_auc_score(test_y,sparselr.predict_proba(test_x)[:, 1]))

In [None]:
df_d = df_all[["text","monthDiff"]].copy()
df_d['monthDiff'] = np.where(df_d.monthDiff > 0 ,'postive','negative')
lm = ps.LM()
vectorizer = CountVectorizer(tokenizer = lm.tokenize,
                             stop_words=custom_stop_words,
                             ngram_range =(2,2),
                             binary=True)
#Create the training DTM and the labels
train_x = vectorizer.fit_transform(df_d["text"])
train_y = df_d["monthDiff"]
print(train_x.shape)

df_c = df_cat[["text",'monthDiff']].copy()
df_c['monthDiff'] = np.where(df_c.monthDiff > 0 ,'postive','negative')
#Create the testing DTM and the labels
test_x = vectorizer.transform(df_c["text"])
test_y = df_c["monthDiff"]
print(test_x.shape)


param_grid = l1_min_c(train_x, train_y, loss='log') * np.logspace(start=0, stop=5, num=20) 
print(param_grid)
sparselr = LogisticRegressionCV(penalty='l1', 
                                solver='liblinear', 
                                Cs=param_grid,   #Use the grid generated above
                                cv=5,            #Number of folds, that is, K
                                scoring='accuracy', #The performance metric to select the best C.
                                random_state=2021,  #To make sure the result is reproducible
                                tol=0.001,
                                max_iter=1000)
sparselr.fit(train_x, train_y)

# Accuracy
print("Month Difference")
print("Accuracy:")
print("Deere:")
print(accuracy_score(train_y,sparselr.predict(train_x)))
print("Caterpillar:")
print(accuracy_score(test_y,sparselr.predict(test_x)))

# AUC Score
print("AUC:")
print("Deere:")
print(roc_auc_score(train_y,sparselr.predict_proba(train_x)[:, 1]))
print("Caterpillar:")
print(roc_auc_score(test_y,sparselr.predict_proba(test_x)[:, 1]))

### Pull in Amazon Data and Apply Lasso and Logistic Regression Models
    - Testing if models developed for Deere and Company can be applied to other companies outside of Industry

In [None]:
# Request Data from Edgar Search. Pulled by Company CIK number
data = requests.get('https://data.sec.gov/submissions/CIK0001018724.json',headers={'User-Agent':'University of Iowa abromeland@uiowa.edu'})
data = json.loads(data.content)['filings']['recent']
df = pd.DataFrame.from_dict(data)
df.sort_values(by='filingDate',ascending=False,inplace=True)
df.reset_index(inplace=True,drop=True)
df.head()

In [None]:
# Pull data by accessionNumber and FileName to be added to the dataframe
for i in range(len(df)):
    accessionNumber = df.loc[i,"accessionNumber"]
    filename = df.loc[i,'primaryDocument']
    URL = f'https://www.sec.gov/Archives/edgar/data/1018724/{re.sub("-","",accessionNumber)}/{filename}'
    page = requests.get(URL,headers={'User-Agent':'University of Iowa abromeland@uiowa.edu'})
    root = html.fromstring(page.content)
    text = [s.text_content() for s in root.xpath('/html/body')]
    df.loc[i,'text'] = ' '.join(text)
    time.sleep(.5)
df

In [None]:
# Clean Data and print out example data
df['text'] = df['text'].str.replace("[\\u200b||\\n||\\xa0]",'',regex=True)
df['text'] = df['text'].str.replace("\$\d+[\d,\.]*",'moneyToken',regex=True)
df['text'] = df['text'].str.lower()
print(df.filingDate.isnull().sum())
print(df.filingDate.dtype)
df["filingDate"] = pd.to_datetime(df["filingDate"],format="%Y-%m-%d %H:%M:%S")
print(df.filingDate.dtype)

In [None]:
df.sort_values(by='filingDate',ascending=False,inplace=True)
df

In [None]:
df_price = yf.download("AMZN",start='2013-10-01',end='2022-07-28',progress=False)
print(type(df_price.index))
df_price.sort_index(ascending=False,inplace=True)
df_price['dayDiff'] = -1*df_price['Adj Close'].diff()
df_price['weekDiff'] = -1*df_price['Adj Close'].diff(periods=5)
df_price['monthDiff'] = -1*df_price['Adj Close'].diff(periods=22)
df_price

In [None]:
# WARNING - THIS CELL WILL TAKE A LONG TIME TO RUN

# AFINN Sentiment Analysis Scores
afinn = Afinn(emoticons=True)
df["AFINN"]=[afinn.score(s) for s in df.text]

# TextBlob Sentiment Analysis Scores
df["TextBlob"]=[TextBlob(s).sentiment.polarity for s in df.text]

# VADER Sentiment Analysis Scores
analyzer=SentimentIntensityAnalyzer()
df["VADER"]=[analyzer.polarity_scores(s)['compound'] for s in df.text]

# Loughran and McDonald Sentiment Scores
lm = ps.LM()
df['LMTitle'] = 0
for i in range(len(df['text'])):
    tokens = lm.tokenize(df['text'][i])
    score = lm.get_score(tokens)
    df.loc[i,"LMTitle"]=score["Polarity"]

df

# Scale AFINN score to compare it with other sentiment analyzers
df['AFINN_SCALE'] = MinMaxScaler().fit_transform(np.array(df['AFINN']).reshape(-1,1))


# inline display of plots
%matplotlib inline
plt.figure(figsize=(15, 6))

plt.plot(df.filingDate, df.AFINN_SCALE.rolling(window=50).mean(), "-r", label="AFINN Scaled")
plt.plot(df.filingDate, df.TextBlob.rolling(window=50).mean(), "-b", label="TextBlob")
plt.plot(df.filingDate, df.VADER.rolling(window=50).mean(), "-g", label="VADER")
plt.plot(df.filingDate, df.LMTitle.rolling(window=50).mean(), "-k", label="LM")


plt.legend(loc="upper left")
plt.title("Comparison on Sentiment Score")
plt.xlabel("Date")
plt.ylabel("Sentiment Score (Moving Average with Window Size 50)")
plt.grid(axis='both')
plt.show()

In [None]:
df_cat = pd.merge(df,df_price,how='left',left_on = 'filingDate',right_index = True, copy=True)
df_cat

In [None]:
# Graph all stock data with Sentiment Data - First Scale difference data to be graphed and compared with sentiment data.
df_cat['dayDiff_SCALE'] = MinMaxScaler().fit_transform(np.array(df_cat['dayDiff']).reshape(-1,1))
df_cat['weekDiff_SCALE'] = MinMaxScaler().fit_transform(np.array(df_cat['weekDiff']).reshape(-1,1))
df_cat['monthDiff_SCALE'] = MinMaxScaler().fit_transform(np.array(df_cat['monthDiff']).reshape(-1,1))
df_cat['Adj_Close_SCALE'] = MinMaxScaler().fit_transform(np.array(df_cat['Adj Close']).reshape(-1,1))

# Graph Sentiment scores with Adjusted Closing Values
plt.figure(figsize=(15, 6))

plt.plot(df_cat.filingDate, df_cat.AFINN_SCALE.rolling(window=50).mean(), "-b", label="AFINN Scaled")
plt.plot(df_cat.filingDate, df_cat.TextBlob.rolling(window=50).mean(), "-g", label="TextBlob")
plt.plot(df_cat.filingDate, df_cat.VADER.rolling(window=50).mean(), "-r", label="VADER")
plt.plot(df_cat.filingDate, df_cat.LMTitle.rolling(window=50).mean(), "-k", label="LM")
plt.plot(df_cat.filingDate, df_cat.Adj_Close_SCALE.rolling(window=50).mean(), "-c", label="Adjusted Close")


plt.legend(loc="upper left")
plt.title("Comparison on Sentiment Score")
plt.xlabel("Date")
plt.ylabel("Sentiment Score (Moving Average with Window Size 50)")
plt.grid(axis='both')
plt.show()

In [None]:
df_cat['characters'] = [len(s) for s in df_cat['text']]
print(f"Average character length: {df_cat['characters'].mean()}")
print(f"Minimum character length: {df_cat['characters'].min()}")
print(f"Maximum character length: {df_cat['characters'].max()}")

In [None]:
df_small = df_cat[df_cat['characters']<1000].copy()
print(df_small.shape)
df_small

In [None]:
df_cat = df_cat[df_cat['characters']>500].copy()
df_cat.sort_values(by='filingDate',ascending=False,inplace=True)
df_cat.reset_index(inplace=True,drop=True)
print(df_cat.shape)
df_cat

In [None]:
print(f"Average character length: {df_cat['characters'].mean()}")
print(f"Minimum character length: {df_cat['characters'].min()}")
print(f"Maximum character length: {df_cat['characters'].max()}")

In [None]:
# Regraining from best Deere Model on entire dataset to be used on Cat Dataset
df = df_all[["text","Adj Close"]].copy()


lm = ps.LM()
vectorizer = TfidfVectorizer(tokenizer = lm.tokenize,stop_words=custom_stop_words)
#Create the training DTM and the labels
train_x = vectorizer.fit_transform(df["text"])
train_y = df["Adj Close"]
print(train_x.shape)


scaler = MaxAbsScaler()
scaler.fit(train_x)
train_x=scaler.transform(train_x)

lasso = Lasso(alpha=bestalpha, #Regularization parameter.
              max_iter=5000
              )
lasso.fit(train_x,train_y)
print(np.count_nonzero(lasso.coef_))
#Check the percentage of non-zero slopes
print(np.count_nonzero(lasso.coef_)/len(lasso.coef_))


In [None]:
df = df_cat[['text','Adj Close']]
df = df[df['Adj Close'].isna()==False]
test_x = vectorizer.transform(df['text'])
test_y = df['Adj Close']
print(test_x.shape)
print(test_y.shape)
test_x = scaler.transform(test_x)
print("Mean Squared Error:")
print("Deere:")
print(mean_squared_error(train_y,lasso.predict(train_x),squared=False))
print()
print("Amazon:")
print(mean_squared_error(test_y,lasso.predict(test_x),squared=False))
print()
print("R-Squared:")
print("Deere:")
print(r2_score(train_y,lasso.predict(train_x)))
print()
print("Amazon:")
print(r2_score(test_y,lasso.predict(test_x)))

In [None]:
df_d = df_all[["text","dayDiff"]].copy()
df_d['dayDiff'] = np.where(df_d.dayDiff > 0 ,'postive','negative')
lm = ps.LM()
vectorizer = CountVectorizer(tokenizer = lm.tokenize,
                             stop_words=custom_stop_words,
                             ngram_range =(3,3),
                             binary=False)
#Create the training DTM and the labels
train_x = vectorizer.fit_transform(df_d["text"])
train_y = df_d["dayDiff"]
print(train_x.shape)

df_c = df_cat[["text",'dayDiff']].copy()
df_c['dayDiff'] = np.where(df_c.dayDiff > 0 ,'postive','negative')
#Create the testing DTM and the labels
test_x = vectorizer.transform(df_c["text"])
test_y = df_c["dayDiff"]
print(test_x.shape)


param_grid = l1_min_c(train_x, train_y, loss='log') * np.logspace(start=0, stop=5, num=20) 
print(param_grid)
sparselr = LogisticRegressionCV(penalty='l1', 
                                solver='liblinear', 
                                Cs=param_grid,   #Use the grid generated above
                                cv=5,            #Number of folds, that is, K
                                scoring='accuracy', #The performance metric to select the best C.
                                random_state=2021,  #To make sure the result is reproducible
                                tol=0.001,
                                max_iter=1000)
sparselr.fit(train_x, train_y)

# Accuracy
print("Day Difference")
print("Accuracy:")
print("Amazon:")
print(accuracy_score(train_y,sparselr.predict(train_x)))
print("Caterpillar:")
print(accuracy_score(test_y,sparselr.predict(test_x)))

# AUC Score
print("AUC:")
print("Deere:")
print(roc_auc_score(train_y,sparselr.predict_proba(train_x)[:, 1]))
print("Amazon:")
print(roc_auc_score(test_y,sparselr.predict_proba(test_x)[:, 1]))

In [None]:
df_d = df_all[["text","weekDiff"]].copy()
df_d['weekDiff'] = np.where(df_d.weekDiff > 0 ,'postive','negative')
lm = ps.LM()
vectorizer = CountVectorizer(tokenizer = lm.tokenize,
                             stop_words=custom_stop_words,
                             ngram_range =(3,3),
                             binary=True)
#Create the training DTM and the labels
train_x = vectorizer.fit_transform(df_d["text"])
train_y = df_d["weekDiff"]
print(train_x.shape)

df_c = df_cat[["text",'weekDiff']].copy()
df_c['weekDiff'] = np.where(df_c.weekDiff > 0 ,'postive','negative')
#Create the testing DTM and the labels
test_x = vectorizer.transform(df_c["text"])
test_y = df_c["weekDiff"]
print(test_x.shape)


param_grid = l1_min_c(train_x, train_y, loss='log') * np.logspace(start=0, stop=5, num=20) 
print(param_grid)
sparselr = LogisticRegressionCV(penalty='l1', 
                                solver='liblinear', 
                                Cs=param_grid,   #Use the grid generated above
                                cv=5,            #Number of folds, that is, K
                                scoring='accuracy', #The performance metric to select the best C.
                                random_state=2021,  #To make sure the result is reproducible
                                tol=0.001,
                                max_iter=1000)
sparselr.fit(train_x, train_y)

# Accuracy
print("Week Difference")
print("Accuracy:")
print("Deere:")
print(accuracy_score(train_y,sparselr.predict(train_x)))
print("Amazon:")
print(accuracy_score(test_y,sparselr.predict(test_x)))

# AUC Score
print("AUC:")
print("Deere:")
print(roc_auc_score(train_y,sparselr.predict_proba(train_x)[:, 1]))
print("Amazon:")
print(roc_auc_score(test_y,sparselr.predict_proba(test_x)[:, 1]))

In [None]:
df_d = df_all[["text","monthDiff"]].copy()
df_d['monthDiff'] = np.where(df_d.monthDiff > 0 ,'postive','negative')
lm = ps.LM()
vectorizer = CountVectorizer(tokenizer = lm.tokenize,
                             stop_words=custom_stop_words,
                             ngram_range =(2,2),
                             binary=True)
#Create the training DTM and the labels
train_x = vectorizer.fit_transform(df_d["text"])
train_y = df_d["monthDiff"]
print(train_x.shape)

df_c = df_cat[["text",'monthDiff']].copy()
df_c['monthDiff'] = np.where(df_c.monthDiff > 0 ,'postive','negative')
#Create the testing DTM and the labels
test_x = vectorizer.transform(df_c["text"])
test_y = df_c["monthDiff"]
print(test_x.shape)


param_grid = l1_min_c(train_x, train_y, loss='log') * np.logspace(start=0, stop=5, num=20) 
print(param_grid)
sparselr = LogisticRegressionCV(penalty='l1', 
                                solver='liblinear', 
                                Cs=param_grid,   #Use the grid generated above
                                cv=5,            #Number of folds, that is, K
                                scoring='accuracy', #The performance metric to select the best C.
                                random_state=2021,  #To make sure the result is reproducible
                                tol=0.001,
                                max_iter=1000)
sparselr.fit(train_x, train_y)

# Accuracy
print("Month Difference")
print("Accuracy:")
print("Deere:")
print(accuracy_score(train_y,sparselr.predict(train_x)))
print("Amazon:")
print(accuracy_score(test_y,sparselr.predict(test_x)))

# AUC Score
print("AUC:")
print("Deere:")
print(roc_auc_score(train_y,sparselr.predict_proba(train_x)[:, 1]))
print("Amazon:")
print(roc_auc_score(test_y,sparselr.predict_proba(test_x)[:, 1]))