<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-Cleaning:--Earning-Call-Transcripts" data-toc-modified-id="Data-Cleaning:--Earning-Call-Transcripts-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data Cleaning:  Earning Call Transcripts</a></span><ul class="toc-item"><li><span><a href="#Retrieve-all-txt-files-from-directories" data-toc-modified-id="Retrieve-all-txt-files-from-directories-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Retrieve all txt files from directories</a></span></li><li><span><a href="#Extract-text-from-each-text-files-by-filename" data-toc-modified-id="Extract-text-from-each-text-files-by-filename-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Extract text from each text files by filename</a></span></li><li><span><a href="#Create-Date-Frame" data-toc-modified-id="Create-Date-Frame-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Create Date Frame</a></span></li><li><span><a href="#Split-transcripts-into-management-discussion-and-Q&amp;A-session" data-toc-modified-id="Split-transcripts-into-management-discussion-and-Q&amp;A-session-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Split transcripts into management discussion and Q&amp;A session</a></span></li><li><span><a href="#Extract-call-release-data,-revenue-status,-and-ESP-status" data-toc-modified-id="Extract-call-release-data,-revenue-status,-and-ESP-status-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Extract call release data, revenue status, and ESP status</a></span></li></ul></li><li><span><a href="#Get-stock-price-and-volume-from-Yahoo-Finance" data-toc-modified-id="Get-stock-price-and-volume-from-Yahoo-Finance-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Get stock price and volume from Yahoo Finance</a></span></li><li><span><a href="#Sentiment-Score-and-Text-Stat-for-MD-and-QA" data-toc-modified-id="Sentiment-Score-and-Text-Stat-for-MD-and-QA-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Sentiment Score and Text Stat for MD and QA</a></span><ul class="toc-item"><li><span><a href="#Text-stat-in-transcripts" data-toc-modified-id="Text-stat-in-transcripts-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Text stat in transcripts</a></span></li><li><span><a href="#Text-stat-in-MD-&amp;-QA" data-toc-modified-id="Text-stat-in-MD-&amp;-QA-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Text stat in MD &amp; QA</a></span></li></ul></li></ul></div>

In [1]:
import pandas as pd
import datetime
import pandas_datareader as pdr
from textblob import TextBlob

import re
from textstat.textstat import textstat 

import matplotlib.pyplot as plt
%matplotlib inline

Sources: 
https://github.com/shivam5992/textstat


https://www.mckinleycapital.com/getting-sentimental-conference-call-sentiment-stock-returns/

### Data Cleaning:  Earning Call Transcripts

#### Retrieve all txt files from directories

In [2]:
import glob

filenames = []
for filename in glob.glob('../data/*.txt'):
    filenames.append(filename)
filenames[0:5]

['../data/Ford Motor (F) Q3 2018 Results - Earnings Call Transcript.txt',
 '../data/Ford Motor (F) Mark Fields on Q1 2016 Results - Earnings Call Transcript.txt',
 '../data/Tesla Motors (TSLA) Elon Reeve Musk on Q2 2015 Results - Earnings Call Transcript.txt',
 '../data/Tesla (TSLA) Q3 2017 Results - Earnings Call Transcript.txt',
 '../data/General Motors (GM) Q3 2017 Results - Earnings Call Transcript.txt']

#### Extract text from each text files by filename

In [3]:
transcripts = []
for filename in filenames: 
    with open(filename, 'r') as f:
        cont = f.read()
        transcripts.append(cont)
        
print('total files loaded:',len(transcripts))


total files loaded: 61


#### Create Date Frame

In [4]:
df = pd.DataFrame(transcripts, columns=[ 'transcripts'])
df.head()

Unnamed: 0,transcripts
0,Ford Motor (F) Q3 2018 Results - Earnings Call...
1,Ford Motor (F) Mark Fields on Q1 2016 Results ...
2,Tesla Motors (TSLA) Elon Reeve Musk on Q2 2015...
3,Tesla (TSLA) Q3 2017 Results - Earnings Call T...
4,General Motors (GM) Q3 2017 Results - Earnings...


#### Split transcripts into management discussion and Q&A session

In [5]:
text_MD = []
text_QA = []

for i in range(len(df.transcripts)): 
    text_MD.append(df.transcripts[i].split('Question-and-Answer Session')[0])
    text_QA.append(df.transcripts[i].split('Question-and-Answer Session')[1])

df['tx_MD'] = text_MD
df['tx_QA'] = text_QA

#### Extract call release data, revenue status, and ESP status

In [6]:
esp_date = []
esp_status = []
esp_status_all = []
esp = []
company = []

rev_status_all =[]
rev_status = []
rev_total=[]
rev_pct= []

for text in df['tx_MD']:
        esp_date.append(re.findall(r'\d{2}-\d{2}-\d{2}', text)[0])
        esp_status_all.append(re.findall(r'\n(EPS.*?\$([\-\.\d]+) (misses|beats).+?([\-\.\d]+).+?)\n', text)[0][0])
        esp_status.append(re.findall(r'\n(EPS.*?\$([\-\.\d]+) (misses|beats).+?([\-\.\d]+).+?)\n', text)[0][2])
        esp.append(re.findall(r'\n(EPS.*?\$([\-\.\d]+) (misses|beats).+?([\-\.\d]+).+?)\n', text)[0][1])
        
        rev_pct.append(re.findall(r'([\.\d]+)(?=%)', text)[0])
        rev_total.append(re.findall(r'Revenue of.*?\$([\-\.\d]+)', text)[0])   
        
        rev_status_all.append(re.findall(r'(Revenue of.* (misses|beats) .+)', text)[0][0])
        rev_status.append(re.findall(r'(Revenue of.* (misses|beats) .+)', text)[0][1])
        
        company.append(re.findall(r'\((.*?)\)', text)[0])

# Create columns related to ESP
df['esp_date'] = esp_date
df['esp_date']= pd.to_datetime(df['esp_date'])

df['esp_status_all'] = esp_status_all
df['esp_status'] = esp_status
df['esp'] = esp
df['company'] = company

#create columns related to YOY revenues 
df['rev_yoy_growth_percent'] = rev_pct
df['rev_yoy_growth_percent'] = df['rev_yoy_growth_percent'].astype('float32')

df['rev_status'] = rev_status
df['rev_total_b'] = rev_total
df['rev_total_b'] = df['rev_total_b'].astype('float32')

df['rev_status_all'] = rev_status_all

df = df.sort_values(by='esp_date', ascending=False)

### Get stock price and volume from Yahoo Finance 

In [7]:
def get(tickers, startdate, enddate):
  def data(ticker):
    return (pdr.get_data_yahoo(ticker, start=startdate, end=enddate))
  datas = map (data, tickers)
  return(pd.concat(datas, keys=tickers, names=['Ticker', 'Date']))

In [8]:
tickers = ['TSLA','GM','F','FCAU']
df_df = get(tickers, datetime.datetime(2014, 1, 2), datetime.datetime(2019, 5, 13))

In [9]:
df_df['pct_price'] = df_df["Close"].pct_change()
df_df['pct_volume'] = df_df["Volume"].pct_change()
df_df['pct_price_same_day'] = (df_df ['Open'] - df_df['Close'].shift(0)) / (df_df['Close'].shift(0))
#df_df ['weekly_return'] = (df_df ['Open'] - df_df['Close'].shift(-5)) / (df_df['Close'].shift(-5))

In [11]:
df_df.reset_index(inplace=True)

In [12]:
df_df = df_df[['Ticker','Date','Open','Close','Volume','pct_price','pct_price_same_day','pct_volume']]

In [13]:
df = df.merge(df_df, left_on=['esp_date','company'], right_on=['Date','Ticker'], how='left')
df.drop_duplicates(subset='transcripts', keep='first', inplace=True)

In [14]:
df['esp_target'] = df['esp_status'].map({'misses':0, 'beats':1})
df['rev_target'] = df['rev_status'].map({'misses':0, 'beats':1})
df["esp"] = df.esp.astype(float)

In [21]:
df['pct_price_target'] = (df['pct_price'] >0).astype(int)
df['pct_price_target_same_day'] = (df['pct_price_same_day'] >0).astype(int)

#df['weekly_return'] = (df['weekly_return'] >0).astype(int)

df['volatility_percentage_return_esp'] = df.esp.pct_change()
df['volatility_percentage_rev_yoy_growth_percent'] = df.rev_yoy_growth_percent.pct_change()

df.fillna(0, inplace=True)
df.fillna(0, inplace=True)

In [26]:
df_words_target = df[['esp_date','company','transcripts', 
                      'tx_MD','tx_QA','pct_price_target', 
                      'pct_volume',
                       'volatility_percentage_return_esp',
                      'volatility_percentage_rev_yoy_growth_percent',
                      'pct_price_target_same_day',
                      'esp_target',
                      'rev_target']]
                      
#df_words_target.to_csv('..data/df_transcripts.csv', index=False)

In [27]:
import pickle
pickle.dump(df_words_target, open( "../data/df_words_target.pkl", "wb" ))

### Sentiment Score and Text Stat for MD and QA

In [None]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyser = SentimentIntensityAnalyzer()

df['tx_MD_neg'] = df['tx_MD'].map(lambda x: analyser.polarity_scores(x)['neg'])
df['tx_MD_pos'] = df['tx_MD'].map(lambda x: analyser.polarity_scores(x)['pos'])
df['tx_MD_neu'] = df['tx_MD'].map(lambda x: analyser.polarity_scores(x)['neu'])
df['tx_MD_compound'] = df['tx_MD'].map(lambda x: analyser.polarity_scores(x)['compound'])

df['tx_QA_neg'] = df['tx_QA'].map(lambda x: analyser.polarity_scores(x)['neg'])
df['tx_QA_pos'] = df['tx_QA'].map(lambda x: analyser.polarity_scores(x)['pos'])
df['tx_QA_neu'] = df['tx_QA'].map(lambda x: analyser.polarity_scores(x)['neu'])
df['tx_QA_compound'] = df['tx_QA'].map(lambda x: analyser.polarity_scores(x)['compound'])

In [None]:
def detect_polarity(text):
    return TextBlob(text).sentiment.polarity

def detect_subjectivity(text):
    return TextBlob(text).sentiment.subjectivity

In [None]:
# polarity is 0.8, which means that the statement is positive 
# 0.75 subjectivity refers that mostly it is a public opinion and not a factual information.
df['polarity'] = df['transcripts'].map(lambda x: detect_polarity(x))
df['subjectivity'] = df['transcripts'].map(lambda x: detect_subjectivity(x))

df['tx_MD_polarity'] = df['tx_MD'].map(lambda x: detect_polarity(x))
df['tx_QA_polarity'] = df['tx_QA'].map(lambda x: detect_polarity(x))

df['tx_MD_subjectivity'] = df['tx_MD'].map(lambda x: detect_subjectivity(x))
df['tx_QA_subjectivity'] = df['tx_QA'].map(lambda x: detect_subjectivity(x))

#### Text stat in transcripts

In [None]:
col = 'transcripts'

for i in range(11): 
    df[col+'_num_syl'] = [textstat.syllable_count(x) for x in list(df.loc[:,col])]
    df[col+'_avg_sentence_length'] = [textstat.avg_sentence_length(x) for x in list(df.loc[:,col])]
    df[col+'textstat.lexicon_count'] = [textstat.lexicon_count(x) for x in list(df.loc[:,col])]
    df[col+'flesch_reading_ease'] = [textstat.flesch_reading_ease(x) for x in list(df.loc[:,col])]
    df[col+'flesch_kincaid_grade'] = [textstat.flesch_kincaid_grade(x) for x in list(df.loc[:,col])]
    df[col+'difficult_words'] = [textstat.difficult_words(x) for x in list(df.loc[:,col])]
    df[col+'linsear_write_formula'] = [textstat.linsear_write_formula(x) for x in list(df.loc[:,col])]
    df[col+'gunning_fog'] = [textstat.gunning_fog(x) for x in list(df.loc[:,col])]
    df[col+'automated_readability_index'] = [textstat.automated_readability_index(x) for x in list(df.loc[:,col])]
    df[col+'coleman_liau_index'] = [textstat.coleman_liau_index(x) for x in list(df.loc[:,col])]
    df[col+'dale_chall_readability_score'] = [textstat.dale_chall_readability_score(x) for x in list(df.loc[:,col])]

#### Text stat in MD & QA

In [None]:
col = 'tx_MD'

for i in range(11): 
    df[col+'_num_syl'] = [textstat.syllable_count(x) for x in list(df.loc[:,col])]
    df[col+'_avg_sentence_length'] = [textstat.avg_sentence_length(x) for x in list(df.loc[:,col])]
    df[col+'textstat.lexicon_count'] = [textstat.lexicon_count(x) for x in list(df.loc[:,col])]
    df[col+'flesch_reading_ease'] = [textstat.flesch_reading_ease(x) for x in list(df.loc[:,col])]
    df[col+'flesch_kincaid_grade'] = [textstat.flesch_kincaid_grade(x) for x in list(df.loc[:,col])]
    df[col+'difficult_words'] = [textstat.difficult_words(x) for x in list(df.loc[:,col])]
    df[col+'linsear_write_formula'] = [textstat.linsear_write_formula(x) for x in list(df.loc[:,col])]
    df[col+'gunning_fog'] = [textstat.gunning_fog(x) for x in list(df.loc[:,col])]
    df[col+'automated_readability_index'] = [textstat.automated_readability_index(x) for x in list(df.loc[:,col])]
    df[col+'coleman_liau_index'] = [textstat.coleman_liau_index(x) for x in list(df.loc[:,col])]
    df[col+'dale_chall_readability_score'] = [textstat.dale_chall_readability_score(x) for x in list(df.loc[:,col])]
    
col = 'tx_QA'

for i in range(11): 
    df[col+'_num_syl'] = [textstat.syllable_count(x) for x in list(df.loc[:,col])]
    df[col+'_avg_sentence_length'] = [textstat.avg_sentence_length(x) for x in list(df.loc[:,col])]
    df[col+'textstat.lexicon_count'] = [textstat.lexicon_count(x) for x in list(df.loc[:,col])]
    df[col+'flesch_reading_ease'] = [textstat.flesch_reading_ease(x) for x in list(df.loc[:,col])]
    df[col+'flesch_kincaid_grade'] = [textstat.flesch_kincaid_grade(x) for x in list(df.loc[:,col])]
    df[col+'difficult_words'] = [textstat.difficult_words(x) for x in list(df.loc[:,col])]
    df[col+'linsear_write_formula'] = [textstat.linsear_write_formula(x) for x in list(df.loc[:,col])]
    df[col+'gunning_fog'] = [textstat.gunning_fog(x) for x in list(df.loc[:,col])]
    df[col+'automated_readability_index'] = [textstat.automated_readability_index(x) for x in list(df.loc[:,col])]
    df[col+'coleman_liau_index'] = [textstat.coleman_liau_index(x) for x in list(df.loc[:,col])]
    df[col+'dale_chall_readability_score'] = [textstat.dale_chall_readability_score(x) for x in list(df.loc[:,col])]

In [None]:
df_final = df
df_final

In [None]:
import pickle
pickle.dump(df_final, open( "../data/df_final.pkl", "wb" ))

Don’t try to engineer features. The NLP and ML community tried that for 50 years and success was pretty mediocre. Instead, train an LSTM or a dilated CNN. In the simplest case, use a binary cross-entropy loss with ground truth labels ‘stock went up’ and ‘stock went down’ (you’ll have to collect those yourself by checking the price before and after earnings calls). Then use back-propagation through time to train your model. Real models are of course much more complicated, but it’s a start and will likely work better than manual engineering.

If you succeed, it’s still pretty unlikely that you can beat institutions in terms of time. They analyze these report in about 150ms, just to give you a benchmark.

Since you’re doing simple text classification, any reason why linear models wouldn’t perform well enough for a task like this? Is there any evaluation suggesting a neural network does significantly better? Also do we really need to model it as a time-series (lstm/dilated cnns)?

It’s not a time-series, it’s sequential prediction. There’s absolutely no chance to obtain state-of-the-art performance with a linear model and the entire NLP literature supports this view. Many successful approaches use bidirectional LSTMs modeling n-grams nowadays. If the task is topic modeling, check out David Blei’s papers on”Latent Dirichlet Allocation” and “Dynamic topic models”. For word embeddings, “Distributed Representations of Words and Phrases and their Compositionality” is a must read.

https://www.reddit.com/r/investing/comments/a1kk0p/when_applying_nlp_to_earnings_transcripts_what/