Project plan:

1. Figure out the data source - check the API and see if it can actually work
2. If the API does not work, make sure you can access the downloadable data from huggingface - check how to do that
3. Check out the data and do data pre-processing
4. Figure out how to use ModernBERT and set that as your first model
5. Research other earnings models on huggingface and see/test what's out there
6. Try tuning some of those models to see if you can make improvements

# Steps 1 and 2: Checking out the API

In [1]:
import wrds
import pandas as pd
import numpy as np
from tqdm import tqdm
import os

In [2]:
# Establishing a connection to WRDS
db = wrds.Connection()

Enter your WRDS username [danielwang]: dwang89
Enter your password: ········


WRDS recommends setting up a .pgpass file.


Create .pgpass file now [y/n]?:  y


Created .pgpass file successfully.
You can create this file yourself at any time with the create_pgpass_file() function.
Loading library list...
Done


## Testing out initial retrieval

In [None]:
# Retrieve 2015 - 2017 Earnings Conference Call Transcripts with Full-text components
sql_query = '''
            SELECT a.*, b.*, c.componenttext
            FROM (
                  SELECT * 
                  FROM ciq.wrds_transcript_detail
                  WHERE companyid IN (112350, 21835, 24937)
                    AND date_part('year', mostimportantdateutc) BETWEEN 2015 AND 2017
                 ) AS a
            JOIN ciq.wrds_transcript_person AS b
              ON a.transcriptid = b.transcriptid
            JOIN ciq.ciqtranscriptcomponent AS c
              ON b.transcriptcomponentid = c.transcriptcomponentid
            ORDER BY a.transcriptid, b.componentorder;
            '''

data = db.raw_sql(sql_query)

In [None]:
data.head()

In [None]:
data['componenttext'].iloc[2]

## Pulling transcripts from S&P500 in 2015 (for testing)
Testing how to retrieve transcripts of S&P 500 companies using CRSP, CCM, and CIQ Transcripts

In [3]:
# S&P 500 Constituents in the year 2015 from CRSP
# Merge with GVKEY using CCM Linktable
sql_query = '''
            SELECT a.*, b.gvkey, b.liid, b.linkdt, b.linkenddt
            FROM (
                SELECT * 
                FROM crsp.dsp500list
                WHERE start <= make_date(2015, 1, 1)
                  AND ending >= make_date(2015, 12, 31)
            ) AS a
            LEFT JOIN (
                SELECT * 
                FROM crsp.ccmxpf_lnkhist
                WHERE linkdt <= make_date(2015, 1, 1)
                  AND (linkenddt >= make_date(2015, 12, 31) OR linkenddt IS NULL)
            ) AS b
            ON a.permno = b.lpermno
            AND b.linktype IN ('LU', 'LC')
            AND b.linkprim IN ('P', 'C');
            '''

snp500_crsp_gvkey = db.raw_sql(sql_query)

Pull Transcripts data with full-text from 2015

Sample selection:
- Transcripts of earnings conference calls
  - Key Development Event Type ID `keydeveventtypeid` = 48
- The final copy of each transcript that is edited, proofed, or audited
  - Final copy: `transcriptpresentationtypeid` = 5.0

Merge GVKEY with Transcripts data

In [4]:
sql_query = '''
            SELECT a.*,
                   b.symbolvalue AS gvkey,
                   c.*, 
                   d.componenttext
            FROM (
                SELECT *
                FROM ciq.wrds_transcript_detail 
                WHERE keydeveventtypeid = 48 
                  AND transcriptpresentationtypeid = 5 
                  AND date_part('year', mostimportantdateutc) = 2015
            ) AS a
            LEFT JOIN (
                SELECT *
                FROM ciq.wrds_ciqsymbol 
                WHERE symboltypecat = 'gvkey'
            ) AS b
              ON a.companyid = b.companyid
            LEFT JOIN ciq.wrds_transcript_person AS c 
              ON a.transcriptid = c.transcriptid
            LEFT JOIN ciq.ciqtranscriptcomponent AS d 
              ON c.transcriptid = d.transcriptid 
             AND c.transcriptcomponentid = d.transcriptcomponentid
            ORDER BY a.transcriptid, c.transcriptcomponentid, a.companyid;
            '''

tr_detail_gvkey = db.raw_sql(sql_query)

Merge transcripts and permno via gvkey

In [5]:
# Obtain transcripts data for GVKEYs in the S&P500 Contituent list
snp500_transcripts = tr_detail_gvkey[tr_detail_gvkey.gvkey.isin(snp500_crsp_gvkey.gvkey.tolist())]

# Remove observations with missing GVKEY
snp500_transcripts = snp500_transcripts[pd.notna(snp500_transcripts.gvkey)]

In [6]:
# Reindexing and displaying
snp500_transcripts = snp500_transcripts.reset_index(drop=True)

In [7]:
snp500_transcripts

Unnamed: 0,companyid,keydevid,transcriptid,headline,mostimportantdateutc,mostimportanttimeutc,keydeveventtypeid,keydeveventtypename,companyname,transcriptcollectiontypeid,...,transcriptcomponenttypename,transcriptpersonid,transcriptpersonname,proid,companyofperson,speakertypeid,speakertypename,componenttextpreview,word_count,componenttext
0,289030.0,279483592.0,743348.0,"Micron Technology, Inc., Q1 2015 Earnings Call...",2015-01-06,21:30:00,48.0,Earnings Calls,"Micron Technology, Inc.",2,...,Presentation Operator Message,1.0,Operator,,,1,Operator,"Good afternoon. My name is Karen, and I'll be ...",57,"Good afternoon. My name is Karen, and I'll be ..."
1,289030.0,279483592.0,743348.0,"Micron Technology, Inc., Q1 2015 Earnings Call...",2015-01-06,21:30:00,48.0,Earnings Calls,"Micron Technology, Inc.",2,...,Presenter Speech,116497.0,Kipp Bedard,2523238.0,,2,Executives,"Thank you, Karen, and welcome to Micron Techno...",221,"Thank you, Karen, and welcome to Micron Techno..."
2,289030.0,279483592.0,743348.0,"Micron Technology, Inc., Q1 2015 Earnings Call...",2015-01-06,21:30:00,48.0,Earnings Calls,"Micron Technology, Inc.",2,...,Presenter Speech,182536.0,Unknown Executive,,,2,Executives,"During the course of this meeting, we may make...",175,"During the course of this meeting, we may make..."
3,289030.0,279483592.0,743348.0,"Micron Technology, Inc., Q1 2015 Earnings Call...",2015-01-06,21:30:00,48.0,Earnings Calls,"Micron Technology, Inc.",2,...,Presenter Speech,116497.0,Kipp Bedard,2523238.0,,2,Executives,I'll now turn the call over to Mr. Mark Durcan...,13,I'll now turn the call over to Mr. Mark Durcan...
4,289030.0,279483592.0,743348.0,"Micron Technology, Inc., Q1 2015 Earnings Call...",2015-01-06,21:30:00,48.0,Earnings Calls,"Micron Technology, Inc.",2,...,Presenter Speech,130734.0,D. Durcan,289059.0,,2,Executives,"Thanks, Kipp. We had another strong quarter be...",1382,"Thanks, Kipp. We had another strong quarter be..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
353774,312648.0,279409950.0,3447492.0,"Vornado Realty Trust, Q2 2015 Earnings Call, A...",2015-08-04,14:00:00,48.0,Earnings Calls,Vornado Realty Trust,8,...,Question and Answer Operator Message,1.0,Operator,,,1,Operator,"At this time, I see we have no further questio...",26,"At this time, I see we have no further questio..."
353775,312648.0,279409950.0,3447492.0,"Vornado Realty Trust, Alexander's Inc., Q2 201...",2015-08-04,14:00:00,48.0,Earnings Calls,Vornado Realty Trust,8,...,Answer,308559.0,Steven Roth,248980.0,,2,Executives,"Thank you all very much. This was 1.5 hours, s...",52,"Thank you all very much. This was 1.5 hours, s..."
353776,312648.0,279409950.0,3447492.0,"Vornado Realty Trust, Q2 2015 Earnings Call, A...",2015-08-04,14:00:00,48.0,Earnings Calls,Vornado Realty Trust,8,...,Answer,308559.0,Steven Roth,248980.0,,2,Executives,"Thank you all very much. This was 1.5 hours, s...",52,"Thank you all very much. This was 1.5 hours, s..."
353777,312648.0,279409950.0,3447492.0,"Vornado Realty Trust, Alexander's Inc., Q2 201...",2015-08-04,14:00:00,48.0,Earnings Calls,Vornado Realty Trust,8,...,Question and Answer Operator Message,1.0,Operator,,,1,Operator,"Thank you, ladies and gentlemen. This conclude...",17,"Thank you, ladies and gentlemen. This conclude..."


I will also display the columns of the data and the values from the first entry to better understand the dataset.

In [8]:
snp500_transcripts.iloc[0]

companyid                                                                  289030.0
keydevid                                                                279483592.0
transcriptid                                                               743348.0
headline                          Micron Technology, Inc., Q1 2015 Earnings Call...
mostimportantdateutc                                                     2015-01-06
mostimportanttimeutc                                                       21:30:00
keydeveventtypeid                                                              48.0
keydeveventtypename                                                  Earnings Calls
companyname                                                 Micron Technology, Inc.
transcriptcollectiontypeid                                                        2
transcriptcollectiontypename                                            Edited Copy
transcriptpresentationtypeid                                                

Along with getting the number of unique values for each column:

In [9]:
snp500_transcripts.nunique()

companyid                            467
keydevid                            1850
transcriptid                        4580
headline                            1866
mostimportantdateutc                 184
mostimportanttimeutc                  34
keydeveventtypeid                      1
keydeveventtypename                    1
companyname                          467
transcriptcollectiontypeid             3
transcriptcollectiontypename           3
transcriptpresentationtypeid           1
transcriptpresentationtypename         1
transcriptcreationdate_utc           278
transcriptcreationtime_utc          4366
audiolengthsec                      1248
gvkey                                466
transcriptid                        4580
transcriptcomponentid             349162
componentorder                       199
transcriptcomponenttypeid              5
transcriptcomponenttypename            5
transcriptpersonid                  4815
transcriptpersonname                4522
proid           

# Step 3: Data Preprocessing

## De-duplicating and cleaning the data (initial investigation)
There seems to be a lot of duplicates in the data, and it also seems somewhat messy. I will start by analyzing what is potentially causing the duplicated results.

### De-duping based on company, date, and text

In [8]:
# Start by creating a copy of the data
snp500_tscrpt_clean = snp500_transcripts.copy()

Firstly, I noticed that there were a few entries from Vornado Realty Trust that looked very similar:

In [9]:
VRT_test_df = snp500_tscrpt_clean.iloc[353773:353775]
VRT_test_df

Unnamed: 0,companyid,keydevid,transcriptid,headline,mostimportantdateutc,mostimportanttimeutc,keydeveventtypeid,keydeveventtypename,companyname,transcriptcollectiontypeid,...,transcriptcomponenttypename,transcriptpersonid,transcriptpersonname,proid,companyofperson,speakertypeid,speakertypename,componenttextpreview,word_count,componenttext
353773,312648.0,279409950.0,3447492.0,"Vornado Realty Trust, Alexander's Inc., Q2 201...",2015-08-04,14:00:00,48.0,Earnings Calls,Vornado Realty Trust,8,...,Question and Answer Operator Message,1.0,Operator,,,1,Operator,"At this time, I see we have no further questio...",26,"At this time, I see we have no further questio..."
353774,312648.0,279409950.0,3447492.0,"Vornado Realty Trust, Q2 2015 Earnings Call, A...",2015-08-04,14:00:00,48.0,Earnings Calls,Vornado Realty Trust,8,...,Question and Answer Operator Message,1.0,Operator,,,1,Operator,"At this time, I see we have no further questio...",26,"At this time, I see we have no further questio..."


In [96]:
# Deciphering which fields are different
VRT_test_df.iloc[0] == VRT_test_df.iloc[1]

companyid                          True
keydevid                           True
transcriptid                       True
headline                          False
mostimportantdateutc               True
mostimportanttimeutc               True
keydeveventtypeid                  True
keydeveventtypename                True
companyname                        True
transcriptcollectiontypeid         True
transcriptcollectiontypename       True
transcriptpresentationtypeid       True
transcriptpresentationtypename     True
transcriptcreationdate_utc         True
transcriptcreationtime_utc         True
audiolengthsec                     True
gvkey                              True
transcriptid                       True
transcriptcomponentid              True
componentorder                     True
transcriptcomponenttypeid          True
transcriptcomponenttypename        True
transcriptpersonid                 True
transcriptpersonname               True
proid                             False


In [97]:
# Comparing the headlines, proid, and companyofperson
display(snp500_tscrpt_clean.iloc[353773].headline)
display(snp500_tscrpt_clean.iloc[353774].headline)

display(snp500_tscrpt_clean.iloc[353773].proid)
display(snp500_tscrpt_clean.iloc[353774].proid)

display(snp500_tscrpt_clean.iloc[353773].companyofperson)
display(snp500_tscrpt_clean.iloc[353774].companyofperson)

"Vornado Realty Trust, Alexander's Inc., Q2 2015 Earnings Call, Aug 04, 2015"

'Vornado Realty Trust, Q2 2015 Earnings Call, Aug 04, 2015'

<NA>

<NA>

<NA>

<NA>

The proid and companyofperson columns contain some missing values, while the headlines differ by a little bit. This can likely be de-duplicated, and I'll start by de-duplicating on the date, company, and text.

In [98]:
# De-duplicating based on the companyid, date of call, and the text itself
snp500_tscrpt_clean = snp500_tscrpt_clean.drop_duplicates(subset=['companyid', 'mostimportantdateutc', 'componenttext'])
snp500_tscrpt_clean = snp500_tscrpt_clean.reset_index(drop=True)

snp500_tscrpt_clean

Unnamed: 0,companyid,keydevid,transcriptid,headline,mostimportantdateutc,mostimportanttimeutc,keydeveventtypeid,keydeveventtypename,companyname,transcriptcollectiontypeid,...,transcriptcomponenttypename,transcriptpersonid,transcriptpersonname,proid,companyofperson,speakertypeid,speakertypename,componenttextpreview,word_count,componenttext
0,289030.0,279483592.0,743348.0,"Micron Technology, Inc., Q1 2015 Earnings Call...",2015-01-06,21:30:00,48.0,Earnings Calls,"Micron Technology, Inc.",2,...,Presentation Operator Message,1.0,Operator,,,1,Operator,"Good afternoon. My name is Karen, and I'll be ...",57,"Good afternoon. My name is Karen, and I'll be ..."
1,289030.0,279483592.0,743348.0,"Micron Technology, Inc., Q1 2015 Earnings Call...",2015-01-06,21:30:00,48.0,Earnings Calls,"Micron Technology, Inc.",2,...,Presenter Speech,116497.0,Kipp Bedard,2523238.0,,2,Executives,"Thank you, Karen, and welcome to Micron Techno...",221,"Thank you, Karen, and welcome to Micron Techno..."
2,289030.0,279483592.0,743348.0,"Micron Technology, Inc., Q1 2015 Earnings Call...",2015-01-06,21:30:00,48.0,Earnings Calls,"Micron Technology, Inc.",2,...,Presenter Speech,182536.0,Unknown Executive,,,2,Executives,"During the course of this meeting, we may make...",175,"During the course of this meeting, we may make..."
3,289030.0,279483592.0,743348.0,"Micron Technology, Inc., Q1 2015 Earnings Call...",2015-01-06,21:30:00,48.0,Earnings Calls,"Micron Technology, Inc.",2,...,Presenter Speech,116497.0,Kipp Bedard,2523238.0,,2,Executives,I'll now turn the call over to Mr. Mark Durcan...,13,I'll now turn the call over to Mr. Mark Durcan...
4,289030.0,279483592.0,743348.0,"Micron Technology, Inc., Q1 2015 Earnings Call...",2015-01-06,21:30:00,48.0,Earnings Calls,"Micron Technology, Inc.",2,...,Presenter Speech,130734.0,D. Durcan,289059.0,,2,Executives,"Thanks, Kipp. We had another strong quarter be...",1382,"Thanks, Kipp. We had another strong quarter be..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
206614,285827.0,305704552.0,2482069.0,"Lockheed Martin Corporation, Q2 2015 Earnings ...",2015-07-20,15:00:00,48.0,Earnings Calls,Lockheed Martin Corporation,8,...,Question,184176.0,Seth Seifman,30369356.0,,3,Analysts,"As you guys have mentioned, Sikorsky has won a...",111,"As you guys have mentioned, Sikorsky has won a..."
206615,285827.0,305704552.0,2482069.0,"Lockheed Martin Corporation, Q2 2015 Earnings ...",2015-07-20,15:00:00,48.0,Earnings Calls,Lockheed Martin Corporation,8,...,Answer,88953.0,Bruce L. Tanner,,,2,Executives,"Yes. So Seth, welcome to the call by the way. ...",354,"Yes. So Seth, welcome to the call by the way. ..."
206616,285827.0,305704552.0,2482069.0,"Lockheed Martin Corporation, Q2 2015 Earnings ...",2015-07-20,15:00:00,48.0,Earnings Calls,Lockheed Martin Corporation,8,...,Answer,88953.0,Bruce L. Tanner,,,2,Executives,"The only thing I would add, David, is this is ...",234,"The only thing I would add, David, is this is ..."
206617,285827.0,305704552.0,2482069.0,"Lockheed Martin Corporation, Q2 2015 Earnings ...",2015-07-20,15:00:00,48.0,Earnings Calls,Lockheed Martin Corporation,8,...,Question,53583.0,Myles Walton,,,3,Analysts,"Just a follow-up on the overall long, long ter...",61,"Just a follow-up on the overall long, long ter..."


### Removing unnecessary columns
There were a number of columns that were created from WRDS to actually obtain this data, along with columns that are either redundant or unnecessary for our analysis. We can drop those columns.

In [99]:
# Dropping first round of unnecessary columns
cols_to_drop_1 = ['companyid', 'headline', 'mostimportanttimeutc', 'keydeveventtypeid', 'keydeveventtypename',
                  'transcriptcollectiontypeid', 'transcriptcollectiontypename', 'transcriptpresentationtypeid',
                  'transcriptpresentationtypename', 'transcriptcreationdate_utc', 'transcriptcreationtime_utc', 'audiolengthsec',
                  'transcriptcomponenttypeid', 'proid', 'companyofperson', 'speakertypeid', 'componenttextpreview']

# Dropping columns
snp500_tscrpt_clean = snp500_tscrpt_clean.drop(columns=cols_to_drop_1)

# Dropping redundant transcriptid column
snp500_tscrpt_clean = snp500_tscrpt_clean.loc[:, ~snp500_tscrpt_clean.columns.duplicated(keep='first')] 

Reasons for dropping each column:
- companyid: CIQ ID of company; redundant with companyname
- headline: not necessary since we have the full text
- mostimportanttimeutc: not performing autoregressive modeling, nor operating on intraday time scales
- keydeveventtypeid: constant used for filtering in SQL
- keydeveventtypename: redundant with the above
- transcriptcollectiontypeid: only relevant to WRDS
- transcriptcollectiontypename: only relevant to WRDS
- transcriptpresentationtypeid: only relevant to WRDS
- transcriptpresentationtypename: only relevant to WRDS
- transcriptcreationdate_utc: not necessary, only a preprocessing timestamp
- transcriptcreationtime_utc: not necessary, only a preprocessing timestamp
- audiolengthsec: not necessary, as we have word count if needed
- transcriptid: duplicated transcriptid from above
- transcriptcomponenttypeid: numerical version of transcriptcomponenttypename
- proid: likely only relevant to WRDS
- companyofperson: redundant
- speakertypeid: numerical version of speakertypename
- componenttextpreview: simply a preview

In [100]:
snp500_tscrpt_clean.head(2)

Unnamed: 0,keydevid,transcriptid,mostimportantdateutc,companyname,gvkey,transcriptcomponentid,componentorder,transcriptcomponenttypename,transcriptpersonid,transcriptpersonname,speakertypename,word_count,componenttext
0,279483592.0,743348.0,2015-01-06,"Micron Technology, Inc.",7343,32636492.0,0,Presentation Operator Message,1.0,Operator,Operator,57,"Good afternoon. My name is Karen, and I'll be ..."
1,279483592.0,743348.0,2015-01-06,"Micron Technology, Inc.",7343,32636493.0,1,Presenter Speech,116497.0,Kipp Bedard,Executives,221,"Thank you, Karen, and welcome to Micron Techno..."


Conversely, here are the reasons for (potentially) keeping the columns that I did:
- keydevid: unique identifier for full transcript
- transcriptid: unique identifier for redcorded transcript
- mostimportantdateutc: date of earnings call
- companyname: name of company
- gvkey: Compustat company ID (helpful for linking to fundamentals)
- transcriptcomponentid: unique identifier for the speech unit (could be helpful for ordering)
- componentorder: necessary to reconstruct speech block
- transcriptcomponenttypename: name of block (e.g., Question, Presenter Speech, etc.)
- transcriptpersonid: id of speaker (in case of duplicated names)
- transcriptpersonname: name of speaker
- speakertypename: type of speaker (e.g., Operator, Executives, etc.)
- word_count: word count
- componenttext: actual text

Note that I added in keydevid afterward, as I didn't realize that keydevid, NOT transcriptid, was actually the true id for the full transcript. I explain how I came to that conclusion below.

### Determining what makes up a full transcript (transcriptid vs keydevid)
With 467 companies and one earnings call per quarter, we should have roughly 1800-1900 transcripts total. If we look at the value counts from above, we notice that only keydevid (and headline, but we've removed that) meets that criterion. However, the transcriptid column seems logically more appropriate, so let's analyze which feature actually gives us the full transcript.

In [102]:
len(indiv_trscrpt_index_list)

44

In [101]:
# Finding out the indices for the first transcriptid
trscrpt_id_list = list(snp500_tscrpt_clean['transcriptid'].unique())
trscrpt_num = 0
indiv_trscrpt_index_list = list(snp500_tscrpt_clean[snp500_tscrpt_clean['transcriptid'] == trscrpt_id_list[trscrpt_num]].index)

# Printing out the transcriptcomponenttypename, transcriptpersonnamem, and speakertypename, along with the text
for i in indiv_trscrpt_index_list:
    trscrpt_comp_row = snp500_tscrpt_clean.iloc[i]
    print(trscrpt_comp_row.speakertypename + ': ' + trscrpt_comp_row.transcriptpersonname + ', Type: ' +
          trscrpt_comp_row.transcriptcomponenttypename)
    print(trscrpt_comp_row.componenttext, '\n')

Operator: Operator, Type: Presentation Operator Message
Good afternoon. My name is Karen, and I'll be your conference facilitator today. And at this time, I would like to welcome everyone to the Micron Technology's First Quarter 2015 Financial Release Conference Call. [Operator Instructions] It is now my pleasure to turn the floor over to your host, Kipp Bedard. Sir, you may begin your conference. 

Executives: Kipp Bedard, Type: Presenter Speech
Thank you, Karen, and welcome to Micron Technology's First Quarter 2015 Financial Release Conference Call. On the call today is Mr. Mark Durcan, CEO and Director; Mark Adams, President; and Ron Foster, Chief Financial Officer and Vice President of Finance. This conference call, including audio and slides, is also available on our website at micron.com.
In addition, our website has a file containing the quarterly operational and financial information and guidance, non-GAAP information with reconciliation, slides used during the conference call 

Now testing out the second transcript to check:

In [103]:
# Finding out the indices for the second transcriptid
trscrpt_num = 1
indiv_trscrpt_index_list = list(snp500_tscrpt_clean[snp500_tscrpt_clean['transcriptid'] == trscrpt_id_list[trscrpt_num]].index)

# Doing the same as the above
for i in indiv_trscrpt_index_list:
    trscrpt_comp_row = snp500_tscrpt_clean.iloc[i]
    print(trscrpt_comp_row.speakertypename + ': ' + trscrpt_comp_row.transcriptpersonname + ', Type: ' +
          trscrpt_comp_row.transcriptcomponenttypename)
    print(trscrpt_comp_row.componenttext, '\n')

Executives: D. Durcan, Type: Presenter Speech
Thanks, Kipp. We had another strong quarter benefiting from continued favorable market conditions and solid execution from the team. We set a new record for quarterly revenue of $4.6 billion. GAAP net income was $1 billion. Free cash flow was $923 million based on record operating cash flow of $1.6 billion less CapEx of $669 million. The investments we're making in the business are putting us in position to continue generating strong cash flow.
We expect continued favorable market conditions for 2015, led by constrained supply in DRAM and solid demand for both DRAM and NAND. Demand growth in our business continues to be driven by our customers' rapidly increasing memory content to enable them to enhance the performance of their products as opposed to strictly unit growth of end systems. The resulting demand outlook remains very encouraging. A few good examples of this growth include mobile DRAM, server DRAM and solid-state drives, all of wh

In [104]:
# Checking out the beginning 1st transcript, repeated middle section of 1st transcript (change in transcriptid), ending of first full
# transcript, and the beginning of the second full transcript (change in keydevid)
display(snp500_tscrpt_clean[:5])
display(snp500_tscrpt_clean[43:45])
display(snp500_tscrpt_clean[65:69])

Unnamed: 0,keydevid,transcriptid,mostimportantdateutc,companyname,gvkey,transcriptcomponentid,componentorder,transcriptcomponenttypename,transcriptpersonid,transcriptpersonname,speakertypename,word_count,componenttext
0,279483592.0,743348.0,2015-01-06,"Micron Technology, Inc.",7343,32636492.0,0,Presentation Operator Message,1.0,Operator,Operator,57,"Good afternoon. My name is Karen, and I'll be ..."
1,279483592.0,743348.0,2015-01-06,"Micron Technology, Inc.",7343,32636493.0,1,Presenter Speech,116497.0,Kipp Bedard,Executives,221,"Thank you, Karen, and welcome to Micron Techno..."
2,279483592.0,743348.0,2015-01-06,"Micron Technology, Inc.",7343,32636494.0,2,Presenter Speech,182536.0,Unknown Executive,Executives,175,"During the course of this meeting, we may make..."
3,279483592.0,743348.0,2015-01-06,"Micron Technology, Inc.",7343,32636495.0,3,Presenter Speech,116497.0,Kipp Bedard,Executives,13,I'll now turn the call over to Mr. Mark Durcan...
4,279483592.0,743348.0,2015-01-06,"Micron Technology, Inc.",7343,32636496.0,4,Presenter Speech,130734.0,D. Durcan,Executives,1382,"Thanks, Kipp. We had another strong quarter be..."


Unnamed: 0,keydevid,transcriptid,mostimportantdateutc,companyname,gvkey,transcriptcomponentid,componentorder,transcriptcomponenttypename,transcriptpersonid,transcriptpersonname,speakertypename,word_count,componenttext
43,279483592.0,743348.0,2015-01-06,"Micron Technology, Inc.",7343,32636535.0,43,Question and Answer Operator Message,1.0,Operator,Operator,18,Thank you. This concludes today's Micron Techn...
44,279483592.0,743360.0,2015-01-06,"Micron Technology, Inc.",7343,32636999.0,4,Presenter Speech,130734.0,D. Durcan,Executives,1381,"Thanks, Kipp. We had another strong quarter be..."


Unnamed: 0,keydevid,transcriptid,mostimportantdateutc,companyname,gvkey,transcriptcomponentid,componentorder,transcriptcomponenttypename,transcriptpersonid,transcriptpersonname,speakertypename,word_count,componenttext
65,279483592.0,743360.0,2015-01-06,"Micron Technology, Inc.",7343,32637036.0,41,Answer,116508.0,Mark Adams,Executives,140,Sure. Thanks for the question. The way I would...
66,279483592.0,743360.0,2015-01-06,"Micron Technology, Inc.",7343,32637037.0,42,Answer,116497.0,Kipp Bedard,Executives,119,"Thank you, Daniel. And with that, we'd like to..."
67,265970735.0,743536.0,2015-01-07,Monsanto Company,140760,32643090.0,0,Presentation Operator Message,1.0,Operator,Operator,53,"Greetings, and welcome to the First Quarter Fi..."
68,265970735.0,743536.0,2015-01-07,Monsanto Company,140760,32643091.0,1,Presenter Speech,299943.0,Laura Meyer,Executives,362,"Thank you, Christine, and good morning to ever..."


What I found was that the keydevid is definitively primary key for the full transcript. Transcriptid seems to have duplicated text (which can also be seen by the componentorder), and for the first full transcript, the duplicated text actually went from component 4 to 42 (as opposed to 0 to 43, which is supposed to be the actual length).

The next step is to see how consistent this issue is - where the first transcriptid for each unique keydevid is the full transcript and the next are simply duplicates. If so, we can simply drop any duplicated componentorders when comparing each unique keydevid.

### Identifying duplicated transcriptids and dropping the appropriate ones

In [105]:
# Checking to see if the first component is always 0 for each unique keydevid
print(snp500_tscrpt_clean.groupby('keydevid').first()['componentorder'].sum())

0


Now that we've confirmed that the first componentorder is always 0 for a unique gvkey, we can drop duplicates and just keep the first rows for each group.

In [107]:
# Dropping duplicates of the transcripts
snp500_tscrpt_clean = snp500_tscrpt_clean.drop_duplicates(subset=['keydevid', 'componentorder'], keep='first')
snp500_tscrpt_clean

Unnamed: 0,keydevid,transcriptid,mostimportantdateutc,companyname,gvkey,transcriptcomponentid,componentorder,transcriptcomponenttypename,transcriptpersonid,transcriptpersonname,speakertypename,word_count,componenttext
0,279483592.0,743348.0,2015-01-06,"Micron Technology, Inc.",007343,32636492.0,0,Presentation Operator Message,1.0,Operator,Operator,57,"Good afternoon. My name is Karen, and I'll be ..."
1,279483592.0,743348.0,2015-01-06,"Micron Technology, Inc.",007343,32636493.0,1,Presenter Speech,116497.0,Kipp Bedard,Executives,221,"Thank you, Karen, and welcome to Micron Techno..."
2,279483592.0,743348.0,2015-01-06,"Micron Technology, Inc.",007343,32636494.0,2,Presenter Speech,182536.0,Unknown Executive,Executives,175,"During the course of this meeting, we may make..."
3,279483592.0,743348.0,2015-01-06,"Micron Technology, Inc.",007343,32636495.0,3,Presenter Speech,116497.0,Kipp Bedard,Executives,13,I'll now turn the call over to Mr. Mark Durcan...
4,279483592.0,743348.0,2015-01-06,"Micron Technology, Inc.",007343,32636496.0,4,Presenter Speech,130734.0,D. Durcan,Executives,1382,"Thanks, Kipp. We had another strong quarter be..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
206574,306412228.0,2472597.0,2015-08-07,"EOG Resources, Inc.",016478,93646624.0,6,Presenter Speech,520011.0,William Thomas,Executives,239,"Thanks, Tim. Concerning our macro view, we bel..."
206575,306412228.0,2472597.0,2015-08-07,"EOG Resources, Inc.",016478,93646629.0,11,Answer,520011.0,William Thomas,Executives,425,Let me give you a good overview. That's a good...
206576,306412228.0,2472597.0,2015-08-07,"EOG Resources, Inc.",016478,93646632.0,14,Answer,520011.0,William Thomas,Executives,73,"Evan, we definitely want to -- that's a primar..."
206577,306412228.0,2472597.0,2015-08-07,"EOG Resources, Inc.",016478,93646646.0,28,Answer,520011.0,William Thomas,Executives,114,"Charles, we're going to -- that's a really goo..."


We notice a glaring mistake, however, as we see that the componentorder for the last entry seems to skip a few components. We can compare this with the original dataset, before we dropped any duplicates.

In [116]:
# Current dataset
display(snp500_tscrpt_clean[snp500_tscrpt_clean['keydevid'] == 306412228.0]['componentorder'])

# Original dataset
display(snp500_transcripts[snp500_transcripts['keydevid'] == 306412228.0]['componentorder'])

147042     0
147043     1
147044     3
147045     4
147046     5
          ..
206574     6
206575    11
206576    14
206577    28
206578    36
Name: componentorder, Length: 64, dtype: Int64

247601     0
247602     1
247603     3
247604     4
247605     5
          ..
353373    59
353374    60
353375    61
353376    62
353377    63
Name: componentorder, Length: 180, dtype: Int64

In this case, it does appear that we dropped duplicates, so we should go back to our original dataset, create a new copy, then only drop duplicates at the last step (to make sure we include all the data).

## Du-duplicating and cleaning the data (final)
This will be the final de-duping of the data, based on the results in the investigation above.

In [10]:
# Creating a new copy of the data
snp500_tscrpt_clean_final = snp500_transcripts.copy()

# Dropping first round of unnecessary columns
cols_to_drop_1 = ['companyid', 'headline', 'mostimportanttimeutc', 'keydeveventtypeid', 'keydeveventtypename',
                  'transcriptcollectiontypeid', 'transcriptcollectiontypename', 'transcriptpresentationtypeid',
                  'transcriptpresentationtypename', 'transcriptcreationdate_utc', 'transcriptcreationtime_utc', 'audiolengthsec',
                  'transcriptcomponenttypeid', 'proid', 'companyofperson', 'speakertypeid', 'componenttextpreview']

# Dropping columns
snp500_tscrpt_clean_final = snp500_tscrpt_clean_final.drop(columns=cols_to_drop_1)

# Dropping redundant transcriptid column
snp500_tscrpt_clean_final = snp500_tscrpt_clean_final.loc[:, ~snp500_tscrpt_clean_final.columns.duplicated(keep='first')] 

# Dropping duplicates based on keydevid and componentorder
snp500_tscrpt_clean_final = snp500_tscrpt_clean_final.drop_duplicates(subset=['keydevid', 'componentorder'], keep='first')

In [11]:
snp500_tscrpt_clean_final

Unnamed: 0,keydevid,transcriptid,mostimportantdateutc,companyname,gvkey,transcriptcomponentid,componentorder,transcriptcomponenttypename,transcriptpersonid,transcriptpersonname,speakertypename,word_count,componenttext
0,279483592.0,743348.0,2015-01-06,"Micron Technology, Inc.",007343,32636492.0,0,Presentation Operator Message,1.0,Operator,Operator,57,"Good afternoon. My name is Karen, and I'll be ..."
1,279483592.0,743348.0,2015-01-06,"Micron Technology, Inc.",007343,32636493.0,1,Presenter Speech,116497.0,Kipp Bedard,Executives,221,"Thank you, Karen, and welcome to Micron Techno..."
2,279483592.0,743348.0,2015-01-06,"Micron Technology, Inc.",007343,32636494.0,2,Presenter Speech,182536.0,Unknown Executive,Executives,175,"During the course of this meeting, we may make..."
3,279483592.0,743348.0,2015-01-06,"Micron Technology, Inc.",007343,32636495.0,3,Presenter Speech,116497.0,Kipp Bedard,Executives,13,I'll now turn the call over to Mr. Mark Durcan...
4,279483592.0,743348.0,2015-01-06,"Micron Technology, Inc.",007343,32636496.0,4,Presenter Speech,130734.0,D. Durcan,Executives,1382,"Thanks, Kipp. We had another strong quarter be..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
353320,306412228.0,2472597.0,2015-08-07,"EOG Resources, Inc.",016478,93646624.0,6,Presenter Speech,520011.0,William Thomas,Executives,239,"Thanks, Tim. Concerning our macro view, we bel..."
353325,306412228.0,2472597.0,2015-08-07,"EOG Resources, Inc.",016478,93646629.0,11,Answer,520011.0,William Thomas,Executives,425,Let me give you a good overview. That's a good...
353328,306412228.0,2472597.0,2015-08-07,"EOG Resources, Inc.",016478,93646632.0,14,Answer,520011.0,William Thomas,Executives,73,"Evan, we definitely want to -- that's a primar..."
353342,306412228.0,2472597.0,2015-08-07,"EOG Resources, Inc.",016478,93646646.0,28,Answer,520011.0,William Thomas,Executives,114,"Charles, we're going to -- that's a really goo..."


Alas, it appears that the values at the end of the dataset were simply the missing values that were not in the first instance of the transcript id. Looking at it a bit more closely below...

In [125]:
snp500_tscrpt_clean[snp500_tscrpt_clean['keydevid'] == 306412228.0]['componentorder'][0:15]

147042     0
147043     1
147044     3
147045     4
147046     5
147047     7
147048     8
147049     9
147050    10
147051    12
147052    13
147053    15
147054    16
147055    17
147056    18
Name: componentorder, dtype: Int64

In [127]:
snp500_tscrpt_clean[snp500_tscrpt_clean['keydevid'] == 306412228.0]['componentorder'][-10:]

147096    60
147097    61
147098    62
147099    63
206572     2
206574     6
206575    11
206576    14
206577    28
206578    36
Name: componentorder, dtype: Int64

...we notice that the last few ids appear to be missing from the original transcript. Therefore, we simply need to reorder the transcript, and things should be fine.

One thing to note is that we did end up getting some more values, when we skipped the original de-duplicating step, so this was still worthwhile to do.

In [12]:
# Sorting the transcripts by date first, then keydevid, then the component order
snp500_tscrpt_clean_final = snp500_tscrpt_clean_final.sort_values(by=['mostimportantdateutc', 'keydevid', 'componentorder'])
snp500_tscrpt_clean_final = snp500_tscrpt_clean_final.reset_index(drop=True)

snp500_tscrpt_clean_final

Unnamed: 0,keydevid,transcriptid,mostimportantdateutc,companyname,gvkey,transcriptcomponentid,componentorder,transcriptcomponenttypename,transcriptpersonid,transcriptpersonname,speakertypename,word_count,componenttext
0,279483592.0,743348.0,2015-01-06,"Micron Technology, Inc.",007343,32636492.0,0,Presentation Operator Message,1.0,Operator,Operator,57,"Good afternoon. My name is Karen, and I'll be ..."
1,279483592.0,743348.0,2015-01-06,"Micron Technology, Inc.",007343,32636493.0,1,Presenter Speech,116497.0,Kipp Bedard,Executives,221,"Thank you, Karen, and welcome to Micron Techno..."
2,279483592.0,743348.0,2015-01-06,"Micron Technology, Inc.",007343,32636494.0,2,Presenter Speech,182536.0,Unknown Executive,Executives,175,"During the course of this meeting, we may make..."
3,279483592.0,743348.0,2015-01-06,"Micron Technology, Inc.",007343,32636495.0,3,Presenter Speech,116497.0,Kipp Bedard,Executives,13,I'll now turn the call over to Mr. Mark Durcan...
4,279483592.0,743348.0,2015-01-06,"Micron Technology, Inc.",007343,32636496.0,4,Presenter Speech,130734.0,D. Durcan,Executives,1382,"Thanks, Kipp. We had another strong quarter be..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
140829,318616474.0,915370.0,2015-12-22,"Paychex, Inc.",008402,39070076.0,165,Question,93197.0,Glenn Greene,Analysts,20,Okay. And just a quick math on a full-year run...
140830,318616474.0,915370.0,2015-12-22,"Paychex, Inc.",008402,39070077.0,166,Answer,199526.0,Efrain Rivera,Executives,12,"I see where your math works, and I can't disag..."
140831,318616474.0,915370.0,2015-12-22,"Paychex, Inc.",008402,39070078.0,167,Answer,170118.0,Martin Mucci,Executives,72,"Okay, I think that's all the questions. At thi..."
140832,318616474.0,915370.0,2015-12-22,"Paychex, Inc.",008402,39070079.0,168,Answer,199526.0,Efrain Rivera,Executives,2,Thank you.


## Identifying which part of the transcript calls to keep (or whether to keep all of it)
Earnings transcripts can be split into multiple sections, and we need to decipher whether to just keep the main portion of the speech, or if we want to include the answers to questions as well.

In [133]:
# Identifying the parts of a transcript
snp500_tscrpt_clean_final['transcriptcomponenttypename'].value_counts()

transcriptcomponenttypename
Answer                                  62887
Question                                46094
Question and Answer Operator Message    22545
Presenter Speech                         7407
Presentation Operator Message            1901
Name: count, dtype: Int64

Because we have already de-duplicated the dataset, and because the questions and answers could be an important part of the call, it may be worthwhile to include everything. However, there could be a lot of extraneous information in those answers and questions, and our goal is to analyze just the earnings call itself. Also, logistically speaking, due to the length of these transcripts, including questions and answers will likely add on many additional hours of training time, and the tokens may not even fit the model.

To confirm my hypothesis, I will first check the length of the transcripts.

In [164]:
# Getting the lengths of the transcripts using word count
tscrpt_lengths = snp500_tscrpt_clean_final.groupby('keydevid')['word_count'].sum()

tscrpt_lengths.describe()

count         1850.0
mean     9003.963243
std      1973.063123
min           1980.0
25%           8015.5
50%           8987.0
75%          9964.75
max          29479.0
Name: word_count, dtype: Float64

We note that the mean (or median) of the word count is ~9,000, which is enormous and already bigger than the max token length of many models (e.g., ModernBERT's max token length is 8,192). Therefore, it does make sense logistically to exclude the question and answering portions.

I will next decipher whether it's appropriate to simply get rid of everything starting from the first question, or if there's more cleaning that needs to be done.

In [191]:
# Checking out just the first call's transcriptcomponenttypename and word_count
all_keydevids = list(snp500_tscrpt_clean_final['keydevid'].drop_duplicates())
chosen_keydevid = all_keydevids[0]
indiv_tscrpt_full = snp500_tscrpt_clean_final[snp500_tscrpt_clean_final['keydevid'] == chosen_keydevid]

for orig_idx, row in indiv_tscrpt_full.iterrows():
    print(row['transcriptcomponenttypename'] + ': ' + str(row['word_count']))

Presentation Operator Message: 57
Presenter Speech: 221
Presenter Speech: 175
Presenter Speech: 13
Presenter Speech: 1382
Presenter Speech: 1218
Presenter Speech: 2096
Presenter Speech: 22
Question and Answer Operator Message: 15
Question: 65
Answer: 246
Question: 40
Answer: 86
Question and Answer Operator Message: 12
Question: 89
Answer: 115
Answer: 216
Question and Answer Operator Message: 14
Question: 52
Answer: 235
Question and Answer Operator Message: 12
Question: 134
Answer: 224
Question and Answer Operator Message: 12
Question: 59
Answer: 118
Answer: 14
Question: 21
Answer: 71
Question: 125
Answer: 104
Answer: 210
Question and Answer Operator Message: 12
Question: 213
Answer: 405
Question: 41
Answer: 69
Question: 54
Answer: 139
Question and Answer Operator Message: 14
Question: 95
Answer: 141
Answer: 119
Question and Answer Operator Message: 18


In [188]:
# Checking out just the second call's transcriptcomponenttypename and word_count
chosen_keydevid = all_keydevids[1]
indiv_tscrpt_full = snp500_tscrpt_clean_final[snp500_tscrpt_clean_final['keydevid'] == chosen_keydevid]

for orig_idx, row in indiv_tscrpt_full.iterrows():
    print(row['transcriptcomponenttypename'] + ': ' + str(row['word_count']))

Presentation Operator Message: 53
Presenter Speech: 362
Presenter Speech: 793
Presenter Speech: 1633
Presenter Speech: 889
Presenter Speech: 2855
Presenter Speech: 46
Question and Answer Operator Message: 14
Question: 67
Answer: 137
Answer: 198
Question and Answer Operator Message: 13
Question: 99
Answer: 269
Answer: 63
Answer: 4
Question and Answer Operator Message: 13
Question: 88
Answer: 226
Answer: 280
Answer: 38
Answer: 2
Answer: 1
Answer: 2
Question and Answer Operator Message: 13
Question: 63
Answer: 122
Answer: 237
Question and Answer Operator Message: 12
Question: 77
Answer: 187
Answer: 333
Question and Answer Operator Message: 13
Question: 69
Answer: 358
Answer: 19
Answer: 1
Answer: 1
Answer: 45
Question and Answer Operator Message: 13
Question: 56
Answer: 68
Answer: 197
Answer: 109
Answer: 7
Question and Answer Operator Message: 16
Question: 105
Answer: 124
Answer: 105
Question and Answer Operator Message: 12
Question: 88
Answer: 10
Answer: 219
Question and Answer Operator M

In [189]:
# Checking out just the third call's transcriptcomponenttypename and word_count
chosen_keydevid = all_keydevids[2]
indiv_tscrpt_full = snp500_tscrpt_clean_final[snp500_tscrpt_clean_final['keydevid'] == chosen_keydevid]

for orig_idx, row in indiv_tscrpt_full.iterrows():
    print(row['transcriptcomponenttypename'] + ': ' + str(row['word_count']))

Presentation Operator Message: 85
Presenter Speech: 341
Presenter Speech: 1902
Presenter Speech: 1604
Presenter Speech: 378
Presentation Operator Message: 16


In [190]:
# Checking out just the fourth call's transcriptcomponenttypename and word_count
chosen_keydevid = all_keydevids[3]
indiv_tscrpt_full = snp500_tscrpt_clean_final[snp500_tscrpt_clean_final['keydevid'] == chosen_keydevid]

for orig_idx, row in indiv_tscrpt_full.iterrows():
    print(row['transcriptcomponenttypename'] + ': ' + str(row['word_count']))

Presentation Operator Message: 44
Presenter Speech: 167
Presenter Speech: 1786
Presenter Speech: 1191
Question and Answer Operator Message: 16
Question: 94
Answer: 343
Question: 54
Answer: 100
Question and Answer Operator Message: 14
Question: 56
Answer: 142
Question: 45
Answer: 92
Question: 150
Answer: 213
Question and Answer Operator Message: 13
Question: 89
Answer: 208
Question: 75
Answer: 316
Question and Answer Operator Message: 13
Question: 79
Answer: 137
Question: 50
Answer: 153
Question: 118
Answer: 159
Question and Answer Operator Message: 14
Question: 26
Answer: 101
Question: 53
Answer: 117
Question: 27
Answer: 182
Question and Answer Operator Message: 13
Question: 118
Answer: 100
Answer: 439
Question: 17
Answer: 53
Question: 49
Answer: 26
Question: 9
Answer: 7
Question: 53
Answer: 14
Question and Answer Operator Message: 12
Question: 134
Answer: 298
Question: 25
Answer: 71
Question: 72
Answer: 104
Question and Answer Operator Message: 12
Question: 70
Answer: 228
Question: 42

In all of the 4 cases above, it appears that order generally starts with the Presentation Operator Message, then leads to the Presenter Speech, then goes into the Question and Answer Operator Message, Question, and Answer portions. Therefore, it might make sense to get rid of everything after the first Question and Answer Operator Message. However, we'll double-check that to be safe.

In [249]:
# Double-checking that the first Presenter Speech section comes before the first Question and Answer Operator Message, Question, and
# Answer sections
cmp_typ_dict = {}
for keydevid, group in snp500_tscrpt_clean_final.groupby('keydevid'):
    first_PS_num = group.loc[group['transcriptcomponenttypename'] == 'Presenter Speech', 'componentorder'].min()
    first_QAOM_num = group.loc[group['transcriptcomponenttypename'] == 'Question and Answer Operator Message', 'componentorder'].min()
    first_Q_num = group.loc[group['transcriptcomponenttypename'] == 'Question', 'componentorder'].min()
    first_A_num = group.loc[group['transcriptcomponenttypename'] == 'Answer', 'componentorder'].min()
    cmp_typ_dict[keydevid] = [first_PS_num, first_QAOM_num, first_Q_num, first_A_num]

# Running through the dictionary to confirm the ordering
for keydevid, cmp_typ_val in cmp_typ_dict.items():
    if pd.isna(cmp_typ_val[1]) or pd.isna(cmp_typ_val[2]) or pd.isna(cmp_typ_val[3]):
        continue
    elif (cmp_typ_val[0] >= cmp_typ_val[1]) and (cmp_typ_val[0] >= cmp_typ_val[2]) and (cmp_typ_val[0] >= cmp_typ_val[3]):
        print(cmp_typ_dict[keydevid])
print('No violations of order.')

No violations of order.


Firstly, the good news is that there is no violation of order, and we'll always have at least one Presenter Speech section before any of the question and answering sections. Secondly, I noticed in my experiment that there were quite a few missing values in the 'Question and Answer Operator Message', 'Question', and 'Answer' sections. Therefore, I need to make sure that those sections exist before filtering them out. Then (assuming they do exist), I need to get the smallest componentorder of those latter three indices, and then remove everything after that componentorder to retain only the bulk of the Presenter Speech.

## Removing the Q&A sections and merging the remaining text
I will now proceed to remove the Q&A sections and only keep the Presenter Speech (and the Presentation Operator Message before that). Then, I will merge the entire transcript into one long section.

In [15]:
from tqdm import tqdm

In [16]:
# Initializing a new df
snp500_tscrpt_full = pd.DataFrame()

# Going through each group and removing all Q&A sections
for keydevid, group in tqdm(snp500_tscrpt_clean_final.groupby('keydevid')):

    # Initializing the Question and Answer Operator Message, Question, and Answer indices
    QAOM_idx = np.inf
    Q_idx = np.inf
    A_idx = np.inf

    # Checking if "Question and Answer Operator Message" index exists
    try:
        QAOM_idx = group[group['transcriptcomponenttypename'] == 'Question and Answer Operator Message'].index[0]
    except:
        pass

    # Checking if "Question" index exists
    try:
        Q_idx = group[group['transcriptcomponenttypename'] == 'Question'].index[0]
    except:
        pass

    # Checking if "Answer" index exists
    try:
        A_idx = group[group['transcriptcomponenttypename'] == 'Answer'].index[0]
    except:
        pass

    # Finding the first Q&A index in group (if exists), then removing all entries with indices equal to and below that
    idx_to_remove = np.min([QAOM_idx, Q_idx, A_idx])
    if idx_to_remove != np.inf:
        group = group[group.index < idx_to_remove]
    snp500_tscrpt_full = pd.concat([snp500_tscrpt_full, group])
    
# Resetting the index
snp500_tscrpt_full = snp500_tscrpt_full.reset_index(drop=True)

100%|██████████████████████████████████████| 1850/1850 [00:02<00:00, 836.79it/s]


Now with the Q&A sections removed, I can merge the remaining section(s) into one long text.

In [49]:
# Initializing dictionary for final dataframe
full_tscrpt_dict = {
    'Unique_transcript_id': [],
    'Date': [],
    'Company_name': [],
    'Word_count': [],
    'Text': []
}

# Going through each transcript to populate the full_tscrpt_dict for the final df
for keydevid, group in tqdm(snp500_tscrpt_full.groupby('keydevid')):
    
    # Appending the keydevid to get the Unique_transcript_id
    first_line = group.iloc[0]
    full_tscrpt_dict['Unique_transcript_id'].append(keydevid)

    # Getting the date, company name, the word count total, and the merged text
    full_tscrpt_dict['Date'].append(first_line['mostimportantdateutc'])
    full_tscrpt_dict['Company_name'].append(first_line['companyname'])
    full_tscrpt_dict['Word_count'].append(group['word_count'].sum())
    full_tscrpt_dict['Text'].append(" \n".join(group['componenttext'].tolist()))

# Transforming into a dataframe and sorting
full_tscrpt_df = pd.DataFrame.from_dict(full_tscrpt_dict)
full_tscrpt_df = full_tscrpt_df.sort_values(by=['Date', 'Company_name'])
full_tscrpt_df = full_tscrpt_df.reset_index(drop=True)

100%|████████████████████████████████████| 1850/1850 [00:00<00:00, 12091.34it/s]


Checking the word count reveals that only 15 out of the 1850 transcripts (~0.8%) in 2015 surpass the token limit for a large model like ModernBERT. However, this sacrifice will likely be worth it for the sake of speed, and the extra tokens are unlikely to cause too big of a difference.

In [58]:
full_tscrpt_df['Word_count'].sort_values()[-16:]

1274     8081
27       8222
585      8520
1711     8627
1703     8700
471      8746
1483     8842
1800     8866
214      8909
887      8995
939      9105
322      9122
1349    10771
390     12701
295     19422
1833    22068
Name: Word_count, dtype: int64

In [61]:
# Confirming that we do get what we expect:
print(full_tscrpt_df['Text'].iloc[1234])

Greetings, and welcome to the Leggett & Platt Second Quarter 2015 Earnings Conference Call. [Operator Instructions] As a reminder, this conference is being recorded. It is now my pleasure to introduce your host, Dave DeSonier, Senior Vice President of Strategy and Investor Relations. Thank you. You may begin. 
Good morning, and thank you for taking part in Leggett & Platt's second quarter conference call. With me this morning are the following: Dave Haffner, our Board Chair and CEO; Karl Glassman, who is President and Chief Operating Officer; Matt Flanigan, our Executive VP and CFO; and Susan McCoy, our VP of  Investor Relations. The agenda for the call this morning is as follows: Dave Haffner will start with a summary of the major statements we made in yesterday's press release; Karl Glassman will provide segment highlights; Matt Flanigan will discuss financial details and address our outlook for 2015; and finally, the group will answer any questions that you have. Dennis Park, who is

Printing out some arbitrary transcript in our dataset reveals that we do obtain what we were aiming for. We have one, long, merged transcript with the date and the company, and this will be suitable for our analysis later on. The only thing left to do is to obtain data from all the years that we want to train on, and we can continue from there.

# Step 3.5: Obtaining all the transcripts and saving them
Now that I have my transcript cleaning process completed, I can start downloading all the ones that I need. While I will likely not use all the data, I will download transcripts in the S&P500 starting from 2000.

In [3]:
import os
os.getcwd()

'/Users/danielwang/Desktop/Berkeley MIDS Stuff/Berkeley MIDS Summer 2025 Stuff/Berkeley MIDS DATASCI 266/Berkeley MIDS DATASCI 266 Project Material'

In [4]:
# Looping through all the years
for year in tqdm(range(2000, 2026)):

    # Checking to see if the df already exists
    csv_name = './Individual_transcripts_by_year/full_tscrpt_df_' + str(year) + '.csv'
    if os.path.exists(csv_name):
        print("Already done with year", year)
        continue

    # SQL queries
    db.connection.rollback()
    sql_query_1 = f'''
            SELECT a.*, b.gvkey, b.liid, b.linkdt, b.linkenddt
            FROM (
                SELECT * 
                FROM crsp.dsp500list
                WHERE start <= make_date({year}, 1, 1)
                  AND ending >= make_date({year}, 12, 31)
            ) AS a
            LEFT JOIN (
                SELECT * 
                FROM crsp.ccmxpf_lnkhist
                WHERE linkdt <= make_date({year}, 1, 1)
                  AND (linkenddt >= make_date({year}, 12, 31) OR linkenddt IS NULL)
            ) AS b
            ON a.permno = b.lpermno
            AND b.linktype IN ('LU', 'LC')
            AND b.linkprim IN ('P', 'C');
            '''
    snp500_crsp_gvkey = db.raw_sql(sql_query_1)

    db.connection.rollback()
    sql_query_2 = f'''
                SELECT a.*,
                       b.symbolvalue AS gvkey,
                       c.*, 
                       d.componenttext
                FROM (
                    SELECT *
                    FROM ciq.wrds_transcript_detail 
                    WHERE keydeveventtypeid = 48 
                      AND transcriptpresentationtypeid = 5 
                      AND date_part('year', mostimportantdateutc) = {year}
                ) AS a
                LEFT JOIN (
                    SELECT *
                    FROM ciq.wrds_ciqsymbol 
                    WHERE symboltypecat = 'gvkey'
                ) AS b
                  ON a.companyid = b.companyid
                LEFT JOIN ciq.wrds_transcript_person AS c 
                  ON a.transcriptid = c.transcriptid
                LEFT JOIN ciq.ciqtranscriptcomponent AS d 
                  ON c.transcriptid = d.transcriptid 
                 AND c.transcriptcomponentid = d.transcriptcomponentid
                ORDER BY a.transcriptid, c.transcriptcomponentid, a.companyid;
                '''
    tr_detail_gvkey = db.raw_sql(sql_query_2)
    
    # Getting the initial data into a df
    tscrpt_year = tr_detail_gvkey[tr_detail_gvkey.gvkey.isin(snp500_crsp_gvkey.gvkey.tolist())]
    tscrpt_year = tscrpt_year[pd.notna(tscrpt_year.gvkey)]
    tscrpt_year = tscrpt_year.reset_index(drop=True)

    # Note if empty
    if tscrpt_year.empty:
        print("Year", year, "has no data.")
        continue

    # Dropping unnecessary and duplicated columns
    cols_to_drop_1 = ['companyid', 'headline', 'mostimportanttimeutc', 'keydeveventtypeid', 'keydeveventtypename',
                      'transcriptcollectiontypeid', 'transcriptcollectiontypename', 'transcriptpresentationtypeid',
                      'transcriptpresentationtypename', 'transcriptcreationdate_utc', 'transcriptcreationtime_utc', 'audiolengthsec',
                      'transcriptcomponenttypeid', 'proid', 'companyofperson', 'speakertypeid', 'componenttextpreview']
    tscrpt_year = tscrpt_year.drop(columns=cols_to_drop_1)
    tscrpt_year = tscrpt_year.loc[:, ~tscrpt_year.columns.duplicated(keep='first')] 
    tscrpt_year = tscrpt_year.drop_duplicates(subset=['keydevid', 'componentorder'], keep='first')
    tscrpt_year = tscrpt_year.dropna()
    
    # Sorting the columns
    tscrpt_year = tscrpt_year.sort_values(by=['mostimportantdateutc', 'keydevid', 'componentorder'])
    tscrpt_year = tscrpt_year.reset_index(drop=True)
    
    # Initializing a new df
    tscrpt_year_full = pd.DataFrame()
    
    # Going through each group and removing all Q&A sections
    for keydevid, group in tscrpt_year.groupby('keydevid'):
    
        # Initializing the Question and Answer Operator Message, Question, and Answer indices
        QAOM_idx = np.inf
        Q_idx = np.inf
        A_idx = np.inf
    
        # Checking if "Question and Answer Operator Message" index exists
        try:
            QAOM_idx = group[group['transcriptcomponenttypename'] == 'Question and Answer Operator Message'].index[0]
        except:
            pass
    
        # Checking if "Question" index exists
        try:
            Q_idx = group[group['transcriptcomponenttypename'] == 'Question'].index[0]
        except:
            pass
    
        # Checking if "Answer" index exists
        try:
            A_idx = group[group['transcriptcomponenttypename'] == 'Answer'].index[0]
        except:
            pass
    
        # Finding the first Q&A index in group (if exists), then removing all entries with indices equal to and below that
        idx_to_remove = np.min([QAOM_idx, Q_idx, A_idx])
        if idx_to_remove != np.inf:
            group = group[group.index < idx_to_remove]
        tscrpt_year_full = pd.concat([tscrpt_year_full, group])
        
    # Resetting the index
    tscrpt_year_full = tscrpt_year_full.reset_index(drop=True)

    # Initializing dictionary for final dataframe
    full_tscrpt_dict_year = {
        'Unique_transcript_id': [],
        'Date': [],
        'Company_name': [],
        'Word_count': [],
        'Text': []
    }
    
    # Going through each transcript to populate the full_tscrpt_dict_year for the final df
    for keydevid, group in tscrpt_year_full.groupby('keydevid'):
        
        # Appending the keydevid to get the Unique_transcript_id
        first_line = group.iloc[0]
        full_tscrpt_dict_year['Unique_transcript_id'].append(keydevid)
    
        # Getting the date, company name, the word count total, and the merged text
        full_tscrpt_dict_year['Date'].append(first_line['mostimportantdateutc'])
        full_tscrpt_dict_year['Company_name'].append(first_line['companyname'])
        full_tscrpt_dict_year['Word_count'].append(group['word_count'].sum())
        full_tscrpt_dict_year['Text'].append(" \n".join(group['componenttext'].tolist()))
    
    # Transforming into a dataframe and sorting
    full_tscrpt_df_year = pd.DataFrame.from_dict(full_tscrpt_dict_year)
    full_tscrpt_df_year = full_tscrpt_df_year.sort_values(by=['Date', 'Company_name'])
    full_tscrpt_df_year = full_tscrpt_df_year.reset_index(drop=True)

    # Getting the final df to a csv
    full_tscrpt_df_year.to_csv(csv_name, sep='|')
    print("Done with year", year)

  4%|█▋                                          | 1/26 [00:00<00:11,  2.19it/s]

Year 2000 has no data.


  8%|███▍                                        | 2/26 [00:00<00:07,  3.04it/s]

Year 2001 has no data.


 12%|█████                                       | 3/26 [00:00<00:06,  3.44it/s]

Year 2002 has no data.


 15%|██████▊                                     | 4/26 [00:01<00:06,  3.66it/s]

Year 2003 has no data.


 19%|████████▍                                   | 5/26 [00:19<02:26,  6.96s/it]

Year 2004 has no data.
Already done with year 2005
Already done with year 2006
Already done with year 2007
Already done with year 2008
Already done with year 2009
Already done with year 2010
Already done with year 2011


 50%|█████████████████████▌                     | 13/26 [04:32<05:48, 26.79s/it]

Done with year 2012


 54%|███████████████████████▏                   | 14/26 [08:09<10:07, 50.66s/it]

Done with year 2013


 58%|████████████████████████▊                  | 15/26 [11:53<14:07, 77.08s/it]

Done with year 2014


 62%|█████████████████████████▊                | 16/26 [15:52<17:39, 105.94s/it]

Done with year 2015


 65%|███████████████████████████▍              | 17/26 [19:44<19:44, 131.61s/it]

Done with year 2016


 69%|█████████████████████████████             | 18/26 [24:30<22:09, 166.21s/it]

Done with year 2017


 73%|██████████████████████████████▋           | 19/26 [30:28<24:49, 212.84s/it]

Done with year 2018


 77%|████████████████████████████████▎         | 20/26 [36:01<24:23, 243.94s/it]

Done with year 2019


 81%|█████████████████████████████████▉        | 21/26 [42:49<24:01, 288.24s/it]

Done with year 2020


 85%|███████████████████████████████████▌      | 22/26 [45:51<17:14, 258.67s/it]

Done with year 2021


 88%|█████████████████████████████████████▏    | 23/26 [48:50<11:47, 235.91s/it]

Done with year 2022


 92%|██████████████████████████████████████▊   | 24/26 [51:55<07:22, 221.22s/it]

Done with year 2023


 96%|████████████████████████████████████████▍ | 25/26 [55:08<03:33, 213.15s/it]

Done with year 2024


100%|██████████████████████████████████████████| 26/26 [56:43<00:00, 130.90s/it]

Year 2025 has no data.





In [7]:
pd.read_csv('./Individual_transcripts_by_year/full_tscrpt_df_2015.csv', sep='|')

Unnamed: 0.1,Unnamed: 0,Unique_transcript_id,Date,Company_name,Word_count,Text
0,0,279483592.0,2015-01-06,"Micron Technology, Inc.",5184,"Good afternoon. My name is Karen, and I'll be ..."
1,1,265970735.0,2015-01-07,Monsanto Company,6631,"Greetings, and welcome to the First Quarter Fi..."
2,2,276234923.0,2015-01-08,"20230930-DK-Butterfly-1, Inc.",4326,Welcome to the Bed Bath & Beyond's Third Quart...
3,3,279701680.0,2015-01-08,"Constellation Brands, Inc.",3188,"Ladies and gentlemen, thank you for standing b..."
4,4,273584744.0,2015-01-12,Alcoa Inc.,7601,"Good day, ladies and gentlemen, and welcome to..."
...,...,...,...,...,...,...
1845,1845,319427233.0,2015-12-21,Cintas Corporation,1821,"Good day, everyone, and welcome to the Cintas ..."
1846,1846,313123544.0,2015-12-22,"Conagra Brands, Inc.",3466,"Good morning, and welcome to today's ConAgra F..."
1847,1847,317447411.0,2015-12-22,"Micron Technology, Inc.",4191,"Good afternoon. My name is Jonathan, and I wil..."
1848,1848,318461548.0,2015-12-22,"NIKE, Inc.",5642,"Good afternoon, everyone. Welcome to NIKE's Fi..."
