## Warning

This file is just exercises that help you build that file. Here, we are solving little problems you'll have as you write it. 

When you create `build_sample.ipynb`, do it from scratch, put the psuedocode structure in place, and proceed from there. 

## First

- Copy `NEAR_regex.py` into the same folder as this file. [It's here](https://ledatascifi.github.io/ledatascifi-2025/content/04/02d_RegexApplication.html#demo) (click the "+") or in the community codebook. You should name the file `NEAR_regex.py` and not `NEAR_regex.ipynb`.
- Also copy the 10k_files_practice.zip file there into this folder
- Make a `.gitignore` file in this folder with  `**10k_files/*` in it.
- Copy [this file](https://github.com/donbowen/Class-Notes-1045/raw/main/Midterm%20sandbox/10k_files.zip) into the `10k_files/` folder here.
- Copy the things in the assignment's input folder in to the inputs folder here.
- Optional: You can install `tqdm` (If you don't, then remove it from the code below.)


In [1]:
import fnmatch
import glob
import os
import re
from time import sleep
from zipfile import ZipFile

import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from utils.near_regex import * # this import all th
from tqdm import tqdm  # progress bar on loops

# if you have tqdm issues, run this in terminal or with ! trick
# jupyter nbextension enable --py widgetsnbextension
# jupyter labextension install @jupyter-widgets/jupyterlab-manager
#
# if that fails, you can disable it

os.makedirs("output", exist_ok=True)

## Load sentiment dictionaries

In [2]:
# Q1 - load the ML negative words into a list called BHR_negative
# BHR is the author names on that paper
# "ML" might be a better name, but having "LM" and "ML" in bound
# to cause transcription errors

with open('inputs/ML_negative_unigram.txt', 'r') as file:
    BHR_negative = [line.strip().lower() for line in file]

BHR_negative.sort()
# BHR_negative

In [3]:
# Q2 - load the ML positive words into a list called BHR_positive
with open('inputs/ML_positive_unigram.txt', 'r') as file:
    BHR_positive = [line.strip().lower() for line in file]

len(BHR_negative), len(BHR_positive)
BHR_positive.sort()
# BHR_positive # not exhaustive word forms!

In [4]:
# Q3 - load the LM negative words into a list called LM_positive
file_path = "inputs/LM_MasterDictionary_1993-2021.csv"  # Update with actual path
df = pd.read_csv(file_path)
LM_positive = df[df['Positive'] > 0]['Word'].tolist()
LM_positive = [e.lower() for e in LM_positive] # to be consistent with our BHR input
df.describe() # there are negative numbers in the columns: years the word is removed!
len(LM_positive)
# LM_positive

347

In [5]:
# Q4 - load the LM positive words into a list called LM_positive

LM_negative = df[df['Negative'] > 0]['Word'].tolist()
LM_negative = [e.lower() for e in LM_negative] # to be consistent with our BHR input
# LM_negative

## Looping over a dataframe and adding a variable

In [6]:
import random

# step 1 will load some database and prep it for the loopy parts
# here, we will just use a toy dataset

toy_database = pd.DataFrame({"Security":['3M','TLSA','APPL'],
             "URL":['blahblah.com','wikisomething.com','wiki.com']})
toy_database

Unnamed: 0,Security,URL
0,3M,blahblah.com
1,TLSA,wikisomething.com
2,APPL,wiki.com


In [7]:
# step 2: figure out how to loop through this dataframe 
# yes, an actual for loop on a dataframe (booooooo)
for index, row in toy_database.iterrows(): 
    # print the row's index, and the url from the row, this will confirm if we are looping right
    print("======")
    print(index)
    # print(row)
    # print(row['Security']) # you can easily grab a variable in that row!

    # A. here, you would open the related 10k, but SKIP this for now

    # B. You'd measure the sentiment here. Let's just pretend that 
    # you opened+cleaned+built a sentiment variable 
    # called "sentiment_positive" (a bad name, but this is just example code!)

    sentiment_positive= random.randint(0,10) # this is a silly line to "simulate" that got a value for this variable

    toy_database.at[index,'Sentiment Positive'] = sentiment_positive
    print(toy_database)


0
  Security                URL  Sentiment Positive
0       3M       blahblah.com                 8.0
1     TLSA  wikisomething.com                 NaN
2     APPL           wiki.com                 NaN
1
  Security                URL  Sentiment Positive
0       3M       blahblah.com                 8.0
1     TLSA  wikisomething.com                 4.0
2     APPL           wiki.com                 NaN
2
  Security                URL  Sentiment Positive
0       3M       blahblah.com                 8.0
1     TLSA  wikisomething.com                 4.0
2     APPL           wiki.com                10.0


## Measure sentiment

What fraction of the words in this "document" (sentence) are "happy" words?

Answer: 2/13. Let's replicate that with code. 

First, count the length of the document.

Then, count how many times each word is in the document.

In [8]:
happy_sentiment = ['happy','smile','hopeful']

sentence = '''I am happy that you are here. I am all smiles.      
    

So hopeful!'''  # I ripped this up to show split is robust to line breaks and extra spaces 

# q0 count the number of "words" (the doc length)
len(sentence.split())

13

In [9]:
# q1 count how many times "happy" is used in the doc
# hint: https://ledatascifi.github.io/ledatascifi-2025/content/04/02b_regex.html
re.findall(" hap", sentence) # it looks for the sequence of characters you ask for ... which may or may not be words

[' hap']

In [10]:
print('poer .\t.  eoriufh')   # in strings, meaning of slashs depends on next char 
print(r'poer .\t.  eoriufh')  # r' means the string is "raw" 
print('poer .\\t.  eoriufh')  # uglier equivalent: \\ means \

poer .	.  eoriufh
poer .\t.  eoriufh
poer .\t.  eoriufh


In [11]:
# q2 count how many times "smile" is used in the doc
len(re.findall("smile", sentence)) # the word smile is not in the doc (smiles)

len(re.findall("smile$", sentence)) # maya v1: look for smile at the end of the string (wrong)

len(re.findall(r"\bsmile\b", sentence)) # upgrade correct: full exact word 



0

In [12]:
# q3 count how many times "smile" or "happy" is used in the doc
# hint: similar to q2 answer... 
# the answer is somewhere this page: https://regexone.com/

# trick: generally put r before the quotes always, means that the inside
# stuff is interpretted literally 
re.findall(r"\b(happy|smile)\b", sentence) # it looks for the sequence of characters you ask for ... which may or may not be words

# equivalent: \\ "means" \ 
re.findall("\\b(happy|smile)\\b", sentence) # it looks for the sequence of characters you ask for ... which may or may not be words

['happy']

In [13]:
# q4 - prof demo - count how many time all the words in happy sentiment are in the doc
# 4.4.4 has examples + output
# docstring: https://github.com/LeDataSciFi/ledatascifi-2025/blob/main/community_codebook/near_regex.py
# solve 

len(re.findall(r"\b(smile|happy|hopeful)\b", sentence))

re.findall("\\b(happy|smile|hopeful)\\b", sentence) # it looks for the sequence of characters you ask for ... which may or may not be words

['happy', 'hopeful']

In [14]:
# q5 - using py's string functions, convert
# happy_sentiment into the format NEAR_regex() wants 
# hint: 4.4.1
r'\b('+'|'.join(happy_sentiment)+r')\b'

# this would look for exact word matches in an html string and count them
# LM_neg_regx = r'\b('+'|'.join(LM_negative)+r')\b' # works for our sentiment!
# len(re.findall(LM_neg_regx, html_cleaned.lower(), ))

'\\b(happy|smile|hopeful)\\b'

In [15]:
# q6 - calculate the doc's happy_sentiment score

pos_regx = r'\b('+'|'.join(happy_sentiment)+r')\b'
pos_hits = len(re.findall(pos_regx, sentence.lower()))

doc_length = len(sentence.split()) 

pos_hits / doc_length
               

0.15384615384615385

Anchor phrases 

In [16]:
# q7: how many times is (happy or smile) near (face or head)? 

body_parts = ['face','head']

sentence1 = 'I see smile on your face. That is so awesome!'
sentence2 = 'I see smile. That is so awesome!'

# do on sentence1 - 
# using py's string functions, convert body_parts into the format NEAR_regex()
# then use near_regex()

# do on sentence2

NEAR_finder(body_parts, 
           ["happy","smile"], 
           sentence1)

(1, ['smile on your face'])

## Opening a 10-K file

I'm giving everyone this code because dealing with Zips is a headache the first 15 times you do it.
- Open the zip before the loop and get a list of all files already in it
- With that zip open, do your loopy stuff inside it

In [17]:
# open the zip file (do this before the for loop
# so you only open it one time... faster)
with ZipFile('10k_files/10k_files_practice.zip','r') as zipfolder:
    
    # before the loop, get list of files in zipped folder
    file_list = zipfolder.namelist()
        
    # replace this with how you'd loop over the dataframe
    # which you already know...
    for firm in [1800]: # 
        
        # get a list of possible files for this firm
        firm_folder    = f"sec-edgar-filings/{str(firm).zfill(10)}/10-K/*/*.html"
        possible_files = fnmatch.filter(file_list, firm_folder) 
        if len(possible_files) == 0: 
            continue
            
        fpath = possible_files[0] # the first match is the path to the file

        # open the file (this is a little different!)
        with zipfolder.open(fpath) as report_file:
            html = report_file.read().decode(encoding="utf-8")
            
        # do more stuff here...

## Cleaning the html

Print out `html`... 

In [18]:
html[:500]

'<?xml version=\'1.0\' encoding=\'UTF-8\'?>\n\n      <!-- iXBRL document created with: Toppan Merrill Bridge iXBRL 9.6.8042.36810 -->\n      <!-- Based on: iXBRL 1.1 -->\n      <!-- Created on: 2/18/2022 12:53:13 AM -->\n      <!-- iXBRL Library version: 1.0.8042.36816 -->\n      <!-- iXBRL Service Job ID: f92a8d11-abb5-46dc-a356-1f63ff59b8d5 -->\n\n  <html xmlns:us-gaap="http://fasb.org/us-gaap/2021-01-31" xmlns:link="http://www.xbrl.org/2003/linkbase" xmlns:country="http://xbrl.sec.gov/country/2021" xmlns:'

Regex won't work on this as is! We need to remove all the html tags, drop the hidden data, and then, with the remaining text, clean it up using the "Good ideas" in 4.4 and 4.4.4 of the book. However, we have to slightly adjust the code. 

1. Use BeautifulSoup() with the `lxml-xml` parser. Call the output `soup`. Don't use `get_text` yet. 
1. Delete the hidden XBRL 

    ```python
    for div in soup.find_all("div", {'style':'display:none'}): 
        div.decompose()
    ```
    
1. Continue on (get the text from the soup, and continue from there...)
2. Check: My cleaned string says ______ in positions ___-___

In [19]:
# work here

## Get 10-K dates

We need to know when the 10-K is released to see the stock returns around it.

I'm going to give you most of this code. How I figured it out:
- I know we have the CIK and accession number
- Looked for EDGAR urls that have CIK + accession number, and then list filing date on the page
- https://www.sec.gov/Archives/edgar/data/1122304/0001193125-15-118890-index.html
- `requests_html` ([my listed suggestion here](https://ledatascifi.github.io/ledatascifi-2025/content/04/01_Intro_to_scraping.html#my-suggestion)) is the `requests` module for getting data from the web PLUS the ability to grab parts of the html
    
Exercise:
- I used code straight off the [documentation's home page](https://requests.readthedocs.io/projects/requests-html/en/latest/), adapted slightly. Look for examples that _find_ parts of the html.
- You'll need to figure out the "CSS Selector"
    - right click on the filing date on the webpage, click inspect
    - in the area that popped up, right click on html code containing that date and copy the CSS selector


In [20]:
# before the loop, set up a browser session

from requests_html import HTMLSession

# will use requests_html to look for filing date
# the headers line of code is requested by the SEC servers https://www.sec.gov/os/accessing-edgar-data
# and you should only hit 10 pages a second, else your bot will start getting bad data (sleep(0.01) between pages)

session = HTMLSession()
session.headers.update({'User-Agent':'Donald Bowen deb219@lehigh.edu'}) #update your name/email

# inside your loop, get the cik and accession number for the filing

cik = 1122304
accession_number = '0001193125-15-118890'

# *one* way to get the filing date... 

url = f'https://www.sec.gov/Archives/edgar/data/{cik}/{accession_number}-index.html'
print(url) # check it out...
r = session.get(url)

https://www.sec.gov/Archives/edgar/data/1122304/0001193125-15-118890-index.html


In [21]:
# EXERCISE: get the filing date out of this "r" object (one line of code will do)


To use this in your actual midterm, save the  accession_number to your database while doing the main for loop to parse the text.

Then, after that, I wrote a second for loop that loops over the rows, and uses the code above to grab the date. I added some error checking (What if we don't have an accession number for that firm, what if the url is wrong, or the server denies you, or the line of code with filing_date fails?)

## Get returns around the 10-K dates

[Returns for 2022 are here.](https://github.com/LeDataSciFi/data/blob/main/Stock%20Returns%20(CRSP))

Before you try to use that, below is a toy dataset of returns and filing dates that mimic the structure of the data you'll actually have. 

Goals, in **reverse** order:
1. What is the [0,2] and [3,10] cumulative returns for each firm? It's easy to actually figure out! Doing so will help you with the pseudo, and in any case... how can you know if you're right otherwise?
2. Make an intermediate dataset with these variables (which is enough to answer goal 1). 
   - ticker
   - date
   - ret
   - trading_days_since_filing (0 on the filing date or the first trading day after it). This is what the midterm calls for.
3. Find the answer manually in excel. `crsp_example.xlsx` is in the handouts folder. This will help you find the steps you need to take. 

If you figured out the bonus on assignment 4, you're set!

In [22]:
data = {
    'ticker': ['JJSF']*20 + ['TSLA']*20,
    'date': ['2021-12-01', '2021-12-02', '2021-12-03', '2021-12-06', '2021-12-07', '2021-12-08', '2021-12-09', '2021-12-10', '2021-12-13', '2021-12-14', '2021-12-15', '2021-12-16', '2021-12-17', '2021-12-20', '2021-12-21', '2021-12-22', '2021-12-23', '2021-12-27', '2021-12-28', '2021-12-29'] + ['2022-12-02', '2022-12-05', '2022-12-06', '2022-12-07', '2022-12-08', '2022-12-09', '2022-12-12', '2022-12-13', '2022-12-14', '2022-12-15', '2022-12-16', '2022-12-19', '2022-12-20', '2022-12-21', '2022-12-22', '2022-12-23', '2022-12-27', '2022-12-28', '2022-12-29', '2022-12-30'],
    'ret': [-0.011276, 0.030954, 0.000287, 0.014362, 0.012459, 0.017200, -0.010173, 0.011875, 0.012559, 0.002508, 0.022852, 0.012360, 0.017387, -0.008957, 0.016840, -0.000256, -0.002558, 0.009041, -0.002097, 0.010189] + [0.000822, -0.063687, -0.014415, -0.032143, -0.003447, 0.032345, -0.062720, -0.040937, -0.025784, 0.005548, -0.047187, -0.002396, -0.080536, -0.001669, -0.088828, -0.017551, -0.114089, 0.033089, 0.080827, 0.011164]
}

crsp = pd.DataFrame(data)
crsp['date'] = pd.to_datetime(crsp['date'])

fake_filings = pd.DataFrame({'ticker':['JJSF','TSLA'],
                             'filing_date':['2021-12-03','2022-12-13']})

In [23]:
# try here...

# pseudocode first! imagine the structure of the dataset you want and work backwards, you'll struggle otherwise!

# this really is a paper and pencil problem


## Put it all together

The readme shows what the output dataset should look like, roughly. The midterm directions elaborate (10 sentiment variables, 2 return measures). 