# Economics Text Cleaning by Group
In this notebook, you will be performing some text cleaning functions on a collection of documents that we identified as having similar formats. The goal of this is to clean up some of the fluff around the edges and have only the body text. We want to get rid of headers, footers, pictures, reference pages, and the like. Since these are grouped by formatting, we would hope to be able to apply text cleaning functions to the whole directory.

## Patterns across documents:
Write down patterns that you observe across the documents in the group that can later be translated to code for cleaning them up

**Title pages:**
- Jstor cover page except 2020, 2022
- Sometimes page saying that it will show pictures of presidents
- Then a picture of the president, usually just with a name

**First Page of text:**
- Has link at top like " American Economic Review 2008, 98 1, 5-37 (line break)http //www aeaweb org/articles php?doi=10 1257/aer98 1 5 "
- Then name of the article with a fancy symbol next to it that translates to a "T", "1", or sometimes a lowercase letter: "Evolution and Intelligent DesignT" 
- Author line: "By Thomas J. Sargent*"
- Might just be able to get rid of everything after "Presidential address delivered at"
- Also a line that starts with * that gives author bio, but in some text extractions it doesn't start with its own line

**Headers:**
- Alternates between volumne-title-page number and page number-journal name - year
- Sometimes with line breaks, sometimes not, but in all caps and at top of page
- Edge case - ocr error with page number:
    -  "h36\nTHE AMERICAN ECONOMIC REVIEW\nAPRIL 2012"

**Footers:** "This content downloaded from..." on all pages, do that first

**Page Numbers:** In headers, except for on first page of text, then it's at the bottom

**Section headers:** Uses roman numerals for big section headers then letters for subsections.

**Equations:** 
- Numbered, sometimes translated ok from latex, sometimes weird

" (1) ct = yt + Dt - (1 + rDj_x)Dt_x. 

 I let

 
 (2) st = rDj_xDt_x - AD? "

**Weird problems:**
- Sub section headers sometimes go to bottom of the page with equations (See economics-2011-0-24)
- Words split over line breaks sometimes have "?" instead of "-", or no notation at all


**Citations:** Bibliography after "REFERENCES" line in all documents

**Footnotes:** 
- On the first page with symbols for title an author, then numbers later on
- Usually see the number after a period.6
- First footnote always gets a new line, subsequent ones might not.
- Sometimes space between number and word, sometimes not
- Might be hard to find and replace if there are other numbers on the page (e.g. in tables)
- Sometimes separated with a line and carried on from previous page


**Appendices:** Presumably wouldn't have been spoken, but still quite a few pages (see Economics-2008-0-24). Maybe treat like references

**Tables and pictures**
- pretty common
- Maybe find them by looking for pages that have very few words

In [None]:
"""
Remove jstor cover pages 
Split page texts into lines, use strip() function, then join, maintaining line breaks
Remove jstor footers

Look from pages 01-03 for all of the text except 2020 and 2022:
if 01 says "series of photographs":
    mark 02 as picture and 03 as first page
elif 01 doesn't have some indicator that it's a title page ("By (author)"):
    mark 01 as picture and 02 as first page
else: 
    mark 01 as first page
    might also want to check for title page indicator and do something if there's a problem with it
    
How are we marking? Probably dataframe with each page and indicator variable for each step
for each title page:
    find and delete journal title and link at top
    Delete title line (maybe find it based on proximity to "By (author)" line)
    delete line with "By (author)" 
    To get rid of footnotes at the bottom:
        Find line that has "f Presidential address delivered at" and delete everything after that (doesn't work for 2018 on)
        
for every page that isn't a title:
    try to match "THE AMERICAN ECONOMIC REVIEW" and "MONTH 20xx" in the first couple lines and delete up to there
    unfortunate edge case: "APRIL 201 5" in Economics-2015-0-03
    if no match for that header, try to match:
        "VOL. 104 NO. 4 GOLDIN: A GRAND GENDER CONVERGENCE: ITS LAST CHAPTER 1097"
        Sometimes different parts split on different lines, sometimes no page number, but VOL is consistent
        save captured text somewhere for verification


Get rid of everything after REFERENCES:
for each full text:
    Find pages with REFERENCES line and mark
    if there's only one:
        Mark pages after REFERENCES line for deletion
        Figure out if there's body text before REFERENCES or if it has its own page:
            Get the length of the body text before the REFERENCES?
            We don't want to get confused by headers if our previous work didn't catch them
    if there are multiple matches:
        print when it happens to see how common it is, and move pages to separate directory to manually check
    if there are no matches:
        move to directory and check
        
Equations:
Find them by matching pattern of one or two digit number in parentheses after a new line
Maybe save in dataframe the number of matches on each page. We can expect it to be a little wrong, however

"""

In [None]:
"""
Strategies for footnotes
Go in manually and make line --- before footnote section
Then read and save the line number where the break is so you don't need to do it manually again

Maybe do we use pattern matching to draw the line in the first place? Then manually validate
We could also look for the number after the period and indicate that there is likely a footnote on that page
"""

In [3]:
import os
import shutil
import re
import itertools as it
import pandas as pd
os.chdir("/Users/BeckyMarcusMacbook/Thesis/manual_work")
text_dir = "data/groups/E7/texts"
dest_dir = "data/groups_cleaned/E7"
def reset():
    rm_ds(text_dir)
    if os.path.exists(dest_dir):
        shutil.rmtree(dest_dir)
    shutil.copytree(text_dir,dest_dir)
def rm_ds(dir):
    for file in os.listdir(dir):
        if file==".DS_Store":
            os.remove(os.path.join(dir,file))

## Read folder to df

In [4]:

page_ids = []
texts = []
doc_ids = []
for file in sorted(os.listdir(text_dir)):
    if not ".txt" in file:
        continue
    page_id = file.split(".")[0]
    doc_id = page_id[:-3]
    doc_ids.append(doc_id)
    page_ids.append(page_id)
    with open(os.path.join(text_dir,file),'r') as f:
        text = f.read()
    texts.append(text)


In [5]:
import re
"""Look from pages 01-03 for all of the texts except 2020 and 2022:
if 01 says "series of photographs":
    mark 02 as picture and 03 as first page
elif 01 doesn't have some indicator that it's a title page (Like the "By...* line"):
    mark 01 as picture and 02 as first page
else: 
    mark 01 as first page
    might also want to check for title page indicator and do something if there's a problem with it
    """
df=pd.DataFrame({"page_id":page_ids,'doc_id':doc_ids,"text":texts})
## Basic info:
def get_info(name):
    _,year,num,page = name.split("-")
    return int(year),int(num),int(page)
df['text_path'] = df['page_id'].apply(lambda x: os.path.join(text_dir,x+".txt"))
df['pdf_path'] = df['text_path'].apply(lambda x: x.replace("texts","pdfs").replace(".txt",".pdf"))
df[['year','num','page']] = df['page_id'].apply(get_info).to_list()
df["length"] = df['text'].apply(len)
df["num_lines"] = df['text'].apply(lambda x: x.count("\n")+1)
df['text'] = df.pop('text')
df['num_pages'] = df.groupby('doc_id')['page'].transform('max')+1
df['is_jstor_cover'] = df['page']==0
## Take care of exception
df.loc[(df['year'].isin([2020,2022]))&(df['page']==0),'is_jstor_cover'] =False
# Getting picture and cover pages
df[["is_photo_intro","is_author_photo","is_first_page"]]=False
for doc in df['doc_id'].unique():
    if '2020' in doc or '2022' in doc:
        continue
    idxes= df.loc[(df['doc_id']==doc)&(df['page']>0)&(df['page']<4)].index
    i1,i2,i3 = idxes
    t1, t2,t3 = df.loc[idxes,"text"].values #type:ignore
    if "series of photographs" in t1:
        df.at[i1,"is_photo_intro"] = True
        df.at[i2,"is_author_photo"] = True
        df.at[i3,"is_first_page"] = True
    elif len(re.findall(r"By[^\*]+\*",t2,re.DOTALL))>0:
        df.at[i1,'is_author_photo'] = True
        df.at[i2,'is_first_page'] = True
    else:
        if len(re.findall(r"By[^\*]+\*",t1,re.DOTALL))==0:
            print(f"Couldn't figure out page for {doc}")
        df.at[i1,"is_first_page"]=True
df.loc[(df['year'].isin([2020,2022]))&(df["page"]==0),'is_first_page'] = True
if df.groupby("doc_id")['is_first_page'].any().all():
    print("All docs have first page marked")
    
df['is_after_title_page'] = df[['is_jstor_cover','is_photo_intro','is_author_photo','is_first_page']].any(axis=1) ==False

df['is_reference_start'] = df['text'].apply(lambda x: "REFERENCES" in x and "Stable URL" not in x)
### Checking for only one reference page
if (df[df['is_reference_start']]['doc_id'].value_counts() ==1).all():
    print("Got one reference page for each document")
ref_page_for_doc = df.loc[df["is_reference_start"],['doc_id','page']].drop_duplicates().set_index('doc_id').to_dict()['page']
df['is_bibliography'] = df.apply(lambda row: row['page']>=ref_page_for_doc[row['doc_id']],axis=1)
def strip_lines(text):
    lines=text.split('\n')
    stripped=[line.strip() for line in lines]
    return "\n".join(stripped)
def remove_jstor_footer(text):
    return text.rsplit("\nThis content downloaded from",1)[0]

df['basic_cleaned'] = df['text'].apply(strip_lines).apply(remove_jstor_footer)
df['has_equation'] = df['basic_cleaned'].apply(lambda x: re.search(r"\n\(\d{,2}+\)",x) is not None)
df['has_equation'] = (df['has_equation']&(df['is_reference_start']|(df['is_bibliography']==False)))
print("num equation pages:", df['has_equation'].sum())

df.sort_values('page_id')



All docs have first page marked
Got one reference page for each document
num equation pages: 50


Unnamed: 0,page_id,doc_id,text_path,pdf_path,year,num,page,length,num_lines,text,num_pages,is_jstor_cover,is_photo_intro,is_author_photo,is_first_page,is_after_title_page,is_reference_start,is_bibliography,basic_cleaned,has_equation
0,Economics-2008-0-00,Economics-2008-0,data/groups/E7/texts/Economics-2008-0-00.txt,data/groups/E7/pdfs/Economics-2008-0-00.pdf,2008,0,0,985,16,Evolution and Intelligent Design \nAuthor(s):...,36,True,False,False,False,False,False,False,Evolution and Intelligent Design\nAuthor(s): T...,False
1,Economics-2008-0-01,Economics-2008-0,data/groups/E7/texts/Economics-2008-0-01.txt,data/groups/E7/pdfs/Economics-2008-0-01.pdf,2008,0,1,230,4,Number 109 of a series of photographs of past...,36,False,True,False,False,False,False,False,Number 109 of a series of photographs of past ...,False
2,Economics-2008-0-02,Economics-2008-0,data/groups/E7/texts/Economics-2008-0-02.txt,data/groups/E7/pdfs/Economics-2008-0-02.pdf,2008,0,2,152,3,This content downloaded from \n�������������73...,36,False,False,True,False,False,False,False,This content downloaded from\n�������������73....,False
3,Economics-2008-0-03,Economics-2008-0,data/groups/E7/texts/Economics-2008-0-03.txt,data/groups/E7/pdfs/Economics-2008-0-03.pdf,2008,0,3,4218,44,"American Economic Review 2008, 98 1, 5-37 \n ...",36,False,False,False,True,False,False,False,"American Economic Review 2008, 98 1, 5-37\nhtt...",False
4,Economics-2008-0-04,Economics-2008-0,data/groups/E7/texts/Economics-2008-0-04.txt,data/groups/E7/pdfs/Economics-2008-0-04.pdf,2008,0,4,3812,45,6 THE AMERICAN ECONOMIC REVIEW \n MARCH 2008...,36,False,False,False,False,True,False,False,6 THE AMERICAN ECONOMIC REVIEW\nMARCH 2008\nd...,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
456,Economics-2022-0-11,Economics-2022-0,data/groups/E7/texts/Economics-2022-0-11.txt,data/groups/E7/pdfs/Economics-2022-0-11.pdf,2022,0,11,3512,50,1086\nTHE AMERICAN ECONOMIC REVIEW APRIL 2022\...,16,False,False,False,False,True,True,True,1086\nTHE AMERICAN ECONOMIC REVIEW APRIL 2022\...,False
457,Economics-2022-0-12,Economics-2022-0,data/groups/E7/texts/Economics-2022-0-12.txt,data/groups/E7/pdfs/Economics-2022-0-12.pdf,2022,0,12,4484,63,1087\nCARD: WHO SET YOUR WAGE? VOL. 112 NO. 4\...,16,False,False,False,False,True,False,True,1087\nCARD: WHO SET YOUR WAGE? VOL. 112 NO. 4\...,False
458,Economics-2022-0-13,Economics-2022-0,data/groups/E7/texts/Economics-2022-0-13.txt,data/groups/E7/pdfs/Economics-2022-0-13.pdf,2022,0,13,4192,62,1088\nTHE AMERICAN ECONOMIC REVIEW APRIL 2022\...,16,False,False,False,False,True,False,True,1088\nTHE AMERICAN ECONOMIC REVIEW APRIL 2022\...,False
459,Economics-2022-0-14,Economics-2022-0,data/groups/E7/texts/Economics-2022-0-14.txt,data/groups/E7/pdfs/Economics-2022-0-14.pdf,2022,0,14,4228,63,1089\nCARD: WHO SET YOUR WAGE? VOL. 112 NO. 4\...,16,False,False,False,False,True,False,True,1089\nCARD: WHO SET YOUR WAGE? VOL. 112 NO. 4\...,False


In [22]:
df.columns

Index(['page_id', 'doc_id', 'text_path', 'pdf_path', 'year', 'num', 'page',
       'length', 'num_lines', 'text', 'num_pages', 'is_jstor_cover',
       'is_photo_intro', 'is_author_photo', 'is_first_page',
       'is_after_title_page', 'is_reference_start', 'is_bibliography',
       'has_equation', 'basic_cleaned'],
      dtype='object')

In [71]:
temp_dir = "data/groups/E7/temp"
#shutil.rmtree(temp_dir)
for column in ["is_jstor_cover",'is_photo_intro',"is_author_photo",'is_first_page','is_reference_start','is_bibliography','has_equation']:
    txt_dest_dir = os.path.join(temp_dir,column,'texts')
    os.makedirs(txt_dest_dir,exist_ok=True)
    pdf_dest_dir = os.path.join(temp_dir,column,'pdfs')
    os.makedirs(pdf_dest_dir,exist_ok=True)

    for _,txt_src, pdf_src, id, text in df.loc[df[column],['text_path','pdf_path','page_id','basic_cleaned']].itertuples():
        #copy text
        txt_dest = os.path.join(txt_dest_dir,id+".txt")
        with open(txt_dest,'w') as f:
            f.write(text)
        pdf_dest = os.path.join(pdf_dest_dir,id+".pdf")
        shutil.copyfile(pdf_src,pdf_dest)

In [89]:
list(df[df['is_jstor_cover']|df['is_author_photo']|df['is_photo_intro']].page_id+".txt")

['Economics-2008-0-00.txt',
 'Economics-2008-0-01.txt',
 'Economics-2008-0-02.txt',
 'Economics-2009-0-00.txt',
 'Economics-2009-0-01.txt',
 'Economics-2009-0-02.txt',
 'Economics-2010-0-00.txt',
 'Economics-2011-0-00.txt',
 'Economics-2011-0-01.txt',
 'Economics-2012-0-00.txt',
 'Economics-2012-0-01.txt',
 'Economics-2013-0-00.txt',
 'Economics-2013-0-01.txt',
 'Economics-2014-0-00.txt',
 'Economics-2014-0-01.txt',
 'Economics-2015-0-00.txt',
 'Economics-2015-0-01.txt',
 'Economics-2016-0-00.txt',
 'Economics-2016-0-01.txt',
 'Economics-2017-0-00.txt',
 'Economics-2017-0-01.txt',
 'Economics-2017-0-02.txt',
 'Economics-2018-0-00.txt',
 'Economics-2018-0-01.txt',
 'Economics-2019-0-00.txt',
 'Economics-2019-0-01.txt',
 'Economics-2019-0-02.txt']

In [6]:
first_page_dir = 'data/groups/E7/temp/is_first_page/texts-annotated'
d={}
for file in os.listdir(first_page_dir):
    if file[0]=='.':
        continue
    if "2019" in file:
        d[file] = (0,0)
        continue
    path = os.path.join(first_page_dir,file)
    text = open(path,'r').read()
    lines= text.split("\n")
    bounds = []
    for i,line in enumerate(lines):
        if line=='---':
            bounds.append(i)
    if len(bounds)<2:
        bounds.append(None)
    else:
        bounds[1]-=1

    d[file] = tuple(bounds)

    

In [24]:
E7_TITLE_PAGE_BOUNDS: dict[str, tuple[int | None, int | None]] = {
    "Economics-2018-0-02.txt": (4, 11),
    "Economics-2019-0-03.txt": (0, 0),
    "Economics-2013-0-02.txt": (14, 31),
    "Economics-2012-0-02.txt": (17, 28),
    "Economics-2008-0-03.txt": (10, 35),
    "Economics-2009-0-03.txt": (4, 32),
    "Economics-2020-0-00.txt": (28, None),
    "Economics-2014-0-02.txt": (19, 27),
    "Economics-2015-0-02.txt": (15, 31),
    "Economics-2011-0-02.txt": (21, 33),
    "Economics-2010-0-01.txt": (15, 36),
    "Economics-2016-0-02.txt": (4, 35),
    "Economics-2022-0-00.txt": (11, 33),
    "Economics-2017-0-03.txt": (17, 32),
}
new = {}
for key,val in E7_TITLE_PAGE_BOUNDS.items():
    dyn,pagetxt = key.rsplit("-",1)
    page=int(pagetxt[:-4])
    new[dyn] = (page,*val)
new



{'Economics-2018-0': (2, 4, 11),
 'Economics-2019-0': (3, 0, 0),
 'Economics-2013-0': (2, 14, 31),
 'Economics-2012-0': (2, 17, 28),
 'Economics-2008-0': (3, 10, 35),
 'Economics-2009-0': (3, 4, 32),
 'Economics-2020-0': (0, 28, None),
 'Economics-2014-0': (2, 19, 27),
 'Economics-2015-0': (2, 15, 31),
 'Economics-2011-0': (2, 21, 33),
 'Economics-2010-0': (1, 15, 36),
 'Economics-2016-0': (2, 4, 35),
 'Economics-2022-0': (0, 11, 33),
 'Economics-2017-0': (3, 17, 32)}

In [18]:
E7_TITLE_PAGE_BOUNDS:dict[str,tuple[int|None,int|None]]={
    'Economics-2018-0-02.txt': (4, 11),
    'Economics-2019-0-03.txt': (0, 0),
    'Economics-2013-0-02.txt': (14, 31),
    'Economics-2012-0-02.txt': (17, 28),
    'Economics-2008-0-03.txt': (10, 35),
    'Economics-2009-0-03.txt': (4, 32),
    'Economics-2020-0-00.txt': (28, None),
    'Economics-2014-0-02.txt': (19, 27),
    'Economics-2015-0-02.txt': (15, 31),
    'Economics-2011-0-02.txt': (21, 33),
    'Economics-2010-0-01.txt': (15, 36),
    'Economics-2016-0-02.txt': (4, 35),
    'Economics-2022-0-00.txt': (11, 33),
    'Economics-2017-0-03.txt': (17, 32)
    }

# See E7 notebook for how these were determined
E7_INTRO_PAGES:list[str] = ['Economics-2008-0-00.txt',
 'Economics-2008-0-01.txt',
 'Economics-2008-0-02.txt',
 'Economics-2009-0-00.txt',
 'Economics-2009-0-01.txt',
 'Economics-2009-0-02.txt',
 'Economics-2010-0-00.txt',
 'Economics-2011-0-00.txt',
 'Economics-2011-0-01.txt',
 'Economics-2012-0-00.txt',
 'Economics-2012-0-01.txt',
 'Economics-2013-0-00.txt',
 'Economics-2013-0-01.txt',
 'Economics-2014-0-00.txt',
 'Economics-2014-0-01.txt',
 'Economics-2015-0-00.txt',
 'Economics-2015-0-01.txt',
 'Economics-2016-0-00.txt',
 'Economics-2016-0-01.txt',
 'Economics-2017-0-00.txt',
 'Economics-2017-0-01.txt',
 'Economics-2017-0-02.txt',
 'Economics-2018-0-00.txt',
 'Economics-2018-0-01.txt',
 'Economics-2019-0-00.txt',
 'Economics-2019-0-01.txt',
 'Economics-2019-0-02.txt'
 ]

for file in E7_INTRO_PAGES:
    dyn,pagetxt = file.rsplit("-",1)
    for title in E7_TITLE_PAGE_BOUNDS.keys():
        if dyn in title:
            if file<title:
                print(file, 'good')
            else:
                 print(file, 'bad')

Economics-2008-0-00.txt good
Economics-2008-0-01.txt good
Economics-2008-0-02.txt good
Economics-2009-0-00.txt good
Economics-2009-0-01.txt good
Economics-2009-0-02.txt good
Economics-2010-0-00.txt good
Economics-2011-0-00.txt good
Economics-2011-0-01.txt good
Economics-2012-0-00.txt good
Economics-2012-0-01.txt good
Economics-2013-0-00.txt good
Economics-2013-0-01.txt good
Economics-2014-0-00.txt good
Economics-2014-0-01.txt good
Economics-2015-0-00.txt good
Economics-2015-0-01.txt good
Economics-2016-0-00.txt good
Economics-2016-0-01.txt good
Economics-2017-0-00.txt good
Economics-2017-0-01.txt good
Economics-2017-0-02.txt good
Economics-2018-0-00.txt good
Economics-2018-0-01.txt good
Economics-2019-0-00.txt good
Economics-2019-0-01.txt good
Economics-2019-0-02.txt good


In [64]:
{dyn:page for dyn,page in [file.rsplit("-",1) for file in E7_TITLE_PAGE_BOUNDS]}

{'Economics-2018-0': '02.txt',
 'Economics-2019-0': '03.txt',
 'Economics-2013-0': '02.txt',
 'Economics-2012-0': '02.txt',
 'Economics-2008-0': '03.txt',
 'Economics-2009-0': '03.txt',
 'Economics-2020-0': '00.txt',
 'Economics-2014-0': '02.txt',
 'Economics-2015-0': '02.txt',
 'Economics-2011-0': '02.txt',
 'Economics-2010-0': '01.txt',
 'Economics-2016-0': '02.txt',
 'Economics-2022-0': '00.txt',
 'Economics-2017-0': '03.txt'}

In [64]:
x = df.loc[df['page_id']=='Economics-2008-0-04','basic_cleaned'].values[0]
re.search(r"\n\(\d{,2}+\)",x)

<re.Match object; span=(3266, 3270), match='\n(1)'>

## Manual work on cleaning (after E7 part 1)

In [92]:
df

Unnamed: 0,page_id,doc_id,text_path,pdf_path,year,num,page,length,num_lines,text,num_pages,is_jstor_cover,is_photo_intro,is_author_photo,is_first_page,is_after_title_page,is_reference_start,is_bibliography,basic_cleaned,has_equation
0,Economics-2008-0-00,Economics-2008-0,data/groups/E7/texts/Economics-2008-0-00.txt,data/groups/E7/pdfs/Economics-2008-0-00.pdf,2008,0,0,985,16,Evolution and Intelligent Design \nAuthor(s):...,36,True,False,False,False,False,False,False,Evolution and Intelligent Design\nAuthor(s): T...,False
1,Economics-2008-0-01,Economics-2008-0,data/groups/E7/texts/Economics-2008-0-01.txt,data/groups/E7/pdfs/Economics-2008-0-01.pdf,2008,0,1,230,4,Number 109 of a series of photographs of past...,36,False,True,False,False,False,False,False,Number 109 of a series of photographs of past ...,False
2,Economics-2008-0-02,Economics-2008-0,data/groups/E7/texts/Economics-2008-0-02.txt,data/groups/E7/pdfs/Economics-2008-0-02.pdf,2008,0,2,152,3,This content downloaded from \n�������������73...,36,False,False,True,False,False,False,False,This content downloaded from\n�������������73....,False
3,Economics-2008-0-03,Economics-2008-0,data/groups/E7/texts/Economics-2008-0-03.txt,data/groups/E7/pdfs/Economics-2008-0-03.pdf,2008,0,3,4218,44,"American Economic Review 2008, 98 1, 5-37 \n ...",36,False,False,False,True,False,False,False,"American Economic Review 2008, 98 1, 5-37\nhtt...",False
4,Economics-2008-0-04,Economics-2008-0,data/groups/E7/texts/Economics-2008-0-04.txt,data/groups/E7/pdfs/Economics-2008-0-04.pdf,2008,0,4,3812,45,6 THE AMERICAN ECONOMIC REVIEW \n MARCH 2008...,36,False,False,False,False,True,False,False,6 THE AMERICAN ECONOMIC REVIEW\nMARCH 2008\nd...,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
456,Economics-2022-0-11,Economics-2022-0,data/groups/E7/texts/Economics-2022-0-11.txt,data/groups/E7/pdfs/Economics-2022-0-11.pdf,2022,0,11,3512,50,1086\nTHE AMERICAN ECONOMIC REVIEW APRIL 2022\...,16,False,False,False,False,True,True,True,1086\nTHE AMERICAN ECONOMIC REVIEW APRIL 2022\...,False
457,Economics-2022-0-12,Economics-2022-0,data/groups/E7/texts/Economics-2022-0-12.txt,data/groups/E7/pdfs/Economics-2022-0-12.pdf,2022,0,12,4484,63,1087\nCARD: WHO SET YOUR WAGE? VOL. 112 NO. 4\...,16,False,False,False,False,True,False,True,1087\nCARD: WHO SET YOUR WAGE? VOL. 112 NO. 4\...,False
458,Economics-2022-0-13,Economics-2022-0,data/groups/E7/texts/Economics-2022-0-13.txt,data/groups/E7/pdfs/Economics-2022-0-13.pdf,2022,0,13,4192,62,1088\nTHE AMERICAN ECONOMIC REVIEW APRIL 2022\...,16,False,False,False,False,True,False,True,1088\nTHE AMERICAN ECONOMIC REVIEW APRIL 2022\...,False
459,Economics-2022-0-14,Economics-2022-0,data/groups/E7/texts/Economics-2022-0-14.txt,data/groups/E7/pdfs/Economics-2022-0-14.pdf,2022,0,14,4228,63,1089\nCARD: WHO SET YOUR WAGE? VOL. 112 NO. 4\...,16,False,False,False,False,True,False,True,1089\nCARD: WHO SET YOUR WAGE? VOL. 112 NO. 4\...,False


In [177]:
### Went through to 2012 and fixed things, need to strip them now
work_dir = "manual_work/E7"
for file in df.loc[df['year']<=2012,"page_id"]+".txt":
    try:
        path = os.path.join(work_dir,file)
        text = open(path,'r').read()
        text = text.strip()
        with open(path,'w') as f:
            f.write(text)
    except FileNotFoundError:
        pass


### Try writing to one file

In [101]:
work_dir = "manual_work/E7"
dest_file="manual_work/E7_long"
os.remove(dest_file)

for file in df.loc[df['year']>2012,"page_id"]+".txt":
    try:
        path = os.path.join(work_dir,file)
        text = open(path,'r').read()
        new_text=f"\n\n\n---{file}---\n{text.strip()}"
        with open(dest_file,'a') as f:
            f.write(new_text)
    except FileNotFoundError:
        pass

In [8]:
def build_kmp_table(pattern):
    table = [0] * len(pattern)
    j = 0
    for i in range(1, len(pattern)):
        while j > 0 and pattern[i] != pattern[j]:
            j = table[j - 1]
        if pattern[i] == pattern[j]:
            j += 1
        table[i] = j
    return table

def kmp_search(text, pattern):
    table = build_kmp_table(pattern)
    j = 0
    for i in range(len(text)):
        while j > 0 and text[i] != pattern[j]:
            j = table[j - 1]
        if text[i] == pattern[j]:
            j += 1
        if j == len(pattern):
            return i - len(pattern) + 1
    return -1

def find_substrings_indices(modified_string, mainstring)->list[tuple[int,int]]:
    indices = []
    start_index = 0  # Start from the beginning of the mainstring
    start_len = len(modified_string)
    while modified_string:
        for i in range(len(modified_string), 0, -1):
            substring = modified_string[:i]
            match_index = kmp_search(mainstring[start_index:], substring)
            
            if match_index != -1:
                actual_start_index = start_index + match_index
                actual_end_index = actual_start_index + len(substring)
                indices.append((actual_start_index, actual_end_index))
                start_index = actual_end_index + 1
                modified_string = modified_string[i:]
                break
        else:
            break
    
    # Validation check: Ensure that the total length of matched substrings covers the entire modified_string
    total_matched_length = sum(end - start for start, end in indices)

    if total_matched_length!=start_len:

        raise ValueError("The matched substrings do not cover the entire modified string.")
    
    return indices

In [9]:
def strip_lines(text):
    lines=text.split('\n')
    stripped=[line.strip() for line in lines]
    return "\n".join(stripped)
def remove_jstor_footer(text):
    return text.rsplit("\nThis content downloaded from",1)[0]


In [29]:
import re
### Get to partially clean
not_yet_dir = 'E7_not_yet'
partial_clean_dir = "data/groups/E7/temp/E7_partial_clean"
og_text = "data/groups/E7/texts"
before_clean = {}
after_clean = {}
not_yet = []
if not os.path.exists(not_yet_dir):
    os.makedirs(not_yet_dir)
for file in sorted(os.listdir(partial_clean_dir)):
    if file[0]=="." or file in E7_TITLE_PAGE_BOUNDS.keys():
        continue
    old_path = os.path.join(og_text,file)
    new_path = os.path.join(partial_clean_dir,file)
    bf = open(old_path).read()
    before_clean[file] = bf
    aft = open(new_path).read()
    new_text = strip_lines(remove_jstor_footer(bf))
    lines = new_text.splitlines()
    for i,line in enumerate(lines):
        if re.search(r"\b[a-z]{2,}\b",line) is not None:
            new_text=  "\n".join(lines[i:])
            break
    if aft.strip()!=new_text.strip():
        not_yet.append(file)
        new_path_attempt = os.path.join(not_yet_dir,file.replace(".","-attempt."))
        with open(new_path_attempt,'w') as f:
            f.write(new_text)
        new_path_true = os.path.join(not_yet_dir,file.replace(".","-actual."))
        
        with open(new_path_true,'w') as f:
            f.write(aft)

    
len(not_yet)





209

In [27]:
not_yet

['Economics-2008-0-30.txt',
 'Economics-2009-0-20.txt',
 'Economics-2010-0-04.txt',
 'Economics-2010-0-15.txt',
 'Economics-2010-0-28.txt',
 'Economics-2011-0-15.txt',
 'Economics-2011-0-35.txt',
 'Economics-2011-0-40.txt',
 'Economics-2012-0-21.txt',
 'Economics-2012-0-22.txt',
 'Economics-2012-0-25.txt',
 'Economics-2012-0-26.txt',
 'Economics-2013-0-11.txt',
 'Economics-2013-0-20.txt',
 'Economics-2013-0-22.txt',
 'Economics-2014-0-03.txt',
 'Economics-2014-0-04.txt',
 'Economics-2014-0-05.txt',
 'Economics-2014-0-06.txt',
 'Economics-2014-0-08.txt',
 'Economics-2014-0-09.txt',
 'Economics-2014-0-12.txt',
 'Economics-2014-0-13.txt',
 'Economics-2014-0-14.txt',
 'Economics-2014-0-15.txt',
 'Economics-2014-0-16.txt',
 'Economics-2014-0-17.txt',
 'Economics-2014-0-18.txt',
 'Economics-2014-0-19.txt',
 'Economics-2014-0-20.txt',
 'Economics-2014-0-21.txt',
 'Economics-2014-0-22.txt',
 'Economics-2014-0-23.txt',
 'Economics-2014-0-24.txt',
 'Economics-2014-0-25.txt',
 'Economics-2014-0-2

In [89]:
import os
import re
os.chdir("/Users/BeckyMarcusMacbook/Thesis/manual_work")
partial_clean_dir = "data/groups/E7/temp/E7_partial_clean"
edited ="E7_long_edited"
cleaned_to_2012 = "E7"
look_over_file = "E7_check_over.txt"
edited_text = open(edited).read().strip()
split = edited_text.split('---')
names = split[1::2]
texts = split[2::2]
ms:dict[str,list[tuple[int,int]]] = {}
problems = []
E7_TITLE_PAGE_BOUNDS:dict[str,tuple[int|None,int|None]]={
    'Economics-2018-0-02.txt': (4, 11),
    'Economics-2019-0-03.txt': (0, 0),
    'Economics-2013-0-02.txt': (14, 31),
    'Economics-2012-0-02.txt': (17, 28),
    'Economics-2008-0-03.txt': (10, 35),
    'Economics-2009-0-03.txt': (4, 32),
    'Economics-2020-0-00.txt': (28, None),
    'Economics-2014-0-02.txt': (19, 27),
    'Economics-2015-0-02.txt': (15, 31),
    'Economics-2011-0-02.txt': (21, 33),
    'Economics-2010-0-01.txt': (15, 36),
    'Economics-2016-0-02.txt': (4, 35),
    'Economics-2022-0-00.txt': (11, 33),
    'Economics-2017-0-03.txt': (17, 32)
    }

def find_indices(old_text:str, edited_text:str)->list[tuple[int,int]]:
    # Find the longest matching subsequence
    if len(edited_text)==0:
        return [(0,0)]
    start_index = old_text.find(edited_text)
    stop_index = start_index + len(edited_text)
    return [(start_index, stop_index)] if start_index!=-1 else find_substrings_indices(edited_text,old_text)

def slice_text(ranges:list[tuple[int,int]],text)->str:
    new_text= ""
    for a,b in ranges:
        new_text+=text[a:b]+"\n"
    return new_text


for name, edited_text in zip(names,texts):
    og_path = os.path.join(partial_clean_dir,name)
    old_text = open(og_path,'r').read()
    try:
        indices = find_indices(old_text.strip(),edited_text.strip())
        if len(indices)>4:
            print(name)
        ms[name]=indices
    except ValueError as e:
        print(e)
        print(name)
print("did first part")
for file in sorted(os.listdir(partial_clean_dir)):
    if file[0]==".":
        continue
    if file not in ms.keys():
        og_path = os.path.join(partial_clean_dir, file)
        old_text = open(og_path,'r').read()
        edited_path = os.path.join(cleaned_to_2012,file)
        edited_text = open(edited_path,'r').read()
        try:
            indices = find_indices(old_text.strip(),edited_text.strip())
            ms[file] = indices
            if len(indices)>4:
                print(file)
        except ValueError:
            print(file)

did first part


In [83]:
def build_kmp_table(pattern):
    table = [0] * len(pattern)
    j = 0
    for i in range(1, len(pattern)):
        while j > 0 and pattern[i] != pattern[j]:
            j = table[j - 1]
        if pattern[i] == pattern[j]:
            j += 1
        table[i] = j
    return table

def kmp_search(text, pattern):
    table = build_kmp_table(pattern)
    j = 0
    for i in range(len(text)):
        while j > 0 and text[i] != pattern[j]:
            j = table[j - 1]
        if text[i] == pattern[j]:
            j += 1
        if j == len(pattern):
            return i - len(pattern) + 1
    return -1

def find_substrings_indices(modified_string, mainstring)->list[tuple[int,int]]:
    indices = []
    start_index = 0  # Start from the beginning of the mainstring
    start_len = len(modified_string)
    while modified_string:
        for i in range(len(modified_string), 0, -1):
            substring = modified_string[:i]
            match_index = kmp_search(mainstring[start_index:], substring)
            
            if match_index != -1:
                print(f"found a match with i={i} and match index of {match_index} ")
                actual_start_index = start_index + match_index
                actual_end_index = actual_start_index + len(substring)
                print(f"New index: {(actual_start_index, actual_end_index)}")
                indices.append((actual_start_index, actual_end_index))
                start_index = actual_end_index + 1
                modified_string = modified_string[i:]
                break
        else:
            break
    
    # Validation check: Ensure that the total length of matched substrings covers the entire modified_string
    total_matched_length = sum(end - start for start, end in indices)
    if total_matched_length!=start_len:

        raise ValueError("The matched substrings do not cover the entire modified string.")
    
    return indices

In [91]:
ms

{'Economics-2013-0-02.txt': [(0, 1300)],
 'Economics-2013-0-03.txt': [(0, 2906)],
 'Economics-2013-0-04.txt': [(0, 3448)],
 'Economics-2013-0-05.txt': [(0, 3296)],
 'Economics-2013-0-06.txt': [(0, 3516)],
 'Economics-2013-0-07.txt': [(0, 1431)],
 'Economics-2013-0-08.txt': [(328, 3215)],
 'Economics-2013-0-09.txt': [(0, 3415)],
 'Economics-2013-0-10.txt': [(0, 1584)],
 'Economics-2013-0-11.txt': [(0, 2596)],
 'Economics-2013-0-12.txt': [(0, 2088)],
 'Economics-2013-0-13.txt': [(0, 1592)],
 'Economics-2013-0-14.txt': [(0, 2852)],
 'Economics-2013-0-15.txt': [(0, 2732)],
 'Economics-2013-0-16.txt': [(0, 3267)],
 'Economics-2013-0-17.txt': [(0, 2447)],
 'Economics-2013-0-18.txt': [(0, 2177)],
 'Economics-2013-0-19.txt': [(0, 3362)],
 'Economics-2013-0-20.txt': [(170, 1860)],
 'Economics-2013-0-21.txt': [(0, 559)],
 'Economics-2013-0-22.txt': [(0, 0)],
 'Economics-2014-0-02.txt': [(0, 579)],
 'Economics-2014-0-03.txt': [(0, 3105)],
 'Economics-2014-0-04.txt': [(0, 2513)],
 'Economics-2014-

In [84]:
for name, edited_text in zip(names,texts):
    if name!='Economics-2019-0-17.txt':
        continue
    og_path = os.path.join(partial_clean_dir,name)
    old_text = open(og_path,'r').read()
    s = 0
    indices = find_indices(old_text.strip(),edited_text.strip())
    for a,b in indices:
        print("\n")
        print(f"index: {(a,b)}")
        ot = old_text.strip()[a:b]
        print(f"\n===\n{ot}\n===\n")
        len_of_slice=len(ot)
        print(f"Len = {len_of_slice}")
        s+=len_of_slice
    print("total len", s)
    print("length of edited:",len(edited_text.strip()))

found a match with i=1118 and match index of 0 
New index: (0, 1118)
found a match with i=1189 and match index of 113 
New index: (1232, 2421)


index: (0, 1118)

===
less than 1, the implicit transfer due to the change in input prices increases utility. If
R
t    is greater than 1, the implicit transfer decreases utility.
The reason why it is the risky rate that matters is simple. Capital yields a rate of
return
R
t+1   . The change in prices due to the decrease in capital represents an implicit
transfer with rate of return   R
t+1  / R t   . Thus, whether the implicit transfer increases or
decreases utility depends on whether
R
t    is less or greater than 1.
Putting the two sets of results together: if the safe rate is less than 1, and
the risky rate is greater than 1 (the configuration that appears to be relevant
today) the two terms now work in opposite directions. The first term implies that an
increase in debt increases welfare. The second term implies that an increase in debt i

In [67]:
import subprocess
import logging
def git_commit(text,msg):
    pass
def clean_headers_footers_references(dest_dir:str,commit_changes:bool):
    try:
        reference_first_pages ={}
        for file in sorted(os.listdir(dest_dir)):
            try:
                if file[0] =='.':
                    continue
                doc_id,pagetxt = file.rsplit("-",1)
                page=int(pagetxt[:-4])
                path = os.path.join(dest_dir, file)
                if reference_first_pages.get(doc_id,page)<page: # means that it's after the reference first page
                    os.remove(path)
                    print(f"Removing reference page {file}{'- Staging for removal' if commit_changes else ''}")
                    if commit_changes:
                            subprocess.run(["git", "rm", path], check=True)
                    continue

                title_pnum,start,stop = E7_TITLE_PAGE_BOUNDS[doc_id]
                if title_pnum>page:
                    os.remove(path)
                    print(f"Removing cover page {file}{'- Staging for removal' if commit_changes else ''}")
                    if commit_changes:
                            subprocess.run(["git", "rm", path], check=True)
                    continue
                text = open(path,'r').read()
                text = jstor_and_stripping(text)
                if "REFERENCES" in text:
                    reference_first_pages[doc_id] = page
                    print(f"Found reference page start for {doc_id} at page {page}")
                    text = text.split("REFERENCES")[0].strip()
                lines = text.splitlines()
                if title_pnum!=page:
                    stop = None
                    for i,line in enumerate(lines):
                        if re.search(r"\b[a-z]+\b",line) is not None:
                            start = i
                            break
                text = "\n".join(lines[start:stop])
                with open(path,'w') as f:
                    f.write(text)
            except:
                print(f"Exception when cleaning file {file}")
                raise
        if commit_changes:
            git_commit(dest_dir,"Cleaned headers and footers")
    except Exception as e:
        logging.error(f"Error when cleaning headers and footers: {e}")
        raise

def handle_line_breaks(dest_dir, commit_changes):
    def remove_quest(text:str)->str:
        return re.sub(r"([a-zA-Z]+)\?\n([a-zA-Z]+)([^\w\n\s])?", # Captures 3 groups: first half of word, second half of word, optional punctuation
                      r"\1\2\3\n", #removes dash and moves line break
                      text)
    apply_func_to_txt_dir(dest_dir,dest_dir,remove_quest)
    if commit_changes:
        git_commit(dest_dir,"joined words split by ? across lines")


def main(source_dir:str, dest_dir:str, log_file:str, commit_changes:bool):

    setup_logging(log_file)

    initialize_directories(source_dir,dest_dir,commit_changes)

    clean_headers_footers_references(dest_dir,commit_changes)


In [56]:
ms = {'Economics-2013-0-02.txt': [(0, 1300)],
 'Economics-2013-0-03.txt': [(0, 2906)],
 'Economics-2013-0-04.txt': [(0, 3448)],
 'Economics-2013-0-05.txt': [(0, 3296)],
 'Economics-2013-0-06.txt': [(0, 3516)],
 'Economics-2013-0-07.txt': [(0, 1431)],
 'Economics-2013-0-08.txt': [(328, 3215)],
 'Economics-2013-0-09.txt': [(0, 3415)],
 'Economics-2013-0-10.txt': [(0, 1584)],
 'Economics-2013-0-11.txt': [(0, 2596)],
 'Economics-2013-0-12.txt': [(0, 2088)],
 'Economics-2013-0-13.txt': [(0, 1592)],
 'Economics-2013-0-14.txt': [(0, 2852)],
 'Economics-2013-0-15.txt': [(0, 2732)],
 'Economics-2013-0-16.txt': [(0, 3267)],
 'Economics-2013-0-17.txt': [(0, 2447)],
 'Economics-2013-0-18.txt': [(0, 2177)],
 'Economics-2013-0-19.txt': [(0, 3362)],
 'Economics-2013-0-20.txt': [(170, 1860)],
 'Economics-2013-0-21.txt': [(0, 559)],
 'Economics-2013-0-22.txt': [(0, 0)],
 'Economics-2014-0-02.txt': [(0, 579)],
 'Economics-2014-0-03.txt': [(0, 3105)],
 'Economics-2014-0-04.txt': [(0, 2513)],
 'Economics-2014-0-05.txt': [(0, 2811)],
 'Economics-2014-0-06.txt': [(0, 2852)],
 'Economics-2014-0-07.txt': [(0, 0)],
 'Economics-2014-0-08.txt': [(0, 3419)],
 'Economics-2014-0-09.txt': [(1938, 3437)],
 'Economics-2014-0-10.txt': [(0, 1624)],
 'Economics-2014-0-11.txt': [(0, 0)],
 'Economics-2014-0-12.txt': [(846, 2171)],
 'Economics-2014-0-13.txt': [(0, 3198)],
 'Economics-2014-0-14.txt': [(0, 2556)],
 'Economics-2014-0-15.txt': [(0, 2669)],
 'Economics-2014-0-16.txt': [(0, 2104)],
 'Economics-2014-0-17.txt': [(0, 2850)],
 'Economics-2014-0-18.txt': [(2172, 3121)],
 'Economics-2014-0-19.txt': [(0, 3169)],
 'Economics-2014-0-20.txt': [(437, 1711)],
 'Economics-2014-0-21.txt': [(0, 2929)],
 'Economics-2014-0-22.txt': [(0, 2975)],
 'Economics-2014-0-23.txt': [(1952, 3189)],
 'Economics-2014-0-24.txt': [(0, 3413)],
 'Economics-2014-0-25.txt': [(1407, 1733)],
 'Economics-2014-0-26.txt': [(0, 2738)],
 'Economics-2014-0-27.txt': [(0, 3257)],
 'Economics-2014-0-28.txt': [(0, 3495)],
 'Economics-2014-0-29.txt': [(0, 919)],
 'Economics-2015-0-02.txt': [(0, 1223)],
 'Economics-2015-0-03.txt': [(0, 3703)],
 'Economics-2015-0-04.txt': [(0, 3180)],
 'Economics-2015-0-05.txt': [(0, 2557)],
 'Economics-2015-0-06.txt': [(0, 3218)],
 'Economics-2015-0-07.txt': [(47, 3730)],
 'Economics-2015-0-08.txt': [(0, 3649)],
 'Economics-2015-0-09.txt': [(0, 3152)],
 'Economics-2015-0-10.txt': [(0, 3255)],
 'Economics-2015-0-11.txt': [(0, 3326)],
 'Economics-2015-0-12.txt': [(0, 3487)],
 'Economics-2015-0-13.txt': [(590, 2203)],
 'Economics-2015-0-14.txt': [(0, 3350)],
 'Economics-2015-0-15.txt': [(0, 3422)],
 'Economics-2015-0-16.txt': [(0, 3575)],
 'Economics-2015-0-17.txt': [(0, 3551)],
 'Economics-2015-0-18.txt': [(187, 2795)],
 'Economics-2015-0-19.txt': [(796, 3464)],
 'Economics-2015-0-20.txt': [(0, 3253)],
 'Economics-2015-0-21.txt': [(0, 0)],
 'Economics-2015-0-22.txt': [(0, 3015)],
 'Economics-2015-0-23.txt': [(0, 0)],
 'Economics-2015-0-24.txt': [(0, 0)],
 'Economics-2015-0-25.txt': [(0, 3418)],
 'Economics-2015-0-26.txt': [(0, 3587)],
 'Economics-2015-0-27.txt': [(688, 2088)],
 'Economics-2015-0-28.txt': [(965, 1607)],
 'Economics-2015-0-29.txt': [(0, 3371)],
 'Economics-2015-0-30.txt': [(0, 3463)],
 'Economics-2015-0-31.txt': [(0, 1831)],
 'Economics-2016-0-02.txt': [(0, 2314)],
 'Economics-2016-0-03.txt': [(0, 3542)],
 'Economics-2016-0-04.txt': [(0, 3248)],
 'Economics-2016-0-05.txt': [(0, 3511)],
 'Economics-2016-0-06.txt': [(0, 3692)],
 'Economics-2016-0-07.txt': [(0, 2437)],
 'Economics-2016-0-08.txt': [(0, 3397)],
 'Economics-2016-0-09.txt': [(0, 3336)],
 'Economics-2016-0-10.txt': [(0, 3536)],
 'Economics-2016-0-11.txt': [(0, 3173)],
 'Economics-2016-0-12.txt': [(0, 3552)],
 'Economics-2016-0-13.txt': [(0, 3467)],
 'Economics-2016-0-14.txt': [(0, 2282)],
 'Economics-2016-0-15.txt': [(0, 3494)],
 'Economics-2016-0-16.txt': [(0, 3391)],
 'Economics-2016-0-17.txt': [(0, 3015)],
 'Economics-2016-0-18.txt': [(0, 3431)],
 'Economics-2016-0-19.txt': [(0, 3415)],
 'Economics-2016-0-20.txt': [(0, 3709)],
 'Economics-2016-0-21.txt': [(0, 3808)],
 'Economics-2016-0-22.txt': [(0, 2824)],
 'Economics-2017-0-03.txt': [(0, 1182)],
 'Economics-2017-0-04.txt': [(0, 3340)],
 'Economics-2017-0-05.txt': [(0, 3302)],
 'Economics-2017-0-06.txt': [(309, 2657)],
 'Economics-2017-0-07.txt': [(0, 3125)],
 'Economics-2017-0-08.txt': [(0, 3330)],
 'Economics-2017-0-09.txt': [(0, 2753)],
 'Economics-2017-0-10.txt': [(0, 3387)],
 'Economics-2017-0-11.txt': [(0, 3040)],
 'Economics-2017-0-12.txt': [(0, 3436)],
 'Economics-2017-0-13.txt': [(226, 2384)],
 'Economics-2017-0-14.txt': [(0, 3569)],
 'Economics-2017-0-15.txt': [(0, 3458)],
 'Economics-2017-0-16.txt': [(644, 2275)],
 'Economics-2017-0-17.txt': [(0, 3215)],
 'Economics-2017-0-18.txt': [(0, 3414)],
 'Economics-2017-0-19.txt': [(271, 1688)],
 'Economics-2017-0-20.txt': [(0, 3199)],
 'Economics-2017-0-21.txt': [(0, 2979)],
 'Economics-2017-0-22.txt': [(0, 3680)],
 'Economics-2017-0-23.txt': [(0, 3164)],
 'Economics-2017-0-24.txt': [(377, 2138)],
 'Economics-2017-0-25.txt': [(0, 3635)],
 'Economics-2017-0-26.txt': [(388, 2023)],
 'Economics-2017-0-27.txt': [(0, 2933)],
 'Economics-2017-0-28.txt': [(0, 3027)],
 'Economics-2017-0-29.txt': [(0, 3189)],
 'Economics-2017-0-30.txt': [(0, 3033)],
 'Economics-2017-0-31.txt': [(400, 2311)],
 'Economics-2017-0-32.txt': [(0, 3257)],
 'Economics-2017-0-33.txt': [(0, 2974)],
 'Economics-2017-0-34.txt': [(0, 3331)],
 'Economics-2017-0-35.txt': [(0, 1560)],
 'Economics-2018-0-02.txt': [(1164, 1327)],
 'Economics-2018-0-03.txt': [(0, 2119)],
 'Economics-2018-0-04.txt': [(0, 3018)],
 'Economics-2018-0-05.txt': [(0, 1997)],
 'Economics-2018-0-06.txt': [(0, 2932)],
 'Economics-2018-0-07.txt': [(0, 1870)],
 'Economics-2018-0-08.txt': [(0, 2905)],
 'Economics-2018-0-09.txt': [(0, 2282)],
 'Economics-2018-0-10.txt': [(0, 1778)],
 'Economics-2018-0-11.txt': [(0, 1911)],
 'Economics-2018-0-12.txt': [(0, 2888)],
 'Economics-2018-0-13.txt': [(0, 2903)],
 'Economics-2018-0-14.txt': [(0, 2550)],
 'Economics-2018-0-15.txt': [(0, 2870)],
 'Economics-2018-0-16.txt': [(0, 2688)],
 'Economics-2018-0-17.txt': [(0, 2645)],
 'Economics-2018-0-18.txt': [(0, 2289)],
 'Economics-2018-0-19.txt': [(0, 2504)],
 'Economics-2018-0-20.txt': [(0, 2178)],
 'Economics-2018-0-21.txt': [(0, 2301)],
 'Economics-2018-0-22.txt': [(0, 2419)],
 'Economics-2018-0-23.txt': [(0, 2740)],
 'Economics-2018-0-24.txt': [(0, 2290)],
 'Economics-2018-0-25.txt': [(0, 2657)],
 'Economics-2018-0-26.txt': [(0, 2734)],
 'Economics-2018-0-27.txt': [(0, 2298)],
 'Economics-2018-0-28.txt': [(0, 2825)],
 'Economics-2018-0-29.txt': [(0, 2602)],
 'Economics-2018-0-30.txt': [(0, 2541)],
 'Economics-2018-0-31.txt': [(0, 2665)],
 'Economics-2018-0-32.txt': [(0, 2745)],
 'Economics-2018-0-33.txt': [(0, 2440)],
 'Economics-2018-0-34.txt': [(0, 1898)],
 'Economics-2018-0-35.txt': [(0, 3180)],
 'Economics-2018-0-36.txt': [(0, 2194)],
 'Economics-2018-0-37.txt': [(0, 3129)],
 'Economics-2018-0-38.txt': [(0, 2762)],
 'Economics-2018-0-39.txt': [(0, 2471)],
 'Economics-2018-0-40.txt': [(0, 2802)],
 'Economics-2019-0-03.txt': [(0, 0)],
 'Economics-2019-0-04.txt': [(0, 3111)],
 'Economics-2019-0-05.txt': [(0, 3873)],
 'Economics-2019-0-06.txt': [(0, 3976)],
 'Economics-2019-0-07.txt': [(0, 2620)],
 'Economics-2019-0-08.txt': [(0, 788)],
 'Economics-2019-0-09.txt': [(0, 2773)],
 'Economics-2019-0-10.txt': [(0, 1151)],
 'Economics-2019-0-11.txt': [(0, 2708)],
 'Economics-2019-0-12.txt': [(0, 1368)],
 'Economics-2019-0-13.txt': [(0, 1901)],
 'Economics-2019-0-14.txt': [(0, 2382)],
 'Economics-2019-0-15.txt': [(0, 1993)],
 'Economics-2019-0-16.txt': [(0, 2316)],
 'Economics-2019-0-17.txt': [(0, 1117), (1232, 2420)],
 'Economics-2019-0-18.txt': [(0, 1853), (1919, 2668)],
 'Economics-2019-0-19.txt': [(0, 3241)],
 'Economics-2019-0-20.txt': [(0, 1991)],
 'Economics-2019-0-21.txt': [(0, 2256)],
 'Economics-2019-0-22.txt': [(0, 768)],
 'Economics-2019-0-23.txt': [(0, 2828)],
 'Economics-2019-0-24.txt': [(0, 890)],
 'Economics-2019-0-25.txt': [(0, 1767)],
 'Economics-2019-0-26.txt': [(0, 2974)],
 'Economics-2019-0-27.txt': [(0, 0)],
 'Economics-2019-0-28.txt': [(0, 1888)],
 'Economics-2019-0-29.txt': [(0, 2480)],
 'Economics-2019-0-30.txt': [(0, 2848)],
 'Economics-2019-0-31.txt': [(0, 3593)],
 'Economics-2019-0-32.txt': [(0, 3168)],
 'Economics-2019-0-33.txt': [(0, 1494)],
 'Economics-2020-0-00.txt': [(0, 948)],
 'Economics-2020-0-01.txt': [(0, 3229)],
 'Economics-2020-0-02.txt': [(0, 3106)],
 'Economics-2020-0-03.txt': [(0, 3091)],
 'Economics-2020-0-04.txt': [(0, 3761)],
 'Economics-2020-0-05.txt': [(0, 2301)],
 'Economics-2020-0-06.txt': [(0, 3583)],
 'Economics-2020-0-07.txt': [(0, 3560)],
 'Economics-2020-0-08.txt': [(0, 3236)],
 'Economics-2020-0-09.txt': [(0, 3509)],
 'Economics-2020-0-10.txt': [(0, 3383)],
 'Economics-2020-0-11.txt': [(0, 3453)],
 'Economics-2020-0-12.txt': [(0, 3749)],
 'Economics-2020-0-13.txt': [(0, 2768)],
 'Economics-2020-0-14.txt': [(0, 3777)],
 'Economics-2020-0-15.txt': [(0, 3513)],
 'Economics-2020-0-16.txt': [(0, 3719)],
 'Economics-2020-0-17.txt': [(0, 3570)],
 'Economics-2020-0-18.txt': [(0, 3092)],
 'Economics-2020-0-19.txt': [(0, 2653)],
 'Economics-2020-0-20.txt': [(0, 3374)],
 'Economics-2020-0-21.txt': [(0, 3384)],
 'Economics-2020-0-22.txt': [(0, 3262)],
 'Economics-2020-0-23.txt': [(0, 3778)],
 'Economics-2020-0-24.txt': [(0, 3395)],
 'Economics-2020-0-25.txt': [(0, 3665)],
 'Economics-2020-0-26.txt': [(0, 3479)],
 'Economics-2020-0-27.txt': [(0, 1844)],
 'Economics-2020-0-28.txt': [(0, 3260)],
 'Economics-2020-0-29.txt': [(0, 2077)],
 'Economics-2020-0-30.txt': [(0, 3400)],
 'Economics-2020-0-31.txt': [(0, 3698)],
 'Economics-2020-0-32.txt': [(0, 3421)],
 'Economics-2020-0-33.txt': [(0, 3504)],
 'Economics-2020-0-34.txt': [(0, 3597)],
 'Economics-2020-0-35.txt': [(0, 1648)],
 'Economics-2022-0-00.txt': [(0, 1759)],
 'Economics-2022-0-01.txt': [(0, 3022)],
 'Economics-2022-0-02.txt': [(0, 2801)],
 'Economics-2022-0-03.txt': [(0, 2666)],
 'Economics-2022-0-04.txt': [(0, 2976)],
 'Economics-2022-0-05.txt': [(0, 2915)],
 'Economics-2022-0-06.txt': [(0, 3289)],
 'Economics-2022-0-07.txt': [(0, 2977)],
 'Economics-2022-0-08.txt': [(0, 2992)],
 'Economics-2022-0-09.txt': [(0, 3179)],
 'Economics-2022-0-10.txt': [(0, 3390)],
 'Economics-2022-0-11.txt': [(0, 1985)],
 'Economics-2008-0-03.txt': [(0, 2258)],
 'Economics-2008-0-04.txt': [(0, 3371)],
 'Economics-2008-0-05.txt': [(0, 2793)],
 'Economics-2008-0-06.txt': [(0, 2152)],
 'Economics-2008-0-07.txt': [(0, 2238)],
 'Economics-2008-0-08.txt': [(0, 3558)],
 'Economics-2008-0-09.txt': [(0, 2979)],
 'Economics-2008-0-10.txt': [(0, 2916)],
 'Economics-2008-0-11.txt': [(0, 1953)],
 'Economics-2008-0-12.txt': [(0, 1801)],
 'Economics-2008-0-13.txt': [(0, 2622)],
 'Economics-2008-0-14.txt': [(0, 3158)],
 'Economics-2008-0-15.txt': [(0, 3298)],
 'Economics-2008-0-16.txt': [(0, 2469)],
 'Economics-2008-0-17.txt': [(0, 3708)],
 'Economics-2008-0-18.txt': [(0, 3230)],
 'Economics-2008-0-19.txt': [(0, 2280)],
 'Economics-2008-0-20.txt': [(0, 3350)],
 'Economics-2008-0-21.txt': [(215, 2035)],
 'Economics-2008-0-22.txt': [(0, 0)],
 'Economics-2008-0-23.txt': [(0, 3574)],
 'Economics-2008-0-24.txt': [(0, 3152)],
 'Economics-2008-0-25.txt': [(0, 3426)],
 'Economics-2008-0-26.txt': [(0, 3632)],
 'Economics-2008-0-27.txt': [(0, 3783)],
 'Economics-2008-0-28.txt': [(0, 3559)],
 'Economics-2008-0-29.txt': [(130, 2479)],
 'Economics-2008-0-30.txt': [(0, 301)],
 'Economics-2009-0-03.txt': [(0, 2525)],
 'Economics-2009-0-04.txt': [(0, 3550)],
 'Economics-2009-0-05.txt': [(0, 4387)],
 'Economics-2009-0-06.txt': [(0, 3425)],
 'Economics-2009-0-07.txt': [(0, 4018)],
 'Economics-2009-0-08.txt': [(0, 3937)],
 'Economics-2009-0-09.txt': [(69, 3780)],
 'Economics-2009-0-10.txt': [(0, 4078)],
 'Economics-2009-0-11.txt': [(0, 3775)],
 'Economics-2009-0-12.txt': [(45, 2381)],
 'Economics-2009-0-13.txt': [(0, 1242), (1303, 2688)],
 'Economics-2009-0-14.txt': [(0, 3673)],
 'Economics-2009-0-15.txt': [(0, 2784)],
 'Economics-2009-0-16.txt': [(0, 2920)],
 'Economics-2009-0-17.txt': [(0, 4114)],
 'Economics-2009-0-18.txt': [(0, 4011)],
 'Economics-2009-0-19.txt': [(0, 3986)],
 'Economics-2009-0-20.txt': [(0, 824)],
 'Economics-2010-0-01.txt': [(0, 1974)],
 'Economics-2010-0-02.txt': [(0, 4639)],
 'Economics-2010-0-03.txt': [(0, 4243)],
 'Economics-2010-0-04.txt': [(1204, 4060)],
 'Economics-2010-0-05.txt': [(0, 4555)],
 'Economics-2010-0-06.txt': [(262, 2924)],
 'Economics-2010-0-07.txt': [(0, 4233)],
 'Economics-2010-0-08.txt': [(0, 3811)],
 'Economics-2010-0-09.txt': [(0, 4276)],
 'Economics-2010-0-10.txt': [(0, 4022)],
 'Economics-2010-0-11.txt': [(822, 3471)],
 'Economics-2010-0-12.txt': [(0, 2632)],
 'Economics-2010-0-13.txt': [(476, 4411)],
 'Economics-2010-0-14.txt': [(1478, 4520)],
 'Economics-2010-0-15.txt': [(322, 2583)],
 'Economics-2010-0-16.txt': [(0, 3718)],
 'Economics-2010-0-17.txt': [(0, 4310)],
 'Economics-2010-0-18.txt': [(78, 1826)],
 'Economics-2010-0-19.txt': [(2006, 3056)],
 'Economics-2010-0-20.txt': [(0, 4234)],
 'Economics-2010-0-21.txt': [(1154, 3082)],
 'Economics-2010-0-22.txt': [(0, 4269)],
 'Economics-2010-0-23.txt': [(0, 4547)],
 'Economics-2010-0-24.txt': [(0, 4465)],
 'Economics-2010-0-25.txt': [(664, 1800)],
 'Economics-2010-0-26.txt': [(1227, 3348)],
 'Economics-2010-0-27.txt': [(0, 4351)],
 'Economics-2010-0-28.txt': [(0, 3466)],
 'Economics-2011-0-02.txt': [(0, 854)],
 'Economics-2011-0-03.txt': [(78, 2260)],
 'Economics-2011-0-04.txt': [(95, 1790)],
 'Economics-2011-0-05.txt': [(0, 2259)],
 'Economics-2011-0-06.txt': [(47, 1952)],
 'Economics-2011-0-07.txt': [(0, 3218)],
 'Economics-2011-0-08.txt': [(55, 2099)],
 'Economics-2011-0-09.txt': [(0, 3685)],
 'Economics-2011-0-10.txt': [(94, 645)],
 'Economics-2011-0-11.txt': [(89, 1980)],
 'Economics-2011-0-12.txt': [(0, 2990)],
 'Economics-2011-0-13.txt': [(301, 1930)],
 'Economics-2011-0-14.txt': [(188, 1931)],
 'Economics-2011-0-15.txt': [(85, 1773)],
 'Economics-2011-0-16.txt': [(86, 2254)],
 'Economics-2011-0-17.txt': [(0, 3777)],
 'Economics-2011-0-18.txt': [(193, 2106)],
 'Economics-2011-0-19.txt': [(52, 1074)],
 'Economics-2011-0-20.txt': [(0, 1373)],
 'Economics-2011-0-21.txt': [(0, 2863)],
 'Economics-2011-0-22.txt': [(116, 1824)],
 'Economics-2011-0-23.txt': [(619, 2861)],
 'Economics-2011-0-24.txt': [(0, 2823)],
 'Economics-2011-0-25.txt': [(0, 2895)],
 'Economics-2011-0-26.txt': [(0, 2810)],
 'Economics-2011-0-27.txt': [(0, 1731)],
 'Economics-2011-0-28.txt': [(0, 1788)],
 'Economics-2011-0-29.txt': [(0, 1866)],
 'Economics-2011-0-30.txt': [(0, 1616)],
 'Economics-2011-0-31.txt': [(949, 2863)],
 'Economics-2011-0-32.txt': [(0, 2026)],
 'Economics-2011-0-33.txt': [(378, 813)],
 'Economics-2011-0-34.txt': [(515, 1017)],
 'Economics-2011-0-35.txt': [(109, 1628)],
 'Economics-2011-0-36.txt': [(0, 2128)],
 'Economics-2011-0-37.txt': [(0, 3483)],
 'Economics-2011-0-38.txt': [(0, 3418)],
 'Economics-2011-0-39.txt': [(0, 3113)],
 'Economics-2011-0-40.txt': [(0, 0)],
 'Economics-2012-0-02.txt': [(0, 830)],
 'Economics-2012-0-03.txt': [(0, 3639)],
 'Economics-2012-0-04.txt': [(734, 2428)],
 'Economics-2012-0-05.txt': [(0, 3065)],
 'Economics-2012-0-06.txt': [(91, 1946)],
 'Economics-2012-0-07.txt': [(418, 903)],
 'Economics-2012-0-08.txt': [(368, 2129)],
 'Economics-2012-0-09.txt': [(0, 2913)],
 'Economics-2012-0-10.txt': [(0, 2837)],
 'Economics-2012-0-11.txt': [(0, 2712)],
 'Economics-2012-0-12.txt': [(0, 2876)],
 'Economics-2012-0-13.txt': [(0, 3370)],
 'Economics-2012-0-14.txt': [(0, 0)],
 'Economics-2012-0-15.txt': [(493, 1974)],
 'Economics-2012-0-16.txt': [(185, 1504)],
 'Economics-2012-0-17.txt': [(0, 0)],
 'Economics-2012-0-18.txt': [(0, 3485)],
 'Economics-2012-0-19.txt': [(1536, 2587)],
 'Economics-2012-0-20.txt': [(871, 1501)],
 'Economics-2012-0-21.txt': [(260, 580)],
 'Economics-2012-0-22.txt': [(189, 1756)],
 'Economics-2012-0-23.txt': [(1523, 3191)],
 'Economics-2012-0-24.txt': [(0, 0)],
 'Economics-2012-0-25.txt': [(166, 1847)],
 'Economics-2012-0-26.txt': [(0, 560)]}

In [None]:
import os
import re
partially_clean_dir = "data/groups/E5/temp/partially_clean"
edited ="E5_long_edited_FINISHED.txt"

look_over_file = "E5_check_over.txt"
if os.path.exists(look_over_file):
    os.remove(look_over_file)

edited_text = open(edited).read().strip()
split = edited_text.split('---')
names = split[1::2]
texts = split[2::2]
ms:dict[str,list[tuple[int,int]]]  = {}
problems = []
import difflib

def find_indices(old_text:str, edited_text:str):
    # Find the longest matching subsequence
    start_index = old_text.find(edited_text)
    stop_index = start_index + len(edited_text)
    return [(start_index, stop_index)] if start_index!=-1 else find_substrings_indices(old_text,edited_text)

def slice_text(ranges:list[tuple[int,int]],text)->str:
    new_text= ""
    for a,b in ranges:
        new_text+=text[a:b]+"\n"
    return new_text
for name, edited_text in zip(names,texts):
    og_path = os.path.join(partially_clean_dir,name)
    try:
        old_text = open(og_path,'r').read()
    except FileNotFoundError:
        continue
    ms[name] =  find_substrings_indices(edited_text.strip(),old_text.strip())



In [179]:
## test editing
test_dir = 'data/groups/E7/temp/index_slicing_test'
if os.path.exists(test_dir):
    shutil.rmtree(test_dir)
os.makedirs(test_dir,exist_ok=True)
def apply_pre_edited_splits(dest_dir,bounds_dict:dict[str,tuple[int,int]]):
    for file in sorted(os.listdir(partial_clean_dir)):
        bounds = bounds_dict.get(file)
        if bounds is None:
            print("Not in dictionary:",file)
            continue
        start,stop = bounds
        old_path = os.path.join(partial_clean_dir,file)
        new_edit_path = os.path.join(test_dir,file.replace(".",'-e.'))
        compare_file_path = os.path.join(test_dir,file.replace(".",'-b.'))
        old_text = open(old_path,'r').read()
        
        new_text = old_text[start:stop]
        shutil.copyfile(old_path,compare_file_path)
        with open(new_edit_path,'w') as f:
            f.write(new_text)




Not in dictionary: .DS_Store


In [None]:
edit_bounds = {'Economics-2013-0-02.txt': (0, 1300),
 'Economics-2013-0-03.txt': (0, 2906),
 'Economics-2013-0-04.txt': (0, 3448),
 'Economics-2013-0-05.txt': (0, 3296),
 'Economics-2013-0-06.txt': (0, 3516),
 'Economics-2013-0-07.txt': (0, 1431),
 'Economics-2013-0-08.txt': (328, 3215),
 'Economics-2013-0-09.txt': (0, 3415),
 'Economics-2013-0-10.txt': (0, 1584),
 'Economics-2013-0-11.txt': (0, 2596),
 'Economics-2013-0-12.txt': (0, 2088),
 'Economics-2013-0-13.txt': (0, 1592),
 'Economics-2013-0-14.txt': (0, 2852),
 'Economics-2013-0-15.txt': (0, 2732),
 'Economics-2013-0-16.txt': (0, 3267),
 'Economics-2013-0-17.txt': (0, 2447),
 'Economics-2013-0-18.txt': (0, 2177),
 'Economics-2013-0-19.txt': (0, 3362),
 'Economics-2013-0-20.txt': (170, 1860),
 'Economics-2013-0-21.txt': (0, 559),
 'Economics-2013-0-22.txt': (0, 0),
 'Economics-2014-0-02.txt': (0, 579),
 'Economics-2014-0-03.txt': (0, 3105),
 'Economics-2014-0-04.txt': (0, 2513),
 'Economics-2014-0-05.txt': (0, 2811),
 'Economics-2014-0-06.txt': (0, 2852),
 'Economics-2014-0-07.txt': (0, 0),
 'Economics-2014-0-08.txt': (0, 3419),
 'Economics-2014-0-09.txt': (1938, 3437),
 'Economics-2014-0-10.txt': (0, 1624),
 'Economics-2014-0-11.txt': (0, 0),
 'Economics-2014-0-12.txt': (846, 2171),
 'Economics-2014-0-13.txt': (0, 3198),
 'Economics-2014-0-14.txt': (0, 2556),
 'Economics-2014-0-15.txt': (0, 2669),
 'Economics-2014-0-16.txt': (0, 2104),
 'Economics-2014-0-17.txt': (0, 2850),
 'Economics-2014-0-18.txt': (2172, 3121),
 'Economics-2014-0-19.txt': (0, 3169),
 'Economics-2014-0-20.txt': (437, 1711),
 'Economics-2014-0-21.txt': (0, 2929),
 'Economics-2014-0-22.txt': (0, 2975),
 'Economics-2014-0-23.txt': (1952, 3189),
 'Economics-2014-0-24.txt': (0, 3413),
 'Economics-2014-0-25.txt': (1407, 1733),
 'Economics-2014-0-26.txt': (0, 2738),
 'Economics-2014-0-27.txt': (0, 3257),
 'Economics-2014-0-28.txt': (0, 3495),
 'Economics-2014-0-29.txt': (0, 919),
 'Economics-2015-0-02.txt': (0, 1223),
 'Economics-2015-0-03.txt': (0, 3703),
 'Economics-2015-0-04.txt': (0, 3180),
 'Economics-2015-0-05.txt': (0, 2557),
 'Economics-2015-0-06.txt': (0, 3218),
 'Economics-2015-0-07.txt': (47, 3730),
 'Economics-2015-0-08.txt': (0, 3649),
 'Economics-2015-0-09.txt': (0, 3152),
 'Economics-2015-0-10.txt': (0, 3255),
 'Economics-2015-0-11.txt': (0, 3326),
 'Economics-2015-0-12.txt': (0, 3487),
 'Economics-2015-0-13.txt': (590, 2203),
 'Economics-2015-0-14.txt': (0, 3350),
 'Economics-2015-0-15.txt': (0, 3422),
 'Economics-2015-0-16.txt': (0, 3575),
 'Economics-2015-0-17.txt': (0, 3551),
 'Economics-2015-0-18.txt': (187, 2795),
 'Economics-2015-0-19.txt': (796, 3464),
 'Economics-2015-0-20.txt': (0, 3253),
 'Economics-2015-0-21.txt': (0, 0),
 'Economics-2015-0-22.txt': (0, 3015),
 'Economics-2015-0-23.txt': (0, 0),
 'Economics-2015-0-24.txt': (0, 0),
 'Economics-2015-0-25.txt': (0, 3418),
 'Economics-2015-0-26.txt': (0, 3587),
 'Economics-2015-0-27.txt': (688, 2088),
 'Economics-2015-0-28.txt': (965, 1607),
 'Economics-2015-0-29.txt': (0, 3371),
 'Economics-2015-0-30.txt': (0, 3463),
 'Economics-2015-0-31.txt': (0, 1831),
 'Economics-2016-0-02.txt': (0, 2314),
 'Economics-2016-0-03.txt': (0, 3542),
 'Economics-2016-0-04.txt': (0, 3248),
 'Economics-2016-0-05.txt': (0, 3511),
 'Economics-2016-0-06.txt': (0, 3692),
 'Economics-2016-0-07.txt': (0, 2437),
 'Economics-2016-0-08.txt': (0, 3397),
 'Economics-2016-0-09.txt': (0, 3336),
 'Economics-2016-0-10.txt': (0, 3536),
 'Economics-2016-0-11.txt': (0, 3173),
 'Economics-2016-0-12.txt': (0, 3552),
 'Economics-2016-0-13.txt': (0, 3467),
 'Economics-2016-0-14.txt': (0, 2282),
 'Economics-2016-0-15.txt': (0, 3494),
 'Economics-2016-0-16.txt': (0, 3391),
 'Economics-2016-0-17.txt': (0, 3015),
 'Economics-2016-0-18.txt': (0, 3431),
 'Economics-2016-0-19.txt': (0, 3415),
 'Economics-2016-0-20.txt': (0, 3709),
 'Economics-2016-0-21.txt': (0, 3808),
 'Economics-2016-0-22.txt': (0, 2824),
 'Economics-2017-0-03.txt': (0, 1182),
 'Economics-2017-0-04.txt': (0, 3340),
 'Economics-2017-0-05.txt': (0, 3302),
 'Economics-2017-0-06.txt': (309, 2657),
 'Economics-2017-0-07.txt': (0, 3125),
 'Economics-2017-0-08.txt': (0, 3330),
 'Economics-2017-0-09.txt': (0, 2753),
 'Economics-2017-0-10.txt': (0, 3387),
 'Economics-2017-0-11.txt': (0, 3040),
 'Economics-2017-0-12.txt': (0, 3436),
 'Economics-2017-0-13.txt': (226, 2384),
 'Economics-2017-0-14.txt': (0, 3569),
 'Economics-2017-0-15.txt': (0, 3458),
 'Economics-2017-0-16.txt': (644, 2275),
 'Economics-2017-0-17.txt': (0, 3215),
 'Economics-2017-0-18.txt': (0, 3414),
 'Economics-2017-0-19.txt': (271, 1688),
 'Economics-2017-0-20.txt': (0, 3199),
 'Economics-2017-0-21.txt': (0, 2979),
 'Economics-2017-0-22.txt': (0, 3680),
 'Economics-2017-0-23.txt': (0, 3164),
 'Economics-2017-0-24.txt': (377, 2138),
 'Economics-2017-0-25.txt': (0, 3635),
 'Economics-2017-0-26.txt': (388, 2023),
 'Economics-2017-0-27.txt': (0, 2933),
 'Economics-2017-0-28.txt': (0, 3027),
 'Economics-2017-0-29.txt': (0, 3189),
 'Economics-2017-0-30.txt': (0, 3033),
 'Economics-2017-0-31.txt': (400, 2311),
 'Economics-2017-0-32.txt': (0, 3257),
 'Economics-2017-0-33.txt': (0, 2974),
 'Economics-2017-0-34.txt': (0, 3331),
 'Economics-2017-0-35.txt': (0, 1560),
 'Economics-2018-0-02.txt': (1164, 1327),
 'Economics-2018-0-03.txt': (0, 2119),
 'Economics-2018-0-04.txt': (0, 3018),
 'Economics-2018-0-05.txt': (0, 1997),
 'Economics-2018-0-06.txt': (0, 2932),
 'Economics-2018-0-07.txt': (0, 1870),
 'Economics-2018-0-08.txt': (0, 2905),
 'Economics-2018-0-09.txt': (0, 2282),
 'Economics-2018-0-10.txt': (0, 1778),
 'Economics-2018-0-11.txt': (0, 1911),
 'Economics-2018-0-12.txt': (0, 2888),
 'Economics-2018-0-13.txt': (0, 2903),
 'Economics-2018-0-14.txt': (0, 2550),
 'Economics-2018-0-15.txt': (0, 2870),
 'Economics-2018-0-16.txt': (0, 2688),
 'Economics-2018-0-17.txt': (0, 2645),
 'Economics-2018-0-18.txt': (0, 2289),
 'Economics-2018-0-19.txt': (0, 2504),
 'Economics-2018-0-20.txt': (0, 2178),
 'Economics-2018-0-21.txt': (0, 2301),
 'Economics-2018-0-22.txt': (0, 2419),
 'Economics-2018-0-23.txt': (0, 2740),
 'Economics-2018-0-24.txt': (0, 2290),
 'Economics-2018-0-25.txt': (0, 2657),
 'Economics-2018-0-26.txt': (0, 2734),
 'Economics-2018-0-27.txt': (0, 2298),
 'Economics-2018-0-28.txt': (0, 2825),
 'Economics-2018-0-29.txt': (0, 2602),
 'Economics-2018-0-30.txt': (0, 2541),
 'Economics-2018-0-31.txt': (0, 2665),
 'Economics-2018-0-32.txt': (0, 2745),
 'Economics-2018-0-33.txt': (0, 2440),
 'Economics-2018-0-34.txt': (0, 1898),
 'Economics-2018-0-35.txt': (0, 3180),
 'Economics-2018-0-36.txt': (0, 2194),
 'Economics-2018-0-37.txt': (0, 3129),
 'Economics-2018-0-38.txt': (0, 2762),
 'Economics-2018-0-39.txt': (0, 2471),
 'Economics-2018-0-40.txt': (0, 2802),
 'Economics-2019-0-03.txt': (0, 0),
 'Economics-2019-0-04.txt': (0, 3111),
 'Economics-2019-0-05.txt': (0, 3873),
 'Economics-2019-0-06.txt': (0, 3976),
 'Economics-2019-0-07.txt': (0, 2620),
 'Economics-2019-0-08.txt': (0, 788),
 'Economics-2019-0-09.txt': (0, 2773),
 'Economics-2019-0-10.txt': (0, 1002),
 'Economics-2019-0-11.txt': (0, 2708),
 'Economics-2019-0-12.txt': (0, 1045),
 'Economics-2019-0-13.txt': (0, 3139),
 'Economics-2019-0-14.txt': (0, 2382),
 'Economics-2019-0-15.txt': (0, 1993),
 'Economics-2019-0-16.txt': (0, 2193),
 'Economics-2019-0-17.txt': (0, 1239),
 'Economics-2019-0-18.txt': (0, 2669),
 'Economics-2019-0-19.txt': (0, 3241),
 'Economics-2019-0-20.txt': (0, 1991),
 'Economics-2019-0-21.txt': (0, 2256),
 'Economics-2019-0-22.txt': (0, 768),
 'Economics-2019-0-23.txt': (0, 2828),
 'Economics-2019-0-24.txt': (0, 890),
 'Economics-2019-0-25.txt': (0, 1767),
 'Economics-2019-0-26.txt': (0, 2974),
 'Economics-2019-0-27.txt': (0, 0),
 'Economics-2019-0-28.txt': (0, 1888),
 'Economics-2019-0-29.txt': (0, 2480),
 'Economics-2019-0-30.txt': (0, 2848),
 'Economics-2019-0-31.txt': (0, 3593),
 'Economics-2019-0-32.txt': (0, 3168),
 'Economics-2019-0-33.txt': (0, 1494),
 'Economics-2020-0-00.txt': (0, 948),
 'Economics-2020-0-01.txt': (0, 3229),
 'Economics-2020-0-02.txt': (0, 3106),
 'Economics-2020-0-03.txt': (0, 3091),
 'Economics-2020-0-04.txt': (0, 3761),
 'Economics-2020-0-05.txt': (0, 2301),
 'Economics-2020-0-06.txt': (0, 3583),
 'Economics-2020-0-07.txt': (0, 3560),
 'Economics-2020-0-08.txt': (0, 3236),
 'Economics-2020-0-09.txt': (0, 3509),
 'Economics-2020-0-10.txt': (0, 3383),
 'Economics-2020-0-11.txt': (0, 3453),
 'Economics-2020-0-12.txt': (0, 3749),
 'Economics-2020-0-13.txt': (0, 2768),
 'Economics-2020-0-14.txt': (0, 3777),
 'Economics-2020-0-15.txt': (0, 3513),
 'Economics-2020-0-16.txt': (0, 3719),
 'Economics-2020-0-17.txt': (0, 3570),
 'Economics-2020-0-18.txt': (0, 3092),
 'Economics-2020-0-19.txt': (0, 2653),
 'Economics-2020-0-20.txt': (0, 3374),
 'Economics-2020-0-21.txt': (0, 3384),
 'Economics-2020-0-22.txt': (0, 3262),
 'Economics-2020-0-23.txt': (0, 3778),
 'Economics-2020-0-24.txt': (0, 3395),
 'Economics-2020-0-25.txt': (0, 3665),
 'Economics-2020-0-26.txt': (0, 3479),
 'Economics-2020-0-27.txt': (0, 1844),
 'Economics-2020-0-28.txt': (0, 3260),
 'Economics-2020-0-29.txt': (0, 2077),
 'Economics-2020-0-30.txt': (0, 3400),
 'Economics-2020-0-31.txt': (0, 3698),
 'Economics-2020-0-32.txt': (0, 3421),
 'Economics-2020-0-33.txt': (0, 3504),
 'Economics-2020-0-34.txt': (0, 3597),
 'Economics-2020-0-35.txt': (0, 1648),
 'Economics-2022-0-00.txt': (0, 1759),
 'Economics-2022-0-01.txt': (0, 3022),
 'Economics-2022-0-02.txt': (0, 2801),
 'Economics-2022-0-03.txt': (0, 2666),
 'Economics-2022-0-04.txt': (0, 2976),
 'Economics-2022-0-05.txt': (0, 2915),
 'Economics-2022-0-06.txt': (0, 3289),
 'Economics-2022-0-07.txt': (0, 2977),
 'Economics-2022-0-08.txt': (0, 2992),
 'Economics-2022-0-09.txt': (0, 3179),
 'Economics-2022-0-10.txt': (0, 3390),
 'Economics-2022-0-11.txt': (0, 1985),
 'Economics-2008-0-03.txt': (0, 2258),
 'Economics-2008-0-04.txt': (0, 3371),
 'Economics-2008-0-05.txt': (0, 2793),
 'Economics-2008-0-06.txt': (0, 2152),
 'Economics-2008-0-07.txt': (0, 2238),
 'Economics-2008-0-08.txt': (0, 3558),
 'Economics-2008-0-09.txt': (0, 2979),
 'Economics-2008-0-10.txt': (0, 2916),
 'Economics-2008-0-11.txt': (0, 1953),
 'Economics-2008-0-12.txt': (0, 1801),
 'Economics-2008-0-13.txt': (0, 2622),
 'Economics-2008-0-14.txt': (0, 3158),
 'Economics-2008-0-15.txt': (0, 3298),
 'Economics-2008-0-16.txt': (0, 2469),
 'Economics-2008-0-17.txt': (0, 3708),
 'Economics-2008-0-18.txt': (0, 3230),
 'Economics-2008-0-19.txt': (0, 2280),
 'Economics-2008-0-20.txt': (0, 3350),
 'Economics-2008-0-21.txt': (215, 2035),
 'Economics-2008-0-22.txt': (0, 0),
 'Economics-2008-0-23.txt': (0, 3574),
 'Economics-2008-0-24.txt': (0, 3152),
 'Economics-2008-0-25.txt': (0, 3426),
 'Economics-2008-0-26.txt': (0, 3632),
 'Economics-2008-0-27.txt': (0, 3783),
 'Economics-2008-0-28.txt': (0, 3559),
 'Economics-2008-0-29.txt': (130, 2479),
 'Economics-2008-0-30.txt': (0, 301),
 'Economics-2009-0-03.txt': (0, 2525),
 'Economics-2009-0-04.txt': (0, 3550),
 'Economics-2009-0-05.txt': (0, 4387),
 'Economics-2009-0-06.txt': (0, 3425),
 'Economics-2009-0-07.txt': (0, 4018),
 'Economics-2009-0-08.txt': (0, 3937),
 'Economics-2009-0-09.txt': (69, 3780),
 'Economics-2009-0-10.txt': (0, 4078),
 'Economics-2009-0-11.txt': (0, 3775),
 'Economics-2009-0-12.txt': (45, 2381),
 'Economics-2009-0-13.txt': (1302, 2689),
 'Economics-2009-0-14.txt': (0, 3673),
 'Economics-2009-0-15.txt': (0, 2784),
 'Economics-2009-0-16.txt': (0, 2920),
 'Economics-2009-0-17.txt': (0, 4114),
 'Economics-2009-0-18.txt': (0, 4011),
 'Economics-2009-0-19.txt': (0, 3986),
 'Economics-2009-0-20.txt': (0, 824),
 'Economics-2010-0-01.txt': (0, 1974),
 'Economics-2010-0-02.txt': (0, 4639),
 'Economics-2010-0-03.txt': (0, 4243),
 'Economics-2010-0-04.txt': (1204, 4060),
 'Economics-2010-0-05.txt': (0, 4555),
 'Economics-2010-0-06.txt': (262, 2924),
 'Economics-2010-0-07.txt': (0, 4233),
 'Economics-2010-0-08.txt': (0, 3811),
 'Economics-2010-0-09.txt': (0, 4276),
 'Economics-2010-0-10.txt': (0, 4022),
 'Economics-2010-0-11.txt': (822, 3471),
 'Economics-2010-0-12.txt': (0, 2632),
 'Economics-2010-0-13.txt': (476, 4411),
 'Economics-2010-0-14.txt': (1478, 4520),
 'Economics-2010-0-15.txt': (322, 2583),
 'Economics-2010-0-16.txt': (0, 3718),
 'Economics-2010-0-17.txt': (0, 4310),
 'Economics-2010-0-18.txt': (78, 1826),
 'Economics-2010-0-19.txt': (2006, 3056),
 'Economics-2010-0-20.txt': (0, 4234),
 'Economics-2010-0-21.txt': (1154, 3082),
 'Economics-2010-0-22.txt': (0, 4269),
 'Economics-2010-0-23.txt': (0, 4547),
 'Economics-2010-0-24.txt': (0, 4465),
 'Economics-2010-0-25.txt': (664, 1800),
 'Economics-2010-0-26.txt': (1227, 3348),
 'Economics-2010-0-27.txt': (0, 4351),
 'Economics-2010-0-28.txt': (0, 3466),
 'Economics-2011-0-02.txt': (0, 854),
 'Economics-2011-0-03.txt': (78, 2260),
 'Economics-2011-0-04.txt': (95, 1790),
 'Economics-2011-0-05.txt': (0, 2259),
 'Economics-2011-0-06.txt': (47, 1952),
 'Economics-2011-0-07.txt': (0, 3218),
 'Economics-2011-0-08.txt': (55, 2099),
 'Economics-2011-0-09.txt': (0, 3685),
 'Economics-2011-0-10.txt': (94, 645),
 'Economics-2011-0-11.txt': (89, 1980),
 'Economics-2011-0-12.txt': (0, 2990),
 'Economics-2011-0-13.txt': (301, 1930),
 'Economics-2011-0-14.txt': (188, 1931),
 'Economics-2011-0-15.txt': (85, 1773),
 'Economics-2011-0-16.txt': (86, 2254),
 'Economics-2011-0-17.txt': (0, 3777),
 'Economics-2011-0-18.txt': (193, 2106),
 'Economics-2011-0-19.txt': (52, 1074),
 'Economics-2011-0-20.txt': (0, 1373),
 'Economics-2011-0-21.txt': (0, 2863),
 'Economics-2011-0-22.txt': (116, 1824),
 'Economics-2011-0-23.txt': (619, 2861),
 'Economics-2011-0-24.txt': (0, 2823),
 'Economics-2011-0-25.txt': (0, 2895),
 'Economics-2011-0-26.txt': (0, 2810),
 'Economics-2011-0-27.txt': (0, 1731),
 'Economics-2011-0-28.txt': (0, 1788),
 'Economics-2011-0-29.txt': (0, 1866),
 'Economics-2011-0-30.txt': (0, 1616),
 'Economics-2011-0-31.txt': (949, 2863),
 'Economics-2011-0-32.txt': (0, 2026),
 'Economics-2011-0-33.txt': (378, 813),
 'Economics-2011-0-34.txt': (515, 1017),
 'Economics-2011-0-35.txt': (109, 1628),
 'Economics-2011-0-36.txt': (0, 2128),
 'Economics-2011-0-37.txt': (0, 3483),
 'Economics-2011-0-38.txt': (0, 3418),
 'Economics-2011-0-39.txt': (0, 3113),
 'Economics-2011-0-40.txt': (0, 0),
 'Economics-2012-0-02.txt': (0, 830),
 'Economics-2012-0-03.txt': (0, 3639),
 'Economics-2012-0-04.txt': (734, 2428),
 'Economics-2012-0-05.txt': (0, 3065),
 'Economics-2012-0-06.txt': (91, 1946),
 'Economics-2012-0-07.txt': (418, 903),
 'Economics-2012-0-08.txt': (368, 2129),
 'Economics-2012-0-09.txt': (0, 2913),
 'Economics-2012-0-10.txt': (0, 2837),
 'Economics-2012-0-11.txt': (0, 2754),
 'Economics-2012-0-12.txt': (0, 2876),
 'Economics-2012-0-13.txt': (0, 3370),
 'Economics-2012-0-14.txt': (0, 0),
 'Economics-2012-0-15.txt': (493, 1974),
 'Economics-2012-0-16.txt': (185, 1504),
 'Economics-2012-0-17.txt': (0, 0),
 'Economics-2012-0-18.txt': (0, 3485),
 'Economics-2012-0-19.txt': (1536, 2587),
 'Economics-2012-0-20.txt': (871, 1501),
 'Economics-2012-0-21.txt': (260, 580),
 'Economics-2012-0-22.txt': (189, 1756),
 'Economics-2012-0-23.txt': (1523, 3191),
 'Economics-2012-0-24.txt': (0, 0),
 'Economics-2012-0-25.txt': (166, 1847),
 'Economics-2012-0-26.txt': (0, 560)}

# E7 Quest

In [33]:
test_dir = "E7_with_quest"
temp_dir = "E7_with_quest_temp"
shutil.copytree(test_dir,temp_dir)

'E7_with_quest_temp'

In [None]:
import os
import re
import shutil
os.chdir("/Users/BeckyMarcusMacbook/Thesis/manual_work/")
def remove_quest(text:str)->str:
        return re.sub(r"([a-zA-Z]+)\?\n([a-zA-Z]+)([^\w\n\s])?", # Captures 3 groups: first half of word, second half of word, optional punctuation
                      r"\1\2\3\n", #removes dash and moves line break
                      text)
for file in sorted(os.listdir(test_dir)):
    if file[0]==".":
        continue
    text = open(os.path.join(test_dir,file)).read().strip()
    id = file.rsplit('-',1)[0]
    if len(text)==0:
        continue
    new_text = remove_quest(text)
    if text!=new_text:
        print(file)
        with open(os.path.join(temp_dir,file),'w') as f:
             f.write(new_text)
### Didn't show anything
# last_file_quest = False
# last_file_id = None
# for file in sorted(os.listdir(test_dir)):
#     if file[0]==".":
#         continue
#     text = open(os.path.join(test_dir,file)).read().strip()
#     id = file.rsplit('-',1)[0]
#     if len(text)==0:
#         continue
    
#     if last_file_quest and id==last_file_id:
#         print(text.splitlines()[0])
#         print("---")
#     if text[-1]=="?":
#         print(f"---\n{file}:")
#         print(text.splitlines()[-1])
#         last_file_quest =True
#         last_file_id = id
#     else:
#         last_file_quest = False


## Fixing dash errors differently

In [38]:
test_dir2 = "E7_testing/temp_clean"
temp_dir2 = "E7_testing/temp_clean_no_space"
if os.path.exists(temp_dir2):
    shutil.rmtree(temp_dir2)
shutil.copytree(test_dir2,temp_dir2)

'E7_testing/temp_clean_no_space'

In [40]:
from typing import Optional
import subprocess
def git_commit(path: str, message: Optional[str] = None) -> None:
    """
    Commit changes from a specific file or directory in the repository to Git.

    Args:
        path (str): The file or directory containing the changes to commit.
        message (str): The commit message.
    """
    msg: str = message if message is not None else f"Modified {path}"
    try:
        if not os.path.exists(path):
            subprocess.run(["git", "rm", path], check=True)
        else:
            if os.path.isdir(path):
            # Stage changes in the specified directory
                subprocess.run(["git", "add", f"{path}/."], check=True)
            elif os.path.isfile(path):
                # Stage changes in the specified file
                subprocess.run(["git", "add", path], check=True)
            else:
                raise ValueError(f"The specified path '{path}' is neither a file nor a directory.")
            
            # Check if there are any staged changes
        result = subprocess.run(
            ["git", "diff", "--cached", "--quiet"],
            check=False,
            capture_output=True
        )
        
        if result.returncode == 1:  # Changes are staged
            subprocess.run(["git", "commit", "-m", msg], check=True)
            print(f"Committed changes from {path} with message: '{msg}'")
        else:
            print(f"No changes to commit in path: {path}")

    except subprocess.CalledProcessError as e:
        print(f"Git operation failed: {e}")
        raise
    except ValueError as e:
        print(str(e))
        raise


In [37]:
def fix_dash_errors(text:str)->str:
    new_text = re.sub(r"([a-zA-Z]+)-\n([a-zA-Z]+)([^\w\n\s])?", # Captures 3 groups: first half of word, second half of word, optional punctuation
                      r"\1\2\3\n", #removes dash and moves line break
                      text)
    new_text_lines_stripped=[line.strip() for line in new_text.split('\n')] #remove any extra leading or trailing whitespace
    return "\n".join(new_text_lines_stripped).strip() #join lines back together

def fix_dash_errors2(text:str)->str:
    new_text = re.sub(r"([a-zA-Z]+)\s*-\s*\n([a-zA-Z]+)([^\w\n\s])?", # Captures 3 groups: first half of word, second half of word, optional punctuation
                      r"\1\2\3\n", #removes dash and moves line break
                      text)
    new_text_lines_stripped=[line.strip() for line in new_text.split('\n')] #remove any extra leading or trailing whitespace
    return "\n".join(new_text_lines_stripped).strip() #join lines back together


test_dir2 = "E7_testing/temp_clean"
temp_dir2 = "E7_testing/temp_clean_no_space"
if os.path.exists(temp_dir2):
    shutil.rmtree(temp_dir2)
shutil.copytree(test_dir2,temp_dir2)
git_commit(temp_dir2,"copied in partially clean text")

for file in sorted(os.listdir(test_dir2)):
    if file[0]==".":
        continue
    text = open(os.path.join(test_dir2,file)).read().strip()
    new_text = fix_dash_errors(text)
    if text!=new_text:
        #print(file)
        with open(os.path.join(temp_dir2,file),'w') as f:
             f.write(new_text)
git_commit(temp_dir2,"Fixed dash errors normally (without space)")


shutil.rmtree(temp_dir2)
shutil.copytree(test_dir2,temp_dir2)

for file in sorted(os.listdir(test_dir2)):
    if file[0]==".":
        continue
    text = open(os.path.join(test_dir2,file)).read().strip()
    new_text = fix_dash_errors2(text)
    if text!=new_text:
        #print(file)
        with open(os.path.join(temp_dir2,file),'w') as f:
             f.write(new_text)
git_commit(temp_dir2,"Fixed dash errors differently, looking for changes")





Economics-2008-0-03.txt
Economics-2008-0-04.txt
Economics-2008-0-05.txt
Economics-2008-0-06.txt
Economics-2008-0-07.txt
Economics-2008-0-08.txt
Economics-2008-0-09.txt
Economics-2008-0-10.txt
Economics-2008-0-11.txt
Economics-2008-0-12.txt
Economics-2008-0-13.txt
Economics-2008-0-14.txt
Economics-2008-0-15.txt
Economics-2008-0-16.txt
Economics-2008-0-17.txt
Economics-2008-0-18.txt
Economics-2008-0-19.txt
Economics-2008-0-20.txt
Economics-2008-0-21.txt
Economics-2008-0-23.txt
Economics-2008-0-24.txt
Economics-2008-0-25.txt
Economics-2008-0-26.txt
Economics-2008-0-27.txt
Economics-2008-0-28.txt
Economics-2008-0-29.txt
Economics-2009-0-03.txt
Economics-2009-0-04.txt
Economics-2009-0-05.txt
Economics-2009-0-06.txt
Economics-2009-0-07.txt
Economics-2009-0-08.txt
Economics-2009-0-09.txt
Economics-2009-0-10.txt
Economics-2009-0-11.txt
Economics-2009-0-12.txt
Economics-2009-0-13.txt
Economics-2009-0-14.txt
Economics-2009-0-15.txt
Economics-2009-0-16.txt
Economics-2009-0-17.txt
Economics-2009-0

# Break

# Reference for code

### How to change and save text files:
All of the text files live in the directory `econ_text_cleaning/GROUP_NAME/texts` and we want the new files to be in `econ_text_cleaning/GROUP_NAME/changed_texts`. A workflow could therefore look like:
```
source_dir_name = 'econ_text_cleaning/E1/texts'
save_dir_name = 'econ_text_cleaning/E1/changed_texts'
for file is os.listdir(source_dir_name):
    load_path = os.path.join(source_dir_name,file)
    with open(load_path,'r') as f_load:
        text = f_load.read()
        
    new_text = ... # do some cleaning functions
    save_path  = os.path.join(save_dir_name,file)
    with open(save_path,'w') as f_save:
        f_save.write(new_text)
```

If you want to have intermediate steps before saving, you can **save the loaded text in a data frame:**
```
import pandas as pd
files = []
orig_texts = []
for file is os.listdir(source_dir_name):
    load_path = os.path.join(source_dir_name,file)
    with open(load_path,'r') as f_load:
        text = f_load.read()
        
    files.append(file)
    orig_texts.append(text)
df = pd.DataFrame([files,orig_texts],columns = ["file_name","orig_text"])
df["text_cleaned1"] = df["orig_texts"].apply(clean_step_1)
df["text_cleaned2"] = df["text_cleaned1"].apply(clean_step2)
...
for path, final_text in df[["file_name","text_cleaned_final"]].values:
    save_path = os.path.join(save_dir_name,path)
    with open(save_path,'w') as f_save:
        f_save.write(final_text)


### Alternatively you can use this helper function apply_func_to_txt_dir:

In [None]:
def apply_func_to_txt_dir(start_dir_path, write_to,func, skip_if_exists=False,pass_filename=False, *args, **kwargs):
    """
    Applies a given function to the contents of all .txt files in a directory tree and writes the 
    results to a new directory structure, maintaining the same relative folder structure.

    Parameters:
    -----------
    start_dir_path : str
        The root directory containing subdirectories and .txt files to process.
    write_to : str
        The root directory where the processed .txt files will be written.
    func : function
        The function to apply to the contents of each .txt file. It should take the file's text 
        content as its first argument and return the modified content.
    skip_if_exists : bool, optional
        If True, the function will skip files that already exist in the write_to directory.
        Defaults to False.
    pass_filename : bool, optional
        If True, the filename will also be passed to the function along with the file content.
        Defaults to False.
    *args : tuple
        Additional positional arguments to pass to the `func`.
    **kwargs : dict
        Additional keyword arguments to pass to the `func`.

    Returns:
    --------
    None
    """
    
    # Traverse the directory tree starting from the provided start_dir_path
    for subdir, dirs, files in os.walk(start_dir_path):
        
        # Skip directories that start with a dot (e.g., .ipynb_checkpoints)
        if os.path.basename(subdir)[0] == '.':
            print(f"Skipping directory {subdir}")
            continue  # Skip processing for hidden directories
        
        # Get the relative directory path to replicate the structure in the target directory
        rel_dir = os.path.relpath(subdir, start=start_dir_path)
        write_to_dir = os.path.join(write_to, rel_dir)  # Create corresponding dir in write_to
        
        # Create the target directory if it doesn't exist
        if not os.path.exists(write_to_dir):
            os.makedirs(write_to_dir)

        # Process each file in the current directory
        for file_name in files:
            cur_path = os.path.join(subdir, file_name)  # Get the full path of the current file
            
            # Check if the file is a .txt file (adjust this condition for other file types)
            if os.path.isfile(cur_path) and cur_path.endswith(".txt"):
                
                new_file_path = os.path.join(write_to_dir, file_name)  # Path for the new file
                
                # Skip the file if it already exists and skip_if_exists is True
                if os.path.exists(new_file_path) and skip_if_exists:
                    continue
                
                # Open the original file and read its content
                with open(cur_path, 'r') as f:
                    old_text = f.read()
                    
                # Apply the provided function to the text content (with or without filename)
                if pass_filename:
                    new_text = func(old_text, file_name, *args, **kwargs)  # Pass file name
                else:
                    new_text = func(old_text, *args, **kwargs)  # Only pass text content
               
            # Write the modified text to the new file path
                with open(new_file_path, 'w') as f:
                    f.write(new_text)

## A few cleaning functions

In [None]:
def jstor_remove_cover_pages(dir_path):
    jstore_cover_pages_list = []
    # iterate each file in a directory, including subdirectories
    for subdir, dirs, files in os.walk(dir_path):
        
        # Skip directories that start with a dot (e.g., .ipynb_checkpoints)
        if os.path.basename(subdir)[0] == '.':
            print(f"Skipping directory {subdir}")
            continue  # Skip processing for hidden directories
        for file in files:
            cur_path = os.path.join(subdir, file)
            # check if it is a file and is on page 0 (to be more efficient)
            if os.path.isfile(cur_path) and cur_path.endswith("00.txt"):
                with open(cur_path, 'r') as file:
                    # read all content of a file and search for string
                    if 'Stable URL: https://www.jstor.org/stable/' in file.read():
                        # append the list
                        jstore_cover_pages_list.append(cur_path)
                        # delete the file
                        os.remove(cur_path)
    # return the list so we can see what got deleted.
    print(f"Removed {len(jstore_cover_pages_list)} pages")
    return jstore_cover_pages_list

In [None]:
def strip_lines(text):
    lines=text.split('\n')
    stripped=[line.strip() for line in lines]
    return "\n".join(stripped)
def remove_jstor_footer(text):
    return text.rsplit("\nThis content downloaded from",1)[0]

def fix_dash_errors(text):
    new_text = re.sub(r"([a-zA-Z]+)-\n([a-zA-Z]+)([^\w\n\s])?", # Captures 3 groups: first half of word, second half of word, optional punctuation
                      r"\1\2\3\n", #removes dash and moves line break
                      text)
    new_text_lines_stripped=[line.strip() for line in new_text.split('\n')] #remove any extra leading or trailing whitespace
    return "\n".join(new_text_lines_stripped) #join lines back together

## Write and implement your new functions here!