## Parsing a word document
python-docx documentation: https://python-docx.readthedocs.io/en/latest/

In [9]:
from docx import Document
from io import BytesIO
from random import randint

In [2]:
with open("AA final paper copy.docx", "rb") as f:
    source_stream = BytesIO(f.read())

In [3]:
document = Document(source_stream)

In [4]:
document.paragraphs[7].text

"“The Endless Frontiers [sic] Act is a downpayment for future generations of American technological leadership, and I'm proud to introduce it on a bipartisan basis.” – Representative Mike Gallagher (2020).\n\n“The point is that founding and growing a company is fundamentally an act of exploration and colonization… Google took web search… Twitter colonized real-time status updates. Quora is attempting to colonize Q&A… Facebook of course colonized online identity.” – Kevin Simler (2012)."

Some notes
- There is not much existing support for footnotes using scripts
- https://github.com/ShayHill/docx2python seems to support footnotes but doesn't seem super developed / either we can directly adapt this, or create our own version of it

## Editing a word document

In [5]:
# Add a paragraph
document.add_paragraph("Random paragraph")
# document.save("AA_edited.docx")

<docx.text.paragraph.Paragraph at 0x7f840f79a350>

In [6]:
# edit a paragraph
document.paragraphs[7].text = "sdkuhagjsdkfshdakj" # this works, although perhaps there is a more proper way of doing this?
document.save("AA_edited.docx")

In [7]:
# can insert paragraphs in the middle similarly

In [8]:
print(document.body)

AttributeError: 'Document' object has no attribute 'body'

## Citation fixer class

In [18]:
class CitationFixer:
    def __init__(self):
        pass
    def read_document(self, path):
        with open(path, 'rb') as f:
            source_stream = BytesIO(f.read())
        self.document = Document(source_stream)
        self.citations = []
    def get_citations(self):
        for i in range(len(self.document.paragraphs)):
            text = self.document.paragraphs[i].text
            if len(text) > 6:
                l = randint(0, len(text) - 6)
                r = l + 5
                self.citations.append((i, l, r))
    def fix_citations(self):
        for (i, l, r) in self.citations:
            text = self.document.paragraphs[i].text
            self.document.paragraphs[i].text = text[:l] + 'a' * (r-l+1) + text[r+1:]
    def save_document(self, path):
        self.document.save(path)

In [19]:
citation_fixer = CitationFixer()
citation_fixer.read_document("AA final paper copy.docx")
citation_fixer.get_citations()
citation_fixer.fix_citations()
citation_fixer.save_document("AA final paper copy_fixed.docx")