# writing new files

### basic structure



In [47]:
import os
from bs4 import BeautifulSoup

In [48]:
f = open('HelloWorld.txt', 'w')
text = "Bonjour world!"
f.write(text)
f.close()

In [49]:
with open('HelloWorld.txt', 'r') as letuslook:
    print(letuslook.read())

Bonjour world!


### one step more advanced: writing from a soup

In [50]:
with open('C:/Users/ASUS/Documents/iPython/data/DREaM/withXML/testcorpus/1698 1698 - T C - The New atlas or Travels.xml') as fDREAM:   
    soup = BeautifulSoup(fDREAM, "xml")
    # the TEXT knows to grab the text from in between the two TEXT tags
    # at the beginning and end of the text that I want
    # ie, minus the header.
    soupedtext = soup.TEXT.get_text()
    stringlist = soupedtext.split()
    finalstring = " ".join(stringlist)
    print(finalstring[:20])
    with open("NEWFILE.txt", "w") as newfile:
        newfile.write(finalstring)
    
    


THE NEW ATLAS: OR, T


In [68]:
with open("NEWFILE.txt", "r") as newfile:
    print(newfile.read()[:500])

THE NEW ATLAS: OR, Travels and Voyages IN Europe, Asia, Africa and America, Through the most Renowned Parts of the WORLD, VIZ. From England to the Dardanelles, thence to Constantinople, Egypt, Palestine, or the Holy Land, Syria, Mesopotamia, Child, Persia, East-India, China, Tartary, Muscovy, and by Poland; the German Empire, Flanders and Holland, to Spain and the West-Indies; with a brief Account of Aethiopia, and the Pilgrimages to Mecha and Medina in Arabia, containing what is Rare and Worthy


## Writing tcpid as the filename
This wasn't working before - let's see if I can put it together.

In [52]:
data_dir = 'C:\\Users\\ASUS\\Documents\\iPython\\data\\DREaM\\withXML\\minicorpus\\'
listdata_dir = os.listdir(data_dir)
fullname = []
for name in listdata_dir:
    filename = data_dir+name
    fullname.append(filename)
print(fullname)

['C:\\Users\\ASUS\\Documents\\iPython\\data\\DREaM\\withXML\\minicorpus\\1698 1698 - T C - The New atlas or Travels.xml']


### but first, a handy little function

It doesn't totally remove the text, but it does save some extra copying/pasting/typing.

In [54]:
def fromsouptoplain(soupedtext):
    stringlist = soupedtext.split()
    finalstring = " ".join(stringlist)
    return finalstring
    

In [58]:
for file in fullname:
    with open(file) as f:
        soup = BeautifulSoup(f, "xml")
        
        # TCP ID number
        tcpid = soup.tcpid.get_text()
        finaltcpid = fromsouptoplain(tcpid)
        
        # Title of the work
        title = soup.title.get_text()
        finaltitle = fromsouptoplain(title)

        
        # Author of the work
        author = soup.author.get_text()
        finalauthor = fromsouptoplain(author)
        
        # date of the work
        date = soup.date.get_text()
        finaldate = fromsouptoplain(date)
        
        # souped text
        soupedtext = soup.TEXT.get_text()
        finaltext = fromsouptoplain(soupedtext)

        print(finaltcpid)
        print(finaltitle)
        print(finalauthor)
        print(finaldate)
        print('--------')
        
        
        # Let's try write the new file, using finaltcpid as the filename.
        
        with open(tcpid+'.txt', "w") as newfile:
            newfile.write(finaltext)
    

A31298.hdr
The New atlas, or, Travels and voyages in Europe, Asia, Africa, and America, thro' the most renowned parts of the world ... performed by an English gentleman, in nine years travel and voyages, more exact than ever.
T. C.
1698.
--------


## And now, with multiple files!

Let's see how well it works with three files, and with my if statements built in...

In [59]:
data_dir = 'C:\\Users\\ASUS\\Documents\\iPython\\data\\DREaM\\withXML\\testcorpus\\'
listdata_dir = os.listdir(data_dir)
fullname = []
for name in listdata_dir:
    filename = data_dir+name
    fullname.append(filename)
print(fullname)

['C:\\Users\\ASUS\\Documents\\iPython\\data\\DREaM\\withXML\\testcorpus\\1600 1600 - Constable Henry 1562-1613 - Discoverye of a counterfe-other.xml', 'C:\\Users\\ASUS\\Documents\\iPython\\data\\DREaM\\withXML\\testcorpus\\1601 1601 - Parry William fl 1601 - A new and large discourse.xml', 'C:\\Users\\ASUS\\Documents\\iPython\\data\\DREaM\\withXML\\testcorpus\\1698 1698 - T C - The New atlas or Travels.xml']


In [66]:
for file in fullname:
    with open(file) as f:
        soup = BeautifulSoup(f, "xml")
        
        # depending on the tag (<TEXTS> or <texts>)...
        
        if soup.find('TEXT'):
            # TCP ID number
            tcpid = soup.tcpid.get_text()
            finaltcpid = fromsouptoplain(tcpid)

            # Title of the work
            title = soup.title.get_text()
            finaltitle = fromsouptoplain(title)


            # Author of the work
            author = soup.author.get_text()
            finalauthor = fromsouptoplain(author)

            # date of the work
            date = soup.date.get_text()
            finaldate = fromsouptoplain(date)

            # souped text
            soupedtext = soup.TEXTS.get_text()
            finaltext = fromsouptoplain(soupedtext)


            # Let's try write the new file, using finaltcpid as the filename.

            with open("testfiles/"+finaltcpid+'.txt', "w") as newfile:
                newfile.write(finaltext)
            
            print("This text was saved: " + finaltitle[:25] + "\n")

                
        #######################################
        
        elif soup.find('text'):
            # TCP ID number
            tcpid = soup.tcpid.get_text()
            finaltcpid = fromsouptoplain(tcpid)

            # Title of the work
            title = soup.title.get_text()
            finaltitle = fromsouptoplain(title)


            # Author of the work
            author = soup.author.get_text()
            finalauthor = fromsouptoplain(author)

            # date of the work
            date = soup.date.get_text()
            finaldate = fromsouptoplain(date)

            # souped text
            soupedtext = soup.texts.get_text()
            finaltext = fromsouptoplain(soupedtext)
            
            # Let's try to write the new file, using finaltcpid as the filename
            with open("testfiles/"+finaltcpid+'.txt', "w") as newfile:
                newfile.write(finaltext)
                
            print("This text was saved: " + finaltitle[:25] + "\n")
        
        else:
            print("This text was saved: " + finaltitle[:25] + "\n")
        
    

This text was saved: Discoverye of a counterfe

This text was saved: A new and large discourse

This text was saved: The New atlas, or, Travel



In [70]:
with open('testfiles/A19224.hdr.txt', 'r') as letuslook:
    print(letuslook.read()[:500])

A DISCOVERY OF A COVNTERFECTE CONFERENCE held at a counterfecte place, by counterfecte travellers, for thadvancement of a counterfecte title , and invented, printed, and published by one (PERSON) that dare not avovve his name. Printed at Collen. 1600. TO THE AVCTOR OF the counterfeit confereÌ„ce &c. ITvvere as easy for meyf Ivvould to discover your name with assured proofs as to detect the devises and dristes of your conterfeat conference made at Amsterdam, but since as it seems you are ashamed 


# yaaaaaaaaaasssss

I could also save this with chunks of the metadata tags still in, but I'll have to see what the best method is.