## CSV experiments

In this notebook, I'm going to be keeping track of some of my experiments with CSV writer.

The first thing, is to get it to work!

In [49]:
import csv

Here is a basic structure, from the [Python documentation website](https://docs.python.org/3/library/csv.html). Note that I am using DictWriter so that I can have a header.

Also, because I'm using 'with', I don't have to close the file after - it will automatically close.

In [50]:
with open('names.csv', 'w') as csvfile:
    fieldnames = ['name', 'title', 'pubdate']
    writer = csv.DictWriter(csvfile, fieldnames = fieldnames)
    writer.writeheader()
    writer.writerow({'name': 'Richardson', 'title': 'Clarissa', 'pubdate': '1748'})

Let's see what it looks like!

In [51]:
with open('names.csv') as names:
    reader = csv.DictReader(names)
    for row in reader:
        print(row['name'], row['title'])
    names.close()

Richardson Clarissa


Okay, cool! So now, what if I want to add to the file? I made sure to use 'a' so I don't overwrite my file...

In [52]:
with open('names.csv', 'a') as csvfile:
    fieldnames = ['name', 'title', 'pubdate']
    writer = csv.DictWriter(csvfile, delimiter=',', fieldnames = fieldnames)
    writer.writerow({'name': 'Sterne', 'title': 'Sentimental Journey', 'pubdate': '1768'})

In [53]:
with open('names.csv') as names:
    reader = csv.DictReader(names)
    for row in reader:
        print(row['name'], row['title'], row['pubdate'])
    names.close()

Richardson Clarissa 1748
Sterne Sentimental Journey 1768


Great - looks like using "a" worked well. However, I am getting whitespace in between each row, which isn't ideal. But for now, let's see if I can start to automize this using my Beautiful Soup info!

In [54]:
from bs4 import BeautifulSoup

In [55]:
testFile = open('1698 1698 - T C - The New atlas or Travels.xml')

# for this testfile, I removed most of the body text.

In [56]:
soup = BeautifulSoup(testFile, 'xml')

In [57]:
print(soup)

<?xml version="1.0" encoding="utf-8"?>
<EEBO><dreamheader>
<tcpdata>
<tcpid phase="2">A31298.hdr</tcpid>
<fileDesc>
<titleStmt>
<title>The New atlas, or, Travels and voyages in Europe, Asia, Africa, and America, thro' the most renowned parts of the world ... performed by an English gentleman, in nine years travel and voyages, more exact than ever.</title>
<author>T. C.</author>
</titleStmt>
<publicationStmt>
<date>1698.</date>
<pubPlace>London :</pubPlace>
<publisher>Printed for J. Cleave ... and A. Roper ...,</publisher>
</publicationStmt>
<idno type="marc">12251058</idno>
<idno type="stc">Wing C139.</idno>
<idno type="stc">Arber's Term cat. III 138.</idno>
<idno type="vid">57084</idno>
<idno type="DLPS">A31298</idno>
</fileDesc>
</tcpdata>
<opendata>
<teiHeader>
<fileDesc>
<titleStmt>
<title>The New atlas, or, Travels and voyages in Europe, Asia, Africa, and America, thro' the most renowned parts of the world</title>
<title type="alternative">The New atlas, or, Travels and voyages in

Let's see if I can get out the information that I want:

In [58]:
soup.title.string
# without the .string, it prints the <title> tags as well

"The New atlas, or, Travels and voyages in Europe, Asia, Africa, and America, thro' the most renowned parts of the world ... performed by an English gentleman, in nine years travel and voyages, more exact than ever."

In [59]:
soup.author.string

'T. C.'

In [60]:
soup.date.string

'1698.'

I can also remove the periods by doing the following:

In [61]:
date = soup.date.string
newdate = "".join(char for char in date if char.isalnum())
print(newdate)

1698


Now, let's see if I can integrate this back into my csv writer. Once I get that settled, I should be able to 

1. loop through the files with BS
2. for each title, author, and pubdate tag, add to the csv

I don't know right now how this will work best with when, say, after I have BSed the files and ran, say, an analysis of the use of I - how, then, will I add that to the proper row in the csv file? I suppose I will cross that bridge when I come to it.

In [62]:
with open('names.csv') as names:
    reader = csv.DictReader(names)
    for row in reader:
        print(row['name'], row['title'], row['pubdate'])

Richardson Clarissa 1748
Sterne Sentimental Journey 1768


In [63]:
with open('names.csv', 'a') as csvfile:
    fieldnames = ['name', 'title', 'pubdate']
    writer = csv.DictWriter(csvfile, delimiter=',', fieldnames = fieldnames)
    writer.writerow({'name': soup.author.string, 'title': soup.title.string, 'pubdate': "".join(char for char in soup.date.string if char.isalnum())})

In [64]:
with open('names.csv') as names:
    reader = csv.DictReader(names)
    for row in reader:
        print(row['name'], row['title'], row['pubdate'])

Richardson Clarissa 1748
Sterne Sentimental Journey 1768
T. C. The New atlas, or, Travels and voyages in Europe, Asia, Africa, and America, thro' the most renowned parts of the world ... performed by an English gentleman, in nine years travel and voyages, more exact than ever. 1698


Huzzah! Let's see if I can write a simple for loop to use for two files!

In [65]:
import glob
import os

In [66]:
data_dir = 'C:\\Users\\ASUS\\Documents\\iPython\\data\\DREaM\\withXML\\testcorpus\\'

In [67]:
listdata_dir = os.listdir(data_dir)

In [68]:
listdata_dir

['1600 1600 - Constable Henry 1562-1613 - Discoverye of a counterfe-other.xml',
 '1601 1601 - Parry William fl 1601 - A new and large discourse.xml',
 '1698 1698 - T C - The New atlas or Travels.xml']

In [69]:
fullname = []
for name in listdata_dir:
    filename = data_dir+name
    fullname.append(filename)

In [70]:
print(fullname)

['C:\\Users\\ASUS\\Documents\\iPython\\data\\DREaM\\withXML\\testcorpus\\1600 1600 - Constable Henry 1562-1613 - Discoverye of a counterfe-other.xml', 'C:\\Users\\ASUS\\Documents\\iPython\\data\\DREaM\\withXML\\testcorpus\\1601 1601 - Parry William fl 1601 - A new and large discourse.xml', 'C:\\Users\\ASUS\\Documents\\iPython\\data\\DREaM\\withXML\\testcorpus\\1698 1698 - T C - The New atlas or Travels.xml']


In [71]:
# here's a function to get the plain text without the weird spacing

def fromsouptoplain(soupedtext):
    stringlist = soupedtext.split()
    finalstring = " ".join(stringlist)
    return finalstring
    

In [72]:
# Let's print out our list as a before so we can see what it looks like...
with open('names.csv') as names:
    reader = csv.DictReader(names)
    for row in reader:
        print(row['name'], row['title'], row['pubdate'])

Richardson Clarissa 1748
Sterne Sentimental Journey 1768
T. C. The New atlas, or, Travels and voyages in Europe, Asia, Africa, and America, thro' the most renowned parts of the world ... performed by an English gentleman, in nine years travel and voyages, more exact than ever. 1698


In [73]:
with open('names.csv', 'a') as csvfile:
    fieldnames = ['name', 'title', 'pubdate']
    writer = csv.DictWriter(csvfile, delimiter=',', fieldnames = fieldnames)
    
    # for each file as listed in fullname:
    for file in fullname:
    
        # open the file
        with open(file) as f:

            # apply BeautifulSoup         
            
            soup = BeautifulSoup(f, "xml")
            
            # assign the variables
            author = soup.author.get_text()
            finalauthor = fromsouptoplain(author)
            # "".join(char for char in soup.author.string if char != '\n' and char != '\t' and char != '\r')
            
            title = soup.title.get_text()
            finaltitle = fromsouptoplain(title)
            
            # date is already fine here, I think
            date = "".join(char for char in soup.date.string if char.isalnum()) 
                           
            writer.writerow({'name': finalauthor, 'title': finaltitle[:75], 'pubdate': date})
                           
            # also tried the code below, but it resulted in odd spacing and tabs!
            # writer.writerow({'name': soup.author.string, 'title': soup.title.string, 'pubdate': "".join(char for char in soup.date.string if char.isalnum())})
        
          

In [74]:
# Let's print out our list as a before so we can see what it looks like...
with open('names.csv') as names:
    reader = csv.DictReader(names)
    for row in reader:
        print(row['name'], row['title'], row['pubdate'])

Richardson Clarissa 1748
Sterne Sentimental Journey 1768
T. C. The New atlas, or, Travels and voyages in Europe, Asia, Africa, and America, thro' the most renowned parts of the world ... performed by an English gentleman, in nine years travel and voyages, more exact than ever. 1698
Constable, Henry, 1562-1613. Discoverye of a counterfecte conference helde at a counterfecte place, by c 1600
Parry, William, fl. 1601. A new and large discourse of the trauels of sir Anthony Sherley Knight, by  1601
T. C. The New atlas, or, Travels and voyages in Europe, Asia, Africa, and America 1698
