
### Information

Paige Haring

peh40@pitt.edu

September 1, 2017

**Name:** Genesis Corpus

**Authors:** N/A

**Download URL:** http://www.nltk.org/nltk_data/

**Makeup:** This corpus consists of 8 text files of written data, specifically, each file is a translation of the Book of Genesis. The 8 translations include: English- King James Bible, English- World English Bible, Finnish, French, German, lolcat, Portuguese, and Swedish. 

**License:** Public Domain

**Noteworthy:** Of these translations, all represent a spoken human language except for lolcat, which is essentially a meme language "spoken" by cats on the internet.

### Self-Assessment
**Summary:** My code reads in the text files of the Genesis corpus using nltk's Plaintext Corpus Reader. At first, I had a lot of encoding problems. The nltk site where I downloaded the corpus didn't give me any information about what system was used for the files. I kept getting an error using utf-8, so after some research StackExchange told me to try out 
latin-1. I was able to actually read in the files now, BUT the characters being displayed for Swedish and Portuguese weren't the correct characters! I eventually found out that the files use three different encoding systems, and figured out how to pick and choose which one is used for which file when reading them in. I then displayed some basic stats- i.e. number of words/sentences total and per file. I also printed out the first line of each file to compare them. For my discovery question, I wanted to see how similar the vocabulary in each text was, so I looked at the top 5 most frequent words of each file. I excluded symbols and stopwords. I imported the stopwords corpus and used it as a filter. The only problem was that lolcats wasn't represented in the stopwords corpus, so I couldn't filter out words like "teh" ("the"). Interestingly, Jacob appears in the top five most common words list for all of the languages except for the KJV bible (and maybe lolcats?), and in *kind* of similar frequencies (185, 138, 202, 159, 180, 156). I also thought it was pretty interesting how different the frequency distributions of the two English versions were from each other. 

**Future Wish:** I'd like to learn how to align parallel data automatically to use for translation. That might not even be possible with this corpus despite it all being translations of Genesis, because each text file has a different number of sentences and words. I believe this is because not only are these bibles written by different authors in different languages, but they were also all written in different time periods. I would like to learn more about what makes parallel data "good" parallel data in general and if it would be possible to align this or make use of this corpus in some other way to aid in translation. I'd also like to learn what the correct way to handle my encoding problem is. If you have a dataset with a bunch of wacky characters, utf-8 can't handle it, and you don't know what was used, what is the best way to proceed?

In [1]:
from nltk.corpus import PlaintextCorpusReader

corpus_root = 'data/genesis'

#The encoding for each of the text files was different! This was very annoying and I had
#a hard time finding which file used what. I eventually found it in the code on this page!
#http://www.nltk.org/_modules/nltk/corpus.html

gens = PlaintextCorpusReader(corpus_root, r'.*\.txt', encoding=[
        ('finnish|french|german', 'latin_1'),
        ('swedish', 'cp865'),
        ('.*', 'utf_8')])

print('There are',len(gens.fileids()),'files in the corpus!\n')

for f in gens.fileids():
    print(f, 'has', len(gens.words(f)), 'words,', len(gens.sents(f)), 'sents')

print('There is a total of', len(gens.words()), 'words and', len(gens.sents()), 'sents in the entire corpus.\n')

print("\nThis is the first sentence of each file:")

for f in gens.fileids():
    print(f+':',' '.join(gens.sents(f)[0]).replace(' .', '.').replace(' ,', ','))



There are 8 files in the corpus!

english-kjv.txt has 44764 words, 1467 sents
english-web.txt has 44054 words, 2232 sents
finnish.txt has 32520 words, 2160 sents
french.txt has 46116 words, 2004 sents
german.txt has 43941 words, 1900 sents
lolcat.txt has 17074 words, 822 sents
portuguese.txt has 45094 words, 1669 sents
swedish.txt has 41705 words, 1386 sents
There is a total of 315268 words and 13640 sents in the entire corpus.


This is the first sentence of each file:
english-kjv.txt: In the beginning God created the heaven and the earth.
english-web.txt: In the beginning God created the heavens and the earth.
finnish.txt: Alussa Jumala loi taivaan ja maan.
french.txt: Au commencement, Dieu créa les cieux et la terre.
german.txt: Am Anfang schuf Gott Himmel und Erde.
lolcat.txt: Oh hai.
portuguese.txt: No princípio, criou Deus os céus e a terra.
swedish.txt: I begynnelsen skapade Gud himmel och jord.


### Discovery Question
What are the five most frequent words in each text, and how much do these lists differ?

In [2]:
import nltk
#I want the five most frequent words excluding stopwords, so I'll bring in the stopwords corpus to be able to search for them.
#All of the languages represented in the genesis corpus are represented in the stopwords corpus except, obviously, lolcats
from nltk.corpus import stopwords

#This is for my numpy stuff
top5 = []

for f in gens.fileids():
    fd = nltk.FreqDist(gens.words(f))
    common = fd.most_common(55)
    stops = set(stopwords.words())
    common = [(x,y) for (x,y) in common if x.isalnum() and x.lower() not in stops]
    top5.append(common[:5])
    print(f, ':', common[:5])
 

english-kjv.txt : [('unto', 590), ('said', 476), ('thou', 272), ('thy', 267), ('thee', 257)]
english-web.txt : [('said', 470), ('father', 270), ('God', 229), ('land', 195), ('Jacob', 185)]
finnish.txt : [('sanoi', 324), ('Jumala', 180), ('Jaakob', 138), ('Herra', 123), ('Joosef', 114)]
french.txt : [('fils', 317), ('Dieu', 230), ('Jacob', 202), ('père', 199), ('pays', 182)]
german.txt : [('szlig', 677), ('sprach', 416), ('Gott', 207), ('h3', 200), ('Jakob', 159)]
lolcat.txt : [('teh', 508), ('2', 189), ('wuz', 171), ('Ceiling', 166), ('Cat', 154)]
portuguese.txt : [('terra', 349), ('Deus', 240), ('filhos', 209), ('pai', 200), ('Jacó', 180)]
swedish.txt : [('sade', 400), ('skall', 283), ('Gud', 211), ('Jakob', 156), ('fader', 139)]


### NumPy??
What is the average frequency of the word Jacob in the texts?

In [3]:
import numpy

#I feel like I'm kind of cheating this way using startswith('Ja'). This seems better than a regex though?
jake_freqs = [freq for lang in top5 for (word, freq) in lang if word.startswith('Ja')]
mean = numpy.mean(jake_freqs)

print(jake_freqs)
print('The average frequency of Jacob is:', mean)


[185, 138, 202, 159, 180, 156]
The average frequency of Jacob is: 170.0
