
### Information

Paige Haring

peh40@pitt.edu

September 1, 2017

**Name:** Genesis Corpus

**Authors:** N/A

**Download URL:** http://www.nltk.org/nltk_data/

**Makeup:** This corpus consists of 8 text files of written data, specifically, each file is a translation of the Book of Genesis. The 8 translations include: English- King James Bible, English- World English Bible, Finnish, French, German, lolcat, Portuguese, and Swedish. 

**License:** Public Domain

**Noteworthy:** Of these translations, all represent a spoken human language except for lolcat, which is essentially a meme language "spoken" by cats on the internet.

### Self-Assessment
**Summary:**

**Future Wish:** I'd like to learn how to align parallel data automatically to use for translation. That might not even be possible with this corpus despite it all being translations of Genesis, because each text file has a different number of sentences and words. I believe this is because not only are these bibles written by different authors in different languages, but they were also all written in different time periods. I would like to learn more about what makes parallel data "good" parallel data in general and if it would be possible to align this or make use of this corpus in some other way to aid in translation. I'd also like to learn what the correct way to handle my encoding problem is. If you have a dataset with a bunch of wacky characters, and utf-8 can't handle it, what is the best way to proceed?

In [53]:
from nltk.corpus import PlaintextCorpusReader

corpus_root = 'data/genesis'

#Encoding is latin-1 because utf-8 couldn't read in any of the files besides the two English and the lolcats
#I searched around, but I couldn't find what the actual encoding for this corpus is. Latin1 seemed to work the best.
gens = PlaintextCorpusReader(corpus_root, r'.*\.txt', encoding = 'latin-1')

print('There are',len(gens.fileids()),'files in the corpus!')

for f in gens.fileids():
    print(f, 'has', len(gens.words(f)), 'words,', len(gens.sents(f)), 'sents')

print('There is a total of', len(gens.words()), 'words and', len(gens.sents()), 'sents in the entire corpus.')



There are 8 files in the corpus!
english-kjv.txt has 44764 words, 1467 sents
english-web.txt has 44054 words, 2232 sents
finnish.txt has 32520 words, 2160 sents
french.txt has 46116 words, 2004 sents
german.txt has 43941 words, 1900 sents
lolcat.txt has 17096 words, 821 sents
portuguese.txt has 51349 words, 1669 sents
swedish.txt has 55950 words, 1386 sents
There is a total of 335790 words and 13639 sents in the entire corpus.


In [19]:
print(gens.readme())

Genesis Corpus

This corpus has been prepared from several web sources; formatting,
markup and verse numbers have been stripped.

CONTENTS

english-kjv.txt - Genesis, King James version (Project Gutenberg)
english-web.txt - Genesis, World English Bible (Project Gutenberg)
finnish.txt - Genesis, Suomen evankelis-luterilaisen kirkon kirkolliskokouksen vuonna 1992 käyttöön ottama suomennos
french.txt - Genesis, Louis Segond 1910
german.txt - Genesis, Luther Translation
lolcat.txt - Genesis, Lolcat version http://www.lolcatbible.com/
portuguese.txt - Genesis, Brazilian Portuguese version - http://www.bibliaonline.com.br
swedish.txt - Genesis, Gamla och Nya Testamentet, 1917 (Project Runeberg)



In [47]:
for f in gens.fileids():
    print(f+':',' '.join(gens.sents(f)[0]).replace(' .', '.').replace(' ,', ','))

#Note there is an encoding issue with the portuguese text.

english-kjv.txt: In the beginning God created the heaven and the earth.
english-web.txt: In the beginning God created the heavens and the earth.
finnish.txt: Alussa Jumala loi taivaan ja maan.
french.txt: Au commencement, Dieu créa les cieux et la terre.
german.txt: Am Anfang schuf Gott Himmel und Erde.
lolcat.txt: Oh hai.
portuguese.txt: No princÃ ­ pio, criou Deus os cÃ © us e a terra.
swedish.txt: I begynnelsen skapade Gud himmel och jord.


### Discovery Question
What are the five most frequent words in each text, and how much do these lists differ?

In [84]:
import nltk
#I want the five most frequent words excluding stopwords, so I'll bring in the stopwords corpus to be able to search for them.
from nltk.corpus import stopwords

for f in gens.fileids():
    fd = nltk.FreqDist(gens.words(f))
    common = fd.most_common(50)
    stops = set(stopwords.words())
    common = [(x,y) for (x,y) in common if x.isalnum() and x.lower() not in stops]
    print(f, ':', common[:5])
 
#Clearly there are a lot of problems with encoding = latin-1. The characters aren't mapped correctly
#Clearly threre are a lot of problems with encoding = latin-1. The characters aren't mapped the the correct symbols.

english-kjv.txt : [('unto', 590), ('said', 476), ('thou', 272), ('thy', 267), ('thee', 257)]
english-web.txt : [('said', 470), ('father', 270), ('God', 229), ('land', 195), ('Jacob', 185)]
finnish.txt : [('sanoi', 324), ('Jumala', 180), ('Jaakob', 138), ('Herra', 123), ('Joosef', 114)]
french.txt : [('fils', 317), ('Dieu', 230), ('Jacob', 202), ('père', 199), ('pays', 182)]
german.txt : [('szlig', 677), ('sprach', 416), ('Gott', 207), ('h3', 200), ('Jakob', 159)]
lolcat.txt : [('teh', 508), ('2', 189), ('wuz', 171), ('Ceiling', 166), ('Cat', 154)]
portuguese.txt : [('Ã', 571), ('terra', 349), ('Deus', 240), ('nÃ', 240), ('filhos', 209)]
swedish.txt : [('r', 1440), ('f', 1101), ('½', 498), ('sade', 400), ('p', 308)]


**Below is a different attempt to read in the corpus where I'm manually choosing the encoding. This seems like a really silly way to do it.**

In [None]:
import glob

files = glob.glob('data/genesis/*.txt')
files_trimmed = [f[13:] for f in files]

raw = []
for file in files:
    if encoding == 'utf-8':
        with open(file, encoding = 'utf-8') as f:
            txt = f.read()
            raw.append(txt)

In [110]:
stopwords.fileids()

['danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'hungarian',
 'italian',
 'kazakh',
 'norwegian',
 'portuguese',
 'russian',
 'spanish',
 'swedish',
 'turkish']