# Work with text in Pandas

This notebook contains the Pandas instructions to load the a corpus in Pandas.

We perform some simple information extracting on the data.

In [1]:
import os
import pandas as pd

# Data files

In [2]:
VERSION = "0.1"
REPO_DIR = os.path.expanduser("~/github/annotation/mobydick")
PANDAS_DIR = f"{REPO_DIR}/pandas"
TEXT_DIR = f"{REPO_DIR}/txt"
TABLE_FILE_PD = f"{PANDAS_DIR}/data-{VERSION}.pd"
TABLE_FILE_TXT = f"{TEXT_DIR}/data-{VERSION}.txt"

if not os.path.exists(TEXT_DIR):
    os.makedirs(TEXT_DIR)

In [3]:
frame = pd.read_parquet(TABLE_FILE_PD, engine="pyarrow")
print("Done. Size={}".format(frame.size))

Done. Size=5935167


In [4]:
frame.shape

(219821, 27)

In [5]:
frame.head(30)

Unnamed: 0,nd,otype,after,str,in_chapter,in_chunk,anchored,chapter,chunk,empty,...,notAfter,rend,rend_cursive,rend_italic,rend_sc,scheme,status,target,type,when
0,213819,chapter,,,213819,,,TEI header,,,...,,,,,,,,,,
1,217164,fileDesc,,,213819,213960.0,,,,,...,,,,,,,,,,
2,213960,chunk,,,213819,213960.0,,TEI header,-1.0,,...,,,,,,,,,,
3,219819,titleStmt,,,213819,213960.0,,,,,...,,,,,,,,,,
4,219814,title,,,213819,213960.0,,,,,...,,,,,,,,,main,
5,1,word,,Moby,213819,213960.0,,,,,...,,,,,,,,,,
6,2,word,,Dick,213819,213960.0,,,,,...,,,,,,,,,,
7,213808,author,,,213819,213960.0,,,,,...,,,,,,,,,,
8,3,word,",",Melville,213819,213960.0,,,,,...,,,,,,,,,,
9,4,word,",",Herman,213819,213960.0,,,,,...,,,,,,,,,,


In [6]:
columnList = frame.columns.values.tolist()
columnList

['nd',
 'otype',
 'after',
 'str',
 'in_chapter',
 'in_chunk',
 'anchored',
 'chapter',
 'chunk',
 'empty',
 'empty_pb',
 'empty_pb_n',
 'id',
 'ident',
 'is_meta',
 'is_note',
 'n',
 'notAfter',
 'rend',
 'rend_cursive',
 'rend_italic',
 'rend_sc',
 'scheme',
 'status',
 'target',
 'type',
 'when']

# Chapters

Let us extract some data.
First a list of the book names.

In [7]:
chapters = frame[frame.otype == "chapter"].chapter

for chapter in chapters:
    print(chapter)

TEI header
2 div
Preliminary Matter.
4 titlePage
LOOMINGS
THE CARPET-BAG 
THE SPOUTER-INN
THE COUNTERPANE
BREAKFAST
THE STREET
THE CHAPEL
THE PULPIT
THE SERMON
A BOSOM FRIEND
NIGHTGOWN
BIOGRAPHICAL
WHEELBARROW
NANTUCKET
CHOWDER
THE SHIP
THE RAMADAN
HIS MARK
THE PROPHET
ALL ASTIR
GOING ABOARD
MERRY CHRISTMAS
THE LEE SHORE
THE ADVOCATE
POSTSCRIPT
KNIGHTS AND SQUIRES
KNIGHTS AND SQUIRES
AHAB
ENTER AHAB; TO HIM, STUBB
THE PIPE
QUEEN MAB
CETOLOGY
THE SPECKSYNDER
THE CABIN-TABLE
THE MAST-HEAD
THE QUARTER-DECK
SUNSET
DUSK
FIRST NIGHT-WATCH
MIDNIGHT, FORECASTLE
MOBY DICK
THE WHITENESS OF THE WHALE
HARK!
THE CHART
THE AFFIDAVIT
SURMISES
THE MAT-MAKER
THE FIRST LOWERING
THE HYENA
AHAB'S BOAT AND CREW. FEDALLAH
THE SPIRIT-SPOUT
THE ALBATROSS
THE GAM
THE TOWN-HO'S STORY
OF THE MONSTROUS PICTURES OF WHALES
OF THE LESS ERRONEOUS PICTURES OF WHALES, AND THE TRUE  PICTURES OF WHALING SCENES
OF WHALES IN PAINT; IN TEETH; IN WOOD; IN SHEET-IRON; IN  STONE; IN MOUNTAINS; IN STARS
BRIT
SQUID
THE LINE
STUB

# Text

Now the complete text of the whole book.

In [8]:
words = frame.loc[frame.otype == "word"]
text = words.str + words.after

with open(TABLE_FILE_TXT, "w") as pt:
    pt.write("".join(text).replace("\\n", "\n"))
    pt.write("\n")

In [9]:
!head {TABLE_FILE_TXT}

Moby DickMelville, Herman, 1819-1891creation of machine-readable versionTriggs, Jefferydeposited byTriggs, Jeffery North American Reading ProjectOxford University Press (NY)c/o Bellcore triggs@bellcore.com   University of Oxford Text Archive Oxford University Computing Services13 Banbury RoadOxfordOX2 6NN ota@oucs.ox.ac.uk http://ota.ox.ac.uk/id/3049110600048X9781106000484 Distributed by the University of Oxford under a Creative Commons Attribution-ShareAlike 3.0 Unported License Revised version of  ​Moby DickMelville, Herman, 1819-1891s.n.s.l.s.dOriginally transcribed and deposited by Prof. Eugene F. Irey, University of Colorado
University of Oxford Text Archive Subject HeadingsLibrary of Congress Subject Headings
  ​EnglishAmerican literature -- 19th century
Header normalised

 Born in New York City, the son of New England merchant. He worked at odd jobs (clerk, farmhand, teacher) before sailing to the South Seas on the whaler Acushnet. He deserted his ship, lived among cannibals, mu

# Drill down to a passage

Let us get the words from the first chunk.

In [10]:
firstChunk = 213960

In [11]:
wordIds = frame[(frame.otype == "word") & (frame.in_chunk == firstChunk)].nd
print(wordIds.values)

<IntegerArray>
[ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57,
 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,
 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95,
 96, 97, 98, 99]
Length: 99, dtype: Int64


Now the *text* of the first verse.

In [12]:
words = frame[(frame.otype == "word") & (frame.in_chunk == firstChunk)]
text = words.str + words.after
print(("".join(text)).replace("\\n", "\n"))

Moby DickMelville, Herman, 1819-1891creation of machine-readable versionTriggs, Jefferydeposited byTriggs, Jeffery North American Reading ProjectOxford University Press (NY)c/o Bellcore triggs@bellcore.com   University of Oxford Text Archive Oxford University Computing Services13 Banbury RoadOxfordOX2 6NN ota@oucs.ox.ac.uk http://ota.ox.ac.uk/id/3049110600048X9781106000484 Distributed by the University of Oxford under a Creative Commons Attribution-ShareAlike 3.0 Unported License Revised version of  ​Moby DickMelville, Herman, 1819-1891s.n.s.l.s.dOriginally transcribed and deposited by Prof. Eugene F. Irey, University of Colorado



Let us get the words and text of an arbitrary passage.

First the id of the chunk (i.e. the Text-Fabric node number):

In [13]:
chapterHead = "THE BLACKSMITH"
chunkHead = 6

chunk_id = frame[
    (frame.otype == "chunk")
    & (frame.chapter == chapterHead)
    & (frame.chunk == chunkHead)
].nd.iloc[0]
print(chunk_id)

216067


Now the word ids of that chunk:

In [14]:
words = frame[(frame.otype == "word") & (frame.in_chunk == chunk_id)]
print(words.nd.values)

<IntegerArray>
[181894, 181895, 181896, 181897, 181898, 181899, 181900, 181901, 181902,
 181903, 181904, 181905, 181906, 181907, 181908, 181909, 181910, 181911,
 181912, 181913, 181914, 181915, 181916, 181917, 181918, 181919, 181920,
 181921, 181922, 181923, 181924, 181925, 181926, 181927, 181928, 181929,
 181930, 181931, 181932, 181933, 181934, 181935, 181936, 181937, 181938,
 181939, 181940, 181941, 181942, 181943, 181944, 181945, 181946, 181947,
 181948, 181949, 181950, 181951, 181952, 181953, 181954, 181955, 181956,
 181957, 181958, 181959, 181960, 181961, 181962, 181963, 181964, 181965,
 181966, 181967, 181968, 181969, 181970, 181971, 181972, 181973, 181974,
 181975, 181976, 181977, 181978, 181979, 181980, 181981, 181982, 181983,
 181984, 181985, 181986, 181987, 181988, 181989, 181990, 181991]
Length: 98, dtype: Int64


And, finally, the text of those words.

In [15]:
text = words.str + words.after
print(("".join(text)).replace("\\n", "\n"))

Why tell the whole? The blows of the basement hammer every day grew more and more between; and each blow every day grew fainter than the last; the wife sat frozen at the window, with tearless eyes, glitteringly gazing into the weeping faces of her children; the bellows fell; the forge choked up with cinders; the house was sold; the mother dived down into the long church-yard grass; her children twice followed her thither; and the houseless, familyless old man staggered off a vagabond in crape; his every woe unreverenced; his grey head a scorn to flaxen curls! 



Now let us organize this in two functions: one that returns the chunk object given a passage, and one that prints the texts of the words in a given object.

In [16]:
def object2text(nd):
    otype = frame[frame.nd == nd].otype.iloc[0]
    inotype = "in_" + otype
    words = frame[(frame.otype == "word") & (frame[inotype] == nd)]
    text = words.str + words.after
    return ("".join(text)).replace("\\n", "\n")


def chunk2object(chapter, chunk):
    return frame[
        (frame.otype == "chunk")
        & (frame.chapter == chapter)
        & (frame.chunk == chunk)
    ].nd.iloc[0]


def chunk2text(chapter, chunk):
    return object2text(chunk2object(chapter, chunk))


def chapter2object(chapter):
    return frame[
        (frame.otype == "chapter") & (frame.chapter == chapter)
    ].nd.iloc[0]


def chapter2text(chapter):
    return object2text(chapter2object(chapter))

In [17]:
print(chunk2text(chapterHead, chunkHead))

Why tell the whole? The blows of the basement hammer every day grew more and more between; and each blow every day grew fainter than the last; the wife sat frozen at the window, with tearless eyes, glitteringly gazing into the weeping faces of her children; the bellows fell; the forge choked up with cinders; the house was sold; the mother dived down into the long church-yard grass; her children twice followed her thither; and the houseless, familyless old man staggered off a vagabond in crape; his every woe unreverenced; his grey head a scorn to flaxen curls! 



In [18]:
chText = chapter2text(chapterHead)
print(chText[0:500])
print("...")
print(chText[-500:])

THE BLACKSMITHAvailing himself of the mild, summer-cool weather that now reigned in these latitudes, and in preparation for the peculiarly active pursuits shortly to be anticipated, Perth, the begrimed, blistered old blacksmith, had not removed his portable forge to the hold again, after concluding his contributory work for Ahab's leg, but still retained it on deck, fast lashed to ringbolts by the foremast; being now almost incessantly invoked by the headsmen, and harpooneers, and bowsmen to do 
...
other life without the guilt of intermediate death; here are wonders supernatural, without dying for them. Come hither! bury thyself in a life which, to your now equally abhorred and abhorring, landed world, is more oblivious than death. Come hither! put up thy grave-stone, too, within the churchyard, and come hither, till we marry thee! 
Hearkening to these voices, East and West, by early sun-rise, and by fall of eve, the blacksmith's soul responded, Aye, I come! And so Perth went a-whalin

# Bi-grams

We make a column of chunk-bound bi-grams of words. The two words are separated by an underscore `_`.

In [19]:
chunkNext = frame[frame.otype == "word"].in_chunk
chunkPrev = frame[frame.otype == "word"].in_chunk.shift(1)
word = frame[frame.otype == "word"].str
wordNext = frame[frame.otype == "word"].str.shift(1)

In [20]:
lastInChunk = chunkPrev != chunkNext
wordNext[lastInChunk] = ""

In [21]:
bigram = ["{}_{}".format(*p) for p in zip(word, wordNext)]

In [22]:
bigram[10_000:10_030]

['I_bed',
 'ran_I',
 'up_ran',
 'to_up',
 'him_to',
 'Don_',
 't_Don',
 'be_t',
 'afraid_be',
 'now_afraid',
 'said_now',
 'he_said',
 'grinning_he',
 'again_grinning',
 'Queequeg_again',
 'here_Queequeg',
 'wouldn_here',
 't_wouldn',
 'harm_t',
 'a_harm',
 'hair_a',
 'of_hair',
 'your_of',
 'head_your',
 'Stop_',
 'your_Stop',
 'grinning_your',
 'shouted_grinning',
 'I_shouted',
 'and_I']