## Overview of Initial Exploration

The object in this notebook is to explore the corpora that I have cleaned manually, including cleaning and engineering via python. I want to have a sense of the texts digitally as data structures as well as as texts.

I will work on one text to get all the code squared away, importing other corpora as necessary. The modeling would rather require that these all be in data frames, each one in a dataframe, with the corpora split up as necessary in each one. Yikes. There is more to do than I thought, I think.

Twain = 1 
Wilde = 2 
Lincoln = 3 
D_Twain = 10 
D_Wilde = 20 
D_Lincoln = 30
Modern = 100

In [34]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import codecs
import nltk
import itertools

%matplotlib inline

In [2]:
corpus_root = '../corpora/'

In [3]:
t = open(corpus_root +'_CLEAN_Twain.txt', 'rU')

In [4]:
raw_t = t.read()

In [None]:
raw_t[0:200]

In [None]:
len(raw_t)
# 14,712,397 characters, which presumably includes the \n's

In [5]:
## New line is as much a function of printing as anything else.
## The newline characters need to be removed in order to have a clean-ish dataset.
clean_t = raw_t.replace("\n", " ")
chars_t = len(clean_t)
## 14,712,397 characters, post cleaning.

In [None]:
clean_t[:200]

In [6]:
words_t = nltk.word_tokenize(clean_t)

In [85]:
wc_t = len(words_t)
wc_t
# 3,143,540 words and punctuation after replacing the 'newline' character with " ". We will ultimately be looking at 
# n-grams, to do LSI/LSA so the punctuation is of use

3143540

In [None]:
words_t[:10]

In [7]:
sents_t = nltk.sent_tokenize(clean_t)

In [None]:
sents_t[:10]

In [8]:
sc_t = len(sents_t)
sc_t

136199

In [86]:
twain_counts = ['Mark Twain', chars_t, wc_t, sc_t]

In [None]:
twain_counts

I would like to have a presentation dataset for my three writers containing the name, code, and char, word, and sentence count for each corpus.

In [9]:
c = open(corpus_root + '_CLEAN_Lincoln.txt', 'rU')

In [None]:
# # An attempt at writing a function to manually clean up the files. It didn't work because of encoding difficulties.
# clean_text = []
# def text_clean(text):
#     raw = text.read()
#     clean = raw.replace("\n", " ")
#     chars = len(clean)
#     clean_text.append(chars)
#     words = nltk.word_tokenize(clean)
#     wc = len(words)
#     clean_text.append(wc)
#     sents = nltk.sent_tokenize(clean)
#     sc = len(sents)
#     clean_text.append(sc)

In [10]:
raw_c = c.read()

In [11]:
clean_c = raw_c.replace("\n", " ")

In [87]:
chars_c = len(clean_c)

In [12]:
words_c = nltk.word_tokenize(clean_c)

In [88]:
wc_c = len(words_c)

In [13]:
sents_c = nltk.sent_tokenize(clean_c)

In [89]:
sc_c = len(sents_c)

In [90]:
lincoln_counts = ["Abraham Lincoln", chars_c, wc_c, sc_c]

St. Oscar of Wilde posed a unique problem insofar as the encodings were not the same across the documents I used to create the corpus. A great deal of work had to be done in order to first figure out what had to be done and second to do it. The first block of code represents about a day's worth of work trying to solve the problem. The second represents the successful approach. If we fail in this, we shall trace it back to Saturday, 15 October AD 2016.

In [None]:
## The unsuccessful approach
# import re
# w1 = codecs.open('../Wilde/_USED/America.txt', encoding='utf-8')
# w2 = codecs.open('../Wilde/_USED/Aphorisms.txt')
# ISO-646-US (US-ASCII)
# raw_w = w1.read()
# raw_w[:100]
# raw_2 = w2.read()
# raw_2[:100]
# w3 = codecs.open('../Wilde/_USED/Essays and Lectures.txt', encoding='UTF-8-sig', errors='replace')
# raw_3 = w3.read()
# raw_3[:100]

In [14]:
# The successful approach
w = codecs.open(corpus_root + '_CLEAN_Wilde.txt', encoding='UTF-8-sig', errors='replace' )
w_raw = w.read()
w_raw = w_raw.encode('ascii', 'replace')

In [130]:
clean_w = w_raw.replace("?s", "'s")
clean_w = clean_w.replace("\r", "")
clean_w = clean_w.replace("\n", " ")

In [91]:
chars_w = len(clean_w)

In [16]:
words_w = nltk.word_tokenize(clean_w)

In [92]:
wc_w = len(words_w)

In [17]:
sents_w = nltk.sent_tokenize(clean_w)

In [93]:
sc_w = len(sents_w)

In [94]:
wilde_counts = ['Oscar Wilde', chars_w, wc_w, sc_w]

And the modern Text

In [648]:
m = codecs.open(corpus_root + 'mod_text.txt', encoding='utf-8-sig', errors='replace')

In [649]:
m_raw = m.read()
m_raw = m_raw.encode('ascii', 'replace')
#m_raw

In [650]:
import re

In [651]:
clean_m = m_raw.replace(r"\?{2,}", "")

In [652]:
type(clean_m)

str

In [653]:
chars_m = len(clean_m)
chars_m

362171

In [654]:
words_m = nltk.word_tokenize(clean_m)

In [655]:
wc_m = len(words_m)
wc_m

69327

In [657]:
sents_m = nltk.sent_tokenize(clean_m)
sc_m = len(sents_m)
sc_m

5174

In [659]:
m_1k_H = int(chars_m/1000)
m_1k_W = int(wc_m/1000)
m_1k_S = int(sc_m/1000)

m_500_H = int(chars_m/500)
m_500_W = int(wc_m/500)
m_500_S = int(sc_m/500)

m_100_H = int(chars_m/100)
m_100_W = int(wc_m/100)
m_100_S = int(sc_m/100)

In [663]:
st_m = ['clean_m', 'words_m', 'sents_m']*3
leng_m = [m_1k_H, m_1k_W, m_1k_S, m_500_H, m_500_W, m_500_S, m_100_H, m_100_W, m_100_S]
labels_m = ["m_1k_H", "m_1k_W", "m_1k_S", "m_500_H", "m_500_W", "m_500_S", "m_100_H", "m_100_W", "m_100_S"]

## On to the Good Stuff

I would like to have a simple presentation dataframe to keep track of the various counts here. Indices would also serve as the decoding codes for the authors, as they'll have to be decoded eventually.

In [106]:
counts_list = [twain_counts, wilde_counts, lincoln_counts]

In [107]:
authors = []
for name in counts_list:
    authors.append(name)

In [None]:
authors

In [108]:
a_df = pd.DataFrame(authors, columns=['Name', 'char_count', 'word_count', 'sentence_count'], index=[1, 2, 3])

In [109]:
a_df

Unnamed: 0,Name,char_count,word_count,sentence_count
1,Mark Twain,14712397,3143540,136199
2,Oscar Wilde,2856299,595302,38528
3,Abraham Lincoln,2646227,531241,17649


In [112]:
# a_df.to_csv('author_counts.csv', sep='|', encoding='utf-8')

A question presents itself: since we are dealing with three corpora of significantly different lengths, does it make sense to look at absolute lengths of strings or relative lengths of strings. That is, a fixed count that is the same across corpora or a percentage of the corpus at hand?

My hunch would be that a relative proportion of characters would work best for identifying, but we are interested in short strings, thus strings of absolute rather than relative length.

The built-in NLP libraries handle a lot of the tokenizing, but we do need to figure out the correct n-gram level for distinguishing among these three writers. The thinking, really, is to have a range of ngrams, run logistic regressions on them to see which has the best classifying against itself. (This is because we're looking to evaluate  substrings of various length, rather than full corpora.)

In [None]:
# For reference, the storage variables for the pieces of these documents:
# Twain:
# clean_t = characters - 
# words_t = words, tokenized
# sents_t = sentences, tokenized
###
# Wilde:
# clean_w = characters
# words_w = words, tokenized
# sents_w = sentences, tokenized
###
# Lincoln:
# clean_c = characters
# words_c = words, tokenized
# sents_c = sentences, tokenized

In [489]:
a_df['char_len'] = (a_df['char_count']/1000)
a_df['word_len'] = a_df['word_count']/1000
a_df['sent_len'] = a_df['sentence_count']/1000

In [505]:
a_df['char_len_D'] = a_df['char_count']/500
a_df['word_len_D'] = a_df['word_count']/500
a_df['sent_len_D'] = a_df['sentence_count']/500

In [508]:
a_df['char_len_C'] = a_df['char_count']/100
a_df['word_len_C'] = a_df['word_count']/100
a_df['sent_len_C'] = a_df['sentence_count']/100

In [509]:
a_df.head()

Unnamed: 0,Name,char_count,word_count,sentence_count,char_len,word_len,sent_len,char_len_D,word_len_D,sent_len_D,char_len_C,word_len_C,sent_len_C
1,Mark Twain,14712397,3143540,136199,14712.397,3143.54,136.199,29424.794,6287.08,272.398,147123.97,31435.4,1361.99
2,Oscar Wilde,2856299,595302,38528,2856.299,595.302,38.528,5712.598,1190.604,77.056,28562.99,5953.02,385.28
3,Abraham Lincoln,2646227,531241,17649,2646.227,531.241,17.649,5292.454,1062.482,35.298,26462.27,5312.41,176.49


In [510]:
leng_t = a_df.iloc[0,4:13].values

In [None]:
l_w = a_df.iloc[1,4:13].values

In [519]:
l_c = a_df.iloc[2,4:13].values

In [512]:
length_t = []
for i in leng_t:
    length_t.append(int(i))


In [514]:
length_w = []
for i in l_w:
    length_w.append(int(i))

In [520]:
length_c = []
for i in l_c:
    length_c.append(int(i))

In [513]:
length_t

[14712, 3143, 136, 29424, 6287, 272, 147123, 31435, 1361]

In [516]:
length_w

[2856, 595, 38, 5712, 1190, 77, 28562, 5953, 385]

In [521]:
length_c

[2646, 531, 17, 5292, 1062, 35, 26462, 5312, 176]

Twain = 1 
Wilde = 2 
Lincoln = 3 
D_Twain = 10 
D_Wilde = 20 
D_Lincoln = 30
Modern = 100

In [482]:
#import seaborn as sns
#sns.distplot(a_df['char_count'])
#sns.distplot(a_df['word_count'])
#sns.distplot(a_df['sentence_count'])

In [None]:
def var_gram(text, pct):
    t = len(text)
    print "The character ngram degree for this text at ", pct, "percent is:"
    pct = pct/100.0
    print int(pct * t)

In [None]:
var_gram(clean_t, 1)

In [420]:
# A relative length n-grammer
def n_grammr(text, n):
    gram = []
    p = int(n*len(text))
    c = [text[0+i:p+i] for i in range(0, len(text), p)]
    gram.append(c)
    df = pd.DataFrame(gram).T
    return df

In [421]:
n_grammr(clean_w, .20)

Unnamed: 0,0
0,LE JARDIN. The lily's withered chalice fa...
1,even told her own mother. I don't know what ...
2,"es must have been partially naval, ?for Agamem..."
3,"d, fogs are carried to excess. They have beco..."
4,"lyre,' and the 'famous final victory,' in such..."
5,ce.


In [None]:
def n_grammr_abs(text, n):
    gram = []
    c = [text[0+i:n+i] for i in range(0, len(text), n)]
    gram.append(c)
    df = pd.DataFrame(gram).T
    return df

In [658]:
# two n_grammrs set up: one for absolute length, one for proportional
def framer(string, n, tpe, auth):
    frame = n_grammr_abs(string, n)
    if auth == 't':
        frame['code'] = 1
    elif auth == 'w':
        frame['code'] = 2
    elif auth == 'c':
        frame['code'] = 3
    elif auth == 'm':
        frame['code'] = 100
    filename = "{}_{}{}.csv".format(auth, n, tpe)
    frame.to_csv(filename, sep="|", encoding="utf-8")
    return filename

In [60]:
# This was a lovely function that killed the kernel a few times... 
    
# def n_grammr(text, n):
#     longstring = itertools.chain(text)
#     laststring = ''
#     for item in longstring: 
#         laststring += item
#     for i in range(5, len(laststring)+1):
#         gram.append(laststring[i-n:i])
#     df = pd.DataFrame(gram)
#     return df


In [None]:
# # Model code to set up a variable ngram generator, as above. Original version killed the kernel

# #string = ['penguin ', 'babboon ', 'puffin ']
# string = ['penguin babboon puffin']
# import itertools 

# longstring = itertools.chain(string)

# laststring = ''

# for item in longstring: 
#     laststring += item
    
# for i in range(5, len(laststring)+1):
#     print laststring[i-5:i]

## Beginning of procedure for automated 'chunking' of the string sets

In [451]:
def n_grammr_abs(text, n):
    gram = []
    c = [text[0+i:n+i] for i in range(0, len(text), n)]
    gram.append(c)
    df = pd.DataFrame(gram).T
    return df

# two n_grammrs set up: one for absolute length, one for proportional
def framer(string, n, tpe, auth):
    frame = n_grammr_abs(string, n)
    if auth == 't':
        frame['code'] = 1
    elif auth == 'w':
        frame['code'] = 2
    elif auth == 'c':
        frame['code'] = 3
    filename = "{}_{}{}.csv".format(auth, n, tpe)
    frame.to_csv(filename, sep="|", encoding="utf-8")
    return filename

In [534]:
a_df.head()

Unnamed: 0,Name,char_count,word_count,sentence_count,char_len,word_len,sent_len,char_len_D,word_len_D,sent_len_D,char_len_C,word_len_C,sent_len_C
1,Mark Twain,14712397,3143540,136199,14712.397,3143.54,136.199,29424.794,6287.08,272.398,147123.97,31435.4,1361.99
2,Oscar Wilde,2856299,595302,38528,2856.299,595.302,38.528,5712.598,1190.604,77.056,28562.99,5953.02,385.28
3,Abraham Lincoln,2646227,531241,17649,2646.227,531.241,17.649,5292.454,1062.482,35.298,26462.27,5312.41,176.49


In [522]:
length_t

[14712, 3143, 136, 29424, 6287, 272, 147123, 31435, 1361]

In [585]:
t_labels = ['t_1k_H', 't_1k_W', 't_1k_S', 't_500_H', 't_500_W', 't_500_S', 't_100_H', 't_100_W', 't_100_S']

In [523]:
length_w

[2856, 595, 38, 5712, 1190, 77, 28562, 5953, 385]

In [586]:
w_labels = ['w_1k_H', 'w_1k_W', 'w_1k_S', 'w_500_H', 'w_500_W', 'w_500_S', 'w_100_H', 'w_100_W', 'w_100_S']

In [524]:
length_c

[2646, 531, 17, 5292, 1062, 35, 26462, 5312, 176]

In [587]:
c_labels = ['c_1k_H', 'c_1k_W', 'c_1k_S', 'c_500_H', 'c_500_W', 'c_500_S', 'c_100_H', 'c_100_W', 'c_100_S']

In [574]:
st_t = strings_t*3
st_w = strings_w*3
st_c = strings_c*3

In [664]:
# t_comb = zip(t_labels, st_t, length_t)
# w_comb = zip(w_labels, st_w, length_w)
# c_comb = zip(c_labels, st_c, length_c)
m_comb = zip(labels_m, st_m, leng_m)

In [607]:
t_comb

[('t_1k_H', 'clean_t', 14712),
 ('t_1k_W', 'words_t', 3143),
 ('t_1k_S', 'sents_t', 136),
 ('t_500_H', 'clean_t', 29424),
 ('t_500_W', 'words_t', 6287),
 ('t_500_S', 'sents_t', 272),
 ('t_100_H', 'clean_t', 147123),
 ('t_100_W', 'words_t', 31435),
 ('t_100_S', 'sents_t', 1361)]

In [590]:
w_comb

[('w_1k_H', 'clean_w', 2856),
 ('w_1k_W', 'words_w', 595),
 ('w_1k_S', 'sents_w', 38),
 ('w_500_H', 'clean_w', 5712),
 ('w_500_W', 'words_w', 1190),
 ('w_500_S', 'sents_w', 77),
 ('w_100_H', 'clean_w', 28562),
 ('w_100_W', 'words_w', 5953),
 ('w_100_S', 'sents_w', 385)]

In [591]:
c_comb

[('c_1k_H', 'clean_c', 2646),
 ('c_1k_W', 'words_c', 531),
 ('c_1k_S', 'sents_c', 17),
 ('c_500_H', 'clean_c', 5292),
 ('c_500_W', 'words_c', 1062),
 ('c_500_S', 'sents_c', 35),
 ('c_100_H', 'clean_c', 26462),
 ('c_100_W', 'words_c', 5312),
 ('c_100_S', 'sents_c', 176)]

In [665]:
m_comb

[('m_1k_H', 'clean_m', 362),
 ('m_1k_W', 'words_m', 69),
 ('m_1k_S', 'sents_m', 5),
 ('m_500_H', 'clean_m', 724),
 ('m_500_W', 'words_m', 138),
 ('m_500_S', 'sents_m', 10),
 ('m_100_H', 'clean_m', 3621),
 ('m_100_W', 'words_m', 693),
 ('m_100_S', 'sents_m', 51)]

In [None]:
# Notes to self on what I'm trying to do:
# for t_1k_C the function should call: clean_t, length of 14712

In [667]:
st_t_v = {'t_1k_H':clean_t, 't_1k_W':words_t, 't_1k_S':sents_t,
          't_500_H':clean_t, 't_500_W':words_t, 't_500_S':sents_t,
          't_100_H':clean_t, 't_100_W':words_t, 't_100_S':sents_t}

st_w_v = {'w_1k_H':clean_w, 'w_1k_W':words_w, 'w_1k_S':sents_w,
          'w_500_H':clean_w, 'w_500_W':words_w, 'w_500_S':sents_w,
          'w_100_H':clean_w, 'w_100_W':words_w, 'w_100_S':sents_w}

st_c_v = {'c_1k_H':clean_c, 'c_1k_W':words_c, 'c_1k_S':sents_c,
          'c_500_H':clean_c, 'c_500_W':words_c, 'c_500_S':sents_c,
          'c_100_H':clean_c, 'c_100_W':words_c, 'c_100_S':sents_c}

st_m_v = {'m_1k_H':clean_m, 'm_1k_W':words_m, 'm_1k_S':sents_m,
          'm_500_H':clean_m, 'm_500_W':words_m, 'm_500_S':sents_m,
          'm_100_H':clean_m, 'm_100_W': words_m, 'm_100_S': sents_m}

In [593]:
def n_grammr_abs(text, n):
    gram = []
    c = [text[0+i:n+i] for i in range(0, len(text), n)]
    gram.append(c)
    df = pd.DataFrame(gram).T
    return df

# two n_grammrs set up: one for absolute length, one for proportional
def framer(string, n, tpe, auth):
    frame = n_grammr_abs(string, n)
    if auth == 't':
        frame['code'] = 1
    elif auth == 'w':
        frame['code'] = 2
    elif auth == 'c':
        frame['code'] = 3
    filename = "{}_{}{}.csv".format(auth, n, tpe)
    frame.to_csv(filename, sep="|", encoding="utf-8")
    return filename

In [597]:
for x in t_comb:
    call = x[0]
    print x
    print framer(st_t_v[call],x[2],x[1],'t')

('t_1k_H', 'clean_t', 14712)
t_14712clean_t.csv
('t_1k_W', 'words_t', 3143)
t_3143words_t.csv
('t_1k_S', 'sents_t', 136)
t_136sents_t.csv
('t_500_H', 'clean_t', 29424)
t_29424clean_t.csv
('t_500_W', 'words_t', 6287)
t_6287words_t.csv
('t_500_S', 'sents_t', 272)
t_272sents_t.csv
('t_100_H', 'clean_t', 147123)
t_147123clean_t.csv
('t_100_W', 'words_t', 31435)
t_31435words_t.csv
('t_100_S', 'sents_t', 1361)
t_1361sents_t.csv


In [603]:
for x in w_comb:
    call = x[0]
    print x
    print framer(st_w_v[call],x[2],x[1],'w')

('w_1k_H', 'clean_w', 2856)
w_2856clean_w.csv
('w_1k_W', 'words_w', 595)
w_595words_w.csv
('w_1k_S', 'sents_w', 38)
w_38sents_w.csv
('w_500_H', 'clean_w', 5712)
w_5712clean_w.csv
('w_500_W', 'words_w', 1190)
w_1190words_w.csv
('w_500_S', 'sents_w', 77)
w_77sents_w.csv
('w_100_H', 'clean_w', 28562)
w_28562clean_w.csv
('w_100_W', 'words_w', 5953)
w_5953words_w.csv
('w_100_S', 'sents_w', 385)
w_385sents_w.csv


In [606]:
for x in c_comb:
    call = x[0]
    print x
    print framer(st_c_v[call],x[2],x[1],'c')

('c_1k_H', 'clean_c', 2646)
c_2646clean_c.csv
('c_1k_W', 'words_c', 531)
c_531words_c.csv
('c_1k_S', 'sents_c', 17)
c_17sents_c.csv
('c_500_H', 'clean_c', 5292)
c_5292clean_c.csv
('c_500_W', 'words_c', 1062)
c_1062words_c.csv
('c_500_S', 'sents_c', 35)
c_35sents_c.csv
('c_100_H', 'clean_c', 26462)
c_26462clean_c.csv
('c_100_W', 'words_c', 5312)
c_5312words_c.csv
('c_100_S', 'sents_c', 176)
c_176sents_c.csv


In [668]:
for x in m_comb:
    call = x[0]
    print x
    print framer(st_m_v[call],x[2],x[1],'m')

('m_1k_H', 'clean_m', 362)
m_362clean_m.csv
('m_1k_W', 'words_m', 69)
m_69words_m.csv
('m_1k_S', 'sents_m', 5)
m_5sents_m.csv
('m_500_H', 'clean_m', 724)
m_724clean_m.csv
('m_500_W', 'words_m', 138)
m_138words_m.csv
('m_500_S', 'sents_m', 10)
m_10sents_m.csv
('m_100_H', 'clean_m', 3621)
m_3621clean_m.csv
('m_100_W', 'words_m', 693)
m_693words_m.csv
('m_100_S', 'sents_m', 51)
m_51sents_m.csv


In [None]:
# # Kept here as an example of something tried and rejected. About 12-15 hours of work is represented here.

In [None]:
# strings_w = ['clean_w', 'words_w', 'sents_w']
# strings_w_vars = {'clean_w':clean_w, 'words_w':words_w, 'sents_w':sents_w}

# strings_c = ['clean_c', 'words_c', 'sents_c']
# strings_c_vars = {'clean_c':clean_c, 'words_c':words_c, 'sents_c':sents_c}

# strings_t = ['clean_t', 'words_t', 'sents_t']
# strings_t_vars = {'clean_t':clean_t, 'words_t':words_t, 'sents_t':sents_t}

In [None]:
# # auth = ['w', 'c']
# # tpe = ['C', 'W', 'S']
# # pct = [10, 20, 30, 40]
# # pcts2 = [.10, .20, .30, .40]
# # lengths = [1000, 5000, 10000, 20000]
# # lengths_t = [10000, 15000, 20000, 30000]

In [571]:
# z_t= itertools.product(strings_t, length_t)

# combos_t = []
# for i,j in z_t:
#     if 'clean' in i:
#         t = 'c'
#     elif 'words' in i:
#         t = 'w'
#     elif 'sents' in i:
#         t = 's'
#     combos_t.append([i,j,t])
# combos_t

# for x in combos_t:
#     call = x[0]
#     print x
#     print framer(strings_t_vars[call],x[1],x[2],'t')

### Chunking Twain
By Characters - 14,712,397

Having done this by hand, I decided it would make a lot more sense to automate it. That decision took about 2 days to bring to fruition.

In [66]:
t_10kC = pd.DataFrame()

t_10kC = n_grammr(clean_t, 10000)

t_10kC.to_csv('t_10kC.csv', sep="|", encoding='utf-8')

In [69]:
t_50kC = pd.DataFrame()

t_50kC = n_grammr(clean_t, 50000)

t_50kC.to_csv('t_50kC.csv', sep="|", encoding='utf-8')

In [71]:
t_100kC = pd.DataFrame()

t_100kC = n_grammr(clean_t, 100000)

t_100kC.to_csv('t_100kC.csv', sep="|", encoding='utf-8')

By Words (words_t) - 3,143,540

In [75]:
t_1kW = pd.DataFrame()
t_1kW = n_grammr(words_t, 1000)

t_1kW.to_csv('t_1kW.csv', sep='|', encoding='utf-8')

In [77]:
t_10kW = pd.DataFrame()
t_10kW = n_grammr(words_t, 10000)

t_10kW.to_csv('t_10kW.csv', sep="|", encoding="utf-8")

In [81]:
t_25kW = pd.DataFrame()
t_25kW = n_grammr(words_t, 25000)

t_25kW.to_csv('t_25kW.csv', sep="|", encoding="utf-8")

In [83]:
t_50kW = pd.DataFrame()
t_50kW = n_grammr(words_t, 50000)

t_50kW.to_csv('t_50kW.csv', sep="|", encoding="utf-8")

By sentences - 136,199

In [None]:
t_1kS = pd.DataFrame()
t_1kS = n_grammr(sents_t, 1000)

t_1kS.to_csv('t_1kS.csv', sep="|", encoding="utf-8")

In [115]:
t_5kS = pd.DataFrame()
t_5kS = n_grammr(sents_t, 5000)

t_5kS.to_csv('t_5kS.csv', sep="|", encoding="utf-8")

In [120]:
t_10kS = pd.DataFrame()
t_10kS = n_grammr(sents_t, 10000)

t_10kS.to_csv('t_10kS.csv', sep="|", encoding="utf-8")

# This is a sidestep to get the wordclouds for my authors. Credit to Sam Stack and Sheena Lee Villaneuva

In [None]:
from stop_words import get_stop_words
stop_words = get_stop_words('en')

In [None]:
twain_words = pd.DataFrame(words_t)

In [None]:
twain_words = twain_words[~twain_words[0].isin(stop_words)]
twain_words.to_csv('twain_wc_clean.csv', encoding='utf-8')