Notebook to develop test data using NLTK package and Project Gutenberg. The NLTK corpus consists of 18 works of literature including novels, plays, peoms, and the King James Bible. The version in this notebook will use a balanced training dataset selected only from the subset of novels from the corpus. Notebook will store the data.frame objects as parquet format files for retrieval by downstream notebooks.

In [190]:

!pip install pydot --quiet
!pip install nltk --quiet
!pip install pyarrow -quiet



Usage:   
  pip install [options] <requirement specifier> [package-index-options] ...
  pip install [options] -r <requirements file> [package-index-options] ...
  pip install [options] [-e] <vcs project url> ...
  pip install [options] [-e] <local project path> ...
  pip install [options] <archive url/path> ...

no such option: -u


In [191]:
import numpy as np
import tensorflow as tf
import pandas as pd

import os
import nltk
from nltk.data import find

import matplotlib.pyplot as plt
import random
import re

nltk.download('gutenberg')
from nltk.corpus import gutenberg

nltk.download('punkt')
from nltk.tokenize import sent_tokenize

import pyarrow as pya
import pyarrow.parquet as pq


[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\Dragon\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Dragon\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [427]:
DATA_LOC = 'local'

# USE ONLY FOR REMOTE DRIVE
# Mount a google Drive for persistent store
if DATA_LOC == 'remote':
    from google.colab import drive
    drive.mount('/content/drive')

# Load novels from Project Gutenberg

In [192]:
import requests

In [528]:
# Utility support function
def remove_new_line_tabs(book):
    """remove unwanted newlines, tabs, etc from the text"""
    for char in ["\n", "\r", "\t", "\d", "\s"]:
        book = book.replace(char, " ")
    return book

In [531]:
# ########################################################################################
# LOAD INDIVIDUAL NOVELS and remove header and  footer info, including title of the book
#
# Process flow:
#   1. load the novel
#   2. search for the end of the novel and cut out the footer info using "split_str"
#   3. of the results from step 2, cut out the header / preamble / table of contents, including title
#      The header info has been analyzed per novel and the starting character value is set at the first
#      text character for the body of the novel.
#   4. pass the body of the novel through the remove_new_line_tabs function to strip out spaces, tabs, ...
#   5. append the processed work to the list of novels
# ########################################################################################
bks_gutenberg = []
split_str = '*** END OF THE PROJECT GUTENBERG EBOOK'

# ########################
# F. Scott Fitzgerald
# ########################
# the great gatsby, start at 1200
r = requests.get(r'https://www.gutenberg.org/cache/epub/64317/pg64317.txt')
r_split = r.text.split(split_str,1)[0]
book = remove_new_line_tabs(r_split[1200:])
bks_gutenberg.append(book)

# this side of paradise
r = requests.get(r'https://www.gutenberg.org/cache/epub/805/pg805.txt')
r_split = r.text.split(split_str,1)[0]
book = remove_new_line_tabs(r_split[1700:])
bks_gutenberg.append(book)

# beautiful and damned
r = requests.get(r'https://www.gutenberg.org/cache/epub/9830/pg9830.txt')
r_split = r.text.split(split_str,1)[0]
book = remove_new_line_tabs(r_split[2410:])
bks_gutenberg.append(book)

# ########################
# Hemingway
# ########################
# the sun also rises
r = requests.get(r'https://www.gutenberg.org/cache/epub/67138/pg67138.txt')
r_split = r.text.split(split_str,1)[0]
book = remove_new_line_tabs(r_split[2400:])
bks_gutenberg.append(book)

# Men Without Women
r = requests.get(r'https://www.gutenberg.org/cache/epub/69683/pg69683.txt')
r_split = r.text.split(split_str,1)[0]
book = remove_new_line_tabs(r_split[3800:])
bks_gutenberg.append(book)

# In Our Time
r = requests.get(r'https://www.gutenberg.org/cache/epub/61085/pg61085.txt')
r_split = r.text.split(split_str,1)[0]
book = remove_new_line_tabs(r_split[1720:])
bks_gutenberg.append(book)

# ########################
# Thomas Hardy
# ########################
# Mayor of Casterbridge 
r = requests.get(r'https://www.gutenberg.org/cache/epub/143/pg143.txt')
r_split = r.text.split(split_str,1)[0]
book = remove_new_line_tabs(r_split[1200:])
bks_gutenberg.append(book)

# Jude the Obscure 
r = requests.get(r'https://www.gutenberg.org/cache/epub/153/pg153.txt')
r_split = r.text.split(split_str,1)[0]
book = remove_new_line_tabs(r_split[3730:])
bks_gutenberg.append(book)

# Return of the Native
r = requests.get(r'https://www.gutenberg.org/cache/epub/122/pg122.txt')
r_split = r.text.split(split_str,1)[0]
book = remove_new_line_tabs(r_split[3500:])
bks_gutenberg.append(book)

# ########################
# Dickens
# ########################
# a tale of two cities
r = requests.get(r'https://www.gutenberg.org/cache/epub/98/pg98.txt')
r_split = r.text.split(split_str,1)[0]
book = remove_new_line_tabs(r_split[2700:])
bks_gutenberg.append(book)

# Great Expectations
r = requests.get(r'https://www.gutenberg.org/cache/epub/1400/pg1400.txt')
r_split = r.text.split(split_str,1)[0]
book = remove_new_line_tabs(r_split[1850:])
bks_gutenberg.append(book)

# Bleak House
r = requests.get(r'https://www.gutenberg.org/cache/epub/1023/pg1023.txt')
r_split = r.text.split(split_str,1)[0]
book = remove_new_line_tabs(r_split[2970:])
bks_gutenberg.append(book)

# ########################
# Jane Austen
# ########################
# Emma
r = requests.get(r'https://www.gutenberg.org/cache/epub/158/pg158.txt')
r_split = r.text.split(split_str,1)[0]
book = remove_new_line_tabs(r_split[1610:])
bks_gutenberg.append(book)

# Sense
r = requests.get(r'https://www.gutenberg.org/cache/epub/161/pg161.txt')
r_split = r.text.split(split_str,1)[0]
book = remove_new_line_tabs(r_split[1590:])
bks_gutenberg.append(book)

# Pride
r = requests.get(r'https://www.gutenberg.org/cache/epub/1342/pg1342.txt')
r_split = r.text.split(split_str,1)[0]
book = remove_new_line_tabs(r_split[2550:])
bks_gutenberg.append(book)

# ########################
# Chesterton
# ########################
# Wisdon of Father Brown
r = requests.get(r'https://www.gutenberg.org/cache/epub/223/pg223.txt')
r_split = r.text.split(split_str,1)[0]
book = remove_new_line_tabs(r_split[1400:])
bks_gutenberg.append(book)

# The Man Who Was Thursday
r = requests.get(r'https://www.gutenberg.org/cache/epub/1695/pg1695.txt')
r_split = r.text.split(split_str,1)[0]
book = remove_new_line_tabs(r_split[2570:])
bks_gutenberg.append(book)

# The Ball and the Cross
r = requests.get(r'https://www.gutenberg.org/cache/epub/5265/pg5265.txt')
r_split = r.text.split(split_str,1)[0]
book = remove_new_line_tabs(r_split[1430:])
bks_gutenberg.append(book)

# ########################
# Shakespeare
# ########################
# As You Like It
r = requests.get(r'https://www.gutenberg.org/cache/epub/1786/pg1786.txt')
r_split = r.text.split(split_str,1)[0]
book = remove_new_line_tabs(r_split[11780:])
bks_gutenberg.append(book)

# Caesar
r = requests.get(r'https://www.gutenberg.org/cache/epub/2263/pg2263.txt')
r_split = r.text.split(split_str,1)[0]
book = remove_new_line_tabs(r_split[4950:])
bks_gutenberg.append(book)

# Hamlet
r = requests.get(r'https://www.gutenberg.org/cache/epub/2265/pg2265.txt')
r_split = r.text.split(split_str,1)[0]
book = remove_new_line_tabs(r_split[4900:])
bks_gutenberg.append(book)

In [527]:
# UNIT TEST
#bks_gutenberg[18]
#atotc = requests.get(r'https://www.gutenberg.org/cache/epub/98/pg98.txt')


In [None]:
#UNIT TEST
#print(bks_gutenberg[1][:-2000])
substr = 'START OF THE PROJECT GUTENBERG EBOOK'
new_str = bks_gutenberg[1].split(substr,1)[-1]

substr2 = 'END OF THE PROJECT GUTENBERG EBOOK'
new_str2 = new_str.split(substr2,1)[0]
print(new_str2)

In [419]:
# ######################################################
# OLD OLD OLD OLD OLD ... DO NOT CALL
# ######################################################
bks_gutenberg_processed = []

# Process data and get sentence counts
start_of_ebook = 'START OF THE PROJECT GUTENBERG EBOOK'
end_of_ebook   = 'END OF THE PROJECT GUTENBERG EBOOK'

# Clean up header and footer info
for indx in range(len(bks_gutenberg)):
  #new_text = bks_gutenberg[indx].split(start_of_ebook,1)[-1]
  #new_text = new_text.split(end_of_ebook,1)[0]

  #for char in ["\n", "\r", "\t"]:
  #  new_text = new_text.replace(char, " ")
  new_text = remove_new_line_tabs(new_text)

  bks_gutenberg_processed.append(new_text)

In [532]:
# #################################################################
# Combine the 3 novels per author under one concatenated string
# #################################################################
bks_gutenberg_combined = [' '.join(bks_gutenberg[i:i+3]) for i in range(0, len(bks_gutenberg), 3)]

In [534]:
#################################################################################
# Tokenize sentences and create a dataframe
#################################################################################
from nltk.tokenize import sent_tokenize

# ***************************************************************************
# Add Project Gutenberg titles to book list, tokenize sentences
# ***************************************************************************
sens_count = []
bks_gutenberg_sentences = []

df_books = pd.DataFrame({

   'Author':  ['fitzgerald',
           'hemingway',
           'hardy',
           'dickens',
           'austen',
           'chesterton',
           'shakespeare'],
   'Short Title': ['gatsby,this side of paradise,beautiful and damned',
                'sun also rises,men without women,in our time',
                'mayor,jude,native',
                'tale,great expectations,bleak house',
                'emma,sense,pride',
                'wisdom brown,thurday,ball',
                'as you like it,caesar,hamlet'],
   'Title': ['The Great Gatsby,This Side of Paradise,The Beautiful and the Damned',
          'The Sun Also Rises,Men Without Women,In Our Time',
          'The Mayor of Casterbridge,Jude the Obscure,Return of the Native',
          'A Tale of Two Cities,Great Expectations,Bleak House',
          'Emma,Sense and Sensibility,Pride and Prejudice',
          'The Wisdom of Father Brown,The Man Who Was Thursday,The Ball and the Cross',
          'As You Like It,Julius Caesar,Hamlet']
})


for indx in range(len(bks_gutenberg_combined)):
  # Get sentence count
  # returns each sentence as a list of word strings
  sentences = sent_tokenize(bks_gutenberg_combined[indx])
  group_sentences = [' '.join(sentence) for sentence in sentences]
  sens_count.append(len(group_sentences))
  #bks_gutenberg_processed.append(new_text)
  bks_gutenberg_sentences.append(sentences)

df_books['Sentence Count'] = sens_count
bks_gutenberg_sentences = [[string] for string in bks_gutenberg_sentences]

#for book in bks_gutenberg_processed:
#  books.append(book)

# Process Data

In [535]:
# ##################################################
# Create sentence groups of size chunk_size
# ##################################################
chunk_size = 3
book_groups = []

for i, book in enumerate(bks_gutenberg_sentences):
  combined_sents = []

  for j in range(0, len(book[0]), chunk_size):
        
    rem = len(book[0]) - j
    
    if rem < chunk_size:
        print("less than chunk_size remaining")
        group = book[0][j:j+rem]
        print(group)
        print(i)
    else:
        group = book[0][j:j+chunk_size]
        
    new_str = " ".join(group)
    combined_sents.append(new_str)

  book_groups.append(combined_sents)

less than chunk_size remaining
['"It was a hard fight, but I didn\'t give  up and I came through!"']
0
less than chunk_size remaining
['The six    works constituting the series are:      Indiscretions _of_ Ezra Pound      Women and Men _by_ Ford Madox Ford      Elimus _by_ B. C. Windeler       with Designs _by_ D. Shakespear      The Great American Novel       _by_ William Carlos Williams      England _by_ B.M.G.-Adams      In Our Time _by_ Ernest Hemingway       with Portrait _by_ Henry Strater']
1
less than chunk_size remaining
['“Such as they were, of course.”    “My dear Dame Durden,” said Allan, drawing my arm through his, “do  you ever look in the glass?”    “You know I do; you see me do it.”    “And don’t you know that you are prettier than you ever were?”    I did not know that; I am not certain that I know it now.', 'But I know  that my dearest little pets are very pretty, and that my darling is  very beautiful, and that my husband is very handsome, and that my  guardian has t

In [536]:
# ############################################################################
# Store in a dataframe
# ############################################################################
df_books["Sentence Groups"] = book_groups
df_books["Group Counts"] = df_books["Sentence Groups"].apply(lambda x: len(x))
df_books

Unnamed: 0,Author,Short Title,Title,Sentence Count,Sentence Groups,Group Counts
0,fitzgerald,"gatsby,this side of paradise,beautiful and damned","The Great Gatsby,This Side of Paradise,The Bea...",15988,[ I In my younger...,5330
1,hemingway,"sun also rises,men without women,in our time","The Sun Also Rises,Men Without Women,In Our Time",9166,[ BOO...,3056
2,hardy,"mayor,jude,native","The Mayor of Casterbridge,Jude the Obscure,Ret...",18318,"[ I. One evening of late summer, before th...",6106
3,dickens,"tale,great expectations,bleak house","A Tale of Two Cities,Great Expectations,Bleak ...",28184,[ CHAPTER I. The Period It was the best...,9395
4,austen,"emma,sense,pride","Emma,Sense and Sensibility,Pride and Prejudice",14605,[ VOLUME I CHAPTER I Emma W...,4869
5,chesterton,"wisdom brown,thurday,ball","The Wisdom of Father Brown,The Man Who Was Thu...",10083,[ ONE -- The Absence of Mr Glass ...,3361
6,shakespeare,"as you like it,caesar,hamlet","As You Like It,Julius Caesar,Hamlet",6298,[ SCENE: OLIVER'S house; FREDERICK'S c...,2100


In [537]:
# ****************************************************************
# PREPARE DATAFRAME
#
# Random shuffle groups of sentences as a unit, then store the first set
# as Training and remaining sentences as Testing based on the split count
# Testing. Since we shuffle at first, taking Train, Test sequentially
# is still random.
# ***************************************************************************

# Select Train, Test split
train_split = 0.8
test_split  = 0.2

# Create data structure to put into a dataframe
data_train = []
data_test  = []

for group in book_groups:
  #author = authors[i]
  #short_title = short_titles[i]
  #title = titles[i]

  # passages contains the sentences for book i
  n = len(group)

  train_split_index = int(n*train_split)
  test_split_index  = int(n*test_split)

  # use temp_group as temp store in order to preserve order in book_group[i]
  temp_group = group.copy()
  random.shuffle(temp_group)

  #train_group = book_groups[i][:train_split_index]
  train_group = temp_group[:train_split_index]
  test_group  = temp_group[train_split_index:]

  data_train.append(train_group)
  data_test.append(test_group)

df_books["Train"] = data_train
df_books["Test"]  = data_test
df_books

Unnamed: 0,Author,Short Title,Title,Sentence Count,Sentence Groups,Group Counts,Train,Test
0,fitzgerald,"gatsby,this side of paradise,beautiful and damned","The Great Gatsby,This Side of Paradise,The Bea...",15988,[ I In my younger...,5330,[After Gatsby’s death the East was haunted for...,"[He returned hurriedly to 12 University, left ..."
1,hemingway,"sun also rises,men without women,in our time","The Sun Also Rises,Men Without Women,In Our Time",9166,[ BOO...,3056,[I won’t stand it. Who cares if he is a damn b...,[We walked along. “What did you say that for?”...
2,hardy,"mayor,jude,native","The Mayor of Casterbridge,Jude the Obscure,Ret...",18318,"[ I. One evening of late summer, before th...",6106,[And then they had turned from each other in ...,"[“Impudence. Don’t tell folk it was I, mind!” ..."
3,dickens,"tale,great expectations,bleak house","A Tale of Two Cities,Great Expectations,Bleak ...",28184,[ CHAPTER I. The Period It was the best...,9395,[Muttering that I would make the inquiry whet...,"[Moreover, he was a boy whom no man could hurt..."
4,austen,"emma,sense,pride","Emma,Sense and Sensibility,Pride and Prejudice",14605,[ VOLUME I CHAPTER I Emma W...,4869,"[“I know little of the game at present,” said ...","[He is an excellent young man, and will suit H..."
5,chesterton,"wisdom brown,thurday,ball","The Wisdom of Father Brown,The Man Who Was Thu...",10083,[ ONE -- The Absence of Mr Glass ...,3361,[The big man in black was staring at me with t...,[He fled frantically down a long lane with his...
6,shakespeare,"as you like it,caesar,hamlet","As You Like It,Julius Caesar,Hamlet",6298,[ SCENE: OLIVER'S house; FREDERICK'S c...,2100,"[Heere is the Will, and vnder Caesars Seale: ...",[What makes he here? Did he ask for me? Where ...


In [None]:
# For remote Drive
df_books.to_parquet("gutenberg_corpus_df_3chunk.parquet")
#!mv "nltk_corpus_df_chunks.parquet" "/content/drive/My Drive/w266_Project/ProjectStore/gutenberg_corpus_df_3chunk.parquet"
!mv "gutenberg_corpus_df_3chunk.parquet" "/content/drive/My Drive/w266/data/gutenberg_corpus_df_3chunk.parquet"

In [184]:
# For local drive
data_path = 'D:/MIDS/W266/Project/Data/'
data_file = 'gutenberg_corpus_df_3chunk_case8.parquet'
df_books.to_parquet(data_path+data_file)

In [None]:
# Unit Test parquet file retrieval
# read into a pyarrow table
# NOTE: list arrays before store get converted to numpy.ndarrays after recalling from Drive
table = pya.parquet.read_table("/content/drive/My Drive/w266_Project/ProjectStore/nltk_corpus_df_chunks.parquet")
df = table.to_pandas()
df

# PREPARE BINARY CLASS DATASETS

In [425]:
# ##################################################################################################
# Function: prepare data for binary classification model
#
# Binary version data files differ from multiclass by label
# ##################################################################################################
def create_bin_data(df,index):

  train = []
  test  = []
  list_of_authors = ['fitzgerald','hemingway','hardy','dickens','austen','chesterton','shakespeare']

  for indx, row in df.iterrows():
    if indx == index:
      label_train = [1]*len(row["Train"])
      label_test = [1]*len(row["Test"])
    else:
      label_train = [0]*len(row["Train"])
      label_test = [0]*len(row["Test"])

    zipped_train = list(zip(row["Train"],label_train))
    zipped_test = list(zip(row["Test"],label_test))

    train.append(zipped_train)
    test.append(zipped_test)

  #flatten the list using list comprehension then shuffle
  train_shuffled = [item for sublist in train for item in sublist]
  random.shuffle(train_shuffled)

  test_shuffled = [item for sublist in test for item in sublist]
  random.shuffle(test_shuffled)

  df_binary_data_train = pd.DataFrame(train_shuffled, columns=['Train Data','Train Label'])
  df_binary_data_test  = pd.DataFrame(test_shuffled,  columns=['Test Data' ,'Test Label'])

  return(df_binary_data_train, df_binary_data_test)


In [538]:
# ############################################################################
# Create binary files and store
# ############################################################################
df_binary_data_train0, df_binary_data_test0 = create_bin_data(df_books,0)
df_binary_data_train1, df_binary_data_test1 = create_bin_data(df_books,1)
df_binary_data_train2, df_binary_data_test2 = create_bin_data(df_books,2)
df_binary_data_train3, df_binary_data_test3 = create_bin_data(df_books,3)
df_binary_data_train4, df_binary_data_test4 = create_bin_data(df_books,4)
df_binary_data_train5, df_binary_data_test5 = create_bin_data(df_books,5)
df_binary_data_train6, df_binary_data_test6 = create_bin_data(df_books,6)

if DATA_LOC == 'local':
    f_name = 'train_case13_bin.parquet'
    f_path = 'D:/MIDS/W266/Project/Data/Bin/'
    
    df_binary_data_train0.to_parquet(f_path+'train_case13_bin_0.parquet')
    df_binary_data_test0.to_parquet(f_path+'test_case13_bin_0.parquet')
    
    df_binary_data_train1.to_parquet(f_path+'train_case13_bin_1.parquet')
    df_binary_data_test1.to_parquet(f_path+'test_case13_bin_1.parquet')
    
    df_binary_data_train2.to_parquet(f_path+'train_case13_bin_2.parquet')
    df_binary_data_test2.to_parquet(f_path+'test_case13_bin_2.parquet')
    
    df_binary_data_train3.to_parquet(f_path+'train_case13_bin_3.parquet')
    df_binary_data_test3.to_parquet(f_path+'test_case13_bin_3.parquet')
    
    df_binary_data_train4.to_parquet(f_path+'train_case13_bin_4.parquet')
    df_binary_data_test4.to_parquet(f_path+'test_case13_bin_4.parquet')
    
    df_binary_data_train5.to_parquet(f_path+'train_case13_bin_5.parquet')
    df_binary_data_test5.to_parquet(f_path+'test_case13_bin_5.parquet')
    
    df_binary_data_train6.to_parquet(f_path+'train_case13_bin_6.parquet')
    df_binary_data_test6.to_parquet(f_path+'test_case13_bin_6.parquet')
else:
    df_binary_data_train0.to_parquet("train_case13_bin_0.parquet")
    df_binary_data_test0.to_parquet("test_case13_bin_0.parquet")
    !mv "train_case13_bin_0.parquet" "/content/drive/My Drive/w266/data/train_case13_bin_0.parquet"
    !mv "test_case13_bin_0.parquet"  "/content/drive/My Drive/w266/data/test_case13_bin_0.parquet"

    df_binary_data_train1.to_parquet("train_case13_bin_1.parquet")
    df_binary_data_test1.to_parquet("test_case13_bin_1.parquet")
    !mv "train_case13_bin_1.parquet" "/content/drive/My Drive/w266/data/train_case13_bin_1.parquet"
    !mv "test_case13_bin_1.parquet"  "/content/drive/My Drive/w266/data/test_case13_bin_1.parquet"

    df_binary_data_train2.to_parquet("train_case13_bin_2.parquet")
    df_binary_data_test2.to_parquet("test_case13_bin_2.parquet")
    !mv "train_case13_bin_2.parquet" "/content/drive/My Drive/w266/data/train_case13_bin_2.parquet"
    !mv "test_case13_bin_2.parquet"  "/content/drive/My Drive/w266/data/test_case13_bin_2.parquet"
    
    df_binary_data_train3.to_parquet("train_case13_bin_3.parquet")
    df_binary_data_test3.to_parquet("test_case13_bin_3.parquet")
    !mv "train_case13_bin_3.parquet" "/content/drive/My Drive/w266/data/train_case13_bin_3.parquet"
    !mv "test_case13_bin_3.parquet"  "/content/drive/My Drive/w266/data/test_case13_bin_3.parquet"

    df_binary_data_train4.to_parquet("train_case13_bin_4.parquet")
    df_binary_data_test4.to_parquet("test_case13_bin_4.parquet")
    !mv "train_case13_bin_4.parquet" "/content/drive/My Drive/w266/data/train_case13_bin_4.parquet"
    !mv "test_case13_bin_4.parquet"  "/content/drive/My Drive/w266/data/test_case13_bin_4.parquet"

    df_binary_data_train5.to_parquet("train_case13_bin_5.parquet")
    df_binary_data_test5.to_parquet("test_case13_bin_5.parquet")
    !mv "train_case13_bin_5.parquet" "/content/drive/My Drive/w266/data/train_case13_bin_5.parquet"
    !mv "test_case13_bin_5.parquet"  "/content/drive/My Drive/w266/data/test_case13_bin_5.parquet"

    df_binary_data_train6.to_parquet("train_case13_bin_6.parquet")
    df_binary_data_test6.to_parquet("test_case13_bin_6.parquet")
    !mv "train_case13_bin_6.parquet" "/content/drive/My Drive/w266/data/train_case13_bin_6.parquet"
    !mv "test_case13_bin_6.parquet"  "/content/drive/My Drive/w266/data/test_case13_bin_6.parquet"

In [None]:
# UNIT TEST
#df_binary_data_test.to_parquet(model_filename)
#!mv $model_filename "/content/drive/My Drive/w266/"

test_shuffled[15]


('Both of which,” said Joe, quite charmed  with his logical arrangement, “being done, now this to you a true  friend, say. Namely. You mustn’t go a overdoing on it, but you must  have your supper and your wine and water, and you must be put betwixt  the sheets.”    The delicacy with which Joe dismissed this theme, and the sweet tact  and kindness with which Biddy—who with her woman’s wit had found me out  so soon—had prepared him for it, made a deep impression on my mind.',
 2)

# PREPARE MULICLASS DATASETS

In [390]:
# ##############################################################
# Prepare data for multiclass classification model
#
# index location --> label
# ##############################################################
train = []
test  = []

for indx, row in df_books.iterrows():
  #print(len(row["Train"]))
  label_train = [indx]*len(row["Train"])
  label_test  = [indx]*len(row["Test"])
  #print(len(label_train))
  zipped_train = list(zip(row["Train"],label_train))
  zipped_test = list(zip(row["Test"],label_test))
  train.append(zipped_train)
  test.append(zipped_test)

#flatten the list using list comprehension then shuffle
train_shuffled = [item for sublist in train for item in sublist]
random.shuffle(train_shuffled)

test_shuffled = [item for sublist in test for item in sublist]
random.shuffle(test_shuffled)

df_multi_data_train = pd.DataFrame(train_shuffled, columns=['Train Data','Train Label'])
df_multi_data_test  = pd.DataFrame(test_shuffled,  columns=['Test Data' ,'Test Label'])

In [None]:
# ############################################
# Save to Google Drive
# ############################################
df_multi_data_train.to_parquet("gut_corpus_train_data_multi.parquet")
df_multi_data_valid.to_parquet("gut_corpus_valid_data_multi.parquet")
df_multi_data_test.to_parquet("gut_corpus_test_data_multi.parquet")

#!mv "gut_corpus_train_data_multi.parquet" "/content/drive/My Drive/w266/gut_corpus_train_data_multi.parquet"
#!mv "gut_corpus_valid_data_multi.parquet" "/content/drive/My Drive/w266/gut_corpus_valid_data_multi.parquet"
#!mv "gut_corpus_test_data_multi.parquet" "/content/drive/My Drive/w266/gut_corpus_test_data_multi.parquet"
!mv "gut_corpus_train_data_multi.parquet" "/content/drive/My Drive/w266/data/gut_corpus_train_data_multi.parquet"
!mv "gut_corpus_valid_data_multi.parquet" "/content/drive/My Drive/w266/data/gut_corpus_valid_data_multi.parquet"
!mv "gut_corpus_test_data_multi.parquet" "/content/drive/My Drive/w266/data/gut_corpus_test_data_multi.parquet"


In [186]:
# ############################################
# Save to local drive
# ############################################
data_file_train = 'train_case8.parquet'
#data_file_valid = 'datatest_valid.parquet'
data_file_test  = 'test_case8.parquet'
data_path = 'D:/MIDS/W266/Project/Data/'
df_multi_data_train.to_parquet(data_path+data_file_train)
#df_multi_data_valid.to_parquet(data_path+data_file_valid)
df_multi_data_test.to_parquet(data_path+data_file_test)

# PREPARE MULTICLASS BALANCED DATASETS
## requires portions of previous sections

In [391]:
# ##########################################################################################################
# Balance data sets
#
# we're going to sort the tuple by the second value which is an integer indicating author
# then from the sorted data (which should be shuffled in terms of order of sentences from any given novel)
# select a max number of sentences not greater than the smallest novel
# Then reshuffle and store the training data. Test data can remain at the larger size.
# ##########################################################################################################

# Find smallest group count, take percentage factor of that amount
#MIN_GROUP_COUNT_TRAIN = np.minimum(int(np.min(df_books['Group Counts'])*train_split),100)
sample_factor = 0.2
MIN_GROUP_COUNT_TRAIN = int(sample_factor * np.min(df_books['Group Counts'])*train_split)
MIN_GROUP_COUNT_VALID = int(sample_factor * np.min(df_books['Group Counts'])*valid_split)
NUM_OF_LABELS = 7

# Sort by label
sorted_train = sorted(train_shuffled, key=lambda x: x[1])
#sorted_valid = sorted(valid_shuffled, key=lambda x: x[1])

train_balanced = []
#valid_balanced = []

for indx in range(NUM_OF_LABELS):
  train_balanced.extend([item for item in sorted_train if item[1] == indx][:MIN_GROUP_COUNT_TRAIN])
  #valid_balanced.extend([item for item in sorted_valid if item[1] == indx][:MIN_GROUP_COUNT_VALID])

# shuffle labels
random.shuffle(train_balanced)
#random.shuffle(valid_balanced)

In [None]:
# ############################################
# Save to remote Drive
# ############################################
df_multi_data_train = pd.DataFrame(train_balanced, columns=['Train Data','Train Label'])
df_multi_data_valid = pd.DataFrame(valid_balanced, columns=['Valid Data','Valid Label'])
df_multi_data_test  = pd.DataFrame(test_shuffled,  columns=['Test Data' ,'Test Label'])

df_multi_data_train.to_parquet("gut_corpus_train_data_multi_bal.parquet")
df_multi_data_valid.to_parquet("gut_corpus_valid_data_multi_bal.parquet")
df_multi_data_test.to_parquet("gut_corpus_test_data_multi_bal.parquet")

!mv "gut_corpus_train_data_multi_bal.parquet" "/content/drive/My Drive/w266/data/gut_corpus_train_data_multi_bal.parquet"
!mv "gut_corpus_valid_data_multi_bal.parquet" "/content/drive/My Drive/w266/data/gut_corpus_valid_data_multi_bal.parquet"
!mv "gut_corpus_test_data_multi_bal.parquet" "/content/drive/My Drive/w266/data/gut_corpus_test_data_multi_bal.parquet"

In [393]:
# ############################################
# Save to local drive
# ############################################
df_multi_data_train = pd.DataFrame(train_balanced, columns=['Train Data','Train Label'])
#df_multi_data_valid = pd.DataFrame(valid_balanced, columns=['Valid Data','Valid Label'])
df_multi_data_test  = pd.DataFrame(test_shuffled,  columns=['Test Data' ,'Test Label'])

data_path = 'D:/MIDS/W266/Project/Data/'
data_file_train = 'train_bal_case9.parquet'
#data_file_valid = 'datatest_valid_bal.parquet'
data_file_test  = 'test_case9.parquet'

df_multi_data_train.to_parquet(data_path+data_file_train)
#df_multi_data_valid.to_parquet(data_path+data_file_valid)
df_multi_data_test.to_parquet(data_path+data_file_test)

In [401]:
# #########################################
# DEVELOP spaCy MODELS
# #########################################
!pip install spacy


Defaulting to user installation because normal site-packages is not writeable
Collecting spacy
  Downloading spacy-3.7.2-cp39-cp39-win_amd64.whl (12.2 MB)
Collecting srsly<3.0.0,>=2.4.3
  Downloading srsly-2.4.8-cp39-cp39-win_amd64.whl (483 kB)
Collecting langcodes<4.0.0,>=3.2.0
  Downloading langcodes-3.3.0-py3-none-any.whl (181 kB)
Collecting wasabi<1.2.0,>=0.9.1
  Downloading wasabi-1.1.2-py3-none-any.whl (27 kB)
Collecting thinc<8.3.0,>=8.1.8
  Downloading thinc-8.2.1-cp39-cp39-win_amd64.whl (1.5 MB)
Collecting spacy-legacy<3.1.0,>=3.0.11
  Downloading spacy_legacy-3.0.12-py2.py3-none-any.whl (29 kB)
Collecting cymem<2.1.0,>=2.0.2
  Downloading cymem-2.0.8-cp39-cp39-win_amd64.whl (39 kB)
Collecting typer<0.10.0,>=0.3.0
  Downloading typer-0.9.0-py3-none-any.whl (45 kB)
Collecting smart-open<7.0.0,>=5.2.1
  Downloading smart_open-6.4.0-py3-none-any.whl (57 kB)
Collecting catalogue<2.1.0,>=2.0.6
  Downloading catalogue-2.0.10-py3-none-any.whl (17 kB)
Collecting spacy-loggers<2.0.0,>=



In [404]:
# THE FOLLOWING spacy MODULES ARE FOR UNIT TESTING ONLY
# SEE ...NER version of thie notebook for spacy processing

!python -m spacy download en_core_web_sm

import spacy

Defaulting to user installation because normal site-packages is not writeable
Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.7.1
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [402]:
r = requests.get(r'https://www.gutenberg.org/cache/epub/64317/pg64317.txt')
r_sub = r.text[0:1000]

In [405]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(r_sub)

In [409]:
#print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
#print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])

# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)

Project Gutenberg eBook PERSON
The Great Gatsby
    
 WORK_OF_ART
the United States GPE
the United States GPE
eBook PRODUCT
Title: The Great Gatsby


Author WORK_OF_ART
F. Scott Fitzgerald

 PERSON
January 17, 2021 DATE
eBook #64317 LAW
English LANGUAGE
The Great Gatsby WORK_OF_ART
F. Scott Fitzgerald


                            PERSON


In [410]:
doc_without_names = ' '.join(['PERSON' if entity.label_ == 'PERSON' else entity.text for entity in doc.ents])


In [411]:
doc_without_names

'PERSON The Great Gatsby\r\n    \r\n the United States the United States eBook Title: The Great Gatsby\r\n\r\n\r\nAuthor PERSON January 17, 2021 eBook #64317 English The Great Gatsby PERSON'