# Doc2Vec Corpus Creation
## Create Text Chunks in <i>The Masses</i>

This file sorts and parses two different, though overlapping datasets, which are then used in creating a doc2vec model. The first dataset is structured data from the MJP, which includes genre information in each issue of publication. This data was previously parsed using the XML Parser in a separate repository. The second dataset is unstructured data from the <a href="https://repository.library.brown.edu/studio/collections/id_592/">Modernist Journals Project in the Brown University Digital Repository</a>. 

This notebook provides three options for data exploration with these different corpora. The first option takes the structured data and chunks texts into strings (300-word strings at the moment). The resulting dataframe then joins with the unstructured, web-scrapped data. This produces some duplication but increases the number of words in the final, merged corpus and allows me to extract the text-chunks and their similarities to other text-chunks without losing key metadata. 

The second option similarly chunks texts into strings but does not merge with the structured data and, therefore, does not have as much bibliographic data.

The third option uses the unstructured data from the web scrapper but does not chunk texts. This option can be useful for creating a corpus of issues (in the publication sense).

This particular notebook extracts on text-chunks from the magazine, <i>The Masses</i>.

In [1]:
import pandas as pd
import numpy as np
import gensim
import re, string
import nltk
englishWords = set(nltk.corpus.words.words())

abs_dir = "/Users/williamquinn/Desktop/DH/Python/MJP/"

## Chunk Data

### Option 1: Parse Structured Data of MJP

In [11]:
%%time

# Load data.
mjp_df = pd.read_csv(abs_dir + 'Output/mjp_documents.txt', sep='\t')[['mjp_id', 'magazine', 'date', 'type', 'text']]

# Subset to include key genres.
# Script will only chunk selected genres and will re-combine with original dataframe.
# The recombination should maintain model's accuracy, 
# which relies on an abundance of text,
# while also modeling chunked data.
mjp_df = mjp_df[mjp_df['type'] \
                .isin(["articles", "letters", "fiction", 
                       "poetry", "drama"])]

# Remove bibliographic information (volume/issue, year) from strings.
mjp_df['text'] = mjp_df['text'].astype(str) \
    .str.lower() \
    .str.strip() \
    .str.replace(r'[^\w\s]','', regex=True) \
    .str.replace(r"pgbrk","", regex=True) \
    .str.replace('\.0', '', regex=True) \
    .str.replace(r'vol \w+ no \d+ \w+ \d{4}', '', regex=True) \
    .str.replace(r'\w+ \d{4}', '', regex=True) \
    .str.replace(r'vol \w+ no \d+', '', regex=True) \
    .str.replace(r'v ', '', regex=True) \
    .str.replace(r'vol ', '', regex=True) \
    .str.replace(r'no ', '', regex=True) \
    .str.replace(r'the masses', '', regex=True) \
    .apply(lambda x: ' '.join([item for item in x.split() if item in englishWords]))

mjp_df = mjp_df.rename(columns = {"mjp_id":"mjp_index"})

CPU times: user 11.2 s, sys: 314 ms, total: 11.6 s
Wall time: 11.7 s


### Option 1 (continued): <br>Incorporate Data with Chunked Texts from <i>The Masses</i>

In [12]:
%%time
    
# Subset dataframe for selected magazine.
mjp_df_Masses = mjp_df.query('magazine == "the masses"')

# Split text field string into list of 300 words.
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html
# https://stackoverflow.com/questions/17116814/pandas-how-do-i-split-text-in-a-column-into-multiple-rows
# s = df['Seatblocks'].str.split(pat = " ").apply(Series, 1).stack()

# Split text field string into list of 300 words.
def splitText(string):
    words = string.split()
    grouped_words = [' '.join(words[i: i + 300]) for i in range(0, len(words), 300)]
    return grouped_words

# This might cause errors.
mjp_df_Masses['text'] = mjp_df_Masses['text'].apply(splitText)

# Unnest list and create row for each list item.
# https://mikulskibartosz.name/how-to-split-a-list-inside-a-dataframe-cell-into-rows-in-pandas-9849d8ff2401
mjp_df_Masses = mjp_df_Masses['text'] \
    .apply(pd.Series) \
    .merge(mjp_df_Masses, right_index = True, left_index = True) \
    .drop(["text"], axis = 1) \
    .melt(id_vars = ['mjp_index', 'magazine', 'date', 'type'], value_name = "text") \
    .drop("variable", axis = 1) \
    .dropna()

# Count words and remove short strings.
mjp_df_Masses['count'] = mjp_df_Masses['text'].str.split().str.len()
mjp_df_Masses = mjp_df_Masses[~(mjp_df_Masses['count'] < 300)]

# Change index value to prevent duplication when merged with original dataframe.
mjp_df_Masses['mjp_id'] = mjp_df_Masses.index + 10000

mjp_df_Masses = mjp_df_Masses[["date", "magazine", "mjp_index", "text", "type"]]

# Rejoing dataframes
mjp_concat_df = pd.concat([mjp_df, mjp_df_Masses], sort = True) \
    .dropna()

mjp_concat_df.to_csv(abs_dir + 'Chapter4-Masses/Masses_Data/d2v/mjp_masses-chunks.csv', sep='\t', index=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.


CPU times: user 2.22 s, sys: 111 ms, total: 2.33 s
Wall time: 2.48 s


### Option 2: Parse MJP Scrapped from Web

In [4]:
%%time

mjp_df = pd.read_csv(abs_dir + "Output/doc2vec/mjp_documents-EntireCorpus.csv", 
                     sep=",") \
    .assign(type = "issue")

# Split text field string into list of 300 words.
def splitText(string):
    words = string.split()
    grouped_words = [' '.join(words[i: i + 300]) for i in range(0, len(words), 300)]
    return grouped_words

masses_df = mjp_df.query('magazine == "Masses"')

# This is raising a warning and could be responsible for error during UMAP (unequal values).
masses_df['text'] = masses_df['text'].apply(splitText)

# Unnest list and create row for each list item.
# https://mikulskibartosz.name/how-to-split-a-list-inside-a-dataframe-cell-into-rows-in-pandas-9849d8ff2401
masses_df = masses_df['text'] \
    .apply(pd.Series) \
    .merge(masses_df, right_index = True, left_index = True) \
    .drop(["text"], axis = 1) \
    .melt(id_vars = ['mjp_index', 'magazine', 'date', 'type'], value_name = "text") \
    .drop("variable", axis = 1) \
    .dropna()

# Count words and remove short strings.
masses_df['count'] = masses_df['text'].str.split().str.len()
masses_df = masses_df[~(masses_df['count'] < 300)]

masses_df['mjp_index'] = masses_df['mjp_index'] + 10000

mjp_concat_df = pd.concat([mjp_df, masses_df], sort = True)

mjp_concat_df \
    .dropna() \
    .to_csv(abs_dir + 'Chapter4-Masses/Masses_Data/d2v/mjp_masses-chunks-eC.csv', sep='\t', index=False)

CPU times: user 5.82 s, sys: 953 ms, total: 6.77 s
Wall time: 7.09 s


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]


### Option 3: Merge Entire Corpus with Structure Data

In [17]:
%%time

# Load Data
mjp_df = pd.read_csv(abs_dir + "Output/doc2vec/mjp_documents-EntireCorpus.csv", 
                     sep=",") \
    .assign(type = "issue")

subset_df = pd.read_csvmjp_df = pd.read_csv(abs_dir + 'Output/mjp_documents.txt', sep='\t') \
    .query('magazine == "the masses"') \
    .rename(columns = {"mjp_id":"mjp_index"})[['mjp_index', 'magazine', 'date', 'type', 'text']]
    
# Regularize subset to match mjp_df.
subset_df['magazine'] = "Masses"
# Add values to disambiguate structured (>10000) and unstructured data.
subset_df['mjp_index'] = subset_df['mjp_index'] + 10000

# Append dataframes and remove empty text fields.
mjp_appended_df = mjp_df.append(subset_df, ignore_index=True) \
    .dropna(subset=['text'])

mjp_appended_df.to_csv(abs_dir + 'Chapter4-Masses/Masses_Data/d2v/mjp_masses-chunks-eC.csv', 
                       sep='\t', 
                       index=False)

CPU times: user 19.5 s, sys: 14.5 s, total: 34.1 s
Wall time: 41.4 s


## Train Model on Masses Chunks

In [19]:
%%time

mjp_df = pd.read_csv(abs_dir + 'Chapter4-Masses/Masses_Data/d2v/mjp_masses-chunks-eC.csv', sep='\t')

# Create Corpus
tagged_docs = mjp_df.apply \
    (lambda x:gensim.models.doc2vec.TaggedDocument \
     (gensim.utils.simple_preprocess(x.text), \
      ['doc{}'.format(x.mjp_index)]), axis=1)

training_corpus = tagged_docs.values

# Training.
model = gensim.models.doc2vec.Doc2Vec(vector_size=300, min_count=4, epochs=20)

model.build_vocab(training_corpus)

model.train(training_corpus, total_examples = model.corpus_count, epochs = model.epochs)

# Store Model.
model.save(abs_dir + "Chapter4-Masses/Masses_Data/d2v/mjp_masses-chunks_d2v.bin")

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


CPU times: user 44min 35s, sys: 2min 9s, total: 46min 45s
Wall time: 20min 51s
