## Data Preparation

### Determining Sample Size

In 3.1 of Hu, Wang, and Wu's paper on applying stylometric analysis to Dream of the Red Chamber, the first 60 chapters are randomly split into 80 samples. A google search on the character count of the novel shows the whole novel has 788451 characters, with the first 80 chapters taking up 607342. First, assuming rough uniformity in chapter length across the novel, the paper's approach to sample size would be akin to splitting the whole novel into 160 samples, for a sample size of about 4928 characters per sample. Next, we know from the google search result that the chapter lengths are indeed not uniform, with the first 80 chapters taking up 77% of the character count. Still using the same approach of splitting $x$ chapters into $\frac{4}{3}x$, we have a sample size of 4270 characters per sample. The sample size is only supposed to be a rough range as we do not want to cut off novels mid-paragraph to create exact-size samples. Based on these results, 4500 seems like an acceptable cutoff. 

### Segmentation Method

One way to segment the samples would be to write samples as the works are being parsed, and cut off samples when a word counter indicates the cutoff has been reached (taking care not to cut in the middle of paragraphs). However there is a chance this would result in leaving a small amount of text at the end too small to constitute a sample.

Since we want to preserve whole paragraphs at all costs (paragraph attributes may be important), this limits using more dynamic ways to segment the samples and avoid undersised samples. Two remediation measures are proposed that while not perfect, may help prevent and/or "beef up" undersized samples.

1. Instead of segmenting samples while going through an author's various works, we will first consolidate all of their works and tally up total character count $(X)$. Then using the preliminary cutoff $(n)$, decided to be 4500 earlier, determine how many samples this would produce. Rounding this preliminary number of samples we again divide total character count to get the real cutoff to use for this author's works. It is demonstrated in **Appendix A** that this measure performs extraordinary well in allowing the final sample to be either large enough to stand on its own, or small enough to merge into a previous sample. The rounding function we will use, denoted $r$, will be numpy's around function, which follows standard rounding procedure and has a preference for even numbers when given floats ending in ".5". In summary, the real cutoff $(RC)$ will be determined by:
&nbsp;
&nbsp;

$$
RC = r\left(\frac{X}{r(X/n)}\right)
$$

2. A constant $k \in (0,1)$ will be determined to work in conjunction with measure 1. If the leftover sample's character count exceeds $kn$, it will be a standalone sample, otherwise it will be merged into the previous sample. For this exercise, we arbitrarily choose $k$ to be $\frac{1}{2}$, which **Appendix A** demonstrates is sufficient.

###  Sample Format and Additional Requirements

In addition to the size of each sample being fixed to some range, the sampler will also need to complete or account for the following: 

- Samples cannot end in the middle of paragraphs
- Exactly 1 paragraph per line
- Only 1 whitespace character, 1 newline, between paragraphs
- Remove all heading and date lines

### Taking a Look at the Data

After looking through the novels downloaded from xstt5.com, it was found the following traits needed to be handled to comply with the listed requirements:

- Text at the start and end of every novel about the site itself (It was found all these lines include variations "txt", such as "TXT" and "t/xt". Since it should be highly unlikely for these characters to show up in a 20th century Chinese novel, characters "T" and "t" will be used for the sampler to recognize these lines)
- Certain lines were used for section titles and dates. All of these were unneccesary (The idea of taking these lines out will be based on the fact that they do not contain punctuation, apart from guillemets, brackets, and half commas)
- For some novels, each paragraph of the actual novel content starts with 2-4 spaces, which need to be removed

### Creating Samples

In [1]:
import os
import os.path as op
import numpy as np
from tabulate import tabulate

In [2]:
# get paths
dpath = os.getcwd() # path to project directory

wdname = "/works" # directory containing works
wdpath = dpath + wdname # path to works directory 

sdname = "/samples" # directory to store samples
sdpath = dpath + sdname # path to samples directory

# get work names
works = [w for w in os.listdir(wdpath) if op.isfile(op.join(wdpath, w))]

In [3]:
# constants and valid punctuation

n = 4500
k = 1/2
valid_punc = ['。','！','？','“'] # chinese punctuation indicating paragraph validity

In [4]:
# Helper functions

def contains_valid_punc(para):
    """
    Checks para for any valid punctation from valid_punc
    
    Parameter:
        para: string of one paragraph
    Returns:
        True if para contains at least one valid punctuation char
    """
    
    for p in valid_punc:
        if p in para:
            return True
    return False

def clean_paragraphs(paras):
    """
    Cleans paras by removing invalid paragraphs and whitespace in paragraphs
    
    Parameter:
        paras: list of paragraph strings
        
    Returns:
        paras: list of cleaned paragraph strings
    """
    
    paras = list(filter(contains_valid_punc, paras))
    paras = [p for p in paras if "T" not in p and "t" not in p]
    paras = [''.join(p.split()) for p in paras] # remove all whitespace
    paras = [p for p in paras if p] # remove all now empty strings
    return paras

def real_cut(x, init_samsize):
    """
    Calculates better sample size to use for segmenting samples
    
    Parameters:
        x: character count of string to be segmented
        init_samsize: initial sample size to be modified
        
    Returns:
        Better sample size
    """
    
    return np.around(x/np.around(x/init_samsize))

def divide_samples(mass, au):
    """
    Divides mass into samples and stores each sample into samples directory
    
    Parameters:
        mass: list of paragraph strings
        au: string of author name
        
    Returns:
        rc: rough sample size used as calculated using real_cut
        sc: number of samples created
    """
    
    x = sum(len(p) for p in mass)
    rc = real_cut(x, n)
    cc = 0 # sample character counter
    sc = 0 # author sample counter
    cur = [] # holds sample until dump
    
    if len(mass) <= 1:
        quit()
    
    for p in mass:
        cur.append(p)
        cc += len(p)
        if cc >= n:
            sc += 1
            sam = sdpath + '/' + "{0}-{1}.txt".format(au, str(sc))
            with open(sam, 'w+', encoding="utf8") as f:
                for para in cur:
                    f.write("{0}\n".format(para))
            cc = 0
            cur = []
        elif p == mass[-1]: # apply remediation when reaching end of mass
            if cc < k*n:
                with open(sam, 'a', encoding="utf8") as f:
                    for para in cur:
                        f.write("{0}\n".format(para))
            else:
                sc += 1
                sam = sdpath + '/' + "{0}-{1}.txt".format(au, str(sc))
                with open(sam, 'w+', encoding="utf8") as f:
                    for para in cur:
                        f.write("{0}\n".format(para))
    return [rc, sc]

In [5]:
# works are arranged by author in the directory

allres = [] # to hold all sampling metric results
mass = [] # to hold all an author's works
curau = works[0].split('-')[0] # current author whose works mass holds

for w in works:
    au = w.split('-')[0] # get potentially new author name
    if au != curau:
        res = divide_samples(mass, curau)
        allres.append([curau] + res)
        mass = []
        curau = au
    fpath = wdpath + '/' + w
    f = open(fpath, "r", encoding="utf8")
    paras = f.readlines()
    paras = clean_paragraphs(paras)
    mass = mass + paras

res = divide_samples(mass, curau)
allres.append([curau] + res)

### Results Overview

In [6]:
print(tabulate(allres, headers=['Author', 'Rough Sample Size', 'Number of Samples']))

Author      Rough Sample Size    Number of Samples
--------  -------------------  -------------------
余华                     4490                   77
冯骥才                   4589                   19
古龙                     4508                  189
巴金                     4490                  181
张爱玲                   4509                   70
汪曾祺                   4519                   59
沈从文                   4522                   65
王安忆                   4496                   51
矛盾                     4511                   64
老舍                     4510                  199
莫言                     4496                   57
路遥                     4506                  173
金庸                     4501                  723
钱钟书                   4537                   51
陈忠实                   4520                   96
鲁迅                     4539                   51


### Note on Cap for Number of Samples Per Author

After generating the first iteration of samples, it was noted that the large differences in number of samples between authors would skewer features such as frequent words going forward. So a **hard cap of 70 samples per author** will be set for when features are extracted to form training and testing sets for the project. 

Additionally due to Feng Jicai's (冯骥才) collection of novels being to short to generate at least 50 samples, he will be excluded from the model building. His samples can be used after the models have been built to explore closeness in style. 