## Initial Feature Extraction

### Proposed Initial Set of Features

The study done on Dream of the Red Chamber by Hu, Wang, and Wu (hereafter referred to in project documents as "the paper") chose from experience to go with the following set of features:

- frequency of $a$ characters
- frequency of $b$ words (defined as a combination of characters)
- mean of sentence length
- variance of sentence length
- frequency of direct speech
- frequency of exclamations

The paper takes care to stress that the characters should be *content independent*, i.e. *function words/characters*, which is realized through human judgement by selecting from frequent character the program suggests. The same goes for words. 

In addition to the $a+b+4$ features above, this project will include two additional initial features:

- mean of paragraph length
- variance of paragraph length

Ultimately a subset of most important features will be determined for each author.

### Determining Function Characters

In [1]:
import os
import os.path as op
import pandas as pd
from collections import Counter
import random

In [2]:
# constants
a_pool = 500
b_pool = 300
train_size = 45 # see "Constructing the Initial DataFrames" below
rdmseed = 101

In [3]:
# Need to reconsolidate samples into a mass blob to determine function chars/words
# paths
dpath = os.getcwd() # path to project directory
sdname = "/samples" # directory to store samples
sdpath = dpath + sdname # path to samples directory

# get sample names
samples = [s for s in os.listdir(sdpath) if op.isfile(op.join(sdpath, s))]

In [4]:
remass = ''
for s in samples:
    fpath = sdpath + '/' + s
    f = open(fpath, "r", encoding="utf8")
    remass = remass + f.read()
print(len(remass))

4484896


In [5]:
c = Counter(remass)
c.most_common(a_pool)

[('，', 260826),
 ('的', 138928),
 ('。', 120238),
 ('一', 78925),
 ('了', 78315),
 ('不', 65143),
 ('是', 59423),
 ('他', 57639),
 ('\n', 49136),
 ('人', 41030),
 ('我', 40109),
 ('在', 38301),
 ('来', 37384),
 ('这', 35855),
 ('有', 35736),
 ('“', 32507),
 ('”', 32169),
 ('上', 31500),
 ('着', 30058),
 ('子', 29949),
 ('说', 29578),
 ('：', 29531),
 ('你', 29224),
 ('个', 28917),
 ('到', 25196),
 ('里', 25180),
 ('她', 24381),
 ('道', 24152),
 ('就', 23819),
 ('也', 23797),
 ('去', 23081),
 ('那', 21484),
 ('大', 21207),
 ('！', 20925),
 ('得', 20758),
 ('们', 19942),
 ('地', 19749),
 ('下', 17619),
 ('出', 17557),
 ('？', 17458),
 ('"', 17246),
 ('么', 17044),
 ('看', 16534),
 ('要', 15807),
 ('时', 15704),
 ('家', 15557),
 ('过', 15326),
 ('好', 14384),
 ('没', 14330),
 ('还', 14215),
 ('小', 14018),
 ('都', 13668),
 ('起', 13363),
 ('天', 13223),
 ('生', 13182),
 ('又', 12811),
 ('然', 12608),
 ('头', 12579),
 ('和', 12316),
 ('心', 12255),
 ('自', 12185),
 ('只', 12127),
 ('中', 11873),
 ('可', 11695),
 ('为', 11513),
 ('把', 11267),
 ('后',

The paper defines function words as "a class of words that in general have little context meaning, but instead serve to express grammatical relationships with other words within a sentence", and characters likewise. Ultimately in that study 144 function characters were chosen.

Looking at our results (excluding the punctation and newline character), most fit the definition comfortably. But there are exceptions such as “头” which usually refers to "head", yet can also be used to support direction words such as “里头”. Overall, since the features will be narrowed down and characters can be used quite flexibly, we go with a lenient approach here and include characters with at least 7000 counts, only going out of our way to exclude “三” and “年”.

In [6]:
function_chars = "的一了不是他人我在这有上着子说你个到里她道就也去那大得们地下出么看要时家过好没还"\
"小都起天生又然头和心自只中可为把后走想会手事老声面样两见儿什以笑回话多太而眼对点知已白能开前很气己身些长无觉给先"
len(function_chars)

96

### Determining Function Words

This step will be more complicated than the previous one, as we now turn to look for recurring combinations of characters.

In [7]:
# TODO: Promising approach: 
# https://stackoverflow.com/questions/37499968/finding-all-repeated-substrings-in-a-string-and-how-often-they-appear

### Constructing the Initial DataFrames

We can now construct the training and testing DataFrames, and populate them with extracted features after. For the portion of the data to set aside for testing, the paper aims to control the portion in the 20%-30% range. Setting 20 samples out of 70 makes sense, but this is complicated by the fact that 7 of the 15 authors don't have full 70 samples, with 鲁迅, 钱钟书, and 王安忆 all with 51 samples. Ultimately, we will take 45 samples for training from each author, opting for balance in the training set over the testing set.

In [8]:
random.seed(rdmseed)
train_index = random.sample(range(1,52), train_size)
test_index = list(set(range(1,71)) - set(train_index))

In [9]:
train_list = []
test_list = []

for s in samples:
    s_au = s.split('-')[0]
    s_id = s.split('-')[1].split('.')[0]
    if int(s_id) in test_index:
        test_list.append([s_au, s_id])
    else:
        train_list.append([s_au, s_id])

train_init_df = pd.DataFrame(train_list, columns = ['Author', 'ID'])
test_init_df = pd.DataFrame(test_list, columns = ['Author', 'ID'])

In [None]:
train_init_df.head()

### Extracting Features

In [None]:
# functions that works for both the training and testing sets

def get_freq_fc(s):
    """
    Gets frequencies in s of each function character
    
    Parameters:
        s: sample string
    
    Returns:
        freq_fc: dict with key:value as function char:frequency 
    """