In [11]:
import numpy as np
import pandas as pd 
import matplotlib 
import matplotlib.pyplot as plt


df = pd.read_csv("train.csv")
print(df.head())

        id                                               text author
0  id26305  This process, however, afforded me no means of...    EAP
1  id17569  It never once occurred to me that the fumbling...    HPL
2  id11008  In his left hand was a gold snuff box, from wh...    EAP
3  id27763  How lovely is spring As we looked from Windsor...    MWS
4  id12958  Finding nothing else, not even gold, the Super...    HPL


Writting syles differ in many ways but one such way is that some writers are wordier than others, and some use more or less punctuation than others. This seems especially relevant given that the writers being analysed use several different types of writing (Poe as a poet, Shelly as short stories, Lovecraft as a novelist). There is definitely cross over with these styles but I would argue that the stylistic aspects carry over throughout works so the following functions are designed to assess the writing sections for wordiness and punctuator usage.

In [37]:
import string


def get_word_counts(text):
    count = 0
    for i in text:
        if i == " ":
            count += 1
    return count


def get_punctuator_counts(text):
    punct_dict = {}
    for i in string.punctuation:
        p_count = text.count(i)
        punct_dict[i] = p_count
    return punct_dict


def get_punctuator_array(text):
    punctuator_array_TF = []
    for i in string.punctuation:
        if i in text: 
            punctuator_array_TF.append(1)
        else:
            punctuator_array_TF.append(0)
    return punctuator_array_TF

df["word count"] = df.text.apply(get_word_counts)
df["punctuator_count"] = df.text.apply(get_punctuator_counts)
df["punctuator_array"] = df.text.apply(get_punctuator_array)
df["temp1"] = df.punctuator_count.apply(lambda x: sum(x.values()))
df["temp2"] = df.punctuator_array.apply(lambda x: sum(x))

print(df.head())
print(df.groupby("author", as_index=False)["word count"].mean())
print(df.groupby("author", as_index=False)["temp1"].mean())
print(df.groupby("author", as_index=False)["temp2"].mean())

        id                                               text author  \
0  id26305  This process, however, afforded me no means of...    EAP   
1  id17569  It never once occurred to me that the fumbling...    HPL   
2  id11008  In his left hand was a gold snuff box, from wh...    EAP   
3  id27763  How lovely is spring As we looked from Windsor...    MWS   
4  id12958  Finding nothing else, not even gold, the Super...    HPL   

   word count                                   punctuator count  \
0          40  {'!': 0, '"': 0, '#': 0, '$': 0, '%': 0, '&': ...   
1          13  {'!': 0, '"': 0, '#': 0, '$': 0, '%': 0, '&': ...   
2          35  {'!': 0, '"': 0, '#': 0, '$': 0, '%': 0, '&': ...   
3          33  {'!': 0, '"': 0, '#': 0, '$': 0, '%': 0, '&': ...   
4          26  {'!': 0, '"': 0, '#': 0, '$': 0, '%': 0, '&': ...   

                                    punctuator array  \
0  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, ...   
1  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 

The data thus far does not look all that interesting but it does show that Poe tends to use slightly more punctuators and have fewer words (which makes sense for a poet), but also shows that HPL tends to use the least punctuation (somewhat surprising). It also reveals that wordcount/puntuation used might be a good metric to include. Additionally, another set of features that would be interesting to explore are metrics realated to readability. Here, this is looked at in brief with python's readability package. 

In [49]:
import readability


def get_readability_nums(textstr):
    pop_list = ['sentences_per_paragraph', 'type_token_ratio', 'sentences', 'paragraphs']
    res = dict(readability.getmeasures(text=textstr, merge=True))
    for i in pop_list:
        del res[i]
    return res

df["readability"] = df.text.apply(get_readability_nums)
keyslist = list(df.at[0, "readability"].keys())
for i in keyslist:
    print((df.groupby("author", as_index=False)[d.get(i) for d in df.readability].mean())

SyntaxError: invalid syntax (<ipython-input-49-23dc45fe57ad>, line 14)