# Predicting literary success through writing style
## The Problem
Publishers and acquisition editors have a history of fancying themselves stylemakers, going with the gut, chasing a feeling when choosing what to publish. And as a result, publishers have followed the business model of operating in the red, while crossing fingers for another Dan Brown or E.L. James to pull them into the black with a runaway hit. When presented with the idea of doing market research or digging into the data, I've heard publishers dig in their heels and repeat the argument that success in the arts is unpredictable.

As the big publishing house market shrinks and new streamlined self-publishing competitors enter the field, this business model is proving increasing less viable. And in this shifting landscapehe big houses are finally beginning to turn to data. Which presents the question, can the success of a novel be predicted?

Clearly, many variables of success are out of predictive bounds -- storyline novelty, fame of the author, social trends, world events. Here, we will focus soley on the structure of the writing style and whether stylistic structure can be used to predict success.



## The Data
The data creation pipeline can be found [here](https://github.com/bishopkd/DSI-SF2-bishopkd/blob/master/projects/capstone/reports/data_creation.ipynb)
### Data collection

For the purposes of this project, I compiled the epub files of XX best-selling titles and XX titles that were less successful. 

Titles were selected from the top two selling genres, Romance and Science-Fiction, and within these genres, restricted to the sub-genres contemporary and space-opera, respectively.

The 2015 sales report from Nielsen BookScan, considered the industry standard, was used to select and categorize titles. 

  * **Best-sellers** were among the top 100 selling titles for their genre in 2015. I selected titles within this group which also have won awards, The Hugo for the sci-fi titles and the RITA for romance, making them both popular and critical successes.

  * Less-successful titles (affectionately termed **Flops**) were published in the years 2013-2015 and sold fewer than 1000 copies in 2015.

### Data cleaning

1. Using the python package **textract**, I wrote functions to loop through a folder of epub files, extract the text, perform some minor cleaning, and save as a text file to use as corpus documents.





~~~python
#loop through files in directory, convert file, save file in new folder
def create_text_files(epub_path,txt_path):
    for epub in os.listdir(epub_path):
        try:
            convert_epub_to_text(epub_path, epub, txt_path)
        except:
            print epub, "failed"

# function to extract text from epub
def convert_epub_to_text(epub_path, epub_file, txt_path):
    clean_text = ''
    text_name = epub_file.replace(' ','_')[:-4]+'txt' # clean up filename and change file extention

    text = textract.process(epub_path+epub_file,encoding='utf_8') # extract text from epub
    clean_text = text.decode('ascii', 'ignore').replace('\n',' ') # trip out the unicode and return characters

    text_file = open(txt_path+text_name, 'w') # save as text file
    text_file.write(clean_text)
    text_file.close()


~~~

 .2. Sections removed from each text file include:
  - Acknowledgements
  - Table of contents
  - About the author
  - Appendices 
  - Copyright, ISBN, and Library of Congress Information
  - Other titles by this author
  - Chapter
  
The remaining text of each book was unaltered.

### Data dictionary

In order to analyize the writing style, I created columns to store metrics about each document.
Counts, average lengths, and diversity were calculated via python functions.
Polarity, subjectivity, and part of speech (POS) tagging were derived using TextBlob's sentiment analysis features.


Field|Description|Datatype
------|----------------|-----------
best_seller|Binary best-seller indicator|object
body|Book text|object
sci_fi|Binary indicator: Sci-Fi=1, Romance=0|integer
title|Book title|integer
avg_sent_len|Average sentence length|integer
word_count|Total word count|integer
avg_word_len|Average word length|float
lex_diversity|Lexical diversity - the number of unique words over the total word count|float
polarity|Polarity - measure of negativity to positivity scaled -1.0-1.0|float
subjectivity|Subjectivity - measure of objectivity to subjectivity scaled 0.0-1.0|float
profanity|Number of profane words|float
profane|Profanity measure - number of profane words over the total word count|float
conj_coord|POS - coordinating conjunction: and, or, but|float
number|POS - cardinal number: five, three, 13%|float
determiner|POS - determiner: the, a, these|float
exist_there|POS - existential there: there were six boys|float
foreign_word|POS - foreign word|float
conj_sub_prep|POS - subordinating conjunction or preposition: of, on, before, unless|float
adj|POS - adjective|float
adj_compare|POS - adjective, comparative|float
adj_sup|POS - adjective, superlative|float
verb_aux|POS - verb, modal auxillary: may, should|float
noun|POS - noun|float
noun_prop|POS - noun, proper|float
noun_prop_pural|POS - noun, proper plural|float
noun_plural|POS - noun, plural|float
predeterm|POS - predeterminer: both his children|float
pronoun_pers|POS - personal pronoun: me, you, it|float
pronoun_poss|POS - possessive pronoun: my, your, our|float
adv|POS - adverb: extremely, loudly, hard|float
adv_compare|POS - adverb, comparative: better|float
adv_sup|POS - adverb, superlative: best|float
adv_part|POS - adverb, particle: about, off, up|float
inf_to|POS - infinitival to: what to do?|float
interject|POS - interjection: oh, oops, gosh|float
verb_base|POS - verb, base form: think|float
verb_past|POS - verb, past tense: they thought|float
verb_ger|POS - verb, gerund or present participle: thinking is fun|float
verb_pp|POS - verb, past participle: a sunken ship|float
verb_sing_pres|POS - verb, non-3rd person singular present: I think|float
verb_3rd_sing_pres|POS - verb, 3rd person singular present: she thinks|float
wh_determ|POS - wh-determiner: which, whatever, whichever|float
wh_pronoun|POS - wh-pronoun, personal: what, who, whom|float
wh_poss|POS - wh-pronoun, possessive: whose, whosever|float
wh_adv|POS - wh-adverb: where, when|float
poss_ending|POS - possessive ending: s|float
symbol|POS - symbol: %$#|float
list_marker|POS - list item marker|float

## EDA

The full EDA report can be found [here](https://github.com/bishopkd/DSI-SF2-bishopkd/blob/master/projects/capstone/reports/EDA.ipynb)

In [None]:
### Metric analysis

In [None]:
### Top n-grams

In [None]:
### topic modeling

# The Models

In [None]:
metrics created and results from those
- sentiment analysis (insult lab w/MultinomialNB)
- profanity analysis
- POS distribution
- that other metric i cant figure out

In [None]:
KNN

In [None]:
classification_report

In [None]:
calibration plots

# Results

In [None]:
- held out test titles
- nano book

# Conclusion