# Problem statement and EDA
In fields such as forensic investigation, the problem of descerning the unidenfied author of an imporant document is a serious scientific problem. Categorized as "Authorship attribution," the problem can be explained as:

- given a corpus of text, such as a paragraph
- determine the probability that it was written by an author given prior works

Ultimately, the goal of the next few notebooks is to build a baseline model. To get there, we must
 1. formalize the problem
 2. gather data
 3. perform eda
 4. preprocess the data
 5. build a model
 6. test 

is a scientific method that given a corpus of text identifies the author.

## Formalizing the problem
How well will a bag of words model work on 3 novels of three different authors?

We select the following training task
given writings by two similar authors and one different author as a test

-- make this into a table?
Two American authors with notable books written a year apart. One, Ernest Hemingway, was infamously trite. And one british author with a best seller 65 years prior who was famously paid per word, leading to lofty language and run on sentences.

My expectation is that Hemingway and Fitzgerald will be difficult to tell apart and Dickens will stick out. But let's see!


| Name | Nationality | Book | Published Date |
| --- | --- | --- | --- |
| Ernest Hemingway | American | The Sun Also Rises | 1926 |
| F. Scott Fitzgerald | American | The Great Gatsby | 1925 |
| Charles Dickens | British | A Tale of Two Cities | 1859 |

# Collect Data
Collecting
1. The great gatsby
2. The sun also rises
3. A Tale of Two Cities

In [1]:
import requests

In [2]:
# the great gatsby
r = requests.get(r'https://www.gutenberg.org/cache/epub/64317/pg64317.txt')
great_gatsby = r.text
great_gatsby[0:100]

'\ufeffThe Project Gutenberg eBook of The Great Gatsby\r\n    \r\nThis ebook is for the use of anyone anywhere'

In [3]:
# the sun also rises
r = requests.get(r'https://www.gutenberg.org/cache/epub/67138/pg67138.txt')
sar = r.text
sar[869:1200]

'*** START OF THE PROJECT GUTENBERG EBOOK THE SUN ALSO RISES ***\r\n\r\n\r\n\r\n\r\n\r\n                                 ERNEST\r\n                               HEMINGWAY\r\n\r\n\r\n\r\n                                The Sun\r\n                               Also Rises\r\n\r\n\r\n\r\n\r\n                        CHARLES SCRIBNER’S SONS\r\n                          '

In [4]:
# a tale of two cities
atotc = requests.get(r'https://www.gutenberg.org/cache/epub/98/pg98.txt')
atotc = atotc.text
atotc[697:850]

'*** START OF THE PROJECT GUTENBERG EBOOK A TALE OF TWO CITIES ***\r\n\r\n\r\n\r\nA TALE OF TWO CITIES\r\n\r\nA STORY OF THE FRENCH REVOLUTION\r\n\r\nBy Charles Dickens\r\n'

In [5]:
# backup book, the wonderful wizard of oz
# https://www.gutenberg.org/cache/epub/55/pg55.txt

# Clean data
this includes removing new lines and tabs

# EDA of RAW DATA
Let's look at the raw data and do some quick checks.
1. What are the number of characters? (this includes extra text from project gutenberg, like copywrite)
2. what are the number of 'words' (defined by python's generic split method)
3. what are the number of periods, which is a proxy for the number of sentences
4. what are the words per period, a proxy for average sentence length

In [6]:
print('====== THE GREAT GATSBY ======')
print(f'number of characters: \t\t{len(great_gatsby):,}')
print(f'number of words: \t\t{len(great_gatsby.split()):,}')
print(f"number of periods: \t\t{great_gatsby.count('.'):,}")
print(f"words per period: \t\t{round(len(great_gatsby.split())/great_gatsby.count('.'),4):,}")
print('\n\n')
print('====== THE SUN ALSO RISES ======')
print(f'number of characters: \t\t{len(sar):,}')
print(f'number of words: \t\t{len(sar.split()):,}')
print(f"number of periods: \t\t{sar.count('.'):,}")
print(f"words per period: \t\t{round(len(sar.split())/sar.count('.'),4):,}")

print('\n\n')
print('====== A TALE OF TWO CITIES ======')
print(f'number of characters: \t\t{len(atotc):,}')
print(f'number of words: \t\t{len(atotc.split()):,}')
print(f"number of periods: \t\t{atotc.count('.'):,}")
print(f"words per period: \t\t{round(len(atotc.split())/atotc.count('.'),4):,}")

len(atotc)

number of characters: 		296,579
number of words: 		51,225
number of periods: 		3,330
words per period: 		15.3829



number of characters: 		395,471
number of words: 		70,954
number of periods: 		7,370
words per period: 		9.6274



number of characters: 		793,153
number of words: 		138,923
number of periods: 		6,821
words per period: 		20.367


793153

In [7]:
!pip install pandas



In [8]:
import pandas as pd
titles = ['The Great Gatsby','The Sun Also Rises','A Tale of Two Cities']
texts = [great_gatsby, sar, atotc]
df_eda = pd.DataFrame({}, columns = titles, index=['raw characters', 'raw word count', 'raw sentences', 'raw characters per word','raw words per sentence'])
for title, text in zip(titles, texts):
    df_eda.loc['raw characters', title] = len(text)
    df_eda.loc['raw word count', title] = len(text.split())
    df_eda.loc['raw sentences', title] = text.count('.')
    df_eda.loc['raw characters per word', title] = len(text)/len(text.split())
    df_eda.loc['raw words per sentence', title] = len(text.split())/text.count('.')
    
    
df_eda.head()

Unnamed: 0,The Great Gatsby,The Sun Also Rises,A Tale of Two Cities
raw characters,296579.0,395471.0,793153.0
raw word count,51225.0,70954.0,138923.0
raw sentences,3330.0,7370.0,6821.0
raw characters per word,5.789732,5.573625,5.709299
raw words per sentence,15.382883,9.627408,20.366955


In [9]:
atotc.count('.')

6821

In [10]:
!pip install nltk



In [11]:
# https://stackoverflow.com/questions/15228054/how-to-count-the-amount-of-sentences-in-a-paragraph-in-python
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

sentences = 'A Turning machine is a device that manipulates symbols on a strip of tape according to a table of rules. Despite its simplicity, a Turing machine can be adapted to simulate the logic of any computer algorithm, and is particularly useful in explaining the functions of a CPU inside a computer. The "Turing" machine was described by Alan Turing in 1936, who called it an "a(utomatic)-machine". The Turing machine is not intended as a practical computing technology, but rather as a hypothetical device representing a computing machine. Turing machines help computer scientists understand the limits of mechaniacl computation.'

number_of_sentences = sent_tokenize(sentences)

print(len(number_of_sentences))

5


[nltk_data] Downloading package punkt to /home/denny/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [12]:
# questions
# what are the most common words?
# what words are most commonly associated with an author


In [13]:
# first, remove unwanted new line and tab characters from the text
for char in ["\n", "\r", "\d", "\t"]:
    great_gatsby = great_gatsby.replace(char, " ")