In [23]:
# general process outline

# take dataset files
# read top x number of rows into pandas dataframe (according to percentage_breakdowns_v2.xlsx)
# save new dataframe with only last_name, first_name, title, fraction_total
# mash together 5 dataframes - where records are identical across first 3 rows, add together fraction total
# take mashed-together dataset records and divide all fraction_total values by 5
# reorganize final author dataset by fraction_total (highest to lowest)


In [101]:
import pandas
import numpy

### LOADING DATA

Here, I'm loading the top ~5% most common authors in the dataset and saving it as a pandas dataframe. From that dataframe, I create a second dataframe with four relevant columns: first_name, last_name, title, fraction_total

This process repeats for each dataset. First, Early English Books Online...

In [24]:
# EEBO

eebo_df = pandas.read_csv('../eebo/eebo_dataset_final.csv', nrows=40)
# eebo_df

In [25]:
eebo_df2 = eebo_df.loc[:, ['last_name', 'first_name', 'title', 'fraction_total']]
# eebo_df2

Next, English Short Title Catalog...

In [26]:
# ESTC

estc_df = pandas.read_csv('../estc/estc_dataset_final.csv', nrows=159)
# estc_df

In [27]:
estc_df2 = estc_df.loc[:, ['last_name', 'first_name', 'title', 'fraction_total']]
# estc_df2

Open Syllabus Project...

In [28]:
# OPEN SYLLABUS

open_syllabus_df = pandas.read_csv('../open-syllabus/english-lit/open-syllabus_dataset_final.csv', nrows=250)
# open_syllabus_df

In [29]:
open_syllabus_df2 = open_syllabus_df.loc[:, ['last_name', 'first_name', 'title', 'fraction_total']]
# open_syllabus_df2

Oxford Text Archive...

In [30]:
# OTA

ota_df = pandas.read_csv('../ota/ota_dataset_final.csv', nrows=31)
# ota_df

In [31]:
ota_df2 = ota_df.loc[:, ['last_name', 'first_name', 'title', 'fraction_total']]
# ota_df2

And finally, Project Gutenberg...

In [32]:
# PROJECT GUTENBERG

project_gutenberg_df = pandas.read_csv('../project-gutenberg-2/project-gutenberg_dataset_final.csv', nrows=1057)
# project_gutenberg_df

In [33]:
project_gutenberg_df2 = project_gutenberg_df.loc[:, ['last_name', 'first_name', 'title', 'fraction_total']]
# project_gutenberg_df2

### PROCESSING LOADED DATA

Now that all the datasets are loaded as dataframes with the relevant columns, I can start processing the loaded data.

This is where I got stuck. I'm trying to merge together records that match on the first three columns (first_name, last_name, title) and, in merging, create an average fraction_total. For example, let's say the records in the dataframe below are taken from 2 different datasets. Because they're identical across the first three columns, the fraction_total column should be merged together and divided by two.

In [34]:
import pandas

sample_records1 = {'last_name': ['Shakespeare', 'Shakespeare'], 'first_name': ['William', 'William'], 'title': ['',''], 'fraction_total': [.1, .2]}

df_sample1 = pandas.DataFrame(sample_records1)

df_sample1

Unnamed: 0,last_name,first_name,title,fraction_total
0,Shakespeare,William,,0.1
1,Shakespeare,William,,0.2


But in this next dataframe, only the first and second records should be merged. The third differs in the 'title' column and should be read as a different record.

In [35]:
import pandas

sample_records2 = {'last_name': ['Hopper', 'Hopper', 'Hopper'], 'first_name': ['William', 'William', 'William'], 'title': ['','', 'Mrs'], 'fraction_total': [.1, .2, .1]}

df_sample2 = pandas.DataFrame(sample_records2)

df_sample2

Unnamed: 0,last_name,first_name,title,fraction_total
0,Hopper,William,,0.1
1,Hopper,William,,0.2
2,Hopper,William,Mrs,0.1


Ideally I could mash together all 5 datasets at once so the math is more straightforward, but merging dataset-by-dataset shouldn't be too much of a problem. 

Having a bit of trouble determining whether the code bit below actually works or not. I'm pretty sure it doesn't so I'm still fiddling. Currently doing some pandas research to see what methods might work for me.

Also considering:
- abandoning Pandas and using the Jupyter Notebook to manipulate the actual files themselves (i.e. create new CSVs instead of dataframes and read those in)
- changing the structure of my datasets to suit pandas better / create a more meaningful final dataset (as of now there's no way of determining which dataset(s) the authors come from in the final dataset, but I think that would be pretty cool to see)

In [36]:
merged_df = pandas.concat((eebo_df2, estc_df2, open_syllabus_df2, ota_df2, project_gutenberg_df2))
merged_df
#  merged_df.groupby(merged_df.index).mean()


Unnamed: 0,last_name,first_name,title,fraction_total
0,,Anonymous,,0.291590
1,England and Wales,Parliament,,0.011711
2,Sovereign,Charles I,,0.010643
3,,England and Wales,,0.009838
4,,Church of England,,0.009440
5,Sovereign,Charles II,,0.007148
6,Scotland,Privy Council,,0.005735
7,Sovereign,James I,,0.004919
8,Sovereign,Elizabeth I,,0.004490
9,Sternhold,Thomas,,0.004218
