# Getting a minimum number of texts to trim

Code by Morgan Lundy, Peizhen Wu, and Ted Underwood

In [2]:
import pandas as pd
import numpy as np
from pathlib import Path
import os


In [3]:
alltexts = pd.read_csv('alltexts.csv') # this file should be in the folder already
alltexts.head()

Unnamed: 0.1,Unnamed: 0,docid,author,authordate,title,latestcomp,hathidate,imprint,gutenstring,enumcron,...,contents,instances,genre,audience,authgender,multiplehtids,comments,coder,Folder,Trimmed
0,0,loc.ark=+13960+t5p851b8s,"Reid, Stuart J.",,Lord John Russell,1895,,New York;Harper & brothers;1,"Reid, Stuart J. | Lord John Russell",<blank>,...,,,bio,,u,,,morgan,gutenbiotrimmed,Trimmed
1,1,hvd.32044070870779,"Smiles, Samuel,",,Lives of the engineers,1879,,London;J. Murray;1874-1877.,"Smiles, Samuel | Lives of the Engineers",v. 5,...,,,bio | short,,m,,"2 people, mixed together (not one per chapter)",morgan,gutenbiotrimmed,Trimmed
2,2,mdp.39015005892362,"Cruttwell, Maud.",,Luca Signorelli,1899,,London;G. Bell & sons;1899.,"Cruttwell, Maud | Luca Signorelli",<blank>,...,,,bio,,f,,,morgan,gutenbiotrimmed,Trimmed
3,3,mdp.39015051108531,"Bettany, George Thomas,",,Life of Charles Darwin,1887,,London;W. Scott;1887.,"Bettany, George Thomas | Life of Charles Darwin",<blank>,...,,,bio,,m,,,morgan,gutenbiotrimmed,Trimmed
4,4,loc.ark=+13960+t6b27z54n,"Gay, Sydney Howard,",,James Madison,1889,,"Boston;New York;Houghton, Mi","Gay, Sydney Howard | James Madison",<blank>,...,,,bio,,u,,,morgan,gutenbiotrimmed,Trimmed


**ML:** What are our "rarest" texts that we know we will need to trim? 
Texts by women, written in the 18thc (or our oldest time "chunk") that are biographies ... 
are these the 3 parameters we're sticking with? 

**TU:** Yes; let's define the oldest time chunk as before the median date of fiction.

In [5]:
alltexts['isfic'] = alltexts['genre'].str.startswith('fic')
sum(alltexts['isfic'])

654

In [6]:
np.median(alltexts.loc[alltexts['isfic'], 'latestcomp'])

1905.0

The median date is actually quite late. Let's make this a round number and call the two classes "up to 1899" and "1900 and after."

In [7]:
sum(~pd.isnull(alltexts.loc[alltexts['isfic'] & (alltexts['latestcomp'] < 1900), 'Trimmed']))

113

In [9]:
len(alltexts.loc[alltexts['isfic'] & (alltexts['latestcomp'] < 1900), 'Trimmed'])

261

So there's about 150 texts we should probably trim right there. 

We could start with those; we're pretty sure we're going to need to trim them, both because we'll need them for the chronological model: 20c / before 20c, and because we'll need many of them for chronological matching if we're trying to model a *non-chronological* boundary e.g. novels by women to those by men. 

In a situation like that, it's usually a good idea to use matching methods (select negative examples that pretty closely match the positive examples by date), because language change is otherwise such a powerful signal that it otherwise inevitably contaminates your model. And since pre-1900 texts are spread out across the timeline, we'll probably do a better job of matching if we maximize their density.

So first, I'll need to download some of these texts from Gutenberg! We don't even have them all yet.

Then we can think about 20c texts.

In [10]:
len(alltexts.loc[alltexts['isfic'] & (alltexts['latestcomp'] >= 1900), 'Trimmed'])

393

In [12]:
sum(~pd.isnull(alltexts.loc[alltexts['isfic'] & (alltexts['latestcomp'] >= 1900), 'Trimmed']))

62

We also have some trimming to do here, but not as much as it looks like, because we can be strategic.