# Evaluation Dataset

Generate a csv with the following information:

* brief name
* brief id
* author party
* paragraph of text

Goal is to have Holly classify each of the paragraphs as one (or more) of our classification labels to serve as a "ground truth" for cross-model comparison

In [1]:
#import data science packages
import pandas as pd
import numpy as np

In [2]:
from google.colab import drive
drive.mount('/content/gdrive')
%cd gdrive/My\ Drive/amicus-iv

Mounted at /content/gdrive
/content/gdrive/My Drive/amicus-iv


## 1. Load Dataset

In [28]:
df = pd.read_csv("data/shortened-amicus-brief-text.csv")
df.head(5)

Unnamed: 0,case,brief,id,txt_short
0,Anders v Floyd,Anders v Floyd - amicus brief for appellant (o...,861815186515,many roe v wade killings are murder the eviden...
1,Anders v Floyd,Anders v Floyd - amicus brief for appellant (o...,861815187715,"for the 14th time, the supreme court is petiti..."
2,Ayotte v. PP,Ayotte v Planned Parenthood of Northern New En...,861823786898,in imposing a constitutional standard for pare...
3,Ayotte v. PP,Ayotte v Planned Parenthood of Northern New En...,861823789298,amici offer this brief for the limited purpose...
4,Ayotte v. PP,Ayotte v Planned Parenthood of Northern New En...,861823790498,new hampshire's parental notification law is a...


## 2. Split Text

Split text into chunks of `max_len` with `step` overlapping words

In [24]:
def split_text(data, max_len=512, step=128):
  # split text on space
  #text = data['txt_short'].split()
  text = data.split()
  # take list and separate into lists of lists, overlapping 
  #text = [text[i : i + max_len] for i in range(0, len(text), step)]
  #separate list into list of strings len "max_len", overlapping by "step"
  text = [' '.join(text[i : i + max_len]) for i in range(0, len(text), step)]
  return text

Use fxn on our dataset

In [29]:
max_len = 312 #512
step = 78 #128

# split each text into len 'max_len' with 'step' overlap
df['text'] = df.apply(lambda row: split_text(row['txt_short'],
                                             max_len=max_len,
                                             step=step),
                       axis=1)
df = df.explode('text')
df.reset_index(inplace = True)
df.drop(['txt_short', 'index'], axis=1, inplace=True)

Check

In [26]:
df.head(5)

Unnamed: 0,case,brief,id,text
0,Anders v Floyd,Anders v Floyd - amicus brief for appellant (o...,861815186515,many roe v wade killings are murder the eviden...
1,Anders v Floyd,Anders v Floyd - amicus brief for appellant (o...,861815186515,"killing, with malice aforethought, of a child ..."
2,Anders v Floyd,Anders v Floyd - amicus brief for appellant (o...,861815186515,"1868, born alive did not mean natural birth af..."
3,Anders v Floyd,Anders v Floyd - amicus brief for appellant (o...,861815186515,evidence shows that the hysterotomy is a commo...
4,Anders v Floyd,Anders v Floyd - amicus brief for appellant (o...,861815186515,"must still be murder, and the justices who per..."


## 3. Add 'fem' variable to stratify sample on

Value is True if the brief is written in support of feminist party.

In [36]:
df['fem_brief'] = ["feminist" in brief_name for brief_name in df['brief']]
df['fem_brief'] = [int(brief_name) for brief_name in df['fem_brief']]

df['fem_brief'].value_counts()

0    30577
1    24947
Name: fem_brief, dtype: int64

## 4. Randomly select 100 rows

50 fem briefs, 50 opp briefs

In [None]:
random_rows = df.groupby('fem_brief', group_keys=False).apply(lambda x: x.sample(50))
random_rows

## 5. Save

In [38]:
random_rows.to_excel('data/labeled-amicus.xlsx')
#df.to_csv('data/shortened-text-312-chunk.csv')