# Evaluation Dataset

Generate a csv with the following information:

* brief name
* brief id
* author party
* paragraph of text

Goal is to have Holly classify each of the paragraphs as one (or more) of our classification labels to serve as a "ground truth" for cross-model comparison

In [None]:
#import data science packages
import pandas as pd
import numpy as np

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')
%cd gdrive/My\ Drive/amicus-iv

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
/content/gdrive/My Drive/amicus-iv


## 1. Load Dataset

In [None]:
df = pd.read_csv("data/shortened-amicus-brief-text.csv")
df.head(5)

Unnamed: 0,case,brief,id,txt_short
0,Anders v Floyd,Anders v Floyd - amicus brief for appellant (o...,861815186515,many roe v wade killings are murder the eviden...
1,Anders v Floyd,Anders v Floyd - amicus brief for appellant (o...,861815187715,"for the 14th time, the supreme court is petiti..."
2,Ayotte v. PP,Ayotte v Planned Parenthood of Northern New En...,861823786898,in imposing a constitutional standard for pare...
3,Ayotte v. PP,Ayotte v Planned Parenthood of Northern New En...,861823789298,amici offer this brief for the limited purpose...
4,Ayotte v. PP,Ayotte v Planned Parenthood of Northern New En...,861823790498,new hampshire's parental notification law is a...


## 2. Split Text

Split text into chunks of `max_len` with `step` overlapping words

In [None]:
def split_text(data, max_len=512, step=128):
  # split text on space
  #text = data['txt_short'].split()
  text = data.split()
  # take list and separate into lists of lists, overlapping 
  #text = [text[i : i + max_len] for i in range(0, len(text), step)]
  #separate list into list of strings len "max_len", overlapping by "step"
  text = [' '.join(text[i : i + max_len]) for i in range(0, len(text), step)]
  return text

Use fxn on our dataset

In [None]:
max_len = 512
step = 128

# split each text into len 'max_len' with 'step' overlap
df['text'] = df.apply(lambda row: split_text(row['txt_short'],
                                             max_len=max_len,
                                             step=step),
                       axis=1)
df = df.explode('text')
df.reset_index(inplace = True)
df.drop(['txt_short', 'index'], axis=1, inplace=True)

## 3. Add 'fem' variable to stratify sample on

Value is True if the brief is written in support of feminist party.

In [None]:
df['fem'] = ["feminist" in brief_name for brief_name in df['brief']]

## 4. Randomly select 100 rows

50 fem briefs, 50 opp briefs

In [None]:
random_rows = df.groupby('fem', group_keys=False).apply(lambda x: x.sample(50))
random_rows

Unnamed: 0,case,brief,id,text,fem
29413,Webster v Reproductive Health Services,Webster v Reproductive Health Services. Amicus...,861820441683,value for every human life regardless of its s...,False
4969,Doe v. Bolton,Doe v. Bolton. Motion to File Amici Brief for ...,861822371012,an action against the united states. sox v. un...,False
1051,Ayotte v. PP,Ayotte v Planned Parenthood of Northern New En...,861822577570,children of twelve years to 17-year-old teenag...,False
5239,Frisby v. Schultz,Frisby_v_Schultz_AmicusBrief for Appellees_AFL...,861822568216,"issues, in pursuit of the first amendment's la...",False
3829,Diamond v. Charles,Diamond v Charles. Amicus Brief for Appellants...,861823621217,fair procedures. the expansive possibilities o...,False
...,...,...,...,...,...
30294,Williams v. Zbaraz,Williams v Zbaraz. Amici Brief for Appellees (...,861821316676,the court should apply settled principles of s...,True
1198,Ayotte v. PP,Ayotte v Planned Parenthood of Northern New En...,861822582370,in an emergency situation to save the life or ...,True
4499,Doe v. Bolton,Doe v. Bolton. Motion to File Amici Brief for ...,861822362612,code draft abortion law suggested that all med...,True
14798,McCullen v. Coakley,McCullen v Coakley. Amicus Brief for Responden...,861823592192,that there is no link between abortion and bre...,True


## 5. Save

In [None]:
random_rows.to_excel('data/labeled_amicus.xlsx')