<a href="https://colab.research.google.com/github/amandakonet/amicus-iv/blob/main/nlp/preprocess_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Amicus Data Preparation

In [None]:
# data
!pip install transformers
!pip install datasets

In [2]:
#dl imports
from transformers import pipeline
from datasets import load_dataset, Dataset, ClassLabel, load_from_disk, DatasetDict
from tokenizers import normalizers
from tokenizers.normalizers import BertNormalizer
from huggingface_hub import notebook_login

#import data science packages
import pandas as pd
import numpy as np
import seaborn as sns

#import file helper packages
import glob
import requests

Mount google drive + read in data

In [3]:
from google.colab import drive
drive.mount('/content/gdrive')
%cd gdrive/My\ Drive/amicus-iv

Mounted at /content/gdrive
/content/gdrive/My Drive/amicus-iv


In [4]:
df = pd.read_csv("data/shortened-amicus-brief-text.csv")
df.head(10)

Unnamed: 0,case,brief,id,txt_short
0,Anders v Floyd,Anders v Floyd - amicus brief for appellant (o...,861815186515,many roe v wade killings are murder the eviden...
1,Anders v Floyd,Anders v Floyd - amicus brief for appellant (o...,861815187715,"for the 14th time, the supreme court is petiti..."
2,Ayotte v. PP,Ayotte v Planned Parenthood of Northern New En...,861823786898,in imposing a constitutional standard for pare...
3,Ayotte v. PP,Ayotte v Planned Parenthood of Northern New En...,861823789298,amici offer this brief for the limited purpose...
4,Ayotte v. PP,Ayotte v Planned Parenthood of Northern New En...,861823790498,new hampshire's parental notification law is a...
5,Ayotte v. PP,Ayotte v Planned Parenthood of Northern New En...,861823791698,parental involvement laws are in the best inte...
6,Ayotte v. PP,Ayotte v Planned Parenthood of Northern New En...,861823792898,1. this court has long recognized that the par...
7,Ayotte v. PP,Ayotte v Planned Parenthood of Northern New En...,861823794098,twenty-four years after this court first uphel...
8,Ayotte v. PP,Ayotte v Planned Parenthood of Northern New En...,861823795298,those who seek to prevent a properly-enacted s...
9,Ayotte v. PP,Ayotte v Planned Parenthood of Northern New En...,861823796498,parents are the undisputed guardians of their ...


## Preprocess Data

preprocessing should be done in R; includes:

1. lowercase
2. remove appendices & footnotes

What preprocessing needs to be done here? Using BertNormalizer, because we'll be using Bert tokenizers and models. This function will process the text to the same level that the text from Bert models/tokenizers is. This will be done later - just before the tokenization step. 

## Split Data

Ignore this -- we can split the text into parts and tokenize directly with tokenize() in HF. See any of the finetuning NBs.

In [None]:
#def split_text(text, n):
  # split text on space
#  text = text.split()
  # grab tokens back into strings, with n words each 
#  text = [' '.join(text[i:i+n]) for i in range(0,len(text),n)]

#  return text

**ignore** Set n (number of tokens per chunk) and split data into n chunks. 

In [None]:
#n = 512
#df_512 = df.copy()
#df_512['txt_split'] = df_512.apply(lambda row: split_text(row['txt_short'], n), axis=1)
#df_512 = df_512.explode('txt_split')
#df_512.drop('txt_short', axis=1, inplace=True)
#len(df_512)

## Add brief_party variable

1 if "feminist" 0 o.w.

In [6]:
df['brief_party'] = ["feminist" in brief_name for brief_name in df['brief']]
df['brief_party'] = [int(brief_name) for brief_name in df['brief_party']]

df['brief_party'].value_counts()

0    411
1    330
Name: brief_party, dtype: int64

## HuggingFace Dataset

Turn pandas df into huggingface data object -- for ease of use w/transformer models

In [22]:
data_ds = Dataset.from_pandas(df)
data_ds

Dataset({
    features: ['case', 'brief', 'id', 'txt_short', 'brief_party'],
    num_rows: 741
})

Rename col w/text to "text" -- standard

In [18]:
data_ds = data_ds.rename_column('txt_short', 'text')

In [9]:
data_ds.info

DatasetInfo(description='', citation='', homepage='', license='', features={'case': Value(dtype='string', id=None), 'brief': Value(dtype='string', id=None), 'id': Value(dtype='int64', id=None), 'text': Value(dtype='string', id=None), 'brief_party': Value(dtype='int64', id=None)}, post_processed=None, supervised_keys=None, task_templates=None, builder_name=None, config_name=None, version=None, splits=None, download_checksums=None, download_size=None, post_processing_size=None, dataset_size=None, size_in_bytes=None)

Check that datatypes are correct

In [10]:
data_ds.features

{'brief': Value(dtype='string', id=None),
 'brief_party': Value(dtype='int64', id=None),
 'case': Value(dtype='string', id=None),
 'id': Value(dtype='int64', id=None),
 'text': Value(dtype='string', id=None)}

## Train/Test/Val Split

In [23]:
# start with train/test split
data_ds = data_ds.train_test_split(test_size=0.2, seed=727) #729

# split training into train and validation
train_val_ds = data_ds['train'].train_test_split(test_size=0.3, seed=451)

# update original ds with re-split training and validation
data_ds['train'] = train_val_ds['train']
data_ds['valid'] = train_val_ds['test']

In [24]:
data_ds

DatasetDict({
    train: Dataset({
        features: ['case', 'brief', 'id', 'txt_short', 'brief_party'],
        num_rows: 414
    })
    test: Dataset({
        features: ['case', 'brief', 'id', 'txt_short', 'brief_party'],
        num_rows: 149
    })
    valid: Dataset({
        features: ['case', 'brief', 'id', 'txt_short', 'brief_party'],
        num_rows: 178
    })
})

Check balance... since there is no option to stratify the split by brief_party

Test and valid are roughly equal... train is about 55:45 opp to fem. 

In [25]:
ds_split = data_ds.filter(lambda x: x['brief_party'] == 1)
ds_split

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['case', 'brief', 'id', 'txt_short', 'brief_party'],
        num_rows: 185
    })
    test: Dataset({
        features: ['case', 'brief', 'id', 'txt_short', 'brief_party'],
        num_rows: 68
    })
    valid: Dataset({
        features: ['case', 'brief', 'id', 'txt_short', 'brief_party'],
        num_rows: 77
    })
})

## Sharing data to HuggingFace Hub

This allows us to easily access data in the correct format

In [26]:
# save to disk
data_ds.save_to_disk('./demo_data')

# load from disk
loaded_ds = load_from_disk('./demo_data')
#loaded_ds

Flattening the indices:   0%|          | 0/1 [00:00<?, ?ba/s]

Flattening the indices:   0%|          | 0/1 [00:00<?, ?ba/s]

Flattening the indices:   0%|          | 0/1 [00:00<?, ?ba/s]

In [27]:
# must run this to be able to push to huggingface
!git config --global credential.helper store

Use the write token from your own hugginface account. Enter that token here to save login credentials

In [28]:
notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token


Push to hub under my acct

In [29]:
data_ds.push_to_hub('repro-rights-amicus-briefs/repro-rights-amicus', private = True)

Pushing split train to the Hub.
The repository already exists: the `private` keyword argument will be ignored.


Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Pushing split test to the Hub.
The repository already exists: the `private` keyword argument will be ignored.


Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Pushing split valid to the Hub.
The repository already exists: the `private` keyword argument will be ignored.


Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]