<a href="https://colab.research.google.com/github/Victorambrose/BERT_Optimize/blob/main/SCOTUS_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
pip install datasets

# scotus
The US Supreme Court (SCOTUS) is the highest federal court in the United States of America and generally hears only the most controversial or otherwise complex cases which have not been sufficiently well solved by lower courts. This is a single-label multi-class classification task, where given a document (court opinion), the task is to predict the relevant issue areas. The 14 issue areas cluster 278 issues whose focus is on the subject matter of the controversy (dispute).

source : https://huggingface.co/datasets/coastalcph/lex_glue
 , http://scdb.wustl.edu/

In [3]:
from datasets import load_dataset


scotus_dataset = load_dataset("coastalcph/lex_glue", "scotus")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/34.1k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/94.4M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/40.0M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/39.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1400 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1400 [00:00<?, ? examples/s]

In [4]:
scotus_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 1400
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 1400
    })
})

**Converting Dataset to a pandas DataFrame**

In [5]:
from datasets import load_dataset
import pandas as pd


def convert_to_dict_with_splits(dataset):
  """Converts a Hugging Face dataset to a dictionary with train, test, and validation splits as pandas DataFrames.
  """
  dataset_dict = {}
  for split in dataset:
    dataset_dict[split] = pd.DataFrame(dataset[split])
  return dataset_dict

scotus_dict = convert_to_dict_with_splits(scotus_dataset)

train_df = scotus_dict['train']
test_df = scotus_dict['test']
validation_df = scotus_dict['validation']

print(train_df.head())


                                                text  label
0  329 U.S. 29\n67 S.Ct. 1\n91 L.Ed. 22\nCHAMPLIN...      7
1  329 U.S. 1\n67 S.Ct. 6\n91 L.Ed. 3\nHALLIBURTO...      7
2  329 U.S. 14\n67 S.Ct. 13\n91 L.Ed. 12\nCLEVELA...      0
3  329 U.S. 40\n67 S.Ct. 167\n91 L.Ed. 29\nUNITED...      1
4  329 U.S. 90\n67 S.Ct. 133\n91 L.Ed. 103\nAMERI...      7


In [6]:
import pandas as pd

# Assuming train_df, test_df, and validation_df are already defined as in your provided code.

# Check if columns are the same across all three dataframes
if list(train_df.columns) == list(test_df.columns) == list(validation_df.columns):
    merged_df = pd.concat([train_df, test_df, validation_df], ignore_index=True)
    print(merged_df.head())
else:
    print("Error: DataFrames do not have the same columns. Cannot merge.")

                                                text  label
0  329 U.S. 29\n67 S.Ct. 1\n91 L.Ed. 22\nCHAMPLIN...      7
1  329 U.S. 1\n67 S.Ct. 6\n91 L.Ed. 3\nHALLIBURTO...      7
2  329 U.S. 14\n67 S.Ct. 13\n91 L.Ed. 12\nCLEVELA...      0
3  329 U.S. 40\n67 S.Ct. 167\n91 L.Ed. 29\nUNITED...      1
4  329 U.S. 90\n67 S.Ct. 133\n91 L.Ed. 103\nAMERI...      7


In [8]:
print(merged_df['label'].unique())
print(merged_df['label'].nunique())

[ 7  0  1  6  3  8 11  2  4 10  9  5 12]
13


**Mapping Case labels with the numbers**

In [19]:
labels_dict={0:"Criminal Procedure",1:"Civil Rights",2:"First Amendment",3:"Due Process",4:"Privacy",5:"Attorneys",6:"Unions",7:"Economic Activity",8:"Judicial Power",9:"Federalism",10:"Interstate Relations",11:"Federal Taxation",12:"Miscellaneous"}
merged_df['Label_text']=merged_df['label'].map(labels_dict)

**Exploring number of cases per bucket in train Dataset**

In [20]:
import plotly.express as px

# Count occurrences of each unique label
label_counts = merged_df['Label_text'].value_counts().reset_index(name='Count')

# Rename columns for clarity
label_counts.rename(columns={'index': 'Label_text'}, inplace=True)

# Create the bar plot with text labels
fig = px.bar(label_counts,
             x='Label_text',
             y='Count',
             text='Count',  # Display count as text on bars
             labels={'Label_text': 'Case Type', 'Count': 'Number of Cases'},
             title='Number of Cases in Each Unique Case Type',
             hover_data={'Label_text': True, 'Count': True})

# Update text position (optional: can be 'inside', 'outside', etc.)
fig.update_traces(textposition='outside')

# Display the plot
fig.show()


In [21]:
label_counts

Unnamed: 0,Label_text,Count
0,Criminal Procedure,1743
1,Economic Activity,1529
2,Civil Rights,1251
3,Judicial Power,1082
4,First Amendment,619
5,Federalism,357
6,Unions,330
7,Due Process,314
8,Federal Taxation,295
9,Privacy,95


In [22]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7800 entries, 0 to 7799
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   text        7800 non-null   object
 1   label       7800 non-null   int64 
 2   Label_text  7800 non-null   object
dtypes: int64(1), object(2)
memory usage: 182.9+ KB


In [23]:
merged_df.isnull().sum()

Unnamed: 0,0
text,0
label,0
Label_text,0


# Sample Case

In [24]:
merged_df['text'][0]

"329 U.S. 29\n67 S.Ct. 1\n91 L.Ed. 22\nCHAMPLIN REFINING COv.UNITED STATES et al.\nNo. 21.\nArgued Oct. 18, 21, 1946.\nDecided Nov. 18, 1946.\nRehearing Denied Dec. 16, 1946.\n\nSee 329 U.S. 831, 67 S.Ct. 363.\nAppeal from the District Court of the United States for the Western District of Oklahoma.\nMessrs.Dan Moody, of Austin, Tex., and Harry O. Glasser, of Enid, Okla., for appellant.\nMr. Edward Dumbauld, of Washington, D.C., for appel-\n[Argument of Counsel from page 30 intentionally omitted]\nlees. Mr. Justice JACKSON delivered the opinion of the Court.\n\n\n1\nThe Interstate Commerce Commission, acting under § 19a of the Interstate Commerce Act,1 ordered the appellant to furnish certain inventories, schedules, maps and charts of its pipe line property.2 Champlin's objections that the Act does not authorize the order, or if it be construed to do so is unconstitutional, were overruled by the Commission and again by the District Court which dismissed the company's suit for an injunc

# Average number of words per case

In [27]:
total_length=0
total_nu_of_documents=len(merged_df)
for case_text in merged_df['text']:
  word_count = len(case_text.split())
  total_length+=word_count
print(f'On an average there are {round(total_length/total_nu_of_documents,0)} words per document')


On an average there are 6860.0 words per document


In [34]:
# prompt: I want to know how much is the average number of words per "text" for each unique "Label_text"

import pandas as pd

# Assuming merged_df is already defined as in your provided code.

# Calculate the average number of words per text for each unique Label_text
average_words_per_label = merged_df.groupby('Label_text')['text'].apply(lambda x: round(x.str.split().str.len().mean())).reset_index(name='Average_Words_per_case')


average_words_per_label


Unnamed: 0,Label_text,Average_Words_per_case
0,Attorneys,7319
1,Civil Rights,7501
2,Criminal Procedure,6770
3,Due Process,7717
4,Economic Activity,6701
5,Federal Taxation,5134
6,Federalism,7340
7,First Amendment,9486
8,Interstate Relations,4004
9,Judicial Power,4770


In [43]:
import altair as alt


chart = alt.Chart(average_words_per_label).mark_bar().encode(
    x=alt.X('Label_text',
            sort='-y',
            axis=alt.Axis(labelAngle=-45, labelFontSize=14)),
    y=alt.Y('Average_Words_per_case',
            axis=alt.Axis(labelFontSize=14)),
    tooltip=['Label_text', 'Average_Words_per_case']
).properties(
    width=1000,
    height=400,
    title=alt.TitleParams(
        "Average Words per Case type",
        fontSize=20
    )
).interactive()


chart