We'll be covering text classification and regression methods over the next month; in preparation for this topic, your assignment is to gather labeled data to use for your analysis.

* Find at least 300 documents for some topic that interests you, along with a single binary label for each document.  Aim high if you can; the more data in your collection, the better your classification models will tend to perform on it.

* Split your data into three non-overlapping files (train.tsv, dev.tsv and test.tsv), with train.tsv containing 80% of the documents, dev.tsv 10% and test.tsv 10%.

* All of the data must be in a common format; we'll use a tab-separated format with the label in the first column and the full text in the second column. Replace all newlines in the text with \_NEWLINE\_ and tab characters with \_TAB\_.

See data/text_classification_sample/ for an example.  Execute this Jupyter notebook to verify that your format is correct.

Your choice of documents and labels is completely up to you (except for any data already used in class in the data/ folder).  Possible sources of data:

* Project Gutenberg.  Metadata is available at this [Github repo](https://github.com/hugovk/gutenberg-metadata) along with URLs for the texts.  Labels here can be author, subject, author gender etc.

* Crawl news articles from different domains (e.g,. CNN, FoxNews); the label for each article is the domain.

* [Movie summary data](http://www.cs.cmu.edu/~ark/personas/).  Labels here can be any categorical metadata aspect (genre, release date); note real-valued metadata (like box office, runtime) can be binarized by selecting some threshold.

* [Download your own tweets](https://help.twitter.com/en/managing-your-account/how-to-download-your-twitter-archive).  Labels here can be any categorical metadata included in the tweet, or labels you add by hand (e.g., sarcasm)


In [1]:
import sys
from collections import Counter

In [2]:
## Data import and cleaning
import pandas as pd
import numpy as np

# Source: https://www.kaggle.com/gqfiddler/scotus-opinions
scotus = pd.read_csv("../data/my_datasets/more_data/scotus_opinions.csv")
scotus.head()

Unnamed: 0,author_name,category,per_curiam,case_name,date_filed,federal_cite_one,absolute_url,cluster,year_filed,scdb_id,scdb_decision_direction,scdb_votes_majority,scdb_votes_minority,text
0,Justice Roberts,majority,False,McCutcheon v. Federal Election Comm'n,2014-04-02,,https://www.courtlistener.com/opinion/2659301/...,https://www.courtlistener.com/api/rest/v3/clus...,2014,2013-033,1.0,5.0,4.0,There is no right more basic in our democracy ...
1,Justice Thomas,concurring,False,McCutcheon v. Federal Election Comm'n,2014-04-02,,https://www.courtlistener.com/opinion/2659301/...,https://www.courtlistener.com/api/rest/v3/clus...,2014,2013-033,1.0,5.0,4.0,I adhere to the view that this Court’s decisio...
2,Justice Breyer,dissenting,False,McCutcheon v. Federal Election Comm'n,2014-04-02,,https://www.courtlistener.com/opinion/2659301/...,https://www.courtlistener.com/api/rest/v3/clus...,2014,2013-033,1.0,5.0,4.0,"Nearly 40 years ago in Buckley v. Valeo, 424 U..."
3,Justice Taney,majority,False,Ex Parte Crenshaw,1841-02-18,40 U.S. 119,https://www.courtlistener.com/opinion/86166/ex...,https://www.courtlistener.com/api/rest/v3/clus...,1841,1841-005,2.0,9.0,0.0,This case was brought here by an appeal from t...
4,Justice Pitney,majority,False,Richards v. Washington Terminal Co.,1914-05-04,233 U.S. 546,https://www.courtlistener.com/opinion/98178/ri...,https://www.courtlistener.com/api/rest/v3/clus...,1914,1913-149,1.0,8.0,1.0,"Plaintiff in error, who was plaintiff below, c..."


In [3]:
# Dropping per curiam opinions b/c they don't necessarily take a side
# and are not reflective of most decisions (according to documentation)
scotus_clean = scotus[scotus["category"] != "per_curiam"]  
scotus_clean = scotus_clean[scotus_clean["per_curiam"] == False] # being extra careful to clean this

# Converting to a binary decision class: agree = majority + concurring; disagree = dissenting + second_dissenting
binary_decision = []
for decision in scotus_clean["category"]:
    if decision == "majority" or decision == "concurring":
        binary_decision.append("agree")
    elif decision == "dissenting" or decision == "second_dissenting":
        binary_decision.append("disagree")

scotus_clean.loc[:,"binary_decision"] = np.array(binary_decision)

# Adding a binary ideology category based on scdb decision criteria
# # 1 = conservative, 2 = liberal, 3 = unspecifiable
# http://scdb.wustl.edu/documentation.php?var=decisionDirection 
binary_ideology = []
for decision in scotus_clean["scdb_decision_direction"]:
    if decision == 1.0:
        binary_ideology.append("conservative")
    elif decision == 2.0:
        binary_ideology.append("liberal")
    else:
        binary_ideology.append("unspecifiable")
        
scotus_clean.loc[:,"ideology"] = np.array(binary_ideology)

# Dropping unspecifiable cases
scotus_clean = scotus_clean[scotus_clean["ideology"] != "unspecifiable"]

# cleaning justice name
scotus_clean["author_name"] = [name[8:].lower() for name in scotus_clean["author_name"]]

# finding unanimous cases
unanimous = []
for index in scotus_clean.index:
    majority = scotus_clean["scdb_votes_majority"][index]
    minority = scotus_clean["scdb_votes_minority"][index]
    total = majority + minority
    if majority == total or majority == 0:
        unanimous.append(1)
    else:
        unanimous.append(0)
        
scotus_clean["unanimous"] = unanimous

# Choosing columns of interest
scotus_clean = scotus_clean[["author_name", "binary_decision", "ideology", "case_name", "date_filed", "unanimous", "scdb_votes_majority", "scdb_votes_minority", "text"]]
scotus_clean.head()

Unnamed: 0,author_name,binary_decision,ideology,case_name,date_filed,unanimous,scdb_votes_majority,scdb_votes_minority,text
0,roberts,agree,conservative,McCutcheon v. Federal Election Comm'n,2014-04-02,0,5.0,4.0,There is no right more basic in our democracy ...
1,thomas,agree,conservative,McCutcheon v. Federal Election Comm'n,2014-04-02,0,5.0,4.0,I adhere to the view that this Court’s decisio...
2,breyer,disagree,conservative,McCutcheon v. Federal Election Comm'n,2014-04-02,0,5.0,4.0,"Nearly 40 years ago in Buckley v. Valeo, 424 U..."
3,taney,agree,liberal,Ex Parte Crenshaw,1841-02-18,1,9.0,0.0,This case was brought here by an appeal from t...
4,pitney,agree,conservative,Richards v. Washington Terminal Co.,1914-05-04,0,8.0,1.0,"Plaintiff in error, who was plaintiff below, c..."


In [4]:
import re

def clean_text(string):
    no_tabs = re.sub(r"(\t)+", "_TAB_", string)
    no_spaces = re.sub(r"(\s)+", " ", no_tabs)
    no_new_lines = re.sub(r"(\n|\r)+", "_NEWLINE_", no_spaces)
    return no_new_lines.replace("", "")

In [6]:
# separating the data into separate sets
# shuffling first for random selection
shuffled_scotus = scotus_clean.sample(n = len(scotus_clean), replace = False, random_state = 12345)
shuffled_scotus["text"] = [clean_text(text) for text in shuffled_scotus["text"]]

# getting indices for separation
a = int(len(shuffled_scotus) * 0.8)
b = int(len(shuffled_scotus) * 0.1)

In [22]:
### NOTE FOR SELF: THIS HAS BEEN ADDED LATER TO MAKE DATASET EASY TO CLEAN; DELETE TO CREATE FULL DATASET ###
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

ps = PorterStemmer()

In [94]:
indices_to_keep = []
dates = pd.to_datetime(scotus_clean["date_filed"])

for i in np.arange(len(dates)):
    if dates.iloc[i].year >= 1900:
        indices_to_keep.append(i)

In [116]:
post_1900 = shuffled_scotus.iloc[indices_to_keep].sample(600, replace = False)
post_1900.head()

Unnamed: 0,author_name,binary_decision,ideology,case_name,date_filed,unanimous,scdb_votes_majority,scdb_votes_minority,text
13397,ginsburg,agree,conservative,Cleveland v. United States,2000-11-07,1,9.0,0.0,This case presents the question whether the fe...
33823,clark,agree,liberal,Office Employes v. NLRB,1957-06-17,0,5.0,4.0,This case concerns the attempt of the petition...
27952,delivered,agree,liberal,McGuire v. Commonwealth,1866-02-19,1,9.0,0.0,"I. The first motion now made is, that in case ..."
1523,ginsburg,disagree,conservative,Gonzales v. Carhart,2007-04-18,0,5.0,4.0,In Planned Parenthood of Southeastern Pa. v. C...
18697,souter,agree,conservative,"Rowland v. California Men's Colony, Unit II Me...",1993-01-12,0,5.0,4.0,"Title 28 U.S. C. § 1915, providing for appeara..."


In [117]:
# This cell is to create a usable dataset for the assignments

def porter_tokenize_and_join(string):
    no_punct = re.sub(r'[\W|^.,;]', ' ', string)
    no_punct = re.sub(r'[\s]+', ' ', no_punct)
    as_lst = no_punct.split(" ")
    empty = []
    for w in as_lst:
        empty.append(ps.stem(w))
    return " ".join(empty[:300])


In [119]:
# apply using pandas on the shuffled_scotus data
post_1900["clean_text"] = post_1900["text"].apply(porter_tokenize_and_join)

In [120]:
post_1900.head()

Unnamed: 0,author_name,binary_decision,ideology,case_name,date_filed,unanimous,scdb_votes_majority,scdb_votes_minority,text,clean_text
13397,ginsburg,agree,conservative,Cleveland v. United States,2000-11-07,1,9.0,0.0,This case presents the question whether the fe...,thi case present the question whether the fede...
33823,clark,agree,liberal,Office Employes v. NLRB,1957-06-17,0,5.0,4.0,This case concerns the attempt of the petition...,thi case concern the attempt of the petition l...
27952,delivered,agree,liberal,McGuire v. Commonwealth,1866-02-19,1,9.0,0.0,"I. The first motion now made is, that in case ...",i the first motion now made is that in case th...
1523,ginsburg,disagree,conservative,Gonzales v. Carhart,2007-04-18,0,5.0,4.0,In Planned Parenthood of Southeastern Pa. v. C...,in plan parenthood of southeastern pa v casey ...
18697,souter,agree,conservative,"Rowland v. California Men's Colony, Unit II Me...",1993-01-12,0,5.0,4.0,"Title 28 U.S. C. § 1915, providing for appeara...",titl 28 u s c 1915 provid for appear in forma ...


In [121]:
a = int(len(post_1900) * 0.8)
b = int(len(post_1900) * 0.1)

train = post_1900.iloc[:a][["ideology", "clean_text"]]
dev = post_1900.iloc[a:a+b][["ideology", "clean_text"]]
test = post_1900.iloc[a+b:][["ideology", "clean_text"]]

In [122]:
train.head()

Unnamed: 0,ideology,clean_text
13397,conservative,thi case present the question whether the fede...
33823,liberal,thi case concern the attempt of the petition l...
27952,liberal,i the first motion now made is that in case th...
1523,conservative,in plan parenthood of southeastern pa v casey ...
18697,conservative,titl 28 u s c 1915 provid for appear in forma ...


In [123]:
# exporting data
train.to_csv('../data/my_datasets/shortened/train.tsv', sep = '\t', index=False, header = False)
dev.to_csv('../data/my_datasets/shortened/dev.tsv', sep = '\t', index=False, header = False)
test.to_csv('../data/my_datasets/shortened/test.tsv', sep = '\t', index=False, header = False)


### NEW STUFF ENDS HERE ###

In [6]:
### RETURNING TO ORIGINAL NOTEBOOK ###

# exporting data
train.to_csv('../data/my_datasets/more_data/train_full.tsv', sep = '\t', index=False)
dev.to_csv('../data/my_datasets/more_data/dev_full.tsv', sep = '\t', index=False)
test.to_csv('../data/my_datasets/more_data/test_full.tsv', sep = '\t', index=False)

In [7]:
# smaller version for class assignment
# saving above dataset for future analysis
scotus_small = scotus_clean[["ideology", "text"]].sample(n = len(scotus_clean), replace = False, random_state = 12345)

scotus_small["text"] = [clean_text(text) for text in scotus_small["text"]]

train_small = scotus_small.iloc[:a]
dev_small = scotus_small.iloc[a:a+b]
test_small = scotus_small.iloc[a+b:]

train_small.to_csv('../data/my_datasets/train.tsv', sep = "\t", index=False, header = False)
dev_small.to_csv('../data/my_datasets/dev.tsv', sep = '\t', index=False, header = False)
test_small.to_csv('../data/my_datasets/test.tsv', sep = '\t', index=False, header = False)

Q1: Describe your data.  What is the source of the documents, and what do the labels mean?

My dataset comes from Kaggle - https://www.kaggle.com/gqfiddler/scotus-opinions - and is a set of all of the Supreme Court of the United States opinions on record from its founding to 2020. I did some data cleaning (described above) -- mainly removing extra spaces, converting tabs and new lines to the prompt specifications, and changing some text fields (e.g. authoring justice and consolidating some categories). 

In this case, the finalized labels in my dataset utilize an ideological score from the Supreme Court Database from Washington University Law known as `scdb_decision_direction` (described here: http://scdb.wustl.edu/documentation.php?var=decisionDirection) which summarizes the cases as `liberal` (2), `conservative` (1), or `unspecifiable` (3) based on the issue and decision of the Court. I've reduced the dataset from the numeric scores to a label `liberal` or `conservative` and associated them with their particular cases.


Q2: Change the directionary name below to the directory containing your data and execute the `test()` function above to verify the data is in the correct format:

In [8]:
def test(directory):
    for split in ["train", "dev", "test"]:
        filename="%s/%s.tsv" % (directory, split)
        with open(filename) as file:
            labelCounts=Counter()
            zeroLength=0
            total=0
            for line in file:
                cols=line.rstrip().split("\t")
                label=cols[0]
                text=cols[1]
                if len(text) == 0:
                    zeroLength+=1
                total+=1

                labelCounts[label]+=1

            print ("File: %s, Total docs: %s, Total zero length: %s" % (filename, total, zeroLength))
            for label in sorted(labelCounts):
                print ("\t%s %s" % (label, labelCounts[label]))
            print()

In [9]:
directory="../data/text_classification_sample_data"
test(directory)

File: ../data/text_classification_sample_data/train.tsv, Total docs: 2723, Total zero length: 0
	D 1350
	R 1373

File: ../data/text_classification_sample_data/dev.tsv, Total docs: 257, Total zero length: 0
	D 127
	R 130

File: ../data/text_classification_sample_data/test.tsv, Total docs: 858, Total zero length: 0
	D 391
	R 467



In [10]:
my_data = "../data/my_datasets"
test(my_data)

File: ../data/my_datasets/train.tsv, Total docs: 22894, Total zero length: 0
	conservative 11060
	liberal 11834

File: ../data/my_datasets/dev.tsv, Total docs: 2861, Total zero length: 0
	conservative 1413
	liberal 1448

File: ../data/my_datasets/test.tsv, Total docs: 2863, Total zero length: 0
	conservative 1366
	liberal 1497

