## Getting All Cases in the Court History
Supreme Court cases are publicly available (obviously), but difficult to find in full. I eventually found a site where I could pick up all cases. If you've ever used BeautifulSoup before you know that finding a site where you can find everything you need is great because you can scale up dramatically without any extra work - vs if you find disperate sources you have to write multiple request functions.

In [63]:
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
#23268 = total cases

### Our year and case title df from previous notebook
We will use an apply function on this (a really fantastic function of pandas where you can easily apply a function to every row in a df with a single line of code).

I had to split this up into three temporary dfs to run because caselaw eventually got smart enough and realized someone was scraping the crap out of their site (almost 24k requests within a few hrs...). I added the bot crawler fakeout header (which you should absolutely do if you are collecting as much data as I naively thought I could collect without a header), which may have solved this problem. However, nothing is worse than waiting 2 hours and coming back to an errored out screen which basically means you'll have to start completely over. So I decided to split them up.

In [64]:
supcourt = pd.read_pickle("supcourt_yearlist.pickle")

In [79]:
test_df = supcourt.iloc[5000:15000]
test3_df = supcourt.iloc[0:5000]
test2_df = supcourt.iloc[15000:23268]

In [67]:
def supcourtdescr(link):
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
    allitems = []
    response = requests.get(link, headers =  headers)
    page = response.text
    soup = BeautifulSoup(page, "lxml")
    
    pagesoup = soup.find_all(class_="caselawcontent searchable-content") 
    
    for item in pagesoup:
        txtt = item.get_text()
        allitems.append(txtt)
    return ' '.join(allitems)

In [None]:
test_df["case"] = test_df.caseurl.apply(supcourtdescr)
test_df.to_pickle("temp2.pickle")

In [None]:
test2_df["case"] = test2_df.caseurl.apply(supcourtdescr)
test2_df.to_pickle("temp1.pickle")

In [None]:
test1_df["case"] = test1_df.caseurl.apply(supcourtdescr)
test1_df.to_pickle("temp3.pickle")

In [77]:
full_project = pd.concat([test2_df, test3_df, test_df]) #putting it all together

## The internet is a dirty place
webscraping always presents interesting regex challenges. These weren't so bad this time, but it's inevitable.

In [83]:
def cleanerup(text):
    text = text.replace("\n", " ")
    text = text.replace("\xa0\xa0\xa0\xa0", " ")
    return text
full_project["case"] = full_project.case.apply(cleanerup)

In [86]:
full_project

Unnamed: 0,caseurl,casetitle,years,case
15000,http://caselaw.findlaw.com/us-supreme-court/38...,"ALUMINUM CO. OF AMERICA v. UNITED STATES, 382...",1965,United States Supreme Court JOBE v. CITY OF ...
15001,http://caselaw.findlaw.com/us-supreme-court/38...,JONES & LAUGHLIN STEEL CORP. v. GRIDIRON STEEL...,1965,United States Supreme Court JONES & LAUGHLIN...
15002,http://caselaw.findlaw.com/us-supreme-court/38...,"JORDAN v. SILVER, 381 U.S. 415 (1965)",1965,United States Supreme Court JORDAN v. SILVER...
15003,http://caselaw.findlaw.com/us-supreme-court/38...,"KADANS v. DICKERSON, 382 U.S. 22 (1965)",1965,United States Supreme Court KADANS v. DICKER...
15004,http://caselaw.findlaw.com/us-supreme-court/38...,"METROMEDIA, INC. v. AMERICAN SOCIETY OF COMPOS...",1965,United States Supreme Court KASHARIAN v. MET...
15005,http://caselaw.findlaw.com/us-supreme-court/38...,"KASHARIAN v. SOUTH PLAINFIELD BAPTIST CHURCH, ...",1965,United States Supreme Court KASHARIAN v. SOU...
15006,http://caselaw.findlaw.com/us-supreme-court/38...,FLORIDA EAST COAST RAILWAY CO. v. UNITED STATE...,1965,United States Supreme Court KASHARIAN v. WIL...
15007,http://caselaw.findlaw.com/us-supreme-court/38...,"KENNECOTT COPPER CORP. v. UNITED STATES, 381 ...",1965,United States Supreme Court KENNECOTT COPPER...
15008,http://caselaw.findlaw.com/us-supreme-court/38...,"KILLGORE v. BLACKWELL, 381 U.S. 278 (1965)",1965,United States Supreme Court KILLGORE v. BLAC...
15009,http://caselaw.findlaw.com/us-supreme-court/37...,KITTY HAWK DEVELOPMENT CO. v. CITY OF COLORADO...,1965,United States Supreme Court KITTY HAWK DEVEL...


### Saving the body of our data
This pickle file is about 600 MB. We don't want to run this again!

In [85]:
full_project.to_pickle("full_proj_preproc.pickle")