# Extract data from a PDF - California WARN report

**Problem**: You have data trapped in a PDF.

**Solution**: Use a Python library like [`pdfplumber`](https://github.com/jsvine/pdfplumber) to extract it.

This PDF we're working with here is [a report of employers planning to lay off workers in California](http://www.edd.ca.gov/jobs_and_training/Layoff_Services_WARN.htm#ListingofWARNNotices).

Our steps:
1. Import dependencies
2. Extract the data and write to a CSV (it's usually a good idea to cache your data)
3. Load the CSV into a pandas dataframe
4. Analyze!

### 1. Import dependencies

In [2]:
import csv

import pdfplumber
import pandas as pd

We'll also look at the source PDF and define the column headers.

While we're in here, let's define variables for the path to our PDF and to the CSV we'll write out to.

In [3]:
cols = ['notice_date', 'effective_date', 'received_date',
        'company', 'city', 'county', 'employee_no', 'warn_type']

# this one exists already
PDF_FILE = '../../data/warnreport.pdf'

# this one won't exist until we create it
CSV_FILE = '../../data/warnreport.csv'

### 2. Extract the data and write to CSV

👉 For more details on reading and writing CSV files, [check out this notebook](../../reference/Reading%20and%20writing%20delimited%20data%20files%20with%20vanilla%20Python.ipynb).

In [4]:
# open the PDF ~and~ a CSV to write out to
with pdfplumber.open(PDF_FILE) as pdf, open(CSV_FILE, 'w', newline='') as outfile:
    
    # create a list-based writer object
    writer = csv.writer(outfile)
    
    # write the header row
    writer.writerow(cols)

    # loop over the PDF pages
    for page in pdf.pages:
        
        # extract the tables on each page and select the first one
        table = page.extract_tables()[0]
        
        # loop over the rows in the table
        for row in table:
            # if it's a header row, skip and go to the next iteration
            # see: https://docs.python.org/3/tutorial/controlflow.html#break-and-continue-statements-and-else-clauses-on-loops
            if row[0] == 'Notice Date':
                continue
            # if it's the summary table, we're done, son
            if row[0] == 'Summary By Month':
                break
            # otherwise, write the row to file
            writer.writerow(row)

### 3. Load the CSV into a pandas dataframe

We'll also pass a list of date columns to the keyword argument `parse_dates` in the [`read_csv()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) method. That way our dates will be interpreted as dates, not text.

In [5]:
df = pd.read_csv(CSV_FILE,
                 parse_dates=['notice_date',
                              'received_date',
                              'effective_date'])

In [6]:
df.head()

Unnamed: 0,notice_date,effective_date,received_date,company,city,county,employee_no,warn_type
0,2017-06-22,2017-06-22,2017-07-03,"Space Systems/Loral, LLC (SSL)",Palo Alto,Santa Clara,173,Layoff Permanent
1,2017-06-23,2017-08-25,2017-07-03,The Boeing Company,El Segundo,Los Angeles,40,Layoff Unknown at this time
2,2017-06-23,2017-08-25,2017-07-03,The Boeing Company,Huntington,Orange County,47,Layoff Unknown at this time
3,2017-06-30,2017-09-01,2017-07-03,"Micron Technology, Inc.",Milpitas,Santa Clara,57,Layoff Permanent
4,2017-07-05,2017-09-03,2017-07-05,"Options for Youth-Victor Valley, Inc.",Fontana,San Bernardino,107,Closure Permanent


### 4. Analyze!

What questions do you have?