## Parsing a BSN Sports PDF <br/>
**Libraries you will need**
- **pdfplumber**: opens pdfs and allows them to be parsed page by page
- **re**: finds patterns in text to know where to find specific data (e.g. dates, model numbers, etc.)
- **pandas**: used here for creating a data frame for exporting to a csv
- **glob**: allows for iterating through files ending in .pdf
<br/><br/>
**Note**: Make sure your pdfs are stored in the *put_pdfs_here* folder


In [2]:
import pdfplumber
import re
import pandas as pd
import glob

from collections import namedtuple 

In [3]:
Line = namedtuple('Line', 'sport_id order_date order_ref_number item_description item_category item_size qty_ordered chart item_unit item_pref_vendor item_manuf item_manuf_model item_price item_serial item_expendable item_universal')

**Regular Expressions**<br/>
The code below might look confusing at first but it is very useful to look for patterns in text. You can learn more about regular expressions here: https://www.w3schools.com/python/python_regex.asp

In [24]:
manuf_model_re = re.compile(r'Item # - (.*)')
quantities_re = re.compile(r'\s*(\d*) ([A-Z]{2})\D*(\d*\.?\d*)')
print(re.compile(r'(^[a-zA-Z]{3}\s)+').search('THE GREATEST ITEM'))

<re.Match object; span=(0, 4), match='THE '>


In [29]:
lines = []
check_des = False 
check_info = False
check_size_or_des = False
check_qty = False

for file in glob.iglob("put_pdfs_here/*pdf"):
    with pdfplumber.open(file) as pdf:
        pages = pdf.pages
        for page in pdf.pages:
            text = page.extract_text()
            #print(text)
            text_lines = text.split("\n")
            for i in range(len(text_lines)):
                
                # regular expressions - used for identifying patterns in text
                model_num = manuf_model_re.search(text_lines[i])
                
                # Order Date
                if text_lines[i].startswith('Order Date'):
                    order_date = re.compile(r'(\d{2}/\d{2}/\d{4})').search(text_lines[i]).group(1)
                
                # Order Reference Number
                elif text_lines[i].startswith('Cart Name'):
                    ref_num = re.compile(r'Cart Name:(.*)').search(text_lines[i]).group(1)
                
                # A line starting with 'Item description' signals the beginning of the items
                elif text_lines[i].startswith('Item Description'):
                    check_des = True
                
                # Item Description
                elif check_des: 
                    if text_lines[i].startswith('Page:') or text_lines[i].startswith('Subtotal'):
                        check_des = False
                        break
                    item_description = text_lines[i] 
                    check_des = False
                    check_info = True
                    
                # Quantity and Price
                elif check_info:
                    print(text_lines[i])
                    info = quantities_re.search(text_lines[i])
                    qty = info.group(1)
                    item_unit = "Each" if info.group(2) == "EA" else "Pair"
                    price = info.group(3)
                    check_info = False
                    
                # Manufacture Model
                elif model_num:
                    manuf_model = model_num.group(1)
                    check_size_or_des = True
                    
                # Sizes
                elif check_size_or_des:
                    # A line starting with "Subtotal" signals the end of items
                    if text_lines[i].startswith('Subtotal'):
                        lines.append(Line("", order_date, ref_num, item_description, "", "", qty, "", item_unit, "", "BSN", manuf_model, price, "N", "Y", "N"))
                        break 
                    elif re.compile(r'(^[a-zA-Z]{3})+').search(text_lines[i]):   # check sizes
                        sizes = text_lines[i].split()
                        check_qty = True 
                    else:                                         # check description
                        item_description = text_lines[i]
                        check_info = True
                    check_size_or_des = False
                
                # Quantities per size
                elif check_qty:
                    qtys = re.findall(r'\d+', text_lines[i])
                    for i in range(len(qtys)):
                        lines.append(Line("", order_date, ref_num, item_description, "", sizes[i], qtys[i], "", item_unit, "", "BSN", manuf_model, price, "N", "Y", "N"))
                    check_qty = False
                    check_des = True
                              

     40 EA $      20.00  $     800.00 
     35 EA $      27.50  $     962.50 
     35 EA $      40.00  $   1,400.00 
     40 EA $      12.50  $     500.00 
     42 EA $      15.00  $     630.00 
     42 EA $      15.00  $     630.00 
      5 EA $      40.00  $     200.00 
      5 EA $      13.75  $      68.75 
     20 EA $      80.00  $   1,600.00 
  20


AttributeError: 'NoneType' object has no attribute 'group'

In [81]:
df = pd.DataFrame(lines)
df

Unnamed: 0,sport_id,order_date,order_ref_number,item_description,item_category,item_size,qty_ordered,chart,item_unit,item_pref_vendor,item_manuf,item_manuf_model,item_price,item_serial,item_expendable,item_universal
0,,05/09/2019,VB 2/M Fill Ins #17,SUBLIM - NO CUST LOGO-WOMENS DIGITAL SPEED ST,,MED,1,,Each,,,NK658068,52.0,N,Y,N
1,,05/09/2019,VB 2/M Fill Ins #17,SUBLIM - NO CUST LOGO-WOMENS DIGITAL SPEED ST,,MED,1,,Each,,,NK658068,52.0,N,Y,N
2,,06/27/2019,VB Team #2,001 - BLK/WHT-MAMBA FOCUS SHOES,,8,2,,Pair,,,NKAT1214,53.63,N,Y,N
3,,06/27/2019,VB Team #2,001 - BLK/WHT-MAMBA FOCUS SHOES,,8.5,6,,Pair,,,NKAT1214,53.63,N,Y,N
4,,06/27/2019,VB Team #2,001 - BLK/WHT-MAMBA FOCUS SHOES,,9,4,,Pair,,,NKAT1214,53.63,N,Y,N
5,,06/27/2019,VB Team #2,001 - BLK/WHT-MAMBA FOCUS SHOES,,9.5,8,,Pair,,,NKAT1214,53.63,N,Y,N
6,,06/27/2019,VB Team #2,001 - BLK/WHT-MAMBA FOCUS SHOES,,10,4,,Pair,,,NKAT1214,53.63,N,Y,N
7,,06/27/2019,VB Team #2,001 - BLK/WHT-MAMBA FOCUS SHOES,,10.5,6,,Pair,,,NKAT1214,53.63,N,Y,N
8,,06/27/2019,VB Team #2,001 - BLK/WHT-MAMBA FOCUS SHOES,,11,2,,Pair,,,NKAT1214,53.63,N,Y,N
9,,06/27/2019,VB Team #2,001 - BLK/WHT-MAMBA FOCUS SHOES,,12.5,2,,Pair,,,NKAT1214,53.63,N,Y,N


In [58]:
df.to_csv('formatted.csv', index=False)