## Formatting an Adidas PDF <br/>

This is a walkthrough of the code behind formatting an Adidas order. If you're not interested in how it works, basically all you need to do is first make sure your pdf's are in the *put_pdfs_here* folder and then click on the *formatting-adidas-orders.py* file. A *formatted.csv* will appear once the code has finished running. Your local machine should also have python and the appropriate libraries/packages (listed below) installed.

**Libraries/Packages used:**
- **pdfplumber**: opens pdfs and allows them to be parsed page by page
- **re**: finds patterns in text to know where to find specific data (e.g. dates, model numbers, etc.)
- **pandas**: used here for creating a data frame for exporting to a csv
- **glob**: allows for iterating through files ending in .pdf
<br/><br/>

In [1]:
import pdfplumber
import re
import pandas as pd
import glob

from collections import namedtuple 

In [2]:
# Creating a named tuple to easily make headings and assign values to a pandas data frame
Line = namedtuple('Line', 'sport_id order_date order_ref_number item_description item_category item_size qty_ordered item_size_chart item_unit item_pref_vendor item_manuf item_manuf_model item_price item_serial item_expendable item_universal')

**Regular Expressions**<br/>
The code below might look confusing at first but it is very useful to look for patterns in text. 

You can learn more about regular expressions here: https://www.w3schools.com/python/python_regex.asp

In [3]:
manuf_model_re = re.compile(r'^(\w+) (.*) \d+ \$(\d+\.?\d+) \$\d+\.?\d+ \D+')

**Extracting infoformation from the PDF** <br/>
This is where the actual extraction happens. We loop through each PDF file in the *put_pdfs_here* folder, each page in every file, and each line in every page. Each iteration checks for patterns that we previously defined and for other specific repetitions. We compile a list of all the lines that we need in the formatted file so that we can create a data frame out of it and export to a CSV.

In [4]:
lines = []

for file in glob.iglob("put_pdfs_here/*pdf"):
    with pdfplumber.open(file) as pdf:
        pages = pdf.pages
        for page in pdf.pages:
            text = page.extract_text()
            text_lines = text.split("\n")
            for i in range(len(text_lines)):
                
                # regular expressions - used for identifying patterns in text
                model_num = manuf_model_re.search(text_lines[i])
                
                # Order Date
                if text_lines[i].startswith('Order Date'):
                    order_date = re.compile(r'(\d{2}/\d{2}/\d{4})').search(text_lines[i]).group(1)
                
                # Order Reference Number
                elif text_lines[i].startswith('Customer PO#:'):
                    ref_num = re.compile(r'Customer PO#:(.*) Contact').search(text_lines[i]).group(1)
                
                # Manufacture Model
                elif model_num:
                    manuf_model = model_num.group(1)
                    item_description = model_num.group(2)
                    price = model_num.group(3)
                    
                # Sizes
                elif text_lines[i].startswith('Size'):
                    item_description += ' '+text_lines[i - 1]
                    sizes = re.compile(r'Size (.*)').search(text_lines[i]).group(1).split()
                    qtys = re.compile(r'Qty (.*)').search(text_lines[i + 1]).group(1).split()
                    for i in range(len(qtys)):
                        lines.append(Line("", order_date, ref_num, item_description, "", sizes[i], qtys[i], "", "Each", "", "Adidas", manuf_model, price, "N", "Y", "N"))
                    
                              

**Creating the data frame** <br/>
Using the pandas library, we create a data frame out of the our lines array from above. Then we export the formatted file to a CSV and we're done! You should see a new *formatted.csv* pop up in the same directory as the python file.

In [5]:
df = pd.DataFrame(lines)

In [6]:
df.to_csv('formatted.csv', index=False)