### Generating a sample dataset

In [16]:
import random
import itertools
import numpy as np
import pandas as pd
import matplotlib
import requests
import re

A PO number consists of PO + seven random digits + C, so below we'll define a function that will output a random PO number

In [17]:
def generate_po():
    result = str(random.sample(range(1000000,9999999), 1)).strip('[').strip(']')
    return f"PO{result}C"

In [18]:
generate_po()

'PO1957862C'

The eventual idea is to locate PO numbers within the text of emails, so for the sample data set we want to define a loop that will use the above function to generate a random PO number, and then interpolate it into random text a set number of times, and then concatenate these outputs into a list. 

In [19]:
email_list = []
for i in itertools.repeat(None, 50):
    sample_email = f'''Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt 
    ut labore et dolore magna aliqua. Magna {generate_po()} ac placerat vestibulum lectus mauris ultrices eros. 
    Duis tristique sollicitudin nibh sit amet. Odio eu feugiat pretium nibh ipsum consequat. Fusce ut placerat orci nulla 
    pellentesque dignissim enim sit amet.  Dictum {generate_po()} varius duis at consectetur. 
    Elementum pulvinar etiam non quam lacus suspendisse. Mi quis hendrerit dolor magna eget est lorem. 
    Pharetra convallis posuere morbi leo urna. Turpis egestas pretium aenean pharetra. 
    Amet cursus sit amet dictum sit amet justo donec.'''
    email_list.append(sample_email)


In [20]:
email_list[:3]

['Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt \n    ut labore et dolore magna aliqua. Magna PO2515043C ac placerat vestibulum lectus mauris ultrices eros. \n    Duis tristique sollicitudin nibh sit amet. Odio eu feugiat pretium nibh ipsum consequat. Fusce ut placerat orci nulla \n    pellentesque dignissim enim sit amet.  Dictum PO7177994C varius duis at consectetur. \n    Elementum pulvinar etiam non quam lacus suspendisse. Mi quis hendrerit dolor magna eget est lorem. \n    Pharetra convallis posuere morbi leo urna. Turpis egestas pretium aenean pharetra. \n    Amet cursus sit amet dictum sit amet justo donec.',
 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt \n    ut labore et dolore magna aliqua. Magna PO5820922C ac placerat vestibulum lectus mauris ultrices eros. \n    Duis tristique sollicitudin nibh sit amet. Odio eu feugiat pretium nibh ipsum consequat. Fusce ut placerat orci nulla \n    p

Because we'll eventually be searching the description column of the Salesforce report as a standard text file, we want to take the 50 'emails' in the above list and join them into one string which we'll then save as a text file to use later. 

In [21]:
sample_txt_file = ''.join(email_list)

In [22]:
sample_txt_file[:500]

'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt \n    ut labore et dolore magna aliqua. Magna PO2515043C ac placerat vestibulum lectus mauris ultrices eros. \n    Duis tristique sollicitudin nibh sit amet. Odio eu feugiat pretium nibh ipsum consequat. Fusce ut placerat orci nulla \n    pellentesque dignissim enim sit amet.  Dictum PO7177994C varius duis at consectetur. \n    Elementum pulvinar etiam non quam lacus suspendisse. Mi quis hendrerit dolor magna e'

In [23]:
text_file = open("sample_file.txt", "wt")
n = text_file.write(sample_txt_file)
text_file.close()

### Extracting the POs from emails using regular expressions

Now that we've generated a random dataset, we can move on to pulling the actual information. The first step will be to open the file we've created above as a string.

In [24]:
filepath = "sample_file.txt"
with open (filepath) as f:
    cases_str = f.read()

In [25]:
cases_str[:500]

'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt \n    ut labore et dolore magna aliqua. Magna PO2515043C ac placerat vestibulum lectus mauris ultrices eros. \n    Duis tristique sollicitudin nibh sit amet. Odio eu feugiat pretium nibh ipsum consequat. Fusce ut placerat orci nulla \n    pellentesque dignissim enim sit amet.  Dictum PO7177994C varius duis at consectetur. \n    Elementum pulvinar etiam non quam lacus suspendisse. Mi quis hendrerit dolor magna e'

Now that we've loaded the file we're going to search it for anything matching the PO pattern of 'PO + Seven Digits + C'

In [26]:
pattern = r"PO\d{7}C"
po_list = re.findall(pattern, cases_str)

In [27]:
po_list[:10]

['PO2515043C',
 'PO7177994C',
 'PO5820922C',
 'PO2256293C',
 'PO6493938C',
 'PO1783010C',
 'PO8677272C',
 'PO6101278C',
 'PO8346848C',
 'PO3282345C']

We've now got a list of every single time anything matching the PO pattern is mentioned in one of the sample 'emails' but you can see that some POs are mentioned more than once so we need to remove the duplicates.

In [28]:
po_series = pd.DataFrame(po_list)
po_series.rename(columns={0: "PO-Number"},inplace=True)
po_series.head()

Unnamed: 0,PO-Number
0,PO2515043C
1,PO7177994C
2,PO5820922C
3,PO2256293C
4,PO6493938C


In [29]:
po_series.drop_duplicates(inplace=True)

In [30]:
po_series.head(10)

Unnamed: 0,PO-Number
0,PO2515043C
1,PO7177994C
2,PO5820922C
3,PO2256293C
4,PO6493938C
5,PO1783010C
6,PO8677272C
7,PO6101278C
8,PO8346848C
9,PO3282345C


In [31]:
po_series.reset_index(inplace=True)
po_series.head()

Unnamed: 0,index,PO-Number
0,0,PO2515043C
1,1,PO7177994C
2,2,PO5820922C
3,3,PO2256293C
4,4,PO6493938C


In [32]:
po_series.drop(columns='index',inplace=True)
po_series.head()

Unnamed: 0,PO-Number
0,PO2515043C
1,PO7177994C
2,PO5820922C
3,PO2256293C
4,PO6493938C


In [33]:
po_series.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 1 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   PO-Number  100 non-null    object
dtypes: object(1)
memory usage: 928.0+ bytes


In [34]:
po_series.to_csv('po_list.csv')