Cleaning Script

In [73]:
%%writefile preprocess.py
#!/usr/bin/python

# imports
import re
import numpy as np
from nltk.tokenize import sent_tokenize

# Regexes
HTML = r'</?\w+/?>|>|<'
BR = r'</?br/?>'
BRBR = BR+BR
MARK = r'</?mark/?>|>|<'
WHITE = r'\s+'
HYPHENS = r'---+'
YW = "you wrote:"

EMAIL_TIME = "[0-9]?[0-9]:[0-9][0-9]\s[AP]M"
EA = r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b"
PH = r"(\d{0,2}[\s\.-]{0,3}\(?\d{0,3}\)?[\s\.-]{0,3}\d{3}[\s\.-]{0,3}\d{4})"
BLACK = "[^A-Za-z0-9\s\?!,\.;:/-]+"


# Strings
BREAK = 'BREAK'
FORWARD = 'Forwarded by'
SPACE = ' '
EMAIL = ' EMAILADDRESS '
PHONE = ' PHONE '

# components
email_components = [
    'Date:',
    'From:',
    'To:',
    'Subject:',
    'Re:',
    'Mime-Version:',
    'Content-Type:',
    'Content-Transfer-Encoding:',
    '-From:',
    '-To:',
    '-cc:',
    '-bcc:',
    '-Folder:',
    '-Origin:',
    '-FileName:'
]

##### Preprocessing Functions ####

def make_regex(lst):
    return '|'.join(lst)

break_regex = make_regex([BRBR, HYPHENS, YW])
comp_regex = make_regex(email_components+[EMAIL_TIME])

def clean(text):
    text = re.sub(break_regex, BREAK, text)
    text = re.sub(HTML, SPACE, text)
    text = re.sub(BLACK, SPACE, text)
    text = text.strip()
    return text


def clean_info(text):
    text = re.sub(EA, EMAIL, text)
    text = re.sub(PH, PHONE, text)
    text = re.sub(WHITE, SPACE, text)
    text = text.strip()
    return text
        
    
def trim_sents(sents, max_tokens=75):
    """Take the most sentences from the tail that together meet the tokens requirement
    """
    lens = [len(s.split()) for s in sents]
    trimmed_sents = [sents[i] for i in range(len(lens)) 
                     if sum(lens[i:]) <= max_tokens]
    return trimmed_sents

Overwriting preprocess.py


In [88]:
sents = [
    'As I m sure j ksdhs so o ao  i  ii i i i i i i iii i i i i i i i i i i i..',
    'One possible cause that we would like ..', 
    'It is my understanding that positions will be correct even if the curves loaded are spelled incorrectly.. As I m sure j ksdhs so o ao  i  ii i i i i i i iii i i i i i i i i i i i..',
    'However, incorrect curve names will result in the positions not being captured in the VAR calculation.'
]
trim_func(sents)

['However, incorrect curve names will result in the positions not being captured in the VAR calculation.']

In [98]:
%%writefile contextMapper.py
#!/usr/bin/python

# imports
import sys
import re
import numpy as np
from nltk.tokenize import sent_tokenize

from preprocess import *


for line in sys.stdin:
   
    try:
        # load
        message, task = line.split('\t')
        # clean
        task, message = clean(task), clean(message)    
        sents = sum([sent_tokenize(m) for m in message.split(BREAK)], [])

        # identify entire task sentence
        i = np.argmax([bool(re.search(task, s)) for s in sents])
        ts = sents[i]

        # identify context
        if i > 0 and not re.search(comp_regex, sents[i-1]):
            cs = sents[i-1]
            context_sents = [s for s in sents[:i] if not re.search(comp_regex, s)]
            context_sents = trim_sents(context_sents)
            context = '. '.join(context_sents)
        else:
            cs, context = '',''

        # print
        print(f'{task}\t{clean_info(context)}\t{clean_info(cs)}\t{clean_info(ts)}')
        
#         print(20*'=')
#         print('CONTEXT', clean_info(context))
#         print()
#         print('TASKSEN', clean_info(ts))
#         print()
        
        
    except:
        continue


Overwriting contextMapper.py


In [99]:
!chmod +x contextMapper.py

In [100]:
!python contextMapper.py < ../data/epa_message_task_samp.txt

Please reply  Brent	Please find attached the final versions of the ETA and PA with the changes discussed and agreed on today s conference call.	Please find attached the final versions of the ETA and PA with the changes discussed and agreed on today s conference call.	Please reply Brent Hendry with your approval and let me know if you have any questions/comments Thanks Carlos x5-8705
Please incorporate these changes into the 2000 plan.			Please incorporate these changes into the 2000 plan.
Please forward this information to any one in your organization	The last public seminars that PGS Energy Training will be offering until October have been scheduled at the downtown Houston Hyatt Regency on June 12, 13, 14 15.. Programs listed below .. For more information, call PHONE or visit http://www.pgsenergy.com/schedule.html	For more information, call PHONE or visit http://www.pgsenergy.com/schedule.html	Please forward this information to any one in your organization who might benefit from a bet

Take a look at this	Mike	Mike	Take a look at this
Laurel:  Will you please contact Treasury per Lynn s email below?			Laurel: Will you please contact Treasury per Lynn s email below?
Can you help us with this?	Thanks, Charles, for the information.. We need to show what utilities would have saved if they had hedged their positions.	We need to show what utilities would have saved if they had hedged their positions.	Can you help us with this?
Richard, please review and comment asap.			Richard, please review and comment asap.
Please review and let me know what changes are suggested.	This was forwarded to me by Phil DeMoes.. Dan,. Take a look at this letter prepared by our surety provider on the PEAK pre-pay.. In particular, note the last paragraph.. I m concerned on the unwind if there is a default.. We plan to meet tomorrow with Joe Deffner and Breese tomorrow morning.. Are you available?. FYI.. Phil-Attached is a first round draft of a reference letter for the PEAK project.	Phil-Atta

Please print out the  swap  documents for me ASAP.			Please print out the swap documents for me ASAP.
Please move all the following counterparties financial   natural gas trades only deals from ENA-FT-WT-SOCAL book to the BANKRUPTCY book:			Please move all the following counterparties financial natural gas trades only deals from ENA-FT-WT-SOCAL book to the BANKRUPTCY book: HESSENESER BRIDGELIGASMAR CONSUMERS CORNERSTPROL P IGIRES OCCIDENTENEMAR PRAXAIR PSEGENERES SEMPRAENETRA
Please answer these	If the PMT breaker trip doesn t always cause a given fault then is there any easy way SCADA data or site crew experience for us to estimate what percentage of the trips did fault the turbine offlline due to a downtime or line out fault?	If the PMT breaker trip doesn t always cause a given fault then is there any easy way SCADA data or site crew experience for us to estimate what percentage of the trips did fault the turbine offlline due to a downtime or line out fault?	Please answer these quest

In preparation for our discussion tomorrow, can you run VAR numbers for some			Vlady: In preparation for our discussion tomorrow, can you run VAR numbers for some mini-portfolios:
Please respond to virginia clarkfineart.com	Is there room on the price if I give you a firm commitment.. The ebay price is significantly discounted but they are not ready to close the deal yet either.. If you can t get them down on the price, it is worth me chasing down the other option for a while.. Please let me know via email and I will attempt to be responsive via odd time zones.. Thanks, mike	Thanks, mike	Virginia E. Repasky virginia clarkfineart.com on 09/25/2000 11:04:57 PM Please respond to virginia clarkfineart.com To: Mike.McConnell enron.com cc: Subject: Re: Lichtenstein s Blue Note
Please set up for Jan 1 start	Done.. Deal 210603. I just plugged/fugged/guessed at the demand charge.. When Dana does her voodoo stuff, this deal will show up.. Ms. Franklin, this might be my last contract to set up h

pls print for me.			pls print for me.
Please let her know once a final order	man, do i OWE you a phone call.. please don t hold it against me.. sort of well, you know, a little nuts right now.. but i will reconnect.. i promise.. SCE is awaiting a commission order to auction off its SO2 credits.. Janel Guerrero is very interested in this.	Janel Guerrero is very interested in this.	Please let her know once a final order is issued.
Please review and let me know your thoughts on their comments.			Please review and let me know your thoughts on their comments.
please take a few minutes today to fill out Enron s annual employee feedbac	Forwarded by Scott Neal/HOU/ECT on 10/20/2000 09:06 AM 20. Is available: through Friday, Oct. 27 Is located at: survey.enron.com. You only have one more week to check your pulse.	You only have one more week to check your pulse.	Your input is crucial, so 20 please take a few minutes today to fill out Enron s annual employee feedbac k 20 survey.
Julie  - coul

In [101]:
!python contextMapper.py < ../data/epa_message_task.csv > ../data/lim_context_task_.txt

In [102]:
!wc ../data/lim_context_task_.txt

    6732  429459 2459779 ../data/lim_context_task_.txt


In [103]:
!wc ../data/epa_message_task.csv

    6734 1454298 11971256 ../data/epa_message_task.csv
