In [1]:
#Importing all necessary libraries for parsing, cleaning and basic pre-processing

import os
import glob
import string
import pandas as pd

from nltk.tokenize import RegexpTokenizer

All the email files are present in a directory/folder called data. To understand how to clean these files, the contents of the emails must be analysed. A single email is read to display the contents. 

In [2]:
file_path = "data/176869"
file = open(file_path, encoding="utf8", errors='ignore')
contents = file.readlines()
for content in contents:
    print(content)

Xref: cantaloupe.srv.cs.cmu.edu talk.politics.misc:176869 alt.sex:110443 soc.men:67802 misc.legal:59959

Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!husc-news.harvard.edu!hsdndev!yale!gumby!wupost!zaphod.mps.ohio-state.edu!ub!galileo.cc.rochester.edu!uhura.cc.rochester.edu!as010b

Newsgroups: talk.politics.misc,alt.sex,soc.men,misc.legal

Subject: Re: Stop putting down white het males.

Message-ID: <1993Apr5.213327.23802@galileo.cc.rochester.edu>

From: as010b@uhura.cc.rochester.edu (Tree of Schnopia)

Date: 5 Apr 93 21:33:27 GMT

Sender: news@galileo.cc.rochester.edu

References: <DJNg2B1w165w@cybernet.cse.fau.edu> <C50FHG.MEA@ocsmd.ocs.com>

Organization: University of Rochester - Rochester, New York

Nntp-Posting-Host: uhura.cc.rochester.edu

Lines: 32



In <C50FHG.MEA@ocsmd.ocs.com> mark@ocsmd.ocs.com (Mark Wilson) writes:



>Yuri Villanueva (elmo@cybernet.cse.fau.edu) wrote:

>: pbray@envy.reed.edu (Public account) writes:

>: 

>: > In article <1993Apr2.180839.14305@ga

Every email has a similar format. Of which, a lot of the content is unncessary for the objective of this task.
Thus, I extract only subject of the email, the body of the email, the from/sender details, and message-ID. 

A function to parse each email and extract only these contents is implemented.

In [3]:
def parse_files(raw_message):
    
    #split the raw_message into lines using delimiter
    lines = raw_message.split('\n')
    
    message = ''
    
    context = {}
    
    #extract only from, subject and message-id
    keywords = ['from', 'subject','message-id']
    
    #in every line if : is not present then it is the message body and can be extracted
    for line in lines:
        
        if ':' not in line:
            message += line.strip()
            context['body'] = message
            
        #the rest is xref, subject, from, to, id, time, etc 
        #we extract only what we need and put it in the email dict
        else:
            pairs = line.split(':')
            key = pairs[0].lower()
            value = pairs[1].strip()
            
            #code to extract subject contents when Re: is used 
            if str(key) == 'subject':
                if len(pairs) > 2:
                    val2 = pairs[2].strip()
                    value = val2
            
            #extract the rest of the key value pairs
            if key in keywords:
                context[key] = value
                
    #return the email context dictionary
    return context

In [4]:
#join the contents and store it in a string
raw = ''.join(str(content) for content in contents)
raw

'Xref: cantaloupe.srv.cs.cmu.edu talk.politics.misc:176869 alt.sex:110443 soc.men:67802 misc.legal:59959\nPath: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!husc-news.harvard.edu!hsdndev!yale!gumby!wupost!zaphod.mps.ohio-state.edu!ub!galileo.cc.rochester.edu!uhura.cc.rochester.edu!as010b\nNewsgroups: talk.politics.misc,alt.sex,soc.men,misc.legal\nSubject: Re: Stop putting down white het males.\nMessage-ID: <1993Apr5.213327.23802@galileo.cc.rochester.edu>\nFrom: as010b@uhura.cc.rochester.edu (Tree of Schnopia)\nDate: 5 Apr 93 21:33:27 GMT\nSender: news@galileo.cc.rochester.edu\nReferences: <DJNg2B1w165w@cybernet.cse.fau.edu> <C50FHG.MEA@ocsmd.ocs.com>\nOrganization: University of Rochester - Rochester, New York\nNntp-Posting-Host: uhura.cc.rochester.edu\nLines: 32\n\nIn <C50FHG.MEA@ocsmd.ocs.com> mark@ocsmd.ocs.com (Mark Wilson) writes:\n\n>Yuri Villanueva (elmo@cybernet.cse.fau.edu) wrote:\n>: pbray@envy.reed.edu (Public account) writes:\n>: \n>: > In article <1993Apr2.180839.14305@g

In [5]:
parse_files(raw)

{'subject': 'Stop putting down white het males.',
 'message-id': '<1993Apr5.213327.23802@galileo.cc.rochester.edu>',
 'from': 'as010b@uhura.cc.rochester.edu (Tree of Schnopia)',
 'body': '>Personally, I doubt she said anything of the kind, but if>someone can provide the ORIGINAL quote, IN CONTEXT, WITH SOURCE>(for, ahem, cross-checking), I would we willing to agree>she is full of sh*t.  Naturally, if no one can provide these>bits of data, the paraphrase listed must be disregarded,>and its poster regarded as full of sh*t.  OK, so which will it be?I followed up without a thought of double-checking...if I double-checkedevery fact people vomited onto the table here on the net, I\'d never havetime to sleep.  But to pass the buck to the person who originally postedthat quote......well, Michael?  Take it away!  (wild applause)Drewcifer------bi    Andrew D. Simchik\t\t\t\t\tSCHNOPIA!\\ ----    as010b@uhura.cc.rochester.edu\t\t\t\tTreeWater\\\\  /\\/     "Words Weren\'t Made For Cowards"--Happy

The function parse_files() is working and can be implemented for all the emails in the data directory.

All the files in the data directory can be read using glob and parsed using the function above. The parsed content is then stored in a pandas dataframe.

In [6]:
#create an empty pandas dataframe
main = pd.DataFrame()
print(main)

Empty DataFrame
Columns: []
Index: []


In [7]:
#READ THE FILES
#PARSE FILES
#APPEND TO DATAFRAME

count = 0
for file in glob.glob('data/*'):
    #print(file)
    f = open(file, encoding="utf8", errors='ignore')
    data = f.readlines()
    raw = ''.join(str(n) for n in data)
    parsed = parse_files(raw)
    #print(parsed)
    main = main.append(parsed,ignore_index=True)
    count += 1
    #break
    
print(f'Files Read : {count}')
print(f'Dataframe Processed\n')

Files Read : 300
Dataframe Processed



In [8]:
main.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 4 columns):
body          300 non-null object
from          300 non-null object
message-id    300 non-null object
subject       300 non-null object
dtypes: object(4)
memory usage: 9.5+ KB


In [9]:
#DROP NULL COLUMNS IF ANY
main = main.dropna()
main.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 300 entries, 0 to 299
Data columns (total 4 columns):
body          300 non-null object
from          300 non-null object
message-id    300 non-null object
subject       300 non-null object
dtypes: object(4)
memory usage: 11.7+ KB


No data is lost in the process. All 300 emails are read, parsed, and stored in a pandas dataframe.

The next step is preprocessing the dataframe contents. To group emails into similar categories, we would have to use an unsupervised learning approach - clustering. To achieve this, the body of each email has to be split, tokenized, and the punctuations must be removed, so the words can be vectorized and analyzed. 

I tried cleaning the contents of the email body using a custom function and with the NLTK preprocessing and tokenizing libraries. NLTK had better results and higher accuracy. Thus NLTK libaries with regular expression
matching was used to tokenize the email body for every file.

In [10]:
def clean_nltk(text):
    
    #Tokenize and remove punctutation using RegExp Tokenizer from NLTK
    tokenizer = RegexpTokenizer(r'\w+')
    
    #tokenize text
    text = tokenizer.tokenize(text)
    
    return text

In [11]:
#APPLY THE ABOVE CLEAN_NLTK FUNCTION TO THE PANDAS DATAFRAME
main['body'] = main['body'].apply(clean_nltk)

In [12]:
#now the body is cleaned and tokenized
main.head()

Unnamed: 0,body,from,message-id,subject
0,"[Anyone, know, about, the, Weitek, P9000, grap...",cavalier@blkbox.COM (Bill Egan),<1993Apr18.031714.3642@nntpxfer.psi.com>,Weitek P9000 ?
1,"[In, article, 1993Mar31, 140529, 10843, news, ...",arc@cco.caltech.edu (Aaron Ray Clements),<1pifnjINNscb@gap.caltech.edu>,ACLU (was Re
2,"[and, I, m, sure, that, people, who, were, big...",hrubin@pop.stat.purdue.edu (Herman Rubin),<C5sFnz.Fo1@mentor.cc.purdue.edu>,Gritz/JBS/Liberty Lobby/LaRouche/Christic Insi...
3,"[In, article, 1r6p8oINN8hi, clem, handheld, co...",wwarf@silver.ucs.indiana.edu (Wayne J. Warf),<C5y2r9.4D7@usenet.ucs.indiana.edu>,BD's did themselves--you're all paranoid freaks
4,"[I, recently, have, become, aware, that, my, h...",mon@cray.com (Muriel Nelson),<1993Apr15.154053.3087@hemlock.cray.com>,ABORTION and private health coverage -- letter...


In [13]:
#THE FINAL DATAFRAME IS THEN EXPORTED AS A CSV FILE FOR CLUSTERING AND LDA SO THE ABOVE STEPS NEED NOT BE 
#REPEATED
print(main.to_csv('email_input.csv',index=False))
print(f'Look for email_input.csv in the current directory')

None
Look for email_input.csv in the current directory
