# HOMEWORK 6: Email Processing

It's easy to recognize an email address, but can you write code to do it for you? In this homework you'll use regular expressions to determine what does or does not qualify as a valid email address. You'll then sort through all of Jeb Bush's emails from 2001 to see what you can learn about his professional network.


# PROBLEM 1

First, we want to find a way to distinguish email addresses from other strings. For example, 'jeb@jeb.org' is a valid email address, while 'jeb' is not. Email addresses may contain additional '.' or '-' characters, though generally start and end with alphanumeric characters.

The function argument 'email_list' is a list of strings, each of which is a prospective email address. Complete the  'email_validation' function such that it receives the email_list and returns 'validation_results', an ordered list of True/False values indicating whether each string is a valid email address.


EXAMPLE:
    
    email_validation(['jeb@jeb.org','h.rod17@clintonemail.com','hotmail.com']) 
    
    should return:      ['True','True','False']

In [None]:
def email_validation(email_list):
    
    # YOUR CODE GOES HERE

    
    # ANSWER
    import re   
    validation_results = []
    address_identifier = re.compile(r'\w+(\.(\w)+)*@\w+\.\w+(\.\w+)*\Z',re.IGNORECASE)   
    for email in email_list:
        check = address_identifier.match(str(email))
        if check != None:
            validation_results.append('True')
        else:
            validation_results.append('False')
    
    return validation_results
    

# PROBLEM 2

Your next task is to determine how many friends Jeb has. In particular, you want to find all of the unique email addresses that appear within Jeb's emails from 2001. 

The emails are stored within text files in '../Data/Emails/'. You must complete the 'find_unique_emails' function below such that it receives a path name (i.e. '../Data/Emails/' ) and returns the number of unique email addresses found as an integer variable 'number_unique_emails'. This should be the total number of unique email addresses found within all of the documents combined.


EXAMPLE:
    
    If in '../Data/emails/file.txt' we have:
         
         From: jeb@jeb.org
         To: hrod17@clintonemails.com
         Subject: Re: jeb@jeb.org is way cooler than hrod17@clintonemails.com
  
  
    Then the function:  find_unique_emails('../Data/emails/')
              returns:  2      
        


In [None]:
def find_unique_emails(path):

    
    # YOUR CODE GOES HERE

    
    # ANSWER
    import re
    import glob
    import codecs
    emails = []
    files = glob.glob(path+'*.txt')
    address_identifier = re.compile(r'\w+([\.\-\w]\w)*@\w+[\w\-]*\.[\w\-]+(\.[\w\-]+)*\w',re.IGNORECASE)   
    for num,file in enumerate(files):    
        current_file = codecs.open(file, 'r', encoding='utf-8',errors='ignore')   
        text = current_file.readlines()
        for line in text:
            emails_found = address_identifier.finditer(line)
            for email in emails_found:
                emails.append(email.group())
    number_unique_emails = len(set(emails))
    
    
    return number_unique_emails


# PROBLEM 3

Finally we want to determine how diverse Jeb's professional and social network is. We'll do this by counting the total number of unique domain names appearing within Jeb's emails.

As before, the function argument 'path' is the path name for a directory containing a bunch of text files, each of which contains one month of Jeb Bush's email history. We want to find the number of unique domain names amongst those in correspondence with Jeb. (i.e. "aol.com" is the domain name for "roxysurfrchick@aol.com") For simplicity, you can assume that every unique set of characters following the '@' symbol is a single domain, e.g. "@u.northwestern.edu" and '@northwestern.edu' are separate domains. Return an integer number of unique domain names as 'unique_domains'


* NOTE: ALL CODE SHOULD BE INDEPENDENT FROM PREVIOUS PROBLEMS. YOU ARE WELCOME TO COPY/PASTE CODE FROM BEFORE, BUT DO NOT CALL FUNCTIONS EXECUTED IN PREVIOUS CELLS. IF YOU DO, YOUR ASSIGNMENT MAY BE GRADED INCORRECTLY.


In [None]:
def find_unique_domains(path):

    
    # YOUR CODE GOES HERE
    
    
    # ANSWER
    import re
    import glob
    import codecs
    domains = []    
    files = glob.glob(path+'*.txt')
    domain_identifier = re.compile(r'@\w+[\w\-]*\.[\w\-]+(\.[\w\-]+)*\w',re.IGNORECASE)   
    for num,file in enumerate(files):    
        current_file = codecs.open(file, 'r', encoding='utf-8',errors='ignore')   
        text = current_file.readlines()
        for line in text:
            domains_found = domain_identifier.finditer(line)
            for domain in domains_found:
                domains.append(domain.group())
    unique_domains = len(set(domains)) 
    
    return unique_domains    

In [None]:
# email validation test
email_list = ['abc@def.com','abc.def@g.h.edu','abcdefg','a@b@c.com','a@b','a?@b.com']
email_validation(email_list)

In [None]:
# unique email count test
path = 'Introduction-to-Python-Programming-and-Data-Science/Data/Emails/'
find_unique_emails(path)

In [None]:
# most common domain test
path = '../Data/Emails/'
find_unique_domains(path)

In [None]:
def parse_emails(path):
    '''
    Function receives path and returns dictionary of all emails in which the email of the sender serves as the key. Each entry contains recipients, a timestamp, and a subject line.
    ''' 
    
    import glob
    import codecs
    import re
    
    address_identifier = re.compile(r'\w+([\.\-\w]\w)*@\w+[\w\-]*\.[\w\-]+(\.[\w\-]+)*\w',re.IGNORECASE)   
    files = glob.glob(path+'*.txt')
    email_dictionary = {}
    
    for num,file in enumerate(files):    
        current_file = codecs.open(file, 'r', encoding='utf-8',errors='ignore')
        current_text = current_file.readlines()
        to_field = False
        
        for line in current_text:
            
            # search from line for email, use it as dictionary key
            if line[:5] == 'From:':
                sender_match = address_identifier.search(line)
                try:
                    sender = sender_match.group()
                except:
                    sender = line[6:].strip()
                    
            if line[:5] == 'Sent:':
                time = line[6:].strip()
            
            if line[:3] == 'To:':
                recipients = []
                recipient_matches = address_identifier.finditer(line)
                for match in recipient_matches:
                    recipients.append(match.group())
                to_field = True
            
            if line[:8] == 'Subject:':
                subject = line[8:].strip()
                to_field = False
                current_data = [set(recipients),time,subject]
                try:
                    email_dictionary[sender].append(current_data)
                except:
                    email_dictionary[sender] = [current_data]
        
            else:
                if to_field == True:
                    recipient_matches = address_identifier.finditer(line)
                    for match in recipient_matches:
                        recipients.append(match.group())
    return email_dictionary

In [None]:
# Counts total emails sent by each email address
total_sent = {}
for i,email in enumerate(emails):
    total_sent[email] = len(emails[email]) 