# Ngram Generator

This notebook generates an email based on an n-gram model of female and male email messages.

# Setup

In [3]:
# Load timer
%load_ext autotime

In [4]:
# Import necessary libraries 
import pandas as pd
from IPython.display import display
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
#nltk.download()
import re
from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer

# Turn off warnings
import warnings 
warnings.simplefilter('ignore')

time: 982 µs


In [5]:
# Import version of data with punctionation in tact
emails_withpunc = pd.read_csv("emails_v1.csv")

time: 9.74 s


# Data Clean and Re-labelling

In [6]:
# Create function to get the gender based on the first name 

import gender_guesser.detector as gender
d = gender.Detector()

def get_gender(email):
    name = email.capitalize().split('.')[0]
    sex = d.get_gender(name)
    return sex

time: 383 ms


In [7]:
# Get gender and add new column

emails_withpunc['Gender'] = emails_withpunc['From'].apply(get_gender)

emails_withpunc.head()

Unnamed: 0.1,Unnamed: 0,Date,From,To,Subject,Message-Body,Gender
0,0,2001-05-14,phillip.allen@enron.com,tim.belden@enron.com,,Here is our forecast,male
1,1,2001-05-04,phillip.allen@enron.com,john.lavorato@enron.com,Re:,Traveling to have a business meeting takes the...,male
2,2,2000-10-18,phillip.allen@enron.com,leah.arsdall@enron.com,Re: test,test successful. way to go!!!,male
3,3,2000-10-23,phillip.allen@enron.com,randall.gay@enron.com,,"Randy, Can you send me a schedule of the salar...",male
4,4,2000-08-31,phillip.allen@enron.com,greg.piper@enron.com,Re: Hello,Let's shoot for Tuesday at 11:45.,male


time: 2.41 s


In [29]:
# Select emails that were written by certain gender

female_email = emails_withpunc[ emails_withpunc['Gender'] == 'male' ]

time: 63 ms


In [30]:
# Do some extra cleaning

bad_keywords = ['X-cc:', 'Outlook Migration Team@', 'X-FileName:', 'X-Origin:', 'X-Folder:', 'X-To: ', 'X-bcc:' ]
def clean_startswith(df, keyword):
    df = df[ df['Message-Body'].str[0:len(keyword)] != keyword ]
    return df

for keyword in bad_keywords:
    female_email = clean_startswith(female_email, keyword)

time: 718 ms


In [31]:
# Convert each message into n-gram readable format

sentences = []
bad_symbols = ['@', '=', '\\', ':', '(', ')']
for email in female_email['Message-Body']:
    try:
        splitup = email.split('. ')
    except:
        continue
    
    for sentence in splitup:
        sentence = sentence.strip().lower()
        sentence = ''.join(sentence.split('\t'))
        if len(sentence) == 0:
            continue

        skip_this_one = False
        for symbol in bad_symbols:
            if sentence.find(symbol) != -1:
                skip_this_one = True
                break
        if skip_this_one:
            continue


        doublesplit = sentence.split('.')
        for subsentence in doublesplit:
            subsentence = subsentence.strip()
            subsentence = ' , '.join(subsentence.split(','))
            subsentence = ' ? '.join(subsentence.split('?'))
            swapped = False
            while swapped == False:
                swapped = True
                testlen = len(subsentence)
                subsentence = ' '.join(subsentence.split('  '))
                if testlen != len(subsentence):
                    swapped = False
            if len(subsentence) > 3:
                subsentence = subsentence + ' .'
                sentences.append(subsentence)
    if len(sentences) > 1000:
        break
        

import pprint 

pprint.pprint(sentences)

['here is our forecast .',
 'traveling to have a business meeting takes the fun out of the trip .',
 'especially if you have to prepare a presentation .',
 'i would suggest holding the business plan meetings here then take a trip '
 'without any formal business meetings .',
 'i would even try and get some honest opinions on whether a trip is even '
 'desired or necessary .',
 'as far as the business meetings , i think it would be more productive to try '
 'and stimulate discussions across the different groups about what is working '
 'and what is not .',
 'too often the presenter speaks and the others are quiet just waiting for '
 'their turn .',
 'the meetings might be better if held in a round table discussion format .',
 'my suggestion for where to go is austin .',
 "play golf and rent a ski boat and jet ski's .",
 'flying somewhere takes too much time .',
 'test successful .',
 'way to go!!! .',
 'randy , can you send me a schedule of the salary and level of everyone in '
 'the sch

# Generate N-grams

In [32]:
# Create tri-grams

ngrams = {}
for sentence in sentences:
    sent_parts = sentence.split(' ')
    two_ago = None
    one_ago = None
    for word in sent_parts:
        if not ((two_ago, one_ago) in ngrams):
            ngrams[(two_ago, one_ago)] = []
        ngrams[(two_ago, one_ago)].append(word)
        two_ago = one_ago
        one_ago = word

time: 16.3 ms


In [33]:
# Show dictionary of ngrams

ngrams

{(None, None): ['here',
  'traveling',
  'especially',
  'i',
  'i',
  'as',
  'too',
  'the',
  'my',
  'play',
  'flying',
  'test',
  'way',
  'randy',
  'plus',
  'greg',
  'to',
  'using',
  'in',
  'we',
  'kwh',
  'buckner',
  'i',
  'her',
  'phillip',
  'follow',
  'click',
  'click',
  'what',
  'although',
  'what',
  'this',
  'also',
  'the',
  'i',
  "we've",
  'ability',
  'ability',
  'inclusion',
  'ability',
  'ability',
  'show',
  'eliminate',
  'position',
  'benchmark',
  'deployment',
  'currency',
  'hopefully',
  'what',
  'although',
  'what',
  'this',
  'also',
  'the',
  'i',
  "we've",
  'ability',
  'ability',
  'inclusion',
  'ability',
  'ability',
  'show',
  'eliminate',
  'position',
  'benchmark',
  'deployment',
  'currency',
  'hopefully',
  'dave',
  'the',
  'phillip',
  'paula',
  'tim',
  'can',
  'thank',
  'the',
  '50',
  'several',
  '25per',
  '30-$1',
  '40',
  'thisproperty',
  'if',
  'the',
  'i',
  'sincerely',
  'richardspresident',

time: 83.7 ms


# Generate Email

In [34]:
import random

def make_random_sentence(nples):
    seed = (None, None)
    word = '?'
    new_sentence = []
    while word != '.':
        word = random.choice(ngrams[seed])
        new_sentence.append(word)
        seed = (seed[1], word)

    cleanup = ' '.join(new_sentence)
    unspace_symbols = ['.', ',', '?']
    for uss in unspace_symbols:
        cleanup = (uss).join(cleanup.split(' '+uss))
    return cleanup

for i in range(0, 10):
    print( make_random_sentence(ngrams) )
    print("\n")

i am tempted to hold for $3000/acre.


it would be $95/month.


the columns below show the volumes under each ticket and the fax is512-338-1103.


phillip.


even though they are getting quotes to put up new rock then we will probably visit in late june or july.


also, if cb cannot find a good idea for both gas and power in ca.


having said all of our current rental and cost data for your info, plus i noticed three days where it appears that pgen kingsgate is mapped to a table you prepared for dealing with subs, vendors and professionals is not feasible.


she has concerns about the correlation to nymex on aeco? .


let me know if we are building 10 town homes.


i have spoken to brenda and everything looks good.


time: 3.58 ms


## Randomly Generating Female Emails

1. corp.
2. the project who were slinging mardi gras beads, hula hoops and limbo sticks.
3. **i feel sad to end this chapter, yet excited as i begin a new business in houston and a white feather boa twirled about her neck.**
4. eric thode said.
5. we invite you or a member of your group to participate in person, via teleconference or via video conference from the highly liquid enrononline system should improve the performance of their letters to dpc, and the meetings will continues executive president of his business today.
6. **phillip - this is not the case.**
7. peter krenkel, president and ceo of turner, collie & braden inc.
8. but how about something very, very dead- end jobs, " the inclusion of enrononline data satisfies the principal claims made by enron unit transwestern, williams cos' kern river transmission, el paso corp.
9. **thanks.**
10. compared with the equity shortfall as per demand.

## Randomly Generating Male Emails

1. i am tempted to hold for 3000/acre.
2. it would be 95/month.
3. the columns below show the volumes under each ticket and the fax is512-338-1103.
4. phillip.
5. even though they are getting quotes to put up new rock then we will probably visit in late june or july.
6. also, if cb cannot find a good idea for both gas and power in ca.
7. having said all of our current rental and cost data for your info, plus i noticed three days where it appears that pgen kingsgate is mapped to a table you prepared for dealing with subs, vendors and professionals is not feasible.
8. she has concerns about the correlation to nymex on aeco? .
9. **let me know if we are building 10 town homes.**
10. **i have spoken to brenda and everything looks good.**