# Fraud detection using text
In this final chapter, you will use text data, text mining and topic modeling to detect fraudulent behavior.

## Word search with dataframes
In this exercise you're going to work with text data, containing emails from Enron employees. The Enron scandal is a famous fraud case. Enron employees covered up the bad financial position of the company, thereby keeping the stock price artificially high. Enron employees sold their own stock options, and when the truth came out, Enron investors were left with nothing. The goal is to find all emails that mention specific words, such as "sell enron stock".

In [1]:
# Import module
from pandas import read_pickle

# Load Data
df = read_pickle('data/eron.pkl')

# Find all cleaned emails that contain 'sell enron stock'
mask = df['clean_content'].str.contains('sell enron stock', na=False)

# Select the data from df using the mask
print(df.loc[mask])

                           From                          To  \
154  ('sarah.palmer@enron.com')  ('sarah.palmer@enron.com')   

                    Date                                            content  \
154  2002-02-01 14:53:35  \nJoint Venture: A 1997 Enron Meeting Belies O...   

                                         clean_content  
154  joint venture enron meeting belies officers cl...  


You see that searching for particular string values in a dataframe can be relatively easy, and allows you to include textual data into your model or analysis. You can use this word search as an additional flag, or as a feauture in your fraud detection model. Let's now have a look at how to filter the data using multiple search terms.

## Using list of terms
Oftentimes you don't want to search on just one term. You probably can create a full "fraud dictionary" of terms that could potentially flag fraudulent clients and/or transactions. Fraud analysts often will have an idea what should be in such a dictionary. In this exercise you're going to flag a multitude of terms, and in the next exercise you'll create a new flag variable out of it. The 'flag' can be used either directly in a machine learning model as a feature, or as an additional filter on top of your machine learning model results. Let's first use a list of terms to filter our data on.

In [2]:
# Create a list of terms to search for
searchfor = ['enron stock', 'sell stock', 'stock bonus', 'sell enron stock']

# Filter cleaned emails on searchfor list and select from df
filtered_emails = df.loc[df['clean_content'].str.contains('|'.join(searchfor), na=False)]
print(filtered_emails.head())

                              From                               To  \
0  ('advdfeedback@investools.com')  ('advdfeedback@investools.com')   
1    ('richard.sanders@enron.com')    ('richard.sanders@enron.com')   
2            ('m..love@enron.com')            ('m..love@enron.com')   
3     ('leslie.milosevich@kp.org')     ('leslie.milosevich@kp.org')   
4     ('rtwait@graphicaljazz.com')     ('rtwait@graphicaljazz.com')   

                  Date                                            content  \
0  2002-01-29 23:20:55  INVESTools Advisory\nA Free Digest of Trusted ...   
1  2000-09-20 19:07:00  ----- Forwarded by Richard B Sanders/HOU/ECT o...   
2  2001-10-30 16:15:17  hey you are not wearing your target purple shi...   
3  2002-01-30 17:54:18  Leslie Milosevich\n1042 Santa Clara Avenue\nAl...   
4  2002-01-30 19:36:01  Rini Twait\n1010 E 5th Ave\nLongmont, CO 80501...   

                                       clean_content  
0  investools advisory free digest trusted invest...  


By joining the search terms with the 'or' sign, i.e. |, you can search on a multitude of terms in your dataset very easily. Let's now create a flag from this which you can use as a feature in a machine learning model.

## Creating a flag
This time you are going to create an actual flag variable that gives a 1 when the emails get a hit on the search terms of interest, and 0 otherwise. This is the last step you need to make in order to actually use the text data content as a feature in a machine learning model, or as an actual flag on top of model results. You can continue working with the dataframe df containing the emails, and the searchfor list is the one defined in the last exercise.

In [3]:
import numpy as np

# Create flag variable where the emails match the searchfor terms
df['flag'] = np.where((df['clean_content'].str.contains('|'.join(searchfor)) == True), 1, 0)

# Count the values of the flag variable
count = df['flag'].value_counts()
print(count)

0    1776
1     314
Name: flag, dtype: int64


You have now managed to search for a list of strings in several lines of text data. These skills come in handy when you want to flag certain words based on what you discovered in your topic model, or when you know beforehand what you want to search for. In the next exercises you're going to learn how to clean text data and to create your own topic model to further look for indications of fraud in your text data.

## Removing stopwords
In the following exercises you're going to clean the Enron emails, in order to be able to use the data in a topic model. Text cleaning can be challenging, so you'll learn some steps to do this well. The dataframe containing the emails df is available. In a first step you need to define the list of stopwords and punctuations that are to be removed in the next exercise from the text data. Let's give it a try.

In [4]:
# Import nltk packages and string
from nltk.corpus import stopwords
import string

# Define stopwords to exclude
stop = set(stopwords.words('english'))
stop.update(("to","cc","subject","http","from","sent", "ect", "u", "fwd", "www", "com"))

# Define punctuations to exclude and lemmatizer
exclude = set(string.punctuation)

## Cleaning text data
Now that you've defined the stopwords and punctuations, let's use these to clean our enron emails in the dataframe df further. The lists containing stopwords and punctuations are available under stop and exclude There are a few more steps to take before you have cleaned data, such as "lemmatization" of words, and stemming the verbs. The verbs in the email data are already stemmed, and the lemmatization is already done for you in this exercise.

In [5]:
# Import the lemmatizer from nltk
import nltk
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('wordnet')
lemma = WordNetLemmatizer()

# Define word cleaning function
def clean(text, stop):
    text = text.rstrip()
    stop_free = " ".join([i for i in text.lower().split() if((i not in stop) and (not i.isdigit()))])
    punc_free = ''.join(i for i in stop_free if i not in exclude)
    normalized = " ".join(lemma.lemmatize(i) for i in punc_free.split())
    return normalized

# Clean the emails in df and print results
text_clean=[]
for text in df['clean_content'].astype(str):
    text_clean.append(clean(text, stop).split())
print(text_clean[0])

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/alejandro.robles/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


['investools', 'advisory', 'free', 'digest', 'trusted', 'investment', 'advice', 'unsubscribe', 'free', 'newsletter', 'please', 'see', 'issue', 'fried', 'sell', 'stock', 'gain', 'month', 'km', 'rowe', 'january', 'index', 'confirms', 'bull', 'market', 'aloy', 'small', 'cap', 'advisor', 'earns', 'lbix', 'compounding', 'return', 'pine', 'tree', 'pcl', 'undervalued', 'high', 'yield', 'bank', 'put', 'customer', 'first', 'aso', 'word', 'sponsor', 'top', 'wall', 'street', 'watcher', 'ben', 'zacks', 'year', 'year', 'gain', 'moving', 'best', 'brightest', 'wall', 'street', 'big', 'money', 'machine', 'earned', 'ben', 'zacks', 'five', 'year', 'average', 'annual', 'gain', 'start', 'outperforming', 'long', 'term', 'get', 'zacks', 'latest', 'stock', 'buylist', 'free', 'day', 'trial', 'investools', 'c', 'go', 'zaks', 'mtxtu', 'zakstb', 'investools', 'advisory', 'john', 'brobst', 'investools', 'fried', 'sell', 'stock', 'lock', 'month', 'km', 'david', 'fried', 'know', 'stock', 'undervalued', 'company', '

Now that you have cleaned your data entirely with the necessary steps, including splitting the text into words, removing stopwords and punctuations, and lemmatizing your words. You are now ready to run a topic model on this data. In the following exercises you're going to explore how to do that.

## Create dictionary and corpus
In order to run an LDA topic model, you first need to define your dictionary and corpus first, as those need to go into the model. You're going to continue working on the cleaned text data that you've done in the previous exercises. That means that text_clean is available for you already to continue working with, and you'll use that to create your dictionary and corpus.

In [6]:
# Import the packages
import gensim
from gensim import corpora

# Define the dictionary
dictionary = corpora.Dictionary(text_clean)

# Define the corpus
corpus = [dictionary.doc2bow(text) for text in text_clean]

These are the two ingredients you need to run your topic model on the enron emails. You are now ready for the final step and create your first fraud detection topic model.

## LDA model
Now it's time to build the LDA model. Using the dictionary and corpus, you are ready to discover which topics are present in the Enron emails. With a quick print of words assigned to the topics, you can do a first exploration about whether there are any obvious topics that jump out. Be mindful that the topic model is heavy to calculate so it will take a while to run. Let's give it a try!

In [7]:
# Define the LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=5, id2word=dictionary, passes=5)

# Save the topics and top 5 words
topics = ldamodel.print_topics(num_words=5)

# Print the results
for topic in topics:
    print(topic)

(0, '0.019*"enron" + 0.011*"company" + 0.006*"employee" + 0.006*"stock" + 0.005*"energy"')
(1, '0.048*"td" + 0.034*"net" + 0.033*"money" + 0.031*"tr" + 0.028*"width"')
(2, '0.037*"image" + 0.009*"se" + 0.009*"ne" + 0.009*"sp" + 0.008*"wscc"')
(3, '0.044*"enron" + 0.010*"hou" + 0.007*"pm" + 0.007*"please" + 0.007*"development"')
(4, '0.014*"message" + 0.010*"original" + 0.009*"e" + 0.008*"mail" + 0.007*"know"')


You have now successfully created your first topic model on the Enron email data. However, the print of words doesn't really give you enough information to find a topic that might lead you to signs of fraud. You'll therefore need to closely inspect the model results in order to be able to detect anything that can be related to fraud in your data. You'll learn more about this in the next video.

## Finding fraudsters based on topic
In this exercise you're going to link the results from the topic model back to your original data. You now learned that you want to flag everything related to topic 3. As you will see, this is actually not that straightforward. You'll be given the function get_topic_details() which takes the arguments ldamodel and corpus. It retrieves the details of the topics for each line of text. With that function, you can append the results back to your original data. If you want to learn more detail on how to work with the model results, which is beyond the scope of this course, you're highly encouraged to read this article.

In [8]:
import pandas as pd

def get_topic_details(ldamodel, corpus):
    topic_details_df = pd.DataFrame()
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                topic_details_df = topic_details_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
    topic_details_df.columns = ['Dominant_Topic', '% Score', 'Topic_Keywords']
    return topic_details_df

In [9]:
# Run get_topic_details function and check the results
print(get_topic_details(ldamodel, corpus).head())

# Add original text to topic details in a dataframe
contents = pd.DataFrame({'Original text': text_clean})
topic_details = pd.concat([get_topic_details(ldamodel, corpus), contents], axis=1)
print(topic_details.head())

# Create flag for text highest associated with topic 3
topic_details['flag'] = np.where((topic_details['Dominant_Topic'] == 3.0), 1, 0)
print(topic_details.head())

   Dominant_Topic  % Score                                     Topic_Keywords
0             0.0   0.9714  enron, company, employee, stock, energy, new, ...
1             3.0   0.4260  enron, hou, pm, please, development, corp, mes...
2             3.0   0.6837  enron, hou, pm, please, development, corp, mes...
3             0.0   0.9935  enron, company, employee, stock, energy, new, ...
4             0.0   0.9934  enron, company, employee, stock, energy, new, ...
   Dominant_Topic  % Score                                     Topic_Keywords  \
0             0.0   0.9714  enron, company, employee, stock, energy, new, ...   
1             3.0   0.4260  enron, hou, pm, please, development, corp, mes...   
2             3.0   0.6837  enron, hou, pm, please, development, corp, mes...   
3             0.0   0.9935  enron, company, employee, stock, energy, new, ...   
4             0.0   0.9934  enron, company, employee, stock, energy, new, ...   

                                       Origin

You have now flagged all data that is highest associated with topic 3, that seems to cover internal conversation about enron stock options. You are a true detective. With these exercises you have demonstrated that text mining and topic modeling can be a powerful tool for fraud detection.

