<a href="https://colab.research.google.com/github/abyanjan/Fraud-Detection/blob/master/Fraud_detection_using_text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Fraud detection using text

In [None]:
!unzip chapter_4.zip

Archive:  chapter_4.zip
   creating: chapter_4/
 extracting: chapter_4/cleantext.pickle  
 extracting: chapter_4/ldamodel.pickle  
  inflating: chapter_4/dict.pickle   
  inflating: chapter_4/corpus.pickle  
  inflating: chapter_4/enron_emails_clean.csv  


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# load data
df = pd.read_csv("chapter_4/enron_emails_clean.csv")
df.head()

Unnamed: 0,Message-ID,From,To,Date,content,clean_content
0,<8345058.1075840404046.JavaMail.evans@thyme>,('advdfeedback@investools.com'),('advdfeedback@investools.com'),2002-01-29 23:20:55,INVESTools Advisory\nA Free Digest of Trusted ...,investools advisory free digest trusted invest...
1,<1512159.1075863666797.JavaMail.evans@thyme>,('richard.sanders@enron.com'),('richard.sanders@enron.com'),2000-09-20 19:07:00,----- Forwarded by Richard B Sanders/HOU/ECT o...,forwarded richard b sanders hou ect pm justin ...
2,<26118676.1075862176383.JavaMail.evans@thyme>,('m..love@enron.com'),('m..love@enron.com'),2001-10-30 16:15:17,hey you are not wearing your target purple shi...,hey wearing target purple shirt today mine wan...
3,<10369289.1075860831062.JavaMail.evans@thyme>,('leslie.milosevich@kp.org'),('leslie.milosevich@kp.org'),2002-01-30 17:54:18,Leslie Milosevich\n1042 Santa Clara Avenue\nAl...,leslie milosevich santa clara avenue alameda c...
4,<26728895.1075860815046.JavaMail.evans@thyme>,('rtwait@graphicaljazz.com'),('rtwait@graphicaljazz.com'),2002-01-30 19:36:01,"Rini Twait\n1010 E 5th Ave\nLongmont, CO 80501...",rini twait e th ave longmont co rtwait graphic...


The data is related to the text data, containing emails from Enron employees. The Enron scandal is a famous fraud case. Enron employees covered up the bad financial position of the company, thereby keeping the stock price artificially high. Enron employees sold their own stock options, and when the truth came out, Enron investors were left with nothing. The goal is to find all emails that mention specific words, such as "sell enron stock".  

We will be working with the text cloumn "clean_content"

**Data Preprocessing**

In [None]:
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords 
import string
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

  formatvalue=lambda value: "")[1:-1]
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype=np.int):
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  from .mio5_utils import VarReader5


In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
# function to clean the text
def clean(text, stop):
    text = str(text).rstrip()
    stop_free = " ".join([i for i in text.lower().split() if((i not in stop) and (not i.isdigit()))])
    punc_free = ''.join(i for i in stop_free if i not in exclude)
    normalized = " ".join(lemma.lemmatize(i) for i in punc_free.split())      
    return normalized

In [None]:
# Define stopwords to exclude
stop = set(stopwords.words('english'))
stop.update(("to", "cc", "subject", "http", "from", "sent", "ect", "u", "fwd", "www", "com", 'html'))

In [None]:
# Define punctuations to exclude 
exclude = set(string.punctuation)

In [None]:
# Define the lemmatizer from nltk
lemma = WordNetLemmatizer()

In [None]:
# apply the function to clean the text
text_clean=[]
for text in df['clean_content']:
    text_clean.append(clean(text, stop).split())

In [None]:
text_clean[0][:10]

['investools',
 'advisory',
 'free',
 'digest',
 'trusted',
 'investment',
 'advice',
 'unsubscribe',
 'free',
 'newsletter']

Now that you have cleaned your data entirely with the necessary steps, including splitting the text into words, removing stopwords and punctuations, and lemmatizing your words. 

**Topic modeling on fraud**

In [None]:
import gensim
from gensim import corpora

scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
  _deprecated()


In [None]:
# Create dictionary number of word counts
dictionary = corpora.Dictionary(text_clean)

In [None]:
print(dictionary)

Dictionary(33980 unique tokens: ['account', 'accurate', 'acquiring', 'acre', 'address']...)


In [None]:
# Define the corpus - create bag of words
corpus = [dictionary.doc2bow(text) for text in text_clean]

In [None]:
corpus[0][:10]

[(0, 2),
 (1, 1),
 (2, 1),
 (3, 1),
 (4, 1),
 (5, 6),
 (6, 1),
 (7, 2),
 (8, 4),
 (9, 1)]

**LDA model**

Using the dictionary and corpus, we are ready to discover which topics are present in the Enron emails. We will create the topics by using the LDA model. 

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Define the LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=5, id2word=dictionary, passes=5,random_state=1)

In [None]:
# Save the topics and top 5 words
topics = ldamodel.print_topics(num_words=5)

In [None]:
# Print the results
for topic in topics:
    print(topic)

(0, '0.031*"image" + 0.012*"e" + 0.008*"mail" + 0.008*"enron" + 0.007*"se"')
(1, '0.018*"enron" + 0.009*"company" + 0.005*"energy" + 0.005*"new" + 0.005*"stock"')
(2, '0.013*"message" + 0.012*"original" + 0.008*"market" + 0.007*"gas" + 0.007*"team"')
(3, '0.045*"td" + 0.033*"net" + 0.031*"money" + 0.029*"tr" + 0.027*"width"')
(4, '0.070*"enron" + 0.018*"hou" + 0.012*"development" + 0.011*"corp" + 0.010*"pm"')


In [None]:
!pip install pyLDAvis==2.1.2

Collecting pyLDAvis==2.1.2
[?25l  Downloading https://files.pythonhosted.org/packages/a5/3a/af82e070a8a96e13217c8f362f9a73e82d61ac8fff3a2561946a97f96266/pyLDAvis-2.1.2.tar.gz (1.6MB)
[K     |████████████████████████████████| 1.6MB 5.1MB/s 
Building wheels for collected packages: pyLDAvis
  Building wheel for pyLDAvis (setup.py) ... [?25l[?25hdone
  Created wheel for pyLDAvis: filename=pyLDAvis-2.1.2-py2.py3-none-any.whl size=97738 sha256=d309c5cc9626a2e1d29e2187d8bad9f5758553ba100a22d044176e72485c3a82
  Stored in directory: /root/.cache/pip/wheels/98/71/24/513a99e58bb6b8465bae4d2d5e9dba8f0bef8179e3051ac414
Successfully built pyLDAvis
Installing collected packages: pyLDAvis
  Found existing installation: pyLDAvis 3.3.1
    Uninstalling pyLDAvis-3.3.1:
      Successfully uninstalled pyLDAvis-3.3.1
Successfully installed pyLDAvis-2.1.2


In [None]:
# visualizing the topics
import pyLDAvis.gensim

In [None]:
lda_display = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary, sort_topics=False)

In [None]:
pyLDAvis.display(lda_display)

In [None]:
# get the topic details
def get_topic_details(ldamodel, corpus):
    topic_details_df = pd.DataFrame()
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_details_df = topic_details_df.append(pd.Series([topic_num, prop_topic]), ignore_index=True)
    topic_details_df.columns = ['Dominant_Topic', '% Score']
    return topic_details_df

In [None]:
# Run get_topic_details function and check the results
topic_details_df = get_topic_details(ldamodel, corpus)

In [None]:
topic_details_df.head()

Unnamed: 0,Dominant_Topic,% Score
0,1.0,0.999217
1,4.0,0.600867
2,4.0,0.63914
3,1.0,0.993458
4,1.0,0.993393


In [None]:
topic_details_df.tail()

Unnamed: 0,Dominant_Topic,% Score
2085,1.0,0.763446
2086,4.0,0.599738
2087,0.0,0.999324
2088,2.0,0.998145
2089,4.0,0.988398


In [None]:
# Add original text to topic details in a dataframe
contents = pd.DataFrame({'Original text': df['clean_content']})
topic_details = pd.concat([get_topic_details(ldamodel, corpus), contents], axis=1)

In [None]:
topic_details.sort_values(by=['% Score'], ascending=False).head(10).head()

Unnamed: 0,Dominant_Topic,% Score,Original text
154,1.0,0.999957,joint venture enron meeting belies officers cl...
135,1.0,0.999953,lawyers agree order safeguard documents housto...
107,1.0,0.999907,sample articles original message schmidt ann s...
849,4.0,0.999877,original message received thu aug cdt ecthou g...
263,1.0,0.999807,original message cook mary sent thursday octob...


In [None]:
topic_details.sort_values(by=['% Score'], ascending=False).head(10).tail()

Unnamed: 0,Dominant_Topic,% Score,Original text
285,1.0,0.999802,original message vann suzanne sent wednesday o...
1782,1.0,0.99979,national journal congressdaily issue date octo...
271,1.0,0.999775,nancy sheed nancy sheed businessweek com pm pl...
149,1.0,0.999769,electricity trading builds oh slowly france do...
520,1.0,0.99975,venturewire monday november contents companies...


From the visualization above Topic 4 seems to be more related to fraud cases, as it includes words related to price, stocks and comuunication between employees.

In [None]:
# Create flag for text highest associated with topic 3
topic_details['flag'] = np.where((topic_details['Dominant_Topic'] == 4.0), 1, 0)

In [None]:
topic_details_fraud = topic_details[topic_details.flag == 1]

In [None]:
pd.set_option('display.max_colwidth', 100)

In [None]:
topic_details_fraud.sort_values(by=['% Score'], ascending=False).head(10)

Unnamed: 0,Dominant_Topic,% Score,Original text,flag
849,4.0,0.999877,original message received thu aug cdt ecthou gt corp enron com following addresses permanent fat...,1
1411,4.0,0.999135,inline attachment follows scasey tfsbrokers com enron mckay jonathan enron ou na cn recipients c...,1
1138,4.0,0.998678,greg great time million club lavo dave forwarded john j lavorato corp enron rosalee fleming pm j...,1
974,4.0,0.998223,forwarded vince j kaminski hou ect pm shirley crenshaw vince j kaminski hou ect ect stinson gibn...,1
1769,4.0,0.99802,please make sure knows asst forwarded jeffrey shankman hou ect tracy l arthur kenneth lay corp e...,1
769,4.0,0.99789,congratulations mark haedicke michael guerriero enron andrea calo sa enron enron carolina waingo...,1
1084,4.0,0.997863,forwarded steven j kean na enron pm katherine brown james bannantine enron development enron dev...,1
1044,4.0,0.997862,forwarded steven j kean na enron pm katherine brown james bannantine enron development enron dev...,1
1749,4.0,0.997672,please check master section changes ss forwarded sara shackleton hou ect pm brent hendry enron d...,1
921,4.0,0.99752,fyi vince forwarded vince j kaminski hou ect pm cathy phillips mark frevert na enron enron mike ...,1
