# Identifying Fraud at Enron Using Emails and Financial Data

## Project Introduction

In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, there was a significant amount of typically confidential information entered into public record, including tens of thousands of emails and detailed financial data for top executives.
For this project, predictive models were built using scikit learn, numpy, and pandas modules in Python. The target of the predictions were persons-of-interest (POI’s) who were ‘individuals who were indicted, reached a settlement, or plea deal with the government, or testified in exchange for prosecution immunity.’ Financial compensation data and aggregate email statistics from the Enron Corpus were used as features for prediction.

The goal of this project is to build a prediction model to identify persons-of-interest (POI’s.)

**Importing Necessary Libraries**

In [30]:
import sys
import pickle
import numpy as np
import pandas as pd
import sklearn
import matplotlib as plt
%matplotlib inline

In [31]:
import pickle

enron_data = pickle.load(open("C:/Users/Geekquad/ud120-projects/final_project/final_project_dataset_modified_unix.pkl", "rb"))

## Understanding the Dataset

### Data Exploration 

Addresses the most important characteristics of the dataset and uses these characteristics to inform their analysis.

**Important characteristics include: **
-  Size of the Enron Dataset
-  Features in the Enron Dataset 
-  Finding POI's in the Enron Data
-  Queries of the Dataset
- Follow the Money
- Dealing with Unfilled Features
- Missing POI's

In [50]:
print('Number of people in the Enron dataset: {0}'.format(len(enron_data)))

Number of people in the Enron dataset: 143


In [51]:
"""Change data dictionary to pandas DataFrame"""
df = pd.DataFrame.from_records(list(enron_data.values()))
persons= pd.Series(list(enron_data.keys()))
print(persons.head())
df.head

0          METTS MARK
1       BAXTER JOHN C
2      ELLIOTT STEVEN
3    CORDES WILLIAM R
4      HANNON KEVIN P
dtype: object


<bound method NDFrame.head of        bonus deferral_payments deferred_income director_fees  \
0     600000               NaN             NaN           NaN   
1    1200000           1295738        -1386055           NaN   
2     350000               NaN         -400729           NaN   
3        NaN               NaN             NaN           NaN   
4    1500000               NaN        -3117011           NaN   
5     325000               NaN             NaN           NaN   
6        NaN           1848227             NaN           NaN   
7    2600000               NaN             NaN           NaN   
8    1150000           2157527         -934484           NaN   
9     400000           1130036          -33333           NaN   
10       NaN           2964506             NaN           NaN   
11       NaN            774401             NaN           NaN   
12    850000               NaN             NaN           NaN   
13    700000               NaN             NaN           NaN   
14       N

In [55]:
pois = [x for x, y in enron_data.items() if y['poi']]
print('Number of POI\'s(Person Of Interests): {0}'.format(len(pois)))

Number of POI's(Person Of Interests): 16


In [56]:
df.shape
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 143 entries, 0 to 142
Data columns (total 21 columns):
bonus                        143 non-null object
deferral_payments            143 non-null object
deferred_income              143 non-null object
director_fees                143 non-null object
email_address                143 non-null object
exercised_stock_options      143 non-null object
expenses                     143 non-null object
from_messages                143 non-null object
from_poi_to_this_person      143 non-null object
from_this_person_to_poi      143 non-null object
loan_advances                143 non-null object
long_term_incentive          143 non-null object
other                        143 non-null object
poi                          143 non-null bool
restricted_stock             143 non-null object
restricted_stock_deferred    143 non-null object
salary                       143 non-null object
shared_receipt_with_poi      143 non-null object
to_messages    

In [60]:
""" Nmaes of all the users in the dataset """
enron_data.keys()

dict_keys(['METTS MARK', 'BAXTER JOHN C', 'ELLIOTT STEVEN', 'CORDES WILLIAM R', 'HANNON KEVIN P', 'MORDAUNT KRISTINA M', 'MEYER ROCKFORD G', 'MCMAHON JEFFREY', 'HAEDICKE MARK E', 'PIPER GREGORY F', 'HUMPHREY GENE E', 'NOLES JAMES L', 'BLACHMAN JEREMY M', 'SUNDE MARTIN', 'GIBBS DANA R', 'LOWRY CHARLES P', 'COLWELL WESLEY', 'MULLER MARK S', 'JACKSON CHARLENE R', 'WESTFAHL RICHARD K', 'WALTERS GARETH W', 'WALLS JR ROBERT H', 'KITCHEN LOUISE', 'CHAN RONNIE', 'BELFER ROBERT', 'SHANKMAN JEFFREY A', 'WODRASKA JOHN', 'BERGSIEKER RICHARD P', 'URQUHART JOHN A', 'BIBI PHILIPPE A', 'RIEKER PAULA H', 'WHALEY DAVID A', 'BECK SALLY W', 'HAUG DAVID L', 'ECHOLS JOHN B', 'MENDELSOHN JOHN', 'HICKERSON GARY J', 'CLINE KENNETH W', 'LEWIS RICHARD', 'HAYES ROBERT E', 'KOPPER MICHAEL J', 'LEFF DANIEL P', 'LAVORATO JOHN J', 'BERBERIAN DAVID', 'DETMERING TIMOTHY J', 'WAKEHAM JOHN', 'POWERS WILLIAM', 'GOLD JOSEPH', 'BANNANTINE JAMES M', 'DUNCAN JOHN H', 'SHAPIRO RICHARD S', 'SHERRIFF JOHN R', 'SHELBY REX', 'LEMA

In [61]:
### Queries of the Dataset ###

In [66]:
enron_data['PRENTICE JAMES']

{'bonus': 'NaN',
 'deferral_payments': 564348,
 'deferred_income': 'NaN',
 'director_fees': 'NaN',
 'email_address': 'james.prentice@enron.com',
 'exercised_stock_options': 886231,
 'expenses': 'NaN',
 'from_messages': 'NaN',
 'from_poi_to_this_person': 'NaN',
 'from_this_person_to_poi': 'NaN',
 'loan_advances': 'NaN',
 'long_term_incentive': 'NaN',
 'other': 'NaN',
 'poi': False,
 'restricted_stock': 208809,
 'restricted_stock_deferred': 'NaN',
 'salary': 'NaN',
 'shared_receipt_with_poi': 'NaN',
 'to_messages': 'NaN',
 'total_payments': 564348,
 'total_stock_value': 1095040}

In [67]:
enron_data['PRENTICE JAMES']['total_stock_value']

1095040

In [68]:
enron_data['COLWELL WESLEY']['from_this_person_to_poi']

11

In [75]:
features_list = ['poi', 'salary', 'to_messages', 'deferral_payments', 'total_payments', 
                 'loan_advances', 'bonus', 'restricted_stock_deferred', 
                 'deferred_income', 'total_stock_value', 'expenses', 'from_poi_to_this_person', 
                 'exercised_stock_options', 'from_messages', 'other', 'from_this_person_to_poi', 
                 'long_term_incentive', 'shared_receipt_with_poi', 'restricted_stock', 'director_fees'] 

f = open('C:/Users/Geekquad/ud120-projects/final_project/poi_names.txt', 'r')


The features in the data fall into three major types, namely financial features, email features and POI labels.

financial features: ['salary', 'deferral_payments', 'total_payments', 'loan_advances', 'bonus', 'restricted_stock_deferred', 'deferred_income', 'total_stock_value', 'expenses', 'exercised_stock_options', 'other', 'long_term_incentive', 'restricted_stock', 'director_fees'] (all units are in US dollars)

email features: ['to_messages', 'email_address', 'from_poi_to_this_person', 'from_messages', 'from_this_person_to_poi', 'shared_receipt_with_poi'] (units are generally number of emails messages; notable exception is ‘email_address’, which is a text string)

POI label: [‘poi’] (boolean, represented as integer)

I'm going to try to use all features, filter them and choose the best.