# Datasets and Questions
## Objective
Use machine learning algorithms to examine the Enron email corpus to find Persons-of-Interest (POI).

## What is a Person-of-Interest (POI)
Someone who was:
* Indicted
* Settled without admitting guilt
* Testified in exchange for immunity

## Mini-Project
### Question: How many people are in the dataset?

In [1]:
import pickle

enron_data = pickle.load(open("../final_project/final_project_dataset.pkl", "rb"))

print('There are {} people in the dataset.'.format(len(enron_data)))

There are 146 people in the dataset.


In [2]:
num_features = max([len(x) for x in enron_data.values()])

print('There are {} features in the dataset.'.format(num_features))

There are 21 features in the dataset.


In [3]:
num_pois = sum([x['poi'] == 1 for x in enron_data.values()])

print('Thre are {} POIs in the dataset.'.format(num_pois))

Thre are 18 POIs in the dataset.


In [4]:
pois = []
with open('../final_project/poi_names.txt') as f:
    for line in f:
        if line.startswith('('):
            pois.append(line[4:].rstrip().upper().replace(',', ''))

for name, features in enron_data.items():
    if features['poi'] == 1:  
        found = False
        for poi in pois:
            # If 2 or more name components are in each list, then it's the same person.
            if len(list(set(poi.split()) & set(name.split()))) >= 2:
                found = True
                break
        if found is False:
            pois.append(name)

print('There are {} POIs.'.format(len(pois)))

There are 35 POIs.


In [5]:
print('James Prentice total stock value is: ${}'.format(
    enron_data['PRENTICE JAMES']['total_stock_value']))

James Prentice total stock value is: $1095040


In [6]:
print('Wesley Colwell sent {} email messages to other POIs.'.format(
    enron_data['COLWELL WESLEY']['from_this_person_to_poi']))

Wesley Colwell sent 11 email messages to other POIs.


In [7]:
print('Jeff Skilling excercised ${} worth of options.'.format(
    enron_data['SKILLING JEFFREY K']['exercised_stock_options']))

Jeff Skilling excercised $19250000 worth of options.


In [8]:
execs = ['LAY KENNETH L', 'SKILLING JEFFREY K', 'FASTOW ANDREW S']
execs = [(x, enron_data[x]['total_payments']) for x in execs]
execs.sort(key=lambda x: x[1], reverse=True)
print('{exec[0]} took home the most with ${exec[1]}'.format(exec=execs[0]))

LAY KENNETH L took home the most with $103559793


In [9]:
salaries = [
    x['salary'] for x in enron_data.values() if x['salary'] != 'NaN']

print('There are {} quantified salaries out of {}.'.format(
    len(salaries),
    len(enron_data.keys())))

There are 95 quantified salaries out of 146.


In [10]:
email_addresses = [
    x['email_address'] for x in enron_data.values() 
    if x['email_address'] != 'NaN']

print('There are {} email addresses.'.format(len(email_addresses)))

There are 111 email addresses.


In [11]:
num_nan_payments = len(
    [x['total_payments'] for x in enron_data.values() 
     if x['total_payments'] == 'NaN'])

as_percentage = \
    (num_nan_payments / len(enron_data.keys()))

print(num_nan_payments, 'people are missing total_payments information.')
print('{:.2%} of the total people in the dataset.'.format(as_percentage))

21 people are missing total_payments information.
14.38% of the total people in the dataset.


In [12]:
num_poi_nan_payments = len(
    [x['total_payments'] for x in enron_data.values() 
     if x['total_payments'] == 'NaN' and x['poi'] == 1])

poi_as_percentage = \
    (num_poi_nan_payments / num_pois)
    
print(num_poi_nan_payments, 'POIs are missing total_payments information.')
print('{:.2%} of the total POIs in the dataset.'.format(poi_as_percentage))

0 POIs are missing total_payments information.
0.00% of the total POIs in the dataset.
