# Explore Enron Data

Starter code for exploring the Enron dataset (emails + finances);
loads up the dataset (pickled dict of dicts).

The dataset has the form:
enron_data["LASTNAME FIRSTNAME MIDDLEINITIAL"] = { features_dict }

{features_dict} is a dictionary of features associated with that person.
You should explore features_dict as part of the mini-project,
but here's an example to get you started:

enron_data["SKILLING JEFFREY K"]["bonus"] = 5600000

In [76]:
import pickle

enron_data = pickle.load(open("../final_project/final_project_dataset.pkl", "r"))

#### How many data points (people) are in the dataset?

In [60]:
print "There are " + str(len(enron_data)) + " records in the data set."

There are 146 records in the data set.


#### For each person, how many features are available?

In [61]:
print "There are " + str(len(enron_data.itervalues().next())) + " features for each record."

There are 21 features for each record.


#### How many POIs are there in the E+F dataset?

In [83]:
pois = {p: enron_data[p] for p in enron_data if enron_data[p]['poi']}
print "There are " + str(len(pois)) + " people classified as a 'person of interest'."

There are 18 people classified as a 'person of interest'.


#### How many POI’s were there total?

There are 35 POIs in total. (../final_project/poi_names.txt)

#### As you can see, we have many of the POIs in our E+F dataset, but not all of them. Why is that a potential problem?

Since we only have a little over half the correct POI clasifications in our data set, we are faced with the following issue. If we choose to use POI as the target classification for a learning algorithm, it will be hard to determine whether the learned model is in fact correctly classifying an individual whose label is incorrect or whether the classification is a false positive. The same can be said for the inverse and potential indication a false negative as a true negative when the target label is incorrect.

#### What is the total value of the stock belonging to James Prentice?

In [63]:
jp = enron_data['PRENTICE JAMES']
print "James Prentice has a total stock value of $" + str(jp['total_stock_value']) + "."

James Prentice has a total stock value of $1095040.


#### How many email messages do we have from Wesley Colwell to persons of interest?

In [64]:
wc = enron_data['COLWELL WESLEY']
print "Wesley Colwell sent " + str(wc['from_this_person_to_poi']) + " emails to people of interest."

Wesley Colwell sent 11 emails to people of interest.


#### What’s the value of stock options exercised by Jeffrey Skilling?

In [65]:
js = enron_data['SKILLING JEFFREY K']
print "Jeffrey Skilling exercised " + str(js['exercised_stock_options']) + " of his stock options."

Jeffrey Skilling exercised 19250000 of his stock options.


#### Of these three individuals (Lay, Skilling and Fastow), who took home the most money (largest value of “total_payments” feature)? 

In [67]:
leaders = {p: enron_data[p] for p in enron_data if p in ['SKILLING JEFFREY K', 'LAY KENNETH L', 'FASTOW ANDREW S']}
leaders_total_payments = [leaders[p]['total_payments'] for p in leaders]
leader_with_largest_payout = [p for p in leaders if leaders[p]['total_payments'] == max(leaders_total_payments)][0]
print leader_with_largest_payout + " received the largest payout."

LAY KENNETH L received the largest payout.


#### How much money did that person get?

In [69]:
print leader_with_largest_payout + " received $" + str(leaders[leader_with_largest_payout]['total_payments']) + " as his payout."

LAY KENNETH L received $103559793 as his payout.


#### How many folks in this dataset have a quantified salary? What about a known email address?

In [77]:
people_with_quantified_salary = [p for p in enron_data if enron_data[p]['salary'] != 'NaN']
people_with_quantified_email  = [p for p in enron_data if enron_data[p]['email_address'] != 'NaN']

print "There are " + str(len(people_with_quantified_salary)) + " people with quantified salaries."
print "There are " + str(len(people_with_quantified_email)) + " people with quantified email addresses."

There are 95 people with quantified salaries.
There are 111 people with quantified email addresses.


#### How many people in the E+F dataset (as it currently exists) have “NaN” for their total payments? What percentage of people in the dataset as a whole is this?

In [80]:
people_without_quantified_total_payments = [p for p in enron_data if enron_data[p]['total_payments'] == 'NaN']
percent_of_people_without_quantified_total_payments = (len(people_without_quantified_total_payments) / float(len(enron_data))) * 100

print "There are " + str(len(people_without_quantified_total_payments)) + " people without quantified total payments." 
print "This constitutes "+ str(percent_of_people_without_quantified_total_payments) + "% of the total records."

There are 21 people without quantified total payments.
This constitutes 14.3835616438% of the total records.


#### How many POIs in the E+F dataset have “NaN” for their total payments? What percentage of POI’s as a whole is this?

In [84]:
pois_without_quantified_total_payments = [p for p in pois if pois[p]['total_payments'] == 'NaN']
percent_of_pois_without_quantified_total_payments = (len(pois_without_quantified_total_payments) / float(len(pois))) * 100

print "There are " + str(len(pois_without_quantified_total_payments)) + " POIs without quantified total payments." 
print "This constitutes "+ str(percent_of_pois_without_quantified_total_payments) + "% of the total POIs."

There are 0 POIs without quantified total payments.
This constitutes 0.0% of the total POIs.


#### If you added in, say, 10 more data points which were all POI’s, and put “NaN” for the total payments for those folks, the numbers you just calculated would change.

####  What is the new number of people of the dataset? What is the new number of folks with “NaN” for total payments?

In [85]:
print "There would now be " + str(len(enron_data) + 10) + " records in the data set."
print "There would now be " + str(len(people_without_quantified_total_payments) + 10) + " people without quantified total payments."

There would now be 156 records in the data set.
There would now be 31 people without quantified total payments.


#### What is the new number of POI’s in the dataset? What percentage of them have “NaN” for their total stock value?

In [87]:
new_percent_of_pois_without_quantified_total_payments = (10 / float(len(pois) + 10)) * 100
print "There are now " + str(len(pois) + 10) + " POIs in the data set." 
print "This constitutes "+ str(new_percent_of_pois_without_quantified_total_payments) + "% of the new total POIs."

There are now 28 POIs in the data set.
This constitutes 35.7142857143% of the new total POIs.
