### Calculating Errors

Here are two datasets that represent two of the examples you have seen in this lesson.  

One dataset is based on the parachute example, and the second is based on the judicial example.  Neither of these datasets are based on real people.

Use the questions below to assist in answering the quiz questions at the bottom of this page.

In [48]:
import numpy as np
import pandas as pd

jud_data = pd.read_csv('./data/judicial_dataset_predictions.csv')
par_data = pd.read_csv('./data/parachute_dataset.csv')

In [49]:
jud_data.head()

Unnamed: 0,defendant_id,actual,predicted
0,22574,innocent,innocent
1,35637,innocent,innocent
2,39919,innocent,innocent
3,29610,guilty,guilty
4,38273,innocent,innocent


In [50]:
par_data.head()

Unnamed: 0,parachute_id,actual,predicted
0,3956,opens,opens
1,2147,opens,opens
2,2024,opens,opens
3,8325,opens,opens
4,6598,opens,opens


`1.` Above, you can see the actual and predicted columns for each of the datasets.  Using the **jud_data**, find the proportion of errors for the dataset, and furthermore, the percentage of errors of each type.  Use the results to answer the questions in quiz 1 below.  

**Hint for quiz:** an error is any time the prediction doesn't match an actual value.  Additionally, there are Type I and Type II errors to think about.  We also know we can minimize one type of error by maximizing the other type of error.  If we predict all individuals as innocent, how many of the guilty are incorrectly labeled?  Similarly, if we predict all individuals as guilty, how many of the innocent are incorrectly labeled?

In [51]:
# error proportions
nb_defendants = len(jud_data)
raw_error = jud_data[jud_data['actual'] != jud_data['predicted']].count()
raw_error_proportion = raw_error / nb_defendants
guilty_whn_innocent = jud_data[(jud_data['actual'] == 'innocent') & (jud_data['predicted'] == 'guilty')].count()
innocent_whn_guilty = jud_data[(jud_data['actual'] == 'guilty') & (jud_data['predicted'] == 'innocent')].count()
type1_error_proportion = guilty_whn_innocent / nb_defendants
type2_error_proportion = innocent_whn_guilty / nb_defendants
raw_error_proportion, type1_error_proportion, type2_error_proportion

(defendant_id    0.042153
 actual          0.042153
 predicted       0.042153
 dtype: float64, defendant_id    0.00151
 actual          0.00151
 predicted       0.00151
 dtype: float64, defendant_id    0.040643
 actual          0.040643
 predicted       0.040643
 dtype: float64)

In [52]:
# changing base assumptions
# all guilty
guilty_whn_innocent2 = jud_data[(jud_data['actual'] == 'innocent')].count()
innocent_whn_guilty2 = jud_data[(jud_data['actual'] == 'guilty') & (jud_data['actual'] == 'innocent')].count()
type1_error_proportion2 = guilty_whn_innocent2 / nb_defendants
type2_error_proportion2 = innocent_whn_guilty2 / nb_defendants
type1_error_proportion2, type2_error_proportion2

(defendant_id    0.4516
 actual          0.4516
 predicted       0.4516
 dtype: float64, defendant_id    0.0
 actual          0.0
 predicted       0.0
 dtype: float64)

`2.` Using the **par_data**, find the proportion of errors for the dataset, and furthermore, the percentage of errors of each type.  Use the results to answer the questions in quiz 2 below.

These should be very similar operations to those you performed in the previous question.

In [53]:
# error proportions
nb_parachutes = len(par_data)
raw_error = par_data[par_data['actual'] != par_data['predicted']].count()
raw_error_proportion = raw_error / nb_parachutes
fails_whn_ok = par_data[(par_data['actual'] == 'fails') & (par_data['predicted'] == 'opens')].count()
opens_whn_bad = par_data[(par_data['actual'] == 'opens') & (par_data['predicted'] == 'fails')].count()
type1_error_proportion = fails_whn_ok / nb_parachutes
type2_error_proportion = opens_whn_bad / nb_parachutes
raw_error_proportion, type1_error_proportion, type2_error_proportion

(parachute_id    0.039973
 actual          0.039973
 predicted       0.039973
 dtype: float64, parachute_id    0.000172
 actual          0.000172
 predicted       0.000172
 dtype: float64, parachute_id    0.039801
 actual          0.039801
 predicted       0.039801
 dtype: float64)

In [54]:
# changing base assumptions
# all fails
fails_whn_ok2 = par_data[(par_data['actual'] == 'opens')].count()
opens_whn_bad2 = par_data[(par_data['actual'] == 'fails') & (jud_data['actual'] == 'opens')].count()
type1_error_proportion2 = fails_whn_ok2 / nb_defendants
type2_error_proportion2 = opens_whn_bad2 / nb_defendants
type1_error_proportion2, type2_error_proportion2



(parachute_id    0.793766
 actual          0.793766
 predicted       0.793766
 dtype: float64, parachute_id    0.0
 actual          0.0
 predicted       0.0
 dtype: float64)