Our goal is to correctly label budget line items by training a supervised model to predict the probability of each
possible label, taking most probable label as the correct label.

First we will explore the data.

There is training data called "TrainingData.csv".

### Discover Data

In [None]:
import pandas as pd

df = pd.read_csv("TrainingData.csv")

df.describe()
df.info()
df.head()
df.tail()

### Summarizing the data

In [None]:
# Print the summary statistics
print(df.describe())

# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt

# Create the histogram
plt.hist(df['FTE'].dropna())

# Add title and labels
plt.title('Distribution of %full-time \n employee works')
plt.xlabel('% of full-time')
plt.ylabel('num employees')

# Display the histogram
plt.show()

### Exploring the datatypes and converting datatypes

In [None]:
# Looking the data types
df.dtypes

# Looking at how many and different data types are 
df.dtypes.value_counts()

!!  We do not use df.value_counts() , because it is a Series method, not a DataFrame method.

There are 9 columns of labels in the dataset. Each of these columns is a category that has many possible values it can take. The 9 labels have been loaded into a list called LABELS

In this list, every label is encoded as an object datatype. Because category datatypes are much more efficient your task is to convert the labels to category types using the .astype() method.

In [None]:
# Define the lambda function: categorize_label
categorize_label = lambda x: x.astype('category')

# Convert df[LABELS] to a categorical type
df[LABELS] = df[LABELS].apply(categorize_label, axis=0)

# Print the converted dtypes
print(df[LABELS].dtypes)

In [None]:
# Import matplotlib.pyplot
import matplotlib.pyplot as plt

# Calculate number of unique values for each label: num_unique_labels
# pd, provides a pd.Series.nunique method for counting the number of unique values in a Series.
num_unique_labels = df[LABELS].apply(pd.Series.nunique)

# Plot number of unique values for each label
num_unique_labels.plot(kind='bar')

# Label the axes
plt.xlabel('Labels')
plt.ylabel('Number of unique values')

# Display the plot
plt.show()

### Measure the Success of model by Log Loss - Loss Function 

Log Loss provides a steep penalty for predictions that are both wrong and confident, i.e., a high probability is assigned to the incorrect class.

Suppose you have the following 3 examples:

A:y=1,p=0.85
B:y=0,p=0.99
C:y=0,p=0.51

y is an indicator of whether the example was classified correctly. In this case;

The Log Loss ( Penalty Score ) ordering is : 
B ( false predict high confident) > C ( false predict but low confident) > A ( Correctly predicted )

We use compute_log_loss() by Numpy to calculate the score.

5 one-dimensional numeric arrays simulating different types of predictions have been pre-loaded: 
actual_labels, correct_confident, correct_not_confident, wrong_not_confident, and wrong_confident.

In [None]:
# Compute and print log loss for correct_confident
correct_confident_score = compute_log_loss(correct_confident, actual_labels)
print("Log loss, correct and confident: {}".format(correct_confident_score)) 

# Compute log loss for correct_not_confident
correct_not_confident_score = compute_log_loss(correct_not_confident, actual_labels)
print("Log loss, correct and not confident: {}".format(correct_not_confident_score)) 

# Compute and print log loss for wrong_not_confident
wrong_not_confident_score = compute_log_loss(wrong_not_confident, actual_labels)
print("Log loss, wrong and not confident: {}".format(wrong_not_confident_score)) 

# Compute and print log loss for wrong_confident
wrong_confident_score = compute_log_loss(wrong_confident,actual_labels)
print("Log loss, wrong and confident: {}".format(wrong_confident_score)) 