&nbsp;
&nbsp;

# Welcome to Feature Factory for Telstra Network Disruptions

Feature factory is an online infrastructure that allows one to quickly prototype and test features for different machine learning problems. 

Before beginning to use Feature Factory, we highly recommend that you familiarize yourself with what IPython Notebook. IPython Notebook is an interactive python kernel that allows you to run code in different cells. Variables created by the code live in the IPython Notebook python kernel and can be accessed at any time, by any cell. More information can be found at http://ipython.org/notebook.html

# Creating your own IPython Notebook

To get started with Feature Factory, please clone the Template notebook. To do this, click "File"->"Make a Copy". This should spawn a new tab within your browser with the copied notebook. Rename the notebook to your liking and make all edits on that notebook.

&nbsp;
&nbsp;


# Telstra Network Disruptions Machine Learning Competition

## Problem Statement

In this competition, you are challenged to predict the severity of service disruptions on Telstra's network. Using a dataset of features from their service logs, you're tasked with predicting if a disruption is a momentary glitch or a total interruption of connectivity.

## Data

The dataset is in a relational format, split among mutliple files. When using **commands.get_sample_dataset()** to retrieve the dataset, the files are provided as a list of *pandas.DataFrame* objects.

The following step-by-step example shows this in detail.

### Event Type Data


|       Data Fields       | Definition |
|-------------------------|------------|
|ID                   | identifies a unique location-time point|
|EventType                | type of event that occured at that ID (can be multiple events per ID)|

### Log Feature Data

|       Data Fields       | Definition |
|-------------------------|------------|
|ID                   | identifies a unique location-time point|
|Log_Feature               | type of feature logged for that ID|
|Volume              | number of times the feature was logged for that ID|


### Resource Type Data

|       Data Fields       | Definition |
|-------------------------|------------|
|ID                   | identifies a unique location-time point|
|Resource_Type              | type of resource assocaited with that ID|

### Severity Type Data

|       Data Fields       | Definition |
|-------------------------|------------|
|ID                   | identifies a unique location-time point|
|Severity_Type                | type of severity level logged for that ID|

### Training Data

|       Data Fields       | Definition |
|-------------------------|------------|
|ID                   | identifies a unique location-time point|
|Location               | identifier of location|
|Fault_Severity               | categorical. 0: no fault, 1: a few faults, 2: many faults|


&nbsp;
&nbsp;

## Step-by-Step Example

Step 1: Import the feature factory infrastructure

In [1]:
from problems.telstra import commands

&nbsp;
&nbsp;

Step 2: Create a username/password or login into an existing account. If you create an account and it is successful, you don't need to login - you are logged in automatically. 

In [3]:
commands.create_user('stair_2', 'mazdaa8351')

username already exists


In [2]:
commands.login('stair_2', 'mazdaa8351')

user successfully logged in


&nbsp;
&nbsp;

Step 3: To ensure that this notebook is mapped to your username, it is required that you execute the command below. 

In [3]:
commands.add_notebook('TND_FESM02_stair_2')

Notebook already registered


&nbsp;
&nbsp;

Step 4: Get a sample dataset. This will allow you to test your feature before running it on the full data in the server. Remember that the dataset as a list of [Pandas DataFrames](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html).

In [4]:
dataset = commands.get_sample_dataset()
# dataset[0] <- this refers to the event type data
# dataset[1] <- this refers to the log feature data
# dataset[2] <- this refers to the resource type data
# dataset[3] <- this refers to the severity type data
# dataset[4] <- this refers to the training data

In [5]:
dataset[0][:5]

Unnamed: 0,id,event_type
0,6597,event_type 11
1,8011,event_type 15
2,2597,event_type 15
3,5022,event_type 15
4,5022,event_type 11


In [8]:
dataset[1][:5]

Unnamed: 0,id,log_feature,volume
0,6597,feature 68,6
1,8011,feature 68,7
2,2597,feature 68,1
3,5022,feature 172,2
4,5022,feature 56,1


In [9]:
dataset[2][:5]

Unnamed: 0,id,resource_type
0,6597,resource_type 8
1,8011,resource_type 8
2,2597,resource_type 8
3,5022,resource_type 8
4,6852,resource_type 8


In [10]:
dataset[3][:5]

Unnamed: 0,id,severity_type
0,6597,severity_type 2
1,8011,severity_type 2
2,2597,severity_type 2
3,5022,severity_type 1
4,6852,severity_type 1


In [11]:
dataset[4][:5]

Unnamed: 0,id,location
0,2425,location 812
1,7034,location 1050
2,2097,location 805
3,9754,location 114
4,10538,location 1023


&nbsp;
&nbsp;

Step 5: Define your feature extraction function.

The name you give to the function is the name which will be used later on to register your feature extaction function and the score which it obtains.

Your function should simply take in the dataset list as a parameter and output a N x M numpy matrix or pandas dataframe where N is number of store-date pairings and M is the number of features which will be used for the prediction.
Bear in mind that sorting is important and that, in order to properly evaluate your function score, extracted features should preserve the order which the store-date pairings have in the sales dataset.

Also note that, even though the system allows you to do so, any feature extraction function which makes use of the outcome column will be disqualified.

**WARNING:** Your functions have to be self contained!

This means that you can use helper functions or import external modules but that any import or variable definition needs to be made within the functions which use them.

Cross validation is (intentionally) run in a separated process in order to make sure that this scope pattern is preserved, and will fail if the function uses anything defined somewhere else in the notebook.

You might be wondering why we require this. The reason is that the code of your function might be executed and further evaluated in different environments where the variables and modules defined in your notebook will not be available.

In [196]:
def TND_FEOHE01_stair_2(dataset):
    # Feature Extraction using Simple One-Hot-Encoder for Telstra Network Disruptions
    import pandas as pd
    from sklearn import preprocessing

    dataset[3]['st'] = dataset[3]['severity_type'].astype("category")
    dataset[3]['st'].cat.categories = range(5)
    dataset[2]['rt'] = dataset[2]['resource_type'].astype("category")
    dataset[2]['rt'].cat.categories = range(10)
    dataset[1]['lf'] = dataset[1]['log_feature'].astype("category")
    dataset[1]['lf'].cat.categories = range(386)
    dataset[0]['et'] = dataset[0]['event_type'].astype("category")
    dataset[0]['et'].cat.categories = range(53)

    df = pd.merge(dataset[4], dataset[3], on = 'id', how = 'inner')
    df = pd.merge(df, dataset[2], on = 'id', how = 'inner')
    df = pd.merge(df, dataset[1], on = 'id', how = 'inner')
    df = pd.merge(df, dataset[0], on = 'id', how = 'inner')
    df = df.drop(['location','severity_type','resource_type',\
                          'log_feature','event_type'],axis=1)

    enc = preprocessing.OneHotEncoder(sparse=False)
    df_enc = pd.DataFrame(data=enc.fit_transform(df[['st','rt','lf','et']].as_matrix()))
    df_enc['id'] = df['id']
    df_enc = df_enc.groupby(['id'], sort=False).median()
    df = df.drop(['st','rt','lf','et'],axis=1)
    df = df.groupby(['id'], sort=False).median()
    df_enc['vol'] = df['volume']
        
    return df_enc

In [197]:
Fdf = TND_FEOHE01_stair_2(dataset)
commands.cross_validate(TND_FEOHE01_stair_2)

Obtaining dataset
Extracting features
Cross validating


0.62633952888594679

&nbsp;
&nbsp;

Step 6: Evaluate the score of your feature extraction function before submitting it.

You can make use of the cross_validate command as many times a needed in order to have a preview of what the score of your function will be.

In [199]:
commands.cross_validate(TND_FEOHE01_stair_2)

Obtaining dataset
Extracting features
Cross validating


0.62633952888594679

&nbsp;
&nbsp;

Step 7: Register your function in the system

Once you are satisfied with the results, you can call the add_feature command passing your function as an argument.
This will cross_validate the function again and store your code and your score for future analysis.

Again, remember that your function code must be self contained and import or define everything it needs to be run successfully.

In [200]:
commands.add_feature(TND_FEOHE01_stair_2)

Obtaining dataset
Extracting features
Cross validating
Your feature TND_FEOHE01_stair_2 scored 0.6263395288859468
Feature TND_FEOHE01_stair_2 successfully registered


&nbsp;
&nbsp;

Step 8: (Optional) Modify and update your function code.

If you discover that your function can be improved you can add it again into the system as many times as required with the same function name.

However, for improved clarity, we recommend you to use this option only to fix problems or make small improvements within a similar approach.

So, in case you want start a different feature extraction strategy, we strongly recommend you to register it with a new name.