# Sourcing and Transforming Data

### Connect to workspace

In [None]:
import azureml.core
from azureml.core import Workspace

# Load the workspace from the saved config file

ws = Workspace.from_config()
print('Ready to use Azure ML {} to work with {}'.format(azureml.core.VERSION, ws.name))

### Work with a Datastore

In Azure ML, *datastores* are references to storage locations, such as Azure Storage blob containers. Every workspace has a default datastore - usually the Azure storage blob container that was created with the workspace. If you need to work with data that is stored in different locations, you can add custom datastores to your workspace and set any of them to be the default.

### View Datastores

Run the following code to determine the datastores in your workspace:

In [None]:
# Get the default datastore
default_ds = ws.get_default_datastore()

# Enumerate all datastores, indicating which is the default
for ds_name in ws.datastores:
    print(ds_name, "- Default =", ds_name == default_ds.name)

You can also view and manage datastores in your workspace on the Datastores page for your workspace in [Azure ML Studio](https://ml.azure.com).

### Upload Data to a Datastore

Now that you have determined the available datastores, you can upload files from your local file system to a datastore so that it will be accessible to experiments running in the workspace, regardless of where the experiment script is actually being run.

In [None]:
default_ds.upload_files(files=['./data/flight_delays_data.csv'], # Upload the diabetes csv files in /data
                       target_path='data/', # Put it in a folder path in the datastore
                       overwrite=True, # Replace existing files of the same name
                       show_progress=True)

## Work with Datasets

While you can read data directly from datastores, Azure Machine Learning provides a further abstraction for data in the form of *datasets*. A dataset is a versioned reference to a specific set of data that you may want to use in an experiment. Datasets can be *tabular* or *file*-based.

### Create and Register Tabular Dataset

Let's create a dataset from the flight delays data you uploaded to the datastore. In this case, the data is in a structured format in a CSV file, so we'll use a *tabular* dataset.


Once we create the datasets that reference the flight delays data, you can register it to make it easily accessible to any experiment being run in the workspace.

We'll register the tabular dataset as **flight_delays_data**

In [None]:
from azureml.core import Dataset

default_ds = ws.get_default_datastore()

if 'flight_delays_data' not in ws.datasets:
    #Create a tabular dataset from the path on the datastore (this may take a short while)
    csv_path = [(default_ds, 'data/flight_delays_data.csv.csv')]
    tab_data_set = Dataset.Tabular.from_delimited_files(path=csv_path)

    # Register the tabular dataset
    try:
        tab_data_set = tab_data_set.register(workspace=ws, 
                                name='flight_delays_data',
                                description='flight delays data',
                                tags = {'format':'CSV'},
                                create_new_version=True)
        print('Dataset registered.')
    except Exception as ex:
        print(ex)
else:
    print('Dataset already registered.')

Get the flight_delays_data and display first 20 rows examing the content of the data

In [None]:
# Get the training dataset
dataset = ws.datasets.get('flight_delays_data')
dataset = dataset.to_pandas_dataframe()
dataset.head(20)

let's do a quick description of the features available.

In [None]:
dataset.describe()

Displaying a information of the dataset will help us know which columns need to be engineered.

In [None]:
dataset.info()

### Feature Engineering

Feature engineering here will include removing target leakers and features that are not useful to our hypothesis. 
We will then make sure the columns(features) are of the right data types for the algorithm to be used for the prediction.

In [None]:
# Get the training dataset
dataset = ws.datasets.get('flight_delays_data')
dataset = dataset.to_pandas_dataframe().dropna()

# Remove target leaker and features that are not useful
target_leakers = ['DepDel15','ArrDelay','Cancelled','Year']
dataset.drop(columns=target_leakers, axis=1, inplace=True)

# convert some columns to categorical features
columns_as_categorical = ['OriginAirportID','DestAirportID','ArrDel15']
dataset[columns_as_categorical] = dataset[columns_as_categorical].astype('object')

# The labelEncoder and OneHotEncoder only works on categorical features. We need first to extract the categorial featuers using boolean mask.
categorical_feature_mask = dataset.dtypes == object 
categorical_cols = dataset.columns[categorical_feature_mask].tolist()
categorical_cols

LabelEncoder converts each class under specified feature to a numerical value. 

Let’s go through the steps to see how to do it.

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

# Apply LabelEncoder on each of the categorical columns:
dataset[categorical_cols] = dataset[categorical_cols].apply(lambda col:le.fit_transform(col))
dataset[categorical_cols].head(10)

In [None]:
# Drop all null values
dataset.dropna(inplace=True)
dataset.head(20)

Doing a relative data split based on the Month column

In [None]:
train_ds, test_ds = dataset.loc[dataset['Month'] < 9], dataset.loc[dataset['Month'] >= 9]
train_count = train_ds.Month.count()
test_count = test_ds.Month.count()
print('Test data ratio:',(test_count/(test_count+train_count))*100)