# Data Preparation and Feature Engineering

Data is critical to training and preparing a model. In this notebook we will cover how to load data into ML.NET and ensure it is in the proper format so that ML.NET can work with it. 

In this notebook you will learn how to... 

- Load data into ML.NET
- Apply data transforms to help ML.NET understand the data

## Loading data in ML.NET

### What is an IDataView?

The [IDataView](https://docs.microsoft.com/dotnet/api/microsoft.ml.idataview?view=ml-dotnet) is the data format ML.NET loads for training. It is a set of interfaces and components that provide efficient, compositional processing of schematized data for machine learning and advanced analytics applications. It is designed to gracefully and efficiently handle high dimensional data and large data sets. 

The IDataView has general schema support, in that a view can have an arbitrary number of columns, each having an associated name, index, data type, and optional annotation.

### How to create an IDataView

You can create an IDataView by using any of the methods for loading data:

- TextLoader
- LoadFromEnumerable
- DatabaseLoader
- LoadFromTextFile

See [Load Data from files and other sources](https://docs.microsoft.com/en-us/dotnet/machine-learning/how-to-guides/load-data-ml-net) for further documentation and examples. 

In [None]:
#r "nuget: Microsoft.ML, 1.7.1"

In [None]:
using Microsoft.ML;
using Microsoft.ML.Data;

#### Loading from a file

A [TextLoader](https://docs.microsoft.com/en-us/dotnet/api/microsoft.ml.data.textloader?view=ml-dotnet) can load a structred file into the IDataView format. Structured information is basically columns and rows of data. The loader will need to be told if the file has a header, and what separator character the file uses. 

By using the loader instead of the direct method LoadFromTextFile, there is the option to load from multiple files. 

In [None]:
public class ModelInput
{
    [LoadColumn(0)]
    [ColumnName(@"vendor_id")]
    public string Vendor_id { get; set; }

    [LoadColumn(1)]
    [ColumnName(@"rate_code")]
    public float Rate_code { get; set; }

    [LoadColumn(2)]
    [ColumnName(@"passenger_count")]
    public float Passenger_count { get; set; }

    [LoadColumn(3)]
    [ColumnName(@"trip_time_in_secs")]
    public float Trip_time_in_secs { get; set; }

    [LoadColumn(4)]
    [ColumnName(@"trip_distance")]
    public float Trip_distance { get; set; }

    [LoadColumn(5)]
    [ColumnName(@"payment_type")]
    public string Payment_type { get; set; }

    [LoadColumn(6)]
    [ColumnName(@"fare_amount")]
    public float Fare_amount { get; set; }

}

In [None]:
//Create MLContext
MLContext mlContext = new MLContext();

// Create TextLoader based on the Model Input type. 
TextLoader textLoader = mlContext.Data.CreateTextLoader<ModelInput>(separatorChar: ',', hasHeader: true);

// Load the data into an IDataView. Load() method can support multiple files. 
// Files must they have the same separator character, header, column names, etc. 
IDataView data = textLoader.Load("data/taxi-fare.csv");

data.Preview(1); 

CanShuffle,Schema
False,"[ { vendor_id: String: Name: vendor_id, Index: 0, IsHidden: False, Type: { String: RawType: System.ReadOnlyMemory<System.Char> }, Annotations: { : Schema: [ ] } }, { rate_code: Single: Name: rate_code, Index: 1, IsHidden: False, Type: { Single: RawType: System.Single }, Annotations: { : Schema: [ ] } }, { passenger_count: Single: Name: passenger_count, Index: 2, IsHidden: False, Type: { Single: RawType: System.Single }, Annotations: { : Schema: [ ] } }, { trip_time_in_secs: Single: Name: trip_time_in_secs, Index: 3, IsHidden: False, Type: { Single: RawType: System.Single }, Annotations: { : Schema: [ ] } }, { trip_distance: Single: Name: trip_distance, Index: 4, IsHidden: False, Type: { Single: RawType: System.Single }, Annotations: { : Schema: [ ] } }, { payment_type: String: Name: payment_type, Index: 5, IsHidden: False, Type: { String: RawType: System.ReadOnlyMemory<System.Char> }, Annotations: { : Schema: [ ] } }, { fare_amount: Single: Name: fare_amount, Index: 6, IsHidden: False, Type: { Single: RawType: System.Single }, Annotations: { : Schema: [ ] } } ]"


#### Loading in memory collection

ML.NET supports loading data from an in memory collection. This makes it easy to load from a JSON or XML file using C#. Learn how to [deserialize JSON with C#](https://docs.microsoft.com/en-us/dotnet/standard/serialization/system-text-json-how-to?pivots=dotnet-6-0#how-to-read-json-as-net-objects-deserialize) or use [XML serializer](https://docs.microsoft.com/en-us/dotnet/api/system.xml.serialization?view=net-6.0) to get those files into memory. 

Once you have the data collection in memory, you can load it into ML.NET with the `LoadFromEnumerable` method. 

In [None]:
ModelInput[] inMemoryCollection = new ModelInput[]
{
    new ModelInput
    {
        Vendor_id = "CMT",
        Rate_code = 1,
        Passenger_count = 1,
        Trip_time_in_secs = 1271,
        Trip_distance = 3.8f,
        Payment_type = "CRD",
        Fare_amount = 17.5f,
    },
    new ModelInput
    {
        Vendor_id = "CMT",
        Rate_code = 1,
        Passenger_count = 1,
        Trip_time_in_secs = 474,
        Trip_distance = 1.5f,
        Payment_type = "CRD",
        Fare_amount = 8, 
    }
};

In [None]:
// Create MLContext
MLContext mlContext = new MLContext();

//Load Data
IDataView data = mlContext.Data.LoadFromEnumerable<ModelInput>(inMemoryCollection);

data.Preview(1);

### What's the difference between a DataFrame and IDataView?

You may have heard of the [DataFrame](https://docs.microsoft.com/en-us/dotnet/api/microsoft.data.analysis.dataframe?view=ml-dotnet-preview) type. It is another tool to load, view and manipulate data that is common to Notebooks. It implements an IDataView, so it can easily be passed to ML.NET.

DataFrame and IDataView are very similar in the sense that they both are ways of representing data in a tabular format and applying transformations for it. Some key differences:

- DataFrame only supports loading delimited files.
- DataFrame runs on memory so you're limited to the amount of memory on your PC.

The DataFrame is recommended when performing tasks like exploratory data anlysis on a sample of your data. Look at the reference notebook REF - Data Processing for an example of using Data Frames to manipulate a data file for training.

IDataView is recommended for training on larger datasets, and what we will use here for training examples. 

## Data Transformations

ML.NET supports a variety of data transformations that will convert data into the required format and help you make corrections to your data. Some common operations are manipulating columns, normalizing values, replacing missing values, converting values, and more. 

Look for more information on [data transformations available](https://docs.microsoft.com/en-us/dotnet/machine-learning/resources/transforms). 

Below are a few common transformations. 

### One Hot Encoding 

One hot encoding is an important step for data containing strings. ML algorithms require data to be numerical, it doesn't know how to process a string. The columns of vendor_id and payment_type are categorical, vendor can be "CMD" or "VST" and payment can be "CReDit" or "CaSH". One hot encoding takes the string values passed in and converts them into numerical data.

In [None]:
var pipeline = mlContext.Transforms.Categorical.OneHotEncoding(
    new[] { new InputOutputColumnPair(@"vendor_id", @"vendor_id"), 
    new InputOutputColumnPair(@"payment_type", @"payment_type")},
    outputKind: OutputKind.Binary); 
                 

### Replace missing values 

Another common operation is to replace missing values. Here we use the default replacement mode, which replaces the value with the default value for its type.

In [None]:
pipeline.Append(mlContext.Transforms.ReplaceMissingValues(
    new[] { new InputOutputColumnPair(@"rate_code", @"rate_code"), 
    new InputOutputColumnPair(@"passenger_count", @"passenger_count"), 
    new InputOutputColumnPair(@"trip_time_in_secs", @"trip_time_in_secs"), 
    new InputOutputColumnPair(@"trip_distance", @"trip_distance") })); 


Now let's concatenate all of our feature columns. 

In [None]:
pipeline.Append(mlContext.Transforms.Concatenate(
    @"Features", new[] { @"vendor_id", @"payment_type", @"rate_code", @"passenger_count", @"trip_time_in_secs", @"trip_distance" }));

We now have a loaded IDataView and Pipeline to use in training. 

# Continue learning

> [⏩ Next Module - Training and AutoML](https://raw.githubusercontent.com/JakeRadMSFT/csharp-notebooks/main/machine-learning/03-Training%20and%20AutoML.ipynb)  
> [⏪ Last Module - Intro to Machine Learning](https://raw.githubusercontent.com/JakeRadMSFT/csharp-notebooks/main/machine-learning/01-Intro%20to%20Machine%20Learning.ipynb)  
