# Data Preparation and Feature Engineering

Data is critical to training and preparing a model. In this notebook we will cover how to load data into ML.NET and ensure it is in the proper format so that ML.NET can work with it. 

In this notebook you will learn how to... 

- Load data into ML.NET
- Apply data transforms to help ML.NET understand the data

## Loading data in ML.NET

### What is an IDataView?

The [IDataView](https://docs.microsoft.com/dotnet/api/microsoft.ml.idataview?view=ml-dotnet) is the data format ML.NET loads for training. It is a set of interfaces and components that provide efficient, compositional processing of schematized data for machine learning and advanced analytics applications. It is designed to gracefully and efficiently handle high dimensional data and large data sets. 

The IDataView has general schema support, in that a view can have an arbitrary number of columns, each having an associated name, index, data type, and optional annotation.

### How to create an IDataView

You can create an IDataView by using any of the methods for loading data:

- TextLoader
- LoadFromEnumerable
- DatabaseLoader
- LoadFromTextFile

See [Load Data from files and other sources](https://docs.microsoft.com/dotnet/machine-learning/how-to-guides/load-data-ml-net) for further documentation and examples. 

In [1]:
#i "nuget:https://pkgs.dev.azure.com/dnceng/public/_packaging/MachineLearning/nuget/v3/index.json"
#r "nuget: Microsoft.ML, 2.0.0-preview.22356.1"

In [1]:
using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.Transforms; 

## Download or Locate Data
The following code tries to locate the data file in a few known locations or it will download it from the known GitHub location.

In [1]:
using System;
using System.IO;
using System.Net;

string EnsureDataSetDownloaded(string fileName)
{

	// This is the path if the repo has been checked out.
	var filePath = Path.Combine(Directory.GetCurrentDirectory(),"data", fileName);

	if (!File.Exists(filePath))
	{
		// This is the path if the file has already been downloaded.
		filePath = Path.Combine(Directory.GetCurrentDirectory(), fileName);
	}

	if (!File.Exists(filePath))
	{
		using (var client = new WebClient())
		{
			client.DownloadFile($"https://raw.githubusercontent.com/dotnet/csharp-notebooks/main/machine-learning/data/{fileName}", filePath);
		}
		Console.WriteLine($"Downloaded {fileName}  to : {filePath}");
	}
	else
	{
		Console.WriteLine($"{fileName} found here: {filePath}");
	}

	return filePath;
}

#### Loading from a file

A [TextLoader](https://docs.microsoft.com/dotnet/api/microsoft.ml.data.textloader?view=ml-dotnet) can load a structured file into an IDataView. Structured information is represented as columns and rows of data. 

The IDataView has general schema support, in that a view can have an arbitrary number of columns, each having an associated name, index, data type, and optional annotation. You can define the schema for your data using Plain-Old-CLR-Objects (POCO) or classes.

A few things to notice about the ModelInput class.
- The `LoadColumn` attribute specifies the column indices. This is a necessary attribute when loading from a file. 
- The `ColumnName` attribute used to set the name of the column to something other than the property name. The in-memory objects use the property name. However, for data processing and building machine learning models, ML.NET overrides and references the property with the value provided in the ColumnName attribute.

In [1]:
public class ModelInput
{
    [LoadColumn(0)]
    [ColumnName(@"vendor_id")]
    public string Vendor_id { get; set; }

    [LoadColumn(1)]
    [ColumnName(@"rate_code")]
    public float Rate_code { get; set; }

    [LoadColumn(2)]
    [ColumnName(@"passenger_count")]
    public float Passenger_count { get; set; }

    [LoadColumn(3)]
    [ColumnName(@"trip_time_in_secs")]
    public float Trip_time_in_secs { get; set; }

    [LoadColumn(4)]
    [ColumnName(@"trip_distance")]
    public float Trip_distance { get; set; }

    [LoadColumn(5)]
    [ColumnName(@"payment_type")]
    public string Payment_type { get; set; }

    [LoadColumn(6)]
    [ColumnName(@"fare_amount")]
    public float Fare_amount { get; set; }

}

All ML.NET operations start in the [MLContext](https://docs.microsoft.com/dotnet/api/microsoft.ml.mlcontext) class. Initializing mlContext creates a new ML.NET environment that can be shared across the model creation workflow objects. It's similar, conceptually, to DBContext in Entity Framework.

In [1]:
//Create MLContext
MLContext mlContext = new MLContext();

Create a [TextLoader](https://docs.microsoft.com/dotnet/api/microsoft.ml.data.textloader?view=ml-dotnet) based on the ModelInput type. Then use the text loader to load from the data file. At minimum, the loader will need to be told if the file has a header, and what separator character the file uses. 

You could also use the direct method [LoadFromTextFile](https://docs.microsoft.com/dotnet/api/microsoft.ml.textloadersavercatalog.loadfromtextfile?view=ml-dotnet). The advantage of the TextLoader is that it gives the option to load files from multiple files from different locations. 

In [1]:
var trainDataPath = EnsureDataSetDownloaded("taxi-fare.csv");

// Create TextLoader based on the Model Input type. 
TextLoader textLoader = mlContext.Data.CreateTextLoader<ModelInput>(separatorChar: ',', hasHeader: true);

// Load the data into an IDataView. Load() method can support multiple files. 
// Files must they have the same separator character, header, column names, etc. 
IDataView data = textLoader.Load(trainDataPath);

data.Preview(1); 

#### Loading in memory collection

ML.NET supports loading data from an in memory collection. This makes it easy to load from a JSON or XML file using C#. Learn how to [deserialize JSON with C#](https://docs.microsoft.com/dotnet/standard/serialization/system-text-json-how-to?pivots=dotnet-6-0#how-to-read-json-as-net-objects-deserialize) or use [XML serializer](https://docs.microsoft.com/dotnet/api/system.xml.serialization?view=net-6.0) to get those files into memory. 

Once you have the data collection in memory, you can load it into ML.NET with the [LoadFromEnumerable](https://docs.microsoft.com/dotnet/api/microsoft.ml.dataoperationscatalog.loadfromenumerable?view=ml-dotnet) method. 

In [1]:
ModelInput[] inMemoryCollection = new ModelInput[]
{
    new ModelInput
    {
        Vendor_id = "CMT",
        Rate_code = 1,
        Passenger_count = 1,
        Trip_time_in_secs = 1271,
        Trip_distance = 3.8f,
        Payment_type = "CRD",
        Fare_amount = 17.5f,
    },
    new ModelInput
    {
        Vendor_id = "VST",
        // missing Rate_code
        Passenger_count = 1,
        Trip_time_in_secs = 474,
        Trip_distance = 1.5f,
        Payment_type = "CSH",
        Fare_amount = 8, 
    }
};

In [1]:
// Create MLContext
MLContext mlContext = new MLContext();

//Load Data
IDataView data = mlContext.Data.LoadFromEnumerable<ModelInput>(inMemoryCollection);

data.Preview(1);

### What's the difference between a DataFrame and IDataView?

You may have heard of the [DataFrame](https://docs.microsoft.com/dotnet/api/microsoft.data.analysis.dataframe?view=ml-dotnet-preview) type. It is another tool to load, view and manipulate data that is common to Notebooks. It implements an IDataView, so it can easily be passed to ML.NET.

DataFrame and IDataView are very similar in the sense that they both are ways of representing data in a tabular format and applying transformations for it. Some key differences:

- DataFrame only supports loading delimited files.
- DataFrame runs on memory so you're limited to the amount of memory on your PC.

The DataFrame is recommended when performing tasks like exploratory data analysis on a sample of your data. Look at the reference notebook [REF-Data Processing](https://github.com/dotnet/csharp-notebooks/blob/main/machine-learning/REF-Data%20Processing.ipynb) for an example of using Data Frames to manipulate a data file for training.

IDataView is recommended for training on larger datasets, and what is used for the examples in this notebook. Larger datasets in this case are defined as datasets that can't fit into memory.

## Data Transformations

ML.NET supports a variety of data transformations that will convert data into the required format and help you make corrections to your data. Some common operations are manipulating columns, normalizing values, replacing missing values, converting values, and more. 

For more information, see [data transformations](https://docs.microsoft.com/dotnet/machine-learning/resources/transforms). 

Below are a few common transformations. 

### Categorical data

One hot encoding is an important transformation for data containing categories. ML algorithms require data to be numerical, it doesn't know how to process strings representing categories. The columns of vendor_id and payment_type are categorical, vendor can be "CMD" or "VST" and payment can be "CReDit" or "CaSH". One hot encoding takes the string values passed in and converts them into numerical data.

In [1]:
var pipeline = mlContext.Transforms.Categorical.OneHotEncoding(
    new[] 
    { new InputOutputColumnPair(@"vendor_id"), 
    new InputOutputColumnPair(@"payment_type")},
    OneHotEncodingEstimator.OutputKind.Binary); 

Let's test the above transform on the vendor_id and payment_type. The result should be a vector value for each category. For the case of Vendor_Id, CMT becomes `000` and VST becomes `001`. We'll create a new ModelInputTransformed class for the new converted types. 

In [1]:
using System.Numerics; 

public class ModelInputTransformed
{
    [LoadColumn(0)]
    [ColumnName(@"vendor_id")]
    public VBuffer<Single> Vendor_id { get; set; }

    [LoadColumn(1)]
    [ColumnName(@"rate_code")]
    public float Rate_code { get; set; }

    [LoadColumn(2)]
    [ColumnName(@"passenger_count")]
    public float Passenger_count { get; set; }

    [LoadColumn(3)]
    [ColumnName(@"trip_time_in_secs")]
    public float Trip_time_in_secs { get; set; }

    [LoadColumn(4)]
    [ColumnName(@"trip_distance")]
    public float Trip_distance { get; set; }

    [LoadColumn(5)]
    [ColumnName(@"payment_type")]
    public VBuffer<Single> Payment_type { get; set; }

    [LoadColumn(6)]
    [ColumnName(@"fare_amount")]
    public float Fare_amount { get; set; }
}

In [1]:
// Run the transform
IDataView transformedData = pipeline.Fit(data).Transform(data);
var convertedData = mlContext.Data.CreateEnumerable<ModelInputTransformed>(transformedData, true);

// One Hot Encoding of two columns 'vendor_id' and 'payment_type'.
Console.WriteLine("Vendor_Id" +"\t" + "Payment_Type"); 
foreach (ModelInputTransformed item in convertedData)
{    
    Console.WriteLine("{0}\t\t{1}", string.Join(" ", item.Vendor_id.DenseValues()),
                    string.Join(" ", item.Payment_type.DenseValues()));
}

0 0 1		0 0 1


### Replace missing values 

Another common operation is to replace missing values. Here we use the default replacement mode, which replaces the value with the default value for its type.

In [1]:
pipeline.Append(mlContext.Transforms.ReplaceMissingValues(
    new[] { new InputOutputColumnPair(@"rate_code", @"rate_code"), 
    new InputOutputColumnPair(@"passenger_count", @"passenger_count"), 
    new InputOutputColumnPair(@"trip_time_in_secs", @"trip_time_in_secs"), 
    new InputOutputColumnPair(@"trip_distance", @"trip_distance") })); 

Again, let's run the transform and take a look at the filled in value. We were missing the rate_code for the second dummy object. 

In [1]:
IDataView transformedData = pipeline.Fit(data).Transform(data);
var convertedData = mlContext.Data.CreateEnumerable<ModelInputTransformed>(transformedData, true);

"Rate_code: " + convertedData.ElementAt(1).Rate_code

Rate_code: 0

Now let's concatenate all of our feature columns into one vector column. Many ML trainers expect features to be of vector type because applying operations to a vector is more efficient.

In [1]:
pipeline.Append(mlContext.Transforms.Concatenate(
    @"Features", new[] { @"vendor_id", @"payment_type", @"rate_code", @"passenger_count", @"trip_time_in_secs", @"trip_distance" }));

We now have a loaded IDataView and pipeline to use for training. 

# Continue learning

> [⏩ Next Module - Training and AutoML](./03-Training%20and%20AutoML.ipynb)  
> [⏪ Last Module - Intro to Machine Learning](./01-Intro%20to%20Machine%20Learning.ipynb)  