# IDataView

In this notebooks, we'll cover:

- What is an IDataView?
- What's the difference between DataFrame vs. IDataView?
- How to create an IDataView?
- How to inspect data in an IDataView?

## What is an IDataView?

The [IDataView](https://docs.microsoft.com/dotnet/api/microsoft.ml.idataview?view=ml-dotnet) system is a set of interfaces and components that provide efficient, compositional processing of schematized data for machine learning and advanced analytics applications. It is designed to gracefully and efficiently handle high dimensional data and large data sets. It does not directly address distributed data and computation, but is suitable for single node processing of data partitions belonging to larger distributed data sets.

### Schema

IDataView has general schema support, in that a view can have an arbitrary number of columns, each having an associated name, index, data type, and optional annotation.

Column names are case sensitive. Multiple columns can share the same name, in which case, one of the columns hides the others, in the sense that the name will map to one of the column indices, the visible one. 

All user interaction with columns should be via name, not index, so the hidden columns are generally invisible to the user. However, hidden columns are often useful for diagnostic purposes.

### Supported Data Types

The set of supported column data types forms an open type system, in the sense
that additional types can be added at any time and in any assembly. However,
there is a precisely defined set of standard types including:

-  Text
-  Boolean
-  Single and Double precision floating point
-  Signed integer values using 1, 2, 4, or 8 bytes
- Unsigned integer values using 1, 2, 4, or 8 bytes
- Values for ids and probabilistically unique hashes, using 16 bytes
- Date time, date time zone, and timespan
- Key types
- Vector types
- Image types


## What's the difference between a DataFrame and IDataView?

DataFrame and IDataView are very similar in the sense that they both are ways of representing data in a tabular format and applying transformations for it. Some key differences:

- DataFrame only supports loading delimited files.
- DataFrame runs on memory so you're limited to the amount of memory on your PC.

The DataFrame is recommended when performing tasks like exploratory data anlysis on a sample of your data. 

IDataView is recommended for training on larger datasets. 

## How to create an IDataView

You can create an IDataView by using any of the methods for loading data:

- TextLoader
- LoadFromTextFile
- LoadFromEnumerable
- Load

### Defining Schema

IDataViews are schematized. Therefore you need to provide the schema. There's several ways to define the schema:

- Manually
- Classes

#### Manually defining IDataView Schema

To manually define the model schema you can use the `SchemaBuilder`. 

In [1]:
#r "nuget:Microsoft.ML,1.7.1"

In [1]:
using Microsoft.ML;
using Microsoft.ML.Data;

Let's say that we have data that looks like the following

| Student Name | Score | 
| --- | --- |
| Jane | 80 |
| John | 75 | 
| Jack | 90 |
| Sally | 100 |

We can define the schema as follows:

In [1]:
var schemaBuilder = new DataViewSchema.Builder();
schemaBuilder.AddColumn("StudentName", TextDataViewType.Instance);
schemaBuilder.AddColumn("Score", NumberDataViewType.Single);
var schema = schemaBuilder.ToSchema();

When we inspect the schema we can see its different properties.

In [1]:
schema

index,Name,Index,IsHidden,Type,Annotations
RawType,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Schema,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
RawType,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3
Schema,Unnamed: 1_level_4,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4
0,StudentName,0.0,False,RawTypeSystem.ReadOnlyMemory<System.Char>,Schema[ ]
RawType,,,,,
System.ReadOnlyMemory<System.Char>,,,,,
Schema,,,,,
[ ],,,,,
1,Score,1.0,False,RawTypeSystem.Single,Schema[ ]
RawType,,,,,
System.Single,,,,,
Schema,,,,,
[ ],,,,,

RawType
System.ReadOnlyMemory<System.Char>

Schema
[ ]

RawType
System.Single

Schema
[ ]


### Define schema with classes

You also have the option of creating new classes or using existing classes to define your schema. Using the same student data above, you can define the schema as follows:

In [1]:
public class TestScores
{
	public string StudentName {get;set;}
	public string Scores {get;set;}
}

### Loading data

You can load data from a flat file either using the TextLoader or LoadFromTextFile methods

#### Loading data from a TextLoader

In [1]:
// Initialize MLContext
var mlContext = new MLContext();

In [1]:
// Define TextLoader
var textLoader =
    mlContext.Data.CreateTextLoader(
        columns: new TextLoader.Column[]
        {
            new TextLoader.Column("StudentName",DataKind.String, 0),
            new TextLoader.Column("Score", DataKind.Single, 1)
        },
        separatorChar: ',',
        hasHeader: true);

In [1]:
// Create IDataView
var textLoaderDataView = textLoader.Load("student-scores.csv");

In [1]:
textLoaderDataView.Schema

index,Name,Index,IsHidden,Type,Annotations
RawType,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Schema,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
RawType,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3
Schema,Unnamed: 1_level_4,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4
0,StudentName,0.0,False,RawTypeSystem.ReadOnlyMemory<System.Char>,Schema[ ]
RawType,,,,,
System.ReadOnlyMemory<System.Char>,,,,,
Schema,,,,,
[ ],,,,,
1,Score,1.0,False,RawTypeSystem.Single,Schema[ ]
RawType,,,,,
System.Single,,,,,
Schema,,,,,
[ ],,,,,

RawType
System.ReadOnlyMemory<System.Char>

Schema
[ ]

RawType
System.Single

Schema
[ ]


In [1]:
// Specify column index from file via LoadColumn attribute
public class TestScoresAttributes
{
	[LoadColumn(0)]
	public string StudentName {get;set;}
	
	[LoadColumn(1)]
	public string Scores {get;set;}
}

In [1]:
var textLoaderAttributes = 
	mlContext.Data.CreateTextLoader<TestScoresAttributes>(separatorChar: ',', hasHeader:true);

## Inspecting data in IDataView

There's several ways to inspect the data in an IDataView:

- Use cursors
- Convert to IEnumerable

### Use cursors