## .NET Dataframe

The .NET DataFrame is a data structure provided by the Microsoft.Data.Analysis library in .NET. It is designed to handle large amounts of data efficiently and with a familiar API. The DataFrame is similar to tables in a relational database or data frames in R/Python, but with a richer set of functions.

Here are some key features of .NET DataFrame:

1. **Ease of use**: You can easily manipulate data and perform statistical functions on it. It supports operations like group by, join, sort, filter, and others.

2. **High performance**: DataFrame is designed to handle large data sets. It uses memory efficiently and performs operations quickly.

3. **Flexibility**: It can handle different data types (integers, strings, floats, etc.) and allows for adding, editing, or deleting of columns.

4. **Integration with .NET**: Since it's a .NET library, it can be used with other .NET libraries and tools, and it benefits from .NET's strong type checking.

Here is a simple example of how to use it:



In [1]:
// reason is APSToolkit included library Microsoft.Data.Analysis, so we just need install APSToolkit
#r "nuget:APSToolkit"

Loading extensions from `C:\Users\vho2\.nuget\packages\microsoft.data.analysis\0.21.1\interactive-extensions\dotnet\Microsoft.Data.Analysis.Interactive.dll`

In [2]:
using Microsoft.Data.Analysis;

// Create a DataFrame with two columns
DataFrame df = new DataFrame(
    new StringDataFrameColumn("Name", new string[] { "John", "Sue", "Bob" }),
    new Int32DataFrameColumn("Age", new int[] { 33, 45, 21 })
);

// Filter the data
DataFrame filtered = df.Filter(df["Age"].ElementwiseLessThan(40));

// Display the result
Console.WriteLine(filtered);

Name      Age       
John      33        
Bob       21        





In this example, a DataFrame is created with two columns, "Name" and "Age". Then, a filter is applied to get only the rows where the age is less than 40. The result is then printed to the console.

## ApsToolkit Dataframe

In APSToolkit, we added one more function to match with data return, some posible data return are:
- DataTable (System.Data.DataTable)
- Excel - The excel file extracted from the data
- CSV - The CSV file extracted from the data
- Parquet - The Parquet file extracted from the data
... and more

#### DataTable
The DataTable is a .NET class that represents a table of in-memory data. It is a powerful and flexible data structure that can be used to store, manipulate, and analyze data. It is part of the System.Data namespace and is widely used in .NET applications for working with databases and other data sources.

In [3]:
using System.Data;
using APSToolkit.Utils;
DataTable dataTable = new DataTable();
dataTable.Columns.Add("Name", typeof(string));
dataTable.Columns.Add("Age", typeof(int));
dataTable.Rows.Add("John", 33);
dataTable.Rows.Add("Sue", 45);
dataTable.Rows.Add("Bob", 21);
// load into DataFrame
Microsoft.Data.Analysis.DataFrame df = APSToolkit.Utils.DataFrame.LoadFromDataTable(dataTable);
// visualize the DataFrame
df

index,Name,Age
0,John,33
1,Sue,45
2,Bob,21


#### Excel

In APSToolkit, we supported load Dataframe from excel file, the function require the file path and the sheet name to load the data.

In [None]:
// load excel into dataframe
Microsoft.Data.Analysis.DataFrame df = APSToolkit.Utils.DataFrame.LoadFromExcel("path_to_excel.xlsx", "sheet_name");

#### Parquet

Parquet is a columnar storage file format that is optimized for use with big data processing frameworks like Apache Hadoop, Apache Spark, and others. It's compatible with most of the data processing frameworks in the Hadoop environment and is designed to perform best with complex data in bulk.

Here are some key features of Parquet:

1. **Columnar Storage**: Unlike row-based files like CSV or TSV, Parquet is a columnar storage file format, which allows it to provide efficient compression and encoding schemes. This structure also allows for better performance when querying data.

2. **Schema Evolution**: Parquet supports complex nested data structures and allows for schema evolution, where you can add, remove, or modify columns.

3. **Compression and Encoding**: Parquet provides efficient compression and encoding schemes to store data more compactly. It also allows different encoding and compression schemes to be specified for different columns.

4. **Language Independent**: Parquet is built to support very efficient compression and encoding schemes, and to be flexible enough to work with different languages.

Here is an example of how to load data parquet into dataframe:



In [None]:
Microsoft.Data.Analysis.DataFrame df = APSToolkit.Utils.DataFrame.LoadFromParquet("path_to_parquet.parquet");



In this example, a PyArrow Table is created from a pandas DataFrame, and then it's written to a Parquet file named 'example.parquet'.

#### Another Format Supported By Analysis .NET

In [None]:
Microsoft.Data.Analysis.DataFrame df = Microsoft.Data.Analysis.DataFrame.LoadCsv("path_to_csv.csv");
// Microsoft.Data.Analysis.DataFrame df = Microsoft.Data.Analysis.DataFrame.LoadCsvFromString("csv_string");


## Data Visualization

In [None]:
// try see how many elements by categories

## Data Analysis