Permalink
Browse files

3.1.0 elastacloud release (#343)

* small cleanups

* define data model and extension methods for working with row-based data

* changed my mind around table - no expand/contract

* row-based reader prototype for simple tables

* 3.0.6?

* poor man's implementation for repeatable fields in rows model

* map type validation

* thoughts on rep and def levels

* try dotnet housework as a global tool

* rearrange commands slightly

* author CLI

* add virtual schema command to CLI

* disable failing test temporarily

* read simple dictionaries into table

* confirm that maps from apache spack can be read back

* moving to new rows idea, string formatting on table is improved too

* read and write simple maps

* adding docs

* add rows to readme

* this will be 3.1.0

* CLI schema tests

* display list schema in parquet CLI

* first attempt at head CLI

* organise tests in regions

* extracting converter

* read simple structures

* slightly better formatting

* parquet cli json

* finished decomposing RowMatrix

* Add in full table display from Parq

* add appinsights

* add hardcoded key

* force flush

* read/write for simple structs

* don't flush to disk

* update docs on structs

* reading structs implementation

* small refactoring

* Update rows.md

Typos

* housework scripts

* quick fix on handling nulls

* list dies in recurse loop

* prevent infinite loop when reading list of structs

* it's called parq now

* package updates

* error handling in CLI

* run tests under net core 2.1

* delete old files

* renaming according to .net standards

* improving DCE to deal with nested repetition levels

* move interactive view from parq

* we can read list of structs again

* try complex objects

* small bug fix

* greatly simplified rows->columns converter
ListField calculates RLs and DLs

* try to optimise tests

* don't include test files as embedded resource

* renaming some test files

* move alltypes

* update docs

* docs update

* create failing test cases and resurrect some tests

* read files with multiple row groups

* handle null dictionary cases

* removed rowCount parameter in CreateRowGroupWriter completely and calculating rows count automatically (including repeated columns). Fixes issue #334

* add failing test for list of structures

* save it

* try enabling sourcelink

* read/write list of structures

* skip 3 tests for now

* update cli package description

* remove disk write

* upgrade CPF

* a rough prototype of conversion to json

* encode json string values

* versioning and release notes

* enumerator fix
  • Loading branch information...
aloneguid committed Oct 3, 2018
1 parent c144190 commit 62cd82624bb6f923ef94aa613a9cbd5f6b104e47
Showing with 4,931 additions and 1,632 deletions.
  1. +3 −1 .gitignore
  2. +2 −0 README.md
  3. +8 −10 appveyor.yml
  4. +4 −3 build.ini
  5. BIN doc/diagrams.vsdx
  6. BIN doc/img/parq-schema.png
  7. BIN doc/img/rows-general.png
  8. +25 −0 doc/parq.md
  9. +187 −0 doc/rows.md
  10. +19 −0 src/Parquet.CLI/Commands/ConvertToJsonCommand.cs
  11. +82 −0 src/Parquet.CLI/Commands/DisplayFullCommand.cs
  12. +40 −0 src/Parquet.CLI/Commands/FileInputCommand.cs
  13. +38 −0 src/Parquet.CLI/Commands/HeadCommand.cs
  14. +112 −0 src/Parquet.CLI/Commands/SchemaCommand.cs
  15. +144 −0 src/Parquet.CLI/Help.Designer.cs
  16. +147 −0 src/Parquet.CLI/Help.resx
  17. +59 −0 src/Parquet.CLI/Models/ColumnDetails.cs
  18. +12 −0 src/Parquet.CLI/Models/ConsoleFold.cs
  19. +13 −0 src/Parquet.CLI/Models/ConsoleSheet.cs
  20. +16 −0 src/Parquet.CLI/Models/Input.cs
  21. +17 −0 src/Parquet.CLI/Models/ViewModel.cs
  22. +25 −0 src/Parquet.CLI/Models/ViewPort.cs
  23. +15 −0 src/Parquet.CLI/Models/ViewSettings.cs
  24. +46 −0 src/Parquet.CLI/Parquet.CLI.csproj
  25. +160 −0 src/Parquet.CLI/Program.cs
  26. +88 −0 src/Parquet.CLI/Views/FullConsoleView.cs
  27. +12 −0 src/Parquet.CLI/Views/IDrawViews.cs
  28. +394 −0 src/Parquet.CLI/Views/InteractiveConsoleView.cs
  29. +0 −50 src/Parquet.Json.Test/Parquet.Json.Test.csproj
  30. +2 −2 src/Parquet.Runner/Parquet.Runner.csproj
  31. +1 −1 src/Parquet.Test/DocRef.cs
  32. +2 −60 src/Parquet.Test/ListTest.cs
  33. +0 −57 src/Parquet.Test/MapsTest.cs
  34. +0 −22 src/Parquet.Test/MetadataTest.cs
  35. +2 −2 src/Parquet.Test/NonSeekableWriterTEst.cs
  36. +16 −195 src/Parquet.Test/Parquet.Test.csproj
  37. +3 −201 src/Parquet.Test/ParquetReaderTest.cs
  38. +57 −55 src/Parquet.Test/ParquetWriterTest.cs
  39. +1 −1 src/Parquet.Test/PrimitiveTypesTest.cs
  40. +6 −6 src/Parquet.Test/Reader/TestDataTest.cs
  41. +1 −2 src/Parquet.Test/RepeatableFieldsTest.cs
  42. +476 −0 src/Parquet.Test/RowsModelTest.cs
  43. +56 −0 src/Parquet.Test/SchemaTest.cs
  44. +1 −1 src/Parquet.Test/Serialisation/SchemaReflectorTest.cs
  45. +2 −53 src/Parquet.Test/StructureTest.cs
  46. +40 −9 src/Parquet.Test/TestBase.cs
  47. +0 −13 src/Parquet.Test/data/ResourceReader.cs
  48. 0 src/Parquet.Test/data/{nested1.json → all_var1.1.json}
  49. 0 src/Parquet.Test/data/{nested2.json → all_var1.2.json}
  50. BIN src/Parquet.Test/data/{nested.parquet → all_var1.parquet}
  51. BIN src/Parquet.Test/data/{listofitems-empty-onerow.parquet → list_empty.parquet}
  52. 0 src/Parquet.Test/data/{simplerepeated.json → list_simple.json}
  53. BIN src/Parquet.Test/data/{simplerepeated.parquet → list_simple.parquet}
  54. 0 src/Parquet.Test/data/{repeatedstruct.json → list_structs.json}
  55. BIN src/Parquet.Test/data/{repeatedstruct.parquet → list_structs.parquet}
  56. BIN src/Parquet.Test/data/{map.parquet → map_simple.parquet}
  57. 0 src/Parquet.Test/data/{ → real}/nation.csv
  58. BIN src/Parquet.Test/data/{ → real}/nation.dict.parquet
  59. BIN src/Parquet.Test/data/{ → real}/nation.impala.parquet
  60. BIN src/Parquet.Test/data/{ → real}/nation.plain.parquet
  61. BIN src/Parquet.Test/data/{ → special}/all_nulls.parquet
  62. BIN src/Parquet.Test/data/{ → special}/all_nulls_no_booleans.parquet
  63. BIN src/Parquet.Test/data/{ → special}/decimallegacy.parquet
  64. BIN src/Parquet.Test/data/{ → special}/decimalnulls.parquet
  65. BIN src/Parquet.Test/data/struct_plain.parquet
  66. 0 src/Parquet.Test/data/{ → types}/alltypes.csv
  67. BIN src/Parquet.Test/data/{ → types}/alltypes.gzip.parquet
  68. BIN src/Parquet.Test/data/{ → types}/alltypes.plain.parquet
  69. BIN src/Parquet.Test/data/{ → types}/alltypes.snappy.parquet
  70. 0 src/Parquet.Test/data/{ → types}/alltypes_dictionary.csv
  71. BIN src/Parquet.Test/data/{ → types}/alltypes_dictionary.gzip.parquet
  72. BIN src/Parquet.Test/data/{ → types}/alltypes_dictionary.plain-spark21.parquet
  73. BIN src/Parquet.Test/data/{ → types}/alltypes_dictionary.plain.parquet
  74. BIN src/Parquet.Test/data/{ → types}/alltypes_dictionary.snappy.parquet
  75. 0 src/Parquet.Test/data/{ → types}/alltypes_no_headers.csv
  76. +9 −0 src/Parquet.sln
  77. +1 −1 src/Parquet/Data/BasicDataTypeHandler.cs
  78. +48 −48 src/Parquet/Data/Concrete/ListDataTypeHandler.cs
  79. +60 −59 src/Parquet/Data/Concrete/MapDataTypeHandler.cs
  80. +31 −30 src/Parquet/Data/Concrete/StructureDataTypeHandler.cs
  81. +24 −5 src/Parquet/Data/DataColumn.cs
  82. +1 −16 src/Parquet/Data/DataType.cs
  83. +1 −1 src/Parquet/Data/DataTypeFactory.cs
  84. +7 −0 src/Parquet/Data/IDataTypeHandler.cs
  85. +0 −26 src/Parquet/Data/RepeatedDataColumn.cs
  86. +70 −0 src/Parquet/Data/Rows/DataColumnAppender.cs
  87. +102 −0 src/Parquet/Data/Rows/DataColumnEnumerator.cs
  88. +220 −0 src/Parquet/Data/Rows/DataColumnsToRowsConverter.cs
  89. +334 −0 src/Parquet/Data/Rows/Row.cs
  90. +124 −0 src/Parquet/Data/Rows/RowValidator.cs
  91. +112 −0 src/Parquet/Data/Rows/RowsToDataColumnsConverter.cs
  92. +12 −0 src/Parquet/Data/Rows/StringFormat.cs
  93. +332 −0 src/Parquet/Data/Rows/Table.cs
  94. +24 −0 src/Parquet/Data/Rows/TableReader.cs
  95. +85 −0 src/Parquet/Data/Rows/TreeList.cs
  96. +10 −3 src/Parquet/Data/Schema/DataField.cs
  97. +42 −3 src/Parquet/Data/Schema/Field.cs
  98. +16 −4 src/Parquet/Data/Schema/ListField.cs
  99. +31 −0 src/Parquet/Data/Schema/MapField.cs
  100. +26 −58 src/Parquet/Data/Schema/Schema.cs
  101. +7 −7 src/Parquet/Data/Schema/StructField.cs
  102. +2 −29 src/Parquet/Extensions/OtherExtensions.cs
  103. +103 −0 src/Parquet/Extensions/StringBuilderExtensions.cs
  104. +2 −0 src/Parquet/File/DataColumnReader.cs
  105. +13 −2 src/Parquet/Parquet.csproj
  106. +1 −1 src/Parquet/ParquetActor.cs
  107. +1 −1 src/Parquet/ParquetConvert.cs
  108. +65 −3 src/Parquet/ParquetExtensions.cs
  109. +29 −0 src/Parquet/ParquetReader.cs
  110. +1 −1 src/Parquet/ParquetRowGroupReader.cs
  111. +14 −7 src/Parquet/ParquetRowGroupWriter.cs
  112. +13 −5 src/Parquet/ParquetWriter.cs
  113. +88 −0 src/Parquet/Serialization/HttpEncoder.cs
  114. +1 −1 src/Parquet/Serialization/SchemaReflector.cs
  115. +422 −368 src/Parquet/Thrift/SchemaElement.cs
  116. +0 −19 src/PreCommit.ps1
  117. +8 −5 src/SharpArrow.Test/SharpArrow.Test.csproj
  118. +2 −2 src/spark-experiments/pom.xml
  119. +43 −0 src/spark-experiments/src/main/scala/alltestdata.sc
  120. +60 −0 src/spark-experiments/src/main/scala/com/ivan/parquet/ScalaApp.scala
  121. +0 −50 src/spark-experiments/src/main/scala/compat.sc
  122. +0 −33 src/spark-experiments/src/main/scala/maps.sc
  123. +0 −6 src/spark-experiments/src/main/scala/nested-records.sc
  124. +0 −48 src/spark-experiments/src/main/scala/perf.sc
  125. +0 −13 src/spark-experiments/src/main/scala/read-device.sc
  126. +0 −33 src/spark-experiments/src/main/scala/read-file-metadata.sc
@@ -288,4 +288,6 @@ __pycache__/
*.btm.cs
*.odx.cs
*.xsd.cs
.vscode
.vscode

target/
@@ -31,9 +31,11 @@ This project is aimed to fix this problem. We support all the popular server and
- [Reading Data](doc/reading.md)
- [Writing Data](doc/writing.md)
- [Complex Types](doc/complex-types.md)
- [Utilities for row-based access](doc/rows.md)
- [Fast Automatic Serialisation](doc/serialisation.md)
- [Declaring Schema](doc/schema.md)
- [Supported Types](doc/types.md)
- **[parq!!!](doc/parq.md)**

You can track the [amount of features we have implemented so far](doc/features.md).

@@ -7,27 +7,25 @@ configuration: Release
platform: Any CPU
before_build:
- ps: >-
dotnet restore src/Parquet.sln
dotnet tool install -g housework -g
cd src/Parquet
housework setbuildnumber %CiVersion% -s build.ini
dotnet housework setbuildnumber %CiVersion% ../../build.ini
housework author ./src/Parquet/Parquet.csproj -s build.ini
dotnet housework author .\Parquet.csproj ../../build.ini
housework author ./src/SharpArrow/SharpArrow.csproj -s build.ini
dotnet housework author ..\SharpArrow\SharpArrow.csproj ../../build.ini
housework author ./src/Parquet.CLI/Parquet.CLI.csproj -s build.ini
dotnet housework substitute ThriftFooter.cs ../../build.ini -r
housework substitute ./src/Parquet/ThriftFooter.cs -s build.ini -r
cd ../..
dotnet restore src/Parquet.sln
build:
project: src/Parquet.sln
verbosity: minimal
test_script:
- cmd: >-
dotnet test src\Parquet.Test -c release
dotnet test src\Parquet.Json.Test -c release
artifacts:
- path: src/Parquet/bin/**/*.nupkg
- path: src/**/*.nupkg
deploy: off
@@ -1,17 +1,18 @@
VersionMajor=3
VersionMinor=0
VersionPatch=5
VersionMinor=1
VersionPatch=0
BuildNo=%APPVEYOR_BUILD_NUMBER%
CiVersion=%VersionMajor%.%VersionMinor%.%VersionPatch%.%APPVEYOR_REPO_BRANCH%-%BuildNo%

;NuGet specific
;Version=%VersionMajor%.%VersionMinor%.%VersionPatch%-preview-%BuildNo%
Version=%VersionMajor%.%VersionMinor%.%VersionPatch%
FileVersion=%VersionMajor%.%VersionMinor%.%VersionPatch%.%BuildNo%
AssemblyVersion=%VersionMajor%.0.0.0
Copyright=Copyright (c) 2017-%date:yyyy% by Elastacloud Ltd.
PackageIconUrl=http://i.isolineltd.com/nuget/parquet.png
PackageProjectUrl=https://github.com/elastacloud/parquet-dotnet
RepositoryUrl=https://github.com/elastacloud/parquet-dotnet
Authors=Ivan Gavryliuk (@aloneguid); Richard Conway (@azurecoder)
Authors=Ivan Gavryliuk (@aloneguid); Richard Conway (@azurecoder); Andy Cross (@andyelastacloud)
PackageLicenseUrl=https://github.com/elastacloud/parquet-dotnet/blob/master/LICENSE
RepositoryType=GitHub
BIN +50.1 KB doc/diagrams.vsdx
Binary file not shown.
Binary file not shown.
Binary file not shown.
@@ -0,0 +1,25 @@
# PARQ (Global Tool)

Since v3.1 parquet repository includes an amazing [.NET Core Global Tool](https://docs.microsoft.com/en-us/dotnet/core/tools/global-tools) called **parq** which serves as a first class command-line client to perform various funtions on parquet files.

## Installing

Installing is super easy with *global tools*, just go to the terminal and type `dotnet tool install -g parq` and it's done. Note that you need to have at least **.NET Core 2.1 SDK** isntalled on your machine, which you probably have as a hero .NET developer.

## Commands

### Viewing Schema

To view schema type

```powershell
parq schema <path-to-file>
````

which produces an output similar to:

![Parq Schema](img/parq-schema.png)

### More Commands

They are coming soon, please leave your comments in the issue tracker in terms of what you would like to see next.
@@ -0,0 +1,187 @@
# Row Based Access

Parquet, of course, is columnar format, and doesn't store data in rows. However, sometimes accessing data in a row-wise fashion is essential in processing algorithms and to display to a user. We as humans better understand rows rather than columns.

Parquet.Net provides out-of-the-box helpers to represent data in row format, however before using it consider the following:

- Can you avoid using row based access? If yes, don't use row based access.
- Row based helpers add a lot of overhead on top of parquet data as it needs to be transformed on the fly from columns to rows internally, and this cannot be done in performant way.
- If your data access code is slow, this is probably because you are using row based access which is relatively slow.

## Table

Table is at the root of row-based hierarchy. A table is simply a collection of `Row`s, and by itself implements `IList<Row>` interface. This means that you can perform any operations you normally do with `IList<T>` in .NET. A row is just a collection of untyped objects:

![Rows General](img/rows-general.png)

## Row

Row is a central structure to hold data during row-based access. Essentially a row is an array of untyped objects. The fact that the row holds untyped objects *adds a performance penalty on working with rows and tables* throught parquet, because all of the data cells needs to be boxed/unboxed for reading and writing. If you can work with *column-based data* please don't use row-based access at all. However, if you absolutely need row-based access, these helper classes are still better than writing your own helper classes.

**Everything in parquet file can be represented as a set of Rows** including plain flat data, arrays, maps, lists and structures.

## Flat Data

Representing flat data is the most obvious case, you would simply create a row where each element is a value of a row. For instance let's say you need to store a list of cities with their ids looking similar to this:

|id|city|
|--|----|
|1|London|
|2|New York|

The corresponding code to create a table with rows is:

```csharp
var table = new Table(
new Schema(
new DataField<int>("id"),
new DataField<string>("city")));
table.Add(new Row(1, "London"));
table.Add(new Row(2, "New York"));
```

Both `ParquetReader` and `ParquetWriter` has plenty of extension methods to read and write tables.

## Arrays (Repeatable fields)

Parquet has an option to store an array of values in a single cell, which is sometimes called a *repeatable field*. With row-based access you can simply add an array to each cell. For instance let's say you need to create the following table:

|ids|
|---|
|1,2,3|
|4,5,6|

The corresponding code to populate this table is:

```csharp
var table = new Table(
new Schema(
new DataField<IEnumerable<int>>("ids")));
table.Add(new Row(new[] { 1, 2, 3 }));
table.Add(new Row(new[] { 4, 5, 6 }));
```

## Dictionaries (Maps)

```csharp
var schema = new Schema(
new DataField<string>("city"),
new MapField("population",
new DataField<int>("areaId"),
new DataField<long>("count")));
```

and you need to write a row that has *London* as a city and *population* is a map of *234=>100, 235=>110*.

The table should look like:

| |Column 0|Column 1|
|---------|--------|--------|
|**Row 0**|London|`List<Row>`|

where the last cell is the data for your map. As we're in the row-based world, this needs to be represented as a list of rows as well:

| |Column 0|Column 1|
|---------|--------|--------|
|**Row 0**|234|100|
|**Row 1**|235|110|

To express this in code:

```csharp
table.Add("London",
new List<Row>
{
new Row(234, 100L),
new Row(235, 110L)
});
```

## Structures

Structures are represented again as `Row` objects. When you read or write a structure it is embedded into another row's value as a row. To demonstrate, the following schema

```csharp
var table = new Table(
new Schema(
new DataField<string>("isbn"),
new StructField("author",
new DataField<string>("firstName"),
new DataField<string>("lastName"))));
```

represents a table with two columns - *isbn* and *author*, however *author* is a structure of two fields - *firstName* and *lastName*. To add the following data into the table

|isbn|author|
|----|------|
|12345-6|Ivan; Gavryliuk|
|12345-8|Richard; Conway|

you would write:

```csharp
table.Add(new Row("12345-6", new Row("Ivan", "Gavryliuk")));
table.Add(new Row("12345-7", new Row("Richard", "Conway")));
```

## Lists

Lists are easy to get confused with repeatable fields, because they essentially repeat some data in a cell. This is true for a simple data type like a string, int etc., however lists are special in a way that a list item can be anything else, not just a plain data type. In general, *when repeated data can be represented as a plain type, always use repeatable field*. Repeatable fields are lighter and faster than lists which have extra overhead on serialisation and performance.

### Simple Lists

In simple cases, when a list contains a single data element, it will be mapped to a collection of those elements, for instance in the following schema

```csharp
var table = new Table(
new Schema(
new DataField<int>("id"),
new ListField("cities",
new DataField<string>("name"))));
```

and the following set of data:

|id|cities|
|--|------|
|1|London, Derby|
|2|Paris, New York|

can be represented in code as:

```csharp
table.Add(1, new[] { "London", "Derby" });
table.Add(2, new[] { "Paris", "New York" });
```

As you can see, it's no different to repeatable fields (in this case a repeatable string) however it will perform much slower due to transformation costs are higher.

### Lists of Stuctures

A more complicated use case of lists where they actually make some sense is using structures (although lists can contain any subclass of `Field`). Let's say you have the the following schema definition:

```csharp
var t = new Table(
new DataField<int>("id"),
new ListField("structs",
new StructField("mystruct",
new DataField<int>("id"),
new DataField<string>("name"))));
```

and would like to add the following data:

|id|structs|
|--|-------|
|1|id: 1, name: Joe; id: 2, name: Bloggs|
|1|id: 3, name Star; id: 4, name: Wars|

which essentially creates a list of structures with two fields - id and name in a single table cell. To add the data to the table:

```csharp
t.Add(1, new[] { new Row(1, "Joe"), new Row(2, "Bloggs") });
t.Add(2, new[] { new Row(3, "Star"), new Row(4, "Wars") });
```
@@ -0,0 +1,19 @@
using System;

namespace Parquet.CLI.Commands
{
class ConvertToJsonCommand : FileInputCommand
{
private const ConsoleColor BracketColor = ConsoleColor.Yellow;
private const ConsoleColor NameColor = ConsoleColor.DarkGray;

public ConvertToJsonCommand(string path) : base(path)
{
}

public void Execute()
{

}
}
}
Oops, something went wrong.

0 comments on commit 62cd826

Please sign in to comment.