Skip to content

Commit

Permalink
csvflatten prettify is now default
Browse files Browse the repository at this point in the history
  • Loading branch information
dannguyen committed Dec 23, 2020
1 parent e4b2d03 commit 1e33aca
Show file tree
Hide file tree
Showing 17 changed files with 1,140 additions and 830 deletions.
127 changes: 122 additions & 5 deletions TODOS.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,21 +3,132 @@

## 0.0.9.14

**thoughts 2020-12-11**

while working on data project, wondered that:

- csvflatten
- [x] --prettify should be default
- [ ] option to replace record separator with empty row
- csvnorm
- [ ] should have the --max-length option, not csvflatten


**thoughts 2020-11-24**
- terms across frameworks
- csvpivot:
- long data to wide data
- pivot_wider == tidyr.spread == reshape.cast/dcast
- csvmelt: pivot_longer == tidyr.gather == reshape.melt
- wide data to long data
- a csvmelt (i.e. unpivot) would be super useful, especially for real-world examples
- for now, link to other resources that explain pivot tables and wide data. don't write my own guide
- pandas.pivot and reshape.cast/tidyr/pivot_wider refer to an `index` argument rather than rows
- do i want to follow that, or stick with spreadsheet conventions?
- very good reshape2 guide: https://seananderson.ca/2013/10/19/reshape/

```R
melt(airquality, id.vars = c("month", "day"))

# from:
## ozone solar.r wind temp month day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3

# to:

## month day variable value
## 1 5 1 ozone 41
## 2 5 2 ozone 36
## 3 5 3 ozone 12
## 4 5 4 ozone 18
## 5 5 5 ozone NA
## 6 5 6 ozone 28
```

**general documentation**
- given how lengthy usage overview is for csvpivot, maybe every tool should have a Quickstart?
- [ ] wrote a basic one for csvpivot
- [ ] do it for csvslice
- write a top level tutorial like csvkit?
- https://csvkit.readthedocs.io/en/latest/tutorial.html#

- Getting started congress.csv
- combine with csvlook and csvjoin
- csvflatten to view data
- csvsed to replace values
- data exploration: narrative data like env inspects
- csvslice + csvflatten
- data ranlgling: census
- csvheader to replace header with custom names
- csvslice to cut out metadata
- to and from with csvsqlite and in2csv
- LESO data stack

- **get inspiration from tidyverse**
- method reference page: https://tidyr.tidyverse.org/reference/pivot_wider.html#details
- simple, elegant, with just 4 content headers: Arguments, Details, See Also, and Examples
- Details is just a short graf, giving background and relation to other methods
- See Also is a single line and link: pivot_wider_spec() to pivot "by hand" with a data frame that defines a pivotting specification.
- pivot_wider spec: https://tidyr.tidyverse.org/reference/pivot_wider_spec.html
- Intro graf includes "Learn more in vignette("pivot")", which links to a different page with more text and elaboration:
- https://tidyr.tidyverse.org/articles/pivot.html
- ggplot2 is good too: https://ggplot2.tidyverse.org/reference/geom_bar.html

**csvflatten**

- --prettify as the default, in the way that csvstat as a --csv option: https://csvkit.readthedocs.io/en/latest/scripts/csvstat.html

**csvpivot**


**pivot readings**

Read pandas docs on pandas.pivot and DataFrame.pivot_table
- pandas.pivot: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot.html
- DataFrame.pivot_table: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot_table.html#pandas.DataFrame.pivot_table
- Described as `Create a spreadsheet-style pivot table as a DataFrame.`

Reshape2 (Hadley Wickham's general reshaping library)
- tidyr vs reshape2: https://jtr13.github.io/spring19/hx2259_qz2351.html
- reshape2 does aggregation (like csvpivot), whereas tidyr does not
- journal article: Reshaping Data with the reshape Package
- https://www.jstatsoft.org/article/view/v021i12
- study the theory/context sections, e.g. Conceptual Framework
- 4. Casting molten data
- study how example data is shown, then referred to in each example

Read tidyverse's article on pivoting:
- Main https://tidyr.tidyverse.org/articles/pivot.html
- Wider section: https://tidyr.tidyverse.org/articles/pivot.html#wider
- Nutgraf

> pivot_wider() is the opposite of pivot_longer(): it makes a dataset wider by increasing the number of columns and decreasing the number of rows. It’s relatively rare to need pivot_wider() to make tidy data, but it’s often useful for creating summary tables for presentation, or data in a format needed by other tools.

**R guides**

- An introduction to reshape https://seananderson.ca/2013/10/19/reshape/
- 'What makes data wide or long?'
- very well formatted and written guide

- https://ademos.people.uic.edu/Chapter8.html
- reshape.cast is the equivalent to a Pivot:
> Casting will transform long format back into wide format. This will, essentially, make your data look as it did in the beginning (or in any other way you’d prefer).


- cli
- simplify command-line opts to '--column' and '--rows' from '--pivot-column' and '--pivot-rows'

- [ ] documentation
- terminology
- look at tidyverse writeup for pivot_wider
- https://tidyr.tidyverse.org/reference/pivot_wider.html
- *`pivot_wider() "widens" data, increasing the number of columns and decreasing the number of rows. The inverse transformation is pivot_longer().*
- look at tidyverse writeup for spread() (spread is now deprecated):
- https://rstudio-pubs-static.s3.amazonaws.com/282405_e280f5f0073544d7be417cde893d78d0.html
- "key: The column you want to split apart (Field)"
- usage overview
- [ ] simple row count
- [ ] multiple row count
Expand All @@ -35,8 +146,10 @@
- [ ] write options/flags section
- [ ] write comparison section
- [ ] write scenarios/use-cases


- other references about Pivot Tables
- Pivot Tables in Google Sheets: A Beginner’s Guide: https://www.benlcollins.com/spreadsheets/pivot-tables-google-sheets/#one
- https://business.tutsplus.com/tutorials/how-to-use-pivot-tables-in-google-sheets--cms-28887
- https://support.microsoft.com/en-us/office/create-a-pivottable-to-analyze-worksheet-data-a9a84538-bfe9-40a9-a8e9-f99134456576

**csvslice**

Expand Down Expand Up @@ -132,9 +245,13 @@ Overall stuff
## 0.2


- csvmelt:
- csvmelt/csvgather:
- pandas uses "melt()" to refer to an "unpivot" https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html
- >This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.

- r lang: https://www.rdocumentation.org/packages/reshape2/versions/1.4.4/topics/melt
- tidyr: gather/pivot_longer https://tidyr.tidyverse.org/reference/pivot_longer.html
-
- tidyverse https://uc-r.github.io/tidyr#gather
- pt.normalize('gender', ['white', 'black', 'asian', 'latino'])
| gender | property | value |
| ------ | -------- | ----- |
Expand Down
2 changes: 1 addition & 1 deletion csvmedkit/__about__.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
__title__ = "csvmedkit"
__description__ = """The unofficial extended family of csvkit, i.e. even more tools for command-line data parsing and wrangling"""
__url__ = "https://github.com/dannguyen/csvmedkit"
__version__ = "0.0.9.13"
__version__ = "0.0.9.14"
__short_version__ = __version__.split("-")[0]
__author__ = "Dan Nguyen"
__author_email__ = "dansonguyen@gmail.com"
28 changes: 15 additions & 13 deletions csvmedkit/utils/csvflatten.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
}

FLAT_COL_PADDING = 4
FLAT_COL_WIDTH = len("field") + FLAT_COL_PADDING # e.g. '| field |' and '| recid |'
FLAT_COL_WIDTH = len("field") + FLAT_COL_PADDING


class CSVFlatten(UniformReader, CmkUtil):
Expand All @@ -35,12 +35,13 @@ class CSVFlatten(UniformReader, CmkUtil):
override_flags = ["l"]

def add_arguments(self):

self.argparser.add_argument(
"-P",
"--prettify",
dest="prettify",
"-c",
"--csv",
dest="csvify",
action="store_true",
help="""Print output in Markdown tabular format instead of CSV""",
help="""Print output in CSV format""",
)

self.argparser.add_argument(
Expand Down Expand Up @@ -123,8 +124,8 @@ def rec_ids_mode(self):
return self.args.rec_ids_mode

@property
def prettify(self):
return self.args.prettify
def csvify(self):
return self.args.csvify

def read_input(self):
self._rows = agate.csv.reader(self.skip_lines(), **self.reader_kwargs)
Expand Down Expand Up @@ -158,7 +159,7 @@ def main(self):
self.args.max_field_length or self.args.max_field_length == 0
): # 0 is considered to be infinite/no-wrap
self.max_field_length = self.args.max_field_length
elif self.prettify and not self.args.max_field_length:
elif not self.csvify and not self.args.max_field_length:
# user wants it pretty but didn't specify a max_field_length, so we automatically figure it out
# TODO: this is ugly
termwidth = get_terminal_size().columns
Expand Down Expand Up @@ -204,7 +205,12 @@ def main(self):
)
outrows.append(o_row + [fieldname, chunk])

if self.prettify:
if self.csvify:
writer = agate.csv.writer(self.output_file, **self.writer_kwargs)
writer.writerow(self.output_flat_column_names)
writer.writerows(outrows)

else:
outtable = agate.Table(
outrows,
column_names=self.output_flat_column_names,
Expand All @@ -217,10 +223,6 @@ def main(self):
max_rows=None,
max_columns=None,
)
else:
writer = agate.csv.writer(self.output_file, **self.writer_kwargs)
writer.writerow(self.output_flat_column_names)
writer.writerows(outrows)


def launch_new_instance():
Expand Down
1 change: 0 additions & 1 deletion csvmedkit/utils/csvsed.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@

class Parser:
description = """Replaces all instances of [PATTERN] with [REPL]"""

override_flags = [
"f",
]
Expand Down
147 changes: 147 additions & 0 deletions docs/utils/csvflatten/comparison.rstinc
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@


How it compares to existing tools
=================================


Compared to csvkit's``csvlook``
-------------------------------


`csvlook <https://csvkit.readthedocs.io/en/latest/scripts/csvlook.html>`_ doesn't pretty-format multi-line fields, and can also result in very wide tables without ``--max-column-width``::

$ csvlook examples/hamlet.csv --max-column-width 50

| act | scene | speaker | lines |
| --- | ----- | -------- | -------------------------------------------------- |
| 1 | 5 | Horatio | Propose the oath, my lord. |
| 1 | 5 | Hamlet | Never to speak of this that you have seen,
Swea... |
| 1 | 5 | Ghost | [Beneath] Swear. |
| 3 | 4 | Gertrude | O, speak to me no more;
These words, like dagge... |
| 4 | 7 | Laertes | Know you the hand? |


Compared to ``xsv flatten``
---------------------------

`xsv flatten <https://github.com/BurntSushi/xsv#available-commands>`_ does do auto-wrapping of long entries, but doesn't produce tableized output::

$ xsv flatten examples/hamlet.csv

act 1
scene 5
speaker Horatio
lines Propose the oath, my lord.
#
act 1
scene 5
speaker Hamlet
lines Never to speak of this that you have seen,
Swear by my sword.
#
act 1
scene 5
speaker Ghost
lines [Beneath] Swear.
#
act 3
scene 4
speaker Gertrude
lines O, speak to me no more;
These words, like daggers, enter in mine ears;
No more, sweet Hamlet!
#
act 4
scene 7
speaker Laertes
lines Know you the hand?


Compared to ``tabulate``
------------------------

`python-tabulate <https://pypi.org/project/tabulate/>`_ is a command-line tool for producing a variety of tabular outputs, including ``rst``, ``grid``, and ``html`` formats. However, it does not handle multi-line fields well. Nor does it natively handle the CSV format, e.g. double-quoted values that contain commas, hence, the use of csvkit's `csvformat <https://csvkit.readthedocs.io/en/latest/scripts/csvformat.html>`_ to change delimiters to ``\t`` in the example below::



$ csvformat -T examples/hamlet.csv | tabulate -f grid -1 -s '\t'

+------------------------------------------------+---------+-----------+---------------------------------------------+
| act | scene | speaker | lines |
+================================================+=========+===========+=============================================+
| 1 | 5 | Horatio | Propose the oath, my lord. |
+------------------------------------------------+---------+-----------+---------------------------------------------+
| 1 | 5 | Hamlet | "Never to speak of this that you have seen, |
+------------------------------------------------+---------+-----------+---------------------------------------------+
| Swear by my sword." | | | |
+------------------------------------------------+---------+-----------+---------------------------------------------+
| 1 | 5 | Ghost | [Beneath] Swear. |
+------------------------------------------------+---------+-----------+---------------------------------------------+
| 3 | 4 | Gertrude | "O, speak to me no more; |
+------------------------------------------------+---------+-----------+---------------------------------------------+
| These words, like daggers, enter in mine ears; | | | |
+------------------------------------------------+---------+-----------+---------------------------------------------+
| No more, sweet Hamlet!" | | | |
+------------------------------------------------+---------+-----------+---------------------------------------------+
| 4 | 7 | Laertes | Know you the hand? |
+------------------------------------------------+---------+-----------+---------------------------------------------+


That said, if you like ``tabulate``'s table-formatting options, such as ``-f grid``, you can pipe :command:`csvflatten` (and :command:`csvformat` to convert to tab-delimiters) into ``tabulate`` like so::


$ csvflatten --eor 'none' examples/hamlet.csv | csvformat -T \
tabulate -f grid -1 -s '\t'

+---------+------------------------------------------------+
| field | value |
+=========+================================================+
| act | 1 |
+---------+------------------------------------------------+
| scene | 5 |
+---------+------------------------------------------------+
| speaker | Horatio |
+---------+------------------------------------------------+
| lines | Propose the oath, my lord. |
+---------+------------------------------------------------+
| act | 1 |
+---------+------------------------------------------------+
| scene | 5 |
+---------+------------------------------------------------+
| speaker | Hamlet |
+---------+------------------------------------------------+
| lines | Never to speak of this that you have seen, |
+---------+------------------------------------------------+
| | Swear by my sword. |
+---------+------------------------------------------------+
| act | 1 |
+---------+------------------------------------------------+
| scene | 5 |
+---------+------------------------------------------------+
| speaker | Ghost |
+---------+------------------------------------------------+
| lines | [Beneath] Swear. |
+---------+------------------------------------------------+
| act | 3 |
+---------+------------------------------------------------+
| scene | 4 |
+---------+------------------------------------------------+
| speaker | Gertrude |
+---------+------------------------------------------------+
| lines | O, speak to me no more; |
+---------+------------------------------------------------+
| | These words, like daggers, enter in mine ears; |
+---------+------------------------------------------------+
| | No more, sweet Hamlet! |
+---------+------------------------------------------------+
| act | 4 |
+---------+------------------------------------------------+
| scene | 7 |
+---------+------------------------------------------------+
| speaker | Laertes |
+---------+------------------------------------------------+
| lines | Know you the hand? |
+---------+------------------------------------------------+

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 1e33aca

Please sign in to comment.