csvflatten prettify is now default

dannguyen · Dec 23, 2020 · 1e33aca · 1e33aca
1 parent e4b2d03
commit 1e33aca
Show file tree

Hide file tree

Showing 17 changed files with 1,140 additions and 830 deletions.
diff --git a/TODOS.md b/TODOS.md
@@ -3,21 +3,132 @@
 
 ## 0.0.9.14
 
+**thoughts 2020-12-11**
+
+while working on data project, wondered that:
+
+- csvflatten 
+    - [x] --prettify should be default
+    - [ ] option to replace record separator with empty row
+- csvnorm
+    - [ ] should have the --max-length option, not csvflatten
+
+
+**thoughts 2020-11-24**
+- terms across frameworks
+    - csvpivot: 
+        - long data to wide data
+        - pivot_wider == tidyr.spread == reshape.cast/dcast 
+    - csvmelt: pivot_longer == tidyr.gather == reshape.melt
+        - wide data to long data
+- a csvmelt (i.e. unpivot) would be super useful, especially for real-world examples
+- for now, link to other resources that explain pivot tables and wide data. don't write my own guide
+- pandas.pivot and reshape.cast/tidyr/pivot_wider refer to an `index` argument rather than rows
+    - do i want to follow that, or stick with spreadsheet conventions?
+- very good reshape2 guide: https://seananderson.ca/2013/10/19/reshape/
+
+    ```R
+    melt(airquality, id.vars = c("month", "day"))
+
+    # from:
+        ##   ozone solar.r wind temp month day
+        ## 1    41     190  7.4   67     5   1
+        ## 2    36     118  8.0   72     5   2
+        ## 3    12     149 12.6   74     5   3
+
+    # to:
+
+        ##   month day variable value
+        ## 1     5   1    ozone    41
+        ## 2     5   2    ozone    36
+        ## 3     5   3    ozone    12
+        ## 4     5   4    ozone    18
+        ## 5     5   5    ozone    NA
+        ## 6     5   6    ozone    28
+    ```
 
 **general documentation**
 - given how lengthy usage overview is for csvpivot, maybe every tool should have a Quickstart?
     - [ ] wrote a basic one for csvpivot
     - [ ] do it for csvslice
 - write a top level tutorial like csvkit?
     - https://csvkit.readthedocs.io/en/latest/tutorial.html#
-
+    - Getting started congress.csv
+        - combine with csvlook and csvjoin
+        - csvflatten to view data
+        - csvsed to replace values
+    - data exploration: narrative data like env inspects
+        - csvslice + csvflatten
+    - data ranlgling: census
+        - csvheader to replace header with custom names
+        - csvslice to cut out metadata
+    - to and from with csvsqlite and in2csv
+        - LESO data stack
+
+- **get inspiration from tidyverse**
+    - method reference page: https://tidyr.tidyverse.org/reference/pivot_wider.html#details
+        - simple, elegant, with just 4 content headers: Arguments, Details, See Also, and Examples
+        - Details is just a short graf, giving background and relation to other methods
+        - See Also is a single line and link: pivot_wider_spec() to pivot "by hand" with a data frame that defines a pivotting specification.
+            - pivot_wider spec: https://tidyr.tidyverse.org/reference/pivot_wider_spec.html
+        - Intro graf includes "Learn more in vignette("pivot")", which links to a different page with more text and elaboration: 
+         - https://tidyr.tidyverse.org/articles/pivot.html
+    - ggplot2 is good too: https://ggplot2.tidyverse.org/reference/geom_bar.html
+
+**csvflatten**
+
+- --prettify as the default, in the way that csvstat as a --csv option: https://csvkit.readthedocs.io/en/latest/scripts/csvstat.html
 
 **csvpivot**
 
+
+**pivot readings**
+
+Read pandas docs on pandas.pivot and DataFrame.pivot_table
+- pandas.pivot: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot.html
+- DataFrame.pivot_table: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot_table.html#pandas.DataFrame.pivot_table
+    - Described as `Create a spreadsheet-style pivot table as a DataFrame.`
+
+Reshape2 (Hadley Wickham's general reshaping library)
+- tidyr vs reshape2: https://jtr13.github.io/spring19/hx2259_qz2351.html
+    - reshape2 does aggregation (like csvpivot), whereas tidyr does not
+- journal article: Reshaping Data with the reshape Package
+    - https://www.jstatsoft.org/article/view/v021i12
+    - study the theory/context sections, e.g. Conceptual Framework
+    - 4. Casting molten data
+    - study how example data is shown, then referred to in each example
+
+Read tidyverse's article on pivoting: 
+- Main https://tidyr.tidyverse.org/articles/pivot.html
+- Wider section: https://tidyr.tidyverse.org/articles/pivot.html#wider
+- Nutgraf
+
+    > pivot_wider() is the opposite of pivot_longer(): it makes a dataset wider by increasing the number of columns and decreasing the number of rows. It’s relatively rare to need pivot_wider() to make tidy data, but it’s often useful for creating summary tables for presentation, or data in a format needed by other tools.
+
+
+**R guides**
+
+- An introduction to reshape https://seananderson.ca/2013/10/19/reshape/
+    - 'What makes data wide or long?'
+    - very well formatted and written guide
+
+- https://ademos.people.uic.edu/Chapter8.html
+    - reshape.cast is the equivalent to a Pivot: 
+    > Casting will transform long format back into wide format. This will, essentially, make your data look as it did in the beginning (or in any other way you’d prefer).
+
+
+
 - cli
     - simplify command-line opts to '--column' and '--rows' from '--pivot-column' and '--pivot-rows'
 
 - [ ] documentation
+    - terminology
+        - look at tidyverse writeup for pivot_wider
+            - https://tidyr.tidyverse.org/reference/pivot_wider.html
+            - *`pivot_wider() "widens" data, increasing the number of columns and decreasing the number of rows. The inverse transformation is pivot_longer().*
+        - look at tidyverse writeup for spread() (spread is now deprecated): 
+            - https://rstudio-pubs-static.s3.amazonaws.com/282405_e280f5f0073544d7be417cde893d78d0.html
+            - "key: The column you want to split apart (Field)"
     - usage overview
         - [ ] simple row count
         - [ ] multiple row count
@@ -35,8 +146,10 @@
     - [ ] write options/flags section
     - [ ] write comparison section
     - [ ] write scenarios/use-cases 
-
-
+    - other references about Pivot Tables
+        - Pivot Tables in Google Sheets: A Beginner’s Guide: https://www.benlcollins.com/spreadsheets/pivot-tables-google-sheets/#one
+        - https://business.tutsplus.com/tutorials/how-to-use-pivot-tables-in-google-sheets--cms-28887
+        - https://support.microsoft.com/en-us/office/create-a-pivottable-to-analyze-worksheet-data-a9a84538-bfe9-40a9-a8e9-f99134456576
 
 **csvslice**
 
@@ -132,9 +245,13 @@ Overall stuff
 ## 0.2
 
 
-- csvmelt:
+- csvmelt/csvgather:
+    - pandas uses "melt()" to refer to an "unpivot" https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html
+        - >This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.
+
+    - r lang: https://www.rdocumentation.org/packages/reshape2/versions/1.4.4/topics/melt
     - tidyr: gather/pivot_longer https://tidyr.tidyverse.org/reference/pivot_longer.html
-    - 
+    - tidyverse https://uc-r.github.io/tidyr#gather
     - pt.normalize('gender', ['white', 'black', 'asian', 'latino'])
     | gender | property | value |
     | ------ | -------- | ----- |

diff --git a/csvmedkit/__about__.py b/csvmedkit/__about__.py
@@ -1,7 +1,7 @@
 __title__ = "csvmedkit"
 __description__ = """The unofficial extended family of csvkit, i.e. even more tools for command-line data parsing and wrangling"""
 __url__ = "https://github.com/dannguyen/csvmedkit"
-__version__ = "0.0.9.13"
+__version__ = "0.0.9.14"
 __short_version__ = __version__.split("-")[0]
 __author__ = "Dan Nguyen"
 __author_email__ = "dansonguyen@gmail.com"
diff --git a/csvmedkit/utils/csvflatten.py b/csvmedkit/utils/csvflatten.py
@@ -25,7 +25,7 @@
 }
 
 FLAT_COL_PADDING = 4
-FLAT_COL_WIDTH = len("field") + FLAT_COL_PADDING  # e.g. '| field |' and '| recid |'
+FLAT_COL_WIDTH = len("field") + FLAT_COL_PADDING
 
 
 class CSVFlatten(UniformReader, CmkUtil):
@@ -35,12 +35,13 @@ class CSVFlatten(UniformReader, CmkUtil):
     override_flags = ["l"]
 
     def add_arguments(self):
+
         self.argparser.add_argument(
-            "-P",
-            "--prettify",
-            dest="prettify",
+            "-c",
+            "--csv",
+            dest="csvify",
             action="store_true",
-            help="""Print output in Markdown tabular format instead of CSV""",
+            help="""Print output in CSV format""",
         )
 
         self.argparser.add_argument(
@@ -123,8 +124,8 @@ def rec_ids_mode(self):
         return self.args.rec_ids_mode
 
     @property
-    def prettify(self):
-        return self.args.prettify
+    def csvify(self):
+        return self.args.csvify
 
     def read_input(self):
         self._rows = agate.csv.reader(self.skip_lines(), **self.reader_kwargs)
@@ -158,7 +159,7 @@ def main(self):
             self.args.max_field_length or self.args.max_field_length == 0
         ):  # 0 is considered to be infinite/no-wrap
             self.max_field_length = self.args.max_field_length
-        elif self.prettify and not self.args.max_field_length:
+        elif not self.csvify and not self.args.max_field_length:
             # user wants it pretty but didn't specify a max_field_length, so we automatically figure it out
             # TODO: this is ugly
             termwidth = get_terminal_size().columns
@@ -204,7 +205,12 @@ def main(self):
                         )
                     outrows.append(o_row + [fieldname, chunk])
 
-        if self.prettify:
+        if self.csvify:
+            writer = agate.csv.writer(self.output_file, **self.writer_kwargs)
+            writer.writerow(self.output_flat_column_names)
+            writer.writerows(outrows)
+
+        else:
             outtable = agate.Table(
                 outrows,
                 column_names=self.output_flat_column_names,
@@ -217,10 +223,6 @@ def main(self):
                 max_rows=None,
                 max_columns=None,
             )
-        else:
-            writer = agate.csv.writer(self.output_file, **self.writer_kwargs)
-            writer.writerow(self.output_flat_column_names)
-            writer.writerows(outrows)
 
 
 def launch_new_instance():

diff --git a/csvmedkit/utils/csvsed.py b/csvmedkit/utils/csvsed.py
@@ -11,7 +11,6 @@
 
 class Parser:
     description = """Replaces all instances of [PATTERN] with [REPL]"""
-
     override_flags = [
         "f",
     ]

diff --git a/docs/utils/csvflatten/comparison.rstinc b/docs/utils/csvflatten/comparison.rstinc
@@ -0,0 +1,147 @@
+
+
+How it compares to existing tools
+=================================
+
+
+Compared to csvkit's``csvlook``
+-------------------------------
+
+
+`csvlook <https://csvkit.readthedocs.io/en/latest/scripts/csvlook.html>`_  doesn't pretty-format multi-line fields, and can also result in very wide tables without ``--max-column-width``::
+
+    $ csvlook examples/hamlet.csv --max-column-width 50
+
+    | act | scene | speaker  | lines                                              |
+    | --- | ----- | -------- | -------------------------------------------------- |
+    |   1 |     5 | Horatio  | Propose the oath, my lord.                         |
+    |   1 |     5 | Hamlet   | Never to speak of this that you have seen,
+    Swea... |
+    |   1 |     5 | Ghost    | [Beneath] Swear.                                   |
+    |   3 |     4 | Gertrude | O, speak to me no more;
+    These words, like dagge... |
+    |   4 |     7 | Laertes  | Know you the hand?                                 |
+
+
+Compared to ``xsv flatten``
+---------------------------
+
+`xsv flatten <https://github.com/BurntSushi/xsv#available-commands>`_ does do auto-wrapping of long entries, but doesn't produce tableized output::
+
+    $ xsv flatten examples/hamlet.csv
+
+    act      1
+    scene    5
+    speaker  Horatio
+    lines    Propose the oath, my lord.
+    #
+    act      1
+    scene    5
+    speaker  Hamlet
+    lines    Never to speak of this that you have seen,
+    Swear by my sword.
+    #
+    act      1
+    scene    5
+    speaker  Ghost
+    lines    [Beneath] Swear.
+    #
+    act      3
+    scene    4
+    speaker  Gertrude
+    lines    O, speak to me no more;
+    These words, like daggers, enter in mine ears;
+    No more, sweet Hamlet!
+    #
+    act      4
+    scene    7
+    speaker  Laertes
+    lines    Know you the hand?
+
+
+Compared to ``tabulate``
+------------------------
+
+`python-tabulate <https://pypi.org/project/tabulate/>`_ is a command-line tool for producing a variety of tabular outputs, including ``rst``, ``grid``, and ``html`` formats. However, it does not handle multi-line fields well. Nor does it natively handle the CSV format, e.g. double-quoted values that contain commas, hence, the use of csvkit's `csvformat <https://csvkit.readthedocs.io/en/latest/scripts/csvformat.html>`_ to change delimiters to ``\t`` in the example below::
+
+
+
+    $ csvformat -T examples/hamlet.csv | tabulate -f grid -1 -s '\t'
+
+    +------------------------------------------------+---------+-----------+---------------------------------------------+
+    | act                                            |   scene | speaker   | lines                                       |
+    +================================================+=========+===========+=============================================+
+    | 1                                              |       5 | Horatio   | Propose the oath, my lord.                  |
+    +------------------------------------------------+---------+-----------+---------------------------------------------+
+    | 1                                              |       5 | Hamlet    | "Never to speak of this that you have seen, |
+    +------------------------------------------------+---------+-----------+---------------------------------------------+
+    | Swear by my sword."                            |         |           |                                             |
+    +------------------------------------------------+---------+-----------+---------------------------------------------+
+    | 1                                              |       5 | Ghost     | [Beneath] Swear.                            |
+    +------------------------------------------------+---------+-----------+---------------------------------------------+
+    | 3                                              |       4 | Gertrude  | "O, speak to me no more;                    |
+    +------------------------------------------------+---------+-----------+---------------------------------------------+
+    | These words, like daggers, enter in mine ears; |         |           |                                             |
+    +------------------------------------------------+---------+-----------+---------------------------------------------+
+    | No more, sweet Hamlet!"                        |         |           |                                             |
+    +------------------------------------------------+---------+-----------+---------------------------------------------+
+    | 4                                              |       7 | Laertes   | Know you the hand?                          |
+    +------------------------------------------------+---------+-----------+---------------------------------------------+
+
+
+That said, if you like ``tabulate``'s table-formatting options, such as ``-f grid``, you can pipe :command:`csvflatten` (and :command:`csvformat` to convert to tab-delimiters) into ``tabulate`` like so::
+
+
+    $ csvflatten --eor 'none' examples/hamlet.csv | csvformat -T \
+        tabulate -f grid -1 -s '\t'
+
+    +---------+------------------------------------------------+
+    | field   | value                                          |
+    +=========+================================================+
+    | act     | 1                                              |
+    +---------+------------------------------------------------+
+    | scene   | 5                                              |
+    +---------+------------------------------------------------+
+    | speaker | Horatio                                        |
+    +---------+------------------------------------------------+
+    | lines   | Propose the oath, my lord.                     |
+    +---------+------------------------------------------------+
+    | act     | 1                                              |
+    +---------+------------------------------------------------+
+    | scene   | 5                                              |
+    +---------+------------------------------------------------+
+    | speaker | Hamlet                                         |
+    +---------+------------------------------------------------+
+    | lines   | Never to speak of this that you have seen,     |
+    +---------+------------------------------------------------+
+    |         | Swear by my sword.                             |
+    +---------+------------------------------------------------+
+    | act     | 1                                              |
+    +---------+------------------------------------------------+
+    | scene   | 5                                              |
+    +---------+------------------------------------------------+
+    | speaker | Ghost                                          |
+    +---------+------------------------------------------------+
+    | lines   | [Beneath] Swear.                               |
+    +---------+------------------------------------------------+
+    | act     | 3                                              |
+    +---------+------------------------------------------------+
+    | scene   | 4                                              |
+    +---------+------------------------------------------------+
+    | speaker | Gertrude                                       |
+    +---------+------------------------------------------------+
+    | lines   | O, speak to me no more;                        |
+    +---------+------------------------------------------------+
+    |         | These words, like daggers, enter in mine ears; |
+    +---------+------------------------------------------------+
+    |         | No more, sweet Hamlet!                         |
+    +---------+------------------------------------------------+
+    | act     | 4                                              |
+    +---------+------------------------------------------------+
+    | scene   | 7                                              |
+    +---------+------------------------------------------------+
+    | speaker | Laertes                                        |
+    +---------+------------------------------------------------+
+    | lines   | Know you the hand?                             |
+    +---------+------------------------------------------------+
+
diff --git a/docs/utils/csvflatten/files/images/excel-flatfruits-csvmode.png b/docs/utils/csvflatten/files/images/excel-flatfruits-csvmode.png