# Demonstration of `df_binary_states2summary_df.py`

Demonstrating use of `df_binary_states2summary_df.py`, see [here](https://github.com/fomightez/text_mining) for more information.

This script converts a dataframe or data table of text into a summary.  

**This script only works with data that has binary state / or subgroups and assumes the other from calculation by 100%.** Two states typically have one that is considered the 'positive' or featured one, such as among 'present/not present' or 'yes/no'.  
See [this notebook](index.ipynb) demonstrating `df_subgroups_states2summary_df.py` if you have more than two states / subgroups.

-----

The two main ways of using the script are covered, featuring several of the options available.

## Preparation and displaying USAGE block

Let's get the script and run 'Help' on it to see the basic USAGE block.

(If you are running this notebook in the session launched from the repo that includes the script, this step is not necessary. However, it is included because there is no harm in running it here and you may be wanting to run this elsewhere or see how to easily acquire the script. If you are on the actual command line, you'd leave off the exclamation point.)

In [1]:
import os
file_needed = "df_binary_states2summary_df.py"
if not os.path.isfile(file_needed):
    !curl -OL https://raw.githubusercontent.com/fomightez/text_mining/master/df_binary_states2summary_df.py

In [2]:
%run df_binary_states2summary_df.py -h

usage: df_binary_states2summary_df.py [-h] [-olsp] [-bc]
                                      DF_FILE GROUPS STATES_COL STATE_TO_SHOW

df_binary_states2summary_df.py takes a dataframe, and some information about
columns in the dataframe and makes a summary data table with the percent for a
specified state per total and each group / category. **** Script by Wayne
Decatur (fomightez @ github) ***

positional arguments:
  DF_FILE               Name of file containing the dataframe. Whether it is
                        in the form of a pickled dataframe, tab-separated
                        text, or comma-separated text needs to be indicated by
                        the file extension. So `.pkl`, `.tsv`, or `.csv` for
                        the file extension.
  GROUPS                Text indicating column in dataframe to use as main
                        grouping categories.
  STATES_COL            Text indicating column in dataframe to use as the
                        binary st

## Use the script by calling it from the command line

A dataframe  or text data table will be used for input data. To fully demonstrate the options for the script we'll use a toy dataframe and also convert it to a text table.

In [3]:
import pandas as pd
sales = [('Jones LLC', 177887, 'yes'),
         ('Alpha Co', 157987, 'yes'),
         ('Alpha Co', 158981, 'yes'),
         ('Alpha Co', 159983, 'yes'),
         ('Alpha Co', 167987, 'yes'),
         ('Alpha Co', 158117, 'yes'),
         ('Alpha Co', 159333, 'no'),
         ('Alpha Co', 256521, 'no'),
         ('Blue Inc', 111947, 'no')]
labels = ['Manufacturer', 'Item', 'In_Stock']
df = pd.DataFrame.from_records(sales, columns=labels)
df.head()

Unnamed: 0,Manufacturer,Item,In_Stock
0,Jones LLC,177887,yes
1,Alpha Co,157987,yes
2,Alpha Co,158981,yes
3,Alpha Co,159983,yes
4,Alpha Co,167987,yes


Let's save that dataframe as tabular text and also as a Pickled pickled dataframe. The former being human readable and the latter not. The latter is more efficient at storeage though if that is an issue.

First to save as tabular text in tab-separated form. You could change it to be comma-separated, CSV, if you choose.

In [4]:
df.to_pickle("data.pkl")
df.to_csv('data.tsv', sep='\t',index = False)

Now that we have files with input data, we have something we can point the script at for running it.

In addition to providing the data input file name, three other items need to be provided when calling the script. You need also to provide:  
(1) the text corresponding to the column heading of the groupings,  
(2) the text corresponding to the column containing the states /or subgroups, and 
(3) the text corresponding to the 'positive' or featured state or subgroup.

In [1]:
%run df_binary_states2summary_df.py data.pkl Manufacturer In_Stock yes

ERROR:root:File `'df_binary_states2summary_df.py'` not found.


In [6]:
t = pd.read_pickle("summary_data.pkl")
t

The text in the displayed view of the dataframe can be styled better without changing the actual underlying data. Here we fix the `%` column to show it as a percent with two-decimal places represented. At the same time a title can also be added for display.

In [21]:
# This would change the view to be nicer; note the underlying dataframe remains untouched
t_styl = t.style.format("{:.2%}",subset=[('yes','%')]) # based on https://stackoverflow.com/a/56411982/8508004
# and https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.formats.style.Styler.format.html
# Trick to add a title to the dataframe
from IPython.display import display, HTML
# trick from https://stackoverflow.com/a/29665452/8508004
display(HTML('<b>Items in stock by Manufacturer:</b>'))
with pd.option_context('display.multi_sparse', False):
    display(t_styl)

Unnamed: 0_level_0,Unnamed: 1_level_0,yes,yes
Unnamed: 0_level_1,[n],count,%
ALL,9,6,66.67%
Alpha Co,7,5,71.43%
Blue Inc,1,0,0.00%
Jones LLC,1,1,100.00%


The sparsification of the column names and not all the states/ subgroups being represented makes it a little hard to read. In other words, you might question ,"Why is there an `[n]` and `count`?" And not realize that `count` belongs to `yes`. Using the suggestion [here](http://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html), you can un-"sparsify" the display style:

In [8]:
with pd.option_context('display.multi_sparse', False):
    display(t)

Unnamed: 0_level_0,Unnamed: 1_level_0,yes,yes
Unnamed: 0_level_1,[n],count,%
ALL,9,6,0.666667
Alpha Co,7,5,0.714286
Blue Inc,1,0,0.0
Jones LLC,1,1,1.0


Related to the column names sparsification, I note that, when in the classic notebook interface, the default dataframe has the top level column names show on the left, and they shift to being right-aligned in the dataframe with the format styling of the percent column. If you are seeing this and want more standardized, switch to using the JupyterLab interface because it seems the top-level column headings stay right-aligned. It is easy to go from the classic interface to the JupyterLab if you first go to the dashboard/file browser in the classic interface by clicking on the Jupyter logo in the upper right. From the dashboard, change the end of the url from `/tree` to `/lab`. The screen will refresh and you'll be in the JupyterLab interface. (Just change `/lab` at the end back to `/tree` to switch back to the classic interface.

That covers the basics. However, the script can be called with several arguments to specify the output style.

In [10]:
l = pd.read_pickle("data.pkl")
l

Unnamed: 0,Manufacturer,Item,In_Stock
0,Jones LLC,177887,yes
1,Alpha Co,157987,yes
2,Alpha Co,158981,yes
3,Alpha Co,159983,yes
4,Alpha Co,167987,yes
5,Alpha Co,158117,yes
6,Alpha Co,159333,no
7,Alpha Co,256521,no
8,Blue Inc,111947,no


In [11]:
%run df_binary_states2summary_df.py data.pkl Manufacturer In_Stock yes --bracket_counts

Summary dataframe saved as a text table easily opened in
different software; file named: `summary_data.tsv`. This version meant for presenation only.

Summary dataframe saved in pickled form for ease of use within
Python; file named: `summary_data.pkl`. This version meant for
presentation only.


**Also saving data table as forms easier to handle for subsequent steps:**
Summary dataframe saved as a text table easily opened in
different software; file named: `summary_basic_data.tsv`

Summary dataframe saved in pickled form for ease of use within
Python; file named: `summary_basic_data.pkl`. This will retain the column headers/names formatting best.

In [12]:
bc = pd.read_pickle("summary_data.pkl")
bc

Unnamed: 0,[n],In_Stock
ALL,9.0,66.67% [6]
Alpha Co,7.0,71.43% [5]
Blue Inc,1.0,0.00% [0]
Jones LLC,1.0,100.00% [1]


In [13]:
%run df_binary_states2summary_df.py data.pkl Manufacturer In_Stock yes --only_subgrp_perc

Summary dataframe saved as a text table easily opened in
different software; file named: `summary_data.tsv`

Summary dataframe saved in pickled form for ease of use within
Python; file named: `summary_data.pkl`. This will retain the column headers/names formatting best.

In [15]:
po = pd.read_pickle("summary_data.pkl")
po

Unnamed: 0,[n],In_Stock
ALL,9,0.666667
Alpha Co,7,0.714286
Blue Inc,1,0.0
Jones LLC,1,1.0


In [18]:
# This would change the view to be nicer; note the underlying dataframe remains untouched
po_styl =po.style.format("{:.2%}",subset=[('In_Stock')]) # based on https://stackoverflow.com/a/56411982/8508004
# and https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.formats.style.Styler.format.html
# Trick to add a title to the dataframe
from IPython.display import display, HTML
# trick from https://stackoverflow.com/a/29665452/8508004
display(HTML('<b>Inventory by Manufacturer:</b>'))
display(po_styl)

Unnamed: 0,[n],In_Stock
ALL,9,66.67%
Alpha Co,7,71.43%
Blue Inc,1,0.00%
Jones LLC,1,100.00%


---

This is the last notebook in the series.

Go back a notebook in the series by clicking [here &#11013;](index.ipynb).

----
----

In [None]:
import time

def executeSomething():
    #code here
    print ('.')
    time.sleep(480) #60 seconds times 8 minutes

while True:
    executeSomething()

.
