## 1: Csvkit

So far, we've been using the default command line tools to clean, munge, and explore data. Tools like wc and head are useful tools, but weren't designed specifically for working with datasets and are limited in many ways. These tools lack features specific to working with tabular datasets, like parsing the header row or understanding the row and column layout. Because of this, in the Data Munging Using the Command Line challenge, you had to specifically compute the number of lines in each CSV file using the wc tool and use that number to select just the non-header rows using the tail tool. You then had to repeat this for each CSV file you were trying to merge into the resulting, single file!

In this mission, we'll learn about the Csvkit library, which supercharges your workflow by adding 13 new command line tools specifically for working with CSV files. We'll focus on these 5 tools from Csvkit:

- csvstack: for stacking rows from multiple CSV files.
- csvlook: renders CSV in pretty table format.
- csvcut: for selecting specific columns from a CSV file.
- csvstat: for calculating descriptive statistics for some or all columns.
- csvgrep: for filtering tabular data using specific criteria.

We'll be using csvkit version 0.9.1 in this mission and you can read about the installation procedure in the documentation. We'll continue to work with the same 3 datasets on housing affordability:

- Hud_2005.csv,
- Hud_2007.csv,
- Hud_2013.csv.

In [16]:
%%bash

cd data/
ls

Hud_2005.csv
Hud_2007.csv
Hud_2013.csv


## 2: Csvstack

To start, let's circle back to the task of merging 3 CSV files into 1 file. We can use csvstack tool to consolidate the rows from multiple CSV files and redirect the stdout to a new file:

    csvstack file1.csv file2.csv file3.csv > final.csv

As long as the header row for each file in the stdin to csvstack is the same, the first row in the resulting file will match this header row. After the header row, final.csv will contain all of the non-header rows from file1.csv, then all of the non-header rows from file2.csv, then finally the non-header rows from file3.csv. If you don't redirect the stdout of csvstack to a file or a tool like head, the full output will be rendered in the terminal. This can cause your terminal to grind to a halt as it tries to process and display all of the output and you want to be extra careful to avoid doing so.

If you peeked at the documentation, you may have noticed that the behavior of csvstack can be modified using a few different flags. For example,

if you want to be able to trace the file where each row originated from in the merged file, you can use the -g flag to specify a grouping value for each filename. When stacking the rows from a file, csvstack will add the corresponding value in a new column. Lastly, you can use the -n flag to specify the name of this new column. The following code will create a new column named origin, containing the values 1, 2, or 3 depending on which file that row originated from:

    csvstack -n origin -g 1,2,3 file1.csv file2.csv file3.csv > final.csv

The rows in final.csv that originated from file1.csv will contain the value 1 in the origin column and those from file2.csv will contain the value 2 in the origin column. Let's now use csvstack to combine the 3 datasets on U.S. housing affordability from the last challenge.

In [17]:
%%bash
#head -1 data/*.csv
cd data/

csvstack -n year -g 2005,2007,2013 Hud_2005.csv Hud_2007.csv Hud_2013.csv > Combined_hud.csv

In [18]:
%%bash

cd data/
ls

Combined_hud.csv
Hud_2005.csv
Hud_2007.csv
Hud_2013.csv


In [19]:
%%bash

cd data/
head -1 Combined_hud.csv
echo
wc -l Combined_hud.csv

year,CONTROL,AGE1,BEDRMS,PER,REGION,LMED,FMR,L30,L50,L80,IPOV,BUILT,STATUS,VACANCY,TENURE,NUNITS,TYPE,VALUE,ZINC2,ROOMS,ZADEQ,ZSMHC,WEIGHT,METRO3,STRUCTURETYPE,OWNRENT,UTILITY,OTHERCOST,COST06,COST12,COST08,COSTMED,TOTSAL,ASSISTED,GLMED,GL30,GL50,GL80,APLMED,ABL30,ABL50,ABL80,ABLMED,BURDEN,INCRELAMIPCT,INCRELAMICAT,INCRELPOVPCT,INCRELPOVCAT,INCRELFMRPCT,INCRELFMRCAT,COST06RELAMIPCT,COST06RELAMICAT,COST06RELPOVPCT,COST06RELPOVCAT,COST06RELFMRPCT,COST06RELFMRCAT,COST08RELAMIPCT,COST08RELAMICAT,COST08RELPOVPCT,COST08RELPOVCAT,COST08RELFMRPCT,COST08RELFMRCAT,COST12RELAMIPCT,COST12RELAMICAT,COST12RELPOVPCT,COST12RELPOVCAT,COST12RELFMRPCT,COST12RELFMRCAT,COSTMedRELAMIPCT,COSTMedRELAMICAT,COSTMedRELPOVPCT,COSTMedRELPOVCAT,COSTMedRELFMRPCT,COSTMedRELFMRCAT,FMTZADEQ,FMTMETRO3,FMTBUILT,FMTSTRUCTURETYPE,FMTBEDRMS,FMTOWNRENT,FMTCOST06RELPOVCAT,FMTCOST08RELPOVCAT,FMTCOST12RELPOVCAT,FMTCOSTMEDRELPOVCAT,FMTINCRELPOVCAT,FMTCOST06RELFMRCAT,FMTCOST08RELFMRCAT,FMTCOST12RELFMRCAT,FMTCOSTMEDRELFMRCAT,FMTIN

## 3: Csvlook

While head allows you to quickly observe the first few rows in a file, it doesn't attempt to format the rendered output at all. CSV files are tabular and it's incredibly useful to observe this structure and other data tools like Pandas and Microsoft Excel factored that notion in when displaying tabular data. Thankfully, we can use the csvlook tool to display tabular data in the table format we're used to.

The csvlook tool parses CSV formatted data from it's stdin and outputs a pretty formatted table representation of that data to it's stdout:

    head -10 final.csv | csvlook

Let's use csvlook to explore the first few rows from the CSV file we created in the last screen.

In [20]:
%%bash
cd data/

head -2 Combined_hud.csv | csvlook # too many cols to display table properly

|  year |         CONTROL | AGE1 | BEDRMS |  PER | REGION |   LMED | FMR |    L30 |    L50 |    L80 |  IPOV | BUILT | STATUS | VACANCY | TENURE | NUNITS | TYPE |  VALUE |  ZINC2 | ROOMS | ZADEQ | ZSMHC |     WEIGHT | METRO3 | STRUCTURETYPE | OWNRENT |  UTILITY | OTHERCOST |   COST06 |     COST12 |   COST08 |  COSTMED | TOTSAL | ASSISTED |  GLMED |   GL30 |   GL50 |   GL80 |   APLMED |       ABL30 |       ABL50 |  ABL80 |    ABLMED | BURDEN | INCRELAMIPCT | INCRELAMICAT | INCRELPOVPCT | INCRELPOVCAT | INCRELFMRPCT | INCRELFMRCAT | COST06RELAMIPCT | COST06RELAMICAT | COST06RELPOVPCT | COST06RELPOVCAT | COST06RELFMRPCT | COST06RELFMRCAT | COST08RELAMIPCT | COST08RELAMICAT | COST08RELPOVPCT | COST08RELPOVCAT | COST08RELFMRPCT | COST08RELFMRCAT | COST12RELAMIPCT | COST12RELAMICAT | COST12RELPOVPCT | COST12RELPOVCAT | COST12RELFMRPCT | COST12RELFMRCAT | COSTMedRELAMIPCT | COSTMedRELAMICAT | COSTMedRELPOVPCT | COSTMedRELPOVCAT | COSTMedRELFMRPCT | COSTMedRELFMRCAT | FMTZADEQ   | FMTMETRO3 | F

## 4: Csvcut

Csvlook returned a table formatted output of the merged CSV file. Let's now explore individual columns using the csvcut tool. Using the csvcut command with just the -n flag parses and displays all the columns in a CSV file along with an unique integer identifier for each column:

    csvcut -n Combined_hud.csv

will output:

    1: year
    2: AGE1
    3: BURDEN
    4: FMR
    5: FMTBEDRMS
    6: FMTBUILT
    7: TOTSAL

ou can use the integer identifier for each column and the -cc flag to select just a specific column:

    csvcut -c 1 Combined_hud.csv

will output just the year column. You want to avoid displaying the entire column since it contains 154118 rows and your terminal window will severely come to a halt attempting to display all that information. Instead, you can pipe the column output to head to preview just the first n rows.

In [21]:
%%bash
cd data/

# display all column numbers and names
csvcut -n Combined_hud.csv

  1: year
  2: CONTROL
  3: AGE1
  4: BEDRMS
  5: PER
  6: REGION
  7: LMED
  8: FMR
  9: L30
 10: L50
 11: L80
 12: IPOV
 13: BUILT
 14: STATUS
 15: VACANCY
 16: TENURE
 17: NUNITS
 18: TYPE
 19: VALUE
 20: ZINC2
 21: ROOMS
 22: ZADEQ
 23: ZSMHC
 24: WEIGHT
 25: METRO3
 26: STRUCTURETYPE
 27: OWNRENT
 28: UTILITY
 29: OTHERCOST
 30: COST06
 31: COST12
 32: COST08
 33: COSTMED
 34: TOTSAL
 35: ASSISTED
 36: GLMED
 37: GL30
 38: GL50
 39: GL80
 40: APLMED
 41: ABL30
 42: ABL50
 43: ABL80
 44: ABLMED
 45: BURDEN
 46: INCRELAMIPCT
 47: INCRELAMICAT
 48: INCRELPOVPCT
 49: INCRELPOVCAT
 50: INCRELFMRPCT
 51: INCRELFMRCAT
 52: COST06RELAMIPCT
 53: COST06RELAMICAT
 54: COST06RELPOVPCT
 55: COST06RELPOVCAT
 56: COST06RELFMRPCT
 57: COST06RELFMRCAT
 58: COST08RELAMIPCT
 59: COST08RELAMICAT
 60: COST08RELPOVPCT
 61: COST08RELPOVCAT
 62: COST08RELFMRPCT
 63: COST08RELFMRCAT
 64: COST12RELAMIPCT
 65: COST12RELAMICAT
 66: COST12RELPOVPCT
 67: COST12RELPOVCAT
 68: COST12RELFMRPCT
 69: COST12RELFMRCA

In [22]:
%%bash
cd data/

# Copy only seven columns to new file, rewrite original file with new file data, delete new file
csvcut -c 1,3,45,8,80,78,34 Combined_hud.csv > newfile.csv && rm Combined_hud.csv && cp newfile.csv Combined_hud.csv && rm newfile.csv
ls

Combined_hud.csv
Hud_2005.csv
Hud_2007.csv
Hud_2013.csv


In [26]:
%%bash
cd data/

# display columns numbers and names
csvcut -n Combined_hud.csv

# display first 10 rows as table
head Combined_hud.csv | csvlook

  1: year
  2: AGE1
  3: BURDEN
  4: FMR
  5: FMTBEDRMS
  6: FMTBUILT
  7: TOTSAL

|  year | AGE1 |  BURDEN | FMR | FMTBEDRMS | FMTBUILT  | TOTSAL |
| ----- | ---- | ------- | --- | --------- | --------- | ------ |
| 2,005 |   43 |  0.513… | 680 | 3 3BR     | 1980-1989 | 20,000 |
| 2,005 |   44 |  0.223… | 760 | 4 4BR+    | 1980-1989 | 71,000 |
| 2,005 |   58 |  0.218… | 680 | 3 3BR     | 1980-1989 | 63,000 |
| 2,005 |   22 |  0.217… | 519 | 1 1BR     | 1980-1989 | 27,040 |
| 2,005 |   48 |  0.283… | 600 | 1 1BR     | 1980-1989 | 14,000 |
| 2,005 |   42 |  0.292… | 788 | 3 3BR     | 1980-1989 | 42,000 |
| 2,005 |   -9 | -9.000… | 702 | 2 2BR     | 1980-1989 |     -9 |
| 2,005 |   23 |  0.145… | 546 | 2 2BR     | 1980-1989 | 48,000 |
| 2,005 |   51 |  0.296… | 680 | 3 3BR     | 1980-1989 | 58,000 |


In [29]:
%%bash
cd data/

# display first 10 rows of AGE1 column by col number
csvcut -c 2 Combined_hud.csv | head

# display first 5 rows of year column by name
csvcut -c year Combined_hud.csv | head -5

AGE1
43
44
58
22
48
42
-9
23
51
year
2005
2005
2005
2005


## 5: Csvstat

Now that we know how to select specific columns, we can select a column and pipe it to the csvstat tool to calculate summary statistics for that column:

    csvcut -c 4 Combined_hud.csv | csvstat

This calculates a full suite of summary statistics, including:

- max,
- min,
- sum,
- mean,
- median,
- standard deviation.

Depending on the size of the data, the full summary statistics for a column can take a long time and you often just want a specific summary statistic. You can use -- flags to choose specific summary statistics, which will greatly improve the speed:

    # Just the max value.
    csvcut -c 2 Combined_hud.csv | csvstat --max
    # Just the mean value.
    csvcut -c 2 Combined_hud.csv | csvstat --mean
    # Just the number of null values.
    csvcut -c 2 Combined_hud.csv | csvstat --nulls
    
You can see a full list of flags in the documentation. If you want to calculate summary statistics over all the columns in a CSV file, you can pass the file to csvstat directly:

    csvstat Combined_hud.csv

In [33]:
%%bash
cd data/

# mean for all columns
csvstat --mean Combined_hud.csv

  1. year: 2,008.904
  2. AGE1: 46.511
  3. BURDEN: 5.304
  4. FMR: 7,966.32
  5. FMTBEDRMS: None
  6. FMTBUILT: None
  7. TOTSAL: 44,041.842


## 6: Csvcut | Csvstat

Let's use csvcut and csvstat to search for any problematic values in the AGE1 column.

In [34]:
%%bash
cd data/

# all summary statistics for AGE1 column (notice problematic values)
csvcut -c 2 Combined_hud.csv | csvstat

  1. "AGE1"

	Type of data:          Number
	Contains null values:  False
	Unique values:         80
	Smallest value:        -9
	Largest value:         93
	Sum:                   7,168,169
	Mean:                  46.511
	Median:                48
	StDev:                 23.049
	Most common values:    -9 (11553x)
	                       50 (3208x)
	                       45 (3056x)
	                       40 (3040x)
	                       48 (3006x)

Row count: 154117


## 7: Csvgrep
You'll notice that -9 is the most common value in the AGE1 column, which is problematic since age values have to be greater than 0. We can use csvgrep to select all the rows that match a specific pattern to dive a bit deeper. By default, csvgrep will search all of the rows in the dataset but we can restrict the search to specific columns using the -c flag (just like with csvcut). We then use the -m flag to specify the pattern:

    csvgrep -c 2 -m -9 Combined_hud.csv
    
This command will return all rows from Combined_hud.csv with -9 as the value for the AGE1 column. The behavior of csvgrep can be customized using the flags. For example, you can use the -r flag to pass in a regular expression as the pattern instead. We're now going to combined several of the tools we've talked about so far so that you can see the real power of using the csvkit tools combined with other CLI tools.

In [37]:
%%bash
cd data/

# Display the first 10 rows from Combined_hud.csv where the value for the AGE1 column is -9 in table format.
csvgrep -c 2 -m -9 Combined_hud.csv | head | csvlook

|  year | AGE1 | BURDEN |   FMR | FMTBEDRMS | FMTBUILT  | TOTSAL |
| ----- | ---- | ------ | ----- | --------- | --------- | ------ |
| 2,005 |   -9 |     -9 |   702 | 2 2BR     | 1980-1989 |     -9 |
| 2,005 |   -9 |     -9 |   531 | 1 1BR     | 1980-1989 |     -9 |
| 2,005 |   -9 |     -9 | 1,034 | 3 3BR     | 2000-2009 |     -9 |
| 2,005 |   -9 |     -9 |   631 | 1 1BR     | 1980-1989 |     -9 |
| 2,005 |   -9 |     -9 |   712 | 4 4BR+    | 1990-1999 |     -9 |
| 2,005 |   -9 |     -9 | 1,006 | 3 3BR     | 2000-2009 |     -9 |
| 2,005 |   -9 |     -9 |   631 | 1 1BR     | 1980-1989 |     -9 |
| 2,005 |   -9 |     -9 |   712 | 3 3BR     | 2000-2009 |     -9 |
| 2,005 |   -9 |     -9 | 1,087 | 3 3BR     | 2000-2009 |     -9 |


## 8: Filtering Out Problematic Rows

Let's now filter out all of these problematic rows from the dataset since they have data quality issues. Csvkit wasn't developed with a sharp focus on editing existing files, and the easiest way to filter rows is to create a separate file with just the rows we're interested in. To accomplish this, we can redirect the output of csvgrep to a file. So far, we've only used csvgrep to select rows that match a specific pattern. We need to instead select the rows that don't match a pattern, which we can specify with the -i flag. You can read more about this flag in the documentation.

In [52]:
%%bash
cd data/

# csvgrep -i flag inverts the selection (rows where pattern does not match)
csvgrep -c 2 -m -9 -i Combined_hud.csv > positive_ages_only.csv
# display first 10 rows
head positive_ages_only.csv

year,AGE1,BURDEN,FMR,FMTBEDRMS,FMTBUILT,TOTSAL
2005,43,0.513,680,'3 3BR','1980-1989',20000
2005,44,0.2225915493,760,'4 4BR+','1980-1989',71000
2005,58,0.2178312657,680,'3 3BR','1980-1989',63000
2005,22,0.2174556213,519,'1 1BR','1980-1989',27040
2005,48,0.2828571429,600,'1 1BR','1980-1989',14000
2005,42,0.2922857143,788,'3 3BR','1980-1989',42000
2005,23,0.14475,546,'2 2BR','1980-1989',48000
2005,51,0.2962,680,'3 3BR','1980-1989',58000
2005,47,0.1896,1081,'4 4BR+','2000-2009',125000


## 9: Next Steps

In this challenge, you learned how to use the csvkit library to explore and clean CSV files. You should use csvkit whenever you need to quickly transform or explore data from the command line, but remember that it has a few limitations:

- Csvkit is not optimized for speed and struggles to run some commands over larger files.

- Csvkit has very limited capabilities for actually editing problematic values in a dataset, since the community behind the library aspired to keep the library small and lightweight.