# Data Preparation I

*Note: this notebook is the complete (wordy) version of the [slide-oriented notebook](https://drive.google.com/file/d/1EOSzX3u4fyvYyHWzdM0WhQCduJnq8YkJ/view?usp=sharing) shown in lecture.*

Data preparation is a *very* broad subject, covering everything from data models to statistical assessments of data to string algorithms to scalable data processing. In some sense, most of Data Engineering---most of data science!---boils down to Data Preparation.

In today's first lecture we'll cover a few key topics:
1. Data "Unboxing": parsing & structure assessment
2. Structural Data Transformations
3. Type Induction/Coercion
4. String Manipulation

## 1. "Unboxing" Data
Recall some basic tools:
- `du -h`: show the disk size of a file in human-readable form
- `head -c 1024`: show me the first 1024 bytes of a file
- your eyes: in addition to being the window to your soul, they're a very common tool for understanding data

### Assessing Structure
Start by running `du -h` and `head -c 1024` on some files, or have a peek in your Finder/Explorer and favorite (scale-savvy) text editor.
1. Look out for header info, metadata, comments, etc.
  - They may be inline, or in external documentation, "data dictionaries", etc.
2. Most files you'll run across fall into one of these categories:
  1. **Record per line**: newline-delimited rows of uniform symbol-delimited data.
    - E.g. `csv` and `tsv` files
    - Also newline-delimited rows of uniform but ad-hoc structured text
  2. **Dictionaries/Objects**: explicit key:value pairs, may be nested! Two common cases:
    - Object-per-line: e.g. newline-delimited rows of JSON, XML, YAML, etc. (JSON in this format is sometimes called [json lines or jsonl](https://jsonlines.org/)).
    - Complex object: the entire dataset is one fully-nested JSON, XML or YAML object
  3. **Unions**: a mixture of rows from *k* distinct schemas. Two common cases:
    - *Tagged Unions*: each row has an ID or name identifying its schema. Often the tag is in the first column.
    - *Untagged Unions*: the schema for the row must be classified by its content
  4. **Natural Language (prose)**: A lot of data files are mostly or entirely natural language intended for human consumption.
  5. **Everything else**: There is a long tail of file formats in the world. If they're not readable as text, there's likely a commercial or open source tool to translate them to a readable text format.
  
One final note: text formats themselves are a complex issue. Traditionally there were two encodings of roman-alphabet characters: EBCDIC and ASCII. ASCII mostly won. But in our multilingual world, we now deal with characters from multiple languages and beyond 😟! There are many character encodings now. We won't dwell on this subject in this class, but some day you may have to be aware of issues like Unicode, UTF-8 and more. You can search the web for resources on these issues.
  
For the rest of this class, we will focus on the common case of text that can be read and written easily in tools like Jupyter, Python, PostgreSQL, etc.

### Examples
Without further ado, let's unbox some data!

In [1]:
!du -h data/*

  0B	data/Icon
1.3M	data/jc1.txt
 50M	data/jq2.txt
4.0K	data/mm.txt
696K	data/mmp.txt
4.0M	data/mmr.txt
824K	data/mpf.txt


In [2]:
!head -c 1024 data/jc1.txt

﻿_input,_num,_widgetName,_source,_resultNumber,_pageUrl,game_number,bio1,bio2,bio3,contestant1_name,contestant1_score,contestant2_name,contestant2_score,contestant3_name,contestant3_score
,1,Jeopardy_Winners,Jeopardy_Winners,1,http://www.j-archive.com/showgame.php?game_id=3350,Show #5883 - Wednesday March 24 2010,Derek Honoré an attorney from Inglewood California,Tatiana Walton a graphic designer from Cutler Bay Florida,Regina Robbins an arts teacher from New York New York (whose 1-day cash winnings total $38500),Regina,$19401,Tatiana,$7100,Derek,$11900
,2,Jeopardy_Winners,Jeopardy_Winners,2,http://www.j-archive.com/showgame.php?game_id=4400,Show #6756 - Monday January 20 2014,Jon McGuire a software-development manager from Matthews North Carolina,Blake Perkins an environmental scientist from Baton Rouge Louisiana,Sarah McNitt a study abroad adviser originally from Ann Arbor Michigan (whose 4-day cash winnings total $69199),Sarah,$20199,Blake,$0,Jon,$8380
,3,Jeopardy_Winners,Jeopard

What category of data is the file above? Any observations about the data?

Let's look at another

In [3]:
!head -c 1024 data/jq2.txt

{"category":"HISTORY","air_date":"2004-12-31","question":"'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'","value":"$200","answer":"Copernicus","round":"Jeopardy!","show_number":"4680"}
{"category":"ESPN's TOP 10 ALL-TIME ATHLETES","air_date":"2004-12-31","question":"'No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves'","value":"$200","answer":"Jim Thorpe","round":"Jeopardy!","show_number":"4680"}
{"category":"EVERYBODY TALKS ABOUT IT...","air_date":"2004-12-31","question":"'The city of Yuma in this state has a record average of 4,055 hours of sunshine each year'","value":"$200","answer":"Arizona","round":"Jeopardy!","show_number":"4680"}
{"category":"THE COMPANY LINE","air_date":"2004-12-31","question":"'In 1963, live on \"The Art Linkletter Show\", this company served its billionth burger'","value":"$200","answer":"McDonald\\'s","round":"Jeopardy!","show_number":"4680"}
{"c

What do you see this time? Category? Interesting features of the data?

Keep in mind: this process is a form of *data visualization*: just because it's not pretty graphs doesn't mean you aren't interpreting the data based on a visual representation! This happens to a be a text-based visualization, but be aware of the power and biases of your eyeballs and cognition. You are probably making all kinds of assumptions based on what your eyeballs are sensing! Mostly good, I'm sure. But you're working with limited information given this fairly lean data visualization.

### Richer Visualizations and Low-Code Interaction in Trifacta
In DS100 you learned a bit about how to use pandas and some python graphing packages, and you could definitely use those skills to ingest and plot this data. But to move things along more quickly, let's load this into a purpose-built visual data preparation tool called [Trifacta](cloud.trifacta.com), also known as [Google Cloud Dataprep](https://cloud.google.com/dataprep). This is a visual environment for "low-code" dataprep that is the commercialization of joint research at Berkeley and Stanford in the [Wrangler](http://vis.stanford.edu/wrangler/) and [Potter's Wheel](http://control.cs.berkeley.edu/abc) projects.

We could simply drag-and-drop our files into the Trifacta web UI, but to stick with our familiar Jupyter notebook we can instead use Trifacta's Python library to pop open Trifacta at will. We may have to be a bit patient as Trifacta uploads the data to the cloud. The `tf.wrangle` call will upload the data and return a "flow" object from Trifacta. We can then call the `open()` method on that flow object to open the Trifacta UI in a new browser tab.

In [4]:
import trifacta as tf
import pandas as pd

jc1 = tf.wrangle('data/jc1.txt')

HBox(children=(FloatProgress(value=0.0), HTML(value='')))




In [5]:
jc1.open()

Opening https://tfcso.cloud.trifacta.com/data/94137/454413?minimalView=true


'https://tfcso.cloud.trifacta.com/data/94137/454413?minimalView=true'

In Trifacta you see more data visualizations---including visualizations of some analyses. What do you see that's different? What do you see that's additional?

### Let's Look at More Files!

OK, moving on to another file. How would you describe this one?

In [6]:
!head -c 1024 data/mm.txt

Year,OCT,NOV,DEC,JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP
2002,0.00,6.03,7.18,0.82,2.01,9.96,4.74,0.78,0.13,0.00,0.09,0.00
2003,0.00,6.03,7.18,0.82,2.01,9.96,4.74,0.78,0.13,0.00,0.09,0.00
2004,0.00,6.03,7.18,0.82,2.01,9.96,4.74,0.78,0.13,0.00,0.09,0.00
2005,0.00,6.03,7.18,0.82,2.01,9.96,4.74,0.78,0.13,0.00,0.09,0.00
2006,0.00,6.03,7.18,0.82,2.01,9.96,4.74,0.78,0.13,0.00,0.09,0.00
2007,0.00,6.03,7.18,0.82,2.01,9.96,4.74,0.78,0.13,0.00,0.09,0.00
2008,0.00,6.03,7.18,0.82,2.01,9.96,4.74,0.78,0.13,0.00,0.09,0.00
2009,0.00,6.03,7.18,0.82,2.01,9.96,4.74,0.78,0.13,0.00,0.09,0.00
2010,0.00,6.03,7.18,0.82,2.01,9.96,4.74,0.78,0.13,0.00,0.09,0.00
2011,0.00,6.03,7.18,0.82,2.01,9.96,4.74,0.78,0.13,0.00,0.09,0.00
2012,0.00,6.03,7.18,0.82,2.01,9.96,4.74,0.78,0.13,0.00,0.09,0.00
2013,0.00,6.03,7.18,0.82,2.01,9.96,4.74,0.78,0.13,0.00,0.09,0.00
2014,0.00,6.03,7.18,0.82,2.01,9.96,4.74,0.78,0.13,0.00,0.09,0.00
2015,0.00,6.03,7.18,0.82,2.01,9.96,4.74,0.78,0.13,0.00,0.09,0.00
2016,0.00,6.03,7.18,0.

Looks like a matrix! When we get "rectangular" data, we should be able to distinguish whether it's in (dense) matrix form, or in a relational form. 

How does that differ from the next file?

In [7]:
!head -c 1024 data/mmp.txt

"Year","Location","Station Name","OCT","NOV","DEC","JAN","FEB","MAR","APR","MAY","JUN","JUL","AUG","SEP"
"2002","ASHLAND","SOUTHERN OREGON COASTAL","0.86","0.49","2.12","3.42","1.38","1.00","0.36","2.30","1.54","0.00","0.00","0.16"
"2002","CAVE JUNCTION","SOUTHERN OREGON COASTAL","","","","","","","","","","","",""
"2002","GOLD BEACH","SOUTHERN OREGON COASTAL","","","","","","","","","","","",""
"2002","GRANTS PASS KAJO","SOUTHERN OREGON COASTAL","0.61","1.21","4.19","6.31","0.24","0.77","0.58","2.02","0.87","0.00","0.00","0.20"
"2002","GREEN SPRINGS PP","SOUTHERN OREGON COASTAL","0.35","0.75","2.44","4.14","0.66","","0.26","2.59","","","0.00","0.20"
"2002","LEMOLO LAKE","SOUTHERN OREGON COASTAL","3.68","1.81","5.59","18.19","4.34","4.32","3.37","4.05","2.52","0.00","0.02","2.33"
"2002","MEDFORD","SOUTHERN OREGON COASTAL","0.65","0.24","2.86","3.43","0.51","0.74","0.46","2.50","1.20","0.00","0.00","0.05"
"2002","NORTH BEND","SOUTHERN OREGON COASTAL","2.43","2.10","9.16","7.73",

And how about this one?

In [8]:
!head -c 1024 data/mmr.txt

"Year","Location","Station Name","Month","Inches of Precipitation"
"2002","ASHLAND","SOUTHERN OREGON COASTAL","OCT","0.86"
"2002","ASHLAND","SOUTHERN OREGON COASTAL","NOV","0.49"
"2002","ASHLAND","SOUTHERN OREGON COASTAL","DEC","2.12"
"2002","ASHLAND","SOUTHERN OREGON COASTAL","JAN","3.42"
"2002","ASHLAND","SOUTHERN OREGON COASTAL","FEB","1.38"
"2002","ASHLAND","SOUTHERN OREGON COASTAL","MAR","1.00"
"2002","ASHLAND","SOUTHERN OREGON COASTAL","APR","0.36"
"2002","ASHLAND","SOUTHERN OREGON COASTAL","MAY","2.30"
"2002","ASHLAND","SOUTHERN OREGON COASTAL","JUN","1.54"
"2002","ASHLAND","SOUTHERN OREGON COASTAL","JUL","0.00"
"2002","ASHLAND","SOUTHERN OREGON COASTAL","AUG","0.00"
"2002","ASHLAND","SOUTHERN OREGON COASTAL","SEP","0.16"
"2002","CAVE JUNCTION","SOUTHERN OREGON COASTAL","OCT",""
"2002","CAVE JUNCTION","SOUTHERN OREGON COASTAL","NOV",""
"2002","CAVE JUNCTION","SOUTHERN OREGON COASTAL","DEC",""
"2002","CAVE JUNCTION","SOUTHERN OREGON COASTAL","JAN",""
"2002","CAVE

## 2. Structural Transformation: From Relations to Matrices and Back
We discussed in class last time that we can convert matrices to relations, and that we can convert some conveniently-formed relations to matrices. Pivot/Unpivot can be confusing in the abstract, but it's easy to understand visually. So we'll learn it in Trifacta.

To start, let's take our matrix in `mm.txt`, and load it into Trifacta.

In [9]:
mm = tf.wrangle('data/mm.txt')

HBox(children=(FloatProgress(value=0.0), HTML(value='')))




Now let's use Trifacta to convert it to a relation, with one row for each (year, month, value) triple. This is called an "UNPIVOT" operation, and you'll find an icon for it in Trifacta: it looks like an arrow swiveling down from column *headers* to a new column: <img src="files/unpivot.png"> Click on that icon and you'll get a dialog to choose the columns to unpivot. 

But for now we'll use an alternative "no-code" technique: Simply click on the header of the `OCT` column, then scroll all the way to the right and shift-click on the header of the `SEP` column. 12 columns should be highlighted in blue, *and a list of suggested transformations should pop up on the right-hand-side of the screen*. You can mouse over these to see what they would do, but eventually click on the suggestion to `Unpivot columns into rows`. You will get a split-screen preview of the result, which should look right. Click `Add` to accept the suggestion. You should now see one grid of data, transformed the way we like. You will also now see a natural-language summary of the "Recipe" (script) that Trifacta is tracking---note that it did some work automatically before we unpivoted, and included it in the recipe in case we want to override it.

Once you're done unpivoting, you can close the Trifacta tab if you like and return here---Trifacta will save your work to call `mm.open()` again later. Or you can leave that tab open and click back and forth.

In [10]:
mm.open()

Opening https://tfcso.cloud.trifacta.com/data/94138/454415?minimalView=true


'https://tfcso.cloud.trifacta.com/data/94138/454415?minimalView=true'

So UNPIVOT translates matrices in relations. As you might expect, PIVOT translates relations into matrices! Let's do that to our result in Trifacta. Open `mm` in Trifacta again, and this time let's use the PIVOT menu item: <img src="files/pivot.png">  For variety, let's flip things so the row labels are months, and the column labels are years. 
What familiar matrix operation was achieved by this pair of UNPIVOT followed by PIVOT?


In [11]:
mm.open()

Opening https://tfcso.cloud.trifacta.com/data/94138/454415?minimalView=true


'https://tfcso.cloud.trifacta.com/data/94138/454415?minimalView=true'

### Extra Columns
Now, as an exercise, load up that more complex version of the dataset in `mmp.txt` that we looked at above. Is it a matrix or relation? What's going on? Try doing some PIVOT/UNPIVOT work in Trifacta on your own. Feel free to play with the data, and the Trifacta interface.

In [12]:
mmp = tf.wrangle('data/mmp.txt')

HBox(children=(FloatProgress(value=0.0), HTML(value='')))




In [13]:
mmp.open()

Opening https://tfcso.cloud.trifacta.com/data/94139/454417?minimalView=true


'https://tfcso.cloud.trifacta.com/data/94139/454417?minimalView=true'

### Duplicate Entries and Aggregation
This time let's load up a slightly different version of this data in relational form from the `mmr.txt` file, and PIVOT it into `year`x`month` form.

In [14]:
mmr = tf.wrangle('data/mmr.txt')

HBox(children=(FloatProgress(value=0.0), HTML(value='')))




Once again, we'd like to PIVOT this data into matrix form, with rows labeled by `year`, and columns labeled by `month`. Click the PIVOT icon and go to it! What problem do we face that we didn't see before? 

Well, the header above gives it away: we have many rows that have the same `(year, month)` pair, which means our PIVOT needs to pack many values into a single cell. To do this, Trifacta asks us to choose an aggregate function -- a reasonable choice might be `AVERAGE({Inches of Precipitation})`. If you prefer, Trifacta (like Postgres) actually has an aggregate function that will just store a nested list (array) of all the values in a single cell---this is the `LIST` aggregate. Play with it and see what you get!

In [15]:
mmr.open()

Opening https://tfcso.cloud.trifacta.com/data/94140/454419?minimalView=true


'https://tfcso.cloud.trifacta.com/data/94140/454419?minimalView=true'

### Spreadsheets
As an exercise on your own, load this data into a spreadsheet and play with PIVOT/UNPIVOT in the spreadsheet. Beware that some spreadsheets only support PIVOT but not UNPIVOT: Excel on Mac is an example. (Excel on Windows supports UNPIVOT, but you have to go into the "PowerQuery" interface to do it.)

### PIVOT/UNPIVOT and the Relational Model??!
We can do PIVOT/UNPIVOT in Trifacta, in Pandas, and in Spreadsheets. But can we do it in SQL?
Before we answer that question, let's go to the foundations. Can we do this in Relational Algebra?

Here the answer is a resounding *no*. Think about how we declare column values in relational algebra: we write an expression like $\pi_{c1, c2, c3}(T)$. The subscripts of the $\pi$ operator are part of the *syntax* of your relational expression---they *do not change* as the relation instance (the data in the database!) changes. 

By contrast, for PIVOT the subscript of the $\pi$ operator essentially needs to be "the set of distinct values in the relation instance", which *absolutely changes* as the relation instances changes. Similarly, UNPIVOT returns data values (an output instance) that come from the input *schema* which isn't allowed.

*If you don't know what First Order Logic is, no sweat: you can safely skip this paragraph!*
UNPIVOT hints at what's going on here. Relational languages are based on First Order Logic, in which the EXISTS and FORALL quantifiers range over the data in the instance. By contrast, UNPIVOT somehow has a quantifier that ranges over the variable (column) names. Logics that quantify over variable names are called Second Order Logic. They are strictly more expressive than First Order Logic, and as such also more computationally complex.

Anyhow, "pure" SQL as we've learned it---an equivalent to the relational algebra---shouldn't be able to express PIVOT or UNPIVOT. However, given how useful this is, many SQL systems have (proprietary) extensions. They are often called PIVOT/UNPIVOT, though in Postgres the extension is called [crosstab](https://www.postgresql.org/docs/current/tablefunc.html).

### What about performance and scale
Have you ever seen a table with 10 million rows? Sure! Have you ever seen one with 10 million columns? 

Many database systems and other data tools scale wonderfully in the number of rows, but lay an egg if you generate too many columns, or simply prevent you from doing so. There are exceptions to this rule, but it's a very common limitation. It's also sensible from a user experience perspective: wide tables are unwieldy to query (imagine almost any SELECT clause other than `SELECT *`!) Some tools like Trifacta will truncate your PIVOTs (and related operators that generate new columns) to prevent disasters, but the result is often non-deterministic in terms of which columns get added and which don't.


Mathematically, "wide matrices" can also be unattractive. A wide matrix is basically a set of *very* high-dimensional vectors, and high dimensionality is traditionally hard in statistics (the so-called [curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality)). However in Machine Learning in recent years, high dimensionality has become more useful---in part because massive real-world data sets aren't uniformly distributed in the high-dimensional spaces that cause the curse. In essence, in real-world data there's some lower-dimensional embedding latent in most real-world data, but modern ML techniques can kind of find it. We will come back to this point later.

## 3. Type Induction and Coercion
To begin let's review "statistical" data types. This is a slight refinement from the terms in DS100:
- *nominal* / *categorical*: types that have no inherent ordering, used as names for categories
- *ordinals*: types that are used to encode order. Typically the integers $1, 2, \ldots$
- *cardinals*: types that are used to express cardinality ("how many"). Typically the integers $0,1,\ldots$. Cardinals are common as the output of statistics (frequencies).
- *numerical* / *measures*: types that capture a numerical value or measurement. Typically a real (floating point) number.

### Data types in the wild
We've seen that some systems like databases keep data types as metadata, and enforce strong typing in storage when data is inserted or modified. Used carefully, databases will carry the data-type metadata along with the data when they communicate with tools or other databases. 

But it's very, very common to work with data that has little or no metadata. In that case, we *have* to interpret the data somehow. As a very first step, we need to guess ("induce") types for the data.

### Techniques for Type Induction
Suppose I give you a column of potentially dirty data. Suppose you have a set of types H. You need to write an algorithm to choose a type. How does it choose?
- "Hard" Rules: E.g. Occam's razor. 
  - Try types from most- to least-specific. (e.g. boolean, int, float, string)
  - Choose the first one that matches *all* the values.
- Minimum Description Length (MDL): See below
- Classification (i.e. Supervised Learning): You know how this goes.

#### MDL
Intuition is similar to Occam's razor, but accounts for the "weight" or "penalty" of encoding exceptions to the best type. The "fitness" of a type to some data is the description length of the data using that type---including the cost of "explicitly" storing the data that doesn't fit the type as a string. Let's say $len(v)$ is the bit-length for encoding of a value $v$ "explicitly". Given a type $T$ with $|T|$ distinct values, the bit-length of encoding a value in that type is $log|T|$. (E.g. there are $2^64$ 64-bit integers, and each one is $log(2^64)=64$ bits long.)

Let's say that indicator variable $I_T(v) = 1$ if $v \in T$, and $0$ otherwise. 

For MDL, we choose the type that minimizes the description length for the set of data $c$ in a column:

$$\min_{T \in H} \sum_{v \in c}(I_T(v)log(|T|) + (1-I_T(v))len(v))$$

Consider a "column" of values: $\{\mbox{'Joe'}, 2, 12, 4750\}$. Assume the default type is "string", which costs us 8 bits per character. 
- We can encode this as 3 16-bit integers and 'Joe': length is $3*16 + 3*8 = 72$
- Or we can encode it all as strings: $(3 + 1 + 2 + 4)*8 = 80$.

MDL would favor "int16" over "string" in this example. 

Note that one can enhance MDL in various ways. One approach that's interesting to consider is to induce *compound* types: i.e. the string "12/31/2021" could be *string* or *date* or a compound type like *int4 '/' int8 '/' int16*. Another approach is to use compression techniques to get tighter measures for the length of encoding---for both type-matches and for strings.

#### In practice
Some systems will break if the chosen type doesn't fit all the data in the column, in which case they'll choose a "hard rules" approach. For systems that can handle a mix of types, something like MDL is not unusual, though it may be more naive (e.g. pick the type that matched the largest number of entries).

### Type Coercion/Casting
You can explicitly set the type of a column. If the column has values that don't match the type, you will have to live with a lossy representation of those values (often NULL).

Typecasting can be useful for ensuring that a system treats your type right *statistically*. For example, ID columns are often arbitrary integers. These are not really numeric columns, they're categorical. To ensure the system/processes don't get confused, you can cast them to strings.

## 4. String Manipulation
I don't have a lot to add here that wasn't already covered in DS100 and SQL.

- Typical transforms include the following (names may vary across systems/DSLs):
  - **Split** a string into separate rows/columns
      - Often by position or delimiter
      - Sometimes via parsing: e.g. counting nested parentheses (e.g. JSON/XML rowsplits)
  - **CountMatches**: Create a new column with the count of matches of a pattern in a string column
  - **Extract**: create a column of substrings derived from another column 
  - **Replace**: a (sub)string with a constant, a "captured group", or any string formula (e.g. lowercase, trim, etc)
- Facility with regular expressions goes a VERY long way.
- All of these can be done directly in SQL


## Cleaning The Real Data
The previous few files were the results of wrangling a raw dataset. Let's look at that dataset now! It's a scrape of [rainfall data](https://www.cnrfc.noaa.gov/monthly_precip_2020.php) from the website of the National Oceanic and Atmospheric Administration (NOAA).

In [16]:
!head -c 1024 data/mpf.txt

2002, 'SOUTHERN OREGON COASTAL'
2002, 'ID', 'Location', 'OCT', 'NOV', 'DEC', 'JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN', 'JUL', 'AUG', 'SEP', 'WY to Date', 'Pct Avg to Date', 'Pct Tot WY'
2002, 'ASHO3', 'ASHLAND', '0.86', '0.49', '2.12', '3.42', '1.38', '1.00', '0.36', '2.30', '1.54', '0.00', '0.00', '0.16', '13.63', '68', '68'
2002, 'CVJO3', 'CAVE JUNCTION', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M'
2002, 'GOLO3', 'GOLD BEACH', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'M'
2002, 'GPSO3', 'GRANTS PASS KAJO', '0.61', '1.21', '4.19', '6.31', '0.24', '0.77', '0.58', '2.02', '0.87', '0.00', '0.00', '0.20', '17.00', '55', '55'
2002, 'GSPO3', 'GREEN SPRINGS PP', '0.35', '0.75', '2.44', '4.14', '0.66', 'M', '0.26', '2.59', 'M', 'M', '0.00', '0.20', 'M'
2002, 'LEMO3', 'LEMOLO LAKE', '3.68', '1.81', '5.59', '18.19', '4.34', '4.32', '3.37', '4.05', '2.52', '0.00', '0.02', '2.33', '50.22', '76', '76'
2002, 'MFR', 'MEDFORD', '0.65', '0.24', '2.86', '3.

Looks kinda messy. You can play with it in bash or pandas if you like. Since it's pretty nasty we'll clean it up in Trifacta. This will illustrate some of the benefits of having tooling at hand that combines live visualization with AI-based recommendations (software synthesis) to speed you on your way. In essence, we are leveraging computer science techniques (HCI, AI, Database languages) to reduce the work needed to do data science!

In [17]:
mpf = tf.wrangle('data/mpf.txt')

HBox(children=(FloatProgress(value=0.0), HTML(value='')))




In [18]:
mpf.open()

Opening https://tfcso.cloud.trifacta.com/data/94141/454421?minimalView=true


'https://tfcso.cloud.trifacta.com/data/94141/454421?minimalView=true'