*Note: This notebook is the slide-oriented version; a [fully literate (wordy) version](https://drive.google.com/file/d/1GU-JXlpCT85dqucPieoJxErsKRZ7xN7L/view?usp=sharing) of the notebook is also available. This notebook was designed to be viewed as slides via the [RISE](https://rise.readthedocs.io/en/stable/) notebook extension!*

# Data Preparation I

Data preparation is a *very* broad and central subject in Data Science.

In today's first lecture we'll cover a few key topics:
1. Data "Unboxing"
2. Structural Data Transformations
3. Type Induction/Coercion
4. String Manipulation

## 1. "Unboxing" Data
Recall some basic tools:
- `du -h`: show the disk size of a file in human-readable form
- `head -c 1024`: show me the first 1024 bytes of a file
- your eyes

### Assessing Structure
1. Look out for header info, metadata, comments, etc.
2. Most files you'll run across fall into one of these categories:
  1. **Record per line**: newline-delimited rows of uniform symbol-delimited data.
    - E.g. CSV and TSV files
    - Also newline-delimited rows of uniform but ad-hoc structured text
  2. **Dictionaries/Objects**: explicit key:value pairs, may be nested! Two common cases:
    - Object-per-line: e.g. newline-delimited rows of JSON, XML, YAML, etc. (JSON in this format is sometimes called [json lines or jsonl](https://jsonlines.org/)).
    - Complex object: the entire dataset is one fully-nested JSON, XML or YAML object
  

  3. **Unions**: a mixture of rows from *k* distinct schemas. Two common cases:
    - *Tagged Unions*: each row has an ID or name identifying its schema. Often the tag is in the first column.
    - *Untagged Unions*: the schema for the row must be classified by its content
  4. **Natural Language (prose)**: intended for human consumption.
  5. **Everything else**: A long tail of file formats! If not readable as text, likely some commercial or open source tool will translate.

### Text formats
Be aware of:
- EBCDIC vs. ASCII. 
- Multibyte character encodings 😟: Unicode, UTF-8 and more. You can search the web for resources on these issues.
You can ignore for this class (and often---but not always!---in life).

### Examples
Without further ado, let's unbox some data!

In [1]:
!du -h data/*

  0B	data/Icon
4.0K	data/README.md
8.0K	data/flow_CalDataEng_Example.zip
1.3M	data/jc1.txt
 50M	data/jq2.txt
4.0K	data/mm.txt
600K	data/mmp.txt
4.3M	data/mmr.txt
1.1M	data/monthly_precip_full.csv
628K	data/mpf.txt
4.0K	data/simple_scrape.py


In [2]:
!head -c 2048 data/jc1.txt

﻿_input,_num,_widgetName,_source,_resultNumber,_pageUrl,game_number,bio1,bio2,bio3,contestant1_name,contestant1_score,contestant2_name,contestant2_score,contestant3_name,contestant3_score
,1,Jeopardy_Winners,Jeopardy_Winners,1,http://www.j-archive.com/showgame.php?game_id=3350,Show #5883 - Wednesday March 24 2010,Derek Honoré an attorney from Inglewood California,Tatiana Walton a graphic designer from Cutler Bay Florida,Regina Robbins an arts teacher from New York New York (whose 1-day cash winnings total $38500),Regina,$19401,Tatiana,$7100,Derek,$11900
,2,Jeopardy_Winners,Jeopardy_Winners,2,http://www.j-archive.com/showgame.php?game_id=4400,Show #6756 - Monday January 20 2014,Jon McGuire a software-development manager from Matthews North Carolina,Blake Perkins an environmental scientist from Baton Rouge Louisiana,Sarah McNitt a study abroad adviser originally from Ann Arbor Michigan (whose 4-day cash winnings total $69199),Sarah,$20199,Blake,$0,Jon,$8380
,3,Jeopardy_Winners,Jeopard

What category of data is the file above? Any observations about the data?

Let's look at another:

In [3]:
!head -c 1024 data/jq2.txt

{"category":"HISTORY","air_date":"2004-12-31","question":"'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'","value":"$200","answer":"Copernicus","round":"Jeopardy!","show_number":"4680"}
{"category":"ESPN's TOP 10 ALL-TIME ATHLETES","air_date":"2004-12-31","question":"'No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves'","value":"$200","answer":"Jim Thorpe","round":"Jeopardy!","show_number":"4680"}
{"category":"EVERYBODY TALKS ABOUT IT...","air_date":"2004-12-31","question":"'The city of Yuma in this state has a record average of 4,055 hours of sunshine each year'","value":"$200","answer":"Arizona","round":"Jeopardy!","show_number":"4680"}
{"category":"THE COMPANY LINE","air_date":"2004-12-31","question":"'In 1963, live on \"The Art Linkletter Show\", this company served its billionth burger'","value":"$200","answer":"McDonald\\'s","round":"Jeopardy!","show_number":"4680"}
{"c

What do you see this time? Category? Interesting features of the data?

Keep in mind: this is *data visualization*! 

### Richer Visualizations and Low-Code Interaction in Trifacta
- We *could* use pandas and matplotlib/seaborn/plotly/etc. I.e. we could write code.
- Instead, try a purpose-built visual data preparation tool: [Trifacta](cloud.trifacta.com), also known as [Google Cloud Dataprep](https://cloud.google.com/dataprep). A visual environment for "low-code" dataprep that is the commercialization of joint research at Berkeley and Stanford in the [Wrangler](http://vis.stanford.edu/wrangler/) and [Potter's Wheel](http://control.cs.berkeley.edu/abc) projects.

- Can interact with Trifacta in its Web UI, or via the `trifacta` Python library.
- `wf = tf.wrangle` uploads data and returns a "wrangle flow" object from Trifacta. 
- The `wf.open()` method opens the Trifacta UI in a new browser tab.

In [4]:
import trifacta as tf
import pandas as pd

In [5]:
jc1 = tf.wrangle('data/jq2.txt')

HBox(children=(FloatProgress(value=0.0), HTML(value='')))




In [6]:
jc1.open()

Opening https://tfcso.cloud.trifacta.com/data/97476/490004?minimalView=true


'https://tfcso.cloud.trifacta.com/data/97476/490004?minimalView=true'

What do you see in Trifacta that's additional to `head -c`?

### Let's Look at More Files!

OK, moving on to another file. How would you describe this one?

In [7]:
import trifacta as tf
import pandas as pd

In [8]:
!head -c 1024 data/mm.txt

Year,OCT,NOV,DEC,JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP
2002,545.9200000000003,3115.08,3996.760000000001,1815.7399999999998,1204.1399999999994,1644.020000000001,795.9200000000001,540.24,112.61999999999999,79.52000000000002,22.200000000000003,171.70000000000016
2003,55.41000000000004,1242.2300000000007,2976.9399999999973,797.7199999999999,836.0099999999996,1026.1100000000004,1571.270000000001,468.59000000000026,24.930000000000003,98.33000000000001,267.40000000000015,99.19999999999999
2004,55.900000000000034,834.4000000000009,2311.719999999997,942.7500000000002,2019.2199999999987,399.51999999999964,339.17999999999995,251.6400000000001,72.37999999999998,55.569999999999986,116.74000000000007,97.48000000000002
2006,347.2199999999998,908.4399999999991,2981.1600000000017,1793.970000000001,995.2699999999998,2031.1899999999987,1602.5499999999997,287.2099999999997,102.43999999999994,90.31000000000002,18.749999999999986,33.75999999999999
2005,1449.23,619.7699999999999,1789.9299999999998,1777.23

How does that differ from the next file?

In [9]:
!head -c 1024 data/mmp.txt

Year,ID,Location,Station,OCT,NOV,DEC,JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP
2002,4BK,BROOKINGS,SOUTHERN OREGON COASTAL,12.86,29.06,34.64,34.64,18.2,12.1,13.24,7.3,7.36,0.04,0.06,2.9
2002,ASHO3,ASHLAND,SOUTHERN OREGON COASTAL,0.76,7,6.82,2.64,2.58,2.58,5.84,3.76,0.16,0.56,0,0.4
2002,COPO3,COPPER 4NE,SOUTHERN OREGON COASTAL,0.58,13.36,13.96,6.84,3.98,3.6,0,0,0,0,0,0.54
2002,CVJO3,CAVE JUNCTION,SOUTHERN OREGON COASTAL,4.92,27.2,29.62,19.52,12.92,9.26,3.88,1.78,0,0,0,0.66
2002,GOLO3,GOLD BEACH,SOUTHERN OREGON COASTAL,9.26,23.44,33.18,29.16,17.78,13.24,9.46,3,4.18,0.04,0,1.24
2002,GPSO3,GRANTS PASS KAJO,SOUTHERN OREGON COASTAL,0.78,12.74,13.88,8.62,5.78,2.72,1.74,1.24,0.04,0,0,0.1
2002,GSPO3,GREEN SPRINGS PP,SOUTHERN OREGON COASTAL,0.72,9.58,11.8,5.04,2.94,3.48,4.82,2.3,0.22,0.02,0,0.4
2002,LEMO3,LEMOLO LAKE,SOUTHERN OREGON COASTAL,10.26,20.6,25.44,22.96,9.98,12.64,10.98,2.28,4.44,0.8,0.4,4.26
2002,MFR,MEDFORD,SOUTHERN OREGON COASTAL,0.38,8.32,8.68,3.18,3.3,2.66,2.98,1.06,0.06,0.16,0,

And how about this one?

In [10]:
!head -c 1024 data/mmr.txt

Year,ID,Location,Station,Month,Inches of Precipitation
2002,4BK,BROOKINGS,SOUTHERN OREGON COASTAL,OCT,6.43
2002,4BK,BROOKINGS,SOUTHERN OREGON COASTAL,NOV,14.53
2002,4BK,BROOKINGS,SOUTHERN OREGON COASTAL,DEC,17.32
2002,4BK,BROOKINGS,SOUTHERN OREGON COASTAL,JAN,17.32
2002,4BK,BROOKINGS,SOUTHERN OREGON COASTAL,FEB,9.1
2002,4BK,BROOKINGS,SOUTHERN OREGON COASTAL,MAR,6.05
2002,4BK,BROOKINGS,SOUTHERN OREGON COASTAL,APR,6.62
2002,4BK,BROOKINGS,SOUTHERN OREGON COASTAL,MAY,3.65
2002,4BK,BROOKINGS,SOUTHERN OREGON COASTAL,JUN,3.68
2002,4BK,BROOKINGS,SOUTHERN OREGON COASTAL,JUL,0.02
2002,4BK,BROOKINGS,SOUTHERN OREGON COASTAL,AUG,0.03
2002,4BK,BROOKINGS,SOUTHERN OREGON COASTAL,SEP,1.45
2002,ASHO3,ASHLAND,SOUTHERN OREGON COASTAL,OCT,0.38
2002,ASHO3,ASHLAND,SOUTHERN OREGON COASTAL,NOV,3.5
2002,ASHO3,ASHLAND,SOUTHERN OREGON COASTAL,DEC,3.41
2002,ASHO3,ASHLAND,SOUTHERN OREGON COASTAL,JAN,1.32
2002,ASHO3,ASHLAND,SOUTHERN OREGON COASTAL,FEB,1.29
2002,ASHO3,ASHLAND,SOUTHERN OREGON COASTAL

## 2. Structural Transformation: From Relations to Matrices and Back
- Matrix $\rightarrow$ Relational works.
- Relational $\rightarrow$ Matrix sometimes works!
- But how?

To start, let's take our matrix in `mm.txt`, and load it into Trifacta.

In [11]:
mm = tf.wrangle('data/mm.txt')

HBox(children=(FloatProgress(value=0.0), HTML(value='')))




- Matrix $\rightarrow$ Relation: UNPIVOT <img src="files/unpivot.png">

In [12]:
mm.open()

Opening https://tfcso.cloud.trifacta.com/data/97477/490006?minimalView=true


'https://tfcso.cloud.trifacta.com/data/97477/490006?minimalView=true'

- Relation $\rightarrow$ Matrix: PIVOT <img src="files/pivot.png">
- PIVOT(UNPIVOT) = ??

In [13]:
mm.open()

Opening https://tfcso.cloud.trifacta.com/data/97477/490006?minimalView=true


'https://tfcso.cloud.trifacta.com/data/97477/490006?minimalView=true'

### Extra Columns
Let's go back to `mmp.txt`. 
- Matrix or relation? 
- Try doing some PIVOT/UNPIVOT work in Trifacta on this.

In [14]:
mmp = tf.wrangle('data/mmp.txt')

HBox(children=(FloatProgress(value=0.0), HTML(value='')))




In [15]:
mmp.open()

Opening https://tfcso.cloud.trifacta.com/data/97478/490008?minimalView=true


'https://tfcso.cloud.trifacta.com/data/97478/490008?minimalView=true'

### Duplicate Entries and Aggregation
Now let's take relational version in the `mmr.txt` file, PIVOT into `year`x`month` form.

In [16]:
mmr = tf.wrangle('data/mmr.txt')

HBox(children=(FloatProgress(value=0.0), HTML(value='')))




What do we need to do here that we didn't before? What are our choices?

In [17]:
mmr.open()

Opening https://tfcso.cloud.trifacta.com/data/97479/490010?minimalView=true


'https://tfcso.cloud.trifacta.com/data/97479/490010?minimalView=true'

### Spreadsheets
- You should try this in your favorite spreadsheet, you may see some differences.
- Beware: there may be no UNPIVOT, or it may be hard to find/use

### PIVOT/UNPIVOT and the Relational Model??!
- In SQL? How about in Relational Algebra?
- Consider how we get "output" column names: $\pi_{c1, c2, c3}(T)$. What's true about the subscripts of $\pi$?
- By contrast, where do output column names come from in PIVOT?
- UNPIVOT: quantifier over column names (variables). This is *second order logic*. Relational langauges are based in first order logic (quantifiers over data).
- So *no* PIVOT/UNPIVOT in "pure" SQL
- BUT: most DBMSs extend SQL to do it. E.g. in Postgres the extension is called [crosstab](https://www.postgresql.org/docs/current/tablefunc.html).

### What about performance and scale
- Have you ever seen a table with 10 million rows? 
- Have you ever seen a table with 10 million columns? 

- Performance
- Usability/Queryability
- Statistics: the [curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality) (and the blessing?!)

## 3. Type Induction and Coercion
To begin let's review "statistical" data types. This is a slight refinement from the terms in DS100:
- *nominal* / *categorical*: types that have no inherent ordering, used as names for categories
- *ordinals*: types that are used to encode order. Typically the integers $1, 2, \ldots$
- *cardinals*: types that are used to express cardinality ("how many"). Typically the integers $0,1,\ldots$. Cardinals are common as the output of statistics (frequencies).
- *numerical* / *measures*: types that capture a numerical value or measurement. Typically a real (floating point) number.

### Data types in the wild
- Some systems (DBMSs) enforce/export types
- Very, very common to work with data that has little or no metadata. Must interpret the data somehow! Type "induction".

### Techniques for Type Induction
- Given: a column $c$ of potentially "dirty" data values
- A set of types H. 
- You need to write an algorithm to choose a type. How does it choose?

- "Hard" Rules: E.g. Occam's razor. 
  - Try types from most- to least-specific. (e.g. boolean, int, float, string)
  - Choose the first one that matches *all* the values.
- Minimum Description Length (MDL): See below
- Classification (i.e. Supervised Learning): You know how this goes.

#### MDL
- Like Occam's razor, but account for the "weight" or "penalty" of encoding exceptions. 
- Say $len(v)$ is the bit-length for encoding of a value $v$ "explicitly". 
- Given a type $T$ with $|T|$ distinct values, the bit-length of encoding a value in that type is $log|T|$.
- Let's say that indicator variable $I_T(v) = 1$ if $v \in T$, and $0$ otherwise. 

$$\mbox{MDL} = \min_{T \in H} \sum_{v \in c}(I_T(v)log(|T|) + (1-I_T(v))len(v))$$

- Example: $\{\mbox{'Joe'}, 2, 12, 4750\}$. Assume the default type is "string", which costs us 8 bits per character. 
- 16-bit integers: 

In [18]:
c = ['Joe', 'Aditya', '12', '4750']
len('JoeAditya')*8 + 2*16

104

- All strings: 

In [19]:
c = ['Joe', 'Aditya', '12', '4750']
sum([len(x)*8 for x in c])

120

#### In practice
- Many systems *need* hard rules
- Others lean MDLish

### Type Coercion/Casting
- Can be done. You may lose data fidelity (e.g. set to NULL)
- Can be useful. E.g. for *statistically* correct treatment, convert IDs to string rather than integer.
- Also useful for smoothing (e.g. quantization)

## 4. String Manipulation
We covered a lot in DS100 and SQL already.

- Typical transforms include the following (names may vary across systems/DSLs):
  - **Split** a string into separate rows/columns
      - Often by position or delimiter
      - Sometimes via parsing: e.g. counting nested parentheses (e.g. JSON/XML rowsplits)
  - **CountMatches**: Create a new column with the count of matches of a pattern in a string column
  - **Extract**: create a column of substrings derived from another column 
  - **Replace**: a (sub)string with a constant, a "captured group", or any string formula (e.g. lowercase, trim, etc)

- Facility with regular expressions goes a VERY long way.
  - Most languages/tools support regex
- All of the string manipulation can be done 
directly in SQL

## Cleaning The Real Data
The previous few files were the results of wrangling a raw dataset. Let's look at that dataset now! It's a scrape of [rainfall data](https://www.cnrfc.noaa.gov/monthly_precip_2020.php) from the website of the National Oceanic and Atmospheric Administration (NOAA).

In [20]:
!head -c 1024 data/monthly_precip_full.csv

2002, 'SOUTHERN OREGON COASTAL'
2002, 'ID', 'Location', 'OCT', 'NOV', 'DEC', 'JAN', 'FEB', 'MAR', 'APR', 'MAY', 'JUN', 'JUL', 'AUG', 'SEP', 'WY Tota', 'Pct Avg to Date'
2002, '4BK', 'BROOKINGS', '   6.43', '  14.53', '  17.32', '  17.32', '   9.10', '   6.05', '   6.62', '   3.65', '   3.68', '   0.02', '   0.03', '   1.45', ' 86.20', ' 117'
2002, 'ASHO3', 'ASHLAND', '   0.38', '   3.50', '   3.41', '   1.32', '   1.29', '   1.29', '   2.92', '   1.88', '   0.08', '   0.28', '   0.00', '   0.20', ' 16.55', '  84'
2002, 'COPO3', 'COPPER 4NE', '   0.29', '   6.68', '   6.98', '   3.42', '   1.99', '   1.80', 'M', 'M', 'M', 'M', 'M', '   0.27', 'M'
2002, 'CVJO3', 'CAVE JUNCTION', '   2.46', '  13.60', '  14.81', '   9.76', '   6.46', '   4.63', '   1.94', '   0.89', '   0.00', '   0.00', '   0.00', '   0.33', ' 54.88', '  88'
2002, 'GOLO3', 'GOLD BEACH', '   4.63', '  11.72', '  16.59', '  14.58', '   8.89', '   6.62', '   4.73', '   1.50', '   2.09', '   0.02', '   0.00', '   0.62'

Messy! You can play in bash or pandas if you like. I'll clean it up in Trifacta, illustrate benefits of tools:
- live visualization 
- transform recommendations (software synthesis) 
- Use CS (HCI, AI, Database languages) to reduce the work in DS!

In [21]:
mpf = tf.wrangle('data/monthly_precip_full.csv')

HBox(children=(FloatProgress(value=0.0), HTML(value='')))




In [None]:
mpf.open()