# DSCI 330 - The Management of Unstructured Data

## Structured and Unstructured Data

* **Structured Data** fits in a table/database
* All other data is **unstructured**
* Examples
    * Web pages, XML, JSON
    * Text
    * Audio/Music
    * Images and Video

## Python

* Popular programming language in DSCI
* Features
    * Multi-paradigm
    * Interpreted
    * Extensible
    * Strongly, dynamically typed
    * Free
    * Many DSCI-related libraries

## Sources of Unstructured Data

* Claims/Transactions
* Hieracrhical Data
* Multidimensional Data

## Claims Data

<img src="./img/transactions.png">

## Hierarchical Data

* Type/number of some variables depend on others

In [None]:
houses = [{"Garage":None},
          {"Garage":{"Cars":2,
                     "Size":500}}]

## Multidimensional Data

<img src="https://www.codeproject.com/KB/graphics/Face_Recognition/Main1.jpg">

## Examples

* Web scraping [The Current](https://github.com/WSU-DataScience/WiCS_workshop/blob/master/labs/case_study_1_the_current_key_short.ipynb)
* Music: [Music Alignment](https://github.com/stevetjoa/musicinformationretrieval.com/blob/gh-pages/dtw_example.ipynb)
* Images: [Edge Detection](https://github.com/dalgu90/opencv-tutorial/blob/master/2_edge_detection.ipynb)

# The File System and Command Line Interaction

## The CommandLine

* Text interactions with OS: 
    * Type commands
        * Execute commands
        * Run another program
    * Displaying the output
* Three Windows command line tools
    * cmd.exe
    * Powershell.exe
    * Git Bash (Same as osX/Linux)
* Our focus: Linux/Unix/osX Terminal

## Installing Git and Git Bash 

* Follow the instructions on https://git-scm.com/book/en/v2/Getting-Started-Installing-Git

## The File System

Consists of

* Files
* Directories
* "Everything that doesn't go away when you reboot"

## Files

* Have two part names, like 
    * notes.txt or home.html
    
        <img src="./img/file_ext.png">
        

## File Extension 

* Tells computer how to open
    * .txt  →  editor
    * .html → browser
* File extension in linux are options 
    * but encouraged

## The File System is a Tree

* Files are stored in directories 
* Directories contain 
    * files 
    * directories
* Result: directory tree
* In one directory 
    * → unique names
* Different directories 
    * → can have same name

## Important bash Commands

<img src="./img/bash_cmds.png">

In [None]:
pwd
cd ~
pwd
ls -al

## Exercise -Make the Folders

<img src="http://www.cgl.ucsf.edu/Outreach/bmi219/slides/swc/lec/img/shell01/directory_tree.png">

In [None]:
mkdir home
mkdir home/hpotter
mkdir home/hpotter/thesis
mkdir home/rweasley
mkdir home/rweasley/thesis

## Text Files


* **File Extension:** `.txt`
    * Can also be `.html`, `.py`, etc.
* **Contents:** text and symbols
* Create an empty text file with `touch`
* Edited with a text editor
    * The Jupyter terminal comes with nano

## Installing `nano` on Windows

In [None]:
curl -L -O http://www.nano-editor.org/dist/v2.2/NT/nano-2.2.6.zip
mkdir ~/bin
unzip  nano-2.2.6.zip -d ~/bin/nano
~/bin/nano test.txt

## Exercise - Create the Files

In [None]:
touch home/hpotter/addresses.html
touch home/hpotter/thesis/intro.txt
touch home/hpotter/thesis/spells.txt
nano home/hpotter/thesis/spells.txt

## Paths

* **Path:** Files location
* **Absolute Path:** Full address
    * Windows: `C:/nbuser/home/hpotter`
    * Mac: `/home/nbuser/home/hpotter/addresses.html`

## Relative Paths

* **Current working directory (CWD)** Program is "working" here.  
    * Use `pwd` in bash
    * Can change over time
* **Relative path:** file location relative to CWD

## Shortcuts

* Current Directory: `.`
* Parent Directory: `..`
* Home: `~`

## Exercise

<img src="http://www.cgl.ucsf.edu/Outreach/bmi219/slides/swc/lec/img/shell01/directory_tree.png">

* CWD is `hpotter/thesis`
* Where is `spells.txt`?
* Where is `rweasley`?

In [None]:
nano ??
cd ??

## Using paths

In [None]:
# Navigation
cd ../../rweasley/thesis
# Edit files
nano ./spells.txt

## Copy and move files

* Copy a file with `cp`
* Move a file with `mv`

In [None]:
cp ./spells.txt ../../rweasley/thesis
mv ./spells.text ./spells.txt

## Deleting Files and Folders

* Use `rm` to delete files
    * `rm -i` gives a prompt
    * `rm -f` forces deletion
    * `rm -r` recursive