# Chapter 2: Getting to Know Your Data

## Data Objects & Attributes

- **Data set**: a set of data objects
    - e.g., students, courses, customers, products
- **Data object**
    - an entity with certain attributes/features/dimensions/variables
    - e.g., patient_id, name, DOB, address, office visits, lab tests
- **Attribute type**: nominal, binary, ordinal, numeric

### Attribute Types

- **Nominal** (categorical)
    - e.g., major, occupation, city 
- **Binary** (boolean, symmetric or asymmetric)
    - e.g., CS major? professor? Boulder? 
- **Ordinal**
    - degree, professional rank, vehicle size class 
- **Numeric**
    - quantitative 

### Numeric Attributes

- **Interval-scaled**
    - e.g., 50 or 100 Fahrenheit degree; Year 2000 or 2020 
- **Ratio-scaled** (true zero-point)
    - e.g., age, dollars, number of books, number of cars 
- **Discrete vs. continuous**
    - discrete: finite or countably infinite; integers vs. real numbers 

## Statistical Description of Data

- Motivation: better understanding of the data
    - e.g., sales, traffic volume, #likes
- **Basics**: N, min, max
- **Central tendency**: mean, median, mode, midrange
- **Dispersion**: quartiles, interquartile range, variance

### Central Tendency

- **Mean** $\displaystyle \quad \bar{x} = \frac{1}{N} \sum_{i=1}^N x_i = \frac{x_1 + x_2 + \cdots + x_N}{N}$
    - weighted arithmetic mean $\displaystyle \quad \bar{x} = \frac{\sum_{i=1}^N w_ix_i}{\sum_{i=1}^N w_i} = \frac{w_1x_1 + w_2x_2 + \cdots + w_Nx_N}{w_1 + w_2 + \cdots + w_N}$
    - trimmed mean: chopping extreme values
- **Median**
    - middle value if N is odd, otherwise
    - average of the middle two values
    - $\displaystyle median = L_1 + \left(\frac{N/2 - (\sum freq)_l}{freq_{median}}\right) width$
- **Mode**: value that occurs most frequently
    - unimodal, biomodal, trimodal, multimodal
- **Midrange**: avg. of min and max

### Data Dispersion

- How much numeric data tend to spread
- **Range**: difference between max and min
- **Quartiles**: Q1 (25th percentile), Q3 (75th)
- **Interquartile range**
    - IQR = Q3 - Q1
- **Five number summary**: min, Q1, median, Q3, max
- **Outlier**: value higher/lower than 1.5 x IQR of Q3/Q1
- **Variance**: $\displaystyle \quad \sigma^2 = \frac{1}{N} \sum_{i=1}^N (x_i - \bar{x})^2 = \frac{1}{N} \left[ \sum x_i^2 - \frac{1}{N} \left(\sum x_i \right)^2 \right]$
- **Standard deviation**: square root of variance
- **Boxplot**
    - box: Q1, M, Q3, IQR
    - whiskers:
        - min, max, 1.5 x IQR
    - outliers

![Image: Boxplot](img/2.1.png)

### Graphic Displays

- **Histogram**

![Image: Histogram](img/2.2.png)

- **Scatter plot**

![Image: Scatter Plot](img/2.3.png)

## Data Visualization

- Why data visualization?
    - gain insights, qualitative overview, explore
- Visualization methods
    - pixel-oriented, icon-based, hierarchical
    - geometric projection
    - visualizing complex data and relations

## Object Similarity/Dissimilarity

- **Data matrix**
    - object-by-attribute
    - two modes
    - $\begin{bmatrix}
        x_{11} & \cdots & x_{1f} & \cdots & x_{1p} \\
        \cdots & \cdots & \cdots & \cdots & \cdots \\
        x_{i1} & \cdots & x_{if} & \cdots & x_{ip} \\
        \cdots & \cdots & \cdots & \cdots & \cdots \\
        x_{n1} & \cdots & x_{nf} & \cdots & x_{np} \\
    \end{bmatrix}$
- **Dissimilarity matrix**
    - object-by-object
    - one mode
    - $\begin{bmatrix}
        0 \\
        d(2,1) & 0 \\
        d(3,1) & d(3,2) & 0 \\
        \vdots & \vdots & \vdots \\
        d(n,1) & d(n,2) & \cdots & \cdots & 0
    \end{bmatrix}$
- Usually measured by **distance**
- **Minkowski** distance ($L_p$ norm)
    - $\displaystyle d(i,j) = \left( |x_{i1} - x_{j1}|^p + |x_{i2} - x_{j2}|^p + \cdots + |x_{in} - x_{jn}|^p \right)^{1/p}$
- **Euclidean** distance ($L_2$ norm)
    - $\displaystyle d(i,j) = \sqrt{ (x_{i1} - x_{j1})^2 + (x_{i2} - x_{j2})^2 + \cdots + (x_{in} - x_{jn})^2}$
- **Manhattan** distance ($L_1$ norm)
    - $\displaystyle d(i,j) = |x_{i1} - x_{j1}| + |x_{i2} - x_{j2}| + \cdots + |x_{in} - x_{jn}|$
- Weighted distance

### Distance Measure

- Euclidean distance vs. Manhattan distance
- Properties
    - $d(i,j) \geq 0$
    - $d(i,i) = 0$
    - $d(i,j) = d(j,i)$
    - $d(i,j) \leq d(i,k) + d(k,j)$
    - (triangular inequality)

![Image: Euclidean vs. Manhattan Distance](img/2.4.png)

## Nominal Attributes

- E.g., courses taken by different students
- **Method 1**: simple matching
    - $d(i,j) = (p - m) / p$
    - $m$: # of matches, $p$: total # of variables
- **Method 2**: view each state as a binary variable
    - e.g., degree (BS,BA,MS,PhD); then (0,1,0,0) means BA

## Binary Variables

- Contingency table

![Image: Contingency Table](img/2.5.png)

- **Symmetric** binary variables $\displaystyle \quad d(i,j) = \frac{r + s}{q + r + s + t}$
- **Asymmetric** binary variables $\displaystyle \quad d(i,j) = \frac{r + s}{q + r + s}$
    - Jaccard coefficient $\displaystyle \quad sim(i,j) = \frac{q}{q + r + s} = 1 - d(i,j)$

### Example

![Image: Binary Variables Example](img/2.6.png)

- Gender: symmetric
- Others: asymmetric
- Consider only asymmetric binary variables
- Y (yes) and P (positive) is 1, and N is 0 $\displaystyle \quad d(i,j) = \frac{r + s}{q + r + s}$
    - $d(\text{Jack, Mary}) = (0+1)/(2+0+1) = 0.33$
    - $d(\text{Jack, Jim}) = (1+1)/(1+1+1) = 0.67$
    - $d(\text{Jim, Mary}) = (1+2)/(1+1+2) = 0.75$

## Ordinal Variables

- E.g., gold, silver, bronze
- Order is important: **rank**
- Treat like interval-scaled variables
    - map to their ranks
    - map to range [0, 1]
- $\displaystyle r_{if} \in \{1,\ldots,M_f\}, \quad z_{if} = \frac{r_{if} - 1}{M_f - 1}$
- $(1,2,3) \to (0.0, 0.5, 1.0)$
- dissimilarity of interval-scaled variables

## Variables of Mixed Types

- Data may contain different types of variables
- **Weighted combination** $\displaystyle \quad d(i,j) = \frac{\sum_{f=1}^p \delta_{ij}^{(f)} d_{ij}^{(f)}}{\sum_{f=1}^p \delta_{ij}^{(f)}}$
- $\delta_{ij}^{(f)} = 0$ if
    - $x_{if}$ or $x_{jf}$ is missing
    - $x_{if} = x_{jf} = 0$ and $f$ is an asymmetric binary variable
    - otherwise $=1$

## Cosine Similarity Example

![Image: Cosine Similarity Example](img/2.7.png)

$\displaystyle s(x,y) = \frac{x^t \cdot y}{||x|| ||y||}$

- $D1 = (5,0,3,0,2,0,0,2,0,0)$
- $D2 = (3,0,2,0,1,1,0,1,0,1)$
- $D1 \cdot D2 = 5 \cdot 3 + 0 \cdot 0 + \ldots + 0 \cdot 1 = 25$
- $||D1|| = (5 \cdot 5 + 0 \cdot 0 + \ldots + 0 \cdot 0)^{1/2} = 6.481$
- $||D2|| = 4.12$
- $\cos(D1, D2) = 0.936$