### Data Objects and Attributes

- **Data Object**: A collection of attributes that describe an entity.
    - Also known as: record, point, case, sample, entity, or instance.

- **Attribute**: A property or characteristic of an object.
    - Examples: eye color of a person, temperature, etc.
    - Also known as: variable, field, characteristic, dimension, or feature.

- **Attribute values** are numbers or symbols assigned to an attribute for
a particular object

**Attributes v/s attribute values**: 
 - Same attribute can be mapped to different attribute values
   Example: height can be measured in feet or meters

| Object   | Attribute 1 (Height in cm) | Attribute 2 (Eye Color) |
|----------|----------------------------|-------------------------|
| Person 1 | 170                        | Brown                   |
| Person 2 | 160                        | Blue                    |
| Person 3 | 180                        | Green                   |

### Types of Attribute

| Attribute Type | Description | Examples | Possible Operations |
|----------------|-------------|----------|---------------------|
| Nominal        | The values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=, ≠) | zip codes, employee ID numbers, eye color, sex: male, female | mode, entropy, contingency correlation |
| Ordinal        | The values of an ordinal attribute provide enough information to order objects. (<, >) | hardness of minerals, good, better, best, grades, street numbers | median, percentiles, rank correlation, run tests, sign tests |
| Interval       | For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists. (+, -) No Defined zero point | calendar dates, temperature in Celsius or Fahrenheit | mean, standard deviation, Pearson's correlation, t and F tests |
| Ratio          | For ratio variables, both differences and ratios are meaningful. (*, /) Defined zero point | temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current | geometric mean, harmonic mean, percent variation |


### Discrete and continuous attributes
#### Discrete Attribute
 - Has only a finite or countably infinite set of values
 - Examples: zip codes, counts, or the set of words in a collection of documents
 - Often represented as integer variables

#### Continuous Attribute
 - Has real numbers as attribute values
 - Examples: temperature, height, or weight.
 - Practically, real values can only be measured and represented using a finite number of digits.

### Datasets
 - A collection of data samples
 - Used to evaluate models for a given task

![image.png](attachment:image.png)

#### Dataset characteristics
 - Dimensionality (number of attributes)
    - High dimensional data brings a number of challenges

 - Distribution
    - Skewness - Disproportionate representation of classes (99% class A and 1% Class B ?)

    - Unfairness - Disproportionate representation of where the data comes from

### Data matrix
 - If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi dimensional space

 - Such a data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute

| Country   | Population (millions) | Area (sq km) | GDP (trillions USD) |
|-----------|-----------------------|--------------|---------------------|
| USA       | 331                   | 9,834,000    | 21.43               |
| China     | 1439                  | 9,597,000    | 14.34               |
| India     | 1380                  | 3,287,000    | 2.87                |


### Graph data
 - the graph captures relationships among data objects (1)
 - the data objects themselves are represented as graphs (2)

![image.png](attachment:image.png)