# Data 1: Introduction to Data

## Types of data

In this section, when I talk about the "type" of the data, I am not talking about the `dtype` (`int`, `float`, `bool`, `str`) used to represent the data in a NumPy array or Pandas DataFrame. In this context the "type" of the data is used in a more abstract sense.

### Nominal

* Non-numerical
* Usually, but not always strings
* Non-ordered
* Cannot be averaged

Examples:

In [1]:
states = ["Wyoming", "Idaho", "California", "Vermont", "Kansas"]
food = ["peanut butter", "pizza", "carrot", "apple", "ice cream"]
gender = ["male", "female"]

### Ordinal

* Non-numerical
* Usually, but not always strings
* Natural ordering
* Sometimes can be averaged
* Can assign numerical scale, but it will be arbitrary

Examples:

In [4]:
opinion = ["strongly disagree", "disagree", "neutral", "agree", "strongly agree"]
temperature = ["cold", "cool", "warm", "hot"]
age_group = ["baby", "toddler", "child", "teenager", "adult", "senior"]
height = ["short", "medium", "tall"]

### Interval

* Equally spaced numerical data
* Ordered
* Can either be discrete (int) or continuous (float)
* No meaningful zero point

Examples:

In [5]:
temp_in_C = [-20.0, -10.0, 0.0, 10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0, 100.0, 110.0]
world_cup_years = [1994, 1998, 2002, 2006, 2010, 2014, 2018, 2022]
pH = [3.00, 3.25, 3.50, 3.75, 4.00, 4.25]

### Ratio

* Equally spaced, ordered numerical data
* Can either be discrete or continuous
* Meaningful zero point that indicates an absence of the measured entity
* Examples:
  - Age in years
  - Height in inches

Examples:

In [6]:
age = [0, 5, 10, 15, 20, 25, 30]
height_in_inches = [0, 12, 24, 36, 48, 60, 72]

### Categorical

* Data is labelled by well separated categories
* Often used as an umbrella for nominal and ordinal, which are unordered and ordered categorical data types respectively.

## Variables

* A **variable** is some quantity that is measured, such as "age"
* A single variable can be measured in different ways that give different data types:
  - "young" or "old" = ordinal
  - Age ranges (0-9, 10-19, ...) = ordinal
  - Age in years = ratio

## Records and data sets

 * A **record** or **sample** is one measurement of a set of variables
 * A **data set** is a set of records that measure the same set of variables in the same way

## Resources

* [Data+Design](https://infoactive.co/data-design) online book, Infoactive, RJI.