# Data 1: Introduction to Data

## Types of data

In this section, when I talk about the "type" of the data, I am not talking about the `dtype` (`int`, `float`, `bool`, `str`) used to represent the data in a NumPy array or Pandas DataFrame. In this context the "type" of the data is used in a more abstract sense.

### Nominal

* Non-numerical
* Usually, but not always strings
* Non-ordered
* Cannot be averaged

Operations typically seen: ==, !=

Examples:

In [1]:
states = ["Wyoming", "Idaho", "California", "Vermont", "Kansas"]
states

['Wyoming', 'Idaho', 'California', 'Vermont', 'Kansas']

In [2]:
food = ["peanut butter", "pizza", "carrot", "apple", "ice cream"]
food

['peanut butter', 'pizza', 'carrot', 'apple', 'ice cream']

In [3]:
gender = ["male", "female"]
gender

['male', 'female']

In [4]:
import numpy as np
gender2 = np.random.choice(['m','f'],10,p=[0.75,0.25])
gender2

array(['m', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm'], 
      dtype='|S1')

### Ordinal

* Non-numerical
* Usually, but not always strings
* Natural ordering
* Sometimes can be averaged
* Can assign numerical scale, but it will be arbitrary

Operations typically seen: ==, !=, <, >, <=, >=

Examples:

In [5]:
opinion = ["strongly disagree", "disagree", "neutral", "agree", "strongly agree"]
opinion

['strongly disagree', 'disagree', 'neutral', 'agree', 'strongly agree']

In [6]:
temperature = ["cold", "cool", "warm", "hot"]
temperature

['cold', 'cool', 'warm', 'hot']

In [7]:
age_group = ["baby", "toddler", "child", "teenager", "adult", "senior"]
age_group

['baby', 'toddler', 'child', 'teenager', 'adult', 'senior']

In [8]:
height = ["short", "medium", "tall"]
height

['short', 'medium', 'tall']

In [9]:
nba_draft = np.random.choice(
    ['Hall of Famer', 'All-Star', 'Starter', 'Sixth Man', 'Role Player', 'Bust'],
    60,
    p=[0.01, 0.04, .2, .15, .4, .2])
nba_draft

array(['Starter', 'Role Player', 'Role Player', 'Starter', 'Bust',
       'Sixth Man', 'All-Star', 'Role Player', 'Role Player', 'Sixth Man',
       'Bust', 'Role Player', 'Bust', 'Starter', 'Starter', 'Role Player',
       'Sixth Man', 'Sixth Man', 'Role Player', 'Bust', 'Starter',
       'Sixth Man', 'Role Player', 'Bust', 'Starter', 'Sixth Man',
       'Starter', 'Starter', 'Role Player', 'Role Player', 'Starter',
       'Starter', 'Role Player', 'Role Player', 'Sixth Man', 'Role Player',
       'Role Player', 'Bust', 'Role Player', 'Starter', 'Role Player',
       'Sixth Man', 'Role Player', 'Starter', 'Sixth Man', 'Role Player',
       'Sixth Man', 'Role Player', 'Role Player', 'Role Player',
       'Role Player', 'Role Player', 'Bust', 'Bust', 'Role Player',
       'Role Player', 'Role Player', 'Bust', 'Starter', 'Starter'], 
      dtype='|S13')

### Interval

* Equally spaced numerical data
* Ordered
* Can either be discrete (int) or continuous (float)
* No meaningful zero point

Operations typically seen: ==, !=, <, >, <=, >=, -

Examples:

In [10]:
temp_in_C = [-20.0, -10.0, 0.0, 10.0, 20.0, 30.0, 40.0,
             50.0, 60.0, 70.0, 80.0, 90.0, 100.0, 110.0]
temp_in_C

[-20.0,
 -10.0,
 0.0,
 10.0,
 20.0,
 30.0,
 40.0,
 50.0,
 60.0,
 70.0,
 80.0,
 90.0,
 100.0,
 110.0]

In [11]:
world_cup_years = [1994, 1998, 2002, 2006, 2010, 2014, 2018, 2022]
world_cup_years

[1994, 1998, 2002, 2006, 2010, 2014, 2018, 2022]

In [12]:
pH = [3.00, 3.25, 3.50, 3.75, 4.00, 4.25]
pH

[3.0, 3.25, 3.5, 3.75, 4.0, 4.25]

### Ratio

* Equally spaced, ordered numerical data
* Can either be discrete or continuous
* Meaningful zero point that indicates an absence of the measured entity
* Examples:
  - Age in years
  - Height in inches

Operations typically seen: ==, !=, <, >, <=, >=, -, /

Examples:

In [13]:
age = [0, 5, 10, 15, 20, 25, 30]
age

[0, 5, 10, 15, 20, 25, 30]

In [14]:
height_in_inches = [0, 12, 24, 36, 48, 60, 72]
height_in_inches

[0, 12, 24, 36, 48, 60, 72]

In [15]:
digits = [x for x in range(10)]
digits

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

### Categorical

* Data is labelled by well separated categories
* Often used as an umbrella for nominal and ordinal, which are unordered and ordered categorical data types respectively.

## Variables

* A **variable** is some quantity that is measured, such as "age"
* A single variable can be measured in different ways that give different data types:
  - "young" or "old" = ordinal
  - Age ranges (0-9, 10-19, ...) = ordinal
  - Age in years = ratio

## Records and data sets

 * A **record** or **sample** is one measurement of a set of variables
 * A **data set** is a set of records that measure the same set of variables in the same way

## Resources

* [Data+Design](https://infoactive.co/data-design) online book, Infoactive, RJI.