# Intro to variable types

## Quantitative Variables

Variables like counts or measures

### Discrete Variables

Whole numbers representing counts

### Continuous Variables

Measurements that can be represented using decimal numbers

## Categorical Variables

Variables that focus on different ways that data can be grouped rather than
counter or measured

### Ordinal Variables

Groups that represent some kind of order (one group is better/worse than the other)

### Nominal Variables

Groups with no relational order between them

### Binary Variables

Special kind of nominal variables with only two groups

# Calculating Quantitative Variables

## Mean

$ \overline{x} = \frac{x_1 + x_2 + ... + x_n}{n} $

The mean represents the center of a dataset (not to be confused with the median).
We can use NumPy to calculate the mean and the average of our dataset.

In [1]:
import numpy as np

dataset = np.array([2, 4, 6, 8])
print("Mean Value:", np.mean(dataset))
print("Average Value:", np.average(dataset))


Mean Value:  5.0
Average Value:  5.0


## Median

The value that, assuming the dataset is ordered from smallest to largest, falls in the middle. 
If there are an even number of values in a dataset, you either report both 
of the middle two values or their average.

In [4]:
import numpy as np

dataset = np.array([9,5,2,7,1,3,6,4,8])
median = int(np.median(dataset))
print("Median:", median)

Median: 5


## Mode

The most frequently occurring observation in the dataset. A dataset can have multiple modes 
if there is more than one value with the same maximum frequency.

In [9]:
import numpy as np
from scipy import stats

data = np.array([3, 1, 2, 5, 5, 2, 5, 4, 1])
mode = stats.mode(data)

print("Mode:", mode[0], "Count:", mode[1])

Mode: 5 Count: 3


## Variance

Variance is a descriptive statistic that describes how spread out the points in a data set are.
We can calculate the variance of our dataset by including every point in our calculation. 
We simply have to find the difference between every data point (X) and the mean ($\mu$).
Note: We square the formula to get rid of negative results.
<br>$ difference = (X - \mu)^2 $
<br>If we want to calculate the variance ($\sigma^2$) of of a whole dataset, the formula looks
as follows:
<br>$ \sigma^2 = \frac{\sum \limits _{i=1} ^{n} (X_1 - \mu)^2}{N}  $

In [13]:
import numpy as np

grades1 = np.array([1, 2, 3, 3, 4, 5, 3])
mean1 = np.average(grades1)
variance1 = np.var(grades1)

grades2 = np.array([1, 6, 6, 1, 2, 2])
mean2 = np.average(grades2)
variance2 = np.var(grades2)

print("Variance 1:", round(variance1, 2), "Mean 1:", round(mean1, 2))
print("Variance 2:", round(variance2, 2), "Mean 2:", round(mean2, 2)) 

Variance 1: 1.43 Mean 1: 3.0
Variance 2: 4.67 Mean 2: 3.0


## Standard Deviation

Variance is hard to compare to the other measures (mean, median, mode) since its value and unit 
are squared. That's where standard deviation comes into play.
<br> $ \sigma = \sqrt{\sigma^2} = \sqrt{\frac{\sum \limits _{i=i} ^{n} (X_i - \mu)^2 }{N}} $ 

In [17]:
import numpy as np

grades1 = np.array([1, 2, 3, 3, 4, 5, 3])
std1 = np.std(grades1)

grades2 = np.array([1, 6, 6, 1, 2, 2])
std2 = np.std(grades2)

print("Standard Deviation 1:", round(std1, 2))
print("Standard Deviation 2:", round(std2, 2))

Standard Deviation 1: 1.2
Standard Deviation 2: 2.16


By finding the number of standard deviations a data point is away from the mean, 
we can begin to investigate how unusual that datapoint truly is. In fact, you can usually 
expect around 68% of your data to fall within one standard deviation of the mean, 
95% of your data to fall within two standard deviations of the mean, and 99.7% of your data 
to fall within three standard deviations of the mean. If you have a data point that is over 
three standard deviations away from the mean, that's an incredibly unusual piece of data!

In [19]:
import numpy as np

grades = np.array([1, 6, 6, 1, 2, 2])
mean = np.average(grades)
std = np.std(grades)

difference_two = abs(2 - mean)
difference_six = abs(6 - mean)
difference_two_std = difference_two / std
difference_six_std = difference_six / std
print("The difference of 2 in standard deviations is: ", round(difference_two_std, 2))
print("The difference of 6 in standard deviations is: ", round(difference_six_std, 2))

The difference of 2 in standard deviations is:  0.46
The difference of 6 in standard deviations is:  1.39


# Calculating Categorical Variables

## Nominal Categories

### Mode

In [24]:
import pandas as pd

pokemon = [
    {"name": "Pikachu", "type": "Electric", "evolution": "Base", "height_in_cm": 40.6},
    {"name": "Charizard", "type": "Fire", "evolution": "Final", "height_in_cm": 170},
    {"name": "Bulbasaur", "type": "Grass", "evolution": "Base", "height_in_cm": 70},
    {"name": "Ivysaur", "type": "Grass", "evolution": "Middle", "height_in_cm": 100},
    {"name": "Venusaur", "type": "Grass", "evolution": "Final", "height_in_cm": 200},
    {"name": "Squirtle", "type": "Water", "evolution": "Base", "height_in_cm": 50},
    {"name": "Wartortle", "type": "Water", "evolution": "Middle", "height_in_cm": 100},
    {"name": "Blastoise", "type": "Water", "evolution": "Final", "height_in_cm": 160},
    {"name": "Eevee", "type": "Normal", "evolution": "Base", "height_in_cm": 30},
    {"name": "Jolteon", "type": "Electric", "evolution": "Final", "height_in_cm": 80}
]
df = pd.DataFrame(pokemon)
type_counts = df["type"].value_counts()
print(type_counts)
print("Mode:", type_counts.index[0])


type
Grass       3
Water       3
Electric    2
Fire        1
Normal      1
Name: count, dtype: int64
Mode: Grass


## Ordinal Categories

### Median

With equal spacing between ordinal categorical values

In [27]:
import pandas as pd

pokemon = [
    {"name": "Pikachu", "type": "Electric", "evolution": "Base", "height_in_cm": 40.6},
    {"name": "Charizard", "type": "Fire", "evolution": "Final", "height_in_cm": 170},
    {"name": "Bulbasaur", "type": "Grass", "evolution": "Base", "height_in_cm": 70},
    {"name": "Ivysaur", "type": "Grass", "evolution": "Middle", "height_in_cm": 100},
    {"name": "Venusaur", "type": "Grass", "evolution": "Final", "height_in_cm": 200},
    {"name": "Squirtle", "type": "Water", "evolution": "Base", "height_in_cm": 50},
    {"name": "Wartortle", "type": "Water", "evolution": "Middle", "height_in_cm": 100},
    {"name": "Blastoise", "type": "Water", "evolution": "Final", "height_in_cm": 160},
    {"name": "Eevee", "type": "Normal", "evolution": "Base", "height_in_cm": 30},
    {"name": "Jolteon", "type": "Electric", "evolution": "Final", "height_in_cm": 80}
]
df = pd.DataFrame(pokemon)
print(df["evolution"].unique())
evolution_categories = ["Base", "Middle", "Final"]
df["evolution"] = pd.Categorical(df["evolution"], evolution_categories, ordered=True)
median_index = int(np.median(df["evolution"].cat.codes))
median_evolution = evolution_categories[median_index]
print(median_evolution)

['Base' 'Final' 'Middle']
Middle


### Ordinal Categories Spread - Inter-quartile Range (IQR)

In [38]:
import pandas as pd

pokemon = [
    {"name": "Pikachu", "type": "Electric", "evolution": "Base", "height_in_cm": 40.6, "height_category": "medium"},
    {"name": "Charizard", "type": "Fire", "evolution": "Final", "height_in_cm": 170, "height_category": "huge"},
    {"name": "Bulbasaur", "type": "Grass", "evolution": "Base", "height_in_cm": 70,"height_category": "medium"},
    {"name": "Ivysaur", "type": "Grass", "evolution": "Middle", "height_in_cm": 100, "height_category": "tall"},
    {"name": "Venusaur", "type": "Grass", "evolution": "Final", "height_in_cm": 200, "height_category": "huge"},
    {"name": "Squirtle", "type": "Water", "evolution": "Base", "height_in_cm": 50, "height_category": "medium"},
    {"name": "Wartortle", "type": "Water", "evolution": "Middle", "height_in_cm": 100,"height_category": "tall"},
    {"name": "Blastoise", "type": "Water", "evolution": "Final", "height_in_cm": 160, "height_category": "tall"},
    {"name": "Eevee", "type": "Normal", "evolution": "Base", "height_in_cm": 30, 
     "height_category": "small"},
    {"name": "Jolteon", "type": "Electric", "evolution": "Final", "height_in_cm": 80, "height_category": "medium"}
]
df = pd.DataFrame(pokemon)
print(df["height_category"].unique())
height_categories = ["small", "medium", "tall", "huge"]
df["height_category"] = pd.Categorical(df["height_category"], height_categories, ordered=True)

p25_height_index = int(np.percentile(df["height_category"].cat.codes, 25))
p25_height_category = height_categories[p25_height_index]
print("25th percentile:", p25_height_category)

p75_height_index = int(np.percentile(df["height_category"].cat.codes, 75))
p75_height_category = height_categories[p75_height_index]
print("75th percentile:", p75_height_category)

p95th_height_index = int(np.percentile(df["height_category"].cat.codes, 95))
p95th_height_category = height_categories[p95th_height_index]
print("95th percentile:", p95th_height_category)

['medium' 'huge' 'tall' 'small']
25th percentile: medium
75th percentile: tall
95th percentile: huge


### Table of proportions

In [45]:
import pandas as pd

pokemon = [
    {"name": "Pikachu", "type": "Electric", "evolution": "Base", "height_in_cm": 40.6},
    {"name": "Charizard", "type": "Fire", "evolution": "Final", "height_in_cm": 170},
    {"name": "Bulbasaur", "type": "Grass", "evolution": "Base", "height_in_cm": 70},
    {"name": "Ivysaur", "type": "Grass", "evolution": "Middle", "height_in_cm": 100},
    {"name": "Venusaur", "type": "Grass", "evolution": "Final", "height_in_cm": 200},
    {"name": "Squirtle", "type": "Water", "evolution": "Base", "height_in_cm": 50},
    {"name": "Wartortle", "type": "Water", "evolution": "Middle", "height_in_cm": 100},
    {"name": "Blastoise", "type": "Water", "evolution": "Final", "height_in_cm": 160},
    {"name": "Eevee", "type": "Normal", "evolution": "Base", "height_in_cm": 30},
    {"name": "Jolteon", "evolution": "Final", "height_in_cm": 80}
]
df = pd.DataFrame(pokemon)
type_proportions = df.type.value_counts(normalize=True)
print("Ignoring n/a \n", type_proportions)

type_proportions_with_na = df.type.value_counts(normalize=True, dropna=False)
print("With n/a \n", type_proportions_with_na)

Ignoring n/a 
 type
Grass       0.333333
Water       0.333333
Electric    0.111111
Fire        0.111111
Normal      0.111111
Name: proportion, dtype: float64
With n/a 
 type
Grass       0.3
Water       0.3
Electric    0.1
Fire        0.1
Normal      0.1
NaN         0.1
Name: proportion, dtype: float64


## Binary Categorical Variables

In [47]:
import pandas as pd

pokemon = [
    {"name": "Pikachu", "type": "Electric", "evolution": "Base", "height_in_cm": 40.6},
    {"name": "Charizard", "type": "Fire", "evolution": "Final", "height_in_cm": 170},
    {"name": "Bulbasaur", "type": "Grass", "evolution": "Base", "height_in_cm": 70},
    {"name": "Ivysaur", "type": "Grass", "evolution": "Middle", "height_in_cm": 100},
    {"name": "Venusaur", "type": "Grass", "evolution": "Final", "height_in_cm": 200},
    {"name": "Squirtle", "type": "Water", "evolution": "Base", "height_in_cm": 50},
    {"name": "Wartortle", "type": "Water", "evolution": "Middle", "height_in_cm": 100},
    {"name": "Blastoise", "type": "Water", "evolution": "Final", "height_in_cm": 160},
    {"name": "Eevee", "type": "Normal", "evolution": "Base", "height_in_cm": 30},
    {"name": "Jolteon", "type": "Electric", "evolution": "Final", "height_in_cm": 80}
]
df = pd.DataFrame(pokemon)

large_frequency = (df.height_in_cm > 150).sum()
large_proportion = (df.height_in_cm > 150).mean()

print("Frequency :", large_frequency)
print("Proportion:", large_proportion)

Frequency : 3
Proportion: 0.3
