# Whats is a Variable?

Understanding the characteristics behind measurable data

## Python Libraries

In [2]:
import pandas as pd

## Situation 0 â€” No Variables at All

Imagine:

1. You have two rows, each representing a person (unit = individual).
2. No characteristics are recorded â€” only names.

You might try to analyze or summarize the dataâ€¦ but then a question arises:

> What can you actually measure or compare?
> How do you summarize anything?


ðŸ”º At this point, you realize thereâ€™s nothing to analyze.

**Final question:**

If you have units of observation but no information to describe them, what allows you to characterize, compare, or summarize them?

### Problem Demonstration

Imagine the initial dataset without any variables, only the names of the individuals:

In [3]:
# Dataset with only units, no variables
data_no_variable = {
    "person": ["Alice", "Bob"]
}

df_no_variable = pd.DataFrame(data_no_variable)
df_no_variable

Unnamed: 0,person
0,Alice
1,Bob


ðŸ”º We have two units of observation, but there is no variable to describe them. You cannot calculate averages, sums, counts of characteristics, or make any meaningful comparisons.

### Solution: Introduce a Variable

Letâ€™s add a variable, for example age, so that we can start analyzing the data:

In [5]:
df_variable = df_no_variable.copy()

# Adding a variable to characterize each person
df_variable["age"] = [30, 25]

cols_variable = ["person", "age"]
df_variable = df_variable[cols_variable]
df_variable

Unnamed: 0,person,age
0,Alice,30
1,Bob,25


### Demonstrating Analysis

Now that we have a variable, we can perform simple analyses:

In [6]:
# Average age
average_age = df_variable["age"].mean()
print("Average age:", average_age)

# Age difference
age_diff = df_variable["age"].max() - df_variable["age"].min()
print("Age difference:", age_diff)

Average age: 27.5
Age difference: 5


## Situation 1 â€” Inconsistent Representation

Imagine:

1. You have four students (unit = individual).

2. You record their ages, but the values are inconsistent: 15, "sixteen", "17 years"

You might try to analyze or summarize the data, but then a question arises:

> How do you compare those values?

ðŸ”º At this point, you realize the variable cannot be analyzed or summarized reliably.

**Final Question:**

If a variable exists but its values are represented inconsistently, how can you perform meaningful analysis or comparison?

### Problem Demonstration

We start with a small dataset with inconsistent representation of the variable age:

In [19]:
data_age_inconsistent = {
    "student": ["S1", "S2", "S3"],
    "age": [15, "sixteen", "17 years"]  # mixed formats
}

df_age_inconsistent = pd.DataFrame(data_age_inconsistent)
df_age_inconsistent

Unnamed: 0,student,age
0,S1,15
1,S2,sixteen
2,S3,17 years


ðŸ”º We have a variable (age), but its values are represented inconsistently. Any calculation (mean, max) is ambiguous.

### Solution: Standardize the Variable

Letâ€™s convert all ages to years as integers:

In [20]:
# Function to standardize age
def convert_age(value):
    if isinstance(value, int):
        return value // 12 if value > 30 else value  # assume >30 are months
    elif isinstance(value, str):
        text_to_num = {"fifteen": 15, "sixteen": 16, "seventeen": 17}
        for word, num in text_to_num.items():
            if word in value.lower():
                return num
        digits = ''.join(c for c in value if c.isdigit())
        if digits:
            return int(digits)
    return None

# Make a copy of the dataframe
df_age = df_age_inconsistent.copy()

# Apply conversion
df_age["age_standardized"] = df_age["age"].apply(convert_age)
df_age

Unnamed: 0,student,age,age_standardized
0,S1,15,15
1,S2,sixteen,16
2,S3,17 years,17


### Demonstrating Analysis

Now you can, for example, compute meaningful summaries:

In [21]:
# Average age
average_age = df_age["age_standardized"].mean()
print("Average age:", average_age)

# Maximum age
max_age = df_age["age_standardized"].max()
print("Maximum age:", max_age)

Average age: 16.0
Maximum age: 17


## Situation 2 â€” Mixed Scales / Non-Standard Values

Imagine:

1. You have four students (unit = individual).
2. You record their scores, but they are in different formats: 90, "B", 7.5, "A+"

You might try to calculate averages, rank students, or compare performancesâ€¦ but then a question arises:

> How can you calculate the average score?
> How do you compare those values?
> Which scale should you use to summarize performance?

ðŸ”º At this point, you realize the variable cannot be analyzed or compared reliably.

**Final Question:**

If a variable exists but its values are not standardized, how can you perform meaningful analysis or comparison?

### Problem Demonstration

We start with a small dataset with mixed representations of scores:

In [15]:
data_scores_mixed = {
    "student": ["S1", "S2", "S3", "S4"],
    "score": [90, "B", 7.5, "A+"]
}

df_scores_mixed = pd.DataFrame(data_scores_mixed)
df_scores_mixed

Unnamed: 0,student,score
0,S1,90
1,S2,B
2,S3,7.5
3,S4,A+


ðŸ”º We have a variable (score), but its values are not standardized. Any calculation (mean, ranking) is ambiguous.

### Solution: Standardize the Variable

Letâ€™s convert all scores to a numeric scale from 0 to 100:

In [16]:
# Function to standardize scores
def convert_score(value):
    letter_to_num = {"A+": 100, "A": 95, "B": 85, "C": 75, "D": 65, "F": 50}
    if isinstance(value, (int, float)):
        # Assume 0-10 scale if <=10
        return value * 10 if value <= 10 else value
    elif isinstance(value, str):
        value = value.strip().upper()
        if value in letter_to_num:
            return letter_to_num[value]
    return None

# Copy the dataframe
df_scores = df_scores_mixed.copy()

# Apply conversion
df_scores["score_standardized"] = df_scores["score"].apply(convert_score)
df_scores

Unnamed: 0,student,score,score_standardized
0,S1,90,90.0
1,S2,B,85.0
2,S3,7.5,75.0
3,S4,A+,100.0


### Demonstrating Analysis

Now you can, for example, compute meaningful summaries:

In [17]:
# Average score
average_score = df_scores["score_standardized"].mean()
print("Average score:", average_score)

# Maximum score
max_score = df_scores["score_standardized"].max()
print("Maximum score:", max_score)


Average score: 87.5
Maximum score: 100.0


## Conclusions

### Variables are essential 

Without them, data cannot be characterized, measured, or summarized.

### Consistency matters

All values of a variable must be represented in a uniform and interpretable format.

### Standardization enables analysis

Values must be on the same scale to allow meaningful comparison, aggregation, and summary.

## ðŸ”¹ Key takeaway: 

Properly defined, consistent, and standardized variables are the foundation for any meaningful data analysis.

## Formal Definition of a Variable

A variable is a characteristic or attribute of a unit of observation that can take on different values across units, is represented consistently, and allows meaningful analysis, comparison, and summarization.

## Mathematical Definition of a Variable

Let 

$$
U = \{u_1, u_2, \dots, u_n\}
$$

be the set of **units of observation**.  

A **variable** $X$ is a **function**:

$$
X: U \rightarrow V
$$

where:  

- $U$ is the set of units (e.g., students, individuals, households)  
- $V$ is the set of possible **values** the variable can take (e.g., integers, real numbers, categories)  

For each unit $u_i \in U$, the variable assigns a value:

$$
X(u_i) \in V
$$

**Example:**  

- Units: Alice, Bob, Charlie  
- Variable: X = age  
- Values: Alice -> 15, Bob -> 16, Charlie -> 17  

This captures the essential properties: **every unit has a value**, values are **well-defined**, and the variable can be **analyzed, compared, and summarized reliably**.
