<a href="https://colab.research.google.com/github/brendanpshea/data-science/blob/main/DataScience_04_DataCleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# From Messy to Meaningful: Mastering the Art of Data Cleaning
### Data Science:  A Practical and Philosophical Introduction | Brendan Shea, PhD

The journey from raw data to actionable insights is fraught with challenges. At the core of this journey lies a critical, often underappreciated process: data cleaning. Also known as data cleansing or scrubbing, this fundamental step transforms chaotic, error-prone datasets into reliable foundations for analysis.

Data cleaning is not merely about tidying up; it's about ensuring the integrity and validity of your entire analytical process. Consider a dataset of high school sports statistics. At first glance, it might seem like a goldmine of information: player performance metrics, team records, and game results. However, lurking beneath the surface are potential pitfalls that could derail even the most sophisticated analysis:

1.  Duplicate Entries--A star player's 30-goal season accidentally recorded twice, inflating their stats to an impossible 60 goals.
2.  Missing Values--Half the forwards missing assist data, making fair comparisons impossible.
3.  Inconsistent Formats--"Soccersaurus Rex" and "Soccer-saurus Rex" recorded as different teams, fragmenting their data.
4.  Outliers--A goalkeeper with 50 goals, challenging the bounds of plausibility.
5.  Data Type Mismatches--Goals recorded as text instead of numbers, preventing proper statistical analysis.
6.  Standardization Issues--Heights recorded in a mix of feet, inches, and centimeters.

These issues are not unique to sports analytics. Whether you're working in finance, healthcare, marketing, or any other data-rich field, the challenges of "dirty" data are universal. Left unaddressed, they can lead to flawed analyses, misguided decisions, and missed opportunities.

In this chapter, we'll dive deep into the world of data cleaning, using a high school sports dataset as our illustrative playground. We'll explore a comprehensive range of data quality issues and learn techniques to transform messy data into a robust foundation for analysis. You'll discover how to leverage both SQL and Python's Pandas library to tackle these challenges, giving you a versatile toolkit for data cleaning across various scenarios.

We'll start by examining the anatomy of dirty data, learning to identify common issues that plague datasets across industries. From there, we'll explore the data cleaning process step-by-step, covering techniques such as:

-   Handling missing data through imputation and deletion strategies
-   Identifying and removing duplicate records
-   Standardizing inconsistent data formats and units
-   Dealing with outliers through detection and treatment methods
-   Correcting data type mismatches and parsing errors
-   Normalizing and scaling data for consistency

Throughout the chapter, we'll compare and contrast two primary approaches to data cleaning: the Extract, Transform, Load (ETL) process typically associated with Python and Pandas, and the Extract, Load, Transform (ELT) process often used with SQL in database environments. You'll gain hands-on experience with both methods, learning when and why to choose one over the other.

Learning Outcomes: By the end of this chapter, you will be able to:

1.  Identify and categorize common types of data quality issues in diverse datasets.
2.  Apply a wide range of SQL techniques for data cleaning in an ELT (Extract, Load, Transform) workflow.
3.  Utilize Python's Pandas library for comprehensive data cleaning in an ETL (Extract, Transform, Load) process.
4.  Transform raw, messy data into clean, analysis-ready datasets using both programmatic and visual inspection methods.
5.  Compare and contrast ETL and ELT approaches, understanding their respective strengths and appropriate use cases.
6.  Implement various data cleaning strategies including handling missing values, correcting data types, and normalizing data.
7.  Detect and handle outliers using statistical methods and domain knowledge.
8.  Standardize inconsistent data formats and units across large datasets.
9.  Create documented, reproducible data cleaning pipelines for ensuring data quality in ongoing projects.

Keywords: Data cleaning, data quality, ETL, ELT, SQL, Pandas, Python, duplicates, missing values, outliers, normalization, standardization, imputation, data types, consistency, automation, reproducibility, pipelines, time series, unstructured data, documentation, scalability, data integrity, transformation, validation, data profiling, data wrangling, data preprocessing, data scrubbing, data munging, data quality assurance

## Sample Data: High School Soccer League

To illustrate common data cleaning challenges, let's create a sample dataset for a fictitious high school soccer league. This dataset will intentionally include various data quality issues that we'll address throughout the chapter.

In [None]:
import pandas as pd
import numpy as np
import sqlite3
import random

# Generate sample data
np.random.seed(42)

teams = [
    "Goal Getters", "Soccersaurus Rex", "Soccer-saurus Rex", "Kick It Up",
    "Net Navigators", "Dribble Trouble", "Goal Diggers", "Turf Titans",
    "Cleat Commanders", "Ball Busters"
]

players = [
    "Alex", "Beckham", "Cristiano", "Diego", "Eden", "Fernando", "Grace", "Hope",
    "Isco", "Javier", "Kristine", "Lionel", "Marta", "Neymar", "Oscar", "Paul",
    "Quincy", "Raheem", "Sam", "Tobin"
]


positions = ["Forward", "Midfielder", "Defender", "Goalkeeper"]

# Create DataFrame
data = []
for _ in range(200):  # Generate 200 rows of data
    team = np.random.choice(teams)
    player = np.random.choice(players) + " " + chr(np.random.randint(65, 91)) + "."
    position = np.random.choice(positions)
    goals = np.random.randint(0, 30) if position != "Goalkeeper" else np.random.randint(0, 3)
    assists = np.random.randint(0, 20) if position != "Goalkeeper" else np.random.randint(0, 5)
    saves = np.random.randint(0, 100) if position == "Goalkeeper" else 0

    # Introduce data quality issues
    if np.random.random() < 0.1:  # 10% chance of issues
        issue = np.random.choice(["duplicate", "missing", "invalid", "outlier", "negative", "type_error"])
        if issue == "duplicate":
            data.append([team, player, position, goals, assists, saves])  # Duplicate entry
            data.append([team, player, position, goals, assists, saves])  # Duplicate entry
        elif issue == "missing":
            data.append([team, player, position, goals, None, saves])  # Missing assists
        elif issue == "invalid":
            data.append([team, player, "Striker", goals, assists, saves])  # Invalid position
        elif issue == "outlier":
            data.append([team, player, position, 100, 50, 200])  # Unrealistic stats
        elif issue == "negative":
            data.append([team, player, position, goals * -1, assists * -1, saves * -1])  # Negative stats
        elif issue == "type_error":
            data.append([team, player, position, f"{goals} goals", assists, saves])  # Type error
    else:
        data.append([team, player, position, goals, assists, saves])

df = pd.DataFrame(data, columns=["Team", "Player", "Position", "Goals", "Assists", "Saves"])
# Insert primary keys:
df["ID"] = np.random.randint(10000, 99999, size=len(df))
# put it at front
df = df[["ID", "Team", "Player", "Position", "Goals", "Assists", "Saves"]]

# Save to CSV
df.to_csv("high_school_soccer_data.csv", index=False)

# Create SQLite database and load data
conn = sqlite3.connect("soccer_league.db")
df.to_sql("player_stats", conn, if_exists="replace", index=False)
conn.close()

print("Data generated and saved to 'high_school_soccer_data.csv' and 'soccer_league.db'")

Data generated and saved to 'high_school_soccer_data.csv' and 'soccer_league.db'


In [None]:
%reload_ext sql
%config SqlMagic.autopandas=True
%sql sqlite:///soccer_league.db

In [None]:
%%sql
SELECT * FROM player_stats LIMIT 10;

 * sqlite:///soccer_league.db
Done.


Unnamed: 0,ID,Team,Player,Position,Goals,Assists,Saves
0,22676,Goal Diggers,Tobin O.,Defender,7,6.0,0
1,20816,Goal Diggers,Kristine K.,Goalkeeper,0,3.0,23
2,42733,Net Navigators,Beckham X.,Goalkeeper,1,1.0,63
3,95680,Goal Getters,Lionel Z.,Striker,28,11.0,0
4,12050,Ball Busters,Paul O.,Midfielder,29,14.0,0
5,43933,Goal Diggers,Tobin Y.,Defender,4,18.0,0
6,34951,Cleat Commanders,Grace R.,Goalkeeper,0,3.0,13
7,42413,Cleat Commanders,Beckham T.,Goalkeeper,2,3.0,70
8,44801,Turf Titans,Oscar C.,Midfielder,16,3.0,0
9,65619,Kick It Up,Beckham F.,Midfielder,9,3.0,0


This Python script generates our sample dataset with intentional problems and saves it as both a CSV file and an SQLite database. Let's break down the data generation process and the issues we've introduced:

1.  We've created a list of 10 teams. Note that "Soccersaurus Rex" appears twice with slightly different spellings.
2.  We have a pool of 20 player names (with a random initial for a last name) that will be randomly assigned to teams.
3.  We're using four standard soccer positions.
4.  The script creates 200 rows of data, randomly assigning players to teams and positions, and generating plausible statistics for goals, assists, and saves.
5.  *Intentional Issues*. With a 10% probability, the script introduces one of four types of data quality problems:
    -   **Duplicates**: Identical rows are added.
    -   **Missing Values**: The 'Assists' field is set to NaN (Not a Number).
    -   **Invalid Data**: An non-existent position "Striker" is used instead of the valid positions.
    -   **Outliers**: Unrealistically high values are set for goals, assists, and saves.
    - **Negative values.** We have inappropriate negative values.
    - **Type errors**. Some data (that should be numeric) is instead a string.
6. The generated data is saved as a CSV file named "high_school_soccer_data.csv" and loaded into an SQLite database named "soccer_league.db".

This dataset now serves as our playground for exploring various data cleaning techniques. In the following sections, we'll use both SQL and Pandas to identify and address these data quality issues, demonstrating the importance and methods of data cleaning in a sports analytics context.

First, we'll show how we can diagnose and correct issues using SQL (a common practice if one is using Extract-Load-Transform). Then, we'll show how we could do the same thing using a Python and the Pandas library (a common practice if one is using Extract-Transform-Load). See the previous chapter for a a discussion of the advantages and disadvantages of each approach.

## Reasoning for Cleaning Data
Data cleaning is a crucial step in the data analysis process. In our high school soccer league dataset, we've intentionally introduced several types of data quality issues. Let's examine each of these issues and understand why addressing them is essential for accurate analysis.

### Duplicate Data
Duplicate data refers to identical or very similar records that appear multiple times in a dataset.  Example from our dataset:

In [None]:
%%sql
-- A duplicate is a row that appeasrs more than once
SELECT Team, Player, Position, Goals, Assists, Saves, COUNT(*) as Count
FROM player_stats
GROUP BY Team, Player, Position, Goals, Assists, Saves
HAVING Count > 1;

 * sqlite:///soccer_league.db
Done.


Unnamed: 0,Team,Player,Position,Goals,Assists,Saves,Count
0,Cleat Commanders,Grace A.,Defender,12,16.0,0,2
1,Dribble Trouble,Kristine E.,Goalkeeper,2,2.0,19,2
2,Goal Diggers,Sam H.,Defender,20,14.0,0,2
3,Net Navigators,Marta I.,Forward,14,12.0,0,2
4,Soccer-saurus Rex,Eden V.,Forward,28,13.0,0,2
5,Soccersaurus Rex,Quincy N.,Midfielder,12,8.0,0,2


This query might reveal duplicate entries for some players.

*Why it's a problem--* Duplicate data can lead to overestimation of player performance or team strength. If a player's 30-goal season is recorded twice, it could incorrectly suggest they scored 60 goals, skewing individual and team statistics.

Let's fix this by simpling deleting the "copy" with the higher ID:

In [None]:
%%sql
-- Delete duplicates while keeping one instance
DELETE FROM player_stats
WHERE ID NOT IN (
    SELECT MIN(ID)
    FROM player_stats
    GROUP BY Team, Player, Position, Goals, Assists, Saves
);


 * sqlite:///soccer_league.db
6 rows affected.


This SQL query is designed to remove duplicate rows from the `player_stats` table while keeping one instance of each unique combination. Let's break it down step by step:

1.  `DELETE FROM player_stats`. This is the main operation, indicating we're going to delete rows from the `player_stats` table.
2.  `WHERE ROWID NOT IN (...)`: This clause specifies which rows to delete. It will delete any row whose `ROWID` is not in the set of ROWIDs returned by the subquery.
3.  The subquery subquery groups the rows by `Team`, `Player`, `Position`, `Goals`, `Assists`, and `Saves`.
    -   For each unique combination of these fields, it selects the minimum `ROWID`.
    -   `ROWID` is a special column in SQLite that uniquely identifies each row in a table. It's automatically created and is typically the order in which rows were inserted.
4.  The effect of this query:
    -   It groups all rows that have identical values for `Team`, `Player`, `Position`, `Goals`, `Assists`, and `Saves`.
    -   For each group of identical rows, it keeps the one with the lowest `ROWID` (typically the first one inserted) and deletes the rest.

In essence, this query removes duplicate entries from the `player_stats` table, keeping only one instance of each unique combination of player statistics. It's an efficient way to de-duplicate data while ensuring that you retain one copy of each unique record.

In [None]:
%%sql
--verify the change
SELECT Team, Player, Position, Goals, Assists, Saves, COUNT(*) as Count
FROM player_stats
GROUP BY Team, Player, Position, Goals, Assists, Saves
HAVING Count > 1;

 * sqlite:///soccer_league.db
Done.


### Redundant Data
Redundant data is information that is repeated unnecessarily or can be derived from other data points.  In our current dataset, we don't have explicit redundant data. However, if we had included both "Goals+Assists" and separate "Goals" and "Assists" columns, that would be an example of redundancy. Example:

| Player   | Goals | Assists | Goals+Assists |
|----------|-------|---------|---------------|
| Alex     | 10    | 5       | 15            |
| Beckham  | 8     | 7       | 15            |
| Cristiano| 12    | 3       | 15            |
| Marta    | 9     | 6       | 15            |
| Lionel   | 11    | 4       | 15            |


*Why it's a problem--*Redundant data increases storage requirements and can lead to inconsistencies if one instance of the data is updated but not the other.

### Missing (NULL) Values
Missing values are data points that are not present for some records. Example from our dataset:

In [None]:
%%sql
-- select where ANY column is null
SELECT * FROM player_stats
WHERE
  Team is NULL
  OR Player is NULL
  OR Position is NULL
  OR Goals is NULL
  OR Assists IS NULL
  OR Saves IS NULL;

 * sqlite:///soccer_league.db
Done.


Unnamed: 0,ID,Team,Player,Position,Goals,Assists,Saves
0,28297,Goal Getters,Hope N.,Goalkeeper,1,,86
1,98801,Cleat Commanders,Tobin U.,Goalkeeper,0,,38
2,36292,Dribble Trouble,Oscar X.,Midfielder,10,,0
3,41598,Cleat Commanders,Kristine H.,Midfielder,24,,0


We can also get a "null count" for each column as follows:

In [None]:
%%sql
-- null count by column
-- each time a null occurs, we add one to its count
SELECT
  COUNT(CASE WHEN Team IS NULL THEN 1 END) AS Team_Nulls,
  COUNT(CASE WHEN Player IS NULL THEN 1 END) AS Player_Nulls,
  COUNT(CASE WHEN Position IS NULL THEN 1 END) AS Position_Nulls,
  COUNT(CASE WHEN Goals IS NULL THEN 1 END) AS Goals_Nulls,
  COUNT(CASE WHEN Assists IS NULL THEN 1 END) AS Assists_Nulls,
  COUNT(CASE WHEN Saves IS NULL THEN 1 END) AS Saves_Nulls
FROM player_stats;

 * sqlite:///soccer_league.db
Done.


Unnamed: 0,Team_Nulls,Player_Nulls,Position_Nulls,Goals_Nulls,Assists_Nulls,Saves_Nulls
0,0,0,0,0,4,0


As you can see, it looks like we are missing some data for assistgs.

*Why this matters--* Missing values can skew analyses and make it difficult to compare players or teams fairly. For instance, if assist data is missing for some forwards, it becomes challenging to evaluate their overall offensive contribution.

#### What is the `CASE'?
This query to count nulls uses the `CASE` statement, which you might not have seen before. in SQL is a way to perform conditional logic within your queries. It allows you to create conditions and return different values based on whether those conditions are met. The basic syntax of a `CASE` statement is:

```sql
CASE
    WHEN condition1 THEN result1
    WHEN condition2 THEN result2
    ...
    ELSE resultN
END
```

-   `WHEN` specifies a condition to evaluate.
-   `THEN` specifies the result to return if the condition is true.
-   `ELSE` Specifies the result to return if none of the conditions are true (optional).
-   `END` Marks the end of the `CASE` statement.

Now let's look at how the `CASE` statement is used in the provided query to count `NULL` values in each column.

1.  `CASE WHEN Team IS NULL THEN 1 END`: This `CASE` statement checks if the `Team` column is `NULL`.

    -   `WHEN Team IS NULL`: This condition checks if the value in the `Team` column is `NULL`.
    -   `THEN 1`: If the condition is true (the value is `NULL`), the `CASE` statement returns `1`.
    -   `END`: Marks the end of the `CASE` statement.

    This `CASE` statement is nested inside the `COUNT` function.

2.  `COUNT(CASE ... END)`: The `COUNT` function counts the number of non-`NULL` values returned by the `CASE` statement. Since the `CASE` statement only returns `1` for `NULL` values, `COUNT` effectively counts the number of `NULL` values in the column.

3.  `AS Team_Nulls`: This assigns the result of the `COUNT` function a meaningful alias (`Team_Nulls`), indicating it represents the number of `NULL` values in the `Team` column.

The same logic applies to the other columns (`Player`, `Position`, `Goals`, `Assists`, and `Saves`), each using a similar `CASE` statement to count `NULL` values.

### Invalid Data

**Invalid data** refers to values that don't conform to the expected format or fall outside the realm of possibility. Example from our dataset:

In [None]:
%%sql
SELECT * FROM player_stats WHERE Position = 'Striker';

 * sqlite:///soccer_league.db
Done.


Unnamed: 0,ID,Team,Player,Position,Goals,Assists,Saves
0,95680,Goal Getters,Lionel Z.,Striker,28,11.0,0
1,79274,Turf Titans,Neymar Z.,Striker,2,1.0,25


This query will reveal records where we've used 'Striker' instead of one of our four valid positions.

Invalid data can lead to errors in analysis or visualization. In our case, grouping by position would incorrectly separate 'Striker' from 'Forward', potentially understating the performance of forwards as a group.

More generally, we can detect invalid data by looking at the "distinct" values in each column:

In [None]:
%%sql
--get distinct values in positions
SELECT DISTINCT Position FROM player_stats;

 * sqlite:///soccer_league.db
Done.


Unnamed: 0,Position
0,Defender
1,Goalkeeper
2,Striker
3,Midfielder
4,Forward


### Non-parametric Data

**Non-parametric data** refers to data that doesn't follow a specific probability distribution, often due to inconsistencies in data entry. In our dataset, this might manifest as inconsistent team names:

In [None]:
%%sql
SELECT DISTINCT Team FROM player_stats;

 * sqlite:///soccer_league.db
Done.


Unnamed: 0,Team
0,Goal Diggers
1,Net Navigators
2,Goal Getters
3,Ball Busters
4,Cleat Commanders
5,Turf Titans
6,Kick It Up
7,Soccersaurus Rex
8,Soccer-saurus Rex
9,Dribble Trouble


This query might reveal both "Soccersaurus Rex" and "Soccer-saurus Rex".

*Why it's a problem--*Non-parametric data can lead to fragmentation of what should be unified categories, making it difficult to aggregate data correctly. In our case, the team's performance might be split across two names, understating their true record.

### Data Outliers

**Outliers** are data points that differ significantly from other observations. For example, let's see if we can find any palyer with very high (unrealistically high) numbers of goals or assists.

In [None]:
%%sql
SELECT * FROM player_stats
WHERE Goals > 50
OR Assists > 30
OR (Position = 'Goalkeeper' AND Goals > 5)
LIMIT 10;

 * sqlite:///soccer_league.db
Done.


Unnamed: 0,ID,Team,Player,Position,Goals,Assists,Saves
0,22676,Goal Diggers,Tobin O.,Defender,7,6.0,0
1,65619,Kick It Up,Beckham F.,Midfielder,9,3.0,0
2,90688,Soccer-saurus Rex,Alex E.,Midfielder,6,8.0,0
3,74754,Ball Busters,Paul Y.,Goalkeeper,100,50.0,200
4,44157,Net Navigators,Eden W.,Forward,8,2.0,0
5,50042,Goal Getters,Hope D.,Midfielder,7,19.0,0
6,92856,Dribble Trouble,Cristiano P.,Forward,8,3.0,0
7,82750,Goal Getters,Cristiano R.,Forward,9,2.0,0
8,56600,Ball Busters,Beckham Z.,Forward,7,0.0,0
9,86898,Dribble Trouble,Paul G.,Forward,9,2.0,0


This query will reveal players with unusually high stats, including goalkeepers with an improbable number of goals.

*Why it's a problem*--While outliers can sometimes represent genuinely exceptional performances, they often indicate data entry errors. Including these without verification can significantly skew averages and other statistical measures.

### Specification Mismatch
**Specification mismatch** occurs when data doesn't adhere to the expected format or rules.

In our dataset, this could be exemplified by negative values for goals, assists, or saves:

In [None]:
%%sql
SELECT *
FROM player_stats
WHERE
  Goals < 0
  OR Assists < 0
  OR Saves < 0;

 * sqlite:///soccer_league.db
Done.


Unnamed: 0,ID,Team,Player,Position,Goals,Assists,Saves
0,84553,Soccersaurus Rex,Marta F.,Forward,-24,0.0,0
1,51846,Goal Diggers,Marta M.,Midfielder,-2,-6.0,0


*Why it's a problem--*Specification mismatches can lead to errors in calculations or visualizations. For instance, negative goals would not make sense in the context of soccer statistics and could cause issues when calculating team totals or player averages.

### Data Type Validation

**Data type validation** ensures that each column contains the expected type of data (e.g., integers for goals, text for player names).  We could check for this in our dataset. First, let's double check the type of each column:

In [None]:
%%sql
-- get table schema
PRAGMA table_info(player_stats);

 * sqlite:///soccer_league.db
Done.


Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,ID,INTEGER,0,,0
1,1,Team,TEXT,0,,0
2,2,Player,TEXT,0,,0
3,3,Position,TEXT,0,,0
4,4,Goals,TEXT,0,,0
5,5,Assists,REAL,0,,0
6,6,Saves,INTEGER,0,,0


This shows a few unexpected things:

- The data type of Goals is `TEXT`, when it should be `INTEGER`. As we'll discover shortly, this is because of some data entry errors. This is a serious issue, as it could cause problems for attempts to sort, analyze or filter data.
- The data type of Assists is `REAL` even though we expected it to be `INTEGER`. WHile this ins't as serious of problem, it results from the fact that assists (at this point) contains null values.

Let's see if we can locate the problematic rows in the "Goals" columns

In [None]:
%%sql
SELECT * FROM player_stats
WHERE Goals GLOB '*[A-Za-z]*';

 * sqlite:///soccer_league.db
Done.


Unnamed: 0,ID,Team,Player,Position,Goals,Assists,Saves
0,40448,Soccersaurus Rex,Sam Y.,Midfielder,29 goals,16.0,0
1,19903,Kick It Up,Marta P.,Forward,12 goals,13.0,0
2,13020,Soccersaurus Rex,Sam E.,Defender,24 goals,1.0,0


It looks we have found the problem! Some rows have "goals" inappropriatey added.

*Why it's a problem--*Incorrect data types can lead to errors in calculations or sorting. For example, if goals were stored as text, sorting by goals might not work as expected, listing "10" before "2".

You also might notice that the SQL query introduces a new idea: GLOB.  `GLOB` is a special operator in SQLite that is similar to `LIKE`, but it supports **regular expression matching** using wildcard characters and character ranges.
-  '*[A-Za-z]*' is the overall pattern we are using to check the `Goals` column.
-   '*` is a wildcard character matches zero or more of any character.
-  `[A-Za-z]` specifies a range of characters. `[A-Za-z]` means any uppercase letter (`A-Z`) or any lowercase letter (`a-z`).
-   `*` (again) matches zero or more of any character after the specified range.

We can also use regular expressions to fix this problem (which make the next part of our data cleaning a bit easier):

Putting it all together, `'*[A-Za-z]*'` matches any string that contains at least one alphabetic character, regardless of what other characters are present before or after the letter(s).


## Data Manipulation

Data manipulation involves transforming, reorganizing, and cleaning our dataset to make it more suitable for analysis. We'll use both SQL and Pandas to demonstrate these techniques, explaining each step in detail.



### Correcting Data Types
At the end of the last section, we saw how regular expressions are a powerful tool for identifying certain of errors in data.

Now, let's put them to work in correcting the data type error we noticed earlier.

In [None]:
%%sql
-- remove "goals" from column text
UPDATE player_stats
SET Goals = REPLACE(Goals, 'goals', '')
WHERE Goals GLOB '*[A-Za-z]*';

-- now, we can recreate the goals column as an integer
ALTER TABLE player_stats
RENAME COLUMN Goals TO Goals_old;

ALTER TABLE player_stats
ADD COLUMN Goals INTEGER;

UPDATE player_stats
SET Goals = CAST(Goals_old AS INT);

ALTER TABLE player_stats
DROP COLUMN Goals_old;

-- select table schema
PRAGMA table_info(player_stats);

 * sqlite:///soccer_league.db
3 rows affected.
Done.
Done.
200 rows affected.
Done.
Done.


Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,ID,INTEGER,0,,0
1,1,Team,TEXT,0,,0
2,2,Player,TEXT,0,,0
3,3,Position,TEXT,0,,0
4,4,Assists,REAL,0,,0
5,5,Saves,INTEGER,0,,0
6,6,Goals,INTEGER,0,,0


Here's what we do in this short SQL script:

1.  *Clean text from 'Goals' column`.* `UPDATE player_stats SET Goals = REPLACE(Goals, 'goals', '')` removes the word 'goals' from any entries containing letters.
2.  *Prepare for data type change.* `ALTER TABLE player_stats RENAME COLUMN Goals TO Goals_old` and `ALTER TABLE player_stats ADD COLUMN Goals INTEGER` create a new integer 'Goals' column.
3.  *Convert data to integer.* `UPDATE player_stats SET Goals = CAST(Goals_old AS INT)` populates the new 'Goals' column with integer values.
4.  *Remove old column.* `ALTER TABLE player_stats DROP COLUMN Goals_old` deletes the original text-based 'Goals' column.
5.  *Verify changes.* `PRAGMA table_info(player_stats)` displays the updated table structure, confirming the 'Goals' column is now of INTEGER type.

This process effectively cleans the 'Goals' data and converts it from text to integer format, a common task in data cleaning and preparation.

### Recoding Data (Numeric and Categorical)

Recoding involves changing the values in a dataset, often to standardize or categorize data. Let's look at examples for both numeric and categorical data.

#### Numeric Recoding: Creating Performance Tiers

Let's create performance tiers for players based on their goal-scoring record.

In [None]:
%%sql
SELECT
    Player,
    Goals,
    CASE
        WHEN Goals >= 20 THEN 'Elite Scorer'
        WHEN Goals >= 10 THEN 'Top Scorer'
        WHEN Goals >= 5 THEN 'Regular Scorer'
        ELSE 'Occasional Scorer'
    END AS Scoring_Tier
FROM player_stats
WHERE Position != 'Goalkeeper'
LIMIT 10;

 * sqlite:///soccer_league.db
Done.


Unnamed: 0,Player,Goals,Scoring_Tier
0,Tobin O.,7,Regular Scorer
1,Lionel Z.,28,Elite Scorer
2,Paul O.,29,Elite Scorer
3,Tobin Y.,4,Occasional Scorer
4,Oscar C.,16,Top Scorer
5,Beckham F.,9,Regular Scorer
6,Lionel B.,29,Elite Scorer
7,Marta I.,14,Top Scorer
8,Alex E.,6,Regular Scorer
9,Hope L.,0,Occasional Scorer


This SQL query uses a CASE statement to create a new column called Scoring_Tier. It then assigns tiers based on the number of goals:
-   20 or more goals: 'Elite Scorer'
-   10-19 goals: 'Top Scorer'
-   5-9 goals: 'Regular Scorer'
 -   Less than 5 goals: 'Occasional Scorer'
Goalkeepers are excluded from this categorization. This recoding allows us to group players into meaningful categories based on their goal-scoring performance.

### Categorical Recoding: Standardizing Team Names
Now, let's standardize team names to address the issue with "Soccersaurus Rex" and "Soccer-saurus Rex".

In [None]:
%%sql
UPDATE player_stats
SET Team = CASE
    WHEN Team = 'Soccer-saurus Rex' THEN 'Soccersaurus Rex'
    ELSE Team
END;

 * sqlite:///soccer_league.db
200 rows affected.


This SQL query:

1.  Uses an UPDATE statement to modify the player_stats table.
2.  Uses a CASE statement to check each Team value.
3.  If the Team is 'Soccer-saurus Rex', it changes it to 'Soccersaurus Rex'.
4.  All other team names remain unchanged.

After running this query, we can verify the change:

In [None]:
%%sql
SELECT DISTINCT Team FROM player_stats;

 * sqlite:///soccer_league.db
Done.


Unnamed: 0,Team
0,Goal Diggers
1,Net Navigators
2,Goal Getters
3,Ball Busters
4,Cleat Commanders
5,Turf Titans
6,Kick It Up
7,Soccersaurus Rex
8,Dribble Trouble


We can do the same thing to fix the issue with `Striker` and `Forward`.

In [None]:
%%sql
UPDATE player_stats
SET Position = CASE
    WHEN Position = 'Striker' THEN 'Forward'
    ELSE Position
END;

 * sqlite:///soccer_league.db
200 rows affected.


### Derived Variables
Derived variables are new data points created from existing data. Let's create a 'OffensiveContribution' metric for each player.

In [None]:
%%sql
ALTER TABLE player_stats
ADD COLUMN OffensiveContribution INTEGER;

UPDATE player_stats
SET OffensiveContribution = Goals + Assists;

SELECT Player, Position, Goals, Assists, OffensiveContribution
FROM player_stats
LIMIT 5;

 * sqlite:///soccer_league.db
Done.
200 rows affected.
Done.


Unnamed: 0,Player,Position,Goals,Assists,OffensiveContribution
0,Tobin O.,Defender,7,6.0,13
1,Kristine K.,Goalkeeper,0,3.0,3
2,Beckham X.,Goalkeeper,1,1.0,2
3,Lionel Z.,Forward,28,11.0,39
4,Paul O.,Midfielder,29,14.0,43


This series of SQL commands:

1.  Adds a new column called 'OffensiveContribution' to our table.
2.  Updates this column with the sum of Goals and Assists for each player.
3.  Selects and displays the top 10 players based on this new metric.

Creating derived variables like this can provide new insights into player performance that aren't immediately obvious from the raw data.

### Data Merge

Data merging involves combining data from multiple tables. Let's imagine we have another table with coach information for each team. We'll create this table and then merge it with our player data.

First, let's create and populate a coaches table

In [None]:
%%sql
DROP TABLE IF EXISTS team_coaches;
CREATE TABLE team_coaches (
    Team TEXT PRIMARY KEY,
    Coach TEXT
);

INSERT INTO team_coaches (Team, Coach) VALUES
('Goal Getters', 'Alex Johnson'),
('Soccersaurus Rex', 'Samantha Lee'),
('Kick It Up', 'Mike Chen'),
('Net Navigators', 'Emily Wong'),
('Dribble Trouble', 'Chris Taylor'),
('Soccer-saurus Rex', 'Michael Lee');

-- Add more coaches for the remaining teams...

 * sqlite:///soccer_league.db
Done.
Done.
6 rows affected.
Done.


Now, let's merge this data with our player stats:

In [None]:
%%sql
SELECT
    ps.Player,
    ps.Team,
    ps.Position,
    ps.Goals,
    ps.Assists,
    tc.Coach
FROM
    player_stats ps
LEFT JOIN
    team_coaches tc ON ps.Team = tc.Team
LIMIT 10;

 * sqlite:///soccer_league.db
Done.


Unnamed: 0,Player,Team,Position,Goals,Assists,Coach
0,Tobin O.,Goal Diggers,Defender,7,6.0,
1,Kristine K.,Goal Diggers,Goalkeeper,0,3.0,
2,Beckham X.,Net Navigators,Goalkeeper,1,1.0,Emily Wong
3,Lionel Z.,Goal Getters,Forward,28,11.0,Alex Johnson
4,Paul O.,Ball Busters,Midfielder,29,14.0,
5,Tobin Y.,Goal Diggers,Defender,4,18.0,
6,Grace R.,Cleat Commanders,Goalkeeper,0,3.0,
7,Beckham T.,Cleat Commanders,Goalkeeper,2,3.0,
8,Oscar C.,Turf Titans,Midfielder,16,3.0,
9,Beckham F.,Kick It Up,Midfielder,9,3.0,Mike Chen


This SQL query:

1.  Uses a LEFT JOIN to combine data from player_stats (ps) and team_coaches (tc).
2.  Matches records based on the Team column in both tables.
3.  Selects relevant columns from both tables.

The LEFT JOIN ensures that all players are included in the result, even if their team doesn't have a coach listed in the team_coaches table.

### Data Blending
Data blending involves combining data from different sources or formats. For example, let's suppose we have a JSON file that contains more detailed information on several of teams, which we have saved to our database

In [None]:
%%sql
DROP TABLE IF EXISTS Teams;
CREATE TABLE Teams(
    Team TEXT PRIMARY KEY,
    Data JSON
);

-- Insert JSON data into the Teams table
INSERT INTO Teams (Team, Data) VALUES
('Goal Getters', '{"Founded": 1998, "City": "Springfield"}'),
('Soccersaurus Rex', '{"Founded": 2005, "City": "Shelbyville"}'),
('Kick It Up', '{"Founded": 2010, "City": "Capital City"}'),
('Net Navigators', '{"Founded": 2003, "City": "Ogdenville"}'),
('Dribble Trouble', '{"Founded": 2012, "City": "North Haverbrook"}'),
('Goal Diggers', '{"Founded": 2008, "City": "Brockway"}'),
('Turf Titans', '{"Founded": 2015, "City": "Cypress Creek"}'),
('Cleat Commanders', '{"Founded": 1995, "City": "Monorail"}'),
('Ball Busters', '{"Founded": 2018, "City": "Springfield Heights"}'),
('Soccer-saurus Rex', '{"Founded": 2005, "City": "Shelbyville"}');

SELECT * FROM Teams;

 * sqlite:///soccer_league.db
Done.
Done.
10 rows affected.
Done.


Unnamed: 0,Team,Data
0,Goal Getters,"{""Founded"": 1998, ""City"": ""Springfield""}"
1,Soccersaurus Rex,"{""Founded"": 2005, ""City"": ""Shelbyville""}"
2,Kick It Up,"{""Founded"": 2010, ""City"": ""Capital City""}"
3,Net Navigators,"{""Founded"": 2003, ""City"": ""Ogdenville""}"
4,Dribble Trouble,"{""Founded"": 2012, ""City"": ""North Haverbrook""}"
5,Goal Diggers,"{""Founded"": 2008, ""City"": ""Brockway""}"
6,Turf Titans,"{""Founded"": 2015, ""City"": ""Cypress Creek""}"
7,Cleat Commanders,"{""Founded"": 1995, ""City"": ""Monorail""}"
8,Ball Busters,"{""Founded"": 2018, ""City"": ""Springfield Heights""}"
9,Soccer-saurus Rex,"{""Founded"": 2005, ""City"": ""Shelbyville""}"


This SQL script accomplishes the following:

1.  Drops the `Teams` table if it already exists to ensure a fresh start.
2.  Creates the `Teams` table with two columns: `Team` (a text column that serves as the primary key) and `Data` (a JSON column to store the team's additional information).
3.  Inserts JSON data for each team into the `Teams` table.

Now, let's merge the data!

Just as we did above, we'll Use a `LEFT JOIN` to merge the `player_stats` table with the `Teams` table on the `Team` column. A `LEFT JOIN` ensures all records from the `player_stats` table are included, even if there is no matching record in the `Teams` table.

In [None]:
%%sql
SELECT
    ps.Player,
    ps.Team,
    t.Founded,
    t.City
FROM
    player_stats ps
LEFT JOIN
    (SELECT
         Team,
         json_extract(Data, '$.Founded') AS Founded,
         json_extract(Data, '$.City') AS City
     FROM Teams) t
ON ps.Team = t.Team
LIMIT 10;


 * sqlite:///soccer_league.db
Done.


Unnamed: 0,Player,Team,Founded,City
0,Tobin O.,Goal Diggers,2008,Brockway
1,Kristine K.,Goal Diggers,2008,Brockway
2,Beckham X.,Net Navigators,2003,Ogdenville
3,Lionel Z.,Goal Getters,1998,Springfield
4,Paul O.,Ball Busters,2018,Springfield Heights
5,Tobin Y.,Goal Diggers,2008,Brockway
6,Grace R.,Cleat Commanders,1995,Monorail
7,Beckham T.,Cleat Commanders,1995,Monorail
8,Oscar C.,Turf Titans,2015,Cypress Creek
9,Beckham F.,Kick It Up,2010,Capital City


### Data Concatenation

Data concatenation involves combining datasets vertically, adding rows rather than columns. Let's imagine we receive mid-season transfer data and need to add these new players to our dataset.

First, let's create a table for new transfers:

In [None]:
%%sql
DROP TABLE IF EXISTS mid_season_transfers;
CREATE TABLE mid_season_transfers (
    Team TEXT,
    Player TEXT,
    Position TEXT,
    Goals INTEGER,
    Assists INTEGER,
    Saves INTEGER
);

INSERT INTO mid_season_transfers (Team, Player, Position, Goals, Assists, Saves) VALUES
('Turf Titans', 'Oliver', 'Forward', 5, 3, 0),
('Ball Busters', 'Penny', 'Midfielder', 2, 4, 0),
('Cleat Commanders', 'Quincy', 'Defender', 1, 1, 0);

 * sqlite:///soccer_league.db
Done.
Done.
3 rows affected.


Now, let's concatenate this data with our existing player_stats:

In [None]:
%%sql
INSERT INTO player_stats (Team, Player, Position, Goals, Assists, Saves)
SELECT Team, Player, Position, Goals, Assists, Saves
FROM mid_season_transfers;

SELECT * FROM player_stats
WHERE Player IN ('Oliver', 'Penny', 'Quincy');

 * sqlite:///soccer_league.db
3 rows affected.
Done.


Unnamed: 0,ID,Team,Player,Position,Assists,Saves,Goals,OffensiveContribution
0,,Turf Titans,Oliver,Forward,3.0,0,5,
1,,Ball Busters,Penny,Midfielder,4.0,0,2,
2,,Cleat Commanders,Quincy,Defender,1.0,0,1,


This SQL operation:

1.  Uses INSERT INTO ... SELECT to add rows from mid_season_transfers to player_stats.
2.  The second SELECT statement verifies that the new players have been added to player_stats.

### Data Append

Data append is similar to concatenation but typically involves adding new columns to existing rows. Let's add a 'JoinedMidSeason' flag to our player_stats table.

In [None]:
%%sql
ALTER TABLE player_stats ADD COLUMN JoinedMidSeason BOOLEAN DEFAULT 0;

UPDATE player_stats
SET JoinedMidSeason = 1
WHERE Player IN (SELECT Player FROM mid_season_transfers);

SELECT Player, Team, Position, Goals, Assists, JoinedMidSeason
FROM player_stats
WHERE JoinedMidSeason = 1 OR Player LIKE "A%"
ORDER BY JoinedMidSeason DESC, Goals DESC;

 * sqlite:///soccer_league.db
Done.
3 rows affected.
Done.


Unnamed: 0,Player,Team,Position,Goals,Assists,JoinedMidSeason
0,Oliver,Turf Titans,Forward,5,3.0,1
1,Penny,Ball Busters,Midfielder,2,4.0,1
2,Quincy,Cleat Commanders,Defender,1,1.0,1
3,Alex Y.,Goal Getters,Defender,29,19.0,0
4,Alex X.,Dribble Trouble,Defender,27,9.0,0
5,Alex E.,Kick It Up,Defender,25,9.0,0
6,Alex W.,Dribble Trouble,Defender,14,0.0,0
7,Alex N.,Soccersaurus Rex,Defender,14,17.0,0
8,Alex Q.,Net Navigators,Forward,10,17.0,0
9,Alex E.,Soccersaurus Rex,Midfielder,6,8.0,0


This series of SQL commands:

1.  Adds a new column 'JoinedMidSeason' to player_stats, defaulting to 0 (FALSE).
2.  Updates this column to 1 (TRUE) for players who were in the mid_season_transfers table.
3.  Selects and displays the data for mid-season transfers and a couple of original players for comparison.

### Imputation

Imputation is the process of replacing missing data with substituted values. We noticed earlier that some of our players were missing assist data. We'll impute these missing values with the average number of assists for their position.

To make it easier, let's intentionally introduce some nulls. For example, perhaps every player named "Alex" with over 10 goals has null assists.

In [None]:
%%sql
UPDATE player_stats
SET Assists = NULL
WHERE Player LIKE 'Alex%'
AND Goals > 10;

SELECT * FROM player_stats
WHERE Player LIKE 'Alex%'
LIMIT 10;

 * sqlite:///soccer_league.db
5 rows affected.
Done.


Unnamed: 0,ID,Team,Player,Position,Assists,Saves,Goals,OffensiveContribution,JoinedMidSeason
0,14494,Turf Titans,Alex L.,Goalkeeper,2.0,80,2,4,0
1,90688,Soccersaurus Rex,Alex E.,Midfielder,8.0,0,6,14,0
2,39757,Goal Getters,Alex Y.,Defender,,0,29,48,0
3,24199,Soccersaurus Rex,Alex O.,Goalkeeper,4.0,60,0,4,0
4,64028,Kick It Up,Alex E.,Defender,,0,25,34,0
5,81719,Dribble Trouble,Alex X.,Defender,,0,27,36,0
6,96206,Kick It Up,Alex H.,Forward,19.0,0,2,21,0
7,97263,Dribble Trouble,Alex W.,Defender,,0,14,14,0
8,11435,Net Navigators,Alex Q.,Forward,17.0,0,10,27,0
9,23507,Soccersaurus Rex,Alex N.,Defender,,0,14,31,0


Now, let's fix this:

In [None]:
%%sql
-- Now, let's perform imputation
UPDATE player_stats
SET Assists = (
    SELECT AVG(Assists)
    FROM player_stats AS sub
    WHERE sub.Position = player_stats.Position
      AND sub.Assists IS NOT NULL
)
WHERE Assists IS NULL;

SELECT * FROM player_stats
WHERE Player LIKE 'Alex%'
LIMIT 10;

 * sqlite:///soccer_league.db
9 rows affected.
Done.


Unnamed: 0,ID,Team,Player,Position,Assists,Saves,Goals,OffensiveContribution,JoinedMidSeason
0,14494,Turf Titans,Alex L.,Goalkeeper,2.0,80,2,4,0
1,90688,Soccersaurus Rex,Alex E.,Midfielder,8.0,0,6,14,0
2,39757,Goal Getters,Alex Y.,Defender,8.769231,0,29,48,0
3,24199,Soccersaurus Rex,Alex O.,Goalkeeper,4.0,60,0,4,0
4,64028,Kick It Up,Alex E.,Defender,8.769231,0,25,34,0
5,81719,Dribble Trouble,Alex X.,Defender,8.769231,0,27,36,0
6,96206,Kick It Up,Alex H.,Forward,19.0,0,2,21,0
7,97263,Dribble Trouble,Alex W.,Defender,8.769231,0,14,14,0
8,11435,Net Navigators,Alex Q.,Forward,17.0,0,10,27,0
9,23507,Soccersaurus Rex,Alex N.,Defender,8.769231,0,14,31,0


This SQL operation:

1.  Introduces NULL values for Assists for specific players.
2.  Uses a subquery to calculate the average Assists for each Position, excluding NULL values.
3.  Updates the NULL Assists with these position-specific averages.
4.  Verifies the imputation by selecting the updated rows and a comparison row.

#### Table: Impute This
Depending on the type of data one has, one might use a number of different methods of "imputing" data:

| Imputation Method | Description | When to Use |
| --- | --- | --- |
| Mean Imputation | Replace missing values with the mean of the column. | For numerical data when the distribution is roughly symmetric and without significant outliers. |
| Median Imputation | Replace missing values with the median of the column. | For numerical data when the distribution is skewed or has outliers. |
| Mode Imputation | Replace missing values with the most frequent value in the column. | For categorical data or discrete numerical data. |
| Last Observation Carried Forward (LOCF) | Replace missing values with the last observed value. | For time series data where values are expected to remain stable over time. |
| Next Observation Carried Backward (NOCB) | Replace missing values with the next observed value. | For time series data, often used in combination with LOCF. |
| Regression Imputation | Use other variables to predict and impute the missing values. | When there's a strong correlation between the variable with missing data and other variables in the dataset. |
| Multiple Imputation | Create multiple plausible imputed datasets and combine the results. | When dealing with data missing not at random (MNAR) or when preserving the relationships between variables is crucial. |
| K-Nearest Neighbors (KNN) Imputation | Impute values based on the K most similar data points. | When there's a meaningful way to measure similarity between data points and the missing data is missing at random (MAR). |
| Random Sample Imputation | Randomly select a value from the observed data to fill in missing values. | When you want to maintain the distribution of the data and the missing data is missing completely at random (MCAR). |
| Constant Value Imputation | Replace missing values with a constant, predefined value. | When missing data has a specific meaning (e.g., using -1 for 'No Response' in a survey). |

When using these methods, it's important to consider:

1.  The nature of your data (numerical, categorical, time series)
2.  The pattern of missingness (MCAR, MAR, MNAR)
3.  The potential impact on subsequent analyses
4.  The proportion of missing data

In our soccer example, we used a variant of mean imputation by calculating the average assists per position. This method preserves the overall structure of the data while accounting for position-specific differences in assist rates.

### Reduction and Aggregation

Reduction and aggregation involve summarizing data to provide insights at a higher level. Let's create some summary statistics for our teams.

In [None]:
%%sql
SELECT
    Team,
    SUM(CASE WHEN Position = 'Forward' THEN Goals ELSE 0 END) AS Forward_Goals,
    SUM(CASE WHEN Position = 'Midfielder' THEN Goals ELSE 0 END) AS Midfielder_Goals,
    SUM(CASE WHEN Position = 'Defender' THEN Goals ELSE 0 END) AS Defender_Goals,
    SUM(CASE WHEN Position = 'Goalkeeper' THEN Goals ELSE 0 END) AS Goalkeeper_Goals
FROM
    player_stats
GROUP BY
    Team
ORDER BY
    Team;

 * sqlite:///soccer_league.db
Done.


Unnamed: 0,Team,Forward_Goals,Midfielder_Goals,Defender_Goals,Goalkeeper_Goals
0,Ball Busters,86,153,81,101
1,Cleat Commanders,33,87,70,5
2,Dribble Trouble,64,16,64,3
3,Goal Diggers,180,25,97,2
4,Goal Getters,200,70,115,110
5,Kick It Up,65,49,153,3
6,Net Navigators,37,42,32,6
7,Soccersaurus Rex,99,158,318,9
8,Turf Titans,74,80,64,6


This query:

1.  Groups data by Team.
2.  Uses CASE statements to create separate columns for goals scored by each position.
3.  Effectively transposes the Position column into separate columns for each position's goals.

###  Normalize Data

Data normalization involves adjusting values measured on different scales to a common scale. Let's focus on normalizing the 'Goals' column to a 0-1 scale, which is a common and easily understandable form of normalization.

In [None]:
%%sql
WITH stats AS (
    SELECT
        Player,
        Goals,
        MIN(Goals) OVER () AS MinGoals,
        MAX(Goals) OVER () AS MaxGoals
    FROM
        player_stats
)
SELECT
    Player,
    Goals,
    ROUND(CAST(Goals - MinGoals AS FLOAT) / NULLIF(MaxGoals - MinGoals, 0), 2) AS NormalizedGoals
FROM
    stats
LIMIT 10;

 * sqlite:///soccer_league.db
Done.


Unnamed: 0,Player,Goals,NormalizedGoals
0,Tobin O.,7,0.25
1,Kristine K.,0,0.19
2,Beckham X.,1,0.2
3,Lionel Z.,28,0.42
4,Paul O.,29,0.43
5,Tobin Y.,4,0.23
6,Grace R.,0,0.19
7,Beckham T.,2,0.21
8,Oscar C.,16,0.32
9,Beckham F.,9,0.27


Here's what happens in this query:

-   The query begins with a **Common Table Expression (CTE)** named `stats`. CTEs are temporary named result sets that exist only within the scope of a single SQL statement, essentially creating a virtual table for use in the main query.
-   Inside the CTE, window functions `MIN(Goals) OVER ()` and `MAX(Goals) OVER ()` are used. Window functions perform calculations across a set of rows related to the current row, in this case, the entire table. The empty `OVER ()` clause indicates the window is the whole dataset.
-   The main query selects from this CTE, including `Player` and `Goals` columns directly.
-   For `NormalizedGoals`, it uses the min-max normalization formula: `(x - min(x)) / (max(x) - min(x))`. This is implemented as `ROUND(CAST(Goals - MinGoals AS FLOAT) / NULLIF(MaxGoals - MinGoals, 0), 2)`.
-   `CAST(...AS FLOAT)` is a type conversion function ensuring floating-point division. `NULLIF(..., 0)` is a null-handling function that returns null if its arguments are equal, preventing division by zero.
-   `ROUND(..., 2)` is an arithmetic function that rounds to two decimal places.

#### Min-Max Scaling
This normalization method shown above, often called Min-Max scaling, transforms the 'Goals' data to a scale between 0 and 1, where:

-   0 represents the player(s) with the minimum number of goals
-   1 represents the player(s) with the maximum number of goals
-   All other players fall somewhere between 0 and 1

The formula used is: (x - min(x)) / (max(x) - min(x))

Here's why this normalization is useful:

1.  It puts all players on a comparable scale, regardless of the absolute number of goals scored.
2.  It preserves the relative differences between players.
3.  It can make it easier to compare goal-scoring performance across different seasons or leagues with varying numbers of games played.

For example, if the raw data showed:

-   Player A: 30 goals
-   Player B: 15 goals
-   Player C: 0 goals

After normalization, it might look like:

-   Player A: 1.0
-   Player B: 0.5
-   Player C: 0.0

This clearly shows that Player A scored the most goals (1.0), Player B scored exactly half as many relative to the range (0.5), and Player C scored the least (0.0).

## Data Cleaning in ETL using Pandas

While we've explored data cleaning using SQL in an ELT (Extract, Load, Transform) process, many data scientists prefer to use Python and Pandas in an ETL (Extract, Transform, Load) workflow. Let's revisit our high school soccer dataset and perform similar cleaning operations using Pandas.

First, let's load our data and take a look:

In [None]:
import pandas as pd
import numpy as np

# Load the data
df = pd.read_csv('high_school_soccer_data.csv')

df.head()

Unnamed: 0,ID,Team,Player,Position,Goals,Assists,Saves
0,22676,Goal Diggers,Tobin O.,Defender,7,6.0,0
1,20816,Goal Diggers,Kristine K.,Goalkeeper,0,3.0,23
2,42733,Net Navigators,Beckham X.,Goalkeeper,1,1.0,63
3,95680,Goal Getters,Lionel Z.,Striker,28,11.0,0
4,12050,Ball Busters,Paul O.,Midfielder,29,14.0,0


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206 entries, 0 to 205
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   ID        206 non-null    int64  
 1   Team      206 non-null    object 
 2   Player    206 non-null    object 
 3   Position  206 non-null    object 
 4   Goals     206 non-null    object 
 5   Assists   202 non-null    float64
 6   Saves     206 non-null    int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 11.4+ KB


Now that we have our data loaded, let's go through the same cleaning steps we did with SQL, but using Pandas:

### Correcting Data Types

In [None]:
# Convert 'Goals' to numeric, removing any non-numeric characters
df['Goals'] = pd.to_numeric(df['Goals'].str.replace(r'[^\d.]', '', regex=True), errors='coerce')

# Convert 'Goals' to integer type
df['Goals'] = df['Goals'].astype(int)

df.dtypes

ID            int64
Team         object
Player       object
Position     object
Goals         int64
Assists     float64
Saves         int64
dtype: object

### Recoding Data

In [None]:
# Numeric recoding: Creating performance tiers
def get_scoring_tier(goals):
    if goals >= 20:
        return 'Elite Scorer'
    elif goals >= 10:
        return 'Top Scorer'
    elif goals >= 5:
        return 'Regular Scorer'
    else:
        return 'Occasional Scorer'

df['Scoring_Tier'] = df[df['Position'] != 'Goalkeeper']['Goals'].apply(get_scoring_tier)

# Categorical recoding: Standardizing team names
df['Team'] = df['Team'].replace('Soccer-saurus Rex', 'Soccersaurus Rex')

df['Team'].unique()

array(['Goal Diggers', 'Net Navigators', 'Goal Getters', 'Ball Busters',
       'Cleat Commanders', 'Turf Titans', 'Kick It Up',
       'Soccersaurus Rex', 'Dribble Trouble'], dtype=object)

### Derived variables

In [None]:
# Create 'OffensiveContribution' metric
df['OffensiveContribution'] = df['Goals'] + df['Assists']

print(df[['Player', 'Position', 'Goals', 'Assists', 'OffensiveContribution']].head())

        Player    Position  Goals  Assists  OffensiveContribution
0     Tobin O.    Defender      7      6.0                   13.0
1  Kristine K.  Goalkeeper      0      3.0                    3.0
2   Beckham X.  Goalkeeper      1      1.0                    2.0
3    Lionel Z.     Striker     28     11.0                   39.0
4      Paul O.  Midfielder     29     14.0                   43.0


### Data Merge

In [None]:
# Create a coach dataframe
coaches_data = {
    'Team': ['Goal Getters', 'Soccersaurus Rex', 'Kick It Up', 'Net Navigators'],
    'Coach': ['Alex Johnson', 'Samantha Lee', 'Mike Brown', 'Sarah Davis']
}
coaches_df = pd.DataFrame(coaches_data)

# Merge with player stats
df_merged = df.merge(coaches_df, on='Team', how='left')

df_merged[['Player', 'Team', 'Position', 'Goals', 'Assists', 'Coach']].head()

Unnamed: 0,Player,Team,Position,Goals,Assists,Coach
0,Tobin O.,Goal Diggers,Defender,7,6.0,
1,Kristine K.,Goal Diggers,Goalkeeper,0,3.0,
2,Beckham X.,Net Navigators,Goalkeeper,1,1.0,Sarah Davis
3,Lionel Z.,Goal Getters,Striker,28,11.0,Alex Johnson
4,Paul O.,Ball Busters,Midfielder,29,14.0,


### Data Blending

In [None]:
# Create a dataframe with additional team info
teams_data = [
    {'Team': 'Goal Getters', 'Founded': 1998, 'City': 'Springfield'},
    {'Team': 'Soccersaurus Rex', 'Founded': 2005, 'City': 'Shelbyville'},
    # ... add other teams ...
]
teams_df = pd.DataFrame(teams_data)

# Merge with player stats
df_blended = df.merge(teams_df, on='Team', how='left')

df_blended[['Player', 'Team', 'Founded', 'City']].head()

Unnamed: 0,Player,Team,Founded,City
0,Tobin O.,Goal Diggers,,
1,Kristine K.,Goal Diggers,,
2,Beckham X.,Net Navigators,,
3,Lionel Z.,Goal Getters,1998.0,Springfield
4,Paul O.,Ball Busters,,


### Data Concatenation


In [None]:
# Create a dataframe for mid-season transfers
transfers_data = [
    {'Team': 'Turf Titans', 'Player': 'Oliver', 'Position': 'Forward', 'Goals': 5, 'Assists': 3, 'Saves': 0},
    {'Team': 'Ball Busters', 'Player': 'Penny', 'Position': 'Midfielder', 'Goals': 2, 'Assists': 4, 'Saves': 0},
    {'Team': 'Cleat Commanders', 'Player': 'Quincy', 'Position': 'Defender', 'Goals': 1, 'Assists': 1, 'Saves': 0}
]
transfers_df = pd.DataFrame(transfers_data)

# Concatenate with existing data
df_concat = pd.concat([df, transfers_df], ignore_index=True)

df_concat[df_concat['Player'].isin(['Oliver', 'Penny', 'Quincy'])]

Unnamed: 0,ID,Team,Player,Position,Goals,Assists,Saves,Scoring_Tier,OffensiveContribution
206,,Turf Titans,Oliver,Forward,5,3.0,0,,
207,,Ball Busters,Penny,Midfielder,2,4.0,0,,
208,,Cleat Commanders,Quincy,Defender,1,1.0,0,,


### Imputation

In [None]:
# Introduce some null values
df.loc[(df['Player'].str.startswith('Alex')) & (df['Goals'] > 10), 'Assists'] = np.nan

# Impute missing values with position-specific means
df['Assists'] = df.groupby('Position')['Assists'].transform(lambda x: x.fillna(x.mean()))

df[df['Player'].str.startswith('Alex')].head(10)

Unnamed: 0,ID,Team,Player,Position,Goals,Assists,Saves,Scoring_Tier,OffensiveContribution
14,14494,Turf Titans,Alex L.,Goalkeeper,2,2.0,80,,4.0
15,90688,Soccersaurus Rex,Alex E.,Midfielder,6,8.0,0,Regular Scorer,14.0
24,39757,Goal Getters,Alex Y.,Defender,29,9.150943,0,Elite Scorer,48.0
44,24199,Soccersaurus Rex,Alex O.,Goalkeeper,0,4.0,60,,4.0
65,64028,Kick It Up,Alex E.,Defender,25,9.150943,0,Elite Scorer,34.0
109,81719,Dribble Trouble,Alex X.,Defender,27,9.150943,0,Elite Scorer,36.0
119,96206,Kick It Up,Alex H.,Forward,2,19.0,0,Occasional Scorer,21.0
143,97263,Dribble Trouble,Alex W.,Defender,14,9.150943,0,Top Scorer,14.0
163,11435,Net Navigators,Alex Q.,Forward,10,17.0,0,Top Scorer,27.0
167,23507,Soccersaurus Rex,Alex N.,Defender,14,9.150943,0,Top Scorer,31.0


### Reduction and Aggregation


In [None]:
# Create summary statistics for teams
team_summary = df.pivot_table(
    values='Goals',
    index='Team',
    columns='Position',
    aggfunc='sum',
    fill_value=0
).reset_index()

team_summary

Position,Team,Defender,Forward,Goalkeeper,Midfielder,Striker
0,Ball Busters,81,86,101,151,0
1,Cleat Commanders,81,33,5,87,0
2,Dribble Trouble,64,64,5,16,0
3,Goal Diggers,117,180,2,29,0
4,Goal Getters,115,172,110,70,28
5,Kick It Up,153,65,3,49,0
6,Net Navigators,32,51,6,42,0
7,Soccersaurus Rex,318,175,9,170,0
8,Turf Titans,64,67,6,80,2


### Normalize Data

In [None]:
# Normalize 'Goals' column
df['NormalizedGoals'] = (df['Goals'] - df['Goals'].min()) / (df['Goals'].max() - df['Goals'].min())

df[['Player', 'Goals', 'NormalizedGoals']].sort_values('NormalizedGoals', ascending=False).head(10)

Unnamed: 0,Player,Goals,NormalizedGoals
205,Lionel Y.,100,1.0
100,Isco C.,100,1.0
153,Marta H.,100,1.0
185,Javier M.,100,1.0
36,Paul Y.,100,1.0
77,Diego L.,100,1.0
193,Lionel L.,29,0.29
4,Paul O.,29,0.29
24,Alex Y.,29,0.29
166,Quincy M.,29,0.29


### Comparing ETL (Pandas) and ELT (SQL) Approaches

1.  Pandas offers more flexibility in data manipulation, especially for complex operations or when working with multiple data sources.
2.  SQL can be faster for large datasets, especially when working with data already in a database.
3.  Pandas might be more intuitive for those familiar with Python, while SQL is often preferred by those with a database background.
4.  Pandas integrates well with other Python libraries for data analysis and machine learning, while SQL is native to database environments.
5.  SQL is generally better for handling very large datasets that don't fit in memory, while Pandas is more suitable for smaller to medium-sized datasets.
6.  Python scripts using Pandas are easier to version control and share, enhancing reproducibility.
7.  SQL is often betterfor real-time data processing and querying.
8.  Learning curve: Pandas might have a steeper learning curve for those new to Python, while SQL syntax is often considered more straightforward.

By understanding both approaches, data scientists can choose the most appropriate tool for their specific data cleaning and transformation needs.

## Table: Data Cleaning in SQL and Pandas
Here is a table of some common operations in data cleaning, and "template" code for both SQL and Pandas. (Note that this template code doesn't always apply to every situation!). (Note that some of these functions--like `STDEV` aren't available in SQLite).

| Operation | SQL | Pandas |
| --- | --- | --- |
| Load Data | `SELECT * FROM table_name;` | `df = pd.read_csv('file.csv')` |
| View Data | `SELECT * FROM table_name LIMIT 5;` | `df.head()` |
| Data Info | `PRAGMA table_info(table_name);` | `df.info()` |
| Rename Column | `ALTER TABLE table_name RENAME COLUMN old_name TO new_name;` | `df = df.rename(columns={'old_name': 'new_name'})` |
| Drop Column | `ALTER TABLE table_name DROP COLUMN column_name;` | `df = df.drop('column_name', axis=1)` |
| Handle Missing Values | `UPDATE table_name SET column = COALESCE(column, default_value) WHERE column IS NULL;` | `df['column'].fillna(default_value, inplace=True)` |
| Remove Duplicates | `DELETE FROM table_name WHERE id NOT IN (SELECT MIN(id) FROM table_name GROUP BY column1, column2, ...);` | `df = df.drop_duplicates(subset=['column1', 'column2', ...])` |
| Filter Data | `SELECT * FROM table_name WHERE condition;` | `df = df[df['column'] > value]` |
| Sort Data | `SELECT * FROM table_name ORDER BY column ASC/DESC;` | `df = df.sort_values('column', ascending=True/False)` |
| Group By and Aggregate | `SELECT column1, AVG(column2) as avg_col2 FROM table_name GROUP BY column1;` | `df = df.groupby('column1')['column2'].mean().reset_index()` |
| Join Tables | `SELECT * FROM table1 JOIN table2 ON table1.id = table2.id;` | `df = pd.merge(df1, df2, on='id', how='inner')` |
| Create New Column | `ALTER TABLE table_name ADD COLUMN new_column datatype; UPDATE table_name SET new_column = expression;` | `df['new_column'] = df['column1'] + df['column2']` |
| String Manipulation | `UPDATE table_name SET column = REPLACE(column, 'old', 'new');` | `df['column'] = df['column'].str.replace('old', 'new')` |
| Handling Outliers | `DELETE FROM table_name WHERE column < (SELECT AVG(column) - 3 * STDDEV(column) FROM table_name) OR column >` <br> `(SELECT AVG(column) + 3 * STDDEV(column) FROM table_name);` | `Q1 = df['column'].quantile(0.25) Q3 = df['column'].quantile(0.75) IQR = Q3 - Q1 df = df[(df['column'] >= Q1 - 1.5*IQR) & (df['column'] <= Q3 + 1.5*IQR)]` |
| Data Type Conversion | `ALTER TABLE table_name ALTER COLUMN column_name TYPE new_datatype;` | `df['column'] = df['column'].astype('new_datatype')` |
| Normalize Data | `UPDATE table_name SET column = (column - MIN(column)) / (MAX(column) - MIN(column));` | `df['column'] = (df['column'] - df['column'].min()) / (df['column'].max() - df['column'].min())` |

## Data Cleaning Lab: Cleaning the Messy Tiny Titans Pet Store Dataset

In this lab, you'll work with a dataset from a fictional pet store chain run by the Tiny Titans. The dataset contains information about pet adoptions across different store locations, but it's quite messy and needs your data cleaning expertise!

### Step 1: Load and Inspect the Data

Use the following code to create a sample dataset:

In [1]:
import pandas as pd
import numpy as np

data = {
    'store_id': ['S001', 'S002', 'S003', 'S001', 'S002', 'S003', 'S001', 'S002', 'S003', 'S001'],
    'date': ['2023-01-15', '2023-01-15', '2023-01-15', '2023-01-16', '2023-01-16', '2023-01-16', '2023-01-17', '2023-01-17', '2023-01-17', '2023-01-18'],
    'pet_type': ['Dog', 'Cat', 'Bird', 'Dog', 'Cat', 'Bird', 'Dog', 'Cat', 'Bird', 'Dog'],
    'num_adopted': [5, 3, '2', 4, '3', 1, '6', 2, '1', 'Five'],
    'total_sales': ['$500.00', '$300', '$150.50', '$400', '$350.75', '$75.25', '$600', '$200.50', '$80', '$550.25'],
    'employee': ['Robin', 'Starfire', 'Beast Boy', 'Robin', 'Starfire', 'Beast Boy', 'robin', 'starfire', 'Beast Boy', 'Robin'],
    'customer_satisfaction': [4.5, 4.0, 'N/A', 4.8, 3.9, 4.2, 4.6, '3.7', 4.1, 4.7]
}

df = pd.DataFrame(data)
df

Unnamed: 0,store_id,date,pet_type,num_adopted,total_sales,employee,customer_satisfaction
0,S001,2023-01-15,Dog,5,$500.00,Robin,4.5
1,S002,2023-01-15,Cat,3,$300,Starfire,4.0
2,S003,2023-01-15,Bird,2,$150.50,Beast Boy,
3,S001,2023-01-16,Dog,4,$400,Robin,4.8
4,S002,2023-01-16,Cat,3,$350.75,Starfire,3.9
5,S003,2023-01-16,Bird,1,$75.25,Beast Boy,4.2
6,S001,2023-01-17,Dog,6,$600,robin,4.6
7,S002,2023-01-17,Cat,2,$200.50,starfire,3.7
8,S003,2023-01-17,Bird,1,$80,Beast Boy,4.1
9,S001,2023-01-18,Dog,Five,$550.25,Robin,4.7


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   store_id               10 non-null     object
 1   date                   10 non-null     object
 2   pet_type               10 non-null     object
 3   num_adopted            10 non-null     object
 4   total_sales            10 non-null     object
 5   employee               10 non-null     object
 6   customer_satisfaction  10 non-null     object
dtypes: object(7)
memory usage: 688.0+ bytes


After running this code, answer the following questions:
a) How many rows and columns does the dataset have?
b) What data types are assigned to each column?
c) Are there any columns with missing values?

Hint: Use df.shape, df.dtypes, and df.isnull().sum() to help answer these questions.


### Step 2: Handle Missing Values

Identify any missing or 'N/A' values in the dataset. Replace 'N/A' with NaN and then use an appropriate method to fill these missing values. Explain your chosen method and why you think it's suitable for this data.

Hint: Look into df.replace() and df.fillna() functions.


### Step 3: Correct Data Types

The 'num_adopted' column contains a mix of integers and strings. Convert this column to integer type. Handle any errors that occur during this conversion and explain how you dealt with them.

Hint: pd.to_numeric() with an error parameter might be useful here.


### Step 4: Standardize Text Data

The 'employee' column has inconsistent capitalization. Standardize this column so that all names are in title case (e.g., "Robin"). Write the code to do this and show the unique values in this column before and after your changes.

Hint: The string method .title() could be helpful.

### Step 5: Clean and Convert Numeric Data

The 'total_sales' column is stored as strings with dollar signs and varying decimal places. Convert this column to a float data type. Write the code to do this and display the first few rows of this column before and after your changes.

Hint: Consider using str.replace() and astype() methods.


### Step 6: Handle Outliers

Use a box plot to visualize the 'num_adopted' column and identify any outliers. Describe what you see and suggest a method to handle these outliers if they exist.

Hint: Look into df.boxplot() for visualization and consider using quantiles for outlier detection.


### Step 7: Create a Derived Variable

Create a new column called 'average_sale' by dividing 'total_sales' by 'num_adopted'. Add this column to your dataframe and display the first few rows of the updated dataframe.

Hint: You can perform arithmetic operations directly on DataFrame columns.


### Step 8: Data Validation

Check if all values in the 'customer_satisfaction' column are between 1 and 5. If any values fall outside this range, replace them with NaN. Write the code to do this and report how many values were replaced.

Hint: Boolean indexing and the .loc accessor can be useful here.


### Step 9: Duplicate Detection

Check for any duplicate rows in the dataset. If duplicates exist, remove them and keep only the first occurrence. Write the code to do this and report how many duplicates were removed.

Hint: Look into df.duplicated() and df.drop_duplicates() methods.


### Step 10: Export Clean Data

Now that you've cleaned the data, export it to a new CSV file named 'clean_tiny_titans_pet_store_data.csv'. Write the code to do this.

Hint: df.to_csv() is the function you need.


## Key Points Summary:

-   Data cleaning is a crucial, foundational step in the data science workflow, ensuring data quality and reliability for all subsequent analyses.
-   Common data issues include duplicates, missing values, inconsistent formats, outliers, data type mismatches, and standardization problems.
-   Both SQL (for ELT) and Python's Pandas (for ETL) offer powerful tools for data cleaning, each with their own strengths and ideal use cases.
-   Key cleaning techniques include correcting data types, recoding data, handling missing values, removing duplicates, treating outliers, and normalizing data.
-   The choice between SQL, Pandas, or a combination depends on factors like data size, integration needs, processing requirements, and personal or team expertise.
-   Data cleaning is an iterative process that requires both attention to detail and broader understanding of the data's context and intended use.
-   Properly cleaned data forms the foundation for accurate analysis, reliable insights, and trustworthy machine learning models in any field of data science.
-   Documenting the data cleaning process and creating reproducible pipelines are essential for transparency and future data maintenance.
-   While powerful tools exist, effective data cleaning often requires a combination of automated processes and human judgment.
-   The impact of data cleaning should be continuously evaluated on downstream analyses and model performance to ensure its effectiveness.

### Review With Quizlet

In [None]:
%%html
<iframe src="https://quizlet.com/927302936/learn/embed?i=psvlh&x=1jj1" height="600" width="100%" style="border:0"></iframe>

## Glossary

| Term | Definition |
| --- | --- |
| CASE | In the SQL query `SELECT ____ WHEN condition THEN result END FROM table`, this keyword introduces a conditional expression. It allows for different results based on specified conditions within a single query. |
| Data cleaning | The process of identifying and correcting errors, inconsistencies, and inaccuracies in datasets. Essential for ensuring data quality and reliability in analysis. |
| (Data) Blend | The process of combining data from different sources or formats, often involving more complex integration than simple merging or concatenation. Allows for comprehensive analysis across diverse data sources. |
| (Data) Concatenate | The process of combining datasets vertically, adding rows rather than columns. Useful for aggregating data from multiple periods or sources into a single dataset. |
| (Data) Merge | The process of combining data from multiple tables or datasets based on a common field. Essential for integrating information from different sources in relational databases. |
| Data type validation | The process of ensuring each column contains the expected type of data (e.g., integers for numerical fields, text for names). Critical for preventing errors in data manipulation and analysis. |
| Derived variable | A new data point created from existing data through calculation or transformation. Can provide new insights or simplify complex relationships in the data. |
| Duplicate data | Identical or very similar records that appear multiple times in a dataset. Can lead to overestimation of certain data points and skew analysis results. |
| ELSE | In the SQL query `SELECT CASE WHEN condition THEN result ____ default END FROM table`, this keyword specifies the result to return if none of the WHEN conditions are true. It's optional but often used to handle all other cases in a CASE statement. |
| END | In the SQL query `SELECT CASE WHEN condition THEN result ELSE default ____ FROM table`, this keyword marks the conclusion of a CASE statement. It ensures proper closure of the conditional logic block in the query. |
| Imputation | The process of replacing missing data with substituted values. Aims to create a complete dataset for analysis while minimizing bias introduction. |
| INNER JOIN | In the SQL query `SELECT * FROM table1 ____ table2 ON table1.id = table2.id`, this clause returns only the rows that have matching values in both tables being joined. It's useful for finding records that satisfy conditions in multiple tables. |
| Invalid data | Values that don't conform to the expected format or fall outside the realm of possibility for a given field. Can lead to errors in analysis or visualization if not addressed. |
| LEFT JOIN | In the SQL query `SELECT * FROM table1 ____ table2 ON table1.id = table2.id`, this clause returns all rows from the left table and matched rows from the right table. It's useful when you want to keep all records from one table even if there are no matches in the other. |
| LOCF | Last Observation Carried Forward, an imputation technique where missing values are filled with the last observed value in time series data. Assumes stability in values over time. |
| Mean imputation | A method of imputation where missing values are replaced with the mean of the available data for that variable. Simple but can distort the distribution and underestimate variance. |
| Median imputation | An imputation technique where missing values are replaced with the median of the available data. More robust to outliers than mean imputation, especially for skewed distributions. |
| Mode imputation | An imputation method where missing values are replaced with the most frequent value (mode) of the variable. Commonly used for categorical data or discrete numerical data. |
| Nonparametric data | Data that doesn't follow a specific probability distribution, often due to inconsistencies in data entry. Can lead to fragmentation of what should be unified categories, complicating analysis. |
| Normalization | The process of adjusting values measured on different scales to a common scale, typically between 0 and 1. Enables fair comparison between variables with different ranges or units. |
| Null | A special marker in databases indicating the absence of a value or unknown information. Different from zero or an empty string, and requires special handling in queries and analysis. |
| Outlier | A data point that differs significantly from other observations in a dataset. Can represent genuinely exceptional cases or indicate data entry errors, requiring careful examination. |
| Recoding data | The process of changing values in a dataset, often to standardize or categorize data. Useful for creating meaningful groups or correcting inconsistencies in categorical data. |
| Redundant data | Information that is repeated unnecessarily or can be derived from other data points. Increases storage requirements and can lead to inconsistencies if not managed properly. |
| Regular expression | A sequence of characters defining a search pattern, often used for pattern matching with strings. Powerful tool for identifying and manipulating specific data patterns. |
| Regression imputation | An imputation method that uses other variables to predict and impute missing values through regression analysis. Can preserve relationships between variables but may overstate correlations. |
| RIGHT JOIN | In the SQL query `SELECT * FROM table1 ____ table2 ON table1.id = table2.id`, this clause returns all rows from the right table and matched rows from the left table. It's similar to LEFT JOIN but with the tables' roles reversed. |
| Specification mismatch | Occurs when data doesn't adhere to the expected format or rules defined for a field. Can lead to errors in calculations or visualizations if not corrected. |
| THEN | In the SQL query `SELECT CASE WHEN condition ____ result END FROM table`, this keyword follows a WHEN clause to specify the result if the condition is true. It's an integral part of the conditional logic in CASE statements. |
| WHEN | In the SQL query `SELECT CASE ____ condition THEN result END FROM table`, this keyword specifies a condition to be evaluated in a CASE statement. It's followed by the result to be returned if the condition is true. |

