<a href="https://colab.research.google.com/github/brendanpshea/data-science/blob/main/DataScience_04_DataCleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction

In the world of sports analytics, the difference between victory and defeat often lies in the details. But what happens when those details are muddled, incomplete, or just plain wrong? Welcome to the crucial yet often overlooked realm of **data cleaning**.

**Data cleaning**, also known as **data cleansing** or **data scrubbing**, is the process of identifying and correcting (or removing) errors, inconsistencies, and inaccuracies in datasets. It's the unsung hero of data analysis, laying the foundation for all the exciting insights and predictions that follow.

Imagine you're analyzing high school soccer data to identify the next rising star. You've got a treasure trove of information: player statistics, team performances, match results. But hidden within this goldmine are potential pitfalls:

-   Duplicate entries of star players (did they really score 60 goals, or was the same 30-goal performance recorded twice?)
-   Missing values for key stats (how do you compare forwards if half of them are missing assist data?)
-   Inconsistent team names ("Soccersaurus Rex" and "Soccer-saurus Rex" are probably the same team)
-   Outliers that seem too good to be true (a goalkeeper with 50 goals?)

These issues, if left unchecked, can lead to flawed analyses, misguided strategies, and missed opportunities. That's where data cleaning comes in.

In this chapter, we'll dive into the nitty-gritty of data cleaning using two powerful tools: **SQL** (Structured Query Language) and **Pandas** (Python Data Analysis Library). We'll work with a sample dataset of high school soccer statistics, tackling common data quality issues and learning techniques to transform messy data into a analyst's dream.

By the end of this chapter, you'll be equipped with the skills to:

1.  Identify various types of data quality issues
2.  Use SQL and Pandas to clean and manipulate data
3.  Transform raw, messy data into clean, analysis-ready datasets

So lace up your cleats and get ready to kick those data problems into touch. It's time to clean up the beautiful game's data!

## Sample Data: High School Soccer League

To illustrate common data cleaning challenges, let's create a sample dataset for a fictitious high school soccer league. This dataset will intentionally include various data quality issues that we'll address throughout the chapter.

In [30]:
import pandas as pd
import numpy as np
import sqlite3

# Generate sample data
np.random.seed(42)

teams = [
    "Goal Getters", "Soccersaurus Rex", "Soccer-saurus Rex", "Kick It Up",
    "Net Navigators", "Dribble Trouble", "Goal Diggers", "Turf Titans",
    "Cleat Commanders", "Ball Busters"
]

players = [
    "Alex", "Beckham", "Cristiano", "Diego", "Eden", "Fernando", "Grace", "Hope",
    "Isco", "Javier", "Kristine", "Lionel", "Marta", "Neymar", "Oscar", "Paul",
    "Quincy", "Raheem", "Sam", "Tobin"
]


positions = ["Forward", "Midfielder", "Defender", "Goalkeeper"]

# Create DataFrame
data = []
for _ in range(200):  # Generate 200 rows of data
    team = np.random.choice(teams)
    player = np.random.choice(players) + " " + chr(np.random.randint(65, 91)) + "."
    position = np.random.choice(positions)
    goals = np.random.randint(0, 30) if position != "Goalkeeper" else np.random.randint(0, 3)
    assists = np.random.randint(0, 20) if position != "Goalkeeper" else np.random.randint(0, 5)
    saves = np.random.randint(0, 100) if position == "Goalkeeper" else 0

    # Introduce data quality issues
    if np.random.random() < 0.1:  # 10% chance of issues
        issue = np.random.choice(["duplicate", "missing", "invalid", "outlier", "negative", "type_error"])
        if issue == "duplicate":
            data.append([team, player, position, goals, assists, saves])  # Duplicate entry
            data.append([team, player, position, goals, assists, saves])  # Duplicate entry
        elif issue == "missing":
            data.append([team, player, position, goals, np.nan, saves])  # Missing assists
        elif issue == "invalid":
            data.append([team, player, "Striker", goals, assists, saves])  # Invalid position
        elif issue == "outlier":
            data.append([team, player, position, 100, 50, 200])  # Unrealistic stats
        elif issue == "negative":
            data.append([team, player, position, goals * -1, assists * -1, saves * -1])  # Negative stats
        elif issue == "type_error":
            data.append([team, player, position, f"{goals} goals", assists, saves])  # Type error
    else:
        data.append([team, player, position, goals, assists, saves])

df = pd.DataFrame(data, columns=["Team", "Player", "Position", "Goals", "Assists", "Saves"])

# Save to CSV
df.to_csv("high_school_soccer_data.csv", index=False)

# Create SQLite database and load data
conn = sqlite3.connect("soccer_league.db")
df.to_sql("player_stats", conn, if_exists="replace", index=False)
conn.close()

print("Data generated and saved to 'high_school_soccer_data.csv' and 'soccer_league.db'")

Data generated and saved to 'high_school_soccer_data.csv' and 'soccer_league.db'


In [31]:
%reload_ext sql
%config SqlMagic.autopandas=True
%sql sqlite:///soccer_league.db

In [32]:
%%sql
SELECT * FROM player_stats LIMIT 10;

 * sqlite:///soccer_league.db
Done.


Unnamed: 0,Team,Player,Position,Goals,Assists,Saves
0,Goal Diggers,Tobin O.,Defender,7,6.0,0
1,Goal Diggers,Kristine K.,Goalkeeper,0,3.0,23
2,Net Navigators,Beckham X.,Goalkeeper,1,1.0,63
3,Goal Getters,Lionel Z.,Striker,28,11.0,0
4,Ball Busters,Paul O.,Midfielder,29,14.0,0
5,Goal Diggers,Tobin Y.,Defender,4,18.0,0
6,Cleat Commanders,Grace R.,Goalkeeper,0,3.0,13
7,Cleat Commanders,Beckham T.,Goalkeeper,2,3.0,70
8,Turf Titans,Oscar C.,Midfielder,16,3.0,0
9,Kick It Up,Beckham F.,Midfielder,9,3.0,0


This Python script generates our sample dataset with intentional problems and saves it as both a CSV file and an SQLite database. Let's break down the data generation process and the issues we've introduced:

1.  We've created a list of 10 teams. Note that "Soccersaurus Rex" appears twice with slightly different spellings.
2.  We have a pool of 20 player names (with a random initial for a last name) that will be randomly assigned to teams.
3.  We're using four standard soccer positions.
4.  The script creates 200 rows of data, randomly assigning players to teams and positions, and generating plausible statistics for goals, assists, and saves.
5.  *Intentional Issues*. With a 10% probability, the script introduces one of four types of data quality problems:
    -   **Duplicates**: Identical rows are added.
    -   **Missing Values**: The 'Assists' field is set to NaN (Not a Number).
    -   **Invalid Data**: An non-existent position "Striker" is used instead of the valid positions.
    -   **Outliers**: Unrealistically high values are set for goals, assists, and saves.
    - **Negative values.** We have inappropriate negative values.
    - **Type errors**. Some data (that should be numeric) is instead a string.
6. The generated data is saved as a CSV file named "high_school_soccer_data.csv" and loaded into an SQLite database named "soccer_league.db".

This dataset now serves as our playground for exploring various data cleaning techniques. In the following sections, we'll use both SQL and Pandas to identify and address these data quality issues, demonstrating the importance and methods of data cleaning in a sports analytics context.

First, we'll show how we can diagnose and correct issues using SQL (a common practice if one is using Extract-Load-Transform). Then, we'll show how we could do the same thing using a Python and the Pandas library (a common practice if one is using Extract-Transform-Load). See the previous chapter for a a discussion of the advantages and disadvantages of each approach.

## Reasoning for Cleaning Data
Data cleaning is a crucial step in the data analysis process. In our high school soccer league dataset, we've intentionally introduced several types of data quality issues. Let's examine each of these issues and understand why addressing them is essential for accurate analysis.

### Duplicate Data
Duplicate data refers to identical or very similar records that appear multiple times in a dataset.  Example from our dataset:

In [33]:
%%sql
-- A duplicate is a row that appeasrs more than once
SELECT Team, Player, Position, Goals, Assists, Saves, COUNT(*) as Count
FROM player_stats
GROUP BY Team, Player, Position, Goals, Assists, Saves
HAVING Count > 1;

 * sqlite:///soccer_league.db
Done.


Unnamed: 0,Team,Player,Position,Goals,Assists,Saves,Count
0,Cleat Commanders,Grace A.,Defender,12,16.0,0,2
1,Dribble Trouble,Kristine E.,Goalkeeper,2,2.0,19,2
2,Goal Diggers,Sam H.,Defender,20,14.0,0,2
3,Net Navigators,Marta I.,Forward,14,12.0,0,2
4,Soccer-saurus Rex,Eden V.,Forward,28,13.0,0,2
5,Soccersaurus Rex,Quincy N.,Midfielder,12,8.0,0,2


This query might reveal duplicate entries for some players.

*Why it's a problem--* Duplicate data can lead to overestimation of player performance or team strength. If a player's 30-goal season is recorded twice, it could incorrectly suggest they scored 60 goals, skewing individual and team statistics.

### Redundant Data
Redundant data is information that is repeated unnecessarily or can be derived from other data points.  In our current dataset, we don't have explicit redundant data. However, if we had included both "Goals+Assists" and separate "Goals" and "Assists" columns, that would be an example of redundancy. Example:

| Player   | Goals | Assists | Goals+Assists |
|----------|-------|---------|---------------|
| Alex     | 10    | 5       | 15            |
| Beckham  | 8     | 7       | 15            |
| Cristiano| 12    | 3       | 15            |
| Marta    | 9     | 6       | 15            |
| Lionel   | 11    | 4       | 15            |


*Why it's a problem--*Redundant data increases storage requirements and can lead to inconsistencies if one instance of the data is updated but not the other.

### Missing Values
Missing values are data points that are not present for some records. Example from our dataset:

In [34]:
%%sql
-- select where ANY column is null
SELECT * FROM player_stats
WHERE
  Team is NULL
  OR Player is NULL
  OR Position is NULL
  OR Goals is NULL
  OR Assists IS NULL
  OR Saves IS NULL;

 * sqlite:///soccer_league.db
Done.


Unnamed: 0,Team,Player,Position,Goals,Assists,Saves
0,Goal Getters,Hope N.,Goalkeeper,1,,86
1,Cleat Commanders,Tobin U.,Goalkeeper,0,,38
2,Dribble Trouble,Oscar X.,Midfielder,10,,0
3,Cleat Commanders,Kristine H.,Midfielder,24,,0


We can also get a "null count" for each column as follows:

In [35]:
%%sql
-- null count by column
-- each time a null occurs, we add one to its count
SELECT
  COUNT(CASE WHEN Team IS NULL THEN 1 END) AS Team_Nulls,
  COUNT(CASE WHEN Player IS NULL THEN 1 END) AS Player_Nulls,
  COUNT(CASE WHEN Position IS NULL THEN 1 END) AS Position_Nulls,
  COUNT(CASE WHEN Goals IS NULL THEN 1 END) AS Goals_Nulls,
  COUNT(CASE WHEN Assists IS NULL THEN 1 END) AS Assists_Nulls,
  COUNT(CASE WHEN Saves IS NULL THEN 1 END) AS Saves_Nulls
FROM player_stats;

 * sqlite:///soccer_league.db
Done.


Unnamed: 0,Team_Nulls,Player_Nulls,Position_Nulls,Goals_Nulls,Assists_Nulls,Saves_Nulls
0,0,0,0,0,4,0


Missing values can skew analyses and make it difficult to compare players or teams fairly. For instance, if assist data is missing for some forwards, it becomes challenging to evaluate their overall offensive contribution.

### Invalid Data

**Invalid data** refers to values that don't conform to the expected format or fall outside the realm of possibility. Example from our dataset:

In [23]:
%%sql
SELECT * FROM player_stats WHERE Position = 'Striker';

 * sqlite:///soccer_league.db
Done.


Unnamed: 0,Team,Player,Position,Goals,Assists,Saves
0,Goal Getters,Lionel Z.,Striker,28,11.0,0
1,Net Navigators,Marta I.,Striker,14,12.0,0
2,Soccer-saurus Rex,Eden V.,Striker,28,13.0,0
3,Goal Getters,Diego L.,Striker,8,18.0,0
4,Dribble Trouble,Kristine E.,Striker,2,2.0,19
5,Soccersaurus Rex,Quincy N.,Striker,12,8.0,0


This query will reveal records where we've used 'Striker' instead of one of our four valid positions.

Invalid data can lead to errors in analysis or visualization. In our case, grouping by position would incorrectly separate 'Striker' from 'Forward', potentially understating the performance of forwards as a group.

More generally, we can detect invalid data by looking at the "distinct" values in each column:

In [24]:
%%sql
--get distinct values in positions
SELECT DISTINCT Position FROM player_stats;

 * sqlite:///soccer_league.db
Done.


Unnamed: 0,Position
0,Defender
1,Goalkeeper
2,Striker
3,Midfielder
4,Forward


### Non-parametric Data

**Non-parametric data** refers to data that doesn't follow a specific probability distribution, often due to inconsistencies in data entry. In our dataset, this might manifest as inconsistent team names:

In [25]:
%%sql
SELECT DISTINCT Team FROM player_stats;

 * sqlite:///soccer_league.db
Done.


Unnamed: 0,Team
0,Goal Diggers
1,Net Navigators
2,Goal Getters
3,Ball Busters
4,Cleat Commanders
5,Turf Titans
6,Kick It Up
7,Soccersaurus Rex
8,Soccer-saurus Rex
9,Dribble Trouble


This query might reveal both "Soccersaurus Rex" and "Soccer-saurus Rex".

*Why it's a problem--*Non-parametric data can lead to fragmentation of what should be unified categories, making it difficult to aggregate data correctly. In our case, the team's performance might be split across two names, understating their true record.

### Data Outliers

**Outliers** are data points that differ significantly from other observations. Example from our dataset:

In [36]:
%%sql
SELECT * FROM player_stats
WHERE Goals > 50 OR Assists > 30 OR (Position = 'Goalkeeper' AND Goals > 5);

 * sqlite:///soccer_league.db
Done.


Unnamed: 0,Team,Player,Position,Goals,Assists,Saves
0,Goal Diggers,Tobin O.,Defender,7,6.0,0
1,Kick It Up,Beckham F.,Midfielder,9,3.0,0
2,Soccer-saurus Rex,Alex E.,Midfielder,6,8.0,0
3,Ball Busters,Paul Y.,Goalkeeper,100,50.0,200
4,Net Navigators,Eden W.,Forward,8,2.0,0
5,Goal Getters,Hope D.,Midfielder,7,19.0,0
6,Dribble Trouble,Cristiano P.,Forward,8,3.0,0
7,Goal Getters,Cristiano R.,Forward,9,2.0,0
8,Ball Busters,Beckham Z.,Forward,7,0.0,0
9,Dribble Trouble,Paul G.,Forward,9,2.0,0


This query will reveal players with unusually high stats, including goalkeepers with an improbable number of goals.

*Why it's a problem*--While outliers can sometimes represent genuinely exceptional performances, they often indicate data entry errors. Including these without verification can significantly skew averages and other statistical measures.