# 02: Data Cleaning

**Objective:** Identify and handle missing or invalid values, detect outliers, and standardize data for our recidivism dataset.

**Key Steps:**

<table width="100%">
  <tr>
    <td style="vertical-align: top; text-align: left; width: 60%; padding-right: 20px;">
      <ol style="font-size: 20px; line-height: 1.4;">
        <li>Identify &amp; quantify missing data</li>
        <li>Handle missing values (imputation, removal, flagging)</li>
        <li>Validate &amp; correct data types</li>
        <li>Detect &amp; treat outliers</li>
        <li>Standardize &amp; normalize</li>
        <li>Document transformations</li>
      </ol>
    </td>
    <td style="vertical-align: top; text-align: left; width: 40%;">
      <!-- relative path to cleaning.png -->
      <img src="../slides/cleaning.png" alt="Data Cleaning Image" width="1000" />
    </td>
  </tr>
</table>

---
<audio controls src="../audio/Cleaning.m4a">



Now that we’ve laid out our data‑cleaning roadmap, let’s put it into practice with a real dataset. For the rest of this lesson, we’ll work with the **COMPAS recidivism data** (`compas‑scores‑raw.csv`), which contains demographic and risk‑assessment scores for individuals screened by the COMPAS tool. 

We’ll start by running through each cleaning step in **Python**, then you are encouraged to repeat the same process in **R** so you can see both workflows side by side.



Now that we’ve laid out our data‑cleaning roadmap, let’s put it into practice with a real dataset. For the rest of this lesson, we’ll work with the **COMPAS recidivism data** (`compas‑scores‑raw.csv`), which contains demographic and risk‑assessment scores for individuals screened by the COMPAS tool. 

We’ll start by running through each cleaning step in **Python**, then you are encouraged to repeat the same process in **R** so you can see both workflows side by side.


## Python

Since this is your first time, you’ll need to download all of the little helper programs (called “packages”) we’ll use—just run one simple command to get them all at once:

```bash
!pip install -r requirements_Python.txt




In [1]:
# only run this block your first time doing this training.

!pip install -r ..\requirements_Python.txt



#### Loading the Pandas Library

Before we can work with our data in Python, we need to load the **Pandas** library. The line below does two things:

1. **Imports** the Pandas module so its functions and classes become available.  
2. **Aliases** it as `pd` so we can reference it concisely in our code.

```python
import pandas as pd


In [2]:
import pandas as pd

### 2.1 Identify & Quantify Missing Data

First, load the data and get a sense of where values are missing.


In [3]:
df = pd.read_csv("../data/compas_scores_raw.csv")

Finally, we call .head() on the DataFrame to display the first five rows. This gives us a quick peek at the structure and contents of our dataset, including column names and sample values.

In [4]:
df.head()

Unnamed: 0,Person_ID,AssessmentID,Case_ID,Agency_Text,LastName,FirstName,MiddleName,Sex_Code_Text,Ethnic_Code_Text,ScaleSet_ID,...,MaritalStatus,RecSupervisionLevel,RecSupervisionLevelText,Scale_ID,DisplayText,RawScore,DecileScore,ScoreText,AssessmentType,Age
0,50844,57167,51950,PRETRIAL,Fisher,Kevin,,Male,Caucasian,22,...,Single,1,Low,18,Risk of Failure to Appear,15.0,1,Low,New,20.0
1,50848,57174,51956,PRETRIAL,KENDALL,KEVIN,,Male,Caucasian,22,...,Married,1,Low,18,Risk of Failure to Appear,19.0,3,Low,New,28.0
2,50855,57181,51963,PRETRIAL,DAYES,DANIEL,,Male,African-American,22,...,Single,4,High,8,Risk of Recidivism,0.18,8,High,New,18.0
3,50855,57181,51963,PRETRIAL,DAYES,DANIEL,,Male,African-American,22,...,Single,4,High,18,Risk of Failure to Appear,13.0,1,Low,New,18.0
4,50850,57176,51958,PRETRIAL,Debe,Mikerlie,George,Female,African-American,22,...,Significant Other,2,Medium,18,Risk of Failure to Appear,11.0,1,Low,New,18.0


> **What to look for in the output:**
> - **Column names** such as `Person_ID`, `Case_ID`, `LastName`, `FirstName`, `MiddleName`, `Sex_Code_Text`, `Ethnic_Code_Text`, `RawScore`, `DecileScore`, `ScoreText`, `Age`  
> - **Data types** (numeric vs. categorical) and any surprising or missing values  
> - **Structure** of the dataset—how it's organized before we begin cleaning and analysis  


Now that we’ve successfully loaded the COMPAS dataset into our `df` object, it’s time to get a quick overview of its structure. In the next step, we’ll check how big the table is and see where any missing values might be hiding.


#### Checking Dataset Size and Missing Data

In this step, we first look at the overall size of our dataset and then count any empty or missing values:


In [5]:
  print(f"Dataset shape: {df.shape}")

Dataset shape: (24272, 24)


 **What does the output mean?**

 - **Dataset shape**  
   This tells us how many rows (records) and columns (fields) we have.  
   For example, `(24272, 24)` means 24,072 rows and 24 columns.

### 2.2 Handle Missing Values

#### Counting Missing Values

First, let’s see which columns have missing data and how many blanks each contains:


In [6]:
missing = df.isna().sum()
print("Missing values per column:")
print(missing[missing > 0])

Missing values per column:
MiddleName    17942
ScoreText        45
Age              56
dtype: int64


#### Options for Handling Missing Values
- **Removal:** drop rows or columns with too many nulls  
- **Imputation:** fill with mean/median or a constant  

##### MiddleName: 17,942 missing values  
  Most records don’t include a middle name—since this field isn’t critical to our analysis, we’ll drop the entire column.

##### ScoreText: 45 missing values  
  A small number of records lack the textual risk label. We’ll remove any rows where `ScoreText` is missing, since we need that label for downstream analyses.

##### Age: 56 missing values
  A handful of entries are missing age information. Since age is important for our analysis, we’ll impute these missing ages with the median age value.

By printing only `missing[missing > 0]`, we focus on the columns that actually have missing entries, allowing us to target our cleaning efforts precisely.  

Below we’ll:
1. Drop columns with >50% missing  
2. Impute `Age` with the median  


##### Step 1: Drop Mostly Empty Columns

Our dataset has a “MiddleName" column that’s almost entirely blank. To avoid clutter, we remove any column where more than half of its values are missing. This will automatically drop “MiddleName” and any other mostly empty fields.


In [7]:
threshold = len(df) * 0.5
df = df.dropna(axis=1, thresh=threshold)

##### Step 2: Remove Rows with Blank `ScoreText` and Fill Missing `Age`

First, we drop any rows where `ScoreText` is blank, since those entries don’t tell us whether someone was classified as “Low,” “Medium,” or “High” risk. After that, we flag rows with missing `Age`, compute the median age, and fill those gaps.


In [8]:
# 1. Drop rows where ScoreText is blank
df = df[df["ScoreText"].str.strip() != ""]

# 2. Flag rows where Age was missing
df["age_missing"] = df["Age"].isna()

# 3. Compute median Age and fill missing ages
median_age = df["Age"].median()
df["Age"] = df["Age"].fillna(median_age)


##### Now we have:
- Removed all rows lacking a valid risk category in `ScoreText`.  
- Marked originally missing `Age` values in a new `age_missing` column.  
- Filled those missing ages with the dataset’s median age.



### 2.3 Validate & Correct Data Types

Before we go further, let’s make sure each column uses the right data type:

- Numbers as integers or floats  
- Dates as datetime objects  
- Text fields we’ll analyze categorically as “category”


In [9]:
# Convert Score and Age to integers
df["RawScore"]    = df["RawScore"].astype(int)
df["DecileScore"] = df["DecileScore"].astype(int)
df["Age"]         = df["Age"].astype(int)


# Convert text fields to categorical
for col in [
    "ScoreText", "Sex_Code_Text", "Ethnic_Code_Text",
    "Language", "MaritalStatus", "RecSupervisionLevelText"
]:
    df[col] = df[col].astype("category")

# Verify types without any warnings
df.dtypes


Person_ID                     int64
AssessmentID                  int64
Case_ID                       int64
Agency_Text                  object
LastName                     object
FirstName                    object
Sex_Code_Text              category
Ethnic_Code_Text           category
ScaleSet_ID                   int64
ScaleSet                     object
Language                   category
LegalStatus                  object
CustodyStatus                object
MaritalStatus              category
RecSupervisionLevel           int64
RecSupervisionLevelText    category
Scale_ID                      int64
DisplayText                  object
RawScore                      int64
DecileScore                   int64
ScoreText                  category
AssessmentType               object
Age                           int64
age_missing                    bool
dtype: object

> **What we’ve done:**  
> - Forced `RawScore`, `DecileScore`, and `Age` into integer form  
> - Marked risk categories and demographic fields as categorical  
> - Verified the changes by inspecting `df.dtypes`  

### Step 2.4: Find and Remove Extreme Values (Outliers) in `RawScore`

Even smart data can hide a few extreme values—called **outliers**—that skew our insights. We’ll clean those out by:

1. **Sorting** all `RawScore` values and finding the 25th percentile (Q1) and 75th percentile (Q3).  
2. Calculating the **interquartile range (IQR)** = Q3 − Q1, which covers the middle 50% of the data.  
3. Defining an **acceptable range** as   [Q1 − 3×IQR, Q3 + 3×IQR]
4. **Keeping** only rows with `RawScore` inside that range and dropping the rest.

This preserves most of the data while removing the very highest or lowest scores that could mislead our analysis.




In [10]:
# 1. Calculate Q1 (25th percentile) and Q3 (75th percentile) for RawScore
Q1 = df["RawScore"].quantile(0.25)
Q3 = df["RawScore"].quantile(0.75)

# 2. Compute the IQR
IQR = Q3 - Q1

# 3. Define acceptable lower and upper bounds
lower_bound = Q1 - 3 * IQR
upper_bound = Q3 + 3 * IQR

# 4. Filter the dataframe to keep only non-outliers
df_clean = df[(df["RawScore"] >= lower_bound) & (df["RawScore"] <= upper_bound)]

# 5. Report how many rows we kept versus removed
print(f"Rows before cleaning: {len(df)}")
print(f"Rows after removing outliers: {len(df_clean)}")


Rows before cleaning: 24272
Rows after removing outliers: 24263


> **In plain terms:**  
> - We measured how wide the middle 50% of scores is.  
> - We dropped any scores more than three times that range away from the center.  
> - Now `df_clean` has most of our original rows minus the extreme cases that could distort averages or trends.


### Step 2.5: Standardize & Normalize Key Variables

Some analysis methods work best when numbers share a common scale. We’ll use two popular Python libraries:

- **SciPy** (`scipy`): a scientific computing library that includes statistical functions, like calculating z‑scores.  
- **scikit‑learn** (`sklearn`): a machine‑learning library that provides tools for data preprocessing, including min‑max scaling.

Here’s what we’ll do:

1. **Z‑score Standardization** on `Age`  
   - Subtract the average age, then divide by the age’s standard deviation  
   - Result is a new column `age_z` where most values fall between –3 and +3  

2. **Min‑Max Scaling** on `RawScore`  
   - Rescales `RawScore` to lie between 0 (lowest score) and 1 (highest score)  
   - Result is a new column `rawscore_scaled` that preserves relative differences  

We’ll apply both to our cleaned data in `df_clean`.



In [11]:
from scipy import stats
from sklearn.preprocessing import MinMaxScaler

# 1. Z‑score standardization on Age
df_clean.loc[:, "age_z"] = stats.zscore(df_clean["Age"])

# 2. Min‑max scaling on RawScore
scaler = MinMaxScaler()
df_clean.loc[:, "rawscore_scaled"] = scaler.fit_transform(df_clean[["RawScore"]])

# Show the new columns alongside the originals
df_clean[["Age", "age_z", "RawScore", "rawscore_scaled"]].head()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean.loc[:, "age_z"] = stats.zscore(df_clean["Age"])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean.loc[:, "rawscore_scaled"] = scaler.fit_transform(df_clean[["RawScore"]])


Unnamed: 0,Age,age_z,RawScore,rawscore_scaled
0,20,-1.057784,15,0.333333
1,28,-0.387237,19,0.422222
2,18,-1.225421,0,0.0
3,18,-1.225421,13,0.288889
4,18,-1.225421,11,0.244444


## Interpreting Our Transformed Data



| Age | age_z | RawScore | rawscore_scaled |
| --- | ----- | -------- | --------------- |
| 20  | -1.057784 | 15       | 0.3333      |
| 28  | -0.387237 | 19       | 0.4222      |
| 18  | -1.225421 | 0        | 0.0000      |
| 18  | -1.225421 | 13       | 0.2888      |

### What each column is

- **Age**  
  The actual age in years

- **age_z**  
  A “z‑score” shows how far each age is from the average age, measured in standard units  
  - 0 means exactly average  
  - Negative means below average  
  - Positive means above average  
  - −-1.05778 means about -1.05778 units younger than average

- **RawScore**  
  The original score before any changes (for example a test result or rating)

- **rawscore_scaled**  
  The RawScore rescaled to lie between 0 and 1  
  - 0 is the lowest RawScore in our group  
  - 1 is the highest RawScore in our group  
  - 0.03 means very close to the lowest  
  - 0.29 means 29 percent of the way from lowest to highest

### Why we do this

1. **Fair comparison**  
   When columns use very different units or ranges our analysis can get biased. Standardizing and scaling puts all numbers on the same footing.

2. **Better results**  
   Many data tools and machine learning methods work best when inputs live on the same scale.

3. **Clear interpretation**  
   - Z‑scores tell us how far a value is from the average in comparable units  
   - Scaled scores between 0 and 1 make it easy to see relative position without worrying about original units

---

In plain terms, we’ve “translated” every number so they all speak the same language. That helps our next analysis steps work properly and gives you a fair way to compare apples to apples.


### Step 2.6: Record Your Cleaning Steps

Finally, it’s best practice to keep a log of every transformation. Below we build a simple dictionary summarizing what we did:

- Which columns we dropped because they were mostly empty  
- The median age we used for imputation  
- The bounds we used to remove outliers in `RawScore`  
- Which columns we standardized and normalized  

At the end, we print this log in a clear, readable format.


In [12]:
# Build a list of (Step, Details)
log_items = [
    ("Dropped Columns", list(missing[missing > threshold].index)),
    ("Imputed Age", median_age),
    ("Outlier Bounds (RawScore)", (lower_bound, upper_bound)),
    ("Standardized", ["age_z"]),
    ("Normalized", ["rawscore_scaled"])
]

# Turn it into a DataFrame and display
log_df = pd.DataFrame(log_items, columns=["Step", "Details"])
display(log_df)


Unnamed: 0,Step,Details
0,Dropped Columns,[MiddleName]
1,Imputed Age,29.0
2,Outlier Bounds (RawScore),"(-11.0, 45.0)"
3,Standardized,[age_z]
4,Normalized,[rawscore_scaled]


 **How to read this table:**  
 - **Step:** what we did  
 - **Details:** the exact values or columns affected by that step  


> **Why this matters:**  
> A clear transformation log makes your work reproducible and lets others (or future you) understand exactly how the data was prepared before any analysis or modeling.


## Next Steps: Continue in Python or Switch to R

Our data is now clean and properly formatted. You have two options:

1. **Try the R version** of this cleaning workflow by opening `02_data_cleaning_R.ipynb`.  
2. **Move on to summary statistics in Python** by opening `03_summary_statistics_Python.ipynb`.  

Choose the notebook that matches your preferred language, and let’s continue exploring our recidivism dataset!

