# Data Preprocessing for KMeans Clustering

<!-- TABLE OF CONTENTS -->
<details>
  <summary>Table of Contents</summary>
  <ol>
    <li>
      <a href="#import-needed-filepaths-and-libraries">Import Needed Filepaths and Libraries</a>
     </li>
     <li><a href="#create-directory-to-store-artifacts">Create Directory to Store Artifacts</a>
     </li>
     <li><a href="#load-dataset">Load Dataset</a></li>
    <li>
      <a href="#build-a-feature-matrix">Build a Feature Matrix</a>
      <ul>
        <li><a href="#build-a-feature-matrix-that-includes-all-features">Build a Feature Matrix that Includes ALL Features</a></li>
        <li><a href="#building-a-feature-matrix-that-does-not-include-racial-composition-features">Build a Feature Matrix that Does NOT Include Racial Composition Features</a></li>
      </ul>
    </li>
    <li><a href="#save-artifacts">Save Artifacts</a></li>
  </ol>
</details>

In [9]:
ART_DIR  = "artifacts"
RACE_COLS = ["pct_white", "pct_black", "pct_asian", "pct_hispanic"]
ID_COL = "district_id"

## Import Needed Filepaths and Libraries

In [10]:
import os, json
import numpy as np
import pandas as pd
import joblib
from sklearn.preprocessing import StandardScaler
from texas_gerrymandering_hb4.config import FINAL_CSV, RACE

## Create Directory to Store Artifacts

In [11]:
os.makedirs(ART_DIR, exist_ok=True)

## Load Dataset

In [12]:
df = pd.read_csv(FINAL_CSV)

## Build a Feature Matrix
* The helper function `build_features` creates a feature matrix from our Pandas dataframe.
* `drop_race` is a boolean flag which determines whether or not racial composition features will be included in our feature matrix.
* The `district_id` is dropped before returning the feature matrix.

In [13]:
def build_features(df, drop_race: bool):
    cols = [c for c in df.columns if c != ID_COL]
    if drop_race:
        cols = [c for c in cols if c not in RACE_COLS]
    return df[cols].copy(), cols

### Build a Feature Matrix that Includes All Features
* This matrix will have the columns `polsby_popper`, `schwartzberg`, `convex_hull_ratio`, `reock`, `pct_white`, `pct_black`, `pct_asian`, `pct_hispanic`, `dem_share`, and `rep_share`.
* Hence, the full feature matrix will have 38 rows and 10 columns.

In [14]:
X_full, full_cols = build_features(df, drop_race=False)
scaler_full = StandardScaler()
X_full_scaled = scaler_full.fit_transform(X_full)

### Build a Feature Matrix that Does NOT Include Racial Composition Features
* This feature matrix that excludes racial features will have the columns `polsby_popper`, `schwartzberg`, `convex_hull_ratio`, `reock`, `dem_share`, and `rep_share`.
* Hence, this feature matrix has 38 rows and 6 columns.

In [15]:
X_norace, norace_cols = build_features(df, drop_race=True)
scaler_norace = StandardScaler()
X_norace_scaled = scaler_norace.fit_transform(X_norace)

## Save Artifacts

In [16]:
joblib.dump(scaler_full, f"{ART_DIR}/scaler_full.joblib")
np.savez(f"{ART_DIR}/X_full_scaled.npz", X=X_full_scaled)
with open(f"{ART_DIR}/full_columns.json", "w") as f: json.dump(full_cols, f)

joblib.dump(scaler_norace, f"{ART_DIR}/scaler_norace.joblib")
np.savez(f"{ART_DIR}/X_norace_scaled.npz", X=X_norace_scaled)
with open(f"{ART_DIR}/norace_columns.json", "w") as f: json.dump(norace_cols, f)

df[[ID_COL]].to_csv(f"{ART_DIR}/district_ids.csv", index=False)
df.to_csv(f"{ART_DIR}/dataset_snapshot.csv", index=False)
print("Artifacts saved in", ART_DIR)

Artifacts saved in artifacts
