# Hands‑On CSV Data Tutorial (Load → Explore → Clean → Split)

This mini‑tutorial shows how to:
1) **Read** a CSV
2) **View** rows/columns
3) **Check** types & missing values
4) **Compute** simple stats
5) **Clean** the data (basic numeric imputation & scaling)
6) **Split** into train/validation sets (`train_test_split`)

**Assumptions**
- The CSV file is in the **same folder** as this notebook, named `breast-cancer.csv`.
- The target column is `diagnosis` (common for breast cancer datasets).
- We will encode `diagnosis`: **M → 1**, **B → 0**.


## 1) Imports
We use **pandas** for data handling, **numpy** for numerics, and **scikit‑learn** for splitting.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

## 2) Read the CSV
Change the filename if yours is different. `head()` shows the first few rows.

In [2]:
CSV_PATH = './data/breast-cancer.csv'  # same folder as this notebook
df = pd.read_csv(CSV_PATH)
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


## 3) Rows, Columns, and Column Names
`shape` returns `(rows, columns)`. We also list column names.

In [3]:
print('Shape (rows, columns):', df.shape)
print('\nColumns:')
print(df.columns.tolist())

Shape (rows, columns): (569, 32)

Columns:
['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']


## 4) Data Types and Missing Values
`info()` shows dtypes and non‑null counts. Then we count missing values per column.

In [4]:
print(df.info())
print('\nMissing values per column:')
print(df.isna().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14  perimeter_se             5

## 5) Simple Descriptive Statistics
`describe()` summarizes numeric columns: count, mean, std, min, quartiles, max.
If you want to include non‑numeric columns, pass `include='all'`.

In [5]:
df.describe()

Unnamed: 0,id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,30371830.0,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,...,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,125020600.0,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,...,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,8670.0,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,869218.0,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,...,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,906024.0,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,...,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,8813129.0,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,...,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,911320500.0,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,...,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


## 6) Inspect/Encode the Target Column
We assume the target column is `diagnosis` and encode **M → 1** (malignant), **B → 0** (benign).

In [6]:
print('Unique values in diagnosis:', df['diagnosis'].unique())
print('\nClass balance:')
print(df['diagnosis'].value_counts())

# Encode target
y = (df['diagnosis'].astype(str).str.upper().str.strip() == 'M').astype(int)
y.head()

Unique values in diagnosis: ['M' 'B']

Class balance:
diagnosis
B    357
M    212
Name: count, dtype: int64


0    1
1    1
2    1
3    1
4    1
Name: diagnosis, dtype: int64

## 7) Build the Feature Matrix `X`
- Drop the target column and any obvious ID columns (like `id`).
- Keep only numeric features (simplest approach for tabular ML).


In [7]:
id_like = [c for c in df.columns if ('id' in c.lower() or 'unnamed' in c.lower())]
X = df.drop(columns=['diagnosis'] + id_like, errors='ignore').select_dtypes(include=[np.number]).astype(np.float32)
X.head()

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,17.99,10.38,122.800003,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.379999,17.33,184.600006,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.899994,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.800003,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.690001,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.530001,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.379999,77.580002,386.100006,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.870003,567.700012,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.290001,14.34,135.100006,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.540001,16.67,152.199997,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


## 8) Basic Cleaning (Numeric Only)
- Fill missing numeric values with the **column mean**.
- (Optional) Standardize features to zero mean / unit variance.

These simple steps are enough for many baseline models.

In [8]:
# Fill missing values with column means
X = X.fillna(X.mean())

# Standardize (mean 0, std 1)
X = (X - X.mean()) / X.std(ddof=0)

# If any std was zero → NaNs; replace those with 0
X = X.fillna(0.0)
X.describe().T.head(10)  # quick peek at feature stats

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
radius_mean,569.0,2.011264e-08,1.00088,-2.029648,-0.689385,-0.215082,0.469393,3.971288
texture_mean,569.0,-3.385628e-07,1.00088,-2.229249,-0.725963,-0.104637,0.584175,4.651888
perimeter_mean,569.0,4.022528e-08,1.00088,-1.984504,-0.691956,-0.23598,0.499677,3.97613
area_mean,569.0,1.910701e-07,1.00088,-1.454443,-0.667195,-0.295187,0.363508,5.250529
smoothness_mean,569.0,6.704213e-09,1.00088,-3.112085,-0.710963,-0.034891,0.636199,4.770911
compactness_mean,569.0,-2.011264e-08,1.00088,-1.610136,-0.747086,-0.221941,0.493857,4.568425
concavity_mean,569.0,2.681685e-08,1.00088,-1.114873,-0.743748,-0.34224,0.526062,4.243589
concave points_mean,569.0,2.681685e-08,1.00088,-1.26182,-0.737944,-0.397721,0.646935,3.92793
symmetry_mean,569.0,-8.715477e-08,1.00088,-2.744117,-0.70324,-0.071627,0.530779,4.484751
fractal_dimension_mean,569.0,-5.313089e-07,1.00088,-1.819866,-0.72264,-0.17828,0.470983,4.910918


## 9) Train/Validation Split
Finally, split into train/validation sets. We stratify on `y` to preserve class balance.

In [9]:
X_train, X_val, y_train, y_val = train_test_split(
    X.values, y.values,
    test_size=0.2,
    random_state=42,
    stratify=y.values
)
print('Train shapes:', X_train.shape, y_train.shape)
print('Validation shapes:', X_val.shape, y_val.shape)

Train shapes: (455, 30) (455,)
Validation shapes: (114, 30) (114,)


### ✅ You now have:
- `X_train, y_train` — training features and labels
- `X_val, y_val` — validation features and labels

From here, plug these into any model you like (e.g., logistic regression, random forest, or your PyTorch `BinaryClassifier`).