# Introduction and dataset description

This notebook performs preprocessing for a lung cancer prediction dataset. The goal is to prepare the data for exploratory analysis and modeling (cleaning, transformation and encoding).

- Source: file `survey lung cancer.csv` (loaded into `df`)
- Shape: 309 rows × 16 columns
- Missing values: none (all columns have 309 non-null values)
- Main dtypes: 14 integer columns (binary/ordinal), 2 object columns (`GENDER`, `LUNG_CANCER`)

Columns and meaning
- `GENDER`: Patient sex. Values: `'M'` = male, `'F'` = female.
- `AGE`: Patient age (integer).
- `SMOKING`: Smoking. Encoded as 1 = NO, 2 = YES.
- `YELLOW_FINGERS`: Yellow fingers. Encoded as 1 = NO, 2 = YES.
- `ANXIETY`: Anxiety. Encoded as 1 = NO, 2 = YES.
- `PEER_PRESSURE`: Peer pressure. Encoded as 1 = NO, 2 = YES.
- `CHRONIC DISEASE`: Chronic disease. Encoded as 1 = NO, 2 = YES.
- `FATIGUE`: Fatigue. Encoded as 1 = NO, 2 = YES.
- `ALLERGY`: Allergy. Encoded as 1 = NO, 2 = YES.
- `WHEEZING`: Wheezing. Encoded as 1 = NO, 2 = YES.
- `ALCOHOL CONSUMING`: Alcohol consumption. Encoded as 1 = NO, 2 = YES.
- `COUGHING`: Coughing. Encoded as 1 = NO, 2 = YES.
- `SHORTNESS OF BREATH`: Shortness of breath. Encoded as 1 = NO, 2 = YES.
- `SWALLOWING DIFFICULTY`: Difficulty swallowing. Encoded as 1 = NO, 2 = YES.
- `CHEST PAIN`: Chest pain. Encoded as 1 = NO, 2 = YES.
- `LUNG_CANCER`: Target label. Values: `'YES'`, `'NO'`.

Note: Many features use 1/2 encoding (1 → NO, 2 → YES). Keep a mapping record for interpretation.
- Scale/standardize `AGE` depending on model choice (e.g., StandardScaler or MinMaxScaler).
- Save and document all mapping dictionaries and preprocessing steps for reproducibility and interpretability.

This cell documents the essential dataset information before applying transformations.

In [17]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

In [18]:
df = pd.read_csv('C:/Users/a_esp/Documents/GitHub/lung_cancer/data/survey lung cancer.csv')

In [19]:
df.head()

Unnamed: 0,GENDER,AGE,SMOKING,YELLOW_FINGERS,ANXIETY,PEER_PRESSURE,CHRONIC DISEASE,FATIGUE,ALLERGY,WHEEZING,ALCOHOL CONSUMING,COUGHING,SHORTNESS OF BREATH,SWALLOWING DIFFICULTY,CHEST PAIN,LUNG_CANCER
0,M,69,1,2,2,1,1,2,1,2,2,2,2,2,2,YES
1,M,74,2,1,1,1,2,2,2,1,1,1,2,2,2,YES
2,F,59,1,1,1,2,1,2,1,2,1,2,2,1,2,NO
3,M,63,2,2,2,1,1,1,1,1,2,1,1,2,2,NO
4,F,63,1,2,1,1,1,1,1,2,1,2,2,1,1,NO


In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 309 entries, 0 to 308
Data columns (total 16 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   GENDER                 309 non-null    object
 1   AGE                    309 non-null    int64 
 2   SMOKING                309 non-null    int64 
 3   YELLOW_FINGERS         309 non-null    int64 
 4   ANXIETY                309 non-null    int64 
 5   PEER_PRESSURE          309 non-null    int64 
 6   CHRONIC DISEASE        309 non-null    int64 
 7   FATIGUE                309 non-null    int64 
 8   ALLERGY                309 non-null    int64 
 9   WHEEZING               309 non-null    int64 
 10  ALCOHOL CONSUMING      309 non-null    int64 
 11  COUGHING               309 non-null    int64 
 12  SHORTNESS OF BREATH    309 non-null    int64 
 13  SWALLOWING DIFFICULTY  309 non-null    int64 
 14  CHEST PAIN             309 non-null    int64 
 15  LUNG_CANCER            

In [21]:
# Normalize column names to snake_case (e.g., `chronic_disease`, `alcohol_consuming`) to avoid spaces and uppercase.
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')

In [22]:
# Check for missing values
df.isnull().sum()

gender                   0
age                      0
smoking                  0
yellow_fingers           0
anxiety                  0
peer_pressure            0
chronic_disease          0
fatigue                  0
allergy                  0
wheezing                 0
alcohol_consuming        0
coughing                 0
shortness_of_breath      0
swallowing_difficulty    0
chest_pain               0
lung_cancer              0
dtype: int64

In [23]:
# Check and remove duplicates
df.duplicated().sum()

np.int64(33)

In [None]:
# Map binary columns 1→0 (NO) y 2→1 (YES)
binary_cols = [
    'chronic_disease', 'smoking', 'yellow_fingers', 'anxiety', 'peer_pressure',
    'fatigue', 'allergy', 'wheezing', 'alcohol_consuming',
    'coughing', 'shortness_of_breath', 'swallowing_difficulty', 'chest_pain'
]

for col in binary_cols:
    # Show original values before transformation
    print(f"Valores originales en {col}: {sorted(df[col].unique())}")
    
    # Map 1→0 y 2→1
    df[col] = df[col].replace({1: 0, 2: 1}).astype('uint8')
    
    # Show transformed values after mapping
    print(f"Valores transformados en {col}: {sorted(df[col].unique())}\n")

Valores originales en chronic_disease: [np.int64(1), np.int64(2)]
Valores transformados en chronic_disease: [np.uint8(0), np.uint8(1)]

Valores originales en smoking: [np.int64(1), np.int64(2)]
Valores transformados en smoking: [np.uint8(0), np.uint8(1)]

Valores originales en yellow_fingers: [np.int64(1), np.int64(2)]
Valores transformados en yellow_fingers: [np.uint8(0), np.uint8(1)]

Valores originales en anxiety: [np.int64(1), np.int64(2)]
Valores transformados en anxiety: [np.uint8(0), np.uint8(1)]

Valores originales en peer_pressure: [np.int64(1), np.int64(2)]
Valores transformados en peer_pressure: [np.uint8(0), np.uint8(1)]

Valores originales en fatigue: [np.int64(1), np.int64(2)]
Valores transformados en fatigue: [np.uint8(0), np.uint8(1)]

Valores originales en allergy: [np.int64(1), np.int64(2)]
Valores transformados en allergy: [np.uint8(0), np.uint8(1)]

Valores originales en wheezing: [np.int64(1), np.int64(2)]
Valores transformados en wheezing: [np.uint8(0), np.uint8(1

In [8]:
df.to_csv('C:/Users/a_esp/Documents/GitHub/lung_cancer/data/survey lung cancer_clean.csv', index=False)