# Part 1: Pre-processing and Data Cleaning

In [1]:
import pandas as pd

## Understanding data
Learning about the data, knowing what inside the data.

1. What is the datasets looks like?
2. What is the features?


In [2]:
dataset = pd.read_csv("../data/raw/lung_cancer.csv")

In [3]:
dataset.columns = dataset.columns.str.strip()
dataset.head()

Unnamed: 0,GENDER,AGE,SMOKING,YELLOW_FINGERS,ANXIETY,PEER_PRESSURE,CHRONIC DISEASE,FATIGUE,ALLERGY,WHEEZING,ALCOHOL CONSUMING,COUGHING,SHORTNESS OF BREATH,SWALLOWING DIFFICULTY,CHEST PAIN,LUNG_CANCER
0,M,69,1,2,2,1,1,2,1,2,2,2,2,2,2,YES
1,M,74,2,1,1,1,2,2,2,1,1,1,2,2,2,YES
2,F,59,1,1,1,2,1,2,1,2,1,2,2,1,2,NO
3,M,63,2,2,2,1,1,1,1,1,2,1,1,2,2,NO
4,F,63,1,2,1,1,1,1,1,2,1,2,2,1,1,NO


### Glimpse on the data
From the data above, we know that there is a bunch of features that can lead to lung cancer.

Right now we have
1. Categorical data in `Gender` column and Boolean categorical in `Lung_Cancer` column
2. Numerical or Continuous data in `Age` column
3. and Boolean numerical in the rest of the column ( Smoking, Yellow_Fingers, etc... )


We need to fix the data into a same scale
1. `Gender` column change to `0` and `1` ( F, M )
2. `Lung_Cancer` column change to `0` and `1` ( NO, YES )
3. `Age` will leave as it is
4. The rest of the column with numerical boolean also will change to `0` and `1` (1, 2)

#### Check the null value and NaN value

First of all, let's start with check is there any null or NaN value
Drop it if it's there.

In [4]:
# Count NaN 
dataset.isna().sum()

GENDER                   0
AGE                      0
SMOKING                  0
YELLOW_FINGERS           0
ANXIETY                  0
PEER_PRESSURE            0
CHRONIC DISEASE          0
FATIGUE                  0
ALLERGY                  0
WHEEZING                 0
ALCOHOL CONSUMING        0
COUGHING                 0
SHORTNESS OF BREATH      0
SWALLOWING DIFFICULTY    0
CHEST PAIN               0
LUNG_CANCER              0
dtype: int64

In [5]:
dataset.isnull().sum()

GENDER                   0
AGE                      0
SMOKING                  0
YELLOW_FINGERS           0
ANXIETY                  0
PEER_PRESSURE            0
CHRONIC DISEASE          0
FATIGUE                  0
ALLERGY                  0
WHEEZING                 0
ALCOHOL CONSUMING        0
COUGHING                 0
SHORTNESS OF BREATH      0
SWALLOWING DIFFICULTY    0
CHEST PAIN               0
LUNG_CANCER              0
dtype: int64

#### Data is ok

No Null or NaN is found, let's continue to change the data into correct format.

#### Change the Gender to 0 and 1

> 📝 Run it only once! because the value will be replaced. If accidentally run it twice, please run from the top again

In [6]:
dataset["GENDER"] = dataset["GENDER"].apply(lambda x: '1' if x == "M" else '0')
dataset["GENDER"] = dataset["GENDER"].astype(int)
dataset

Unnamed: 0,GENDER,AGE,SMOKING,YELLOW_FINGERS,ANXIETY,PEER_PRESSURE,CHRONIC DISEASE,FATIGUE,ALLERGY,WHEEZING,ALCOHOL CONSUMING,COUGHING,SHORTNESS OF BREATH,SWALLOWING DIFFICULTY,CHEST PAIN,LUNG_CANCER
0,1,69,1,2,2,1,1,2,1,2,2,2,2,2,2,YES
1,1,74,2,1,1,1,2,2,2,1,1,1,2,2,2,YES
2,0,59,1,1,1,2,1,2,1,2,1,2,2,1,2,NO
3,1,63,2,2,2,1,1,1,1,1,2,1,1,2,2,NO
4,0,63,1,2,1,1,1,1,1,2,1,2,2,1,1,NO
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
304,0,56,1,1,1,2,2,2,1,1,2,2,2,2,1,YES
305,1,70,2,1,1,1,1,2,2,2,2,2,2,1,2,YES
306,1,58,2,1,1,1,1,1,2,2,2,2,1,1,2,YES
307,1,67,2,1,2,1,1,2,2,1,2,2,2,1,2,YES


#### Change the Lung_Cancer to 0 and 1

> 📝 Run it only once! because the value will be replaced. If accidentally run it twice, please run from the top again

In [7]:
dataset["LUNG_CANCER"] = dataset["LUNG_CANCER"].apply(lambda x: '1' if x == "YES" else '0')
dataset["LUNG_CANCER"] = dataset["LUNG_CANCER"].astype(int)
dataset

Unnamed: 0,GENDER,AGE,SMOKING,YELLOW_FINGERS,ANXIETY,PEER_PRESSURE,CHRONIC DISEASE,FATIGUE,ALLERGY,WHEEZING,ALCOHOL CONSUMING,COUGHING,SHORTNESS OF BREATH,SWALLOWING DIFFICULTY,CHEST PAIN,LUNG_CANCER
0,1,69,1,2,2,1,1,2,1,2,2,2,2,2,2,1
1,1,74,2,1,1,1,2,2,2,1,1,1,2,2,2,1
2,0,59,1,1,1,2,1,2,1,2,1,2,2,1,2,0
3,1,63,2,2,2,1,1,1,1,1,2,1,1,2,2,0
4,0,63,1,2,1,1,1,1,1,2,1,2,2,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
304,0,56,1,1,1,2,2,2,1,1,2,2,2,2,1,1
305,1,70,2,1,1,1,1,2,2,2,2,2,2,1,2,1
306,1,58,2,1,1,1,1,1,2,2,2,2,1,1,2,1
307,1,67,2,1,2,1,1,2,2,1,2,2,2,1,2,1


In [8]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 309 entries, 0 to 308
Data columns (total 16 columns):
 #   Column                 Non-Null Count  Dtype
---  ------                 --------------  -----
 0   GENDER                 309 non-null    int32
 1   AGE                    309 non-null    int64
 2   SMOKING                309 non-null    int64
 3   YELLOW_FINGERS         309 non-null    int64
 4   ANXIETY                309 non-null    int64
 5   PEER_PRESSURE          309 non-null    int64
 6   CHRONIC DISEASE        309 non-null    int64
 7   FATIGUE                309 non-null    int64
 8   ALLERGY                309 non-null    int64
 9   WHEEZING               309 non-null    int64
 10  ALCOHOL CONSUMING      309 non-null    int64
 11  COUGHING               309 non-null    int64
 12  SHORTNESS OF BREATH    309 non-null    int64
 13  SWALLOWING DIFFICULTY  309 non-null    int64
 14  CHEST PAIN             309 non-null    int64
 15  LUNG_CANCER            309 non-null    i

#### Change the rest of the column to 0 and 1

> 📝 Run it only once! because the value will be replaced. If accidentally run it twice, please run from the top again

In [9]:
# Change Smoking to 0 and 1
dataset["SMOKING"] = dataset["SMOKING"].apply(lambda x: '1' if x == 2 else '0')
dataset["SMOKING"] = dataset["SMOKING"].astype(int)

# Change Yellow_Fingers to 0 and 1
dataset["YELLOW_FINGERS"] = dataset["YELLOW_FINGERS"].apply(lambda x: '1' if x == 2 else '0')
dataset["YELLOW_FINGERS"] = dataset["YELLOW_FINGERS"].astype(int)

# Change Anxiety to 0 and 1
dataset["ANXIETY"] = dataset["ANXIETY"].apply(lambda x: '1' if x == 2 else '0')
dataset["ANXIETY"] = dataset["ANXIETY"].astype(int)

# Change Peer_Pressure to 0 and 1
dataset["PEER_PRESSURE"] = dataset["PEER_PRESSURE"].apply(lambda x: '1' if x == 2 else '0')
dataset["PEER_PRESSURE"] = dataset["PEER_PRESSURE"].astype(int)

# Change Chronic_Disease to 0 and 1
dataset["CHRONIC DISEASE"] = dataset["CHRONIC DISEASE"].apply(lambda x: '1' if x == 2 else '0')
dataset["CHRONIC DISEASE"] = dataset["CHRONIC DISEASE"].astype(int)

# Change Fatigue to 0 and 1
dataset["FATIGUE"] = dataset["FATIGUE"].apply(lambda x: '1' if x == 2 else '0')
dataset["FATIGUE"] = dataset["FATIGUE"].astype(int)

# Change Fatigue to 0 and 1
dataset["ALLERGY"] = dataset["ALLERGY"].apply(lambda x: '1' if x == 2 else '0')
dataset["ALLERGY"] = dataset["ALLERGY"].astype(int)

# Change Fatigue to 0 and 1
dataset["WHEEZING"] = dataset["WHEEZING"].apply(lambda x: '1' if x == 2 else '0')
dataset["WHEEZING"] = dataset["WHEEZING"].astype(int)

# Change Fatigue to 0 and 1
dataset["ALCOHOL CONSUMING"] = dataset["ALCOHOL CONSUMING"].apply(lambda x: '1' if x == 2 else '0')
dataset["ALCOHOL CONSUMING"] = dataset["ALCOHOL CONSUMING"].astype(int)

# Change Coughing to 0 and 1
dataset["COUGHING"] = dataset["COUGHING"].apply(lambda x: '1' if x == 2 else '0')
dataset["COUGHING"] = dataset["COUGHING"].astype(int)

# Change Shortness of breath to 0 and 1
dataset["SHORTNESS OF BREATH"] = dataset["SHORTNESS OF BREATH"].apply(lambda x: '1' if x == 2 else '0')
dataset["SHORTNESS OF BREATH"] = dataset["SHORTNESS OF BREATH"].astype(int)

# Change Swallowing difficulty to 0 and 1
dataset["SWALLOWING DIFFICULTY"] = dataset["SWALLOWING DIFFICULTY"].apply(lambda x: '1' if x == 2 else '0')
dataset["SWALLOWING DIFFICULTY"] = dataset["SWALLOWING DIFFICULTY"].astype(int)

# Change Chest pain to 0 and 1
dataset["CHEST PAIN"] = dataset["CHEST PAIN"].apply(lambda x: '1' if x == 2 else '0')
dataset["CHEST PAIN"] = dataset["CHEST PAIN"].astype(int)

dataset

Unnamed: 0,GENDER,AGE,SMOKING,YELLOW_FINGERS,ANXIETY,PEER_PRESSURE,CHRONIC DISEASE,FATIGUE,ALLERGY,WHEEZING,ALCOHOL CONSUMING,COUGHING,SHORTNESS OF BREATH,SWALLOWING DIFFICULTY,CHEST PAIN,LUNG_CANCER
0,1,69,0,1,1,0,0,1,0,1,1,1,1,1,1,1
1,1,74,1,0,0,0,1,1,1,0,0,0,1,1,1,1
2,0,59,0,0,0,1,0,1,0,1,0,1,1,0,1,0
3,1,63,1,1,1,0,0,0,0,0,1,0,0,1,1,0
4,0,63,0,1,0,0,0,0,0,1,0,1,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
304,0,56,0,0,0,1,1,1,0,0,1,1,1,1,0,1
305,1,70,1,0,0,0,0,1,1,1,1,1,1,0,1,1
306,1,58,1,0,0,0,0,0,1,1,1,1,0,0,1,1
307,1,67,1,0,1,0,0,1,1,0,1,1,1,0,1,1


#### Glimpse at datasets

Now the data already on the numeric format and on the same scale for the true false. 

#### Saved the data into `pre-processed/lung_cancer.csv`

Currently the data is good to continue into EDA part.
Let's save the data and continuing EDA to understand which features can be use to predict.

In [10]:
dataset.to_csv("../data/pre-processed/lung_cancer.csv", index=False)