Dataset Topic: Lung Cancer

Target Variable: Lung_Cancer (YES/NO)

Variables: GENDER, AGE, SMOKING, YELLOW_FINGERS, ANXIETY, PEER_PRESSURE, CHRONIC DISEASE, FATIGUE, ALLERGY, WHEEZING, ALCOHOL CONSUMING, COUGHING, SHORTNESS OF BREATH, SWALLOWING DIFFICULTY, CHEST PAIN, LUNG_CANCER

Training Model used: Random Forest

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing




import warnings
warnings.filterwarnings("ignore")

In [2]:
data = pd.read_csv(r"survey lung cancer.csv", encoding ='latin-1')

In [3]:
data

Unnamed: 0,GENDER,AGE,SMOKING,YELLOW_FINGERS,ANXIETY,PEER_PRESSURE,CHRONIC DISEASE,FATIGUE,ALLERGY,WHEEZING,ALCOHOL CONSUMING,COUGHING,SHORTNESS OF BREATH,SWALLOWING DIFFICULTY,CHEST PAIN,LUNG_CANCER
0,M,69,1,2,2,1,1,2,1,2,2,2,2,2,2,YES
1,M,74,2,1,1,1,2,2,2,1,1,1,2,2,2,YES
2,F,59,1,1,1,2,1,2,1,2,1,2,2,1,2,NO
3,M,63,2,2,2,1,1,1,1,1,2,1,1,2,2,NO
4,F,63,1,2,1,1,1,1,1,2,1,2,2,1,1,NO
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
304,F,56,1,1,1,2,2,2,1,1,2,2,2,2,1,YES
305,M,70,2,1,1,1,1,2,2,2,2,2,2,1,2,YES
306,M,58,2,1,1,1,1,1,2,2,2,2,1,1,2,YES
307,M,67,2,1,2,1,1,2,2,1,2,2,2,1,2,YES


In [4]:
data.shape

(309, 16)

In [5]:
data.duplicated().sum()

np.int64(33)

In [6]:
# Prevents Random Forest Training Model from learning redundant data.
data = data.drop_duplicates()

Here, we found 33 duplicate values. By dropping these duplicate values, this will prevent the training model from learning redundant data.

In [7]:
print(data.isnull().sum())

GENDER                   0
AGE                      0
SMOKING                  0
YELLOW_FINGERS           0
ANXIETY                  0
PEER_PRESSURE            0
CHRONIC DISEASE          0
FATIGUE                  0
ALLERGY                  0
WHEEZING                 0
ALCOHOL CONSUMING        0
COUGHING                 0
SHORTNESS OF BREATH      0
SWALLOWING DIFFICULTY    0
CHEST PAIN               0
LUNG_CANCER              0
dtype: int64


Before we start splitting our data into test sets, we have to clean our data. 

## Convert Data Type (Object to Integer)

Convert "GENDER" and "LUNG_CANCER" datatype to "int64" by using Label_Encoder while also converting the numbers to 0 and 1's to normalize data.

In [8]:

data["GENDER"].unique()

array(['M', 'F'], dtype=object)

M = Male |  F = Female

In [9]:
data["LUNG_CANCER"].unique()

array(['YES', 'NO'], dtype=object)

"YES" if the person has lung cancer. "NO" if the person does not have lung cancer.

In [10]:
data.dtypes

GENDER                   object
AGE                       int64
SMOKING                   int64
YELLOW_FINGERS            int64
ANXIETY                   int64
PEER_PRESSURE             int64
CHRONIC DISEASE           int64
FATIGUE                   int64
ALLERGY                   int64
WHEEZING                  int64
ALCOHOL CONSUMING         int64
COUGHING                  int64
SHORTNESS OF BREATH       int64
SWALLOWING DIFFICULTY     int64
CHEST PAIN                int64
LUNG_CANCER              object
dtype: object

In [11]:
le=preprocessing.LabelEncoder()
data['GENDER']=le.fit_transform(data['GENDER'])
data['LUNG_CANCER']=le.fit_transform(data['LUNG_CANCER'])
data['SMOKING']=le.fit_transform(data['SMOKING'])
data['YELLOW_FINGERS']=le.fit_transform(data['YELLOW_FINGERS'])
data['ANXIETY']=le.fit_transform(data['ANXIETY'])
data['PEER_PRESSURE']=le.fit_transform(data['PEER_PRESSURE'])
data['CHRONIC DISEASE']=le.fit_transform(data['CHRONIC DISEASE'])
data['FATIGUE ']=le.fit_transform(data['FATIGUE '])
data['ALLERGY ']=le.fit_transform(data['ALLERGY '])
data['WHEEZING']=le.fit_transform(data['WHEEZING'])
data['ALCOHOL CONSUMING']=le.fit_transform(data['ALCOHOL CONSUMING'])
data['COUGHING']=le.fit_transform(data['COUGHING'])
data['SHORTNESS OF BREATH']=le.fit_transform(data['SHORTNESS OF BREATH'])
data['SWALLOWING DIFFICULTY']=le.fit_transform(data['SWALLOWING DIFFICULTY'])
data['CHEST PAIN']=le.fit_transform(data['CHEST PAIN'])
data['LUNG_CANCER']=le.fit_transform(data['LUNG_CANCER'])

In [12]:
data

Unnamed: 0,GENDER,AGE,SMOKING,YELLOW_FINGERS,ANXIETY,PEER_PRESSURE,CHRONIC DISEASE,FATIGUE,ALLERGY,WHEEZING,ALCOHOL CONSUMING,COUGHING,SHORTNESS OF BREATH,SWALLOWING DIFFICULTY,CHEST PAIN,LUNG_CANCER
0,1,69,0,1,1,0,0,1,0,1,1,1,1,1,1,1
1,1,74,1,0,0,0,1,1,1,0,0,0,1,1,1,1
2,0,59,0,0,0,1,0,1,0,1,0,1,1,0,1,0
3,1,63,1,1,1,0,0,0,0,0,1,0,0,1,1,0
4,0,63,0,1,0,0,0,0,0,1,0,1,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
279,0,59,0,1,1,1,0,0,1,1,0,1,0,1,0,1
280,0,59,1,0,0,0,1,1,1,0,0,0,1,0,0,0
281,1,55,1,0,0,0,0,1,1,0,0,0,1,0,1,0
282,1,46,0,1,1,0,0,0,0,0,0,0,0,1,1,0


In [13]:
print(data[['GENDER', 'LUNG_CANCER']].head())

   GENDER  LUNG_CANCER
0       1            1
1       1            1
2       0            0
3       1            0
4       0            0


In [14]:
data['LUNG_CANCER'].value_counts()

LUNG_CANCER
1    238
0     38
Name: count, dtype: int64

In [15]:
data['GENDER'].value_counts()

GENDER
1    142
0    134
Name: count, dtype: int64

value_counts() ensures that all values are acting properly.

## Exploratory Data Analysis (EDA)

## Normalize Data

In [16]:
data.head()

Unnamed: 0,GENDER,AGE,SMOKING,YELLOW_FINGERS,ANXIETY,PEER_PRESSURE,CHRONIC DISEASE,FATIGUE,ALLERGY,WHEEZING,ALCOHOL CONSUMING,COUGHING,SHORTNESS OF BREATH,SWALLOWING DIFFICULTY,CHEST PAIN,LUNG_CANCER
0,1,69,0,1,1,0,0,1,0,1,1,1,1,1,1,1
1,1,74,1,0,0,0,1,1,1,0,0,0,1,1,1,1
2,0,59,0,0,0,1,0,1,0,1,0,1,1,0,1,0
3,1,63,1,1,1,0,0,0,0,0,1,0,0,1,1,0
4,0,63,0,1,0,0,0,0,0,1,0,1,1,0,0,0


In [22]:
# Drop Lung Cancer, not needed.

y = data["LUNG_CANCER"]
X = data.drop(["LUNG_CANCER"], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [23]:
X = data.iloc[:, 0:15]

In [24]:
y = data.iloc[:, 15]

In [25]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_scaled = scaler.fit_transform(data)

In [26]:
X_scaled

array([[ 0.97142265,  0.72817582, -1.09108945, ...,  1.06748999,
         0.89006056,  0.39957961],
       [ 0.97142265,  1.32596442,  0.91651514, ...,  1.06748999,
         0.89006056,  0.39957961],
       [-1.02941804, -0.46740138, -1.09108945, ..., -0.93677693,
         0.89006056, -2.5026302 ],
       ...,
       [ 0.97142265, -0.94563226,  0.91651514, ..., -0.93677693,
         0.89006056, -2.5026302 ],
       [ 0.97142265, -2.02165174, -1.09108945, ...,  1.06748999,
         0.89006056, -2.5026302 ],
       [ 0.97142265, -0.34784366, -1.09108945, ...,  1.06748999,
         0.89006056,  0.39957961]])

## Random Forest

In [27]:
rf = RandomForestClassifier()

In [28]:
rf.fit(X_train, y_train)

In [29]:
y_pred = rf.predict(X_test)

In [30]:
rf.score(X_test, y_test)

0.8571428571428571

In [31]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      0.33      0.50        12
           1       0.85      1.00      0.92        44

    accuracy                           0.86        56
   macro avg       0.92      0.67      0.71        56
weighted avg       0.88      0.86      0.83        56



In [32]:
features = pd.DataFrame(rf.feature_importances_, index = X.columns)

In [33]:
features.head(15)

Unnamed: 0,0
GENDER,0.042326
AGE,0.216551
SMOKING,0.034216
YELLOW_FINGERS,0.067566
ANXIETY,0.054329
PEER_PRESSURE,0.071494
CHRONIC DISEASE,0.074353
FATIGUE,0.053452
ALLERGY,0.085395
WHEEZING,0.055311


In [34]:
rf2 = RandomForestClassifier(n_estimators = 1000,
                             criterion = 'entropy',
                             min_samples_split = 10,
                             max_depth = 14,
                             random_state = 42
                            )

In [None]:
rf2.fit(X_train, y_train)

In [None]:
rf2.score(X_test, y_test)

In [None]:
y_pred2 = rf2.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred2))