In [177]:
# dataset manipulation
import pandas as pd
# numerical operations
import numpy as np

[Dataset](https://www.kaggle.com/datasets/utkarshx27/smoking-dataset-from-uk) (link to dataset)

This dataset consists of smoking data gathered from the uk, and contains lots of demographic data along with the type of tobacco consumed, this is my first foray into categorical variables in quite some time, so this will be a good welcoming back. I am planning on selecting my target variable as the `smoke` column which contains a **1** if they are currently a smoker and a **0** if they are not. I will use all the features left after removing that column as my feature space. I may involve feature engineering in the analysis which will help me decide which features to keep and which ones to remove. This is my second health dataset in a row, and I am interested in the trends, as I have a smoker in my family.

In [178]:
## loading in dataset
smoking_df = pd.read_csv('smoking.csv', index_col=0)

smoking_df.head()  # Looks like we have some missing values as well, interesting!

Unnamed: 0,gender,age,marital_status,highest_qualification,nationality,ethnicity,gross_income,region,smoke,amt_weekends,amt_weekdays,type
1,Male,38,Divorced,No Qualification,British,White,"2,600 to 5,200",The North,No,,,
2,Female,42,Single,No Qualification,British,White,"Under 2,600",The North,Yes,12.0,12.0,Packets
3,Male,40,Married,Degree,English,White,"28,600 to 36,400",The North,No,,,
4,Female,40,Married,Degree,English,White,"10,400 to 15,600",The North,No,,,
5,Female,39,Married,GCSE/O Level,British,White,"2,600 to 5,200",The North,No,,,


In [179]:
smoking_df.dtypes

gender                    object
age                        int64
marital_status            object
highest_qualification     object
nationality               object
ethnicity                 object
gross_income              object
region                    object
smoke                     object
amt_weekends             float64
amt_weekdays             float64
type                      object
dtype: object

Ase you can see, we have quite a lot of categorical variables to handle (most of the dataset), so first let's get to encoding them, but before we do that, let's analyze and see if any columns contain NaN values. That is our first worry.

In [180]:
contains_nan = {}

for each_column in smoking_df.columns:
    contains_nan[each_column] = smoking_df[each_column].hasnans

contains_nan

{'gender': False,
 'age': False,
 'marital_status': False,
 'highest_qualification': False,
 'nationality': False,
 'ethnicity': False,
 'gross_income': False,
 'region': False,
 'smoke': False,
 'amt_weekends': True,
 'amt_weekdays': True,
 'type': True}

So, it looks like we have about 3 columns that contain NaN values, that is the first part of the dataset we want to fix, after that, we can begin tackling the categorical variables!

In [181]:
# We have one small issue though, one of the NaN columns contains categorical data, not numerical data, so we may have to use frequency to fill those values.

from sklearn.impute import SimpleImputer

type_imputer = SimpleImputer(strategy='most_frequent')

type_imputer.fit(smoking_df[['type']])

values = type_imputer.transform(smoking_df[['type']])

type_imputed_smoking_df = smoking_df.copy()

type_imputed_smoking_df['type'] = values

type_imputed_smoking_df[
    'type'].hasnans  # Yay! The missing values have been removed, now we can move onto the other 2 numerical columns

False

In [182]:
# Do the same thing for amt_weekdays, except use the `mean` strategy to impute the missing values

amt_weekdays_imputer = SimpleImputer(strategy='mean')

amt_weekdays_imputer.fit(smoking_df[['amt_weekdays']])

values = amt_weekdays_imputer.transform(smoking_df[['amt_weekdays']])

amt_weekdays_imputed_smoking_df = type_imputed_smoking_df.copy()

amt_weekdays_imputed_smoking_df['amt_weekdays'] = values

amt_weekdays_imputed_smoking_df['amt_weekdays'].hasnans

False

In [183]:
# Do the same thing for the amt_weekends, except use the `mean` strategy to impute the missing values

amt_weekends_imputer = SimpleImputer(strategy='mean')

amt_weekends_imputer.fit(smoking_df[['amt_weekends']])

values = amt_weekends_imputer.transform(smoking_df[['amt_weekends']])

amt_weekends_imputed_smoking_df = amt_weekdays_imputed_smoking_df.copy()

amt_weekends_imputed_smoking_df['amt_weekends'] = values

amt_weekends_imputed_smoking_df['amt_weekends'].hasnans

False

In [184]:
imputed_smoking_df = amt_weekends_imputed_smoking_df.copy()

contains_nan = {}

for each_column in imputed_smoking_df.columns:
    contains_nan[each_column] = imputed_smoking_df[each_column].hasnans

contains_nan

# We've successfully removed all NaN values!

{'gender': False,
 'age': False,
 'marital_status': False,
 'highest_qualification': False,
 'nationality': False,
 'ethnicity': False,
 'gross_income': False,
 'region': False,
 'smoke': False,
 'amt_weekends': False,
 'amt_weekdays': False,
 'type': False}

In [185]:
# Now, lets checkout if we have any odd standard deviations across our dataset


imputed_smoking_df.head(), imputed_smoking_df.dtypes, imputed_smoking_df.describe()  # Doesn't look too bad at all, no insanely high standard deviations

(   gender  age marital_status highest_qualification nationality ethnicity  \
 1    Male   38       Divorced      No Qualification     British     White   
 2  Female   42         Single      No Qualification     British     White   
 3    Male   40        Married                Degree     English     White   
 4  Female   40        Married                Degree     English     White   
 5  Female   39        Married          GCSE/O Level     British     White   
 
        gross_income     region smoke  amt_weekends  amt_weekdays     type  
 1    2,600 to 5,200  The North    No     16.410926     13.750594  Packets  
 2       Under 2,600  The North   Yes     12.000000     12.000000  Packets  
 3  28,600 to 36,400  The North    No     16.410926     13.750594  Packets  
 4  10,400 to 15,600  The North    No     16.410926     13.750594  Packets  
 5    2,600 to 5,200  The North    No     16.410926     13.750594  Packets  ,
 gender                    object
 age                        int64

Let's look at how we can process the `gross_income` column, we can find a way to convert that to numerical, data, let's see what values that column contains, and come up with a method to parse it.

In [186]:
imputed_smoking_df['gross_income'].unique()

array(['2,600 to 5,200', 'Under 2,600', '28,600 to 36,400',
       '10,400 to 15,600', '15,600 to 20,800', 'Above 36,400',
       '5,200 to 10,400', 'Refused', '20,800 to 28,600', 'Unknown'],
      dtype=object)

In [187]:
gross_income_df = imputed_smoking_df.copy()


def process_gross_income(income_amount: str) -> float:
    income_amount = income_amount.replace(',', '')
    if 'to' in income_amount:
        [start, end] = income_amount.split(' to ')
        return (int(end) + int(start)) / 2
    elif 'Above' in income_amount:
        above_amount = int(income_amount.split('Above ')[1])
        return above_amount + (above_amount / 2)
    elif 'Under' in income_amount:
        under_amount = int(income_amount.split('Under ')[1])
        return under_amount / 2
    else:
        return 0.0


def further_process(income_amount: float, mean: float) -> float:
    if income_amount == 0:
        return mean
    return income_amount


gross_income_df['gross_income'] = gross_income_df['gross_income'].map(lambda x: process_gross_income(x))

gross_income_df['gross_income'] = gross_income_df['gross_income'].map(
    lambda x: further_process(x, float(np.mean(gross_income_df['gross_income']))))

gross_income_df['gross_income'] = np.log(gross_income_df['gross_income'])

imputed_smoking_df['gross_income'] = gross_income_df['gross_income']

gross_income_df['gross_income'].describe()

count    1691.000000
mean        9.205058
std         0.908441
min         7.170120
25%         8.961879
50%         9.472705
75%         9.809177
max        10.907789
Name: gross_income, dtype: float64

We've now processed the gross income column, and successfully converted it to numerical data instead of categorical data. Now we must decide which column we want to preprocess next, let's observe the columns available to us right now.

In [188]:
gross_income_df.dtypes

gender                    object
age                        int64
marital_status            object
highest_qualification     object
nationality               object
ethnicity                 object
gross_income             float64
region                    object
smoke                     object
amt_weekends             float64
amt_weekdays             float64
type                      object
dtype: object

We can try to tackle the region column now, that is strictly categorical data, there is no way we can interpret it as numerical (in the case of gross income, we can interpret above 59000 as some number, but for example with region we can interpret region x as 0, y as 1, z as 2, and so on.

In [191]:
from sklearn.preprocessing import OneHotEncoder

# One hot encoding the categorical values of the region column, and creating a new dataframe from it
dummies = pd.get_dummies(imputed_smoking_df, columns=['region'], drop_first=True)

region_imputed_smoking_df = dummies.copy()

region_imputed_smoking_df

Unnamed: 0,gender,age,marital_status,highest_qualification,nationality,ethnicity,gross_income,smoke,amt_weekends,amt_weekdays,type,region_Midlands & East Anglia,region_Scotland,region_South East,region_South West,region_The North,region_Wales
1,Male,38,Divorced,No Qualification,British,White,8.268732,No,16.410926,13.750594,Packets,0,0,0,0,1,0
2,Female,42,Single,No Qualification,British,White,7.170120,Yes,12.000000,12.000000,Packets,0,0,0,0,1,0
3,Male,40,Married,Degree,English,White,10.388995,No,16.410926,13.750594,Packets,0,0,0,0,1,0
4,Female,40,Married,Degree,English,White,9.472705,No,16.410926,13.750594,Packets,0,0,0,0,1,0
5,Female,39,Married,GCSE/O Level,British,White,8.268732,No,16.410926,13.750594,Packets,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1687,Male,22,Single,No Qualification,Scottish,White,8.268732,No,16.410926,13.750594,Packets,0,1,0,0,0,0
1688,Female,49,Divorced,Other/Sub Degree,English,White,8.268732,Yes,20.000000,20.000000,Hand-Rolled,0,1,0,0,0,0
1689,Male,45,Married,Other/Sub Degree,Scottish,White,8.961879,No,16.410926,13.750594,Packets,0,1,0,0,0,0
1690,Female,51,Married,No Qualification,English,White,8.268732,Yes,20.000000,20.000000,Packets,0,1,0,0,0,0
