### `3. Missing Value`

**Definition:** Missing data is the absence of values in certain observation of a variable. Missing data is an unavoidable problem in most data sources and may have a significant impact on the conclusions that we derived from the data.

#### **3.1. Why Missing Data Matters**

Two important reasons are:

- certain algorithm cannot work when missing value are present
- even for algorithm that handle missing data, without treatment the model can lead to inaccurate conclusion

A study on different of missing data on different ML algorithm can be found here.

**Note:** Some algorithms like XGboost incorporate missing data treatment into its model building process, so you don't need to do the step. However it's important to make sure you understand how the algorithm treat them and explain to the business team.

#### **3.2. Why is the Data Missing**

The source of missing data can vary. These are just some examples:

- The value was forgotten, lost, or not stored properly.
- The value does not exist.
- The value can't be known or identified.

It's important to understand why data is missing, in other words, the mechanism of missing data (MCAR, MAR, or MNAR). We may process the missing information differently depending on this mechanism. Furthermore, identifying the source of missing data allows us to take steps to regulate that source and reduce the amount of missing data as data collection progresses.

#### **3.3. Missing Mechanism**

It is important to understand the mechanisms by which missing fields are introduced in a dataset. Depending on the mechanism, we may choose to process the missing values differently. The mechanisms were first introduced by Rubin [2].

**Missing Completely at Random**

A variable is missing completely at random (MCAR) if the probability of being missing is the same for all the observations. When data is MCAR, there is absolutely no relationship between the data missing and any other values, observed or missing, within the dataset. In other words, those missing data points are a random subset of the data. There is nothing systematic going on that makes some data more likely to be missing than other.

If values for observations are missing completely at random, then disregarding those cases would not bias the inferences made.

**Missing at Random**

Missing as Random (MAR) occurs when there is a systematic relationship between the propensity of missing values and the observed data. In other words, the probability an observation being missing depends only on available information (other variables in the dataset), but not on the variable itself.

For example, if men are more likely to disclose their weight than women, weight is MAR (on variable gender). The weight information will be missing at random for those men and women that decided not to disclose their weight, but as men are more prone to disclose it, there will be more missing values for women than for men.

In a situation like the above, if we decide to proceed with the variable with missing values, we might benefit from including gender to control the bias in weight for the missing observations.

**Missing Not At Random - Depends on Unobserved Predictors**

Missingness depends on information that has not been recorded, and this information also predicts the missing values. E.g., if a particular treatment causes discomfort, a patient is more likely to drop out of the study (and 'discomfort' is not measured).

In this situation, data sample is biased if we drop those missing cases.

**Missing Not At Random - Depends on Missing Value Itself**

Missingness depends on the (potentially missing) variable itself. E.g., people with higher earnings are less likely to reveal them.

#### **3.4. How to Assume a Missing mechanism**

**By business understanding.** In many situations we can assume the mechanism by problem into the business logic behind that variable.
A simpler explanation, so you have missing data:
- Do you know why and have you measured the cause of why? If yes, it's MAR.
- Do you know its missing totally by chance? if yes, it's MCAR.
- Do you have no idea why it's missing? Or you know why, but didn't measure it? If yes, it's MNAR.

**By statistical test.** Divide the dataset into ones with/without missing and perform t-test to see if there's significant differences. If there is, we can assume that missing is not completed at random.

**Note**. We should keep in mind that we can hardly 100% be sure that data are MCAR, MAR, or MNAR because unobserved predictors (lurking variables) are unobserved.

#### **3.5. How to Handle Missing Data**

| **Method** | **Definition** |
| --- | --- |
| Mean/Median/Mode Imputation | The missing values are imputed by mean/median/most frequent values (for categorical features) of that variable |
| KNN Imputation | The missing values are estimated as the average value from the closest K-neigbours |
| Multivariate Imputation | The missing values are imputed based on the observed values for a given individual and the relations observed in the data for other participants |

In [1]:
import pandas as pd 
import numpy as np  

#### **3.6. Applying the Missing Data Handling**

> SimpleImputer

- Simple imputer works for numerical and categorical variable.
- Works like fillna. We can impute mean, median, mode, and constant.
- Univariate

In [2]:
df = pd.DataFrame({
    'x1' : [4, 5, np.nan, 6, 7, 9],
    'x2' : [3, 5, 6, 5, np.nan, 5],
    'x3' : [10, 11, 12, 9, 8, 11],
    'x4' : ['A', 'A', 'C', 'C', 'D', np.nan],
    'x5' : ['X', 'Y', 'X', 'X', np.nan, 'Y'],
    'x6' : ['M', 'M', np.nan, 'M', 'N', np.nan],
})

df

Unnamed: 0,x1,x2,x3,x4,x5,x6
0,4.0,3.0,10,A,X,M
1,5.0,5.0,11,A,Y,M
2,,6.0,12,C,X,
3,6.0,5.0,9,C,X,M
4,7.0,,8,D,,N
5,9.0,5.0,11,,Y,


**Numerical Variable: Mean**

In [3]:
from sklearn.impute import SimpleImputer

In [4]:
imp_mean = SimpleImputer(strategy='mean')

imp_mean.fit(df[['x1']])

df['x1'] = imp_mean.transform(df[['x1']])

In [5]:
df[['x1']]

Unnamed: 0,x1
0,4.0
1,5.0
2,6.2
3,6.0
4,7.0
5,9.0


**Numerical Variable: Median**

In [6]:
imp_median = SimpleImputer(strategy='median')

imp_median.fit(df[['x2']])

df['x2'] = imp_median.transform(df[['x2']])

In [7]:
df[['x2']]

Unnamed: 0,x2
0,3.0
1,5.0
2,6.0
3,5.0
4,5.0
5,5.0


**Categorical Variable: Mode**

In [8]:
# Memastikan kolom 'x4' hanya berisi tipe data yang konsisten
df['x4'] = df['x4'].astype(str)

# Menggunakan SimpleImputer dengan strategi 'most_frequent'
imp_mod = SimpleImputer(strategy='most_frequent')

# Fit dan transformasi kolom 'x4'
df[['x4']] = imp_mod.fit_transform(df[['x4']])

In [9]:
df[['x4']]

Unnamed: 0,x4
0,A
1,A
2,C
3,C
4,D
5,


**Categorical Variable: Constant**

In [11]:
imp_constant = SimpleImputer(strategy='constant',fill_value=['Other','Unknown'])

df[['x5','x6']] = imp_constant.fit_transform(df[['x5','x6'] ])

df

Unnamed: 0,x1,x2,x3,x4,x5,x6
0,4.0,3.0,10,A,X,M
1,5.0,5.0,11,A,Y,M
2,6.2,6.0,12,C,X,Unknown
3,6.0,5.0,9,C,X,M
4,7.0,5.0,8,D,Other,N
5,9.0,5.0,11,,Y,Unknown


In [26]:
df

Unnamed: 0,x1,x2,x3,x4,x5,x6
0,4.0,3.0,10,A,X,M
1,5.0,5.0,11,A,Y,M
2,6.2,6.0,12,C,X,Unknown
3,6.0,5.0,9,C,X,M
4,7.0,5.0,8,D,Other,N
5,9.0,5.0,11,,Y,Unknown


**Using Column Transformer**

In [27]:
df = pd.DataFrame({
    'x1' : [4, 5, np.nan, 6, 7, 9],
    'x2' : [3, 5, 6, 5, np.nan, 5],
    'x3' : [10, 11, 12, 9, 8, 11],
    'x4' : ['A', 'A', 'C', 'C', 'D', np.nan],
    'x5' : ['X', 'Y', 'X', 'X', np.nan, 'Y'],
    'x6' : ['M', 'M', np.nan, 'M', 'N', np.nan],
})

df

Unnamed: 0,x1,x2,x3,x4,x5,x6
0,4.0,3.0,10,A,X,M
1,5.0,5.0,11,A,Y,M
2,,6.0,12,C,X,
3,6.0,5.0,9,C,X,M
4,7.0,,8,D,,N
5,9.0,5.0,11,,Y,


In [31]:
from sklearn.compose import ColumnTransformer

transformer = ColumnTransformer([
    ('imp_mean',SimpleImputer(strategy='mean'),['x1']),
    ('imp_median',SimpleImputer(strategy='median'),['x2']),
    ('imp_mode',SimpleImputer(strategy='most_frequent'),['x4']),
    ('imp_constant',SimpleImputer(strategy='constant',fill_value='Other'),['x5','x6']),
],remainder = 'passthrough')

transformer.fit(df)

transformer.transform(df)

array([[4.0, 3.0, 'A', 'X', 'M', 10],
       [5.0, 5.0, 'A', 'Y', 'M', 11],
       [6.2, 6.0, 'C', 'X', 'Other', 12],
       [6.0, 5.0, 'C', 'X', 'M', 9],
       [7.0, 5.0, 'D', 'Other', 'N', 8],
       [9.0, 5.0, 'A', 'Y', 'Other', 11]], dtype=object)

In [32]:
df

Unnamed: 0,x1,x2,x3,x4,x5,x6
0,4.0,3.0,10,A,X,M
1,5.0,5.0,11,A,Y,M
2,,6.0,12,C,X,
3,6.0,5.0,9,C,X,M
4,7.0,,8,D,,N
5,9.0,5.0,11,,Y,


> IterativeImputer

- Iterative imputer works for numerical only.
- Multivariate. It fills missing value based on oyher variables.
- It works using regression.
- We can fill missing values simultaneously.

In [37]:
df = pd.DataFrame({
    'x1' : [4.3, 5.1, np.nan, 6.3, 7.4, 9.1],
    'x2' : [2.9, 5.1, 6.3, 4.9, np.nan, 5.4],
    'x3' : [9, 11.1, np.nan, 8.9, 9.1, 11],
    'x4' : ['A', 'A', 'C', 'C', 'D', 'D'],
})

df

Unnamed: 0,x1,x2,x3,x4
0,4.3,2.9,9.0,A
1,5.1,5.1,11.1,A
2,,6.3,,C
3,6.3,4.9,8.9,C
4,7.4,,9.1,D
5,9.1,5.4,11.0,D


In [39]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imp_iterative = IterativeImputer(random_state=0)
imp_iterative.fit_transform(df[['x1','x2','x3']])


array([[ 4.3       ,  2.9       ,  9.        ],
       [ 5.1       ,  5.1       , 11.1       ],
       [ 7.18362974,  6.3       ,  9.82338941],
       [ 6.3       ,  4.9       ,  8.9       ],
       [ 7.4       ,  5.07386586,  9.1       ],
       [ 9.1       ,  5.4       , 11.        ]])

> KNNImputer

- Nearest Neigbour imputer works for numerical only.
- It works using nearest neigbors.
- We can fill missing values simultaneously.

In [40]:
df = pd.DataFrame({
    'x1' : [4.3, 5.1, np.nan, 6.3, 7.4, 9.1],
    'x2' : [2.9, 5.1, 6.3, 4.9, np.nan, 5.4],
    'x3' : [9, 11.1, np.nan, 8.9, 9.1, 11],
    'x4' : ['A', 'A', 'C', 'C', 'D', 'D'],
})

df

Unnamed: 0,x1,x2,x3,x4
0,4.3,2.9,9.0,A
1,5.1,5.1,11.1,A
2,,6.3,,C
3,6.3,4.9,8.9,C
4,7.4,,9.1,D
5,9.1,5.4,11.0,D


In [46]:
from sklearn.impute import KNNImputer

imputer_knn = KNNImputer(n_neighbors=2, weights='uniform')
imputer_knn.fit_transform(df[['x1','x2','x3']])

array([[ 4.3,  2.9,  9. ],
       [ 5.1,  5.1, 11.1],
       [ 9.1,  6.3, 11. ],
       [ 6.3,  4.9,  8.9],
       [ 7.4,  4.9,  9.1],
       [ 9.1,  5.4, 11. ]])

___

### `4. Outlier`

**Definition:** An outlier is a data point that is significantly different from the remaining data.

#### **4.1. Why Outliers Matters**

The presence of outliers may:

- introduce noise to the dataset
- make sample less representative
- make algorithm not work properly

Some algorithm are very sensitive to outliers. For example, Adaboost may treat outlier as 'hard' cases and put tremendous weights on outliers, therefore producing a model with bad generalization. Any algorithms that rely on means/variance are sensitive to outliers as those stats are greatly influenced by extreme values.

On the other hand some algorithm are more robust to outliers. For example, decision trees tend to ignore the presence of outliers when creating the branches of their trees. Typically, trees make splits by asking if variable x >= value t, and therefore the outlier will fall on each side of the branch, but it will be treated equally as the remaining values, regardless of its magnitude.

#### **4.2. Should Outliers be Removed**

Depending on the context, outliers either deserve special attention or should be ignored. Take the example of revenue forecasting: if unusual spikes of revenue are observed, it's probably a good ide to pay extra attention to them and figure out what caused the spike. In the same way, an unusual transaction on a credit card might be a sign of fraudulent activity, which is what the credit card issuer wants to prevent. So, in instances like these, it is useful to look for and investigate further the outlier values.

Ifoutliers are, however, introduced by mechanical or measurement error, it is a good idea to remove these outliers before training the model. Why? Because some algorithms are sensitive to outliers.

#### **4.3. Outlier Detection**

In fact outlier analysis and anomaly detection is a huge field of research. All the methods here listed are for univariate outlier detection.

| **Method** | **Definition** |
| --- | --- |
| Detect by arbitrary boundary | Identify outliers based on arbitrary boundaries |
| Mean & Standard Deviation method | outlier detection by Mean & Standard Deviation Method |
| IQR method | outlier detection by Interquartile Ranges Rule |
| MAD method | outlier detection by Median and Median Absolute Deviation Method |

However, beyond these methods, it's more important to keep in mind that the business context should govern how you define and react to these outliers. The meanings of your findings should be dictated by the underlying context, rather than the number itself.

#### **4.4. How to Handle Outliers**

There are many strategies for dealing with outliers in data, and depending on the context and data set, any could be the right or the wrong way. It’s important to investigate the nature of the outlier before deciding.

| **Method** | **Definition** |
| --- | --- |
| Mean/Median/Mode Imputation | replacing the outlier by mean/median/most frequent values of that variable |
| Discretization | transform continuous variables into discrete variables |
| Imputation with arbitrary value | impute outliers with arbitrary value. |
| Windsorization | top-coding & bottom coding (capping the maximum of a distribution at an arbitrarily set value, vice versa). |
| Discard outliers | drop all the observations that are outliers |

**Note:** A detailed guide of doing windsorization can be found here.


#### **4.5. Applying the Outliers Detection**

In [47]:
data = pd.DataFrame(np.array([1, 11, 12, 13, 14, 15, 10, 22, 30]))

> IQR Method

In [49]:
Q1,Q3 = np.percentile(data,[25,75])

IQR = Q3-Q1

upper_bound = Q3 + 1.5*IQR
lower_bound = Q1 - 1.5*IQR

upper_bound,lower_bound

(21.0, 5.0)

In [58]:
#data[(data[0]<lower_bound)| (data[0]>upper_bound)] #Outlier
data[(data[0]>lower_bound) & (data[0]<upper_bound)] #bukan Outlier

Unnamed: 0,0
0,1
1,11
2,12
3,13
4,14
5,15
6,10
7,22
8,30


> Mean & Std Method

In [52]:
avg = data[0].mean()
std = data[0].std()

upper_bound = avg + 3*std
lower_bound = avg - 3*std

upper_bound,lower_bound

(38.398657029695386, -9.954212585250943)

In [55]:
len(data[(data[0]<lower_bound) | (data[0]>upper_bound)]) #tidak ada outlier berdasarkan data diatas

0

> MAD Method

In [62]:
median =  data[0].median() #median absolute deviation
mad = np.median([np.abs(x-median) for x in data[0]])
modified_z_score = pd.Series([0.6745*(x-median)/mad for x in data[0]])
data[np.abs(modified_z_score)>3]

Unnamed: 0,0
0,1
7,22
8,30


> Winsorizing

In [64]:
data = pd.DataFrame(np.array([1, 11, 12, 13, 14, 15, 10, 22, 30]))

In [65]:
from scipy.stats.mstats import winsorize

winsorize(data[0],limits=[0.1,0.1])

masked_array(data=[ 1, 11, 12, 13, 14, 15, 10, 22, 30],
             mask=False,
       fill_value=999999)

___

### `Application`

`Load Dataset`



In [66]:
import seaborn as sns

data = sns.load_dataset('titanic')
df = data[['sex', 'age', 'parch', 'fare', 'class', 'embark_town', 'alone', 'survived']].copy()

print(f'Jumlah baris dan kolom:', df.shape)
df.head()

Jumlah baris dan kolom: (891, 8)


Unnamed: 0,sex,age,parch,fare,class,embark_town,alone,survived
0,male,22.0,0,7.25,Third,Southampton,False,0
1,female,38.0,0,71.2833,First,Cherbourg,False,1
2,female,26.0,0,7.925,Third,Southampton,True,1
3,female,35.0,0,53.1,First,Southampton,False,1
4,male,35.0,0,8.05,Third,Southampton,True,0


`Data Cleaning`

**Duplicated Value**, Deteksi dan kuantifikasi duplikasi data

In [69]:
print(f'data duplikat sebanyak {df.duplicated().sum()}')
print(f'persentase data duplikat sebanyak {df.duplicated().sum() / len(df)*100:.2f}%')

data duplikat sebanyak 111
persentase data duplikat sebanyak 12.46%


Sebesar 12.46% data terindikasi duplikat, maka hilangkan salah satunya.

Handling duplikasi data:

In [70]:
df.drop_duplicates(keep='first',inplace=True,ignore_index=True)

In [71]:
df.duplicated().sum()

0

**Missing Value**, Deteksi dan kuantifikasi missing value

In [73]:
df.isna().sum()

sex              0
age            104
parch            0
fare             0
class            0
embark_town      2
alone            0
survived         0
dtype: int64

Terdapat missing value pada kolom age dan embark town.

**Outlier**, adalah titik data yang secara signifikan berbeda dari data lainnya. Model linier, khususnya regresi linier, sensitif terhadap outlier. Oleh karena itu, kita perlu mendeteksi dan menangani outlier.

Pertama, coba deteksi dan kuantifikasi outlier secara univariate.

In [75]:
numerical = ['age','fare']

In [78]:
def calculate_outliers(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
    return outliers, lower_bound, upper_bound

outlier_data = []

for column in numerical:
    outlier_result, lower_bound, upper_bound = calculate_outliers(data, column)
    total_outlier = len(outlier_result)
    outlier_percentage = round(total_outlier / len(data[column]) * 100, 2)
    outlier_data.append([column, total_outlier, outlier_percentage, lower_bound, upper_bound])

outlier_df = pd.DataFrame(outlier_data, columns=["Column", "Total Outliers", "Percentage (%)", "Lower Bound", "Upper Bound"])
outlier_df

Unnamed: 0,Column,Total Outliers,Percentage (%),Lower Bound,Upper Bound
0,age,11,1.23,-6.6875,64.8125
1,fare,116,13.02,-26.724,65.6344


Terdapat outlier sekitar 1.23% dan 13.02% pada kolom age dan fare.

`Data Splitting`

Bagi dataset menjadi train dan test set dengan komposisi 80:20.

In [80]:
from sklearn.model_selection import train_test_split

X = df.drop(columns='survived')
y = df['survived']

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = .2,random_state = 0, stratify=y)
X_train.shape,X_test.shape

((624, 7), (156, 7))

In [82]:
y.value_counts(normalize=True)

survived
0    0.587179
1    0.412821
Name: proportion, dtype: float64

In [83]:
y_train.value_counts(normalize=True)

survived
0    0.586538
1    0.413462
Name: proportion, dtype: float64

`Data Preprocessing`

Selanjutnya kita akan melakukan preprocessing sebagai berikut:

Imputasi:
- SimpleImputer (median): `age`
- SimpleImputer (mode): `embark_town`

Encoding:
- One Hot Encoding: `sex, embark_town, alone`
- Ordinal Encoding: `class`

In [85]:
from sklearn.preprocessing import OneHotEncoder
from category_encoders import OrdinalEncoder
from sklearn.pipeline import Pipeline


In [92]:
df

Unnamed: 0,sex,age,parch,fare,class,embark_town,alone,survived
0,male,22.0,0,7.2500,Third,Southampton,False,0
1,female,38.0,0,71.2833,First,Cherbourg,False,1
2,female,26.0,0,7.9250,Third,Southampton,True,1
3,female,35.0,0,53.1000,First,Southampton,False,1
4,male,35.0,0,8.0500,Third,Southampton,True,0
...,...,...,...,...,...,...,...,...
775,female,39.0,5,29.1250,Third,Queenstown,False,0
776,female,19.0,0,30.0000,First,Southampton,True,1
777,female,,2,23.4500,Third,Southampton,False,0
778,male,26.0,0,30.0000,First,Cherbourg,True,1


In [89]:
pipe_mode_onehot = Pipeline([ # dikerjakan secara urutan 
    ('mode_imputer',SimpleImputer(strategy='most_frequent')),
    ('ohe',OneHotEncoder(drop='first'))
])

In [90]:
pipe_mode_onehot

In [86]:
ordinal_mapping = [
    {'col':'class','mapping':{None:0,'First':3,'Second':2,'Third':1}}
]

In [91]:
transformer = ColumnTransformer([ # dikerjakan sekaligus
    ('median_impute',SimpleImputer(strategy='median'),['age']),
    ('impute_ohe',pipe_mode_onehot,['embark_town']),
    ('ohe',OneHotEncoder(drop='first'),['sex','alone']),
    ('ce',OrdinalEncoder(cols='class',mapping=ordinal_mapping),['class']),
],remainder='passthrough')

transformer

In [96]:
X_train_prep = transformer.fit_transform(X_train)
X_test_prep = transformer.transform(X_test)

### `Modeling`

In [100]:
import statsmodels.api as sm

In [101]:
logreg = sm.Logit(y_train,sm.add_constant(X_train_prep))

logreg_result = logreg.fit()

Optimization terminated successfully.
         Current function value: 0.486682
         Iterations 6


In [107]:
print(logreg_result.summary())

                           Logit Regression Results                           
Dep. Variable:               survived   No. Observations:                  624
Model:                          Logit   Df Residuals:                      615
Method:                           MLE   Df Model:                            8
Date:                Tue, 27 Aug 2024   Pseudo R-squ.:                  0.2823
Time:                        21:04:03   Log-Likelihood:                -303.69
converged:                       True   LL-Null:                       -423.13
Covariance Type:            nonrobust   LLR p-value:                 3.906e-47
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.6781      0.423      1.603      0.109      -0.151       1.507
x1            -0.0367      0.008     -4.386      0.000      -0.053      -0.020
x2            -0.2119      0.456     -0.465      0.6

Prediksi pada test set.

In [105]:
from sklearn.metrics import accuracy_score

y_pred_proba =  logreg_result.predict(sm.add_constant(X_test_prep))

y_pred_class = np.where(y_pred_proba>.5,1,0)

print(f'akurasi model : {accuracy_score(y_test,y_pred_class)*100:.2f}%')

akurasi model : 82.05%


In [109]:
print(f'Dari {len(y_test)} data penumpang, model berhasil memprediksi dengan benar sebanyak {accuracy_score(y_test,y_pred_class)*len(y_test)}') 

Dari 156 data penumpang, model berhasil memprediksi dengan benar sebanyak 128.0
