# **`Machine Learning - Regression Case`**

Anda sebagai MES - Data Analyst PT. Shimano Batam ditugaskan oleh Client untuk membuat analisa report berupa Dashboard dan Machine Learning sederhana untuk melakukan prediksi persentase nilai NC (Non Conformance / Unacceptable) pada produk yang dihasilkan. 

Berdasarkan kasus diatas, tugas yang dapat dilakukan adalah:

`Skenario Regresi`
    
- Melakukan prediksi nilai numerik, misalnya 'NC %' atau jumlah persentase dari NC. 
    
- Feature columns: Semua kolom kecuali 'NC %' dan kolom kategorikal yang telah di-encode.

- target colum: 'NC %'

## **Import Libraries**

In [1]:
# basic - EDA
import pandas as pd 
import numpy as np 
import klib
import seaborn as sns 
import matplotlib.pyplot as plt  

# model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression

# metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# other
import warnings 
warnings.filterwarnings('ignore')
pd.set_option('display.max_column', 30)

## **Load Dataset**

In [2]:
# preview dataset
raw = pd.read_excel(r'D:\Project\Factory_Prediction\Data.xlsx')
raw.head(15).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
Years,2014,2014,2014,2014,2014,2014,2014,2014,2014,2014,2014,2014,2014,2014,2014
Month,Jan,Jan,Jan,Jan,Jan,Jan,Jan,Jan,Jan,Jan,Feb,Feb,Feb,Feb,Feb
Week,2nd,2nd,2nd,2nd,3rd,3rd,3rd,3rd,4th,4th,1st,1st,1st,2nd,2nd
FindingArea,FG,FG,OTC,FG,FG,FG,FG,FG,FG,FG,FG,FG,FG,FG,FG
Factory,ABC,ABC,ABC,ABC,ABC,ABC,ABC,ABC,HCL,ABC,ABC,ABC,ABC,ABC,ABC
GroupingFactory,ABC,ABC,ABC,ABC,ABC,ABC,ABC,ABC,SUBCON,ABC,ABC,ABC,ABC,ABC,ABC
DepartmentResponsible,Mr. Februari,Mr. Februari,Mr. Maret,Mr. April,Mr. April,Mr. Mei,Mr. Januari,Mr. Januari,Mr. Dafit,Mr. Januari,Mr. Januari,Mr. Januari,Mr. Januari,Mr. Januari,Mr. June
Product,SG,SM,FD,BB,BB,RD,SL,SL,CS,SL,SL,SL,SL,SL,DH
MainModel,SG 3C41,SM-CJ8S20,FD-R2030,BB-ES300,BB-UN26K,RD M310,ST-EF41,ST-EF41,CS-HG201,SL-MT500,SL-M3100,ST-EF65,ST-EF41,ST-EF65,DH-C30003N
Model/Production Code,SG 3C41 (168mm) (LH non-turn) \n235U7010326,SM-CJ8S20 Unit Set GP1\n30015685,FD-R2030 (BRAZED ON) OTC (Packing)\n22B22400037,BB-ES300 113 73\n21V90003056,BB-UN26K LL123 68\n21S1D120356,RD M310 OTC (Packing)\n25W87200237,ST-EF41 (F) 3 SPEED\n26UC2000056,ST-EF41 (F) 3 SPEED\n26UC2000056,CS-HG201 BO\n40050112,SL-MT500-IL\n20L72001056,SL-M3100-2L\n20LJ2000066,ST-EF65-2A (F) 3 SPEED\n26UG2001256,ST-EF41 (F) 3 SPEED\n26UC2000056,ST-EF65-2A (F) 3 SPEED\n26UG2001256,DH-C30003N (Nut Type)\n22AV8001126


In [3]:
# menampilkan informasi umum 
print(raw.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 287 entries, 0 to 286
Data columns (total 22 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   Years                  287 non-null    int64         
 1   Month                  287 non-null    object        
 2   Week                   287 non-null    object        
 3   FindingArea            287 non-null    object        
 4   Factory                287 non-null    object        
 5   GroupingFactory        287 non-null    object        
 6   DepartmentResponsible  287 non-null    object        
 7   Product                287 non-null    object        
 8   MainModel              287 non-null    object        
 9   Model/Production Code  287 non-null    object        
 10  QTY                    286 non-null    object        
 11  SamplingChek           287 non-null    int64         
 12  SamplingNC             287 non-null    int64         
 13  NC % 

In [4]:
# melihat statistik dasar dataset
raw.describe()

Unnamed: 0,Years,SamplingChek,SamplingNC,NC %,DateProduce,TimeProduce
count,287.0,287.0,287.0,287.0,171,171.0
mean,2016.020906,59.195122,3.121951,0.1889,2018-09-24 19:22:06.315789568,11.123977
min,2014.0,1.0,1.0,0.001667,2016-12-31 00:00:00,0.05
25%,2014.0,20.0,1.0,0.033333,2017-11-08 00:00:00,8.31
50%,2016.0,25.0,1.0,0.05,2018-12-13 00:00:00,10.31
75%,2018.0,50.0,2.0,0.2,2019-05-15 00:00:00,14.235
max,2018.0,3000.0,53.0,1.0,2024-04-22 00:00:00,23.56
std,1.584314,224.852423,5.897947,0.285212,,5.152131


In [5]:
# Menampilkan data duplicate dari dataset
duplicates = raw[raw.duplicated()]
print("Jumlah data duplicate:", duplicates.shape[0])
print(duplicates)

Jumlah data duplicate: 0
Empty DataFrame
Columns: [Years, Month, Week, FindingArea , Factory, GroupingFactory, DepartmentResponsible, Product, MainModel, Model/Production Code, QTY, SamplingChek, SamplingNC, NC %, NCDescription, TypeOfNC, Factor, GroupingFactor, LeaderName, DateProduce, TimeProduce, Shift]
Index: []


In [6]:
# Menampilkan nilai unik dari setiap kolom
def display_unique_values(df):
    unique_values = {}
    for column in df.columns:
        unique_values[column] = df[column].unique()
    return unique_values

unique_values = display_unique_values(raw)

# Menampilkan hasil
for column, values in unique_values.items():
    print(f"Unique values in column '{column}':\n{values}\n")

Unique values in column 'Years':
[2014 2015 2016 2017 2018]

Unique values in column 'Month':
['Jan' 'Feb' 'Mar' 'Apr' 'May' 'Jun' 'Jul' 'Aug' 'Sep' 'Oct' 'Nov' 'Dec']

Unique values in column 'Week':
['2nd' '3rd' '4th' '1st' '1s']

Unique values in column 'FindingArea ':
['FG' 'OTC' 'SL' 'SM' 'RD' 'CQ' 'FD']

Unique values in column 'Factory':
['ABC' 'HCL' 'SIP' 'SOLO']

Unique values in column 'GroupingFactory':
['ABC' 'SUBCON']

Unique values in column 'DepartmentResponsible':
['Mr. Februari' 'Mr. Maret' 'Mr. April' 'Mr. Mei' 'Mr. Januari'
 'Mr. Dafit' 'Mr. June' 'Mr. July' 'Mr. Zulkifli' 'Mr. Gorga'
 'Mr. Hardiansyah' 'Mr. Hariyanto' 'Mr. Nirwan' 'Mr. Agustus'
 'Mr. September' 'Mr. Angga' 'Mr. Didik' 'Mr. Lufdi' 'Mr. Rachmat FD']

Unique values in column 'Product':
['SG' 'SM' 'FD' 'BB' 'RD' 'SL' 'CS' 'DH' 'SC' 'RT']

Unique values in column 'MainModel':
['SG 3C41' 'SM-CJ8S20' 'FD-R2030' 'BB-ES300' 'BB-UN26K' 'RD M310'
 'ST-EF41' 'CS-HG201' 'SL-MT500' 'SL-M3100' 'ST-EF65' 'DH-C30003

## **Data Distribution**

In [None]:
# Untuk variabel numerik
numeric_columns = raw.select_dtypes(include=['int64', 'float64']).columns
for col in numeric_columns:
    plt.figure(figsize=(10, 6))
    sns.histplot(raw[col], kde=True)
    plt.title(f'Distribusi {col}')
    plt.show()

# Untuk variabel kategorikal
categorical_columns = raw.select_dtypes(include=['object']).columns
for col in categorical_columns:
    plt.figure(figsize=(10, 6))
    raw[col].value_counts().plot(kind='bar')
    plt.title(f'Distribusi {col}')
    plt.xticks(rotation=45)
    plt.show()

## **Preprocessing**

### a. Check Missing Values

In [None]:
# check missing values
missing_values = raw.isnull().sum()

# tampilkan missing values
missing_values

In [None]:
missing_num = 116/287 * 100
missing_num

Susunan dataset dapat terlihat diatas, dimana terdapat **21 kolom** dengan jumlah entri data sebanyak **287 entri**. Untuk kolom seperti `LeaderName`, `DataProduce`, `TimeProduce`, `Shift` yang memiliki data missing sebanyak 116 entri dari 287 atau **>40%** dari total data sehingga untuk handling missing values perlu dilakukan analisa lebih lanjut.

### b. Handling Missing Values

In [None]:
# mengisi missing values dengan nilai median
raw['QTY'] = raw['QTY'].fillna(raw['QTY'].median())
raw['DateProduce'] = raw['DateProduce'].fillna(raw['DateProduce'].median())
raw['TimeProduce'] = raw['TimeProduce'].fillna(raw['TimeProduce'].median())

# mengisi missing valus dengan nilai modus
raw['Factor'] = raw['Factor'].fillna(raw['Factor'].mode()[0])
raw['GroupingFactor'] = raw['GroupingFactor'].fillna(raw['GroupingFactor'].mode()[0])
raw['shift'] = raw['shift'].fillna(raw['shift'].mode()[0])

### c. Check & Visualize Outlier

In [None]:
def detect_outliers(df, columns):
    outliers = {}
    for col in columns:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        outliers[col] = df[(df[col] < lower_bound) | (df[col] > upper_bound)].index.tolist()
    return outliers


# Deteksi outlier pada kolom numerik
numeric_columns = raw.select_dtypes(include=['int64', 'float64']).columns
outliers = detect_outliers(raw, numeric_columns)

# Visualisasi outlier menggunakan box plot
plt.figure(figsize=(15, 5))
sns.boxplot(data=raw[numeric_columns])
plt.title('Box Plot untuk Kolom Numerik')
plt.xticks(rotation=45)
plt.show()

# Print jumlah outlier untuk setiap kolom
for col, indices in outliers.items():
    print(f"Kolom {col} memiliki {len(indices)} outlier")

### d. Handling Outlier

In [None]:
def handle_outliers(df, columns):
    for col in columns:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        df[col] = df[col].clip(lower_bound, upper_bound)
    return df

raw = handle_outliers(raw, numeric_columns)

### e. Handle Inconsistent Values

In [None]:
# Mengubah format nilai di kolom 'factor'
raw['factor'] = raw['factor'].replace({'Single part': 'Single Part'})

# Mengubah format nilai di kolom 'week'
raw['week'] = raw['week'].replace({'1s': '1st'})

# mengganti nilai lots dengan asumsi 1 lots = 100
raw['QTY'] = raw['QTY'].replace('6 lots', 600)

# Menampilkan beberapa baris pertama untuk memeriksa perubahan
print(raw[['shift', 'factor', 'week']].head())

### f. Encoding

In [None]:
# identifikasi kolom kategorical
categorical_columns = raw.select_dtypes(include=['object'])

# label encoding untuk nilai ordinal
ordinal_columns = ['Month', 'Week']
label_encoders = {}

for column in ordinal_columns:
    le = LabelEncoder()
    raw[column] = le.fit_transform(raw[column].astype(str))
    label_encoders[column] = le

In [None]:
# One-Hot Encoding untuk kolom nominal
nominal_columns = [col for col in categorical_columns if col not in ordinal_columns]

onehot = OneHotEncoder(sparse=False, handle_unknown='ignore')
onehot_encoded = onehot.fit_transform(raw[nominal_columns])
onehot_columns = onehot.get_feature_names(nominal_columns)

In [None]:
 # Gabungkan hasil one-hot encoding dengan dataset asli
onehot_df = pd.DataFrame(onehot_encoded, columns=onehot_columns, index=raw.index)
raw = pd.concat([raw.drop(columns=nominal_columns), onehot_df], axis=1)

# Display the first few rows of the processed dataset
print(raw.head())
print(raw.info())

## **Correlation**

In [None]:
klib.corr_plot(raw, split='pos')

In [None]:
klib.corr_plot(raw, split='neg')

In [None]:
klib.corr_plot(raw, target='NC %')

In [None]:
# 1. Analisis korelasi dengan target
correlation_with_target = raw.corr()['NC %'].abs().sort_values(ascending=False)
print("Korelasi dengan target 'NC %':")
print(correlation_with_target)

# 2. Visualisasi korelasi
plt.figure(figsize=(12, 10))
sns.heatmap(raw.corr(), annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Heatmap Korelasi')
plt.show()

# 3. Identifikasi fitur dengan korelasi rendah terhadap target
low_correlation_threshold = 0.1  # Anda bisa menyesuaikan threshold ini
low_correlation_features = correlation_with_target[correlation_with_target < low_correlation_threshold].index.tolist()

# 4. Identifikasi fitur dengan multikolinearitas tinggi
high_correlation_threshold = 0.5  # Anda bisa menyesuaikan threshold ini
correlation_matrix = raw.corr()
high_correlation_features = set()
for i in range(len(correlation_matrix.columns)):
    for j in range(i):
        if abs(correlation_matrix.iloc[i, j]) > high_correlation_threshold:
            colname = correlation_matrix.columns[i]
            high_correlation_features.add(colname)


## **Feature Engineering**

In [None]:
# Drop feature yang dianggap kurang berpengaruh
raw_cleaned = raw.drop([''], axis=1)

## **Model**

### a. Split Dataset to Train-Test Set

### b. Training

#### `b.1. Linear Regression`

#### `b.2. Random Forest`

### c. Evaluate Model