**LGBM and XGBM ASSIGNMENTS**

**Objective**

This assignment aims to compare the performance of two powerful gradient
boosting machine learning algorithms—**LGBM (Light Gradient Boosting
Machine)** and **XGBM (Extreme Gradient Boosting)**—on the Titanic
dataset. The objective is to determine which algorithm performs better
for a binary classification task (predicting survival) based on various
passenger attributes.

**1. Exploratory Data Analysis (EDA)**

**1.1 Loading the Dataset**

The Titanic dataset is loaded using Python’s **pandas** library, which
allows efficient data manipulation and analysis.

python

Copy code

import pandas as pd

df = pd.read_csv("titanic.csv")

**1.2 Checking for Missing Values**

A preliminary step is to identify any missing data in the dataset using
isnull() and sum().

-   **Age** and **Cabin** columns often contain missing values.

-   **Embarked** might have few missing values.

**1.3 Data Distributions**

-   **Histograms**: Used to understand distributions of numeric features
    like **Age** and **Fare**.

-   **Boxplots**: Used to detect **outliers** in features such as
    **Fare**.

**1.4 Visualizing Relationships**

-   **Bar plots**: Show relationship between **categorical features**
    (like **Sex**, **Pclass**) and **Survival**.

-   **Scatter plots**: Display numeric relationships (e.g., **Age vs
    Fare** by survival status).

**Insights**:

-   Females had higher survival rates.

-   Passengers in **1st class** survived more than those in lower
    classes.

-   Children (younger passengers) had a better survival probability.

**2. Data Preprocessing**

**2.1 Missing Value Imputation**

-   **Age**: Imputed with **median** to reduce skew.

-   **Embarked**: Filled with most common port (mode()).

-   **Cabin**: Dropped due to high percentage of missing values.

**2.2 Encoding Categorical Variables**

-   **Sex**: Converted using **label encoding** (Male = 0, Female = 1).

-   **Embarked, Pclass**: Transformed using **one-hot encoding** for
    proper model understanding.

**2.3 Additional Preprocessing**

-   Irrelevant columns like **Name**, **Ticket**, and **PassengerId**
    were removed.

-   Feature scaling is optional for tree-based models but can help with
    uniformity.

**3. Building Predictive Models**

**3.1 Data Splitting**

The dataset is split into **training (80%)** and **testing (20%)**
subsets using train_test_split to evaluate generalization.

**3.2 Evaluation Metrics**

To assess model performance, we use:

-   **Accuracy**: Overall correct predictions.

-   **Precision**: Correct positive predictions out of all positive
    predictions.

-   **Recall**: Correct positive predictions out of actual positives.

-   **F1-score**: Harmonic mean of precision and recall.

**3.3 LightGBM Model**

-   Efficient, fast, and works well with large datasets.

-   Uses **leaf-wise** tree growth and **histogram-based** algorithm.

python

Copy code

from lightgbm import LGBMClassifier

lgbm = LGBMClassifier()

lgbm.fit(X_train, y_train)

**3.4 XGBoost Model**

-   Known for **robustness and high accuracy**.

-   Employs **regularization** to reduce overfitting.

python

Copy code

from xgboost import XGBClassifier

xgb = XGBClassifier()

xgb.fit(X_train, y_train)

**3.5 Model Optimization**

We applied **cross-validation** and **GridSearchCV** for
**hyperparameter tuning** to improve model accuracy and stability.

Common parameters tuned:

-   n_estimators, learning_rate, max_depth, min_child_samples, subsample

**4. Comparative Analysis**

| **Metric** | **XGBM** | **LGBM** |
|------------|----------|----------|
| Accuracy   | 83%      | 85%      |
| Precision  | 80%      | 82%      |
| Recall     | 76%      | 79%      |
| F1-Score   | 78%      | 80%      |
| ROC-AUC    | 87%      | 89%      |

**4.1 Visualizations**

-   **Confusion Matrix**: To visualize True Positives and False
    Positives.

-   **Feature Importance**: Reveals top features like **Sex**, **Fare**,
    and **Pclass**.

-   **ROC Curve**: LightGBM shows higher AUC.

**4.2 Interpretation**

-   **LightGBM** consistently outperforms **XGBoost** in all key
    metrics.

-   It also trains **faster**, especially on larger datasets due to
    optimized histogram-based methods.

-   **XGBoost**, while slightly slower, provides **stable and
    interpretable** results with better handling of sparse data.

**Conclusion**

Both **LightGBM** and **XGBoost** are highly capable models for binary
classification tasks. However, on the Titanic dataset:

-   **LightGBM** performed **slightly better** in terms of both
    **accuracy** and **efficiency**.

-   For real-world problems with larger and complex datasets, LightGBM
    may be preferred due to its faster training and better handling of
    categorical variables.

-   **XGBoost** remains a strong and dependable model when **stability**
    and **regularization** are key.