# Test-4

## Algorithm Questions

### 1. How does regularization (L1 and L2) help in preventing overfitting ?

Regularization reduces model complexity to prevent overfitting.

- L1 : Adds a penalty that can make some feature weights zero, helping with feature selection.
- L2 : Adds a penalty to shrink feature weights, making the model simpler and less sensitive to noise.

### 2. Why is feature scaling important in gradient descent ?

Feature scaling ensures that all features are on the same scale. Without it, features with larger values can dominate the learning process, making the algorithm slower or harder to train.

## Problem Solving

### 1. How would you handle missing values in a dataset ?

You can handle missing data in a few ways:

Remove rows or columns with missing data.
- Impute missing values using the mean/median (for numerical data) or mode (for categorical data).
- Use models like Random Forest or XGBoost, which handle missing data naturally.

### 2. Design a pipeline for building a classification model ?

- Collect data (e.g., load CSV).
- Preprocess data: Handle missing values, encode categorical variables, scale features.
- Split data into training and test sets (e.g., 80/20).
- Choose a model (e.g., Logistic Regression, Decision Tree).
- Train the model on the training set.
- Evaluate performance (e.g., accuracy).
- Tune hyperparameters to improve the model.
- Deploy the model for predictions.

### Coding

### 1. Implement a Decision Tree classifier using Scikit-learn.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

df = pd.read_csv('dataset.csv')
X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print(f'Accuracy: {accuracy_score(y_test, y_pred):.2f}')

### 2. Split the data into training and test sets (80-20 split).

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv('dataset.csv')
X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Case Study

### 1. A company wants to predict employee attrition. What kind of problem is this? Which algorithms would you use and why?

This is a binary classification problem.

- Logistic Regression: Simple and interpretable.
- Decision Trees: Can capture non-linear patterns.
- Random Forest: Combines multiple trees for better accuracy.
- Gradient Boosting (XGBoost): Very accurate, especially for structured data.