# K-Fold Cross Validation Notebook Explanation (Hinglish)

Is notebook mein hum classification models ko evaluate karne ke liye k-fold cross-validation use karenge. Neeche har cell ka code aur uski theory + simple Hinglish explanation di gayi hai. Aap is markdown ko seedha Jupyter Notebook mein paste kar sakte hain.

---
## Cell 1: Libraries & Models Import

```python
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn.datasets import load_digits
```
- `LogisticRegression`, `SVC`, aur `RandomForestClassifier`: teen popular classification algorithms.
- `numpy` (np) arrays aur numerical operations ke liye.
- `load_digits`: sklearn ka built-in handwritten digits dataset.

**Theory:** Logistic Regression ek linear model hai probability estimate ke liye; SVM (Support Vector Machine) hyperplane banata hai classes separate karne ke liye; Random Forest ensemble of decision trees hai, jisse overfitting kam hota hai. რ

---
## Cell 2: Load Dataset

```python
digits = load_digits()
```
- `load_digits()` se ek Bunch object milta hai jisme `.data` features aur `.target` labels hote hain.

**Theory:** Digits dataset mein 8×8 pixel images of handwritten digits (0–9) hain; total ~1797 samples.

---
## Cell 3: Train/Test Split

```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    digits.data, digits.target, test_size=0.3)
```
- `train_test_split`: dataset ko random split karta hai training aur testing subsets mein.
- `test_size=0.3` matlab 30% data test ke liye.

**Theory:** Hum model ko unseen data pe bhi evaluate karna chahte, isliye data split karke overfitting avoid kartے hain.

---
## Cell 4: Logistic Regression Evaluation

```python
lr = LogisticRegression()
lr.fit(X_train, y_train)
score_lr = lr.score(X_test, y_test)
```
- `fit()`: model ko training data pe train karta hai.
- `score()`: test data pe accuracy calculate karta hai.

**Theory:** Accuracy = (correct predictions)/(total predictions).

---
## Cell 5: SVM Evaluation

```python
svm = SVC()
svm.fit(X_train, y_train)
score_svm = svm.score(X_test, y_test)
```
- SVC ka default kernel `rbf` hota hai.
- `score()` se accuracy milti hai.

**Theory:** SVM data points ko boundary ke aas-paas support vectors ke through separate karta hai.

---
## Cell 6: Random Forest Evaluation

```python
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
score_rf = rf.score(X_test, y_test)
```
- `RandomForestClassifier()`: default 100 trees.
- `score()`: test set pe accuracy.

**Theory:** Forest of trees ka average ya majority vote final prediction banata hai.

---
## Cell 7: Manual KFold Demo (Commented)

```python
# from sklearn.model_selection import KFold
# kf = KFold(n_splits=3)
# for i, (train_index, test_index) in enumerate(kf.split(X_train)):
#     ...
```
- Commented code se manual KFold splitting dikhaya ja sakta tha.

**Theory:** KFold splits data into `n_splits` parts; har fold ek baar test hota hai aur baaki training.

---
## Cell 8: Utility Function for Scoring

```python
def get_score(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    return model.score(X_test, y_test)
```
- Reusable function jo training + scoring ek hi line mein karta hai.

**Theory:** DRY principle (Don't Repeat Yourself) se code clean hota hai.

---
## Cell 9: Single Fold Score Check

```python
get_score(LogisticRegression(), X_train, X_test, y_train, y_test)
```
- Sirf ek train/test split ke upar Logistic Regression ki performance dekhna.

---
## Cell 10: Stratified K-Fold Cross-Validation

```python
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5)

score_l = []
score_svm = []
score_rf = []
for train_index, test_index in skf.split(digits.data, digits.target):
    X_tr, X_te = digits.data[train_index], digits.data[test_index]
    y_tr, y_te = digits.target[train_index], digits.target[test_index]
    score_l.append(get_score(LogisticRegression(max_iter=1000), X_tr, X_te, y_tr, y_te))
    score_svm.append(get_score(SVC(), X_tr, X_te, y_tr, y_te))
    score_rf.append(get_score(RandomForestClassifier(n_estimators=40), X_tr, X_te, y_tr, y_te))
```
- `StratifiedKFold`: ensures har fold mein class distribution same rahe.
- Loop se har fold ke liye teen models ki accuracies calculate karte hain aur list mein append karte hain.

**Theory:** Stratification se imbalance classes handle hoti hain; cross-validation se performance ka reliable estimate milta hai.

---
## Cell 11–13: Fold-wise Scores

```python
score_l    # Logistic scores for each fold
score_svm  # SVM scores for each fold
score_rf   # Random Forest scores for each fold
```
- Har variable ek list of 5 accuracy values show karega.

**Theory:** Scores ke spread se model stability aur variance ka idea milta hai.

---
## Cell 14–16: `cross_val_score` Shortcut

```python
from sklearn.model_selection import cross_val_score
cross_val_score(LogisticRegression(), digits.data, digits.target, cv=5)
cross_val_score(SVC(),            digits.data, digits.target, cv=5)
cross_val_score(RandomForestClassifier(), digits.data, digits.target, cv=5)
```
- Ek hi function built-in cross-validation run karta hai aur scores return karta hai.

**Theory:** `cross_val_score` under the hood StratifiedKFold use karta hai (agar classification task hai) aur convenience ke liye results directly return karta hai.

---
*Yeh tha K-Fold Cross Validation notebook ka detailed Hinglish explanation + theory. Seedha paste karke apne Jupyter Notebook mein dekho!*


from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn.datasets import load_digits

digits=load_digits()

In [74]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    digits.data, digits.target, test_size=0.3)

In [75]:
lr=LogisticRegression()
lr.fit(X_train,y_train)
lr.score(X_test,y_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.9592592592592593

In [76]:
svm=SVC()
svm.fit(X_train,y_train)
svm.score(X_test,y_test)

0.9888888888888889

In [77]:
rf=RandomForestClassifier()
rf.fit(X_train,y_train)
rf.score(X_test,y_test)

0.9796296296296296

In [78]:
# from sklearn.model_selection import KFold
# kf=KFold(n_splits=3)
# kf.get_n_splits(X_train)
# for i, (train_index, test_index) in enumerate(kf.split(X_train)):
#     print(f"Fold {i}:")
#     print(f"  Train: index={train_index}")
#     print(f"  Test:  index={test_index}")

In [79]:
def get_score(model,X_train,X_test,y_train,y_test):
    model.fit(X_train,y_train)
    return model.score(X_test,y_test)

In [80]:
get_score(LogisticRegression(),X_train,X_test,y_train,y_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.9592592592592593

In [81]:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5)
score_l=[]
score_svm=[]
score_rf=[]
for train_index, test_index in skf.split(digits.data, digits.target):
    X_train, X_test = digits.data[train_index], digits.data[test_index]
    y_train, y_test = digits.target[train_index], digits.target[test_index]
    score_l.append(get_score(LogisticRegression(max_iter=1000), X_train, X_test, y_train, y_test))
    score_svm.append(get_score(SVC(), X_train, X_test, y_train, y_test))
    score_rf.append(get_score(RandomForestClassifier(n_estimators=40), X_train, X_test, y_train, y_test))

In [82]:
score_l

[0.9222222222222223,
 0.8722222222222222,
 0.9415041782729805,
 0.9415041782729805,
 0.8969359331476323]

In [83]:
score_svm

[0.9611111111111111,
 0.9444444444444444,
 0.9832869080779945,
 0.9888579387186629,
 0.9387186629526463]

In [84]:
score_rf

[0.9277777777777778,
 0.9,
 0.9610027855153204,
 0.9693593314763231,
 0.9275766016713092]

In [85]:
from sklearn.model_selection import cross_val_score
cross_val_score(LogisticRegression(),digits.data,digits.target,cv=5)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

array([0.92222222, 0.86944444, 0.94150418, 0.93871866, 0.89693593])

In [86]:
cross_val_score(SVC(),digits.data,digits.target,cv=5)

array([0.96111111, 0.94444444, 0.98328691, 0.98885794, 0.93871866])

In [87]:
cross_val_score(RandomForestClassifier(n_estimators=60),digits.data,digits.target,cv=3)

array([0.92821369, 0.94991653, 0.93823038])