## **BIAS AND VARIANCE**

**Bias and variance** are two fundamental concepts in machine learning that describe different aspects of a model's performance.

->Bias refers to the error introduced by the simplifying assumptions made by a model. A high bias model tends to underfit the training data, meaning it fails to capture the underlying patterns and thus has poor predictive accuracy.

->On the other hand, variance refers to the model's sensitivity to variations in the training data. A high variance model, often associated with overfitting, fits the training data too closely, including noise, and performs poorly on new, unseen data due to its inability to generalize well.

Striking a balance between bias and variance is crucial for building models that generalize effectively to new data while accurately capturing the underlying patterns in the training data, a concept known as the bias-variance trade-off.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings('ignore')



import os

In [3]:
data = 'drive/MyDrive/breast_cancer.csv'

df = pd.read_csv(data)

In [4]:
df['Class'].replace(2,0,inplace=True)
df['Class'].replace(4,1,inplace=True)

In [5]:
X = df.drop(['Class'], axis=1)

y = df['Class']

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

cols=X_train.columns
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)

In [7]:
X_train = pd.DataFrame(X_train, columns=cols)
X_test = pd.DataFrame(X_test, columns=cols)

In [8]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(random_state=0)

# fit the model
logreg.fit(X_train, y_train)

## **CALCULATE BIAS AND VARIANCE**

In [9]:
cv_scores = cross_val_score(logreg, X_train, y_train, cv=5, scoring='accuracy')
bias = 1 - np.mean(cv_scores)
variance = np.std(cv_scores)

In [10]:
print(f"Bias: {bias}")
print(f"Variance: {variance}")


Bias: 0.033238149628783575
Variance: 0.00793866333949474


## **CONCLUSION**
  ->**Low Bias**: The bias value of 0.033 suggests that the model's average error on the training data is relatively low. This indicates that the model is able to capture the underlying patterns in the training data reasonably well and is not overly simplistic.

  ->**Low Variance**: The variance value of 0.0079 indicates that the model's performance does not vary significantly across different subsets of the training data. This suggests that the model is stable and not highly sensitive to small changes in the training data.