# Scikit-learn

'''
What is Scikit-learn?
Scikit-learn (sklearn) is a popular Python library for machine learning.

It includes tools for:

Classification (e.g., spam detection)

Regression (e.g., price prediction)

Clustering (e.g., customer segmentation)

Dimensionality reduction

Model evaluation and selection
'''

In [8]:
#Install Scikit-learn
!pip install scikit-learn



In [9]:
!python -m pip install --upgrade pip



In [10]:
#Basic Imports from Scikit-learn
from sklearn import datasets  # to load built-in datasets
from sklearn.model_selection import train_test_split  # to split data
from sklearn.preprocessing import StandardScaler  # to scale features
from sklearn.linear_model import LinearRegression  # ML model
from sklearn.metrics import mean_squared_error  # evaluation

In [12]:
#Built-in Datasets in Scikit-learn
from sklearn.datasets import load_iris, load_diabetes, load_digits
#But load_boston() is deprecated now. We'll use others like load_diabetes() or load_iris().

In [13]:
#Let’s Try It – Load a Sample Dataset
from sklearn.datasets import load_diabetes

# Load dataset
diabetes = load_diabetes()

# View available keys
print(diabetes.keys())

# View feature names
print("Feature names:", diabetes.feature_names)

# Shape of the data
print("Data shape:", diabetes.data.shape)
print("Target shape:", diabetes.target.shape)

dict_keys(['data', 'target', 'frame', 'DESCR', 'feature_names', 'data_filename', 'target_filename', 'data_module'])
Feature names: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
Data shape: (442, 10)
Target shape: (442,)


'''
Data Preparation in Scikit-learn

We'll go through the following:
  Splitting the dataset into training and testing,
  Scaling the features

Why Split the Data?
To train and evaluate a model properly:
  Training data is used to teach the model.
  Testing data is used to check how well the model performs on unseen data.
'''

In [15]:
#Splitting the Data using train_test_split
from sklearn.model_selection import train_test_split

# Features (X) and Target (y)
#data (also called X) → The features or inputs used to predict, target (also called y) → The output or label that we want to predict
X = diabetes.data  #X → Feature data (shape: (442, 10))Each row is a patient.Each column is a feature (age, BMI, BP, etc.)
y = diabetes.target  #y → Target values (shape: (442,)), Disease progression metric (real numbers)

# Split into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
'''
test_size=0.2 → 20% data for testing (88 samples), 80% for training (354 samples)
random_state=42 → Fixes the randomness so that the same split occurs every time you run it (for reproducibility)
Returns:
X_train → Feature data for training (354 × 10)
X_test → Feature data for testing (88 × 10)
y_train → Target values for training (354,)
y_test → Target values for testing (88,)
'''

print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape) #(Minor variation may happen due to rounding 20% of 442 = 88.4 → rounded to 89)

Train shape: (353, 10)
Test shape: (89, 10)


In [17]:
#Feature Scaling using StandardScaler
#Why? Features should be on the same scale, especially for ML algorithms like SVM or KNN.
from sklearn.preprocessing import StandardScaler

# Create a scaler
scaler = StandardScaler()

# Fit on training data and transform both train and test
X_train_scaled = scaler.fit_transform(X_train)  #fit calculates the mean and std on the training data.transform uses these values to scale the training data.
X_test_scaled = scaler.transform(X_test)  #Important: We only use transform on test data, to prevent data leakage.This ensures test data is scaled using training data’s statistics (mean and std), not its own.

print("Scaled training data (first row):", X_train_scaled[0])

Scaled training data (first row): [ 1.49836523  1.06136988  0.21990201  1.13887373  0.72847289  1.05589332
 -0.82445065  0.71103773  0.54748197 -0.06144896]


In [18]:
'''
Why Should We Scale Features?
Many ML algorithms assume all features are on the same scale (range). If not, features with larger values can dominate the learning process.

For Example:
If you have:
  age: ranges from 20–80
  blood pressure: ranges from 80–150
  serum insulin: ranges from 0.1–300
The algorithm might think insulin is more important just because its numbers are larger, which is misleading.

Especially Important For:
  K-Nearest Neighbors (KNN) → Uses distance between points
  Support Vector Machine (SVM) → Uses dot product and distances
  Gradient Descent based models → Better convergence

What Does StandardScaler Do?
It transforms the features so that each column (feature) has:
  Mean = 0
  Standard Deviation = 1
This is called Z-score normalization or standardization.
Formula:   z=((x-μ)/𝜎) x: original value, μ: mean of the feature, σ: standard deviation
'''

'\nWhy Should We Scale Features?\nMany ML algorithms assume all features are on the same scale (range). If not, features with larger values can dominate the learning process.\n\nFor Example:\nIf you have:\n  age: ranges from 20–80\n  blood pressure: ranges from 80–150\n  serum insulin: ranges from 0.1–300\nThe algorithm might think insulin is more important just because its numbers are larger, which is misleading.\n\nEspecially Important For:\n  K-Nearest Neighbors (KNN) → Uses distance between points\n  Support Vector Machine (SVM) → Uses dot product and distances\n  Gradient Descent based models → Better convergence\n\nWhat Does StandardScaler Do?\nIt transforms the features so that each column (feature) has:\n  Mean = 0\n  Standard Deviation = 1\nThis is called Z-score normalization or standardization.\nFormula:   z=((x-μ)/𝜎) x: original value, μ: mean of the feature, σ: standard deviation\n'