In [None]:
1. What is a parameter?

A parameter is a configuration variable that the learning algorithm estimates from data during training.
Examples: weights in linear regression, coefficients in logistic regression, or the neurons’ weights in a neural network.

2. What is correlation? What does negative correlation mean?

Correlation measures the strength and direction of a linear relationship between two variables, usually expressed by the correlation coefficient (r) ranging from –1 to +1.
It means that when one variable increases, the other tends to decrease.
Example: Hours of exercise vs. body fat percentage.

3. Define Machine Learning and its main components.

Machine Learning is the science of enabling computers to learn patterns from data without explicit programming.
Main components:

Data (training & testing sets)

Features (input variables)

Model/Algorithm (e.g., decision tree, neural net)

Loss function

Optimizer (to minimize loss)

Evaluation (metrics such as accuracy, RMSE).

4. How does loss value indicate model quality?

Lower loss means the model’s predictions are closer to actual outcomes. A consistently high loss signals poor fit.

5. What are continuous and categorical variables?

Continuous: Numeric values with infinite possible points within a range (e.g., temperature, weight).

Categorical: Discrete groups or labels (e.g., gender, city).

6. How do we handle categorical variables in Machine Learning? What are the common techniques?

Categorical variables represent labels or categories (e.g., colors, countries, types). Most ML algorithms need numerical input, so we convert categories into numbers.
Common techniques:

Label Encoding

One-Hot Encoding

Ordinal Encoding (when order matters)

Target/Mean Encoding (advanced).

7. What do you mean by training and testing a dataset?

1. Training a Dataset

Purpose: To teach the machine learning model using historical or sample data.

What Happens: The model learns patterns and relationships between input features and the target/output.

Data Used: Called the training set — usually 70–80% of the total dataset.

Example: You train a model to predict house prices using data like size, location, and number of rooms.

2. Testing a Dataset

Purpose: To evaluate the model’s performance on unseen data.

What Happens: The model uses what it learned during training to make predictions, and you compare those predictions to actual outcomes.

Data Used: Called the test set — usually 20–30% of the total dataset.

Example: After training the house price model, you test it on new houses to see how well it predicts their prices.


8. What is sklearn.preprocessing?

sklearn.preprocessing is a module in Scikit-learn (a popular Python machine learning library) that provides tools to prepare and transform data before feeding it into machine learning models.

9. What is a Test set?
A reserved portion of the dataset, typically 20–30%, that remains untouched during training.
It provides an unbiased final evaluation of the model’s performance.

10. How do we split data for model fitting (training and testing) in Python?
How do you approach a Machine Learning problem?

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
test_size=0.2 means 20% of data is for testing.

stratify=y preserves class distribution in classification tasks.

--> Approach to a Machine Learning Problem

Problem Understanding – Define objective and metrics.

Data Collection – Gather structured/unstructured data.

Data Cleaning – Handle missing values, outliers.

Exploratory Data Analysis (EDA) – Visualize distributions, detect correlations.

Feature Engineering – Transform and create informative features.

Model Selection & Training – Choose algorithms and tune hyperparameters.

Evaluation – Use cross-validation, confusion matrix, etc.

Deployment & Monitoring – Serve predictions, monitor drift and performance.

11. Why do we have to perform EDA before fitting a model to the data?

Performing Exploratory Data Analysis (EDA) before fitting a model is a critical step in any machine learning workflow. 
Here's a clear, point-wise explanation of why EDA is necessary:

1. Understand the Data

Helps you understand the structure, types, and distribution of data.

Example: Are the features numerical or categorical? Are there outliers?

2. Detect Missing or Corrupted Data

Identifies missing values, NaNs, or data entry errors.

You can decide whether to fill, drop, or impute such values before modeling.

3. Uncover Patterns and Relationships

Helps identify correlations and dependencies between variables.

Example: A strong correlation between a feature and the target is useful.

4. Identify Outliers and Noise

Outliers can distort model predictions and affect accuracy.

EDA helps you decide whether to remove, transform, or leave them.

5. Feature Selection & Engineering

Reveals which features may be useful, redundant, or irrelevant.

Helps with creating new, more meaningful features.

6. Choose the Right Algorithms

Data insights help you choose the best model type.

Example: If the target is imbalanced → Use special classifiers or resampling.

7. Avoid Wrong Assumptions

Prevents blindly applying algorithms to unsuitable data (e.g., non-numeric or skewed distributions).

Reduces chances of garbage-in, garbage-out results.     


12. What is correlation?        

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

corr = df.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.show()

df.corr() computes the Pearson correlation matrix.

Heatmaps visualize strength and direction.  

13. What does negative correlation mean?

A negative correlation means that as one variable increases, the other variable decreases, and vice versa.

 Direction of Relationship

Inverse relationship

When X goes up, Y goes down

When X goes down, Y goes up

 Correlation Coefficient (r) Range

Value of r is between -1 and 0

Closer to -1 → Stronger negative correlation

r = -1 → Perfect negative correlation

r = 0 → No correlation

14. How can you find correlation between variables in Python?

Using pandas.corr()
import pandas as pd

# Load or create a DataFrame
df = pd.read_csv("your_data.csv")

# Compute correlation matrix
correlation_matrix = df.corr()

print(correlation_matrix)

By default, it uses Pearson correlation.

You can also specify other methods:

df.corr(method='pearson')

df.corr(method='kendall')

df.corr(method='spearman')

15. What is causation? Explain difference between correlation and causation with an example.

Correlation: Two variables move together.

Causation: One variable directly influences the other.

Example:
Hot weather increases ice-cream sales and swimming, leading to more drowning incidents.
Ice-cream sales correlate with drownings but do not cause them.

To establish causation, controlled experiments or causal inference methods are needed.

16. What is an Optimizer? What are different types of optimizers? Explain each with an example.

An optimizer updates model parameters to minimize the loss.

Common Optimizers

Gradient Descent (GD): Updates weights using the entire dataset each step.

Stochastic Gradient Descent (SGD): Updates using a single sample—faster, adds noise to escape local minima.

Mini-batch GD: Balance of GD and SGD.

Adam: Adaptive learning rates; most popular for deep learning.

RMSprop & Adagrad: Adjust learning rates per parameter.

Eg: 
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

17. What is sklearn.linear_model ?

sklearn.linear_model is a module in Scikit-learn that contains linear models for both regression and classification tasks.

It provides tools to model relationships between input variables (features) and the target variable (output) using linear equations.

18. What does model.fit() do? What arguments must be given?

The .fit() method is used to train a machine learning model on your dataset.
It learns the patterns in the data by finding the best model parameters.

It builds the internal structure of the model (e.g., coefficients in linear regression, tree splits in decision trees).

This is the first step before you can use .predict() or .score().

model.fit(X, y)

Eg:
from sklearn.linear_model import LinearRegression

# Sample data
X = [[1], [2], [3], [4]]   # Features
y = [2, 4, 6, 8]           # Target

# Create model
model = LinearRegression()

# Train the model
model.fit(X, y)

19. What does model.predict() do? What arguments must be given?

The .predict() method is used to make predictions using a trained machine learning model.
Uses the learned patterns from .fit() to predict the target/output for new or unseen data.

Outputs either:

Numerical values (for regression)

Class labels (for classification)

Eg:
from sklearn.linear_model import LinearRegression

# Training data
X = [[1], [2], [3], [4]]
y = [2, 4, 6, 8]

# Train model
model = LinearRegression()
model.fit(X, y)

# Predict for new input
X_new = [[5], [6]]
predictions = model.predict(X_new)

print(predictions)  # Output: [10. 12.]


20. What are continuous and categorical variables?

1. Continuous Variables

Definition: Variables that can take any numeric value within a range (including decimals).

Also called: Quantitative or numerical variables

Can be measured, not just counted

 Examples:

Height (e.g., 172.5 cm)

Temperature (e.g., 36.6°C)

Income (e.g., $45,678.90)

Time (e.g., 2.75 hours)

   2. Categorical Variables

Definition: Variables that represent categories or groups. Values are labels and not numerical in a meaningful way.

Also called: Qualitative variables

Can be counted but not measured 


21. What is feature scaling? How does it help in Machine Learning?

Feature scaling is the process of standardizing or normalizing the range of independent variables (features) in your dataset.
Different features can have different units and scales (e.g., age in years vs. income in thousands).

Algorithms that compute distances or assume normally distributed data can be biased or perform poorly if features are on different scales.


22. How do we perform scaling in Python?

Common Scaling Techniques in Python
1. Standardization (StandardScaler)

Scales features to have mean = 0 and standard deviation = 1
Eg: 
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# Example data: 2D array (samples x features)
X = [[1, 50], [2, 60], [3, 70]]

# Fit scaler on data and transform it
X_scaled = scaler.fit_transform(X)

print(X_scaled)

2. Normalization (MinMaxScaler)

Scales features to a fixed range (default 0 to 1)

eg:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

X = [[1, 50], [2, 60], [3, 70]]

X_scaled = scaler.fit_transform(X)

print(X_scaled)


23. What is sklearn.preprocessing?

sklearn.preprocessing is a module in Scikit-learn that provides tools to prepare and transform your data before feeding it into machine learning models. It includes functions and classes for tasks like:

Scaling features (e.g., StandardScaler, MinMaxScaler)

Encoding categorical variables (e.g., OneHotEncoder, LabelEncoder)

Normalizing data (e.g., Normalizer)

Generating polynomial features (e.g., PolynomialFeatures)

These preprocessing steps help improve model performance and ensure the data is in the right format for algorithms.

24.  How do we split data for model fitting (training and testing) in Python?

 Splitting data into training and testing sets is a crucial step to evaluate your machine 
 learning model’s performance on unseen data. Here’s how to do it in Python using Scikit-learn:                        

Using train_test_split from sklearn.model_selection
    from sklearn.model_selection import train_test_split

# Suppose X = features, y = target variable
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

25. Explain data encoding?

 Data encoding is the process of transforming categorical variables (non-numeric data) 
 into a numerical format that machine learning models can understand and use.

  Most ML algorithms require numerical input.

Models can’t work directly with text labels like "Red", "Blue", or "Small".

Encoding converts categories into numbers while preserving information.   


     Eg:
     from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Label Encoding example
le = LabelEncoder()
labels = ['Red', 'Blue', 'Green']
encoded_labels = le.fit_transform(labels)  # Output: [2, 0, 1]

# One-Hot Encoding example (using pandas for convenience)
import pandas as pd

df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green']})
one_hot_encoded = pd.get_dummies(df['Color'])
