# **Assignment Questions**

**Q1. What is a parameter?**

In machine learning, a parameter is a value that the model learns from the training data to make predictions.

**Examples:**

* In linear regression, the slope (weights) and intercept are parameters.

* In neural networks, the weights and biases in each layer are parameters.

These values are adjusted during training to minimize the error between the model's predictions and the actual outcomes.

**Q2. What is correlation and What does negative correlation mean?**

**Correlation** is a statistical measure that shows how two variables are related.

* If both increase or decrease together, it's a positive correlation.
* If one increases while the other decreases, it's a negative correlation.
* If there's no connection, it's zero correlation.
* It is usually measured by a value between -1 and +1.

**Negative correlation** means that as one variable increases, the other variable decreases.
In other words, they move in opposite directions.

**Q3. Define Machine Learning. What are the main components in Machine Learning?**

**Machine Learning (ML)** is a branch of artificial intelligence that enables systems to learn from data and improve their performance without being explicitly programmed.

**Main Components of Machine Learning:**

1. **Data**
Raw input used for training and testing the model.

2. **Model**
The mathematical structure (like a decision tree, neural network, etc.) that makes predictions or decisions based on data.

3. **Features**
The input variables or attributes extracted from data that are used for learning.

4. **Labels (Targets)**
The correct outputs used in supervised learning to guide the model.

5. **Algorithm**
The procedure (e.g., linear regression, gradient boosting) that trains the model by adjusting its parameters.

6. **Training**
The process of feeding data into the model and adjusting its parameters to minimize prediction errors.

7. **Evaluation**
Testing the model’s accuracy and performance on unseen data using metrics like accuracy, precision, recall, etc.

8. **Prediction**
Using the trained model to make decisions or forecasts on new, unseen data.

**Q4. How does loss value help in determining whether the model is good or not?**

The **loss value** measures how well the model's predictions match the actual data during training.

A low loss means the model's predictions are close to the true values, indicating better performance.

A high loss means the predictions are far from the true values, indicating poor performance.

By monitoring the loss value, we can tell if the model is learning well (loss decreases) or if it needs improvement (loss stays high or fluctuates).
So, lower loss generally means a better model.

**Q5. What are continuous and categorical variables?**

**Continuous variables** are numerical and can take any value within a range **(e.g., height, weight, temperature).**
They are measurable and often involve decimals or fractions.

**Categorical variables** represent groups or categories **(e.g., gender, color, product type).**
They can be nominal (no order) or ordinal (with order).


Categorical variables are not used for arithmetic operations but for classification.

**Q6. How do we handle categorical variables in Machine Learning? What are the common techniques?**

**Handling Categorical Variables in Machine Learning**

Categorical variables must be **converted into numeric form** to be used in most ML models.

**Common Techniques:**
1. **Label Encoding**

* Converts categories into integers

* Best for ordinal data (with order)

**Example: Low = 0, Medium = 1, High = 2**

2. **One-Hot Encoding**

* Creates separate binary columns for each category

* Best for nominal data (no order)

**Example: Color → [Red, Green, Blue] becomes 3 columns with 0s and 1s**

3. Ordinal Encoding

* Similar to label encoding but explicitly maintains order

* Used when order matters and can be manually defined

4. **Binary Encoding / Target Encoding (less common)**

* Advanced methods for high-cardinality categories (many unique values)


These techniques are typically applied using sklearn.preprocessing or libraries like pandas.

**Q7. What do you mean by training and testing a dataset?**

**Training a dataset** means using a portion of data to teach the machine learning model by allowing it to learn patterns and relationships.

**Testing a dataset** involves using a separate portion of data to evaluate the model's performance on unseen data.

This helps check how well the model generalizes and avoids overfitting (memorizing the training data).

**Q8. What is sklearn.preprocessing?**

*sklearn.preprocessing* is a module in Scikit-learn used for scaling, transforming, and encoding data before training a machine learning model.

It helps prepare the data to improve model accuracy and performance.

**Common Tools in sklearn.preprocessing:**
* **StandardScaler** - Standardizes features (mean = 0, std = 1)

* **MinMaxScaler** - Scales features to a fixed range (e.g., 0 to 1)

* **LabelEncoder** - Converts categorical labels to numeric form

* **OneHotEncoder** - Converts categories to binary columns

* **PolynomialFeatures** - Generates polynomial and interaction terms

These tools are essential for cleaning and normalizing data before model training.

**Q9. What is a Test set?**

A **test set** is a portion of the dataset kept separate from training, used to **evaluate the performance** of a trained machine learning model.

It contains **unseen data** that helps check how well the model generalizes to new examples, ensuring it doesn’t just memorize the training data.

**Q10. How do we split data for model fitting (training and testing) in Python? and How do you approach a Machine Learning problem?**

 **can use train_test_split from Scikit-learn:**    
    
    from sklearn.model_selection import train_test_split

    # X = features, y = target variable
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

* *test_size=0.2* means 20% of data is for testing, 80% for training

* *random_state* ensures reproducibility

**Approach to a Machine Learning Problem**

1. **Understand the problem:** Define objectives and success criteria.

2. **Collect and explore data:** Perform EDA to understand data quality and patterns.

3. **Preprocess data:** Handle missing values, encode categorical variables, scale features.

4. **Split data:** Create training and testing sets.

5. **Choose a model:** Select algorithm(s) based on the problem type (regression, classification).

6. **Train the model:** Fit on training data.

7. **Evaluate the model:** Use test data and metrics (accuracy, RMSE, etc.) to assess performance.

8. **Tune and optimize:** Adjust hyperparameters or try different models.

9. **Deploy and monitor:** Use the model in production and track performance over time.

**Q11. Why do we have to perform EDA before fitting a model to the data?**

We perform **Exploratory Data Analysis (EDA)** before fitting a model to:

* **Understand the data’s structure, patterns, and relationships.**

* **Detect missing values, outliers, or errors** that might affect the model

* **Choose relevant features** and decide how to preprocess them

* **Get insights** that help select the right model and improve accuracy

* **Avoid surprises** by knowing the data well before modeling

In short, EDA helps ensure better, more reliable machine learning results.

**Q12. What is correlation?**

**Correlation** is a statistical measure that shows the **strength and direction** of the relationship between two variables.

* If both variables increase or decrease together, they have a **positive correlation.**

* If one increases while the other decreases, they have a **negative correlation.**

* A correlation close to **0** means no relationship.

Correlation values range from **–1 to +1**, where:

* **+1** = perfect positive correlation

* **0** = no correlation

* **-1** = perfect negative correlation

**Q13. What does negative correlation mean?**

**Negative correlation** means that as one variable increases, the other decreases.

**They move in opposite directions.**

**Example:**
As the **price of a product increases,** its sales may decrease.
This shows a negative relationship between price and sales.

**Q14. How can you find correlation between variables in Python?**

correlation between variables in Python using **Pandas** or **NumPy**.

    import pandas as pd

    # Assuming df is your DataFrame
    correlation_matrix = df.corr()
    print(correlation_matrix)

**For correlation between two specific variables:**

    correlation = df['variable1'].corr(df['variable2'])
    print(correlation)

**Using NumPy:**

    import numpy as np

    correlation = np.corrcoef(array1, array2)[0, 1]
    print(correlation)

This helps you understand how strongly variables are related.

**Q15. What is causation? Explain difference between correlation and causation with an example.**

**Causation** means that **one variable directly causes a change in another.**
It implies a **cause-and-effect** relationship.


**Correlation** means two variables change together, but one does not necessarily cause the other.


**Example:**
Ice cream sales and sunburns are correlated (both rise in summer), but one doesn't cause the other.
Smoking causes lung disease — that’s causation.

**Q16. What is an Optimizer? What are different types of optimizers? Explain each with an example.**

An **optimizer** is an algorithm used in machine learning, especially in training models like neural networks, to minimize the loss function by adjusting the model's parameters (weights and biases).
Its goal is to find the best set of parameters that reduces prediction errors.

**Different Types of Optimizers:**
1. **Gradient Descent (GD)**

* Updates parameters by moving in the direction of the negative gradient of the loss function.

* Example: For a simple linear regression, it updates weights step-by-step to reduce error.

* Limitation: Can be slow on large datasets.

2. **Stochastic Gradient Descent (SGD)**

* Similar to GD but updates parameters using one data point at a time.

* Faster and works well with large datasets but can be noisy.

3. **Mini-batch Gradient Descent**

* Updates parameters using a small batch of data points (between 1 and total dataset size).

* Balances speed and stability.

4. **Adam (Adaptive Moment Estimation)**

* Combines ideas from Momentum and RMSProp optimizers.

* Adjusts learning rates for each parameter adaptively using estimates of first and second moments of gradients.

* Widely used due to fast convergence and good performance.

* Example: Used in training deep neural networks.

5. **RMSProp**

* Adapts learning rate for each parameter by dividing the learning rate by a moving average of recent gradient magnitudes.

* Works well for non-stationary objectives.

**Q17. What is sklearn.linear_model ?**

**sklearn.linear_model** is a module in **Scikit-learn** that provides **linear models** for regression and classification tasks.

**Common models in sklearn.linear_model:**
* **LinearRegression** - For predicting continuous values (e.g., house prices)

* **LogisticRegression** - For binary or multi-class classification (e.g., spam detection)

* **Ridge, Lasso** - Variants of linear regression with regularization to prevent overfitting

It's widely used for building simple and interpretable machine learning models.

**Q18. What does model.fit() do? What arguments must be given?**

    model.fit()
is used to train a machine learning model by learning from the training data.
It adjusts the model’s parameters to best fit the relationship between input features and target outputs.

**Arguments you must provide:**
* Input data (features): Usually a 2D array, DataFrame, or similar structure (X_train)

* Target labels: Corresponding output values (y_train)

* Optional parameters like sample weights, number of iterations, or validation data depending on the model.

**Q19. What does model.predict() do? What arguments must be given?**

*model.predict()* is used to make predictions using a trained machine learning model on new or unseen data.

**Arguments Required:**
* You must provide input data (features) in the same format and structure as used during training.

* Typically passed as a NumPy array, list of lists, or Pandas DataFrame.

**Q20.What are continuous and categorical variables?**

**Continuous variables** are numerical values that can take any value within a range, like height or temperature.
They are measurable and often include decimals.

**Categorical variables** represent categories or groups, like gender or color.
They can be nominal (no order) or ordinal (ordered), and are usually non-numeric or encoded as numbers.

**Q21. What is feature scaling? How does it help in Machine Learning?**

**Feature scaling** is the process of standardizing or normalizing the range of independent variables (features) in a dataset.

**How It Helps in Machine Learning:**
* Ensures **all features contribute equally** to the model (especially distance-based ones like KNN, SVM).

* Speeds up **model training** and improves **convergence** for gradient-based algorithms.

* Prevents features with large values from **dominating** those with smaller ones.

**Common methods:**

*StandardScaler* (mean = 0, std = 1)

*MinMaxScaler* (scales to 0-1 range)

**Q22. How do we perform scaling in Python?**

can perform feature scaling in Python easily using Scikit-learn’s preprocessing module.

**Example with StandardScaler (standardization):**

    from sklearn.preprocessing import StandardScaler

    scaler = StandardScaler()
    scaled_data = scaler.fit_transform(data)  # data can be a NumPy array or DataFrame

**Example with MinMaxScaler (scaling to range 0-1):**

    from sklearn.preprocessing import MinMaxScaler

    scaler = MinMaxScaler()
    scaled_data = scaler.fit_transform(data)

**Q23. What is sklearn.preprocessing?**

*sklearn.preprocessing* is a module in Scikit-learn used to prepare and transform data before feeding it into a machine learning model.

**What It Does:**
* **Scales numerical features** (e.g., StandardScaler, MinMaxScaler)

* **Encodes categorical variables** (e.g., LabelEncoder, OneHotEncoder)

* **Transforms features** (e.g., PolynomialFeatures, Normalizer)

* **Handles missing or inconsistent values**

It ensures the data is in the right format and scale for optimal model performance.

**Q24. How do we split data for model fitting (training and testing) in Python?**

**Can split data into training and testing sets using Scikit-learn’s train_test_split function:**

    from sklearn.model_selection import train_test_split

    # X = features, y = target
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

* *test_size=0.2* means 20% of data is for testing, 80% for training

* *random_state* ensures reproducibility

**Q25. Explain data encoding?**

**Data encoding** is the process of converting categorical (non-numeric) data into numerical format so it can be used by machine learning models.

**Most ML algorithms work only with numbers, not text or labels.**

**Common Encoding Methods:**
1. **Label Encoding**

* Converts categories to numbers **(e.g., "Male" → 0, "Female" → 1)**

* Suitable for **ordinal data.**

2. **One-Hot Encoding**

* Creates binary columns for each category

* Suitable for **nominal (non-ordered) data**

Encoding helps models understand and process categorical data effectively.