# Theory Questions

1. What is a parameter?
   - A parameter is an internal variable of a machine learning model that is learned from the training data.
   - It defines the relationship between input features and the predicted output.
   - Examples include the weights in a linear regression model or the split points in a decision tree.
   - Parameters are updated during training to minimize the model's error.

2. What is correlation?What does negative correlation mean?
   - Correlation is a statistical measure that describes the strength and direction of a relationship between two variables.
   - It ranges from -1 to 1, where 1 indicates a perfect positive relationship, -1 indicates a perfect negative relationship, and 0 means no correlation.
   - It's commonly used in data analysis to identify linear relationships.
   - Correlation doesn't imply causation.
   - Negative correlation means that as one variable increases, the other decreases.
   - The correlation value lies between -1 and 0. For example, if time spent watching TV increases and academic performance decreases, they have a negative correlation.
   - It shows an inverse relationship.

3. Define Machine Learning. What are the main components in Machine Learning?
   - Machine Learning is a subset of AI that allows systems to learn patterns from data and make decisions without being explicitly programmed.
   - The main components are data, features, a model, a learning algorithm, and an evaluation metric.
   - These components work together to train the model to make predictions.
   - The goal is to minimize error and generalize well to unseen data.

4. How does loss value help in determining whether the model is good or not?
   - The loss value measures the difference between the predicted output and the actual output.
   - A lower loss indicates better performance of the model.
   - It guides the optimization process to adjust parameters and improve accuracy.
   - Monitoring loss helps detect overfitting or underfitting.

5. What are continuous and categorical variables?
   - Continuous variables are numerical values that can take any value within a range (e.g., height, temperature).
   - Categorical variables represent discrete groups or categories (e.g., gender, color).
   - Continuous variables are usually measured, while categorical variables are labeled.
   - Different preprocessing techniques are applied to each type in ML.

6. How do we handle categorical variables in Machine Learning? What are the common techniques?
   - Categorical variables are typically converted into numerical formats using encoding methods.
   - Common techniques include label encoding, one-hot encoding, and ordinal encoding.
   - One-hot encoding is used when categories are not ordinal, while label encoding is suitable for tree-based models.
   - Proper encoding helps algorithms interpret categorical data.

7. What do you mean by training and testing a dataset?
   - Training a dataset involves using it to teach the model how to make predictions.
   - Testing evaluates the model's performance on unseen data to check its generalization ability.
   - The training set is used to fit the model, and the test set assesses how well it performs.
   - This split helps avoid overfitting.

8. What is sklearn.preprocessing?
   - sklearn.preprocessing is a module in Scikit-learn that provides functions for scaling, transforming, and encoding features.
   - It helps standardize or normalize input data for better model performance.
   - It includes methods like StandardScaler, MinMaxScaler, and OneHotEncoder.
   - These transformations prepare raw data for machine learning.

9. What is a Test set?How do we split data for model fitting (training and testing) in Python?
   - A test set is a portion of the dataset used to evaluate the performance of a trained model.
   - It simulates how the model will perform on real-world unseen data. The model does not learn from this set; it only predicts outcomes.
   - It's crucial for assessing model generalization.
   In Python, the train_test_split() function from sklearn.model_selection is used to split the data.
   - You specify the test size (e.g., 0.2 for 20%) and optionally use a random state for reproducibility.
   - It returns four datasets: X_train, X_test, y_train, and y_test.
   - This ensures that the model is trained and evaluated separately.

10. How do you approach a Machine Learning problem?
   - Start by understanding the problem and collecting relevant data.
   - Then, perform exploratory data analysis (EDA), preprocessing, and feature engineering.
   - After that, select and train a model, evaluate it using appropriate metrics, and tune hyperparameters.
   - Finally, validate and deploy the model.

11. Why do we have to perform EDA before fitting a model to the data?
   - EDA helps in understanding the structure, patterns, and anomalies in the dataset.
   - It identifies missing values, outliers, and relationships between variables.
   - This insight guides data cleaning and feature selection.
   - EDA ensures better model performance and avoids issues during training.

12. What is correlation?
   - Correlation is a measure of how two variables move in relation to each other.
   - Positive correlation means they increase together; negative correlation means one increases while the other decreases.
   - It is useful for identifying linear relationships in EDA.
   - It does not imply causation.

13. What does negative correlation mean?
   - Negative correlation implies that an increase in one variable results in a decrease in the other.
   - The correlation value is less than 0. For example, if the number of hours spent exercising increases and weight decreases, the variables are negatively correlated.
   - It's an inverse relationship.

14. How can you find correlation between variables in Python?
   - You can use the .corr() method on a Pandas DataFrame to compute correlation coefficients.
   - It returns a correlation matrix showing relationships between all numeric features.
   - You can also visualize it using a heatmap with Seaborn (sns.heatmap()).
   - This helps identify strong or weak linear dependencies.

15. What is causation? Explain difference between correlation and causation with an example.
   - Causation means one variable directly affects another, while correlation only shows a relationship.
   - For example, ice cream sales and drowning deaths may be correlated, but one does not cause the other; the underlying cause is hot weather.
   - Correlation does not imply a cause-effect link.
   - Causation requires experimental or controlled study.

16. What is an Optimizer? What are different types of optimizers? Explain each with an example.
   - An optimizer adjusts model parameters to minimize loss during training.
   - Common types include SGD (Stochastic Gradient Descent), Adam, and RMSProp.
   - SGD updates weights based on the gradient of loss, while Adam combines momentum and adaptive learning rates for faster convergence.
   - For example, optimizer = Adam(lr=0.001) in Keras optimizes the model during backpropagation.

17. What is sklearn.linear_model?
   - sklearn.linear_model is a Scikit-learn module for implementing linear models like Linear Regression, Logistic Regression, and Ridge Regression.
   - It provides classes for training models on linearly related data.
   - For example, LinearRegression() fits a line to predict continuous values.
   - It supports both regression and classification.

18. What does model.fit() do? What arguments must be given?
   - model.fit() trains the model using the provided input (X) and target (y) data.
   - It learns the relationship between features and labels.
   - The required arguments are the training features X_train and target labels y_train.
   - Optionally, you can specify parameters like epochs, batch size (in deep learning), or sample weights.

19. What does model.predict() do? What arguments must be given?
   - model.predict() generates output predictions for the given input data.
   - It takes new input data (e.g., X_test) and uses the trained model to estimate the target values.
   - This is used for testing or deploying the model.
   - It helps in evaluating how well the model generalizes.

20. What are continuous and categorical variables?
   - Continuous variables are numeric and can take an infinite number of values within a range (e.g., age, salary).
   - Categorical variables represent distinct groups or categories (e.g., city, product type).
   - Each type requires different preprocessing for machine learning models.
   - Algorithms may interpret them differently.

21. What is feature scaling? How does it help in Machine Learning?
   - Feature scaling transforms features to be on a similar scale to improve model performance.
   - It prevents features with large magnitudes from dominating the model.
   - It's essential for algorithms like KNN, SVM, and gradient descent-based models.
   - Common methods include normalization and standardization.

22. How do we perform scaling in Python?
   - In Python, scaling is done using StandardScaler or MinMaxScaler from sklearn.preprocessing.
   - StandardScaler standardizes features to have zero mean and unit variance.
   - MinMaxScaler scales data to a fixed range (usually 0 to 1).
   - Fit the scaler on training data and transform both training and test sets.

23. What is sklearn.preprocessing?
   - sklearn.preprocessing is a Scikit-learn module that provides tools to transform input data before training.
   - It includes functions for scaling, encoding, binarization, and generating polynomial features.
   - These transformations improve data quality and model performance.
   - It ensures consistency and prepares data for model consumption.

24. How do we split data for model fitting (training and testing) in Python?
   - Use train_test_split() from sklearn.model_selection to divide data.
   - Pass in features, labels, and specify test_size and random_state for reproducibility.
   - It returns training and testing sets: X_train, X_test, y_train, y_test.
   - This helps validate the model properly.

25. Explain data encoding?
   - Data encoding converts categorical values into numeric form for model compatibility.
   - Techniques include one-hot encoding, label encoding, and ordinal encoding.
   - Encoding helps machine learning algorithms interpret non-numeric data correctly.
   - It's a crucial preprocessing step before model training.