**1. What is feature engineering, and how does it work? Explain the various aspects of feature engineering in depth.**

**Ans:** Feature engineering is the process of transforming raw data into informative features that enhance a machine learning model's performance. Aspects of feature engineering include:

- **Feature Extraction:** Creating new features from existing ones, e.g., combining or transforming attributes.
- **Feature Transformation:** Applying mathematical operations (scaling, normalization) to improve model convergence.
- **Handling Missing Values:** Dealing with missing data through imputation or removal.
- **Encoding Categorical Variables:** Converting categorical attributes into numerical format.
- **Creating Interaction Features:** Incorporating interactions between features.
- **Feature Scaling:** Ensuring features are on similar scales.
- **Domain-Specific Engineering:** Introducing domain knowledge to create relevant features.

**2. What is feature selection, and how does it work? What is the aim of it? What are the various methods of function selection?**

**Ans:**  Feature selection is the process of selecting a subset of relevant features from a larger set of features in a dataset. The aim is to improve model performance, reduce complexity, and enhance interpretability by including only the most informative attributes. This process helps in avoiding overfitting, reducing computational resources, and improving the model's ability to generalize to new data.

**Methods of Feature Selection:**

1. **Filter Methods:** These methods assess feature relevance independently of the learning algorithm. Common metrics include correlation, mutual information, and chi-squared. Features are ranked and selected based on these metrics.
    
2. **Wrapper Methods:** Wrapper methods evaluate the performance of a machine learning model using subsets of features. This involves training and testing the model on different feature combinations. Common wrapper methods include forward selection (add one feature at a time), backward elimination (remove one feature at a time), and recursive feature elimination (eliminate less important features in iterations).
    
3. **Embedded Methods:** These methods combine feature selection with the model training process. They are integrated into the learning algorithm and optimize the selection of features while building the model. Examples include LASSO (L1 regularization), decision tree-based methods (like Random Forest feature importance), and gradient boosting feature importance.

**3. Describe the function selection filter and wrapper approaches. State the pros and cons of each approach?**

**Ans:** 

**Filter Approach for Feature Selection:** 

In the filter approach, feature selection is based on statistical measures calculated independently of a specific machine learning algorithm. Features are evaluated using metrics like correlation, mutual information, chi-squared, or ANOVA. These metrics quantify the relationship between features and the target variable without considering the learning algorithm's behavior. Features are ranked according to these metrics, and a threshold is set to select the top-ranked features.

- **Pros:** Efficient (no model training), scalable to large datasets, algorithm-independent.
- **Cons:** Ignores feature interactions, overlooks model performance.

**Wrapper Approach for Feature Selection:** 

In the wrapper approach, feature selection is treated as a search problem. It involves training and evaluating a machine learning model using different subsets of features. Wrapper methods employ a specific learning algorithm (e.g., decision trees, SVM) to assess feature subsets. The model's performance on validation or cross-validation sets guides the selection process. The algorithm iteratively adds or removes features based on their impact on model performance.

- **Pros:** Considers feature interactions, better performance, adaptable to complex scenarios.
- **Cons:** Computationally expensive, risk of overfitting with small datasets.

**4. i. Describe the overall feature selection process.**

	1. Collect Data 
	2. Preprocess Data (remove outliers, handle missing values) 
	3. Extract Features 
	4. Select Features (filter, wrapper, or embedded methods) 
	5. Train Model 
	6. Evaluate Model 
	7. Adjust Feature Selection if necessary

**ii. Explain the key underlying principle of feature extraction using an example. What are the most widely used function extraction algorithms?**

**Ans:** Feature extraction aims to reduce data dimensions while preserving relevant information. Example: In image recognition, extracting edge or texture features simplifies data while retaining crucial image patterns. Widely used algorithms include Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA).

**5. Describe the feature engineering process in the sense of a text categorization issue.**

**Ans:** Text categorization involves assigning predefined categories or labels to text documents. Effective feature engineering is crucial for transforming raw text data into a format suitable for machine learning models. Here's how the feature engineering process works for a text categorization issue:

1. **Data Collection and Preprocessing:**
    
    - Collect a dataset of text documents and associated labels.
    - Preprocess the text: remove special characters, convert to lowercase, and tokenize (split into words).
2. **Feature Extraction:**
    
    - Create a vocabulary: Collect unique words (terms) from the entire dataset.
    - Term Frequency-Inverse Document Frequency (TF-IDF): Calculate the importance of each term in each document. High importance for terms frequent in a document but rare across the corpus.
    - Bag-of-Words (BoW): Create a matrix where each row represents a document, and each column represents a term's frequency.
3. **Text Processing and Transformation:**
    
    - Stop Word Removal: Eliminate common, non-informative words (e.g., "and," "the").
    - Stemming or Lemmatization: Reduce words to their root form (e.g., "running" to "run").
    - N-grams: Consider sequences of words to capture contextual information.
4. **Feature Scaling:**
    
    - Normalize the feature matrix to ensure that all features are on a similar scale. Common techniques include z-score normalization.
5. **Dimensionality Reduction (Optional):**
    
    - Apply techniques like Principal Component Analysis (PCA) to reduce the dimensionality of the feature space while preserving relevant information.
6. **Model Training and Evaluation:**
    
    - Use the preprocessed and engineered features to train a machine learning model.
    - Evaluate the model's performance using appropriate evaluation metrics such as accuracy, precision, recall, F1-score, etc.
7. **Fine-Tuning:**
    
    - Iteratively refine the feature engineering process based on the model's performance.
    - Experiment with different preprocessing techniques, n-gram configurations, and feature extraction methods.
8. **Validation and Testing:**
    
    - Validate the model on unseen data to assess its generalization ability.
    - Fine-tune the feature engineering process based on validation results.

**6. What makes cosine similarity a good metric for text categorization? A document-term matrix has two rows with values of (2, 3, 2, 0, 2, 3, 3, 0, 1) and (2, 1, 0, 0, 3, 2, 1, 3, 1). Find the resemblance in cosine.**

**Ans:** Cosine similarity is a metric used to measure how similar the documents are irrespective of their size. The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance (due to the size of the document), chances are they may still be oriented closer together.

Cosine similarity is the cosine of the angle between two n-dimensional vectors in an n-dimensional space. It is the dot product of the two vectors divided by the product of the two vectors' lengths (or magnitudes).

The formula for calculating the cosine similarity is : Cos(x, y) = x . y / ||x|| * ||y||

In our Question cos(x,y) = 23/(root 40 * root 29) = 0.675

**7. i. What is the formula for calculating Hamming distance? Between 10001011 and 11001111, calculate the Hamming gap.**

**Ans:** The Hamming distance between two vectors is the number of bits we must change to change one into the other. Example Find the distance between the vectors `01101010` and `11011011`. They differ in four places, so the Hamming distance `d(01101010,11011011) = 4`. In question mentioned between 10001011 and 11001111, hamming distance will be 2 as two character are different.

**ii. Compare the Jaccard index and similarity matching coefficient of two features with values (1, 1, 0, 0, 1, 0, 1, 1) and (1, 1, 0, 0, 0, 1, 1, 1), respectively (1, 0, 0, 1, 1, 0, 0, 1).**

**Ans:** Jaccard Index = (the number in both sets) / (the number in either set) * 100 

For Question given, Jaccard Index = 2/2 *100 = 100%

**8. State what is meant by  "high-dimensional data set"? Could you offer a few real-life examples? What are the difficulties in using machine learning techniques on a data set with many dimensions? What can be done about it?**

**Ans:** 
- High-dimensional data sets have many attributes compared to instances.
- Examples: DNA sequences, text documents with many terms, images with many pixels.
- Difficulties: Curse of dimensionality, increased computational complexity, overfitting.
- Solutions: Feature selection, dimensionality reduction (PCA, t-SNE), domain-specific knowledge.

**9. Make a few quick notes on:**

**PCA** 

The Principal component analysis (PCA) is a technique used for identification of a smaller number of uncorrelated variables known as principal components from a larger set of data. The technique is widely used to emphasize variation and capture strong patterns in a data set.

**Use of vectors**

Vectors can be used to represent physical quantities. Most commonly in physics, vectors are used to represent displacement, velocity, and acceleration. Vectors are a combination of magnitude and direction, and are drawn as arrows

**Embedded technique**

In the context of machine learning, an embedding is a low-dimensional, learned continuous vector representation of discrete variables into which you can translate high-dimensional vectors. Generally, embeddings make ML models more efficient and easier to work with, and can be used with other models as well

**10. Make a comparison between:**

**1. Sequential backward exclusion vs. sequential forward selection**

|Comparison|Sequential Backward Exclusion|Sequential Forward Selection|
|---|---|---|
|**Definition**|Iteratively removes one feature at a time starting from all features|Iteratively adds one feature at a time starting from no features|
|**Objective**|Identify less important features|Identify important features|
|**Advantages**|Reduces overfitting risk, computationally efficient|Can capture interactions, better performance|
|**Disadvantages**|Ignores interactions, might remove relevant features|Risk of overfitting, computationally expensive|
|**Application**|Dimensionality reduction, feature selection|Feature selection, capturing interactions|

**2. Function selection methods: filter vs. wrapper**

|Comparison|Filter Methods|Wrapper Methods|
|---|---|---|
|**Definition**|Use statistical measures to assess feature importance|Use models to evaluate feature subsets|
|**Objective**|Efficiently rank features based on metrics|Optimize model performance through feature selection|
|**Advantages**|Computationally efficient, algorithm-independent|Considers feature interactions, better performance|
|**Disadvantages**|Ignores feature interactions, limited to metrics|Computationally expensive, prone to overfitting with small datasets|
|**Application**|Preliminary feature ranking|Feature selection, model tuning|

**3. SMC vs. Jaccard coefficient**

|Comparison|SMC (Simple Matching Coefficient)|Jaccard Coefficient|
|---|---|---|
|**Definition**|Binary similarity measure based on matches|Set similarity measure using intersection and union|
|**Advantages**|Captures binary matches, simple to compute|Considers matches and mismatches, accounts for binary data|
|**Disadvantages**|Limited to binary data, doesn't consider mismatches|Limited to binary data, doesn't consider mismatches|
|**Application**|Binary data comparison|Binary set comparison|