# Yashwant Desai –  ML_Assignment_9

# 1. What is feature engineering, and how does it work? Explain the various aspects of feature engineering in depth.

Feature engineering is the process of creating new features or modifying existing ones in a dataset to improve the performance of machine learning models. 

It involves:

Feature Creation: Generating new features from existing ones. For example, creating a "total income" feature by adding "salary" and "bonus."

Feature Transformation: Modifying features through mathematical operations, such as taking the logarithm of a variable to make its distribution more Gaussian.

Feature Selection: Choosing the most relevant features to reduce dimensionality.

Handling Missing Data: Dealing with missing values in features using techniques like imputation.

Encoding Categorical Variables: Converting categorical variables into numerical format, e.g., one-hot encoding.

Scaling and Normalization: Scaling features to have similar scales, e.g., standardization.

Feature Extraction: Creating new features through techniques like Principal Component Analysis (PCA).

# 2. What is feature selection, and how does it work? What is the aim of it? What are the various methods of function selection?

Feature selection is the process of choosing a subset of the most relevant features from a dataset to improve model performance, reduce dimensionality, and decrease computational cost. The aim is to retain the most informative features. 

Methods of feature selection include:

Filter Methods: Use statistical tests to score and rank features based on their individual relationship with the target variable (e.g., Chi-squared, Information Gain).

Wrapper Methods: Evaluate subsets of features by training and testing the model on different combinations (e.g., Forward Selection, Backward Elimination).

Embedded Methods: Feature selection is integrated into the model training process (e.g., LASSO for linear regression, decision tree feature importance).

# 3. Describe the function selection filter and wrapper approaches. State the pros and cons of each approach?

Filter Approach: In filter methods features are ranked independently of the machine learning model. 

Pros include efficiency and speed but cons involve potentially missing feature interactions and dependencies on the selected model.

Wrapper Approach: In wrapper methods features are evaluated in the context of the specific machine learning model being used. 

Pros include capturing feature interactions but cons involve high computational cost and potential overfitting to the model used.

# 4. i. Describe the overall feature selection process. ii. Explain the key underlying principle of feature extraction using an example. What are the most widely used function extraction algorithms?


The overall feature selection process. The feature selection process typically involves three main steps:

1 Feature Ranking: Features are scored based on their relevance to the target variable.

2 Subset Selection: A subset of top-ranked features is chosen based on a predefined criterion (e.g., number of features to select).

3 Model Evaluation: The selected subset of features is used to train and test a machine learning model, and its performance is assessed.
    

Below is the key underlying principle of feature extraction

Feature extraction transforms high-dimensional data into a lower-dimensional representation. 

For example, Principal Component Analysis (PCA) identifies the most important orthogonal directions (principal components) in the data and projects it onto a lower-dimensional space while preserving the most variance. Widely used feature extraction algorithms include PCA, Linear Discriminant Analysis (LDA) and Independent Component Analysis (ICA).

# 5. Describe the feature engineering process in the sense of a text categorization issue.

In text categorization feature engineering involves converting text data into numerical features. It can include methods like TF-IDF (Term Frequency-Inverse Document Frequency), word embeddings and creating features based on the presence of specific words or phrases in documents.

# 6. What makes cosine similarity a good metric for text categorization? A document-term matrix has two rows with values of (2, 3, 2, 0, 2, 3, 3, 0, 1) and (2, 1, 0, 0, 3, 2, 1, 3, 1). Find the resemblance in cosine.

In [1]:
import numpy as np

# Define the two vectors A and B
A = np.array([2, 3, 2, 0, 2, 3, 3, 0, 1])
B = np.array([2, 1, 0, 0, 3, 2, 1, 3, 1])

# Calculate the dot product of A and B
dot_product = np.dot(A, B)

# Calculate the Euclidean norms (magnitudes) of A and B
norm_A = np.linalg.norm(A)
norm_B = np.linalg.norm(B)

# Calculate the cosine similarity
cosine_similarity = dot_product / (norm_A * norm_B)

print(f"Cosine Similarity between A and B: {cosine_similarity}")

Cosine Similarity between A and B: 0.6753032524419089


Cosine similarity is a good metric for text categorization because it measures the cosine of the angle between two vectors which represents the similarity of their directions in a high-dimensional space. 

To calculate cosine similarity between the two rows:

Calculate the dot product of the two vectors: (2 * 2) + (3 * 1) + (2 * 0) + (0 * 0) + (2 * 3) + (3 * 2) + (3 * 1) + (0 * 3) + (1 * 1) = 14

Calculate the magnitude of each vector: sqrt((2^2 + 3^2 + 2^2 + 0^2 + 2^2 + 3^2 + 3^2 + 0^2 + 1^2)) = sqrt(36) = 6

Calculate cosine similarity: (14) / (6 * 6) = 14 / 36 = 7 / 18

# 7. i. What is the formula for calculating Hamming distance? Between 10001011 and 11001111, calculate the Hamming gap. ii. Compare the Jaccard index and similarity matching coefficient of two features with values (1, 1, 0, 0, 1, 0, 1, 1) and (1, 1, 0, 0, 0, 1, 1, 1), respectively (1, 0, 0, 1, 1, 0, 0, 1).


The Hamming distance between two binary strings of equal length is the number of positions at which the corresponding bits are different. 

The formula is:

Hamming Distance = Σ (bit1 XOR bit2)

For 10001011 and 11001111, the Hamming distance is:
1 XOR 1 + 0 XOR 1 + 0 XOR 0 + 0 XOR 0 + 1 XOR 0 + 0 XOR 1 + 1 XOR 1 + 1 XOR 1 = 2

The Jaccard index and similarity matching 

Jaccard Index = (Number of Common Elements) / (Total Number of Unique Elements)

Jaccard Index = (3) / (5) = 0.6

Similarity Matching Coefficient = (Number of Common Elements) / (Total Number of Non-zero Elements in Either Set)

Similarity Matching Coefficient = (3) / (7) = 3/7 ≈ 0.4286

In [2]:
# Define the two binary strings
string1 = "10001011"
string2 = "11001111"

# Check that the strings are of the same length
if len(string1) != len(string2):
    raise ValueError("The strings must have the same length")

# Calculate the Hamming distance
hamming_distance = sum(bit1 != bit2 for bit1, bit2 in zip(string1, string2))

print(f"Hamming Distance: {hamming_distance}")

Hamming Distance: 2


# 8. State what is meant by  "high-dimensional data set"? Could you offer a few real-life examples? What are the difficulties in using machine learning techniques on a data set with many dimensions? What can be done about it?

A high-dimensional dataset refers to a dataset with a large number of features or dimensions. 

Real-life examples include:

Medical data with many patient attributes.
Image data with pixel values as features.
Genomic data with gene expression levels.
Difficulties in using machine learning techniques on high-dimensional data include:

Increased computational complexity.
Overfitting due to the "curse of dimensionality."
Difficulty in visualizing data.

To address these issues techniques like feature selection, dimensionality reduction (e.g., PCA) and regularization can be applied.

# 9. Make a few quick notes on: PCA is an acronym for Personal Computer Analysis. 2. Use of vectors 3. Embedded technique


PCA is an acronym for Personal Computer Analysis.

Incorrect. PCA stands for Principal Component Analysis a technique for reducing the dimensionality of data while preserving its variance.

Use of vectors

Vectors are often used to represent data points or features in machine learning. They can be manipulated mathematically to perform various operations.

Embedded technique

Embedded techniques integrate feature selection within the machine learning model training process where feature importance is assessed during model training.

# 10. Make a comparison between: 1. Sequential backward exclusion vs. sequential forward selection 2. Function selection methods: filter vs. wrapper 3. SMC vs. Jaccard coefficient

Sequential backward exclusion vs. sequential forward selection

Sequential Backward Exclusion: Starts with all features and iteratively removes the least important ones. Prone to underfitting.

Sequential Forward Selection: Starts with an empty set of features and iteratively adds the most important ones. Prone to overfitting.

Feature selection methods: filter vs. wrapper

Filter Methods: Use statistical measures to rank and select features. Fast but may miss feature interactions.

Wrapper Methods: Use a specific machine learning model to evaluate feature subsets. Slow but can capture feature interactions.
SMC vs. Jaccard coefficient

SMC (Simple Matching Coefficient): Considers the number of matching elements divided by the total number of elements in the sets. Sensitive to common and non-common elements.

Jaccard Coefficient: Considers the number of common elements divided by the total number of unique elements in the sets. Sensitive to common elements only.

# Done all 10 questions 

# Regards,Yashwant