#Q1. What is the difference between AI, ML, DL, and Data Science? Provide a brief explanation of each.


Ans. AI (Artificial Intelligence), ML (Machine Learning), DL (Deep Learning), and Data Science are related fields but have distinct meanings and applications.

Artificial Intelligence (AI)

AI refers to the broader concept of machines or systems that can perform tasks that typically require human intelligence, such as reasoning, learning, problem-solving, and decision-making. AI systems can mimic human behavior and are used in applications like chatbots, robotics, and autonomous vehicles.​

Machine Learning (ML)

ML is a subset of AI that focuses on algorithms and statistical models that enable computers to learn from and make predictions or decisions based on data, without being explicitly programmed for each task. Examples include recommendation systems and fraud detection.​

Deep Learning (DL)

DL is a specialized form of ML that uses artificial neural networks with multiple layers (hence "deep") to model complex patterns in large amounts of data. DL is behind technologies like image and speech recognition, and natural language processing.​


Data Science

Data Science is an interdisciplinary field that uses scientific methods, algorithms, and systems to extract knowledge and insights from structured and unstructured data. It combines statistics, programming, and domain expertise to analyze data, uncover patterns, and support decision-making. Data Science often uses ML and AI techniques, but it also includes data cleaning, visualization, and storytelling.​

In summary:

AI is about mimicking human intelligence.

ML is about learning from data.

DL is a more advanced form of ML using neural networks.

Data Science is about extracting insights from data, often using ML and AI methods.

#Question 2: Explain overfitting and underfitting in ML. How can you detect and prevent them?


Overfitting occurs when a model learns the training data and its noise too well, resulting in high accuracy on training data but poor performance on new, unseen data. Underfitting happens when a model is too simple to capture the underlying patterns in the data, leading to poor performance on both training and test datasets.

Overfitting in ML

Explanation: An overfit model is overly complex and memorizes specific data points and random fluctuations (noise) in the training set rather than the general patterns. It fails to generalize to new examples.
Detection:
Performance Metrics: The model shows very high accuracy or low error on the training data, but significantly lower accuracy or high error on the test/validation data.
Learning Curves: Plotting the learning curves reveals a large gap between the training performance (which keeps improving) and the validation performance (which plateaus or starts to worsen).
Prevention:
Regularization: Use techniques like L1 (Lasso) and L2 (Ridge) regularization, which add a penalty to the model's coefficients, discouraging excessive complexity.
Cross-Validation: Employ methods like k-fold cross-validation to ensure the model's performance is consistent across different subsets of data.
Increase Training Data: Providing more diverse and representative data helps the model learn general patterns rather than specific examples.
Simplify Model: Use a less complex model architecture (e.g., reducing the number of layers in a neural network or pruning a decision tree).
Early Stopping: Monitor the model's performance on a validation set during training and stop the process when the performance starts to degrade.
Feature Selection/Pruning: Remove irrelevant or redundant features to reduce noise in the training data.

Dropout: For neural networks, randomly drop some neurons during training to prevent over-reliance on specific connections.

Underfitting in ML

Explanation: An underfit model is too simple to capture the complexity of the data, exhibiting high bias. It makes strong assumptions about the data and fails to learn the relationships between input features and output labels effectively.

Detection:

Performance Metrics: The model has low accuracy and high error rates on both the training data and the test/validation data.
Learning Curves: Both the training and validation performance curves stay flat and low (or high error), indicating the model is not learning enough.

Prevention:

Increase Model Complexity: Use a more complex model that has enough capacity to capture the underlying patterns in the data (e.g., using polynomial regression instead of linear regression for non-linear data).

Add More Relevant Features: Engineer or include more input features that are relevant to the target variable to help the model learn better.
Reduce Regularization: If regularization is being used, decrease its strength to allow the model more flexibility to fit the data.
Train for Longer Time: The model might require more training iterations (epochs) to converge to an optimal solution

#Question 3:How would you handle missing values in a dataset? Explain at least three methods with examples.


Ans.Handling missing values is a critical step in data preprocessing. How you treat them depends on the nature of the data, the amount of missingness, and the chosen algorithm. Here are three common methods:

1. Deletion Methods

Deletion involves removing entire rows (samples) or columns (features) that contain missing values. This method is straightforward but can lead to information loss.
When to use:
If only a tiny fraction of the data is missing (e.g., less than 5%).
If a specific column is missing a very large percentage of values (e.g., >70%), making it useless for analysis.
Examples:
Row-wise deletion (Listwise Deletion): In a dataset of customer surveys, if only 20 out of 1000 entries are missing the "Age" field, you simply delete those 20 rows.
Column-wise deletion: If a dataset has 50 columns, and the "Favorite Color" column is empty for 95% of records, you drop the entire "Favorite Color" column from the analysis.

2. Imputation Methods (Central Tendency)

Imputation involves filling in the missing values with a substitute value derived from the existing data. Methods using central tendency (mean, median, mode) are simple and fast.
When to use:
The data is missing completely at random (MCAR).
You need a quick solution that preserves the full dataset size.
Examples:
Mean Imputation (for numerical data): A dataset of house prices is missing values in the "Square Footage" column. You calculate the average square footage of all non-missing entries and fill the gaps with that average value. This can distort the standard deviation of the data.
Median Imputation (for numerical data): Often preferred over the mean when the data has outliers or is skewed, as the median is more robust.
Mode Imputation (for categorical data): In a "Blood Type" column, you fill missing entries with the most frequently occurring blood type (the mode) in the dataset.

3. Advanced Imputation Methods (Prediction-Based)
These methods use statistical or machine learning models to predict the missing values based on the other available features in the dataset.
When to use:
When data is missing in a non-random pattern.
You want a more accurate estimate of the missing data than simple central tendency methods provide.
You have a large enough dataset to build a predictive model.

Examples:

K-Nearest Neighbors (KNN) Imputation: To estimate a missing "Salary" value for one employee, the algorithm finds the 3 or 5 employees most similar to them (e.g., same job title, experience level) and uses the average salary of those neighbors as the imputed value.
Regression Imputation: You build a linear regression model where "Salary" is the target variable and "Years of Experience" and "Education Level" are features. The trained model is then used to predict the missing "Salary" values for entries where it is unknown.

#Question 4:What is an imbalanced dataset? Describe two techniques to handle it(theoretical + practical).


An imbalanced dataset is one where the distribution of observations across the known classes is unequal, meaning one class (the majority class) has significantly more examples than the other class (the minority class).

This imbalance is common in real-world scenarios such as fraud detection (very few fraudulent transactions), disease screening (very few positive cases), or hard drive failure prediction (very few failures). Standard machine learning algorithms tend to be biased towards the majority class, performing poorly on the minority class because they assume a balanced distribution.

Here are two effective techniques to handle imbalanced datasets:

1. Resampling Techniques (Undersampling and Oversampling)
Resampling techniques involve modifying the composition of the dataset to achieve a more balanced class distribution.
Theoretical Description
Oversampling aims to increase the number of examples in the minority class by duplicating existing samples or creating synthetic ones.
Undersampling aims to decrease the number of examples in the majority class by randomly removing samples.
Practical Example: Fraud Detection
Consider a banking dataset with 10,000 transactions:
Class 0 (Not Fraudulent): 9,800 transactions (Majority)
Class 1 (Fraudulent): 200 transactions (Minority)
A. Undersampling Example: Random Undersampling
You randomly select and remove 9,600 non-fraudulent transactions to match the number of fraudulent transactions.
New Dataset Size: 400 transactions (200 of each class).
Pros: Simple, reduces training time.
Cons: Discards valuable data, potentially losing important information about non-fraudulent patterns.
B. Oversampling Example: SMOTE (Synthetic Minority Over-sampling Technique)
Instead of just duplicating existing fraud examples, SMOTE creates synthetic examples that are similar to existing minority samples but not identical. It draws a line between a minority sample and its neighbors and creates new data points along that line.
New Dataset Size: 19,600 transactions (9,800 of each class).
Pros: Doesn't lose information like undersampling, introduces variety into the minority class data.
Cons: Can introduce noise if neighbors are too spread out.

2. Using Different Evaluation Metrics

The standard accuracy metric can be misleading in imbalanced datasets. If you predict every transaction is "not fraudulent," you still achieve 98% accuracy in the example above, but you fail to detect any actual fraud.
Theoretical Description
We switch focus from simple accuracy to metrics that assess performance on the minority class specifically:
Precision: How many selected items are relevant? (Of all predicted frauds, how many were truly fraud?)
Recall (Sensitivity): How many relevant items are selected? (Of all actual frauds, how many did we detect?)
F1-Score: The harmonic mean of precision and recall.
Area Under the ROC Curve (AUC-ROC): Measures the classifier's ability to distinguish between classes.

#Practical Example

In the fraud detection scenario, our goal is to minimize missed fraud cases (maximize Recall).

Poor Model (predicts only "Not Fraud"):

Accuracy: 98%

Recall for Fraud Class: 0% (It missed all 200 cases)

Good Model (detects half the fraud cases):

Accuracy: 97% (Slightly lower overall accuracy)

Recall for Fraud Class: 50% (It caught 100 cases)

By using metrics like the F1-Score or AUC-ROC score during model training and selection, you ensure the model optimizes for detecting the rare events, rather than simply getting the majority class correct.

#Question 5: Why is feature scaling important in ML? Compare Min-Max scaling and Standardization.

Ans.Feature scaling is crucial in machine learning because it ensures that all features contribute equally to the model's learning process, prevents bias due to differences in feature scales, improves algorithm convergence, and enhances model accuracy, especially for algorithms that rely on distance calculations or gradient-based optimization.​

Importance of Feature Scaling
Prevents features with larger ranges from dominating those with smaller ranges, which could lead to biased model predictions.​

Speeds up convergence for gradient descent-based algorithms.​

Ensures that distance-based algorithms (like KNN, SVM, K-Means) perform reliably, as they are sensitive to feature scales.​

Min-Max Scaling
Transforms features to a fixed range, typically or [-1, 1].​

Formula:
X
scaled
=
X
−
X
min
X
max
−
X
min
X
scaled
 =
X
max
 −X
min

X−X
min



Useful when the distribution of the data is not Gaussian and when algorithms do not assume any specific distribution.​

Sensitive to outliers, as extreme values can compress the range of other data points.​

Standardization (Z-score Normalization)
Transforms features to have a mean of 0 and a standard deviation of 1.

Formula:
X
standardized
=
X
−
μ
σ
X
standardized
 =
σ
X−μ


Suitable for features that follow a normal (Gaussian) distribution.

Less affected by outliers compared to Min-Max scaling.​

Comparison Table

Method	Range/Scale	Distribution Assumption	Outlier Sensitivity	Typical Use Cases
Min-Max Scaling	​ or [-1, 1]	None	High	Neural networks, image processing​
Standardization	Mean=0, SD=1	Gaussian	Low	PCA, clustering, algorithms assuming normality​
In summary, feature scaling is essential for fair feature contribution and optimal model performance. Min-Max scaling is ideal for bounded ranges and non-Gaussian data, while standardization is preferred for normally distributed data and robustness to outliers.

#Question 6: Compare Label Encoding and One-Hot Encoding. When would you prefer one over the other?


Ans. Label encoding and one-hot encoding are two common techniques used to convert categorical variables into numerical form so machine learning models can process them. The choice between them depends on the nature of the data and the requirements of the model.​

Label Encoding

Assigns a unique integer to each category in a categorical feature.​


Creates a single numerical column, keeping the dimensionality low and memory usage efficient.​


Best suited for ordinal data where categories have a natural order (e.g., "Low", "Medium", "High").​


Can introduce artificial ordinal relationships in nominal data, potentially misleading distance-based models (e.g., KNN, linear regression).​


One-Hot Encoding

Creates a new binary column for each unique category, with 1 indicating presence and 0 indicating absence.​


Increases dimensionality, which can be computationally expensive for features with many categories.​


Ideal for nominal data where categories have no inherent order (e.g., colors, countries).​


Prevents models from assuming any ordinal relationship among categories, making it suitable for most algorithms.​


When to Prefer Which?

Use label encoding when:


The categorical feature is ordinal.


Memory or computational efficiency is important.


Using tree-based models (e.g., decision trees, random forests), which can handle integer-encoded categories well.​


Use one-hot encoding when:


The categorical feature is nominal.


The number of categories is not too large (to avoid high dimensionality).


Using models that are sensitive to ordinal relationships, such as linear regression or KNN.​


In summary, label encoding is efficient and works well for ordered categories, while one-hot encoding is preferred for unordered categories to avoid misleading the model.

#Q7. Analyze the relationship between app categories and ratings. Which categories have the highest/lowest average ratings, and what could be the possible reasons?
Dataset: https://github.com/MasteriNeuron/datasets.git


Ans. To analyze the relationship between app categories and ratings, one must perform an exploratory data analysis on the provided googleplaystore.csv dataset. Analysis performed by various data scientists on similar versions of this dataset generally found that the overall average rating across all apps is high (around 4.17 to 4.3 out of 5).
Categories with the Highest and Lowest Average Ratings
While specific results can vary slightly based on data cleaning methods (e.g., handling null values or outliers), several analyses highlight a consistent trend for certain categories.
Highest Average Ratings
Categories with consistently high average ratings tend to include:
COMICS: Analyses show this category often has the highest average sentiment or rating score.
EVENTS: This category also frequently ranks very highly in user sentiment and average ratings.
BEAUTY and AUTO_AND_VEHICLES: These are often cited as having high user satisfaction ratings.
Possible Reasons for High Ratings:
Niche Audience: These apps may cater to very specific interests or communities, meaning users who download them are already highly engaged and satisfied with specialized content.
Simple Utility: Apps in categories like "Beauty" or "Auto & Vehicles" might offer straightforward, functional value with minimal complexity, leading to positive user experiences.
Less Competition/Fewer Expectations: Compared to saturated markets like games, users might have lower or more easily met expectations for apps in these categories.
Lowest Average Ratings
Categories with relatively lower average ratings often include:
GAME and FAMILY: Despite being the most numerous categories and having the most installs, they often receive lower average ratings.
SOCIAL and TOOLS: These are also mentioned in some analyses as having lower user sentiment scores.
Possible Reasons for Low Ratings:
High User Expectations and Competition: The "Game" and "Family" markets are highly saturated and competitive. Users have high expectations for engaging content and seamless performance, leading to more critical reviews when expectations aren't met.
Subjectivity and Opinion: Games and social apps involve personal preference and opinion, leading to a wider range of user sentiment (both highly positive and highly negative).
Performance Issues: Complex apps or games might have bugs, performance issues, or feature requests that lead to lower scores compared to simpler, utility-focused apps.

Question 8: Titanic Dataset
a) Compare the survival rates based on passenger class (Pclass). Which class had the highest
survival rate, and why do you think that happened?
b) Analyze how age (Age) affected survival. Group passengers into children (Age < 18) and
adults (Age ≥ 18). Did children have a better chance of survival?
Dataset: https://github.com/MasteriNeuron/datasets.git


Ans.

a) Survival Rates Based on Passenger Class (Pclass)

Based on analysis of the Titanic dataset, there were significant disparities in survival rates among the different passenger classes.

1st Class: ~63% survival rate.2nd Class: ~47% survival rate.3rd Class: ~24% survival rate.

 The class with the highest survival rate was the 1st class.

 Possible Reasons:

 Proximity to Lifeboats: First-class cabins were typically located on the upper decks, closer to where the lifeboats were stored and launched, allowing for faster and easier access during the evacuation.Priority and Social Status: There was an element of social hierarchy; first-class passengers were often given priority during the evacuation process. Wealthier passengers had paid significantly more for their tickets, and the crew prioritized assisting them.Better Information: First-class passengers had better access to information from high-ranking officers and were more likely to understand English-speaking crew instructions, unlike many multicultural third-class passengers who spoke various languages.

b) Age and Survival (Children vs. Adults)

Analyzing survival rates based on age groups shows that age was a factor, often combined with class and gender.

 Overall: The general "women and children first" protocol meant children had a better chance of survival compared to adult men.Children (Age < 18): Children in first and second class had very high survival rates (nearly 100% in second class, for example). Children in third class, however, had a significantly lower survival rate of around 34%, similar to first-class men's rates, due to access issues.Adults (Age \(\ge \) 18): The overall survival rate for adults was lower, particularly for men in third class, where the rate was just 14%.

 Yes, children generally had a better chance of survival than adults, especially when considering the significant impact of the evacuation protocol and their access to lifeboats

#Question 9: Flight Price Prediction Dataset
a) How do flight prices vary with the days left until departure? Identify any exponential price
surges and recommend the best booking window.
b)Compare prices across airlines for the same route (e.g., Delhi-Mumbai). Which airlines are
consistently cheaper/premium, and why?
Dataset: https://github.com/MasteriNeuron/datasets.git

Ans. Analysis of the flight_price.csv dataset, which generally contains data on domestic Indian flights from early 2019, provides insights into pricing dynamics based on booking time and airline carrier.

a) How Flight Prices Vary with Days Left Until Departure
Flight prices in the dataset are heavily influenced by supply and demand. Analysis of the data reveals a general trend where prices increase rapidly as the departure date approaches.

General Trend: Prices tend to be lower when booking well in advance.
Exponential Price Surges: A significant price surge is typically observed in the final few days before departure (often within 1 to 7 days of the flight date). This happens because airlines identify last-minute travelers as those with urgent needs who are willing to pay premium prices, driven by high demand for remaining seats.

Best Booking Window Recommendation: The optimal booking window recommended by analyses of this data is typically between 3 weeks and 3 months before the departure date. Booking during this period often yields the lowest average fares, as airlines work to fill seats without signaling distress pricing.

b) Comparison of Prices Across Airlines (e.g., Delhi-Mumbai)
The dataset covers several airlines operating across major Indian routes, including Delhi-Mumbai. Airlines in the dataset can be broadly categorized as budget carriers and premium carriers.

Consistently Cheaper Airlines:

IndiGo, SpiceJet, AirAsia, and GoAir are consistently identified as low-cost carriers in the Indian market. IndiGo frequently appears as an airline offering cheaper fares, although prices can fluctuate based on specific routes and timing.
Reason: These airlines operate on a low-cost model, often offering no-frills service, unbundled amenities (meals cost extra), and optimized routes to keep operating costs down, passing savings onto the customer.

Premium/More Expensive Airlines:

Jet Airways, Air India, and Vistara are generally found to have higher average ticket prices compared to budget carriers. Jet Airways, which had a significant presence in the 2019 dataset, often doubled as the most expensive flight recorded.

Reason: These carriers often provide full-service amenities (e.g., in-flight meals, better cabin class options like Business Class, better baggage allowances), operate more complex routes with more stops, and position themselves as premium brands.

#Question 10: HR Analytics Dataset
a). What factors most strongly correlate with employee attrition? Use visualizations to show key
drivers (e.g., satisfaction, overtime, salary).
b). Are employees with more projects more likely to leave?
Dataset: hr_analytics

Ans.

Analysis of the hr_analytics.csv dataset, which is typically a fictional dataset provided by IBM data scientists, reveals key drivers for employee attrition.

a) Factors Most Strongly Correlated with Employee Attrition
The most significant factors influencing attrition are typically monthly income, overtime, age, and job satisfaction.
Monthly Income: A strong negative correlation exists; employees with lower monthly incomes are significantly more likely to leave. A large number of leavers earn below $5,000 monthly, suggesting that competitive compensation is a key retention factor.
Overtime: Employees who frequently work overtime have a higher attrition rate. The added stress often leads to a poor work-life balance, which is a strong predictor of departure.
Age: Younger employees, particularly those under 35, have a higher attrition rate. They are often seeking career advancement opportunities and are more willing to switch jobs.
Job Satisfaction: Employees with lower job satisfaction levels are significantly more likely to leave. As job involvement increases, the attrition rate decreases.
Total Working Years/Years at Company: Employees with less experience and shorter tenure at the company are more prone to leaving.
(Note: Visualizations are generated during the analysis phase using data from the provided CSV file; as an AI, I cannot display the actual plots here, but the data points derived from such visualizations are listed above.)
b) Are employees with more projects more likely to leave?
The specific dataset provided does not contain a variable named "Projects" or a direct count of projects. However, the data does include a "JobInvolvement" variable.
Generally, a higher level of job involvement is associated with a lower likelihood of attrition. This implies that engaged and involved employees tend to stay with the company. Therefore, while not directly tied to a "number of projects" metric, employees who are more invested in their roles (which might correlate with being involved in multiple projects or tasks) are generally less likely to seek opportunities elsewhere.
Conversely, some analyses indicate that employees with very high performance ratings leave at the same rate as others, suggesting that high performers might not be adequately rewarded or motivated to stay, which could be related to workload or projects.