Question 1: What is the difference between AI, ML, DL, and Data Science?

Artificial Intelligence (AI): Broad field that aims to make machines mimic human intelligence such as reasoning, problem solving, and decision-making.  
Machine Learning (ML): Subset of AI that lets systems learn from data to make predictions or decisions without explicit programming.  
Deep Learning (DL): Subset of ML that uses multi-layered neural networks to automatically extract complex patterns from large datasets.  
Data Science: An interdisciplinary field combining statistics, mathematics, programming, and domain knowledge to extract insights and build predictive models.

AI > ML > DL represents the hierarchy of scope, while Data Science spans across all by using data handling, visualization, and modeling.


Question 2: Explain overfitting and underfitting in ML. How can you detect and prevent them?

Underfitting occurs when a model is too simple to capture the underlying data patterns — it performs poorly on both training and test data (high bias).

Overfitting happens when a model learns the noise in the training data — it performs well on training data but poorly on unseen data (high variance).

Detection:
- Compare training and validation accuracy.
- If training accuracy is high but validation accuracy is low → overfitting.
- If both accuracies are low → underfitting.

Prevention Methods:
1. Cross-validation: Split data into folds to ensure consistent performance.
2. Regularization: Add penalties (L1/L2) to prevent overly complex models.
3. Early Stopping: Stop training when validation error increases.
4. Simplify model (reduce parameters) or collect more data.
5. Use Dropout or Data Augmentation (for neural networks).

Bias–Variance Trade-off:
Aim for a balance between low bias (enough complexity) and low variance (good generalization).


Question 3: How would you handle missing values in a dataset? Explain at least three methods with examples.

Missing values occur when no data is stored for a variable in an observation. Handling them properly is essential to maintain model accuracy and data integrity.

Common Methods:

1. Deletion (Listwise or Pairwise)
   - Remove rows or columns with missing values.
   - Best when missing data is very small and random.
   - Example: df.dropna()

2. Imputation (Mean / Median / Mode)
   - Replace missing values with a representative statistic.
   - Mean or median for numerical data; mode for categorical.
   - Example: df["Age"].fillna(df["Age"].median(), inplace=True)

3. Predictive Imputation
   - Use other features to predict the missing value.
   - Techniques include KNN Imputer or regression models.
   - Example: from sklearn.impute import KNNImputer

Other methods:
- Use a constant (e.g., “Unknown”) for categorical variables.
- Create an indicator column showing where data was missing.


In [1]:
# Handling missing values in Python

import pandas as pd
from sklearn.impute import KNNImputer

# Sample Data
data = {
    'Age': [25, 30, None, 22, None],
    'Salary': [50000, None, 40000, 35000, 45000],
    'City': ['Delhi', None, 'Mumbai', 'Delhi', 'Kolkata']
}

df = pd.DataFrame(data)
print("Original Data:\n", df)

# 1. Deletion
df_drop = df.dropna()
print("\nAfter Deletion:\n", df_drop)

# 2. Imputation with Median / Mode
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Salary'].fillna(df['Salary'].mean(), inplace=True)
df['City'].fillna(df['City'].mode()[0], inplace=True)
print("\nAfter Simple Imputation:\n", df)

# 3. KNN Imputation (Predictive)
imputer = KNNImputer(n_neighbors=2)
numeric_cols = ['Age', 'Salary']
df[numeric_cols] = imputer.fit_transform(df[numeric_cols])
print("\nAfter KNN Imputation:\n", df)


Original Data:
     Age   Salary     City
0  25.0  50000.0    Delhi
1  30.0      NaN     None
2   NaN  40000.0   Mumbai
3  22.0  35000.0    Delhi
4   NaN  45000.0  Kolkata

After Deletion:
     Age   Salary   City
0  25.0  50000.0  Delhi
3  22.0  35000.0  Delhi

After Simple Imputation:
     Age   Salary     City
0  25.0  50000.0    Delhi
1  30.0  42500.0    Delhi
2  25.0  40000.0   Mumbai
3  22.0  35000.0    Delhi
4  25.0  45000.0  Kolkata

After KNN Imputation:
     Age   Salary     City
0  25.0  50000.0    Delhi
1  30.0  42500.0    Delhi
2  25.0  40000.0   Mumbai
3  22.0  35000.0    Delhi
4  25.0  45000.0  Kolkata


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Salary'].fillna(df['Salary'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are se

Question 4: What is an imbalanced dataset? Describe two techniques to handle it.

An imbalanced dataset is one where the number of samples in different classes is not equal — for example, 95% of class 0 and only 5% of class 1.  
This causes models to become biased toward the majority class, giving poor performance on the minority class.

Two techniques to handle imbalance:

1. Resampling Techniques
   - Oversampling: Increase the minority class samples (e.g., SMOTE).
   - Undersampling: Reduce samples from the majority class.
   Example: Use SMOTE from imbalanced-learn library.

2. Class Weight Adjustment
   - Give higher penalty to misclassification of minority class.
   - Many models (like Logistic Regression, Random Forest) have `class_weight='balanced'` parameter.

Other helpful steps:
- Use proper metrics like F1-score, Precision, Recall, and ROC-AUC instead of Accuracy.
- Combine sampling with cross-validation to maintain balance.


In [2]:
# Handling imbalanced dataset using SMOTE and Class Weights

!pip install imbalanced-learn --quiet

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE

# Create an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=5, weights=[0.9, 0.1], random_state=42)
print("Original Class Distribution:", {0: sum(y==0), 1: sum(y==1)})

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 1. Without balancing
model_unbalanced = LogisticRegression()
model_unbalanced.fit(X_train, y_train)
y_pred_unbalanced = model_unbalanced.predict(X_test)
print("\nWithout Balancing:\n", classification_report(y_test, y_pred_unbalanced))

# 2. Using SMOTE (Oversampling minority class)
smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X_train, y_train)
print("\nAfter SMOTE Resampling:", {0: sum(y_res==0), 1: sum(y_res==1)})

model_smote = LogisticRegression()
model_smote.fit(X_res, y_res)
y_pred_smote = model_smote.predict(X_test)
print("\nWith SMOTE:\n", classification_report(y_test, y_pred_smote))

# 3. Using Class Weights
model_weighted = LogisticRegression(class_weight='balanced')
model_weighted.fit(X_train, y_train)
y_pred_weighted = model_weighted.predict(X_test)
print("\nWith Class Weights:\n", classification_report(y_test, y_pred_weighted))


Original Class Distribution: {0: np.int64(895), 1: np.int64(105)}

Without Balancing:
               precision    recall  f1-score   support

           0       0.94      1.00      0.97       272
           1       1.00      0.36      0.53        28

    accuracy                           0.94       300
   macro avg       0.97      0.68      0.75       300
weighted avg       0.94      0.94      0.93       300


After SMOTE Resampling: {0: np.int64(623), 1: np.int64(623)}

With SMOTE:
               precision    recall  f1-score   support

           0       0.98      0.90      0.94       272
           1       0.47      0.82      0.60        28

    accuracy                           0.90       300
   macro avg       0.72      0.86      0.77       300
weighted avg       0.93      0.90      0.91       300


With Class Weights:
               precision    recall  f1-score   support

           0       0.98      0.90      0.94       272
           1       0.46      0.86      0.60        2

Question 6: Compare Label Encoding and One-Hot Encoding. When would you prefer one over the other?

Label Encoding:
- Converts categories to integers.
- Best for ordinal variables (ordered).

One-Hot Encoding:
- Creates binary columns for each category.
- Best for nominal variables (unordered).

Preference:
- Ordinal → Label Encoding
- Nominal → One-Hot Encoding


In [3]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

data = pd.DataFrame({
    'Education': ['High School', 'Bachelor', 'Master', 'PhD'],
    'Color': ['Red', 'Green', 'Blue', 'Green']
})

# Label Encoding for ordinal
le = LabelEncoder()
data['Education_Label'] = le.fit_transform(data['Education'])

# One-Hot Encoding for nominal
data = pd.concat([data, pd.get_dummies(data['Color'], prefix='Color')], axis=1)

print(data)


     Education  Color  Education_Label  Color_Blue  Color_Green  Color_Red
0  High School    Red                1       False        False       True
1     Bachelor  Green                0       False         True      False
2       Master   Blue                2        True        False      False
3          PhD  Green                3       False         True      False


Question 7: Analyze the relationship between app categories and ratings. Which categories have the highest/lowest average ratings, and what could be the possible reasons?

- Group apps by category and calculate average ratings.
- Categories with highest average ratings may have better app quality or user engagement.
- Categories with lowest average ratings may have low-quality apps or inconsistent updates.


In [5]:
import pandas as pd

# Load dataset from GitHub URL
url = "https://raw.githubusercontent.com/MasteriNeuron/datasets/main/googleplaystore.csv"
data = pd.read_csv(url)

# Clean data: remove rows with missing 'Rating'
data = data[pd.notnull(data['Rating'])]

# Convert 'Rating' to numeric
data['Rating'] = pd.to_numeric(data['Rating'], errors='coerce')
data = data[pd.notnull(data['Rating'])]

# Group by 'Category' and calculate average rating
category_ratings = data.groupby('Category')['Rating'].mean().sort_values(ascending=False)

print("Average Ratings by Category:\n", category_ratings)

# Highest and lowest rated categories
print("\nHighest rated category:", category_ratings.idxmax(), "->", category_ratings.max())
print("Lowest rated category:", category_ratings.idxmin(), "->", category_ratings.min())


Average Ratings by Category:
 Category
1.9                    19.000000
EVENTS                  4.435556
EDUCATION               4.389032
ART_AND_DESIGN          4.358065
BOOKS_AND_REFERENCE     4.346067
PERSONALIZATION         4.335987
PARENTING               4.300000
GAME                    4.286326
BEAUTY                  4.278571
HEALTH_AND_FITNESS      4.277104
SHOPPING                4.259664
SOCIAL                  4.255598
WEATHER                 4.244000
SPORTS                  4.223511
PRODUCTIVITY            4.211396
HOUSE_AND_HOME          4.197368
FAMILY                  4.192272
PHOTOGRAPHY             4.192114
AUTO_AND_VEHICLES       4.190411
MEDICAL                 4.189143
LIBRARIES_AND_DEMO      4.178462
FOOD_AND_DRINK          4.166972
COMMUNICATION           4.158537
COMICS                  4.155172
NEWS_AND_MAGAZINES      4.132189
FINANCE                 4.131889
ENTERTAINMENT           4.126174
BUSINESS                4.121452
TRAVEL_AND_LOCAL        4.109292
LIFE

Question 8: Titanic Dataset Analysis

a) Compare the survival rates based on passenger class (Pclass).
- Lower class passengers (3rd class) had lower survival rates due to limited access to lifeboats.
- Higher class passengers (1st class) had the highest survival rates because of priority during evacuation.

b) Analyze how age (Age) affected survival.
- Group passengers into children (Age < 18) and adults (Age ≥ 18).
- Children generally had a higher survival rate due to 'women and children first' policy during evacuation.


In [6]:
import pandas as pd

# Load dataset from GitHub URL
url = "https://raw.githubusercontent.com/MasteriNeuron/datasets/main/titanic.csv"
titanic = pd.read_csv(url)

# a) Survival rate by Pclass
pclass_survival = titanic.groupby('Pclass')['Survived'].mean()
print("Survival Rate by Passenger Class:\n", pclass_survival)
print("Highest survival class:", pclass_survival.idxmax(), "->", pclass_survival.max())

# b) Survival rate by age group
titanic['AgeGroup'] = titanic['Age'].apply(lambda x: 'Child' if x < 18 else 'Adult')
age_survival = titanic.groupby('AgeGroup')['Survived'].mean()
print("\nSurvival Rate by Age Group:\n", age_survival)
print("Children had better chance of survival:", age_survival['Child'] > age_survival['Adult'])


Survival Rate by Passenger Class:
 Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64
Highest survival class: 1 -> 0.6296296296296297

Survival Rate by Age Group:
 AgeGroup
Adult    0.361183
Child    0.539823
Name: Survived, dtype: float64
Children had better chance of survival: True


Question 9: Flight Price Prediction Dataset

a) Flight prices vs. days left until departure:
- Prices usually increase exponentially as the departure date approaches.
- Best booking window is typically 2–3 weeks in advance to get cheaper fares.

b) Compare prices across airlines for the same route (Delhi-Mumbai):
- Some airlines are consistently cheaper due to budget operations.
- Premium airlines charge more due to better services and amenities.


In [7]:
import pandas as pd
import matplotlib.pyplot as plt

# Load dataset from GitHub URL
url = "https://raw.githubusercontent.com/MasteriNeuron/datasets/main/FlightPrices.csv"
flights = pd.read_csv(url)

# a) Flight prices vs days left until departure
plt.figure(figsize=(8,5))
plt.scatter(flights['DaysLeft'], flights['Price'], alpha=0.5)
plt.xlabel('Days Left Until Departure')
plt.ylabel('Price')
plt.title('Flight Price vs Days Left')
plt.show()

# b) Compare prices across airlines for Delhi-Mumbai
route = flights[flights['Route'] == 'Delhi-Mumbai']
airline_prices = route.groupby('Airline')['Price'].mean().sort_values()
print("Average Prices by Airline for Delhi-Mumbai:\n", airline_prices)
print("\nCheapest Airline:", airline_prices.idxmin(), "->", airline_prices.min())
print("Premium Airline:", airline_prices.idxmax(), "->", airline_prices.max())


HTTPError: HTTP Error 404: Not Found

Question 10: HR Analytics Dataset

a) Factors correlating with employee attrition:
- Satisfaction level, overtime, and salary are strong indicators.
- Low satisfaction, high overtime, and low salary increase attrition risk.

b) Are employees with more projects more likely to leave?
- Employees handling too many projects may have higher attrition due to stress and workload.


In [8]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset from GitHub URL
url = "https://raw.githubusercontent.com/MasteriNeuron/datasets/main/hr_analytics.csv"
data = pd.read_csv(url)

# a) Correlation with attrition
corr = data.corr(numeric_only=True)['left'].sort_values(ascending=False)
print("Correlation with Attrition:\n", corr)

# Visualize key drivers
plt.figure(figsize=(8,5))
sns.barplot(x=corr.index, y=corr.values)
plt.xticks(rotation=45)
plt.title('Correlation of Features with Attrition')
plt.show()

# b) Attrition vs number of projects
projects_attrition = data.groupby('number_project')['left'].mean()
print("\nAttrition Rate by Number of Projects:\n", projects_attrition)

plt.figure(figsize=(6,4))
projects_attrition.plot(kind='bar')
plt.xlabel('Number of Projects')
plt.ylabel('Attrition Rate')
plt.title('Attrition Rate by Number of Projects')
plt.show()

FileNotFoundError: [Errno 2] No such file or directory: 'hr_analytics.csv'