Question 1 ‚Äî Difference between AI, ML, DL, and Data Science (brief)

AI (Artificial Intelligence)
Scope: Broad field that builds systems that perform tasks normally requiring human intelligence (planning, reasoning, perception, language).
Techniques: Rule-based systems, search, logic, planning, ML.
Applications: Chatbots, game playing, automated planning, expert systems.

ML (Machine Learning)
Scope: Subset of AI focused on algorithms that learn patterns from data and make predictions/decisions.
Techniques: Supervised (classification/regression), unsupervised (clustering), reinforcement learning.
Applications: Spam detection, recommendation engines, fraud detection.

DL (Deep Learning)
Scope: Subset of ML using multi-layer neural networks (deep neural nets) to learn hierarchical features from large data.
Techniques: CNNs, RNNs/LSTMs, Transformers, autoencoders.
Applications: Image recognition, speech-to-text, large language models.

Data Science
Scope: Interdisciplinary field using statistics, ML, domain knowledge, and data engineering to extract insights and build data products.
Techniques: Data cleaning, EDA, visualization, statistical inference, ML modeling, deployment/monitoring.
Applications: Business analytics, A/B testing, forecasting, dashboards.

Short comparison: AI is the broad goal; ML is the data-driven approach within AI; DL is a set of powerful ML models (neural nets) for complex data; Data Science is the practice of analyzing data (may use ML/DL) to answer questions and make decisions.

Question 2 ‚Äî Overfitting and Underfitting (detect & prevent)

Definitions

Underfitting: Model is too simple to capture underlying patterns ‚Üí poor performance on train and test.

Overfitting: Model fits training data (including noise) too closely ‚Üí low train error but high test error.

Bias‚ÄìVariance tradeoff

Underfitting ‚á® high bias, low variance.

Overfitting ‚á® low bias, high variance.
Goal: find sweet spot minimizing generalization error (bias¬≤ + variance + noise).

Detection

Compare training vs validation/test error. Underfitting ‚Üí both errors high. Overfitting ‚Üí training error much lower than validation error.

Learning curves (plot error vs training set size): overfitting shows large gap between train/val; underfitting shows both high and close.

Prevention / fixes

For overfitting:

More data (if possible).

Regularization (L1/L2), dropout for NNs.

Simpler model (reduce parameters/features).

Early stopping (stop training when validation loss increases).

Cross-validation to pick hyperparameters.

For underfitting:

Increase model complexity (add features, deeper model).

Reduce regularization.

Feature engineering.

Cross-validation (k-fold) helps evaluate generalization reliably and tune hyperparameters to avoid overfitting.

Question 3 ‚Äî Handling missing values (‚â•3 methods, examples)

Deletion (listwise / row removal)

Remove rows with missing values (or columns with many missing).

Use when missingness is small and random (MCAR).

Example: df = df.dropna(subset=['Age', 'Salary'])

Imputation with summary statistics (mean/median/mode)

Numeric: mean or median (median robust to outliers).

Categorical: mode (most frequent).

Example:

df['Age'].fillna(df['Age'].median(), inplace=True)
df['City'].fillna(df['City'].mode()[0], inplace=True)


Predictive modeling (regression / KNN imputation / MICE)

Fit model to predict missing values using other features (e.g., regress Age on other columns).

KNN imputer uses nearest neighbors to impute. MICE (multiple imputation by chained equations) creates multiple plausible imputed datasets.

Example: from sklearn.impute import KNNImputer then imputer.fit_transform(X)

Indicator for missingness (helpful when missingness itself is informative)

Create boolean column Age_missing = df['Age'].isnull().astype(int) and then impute Age (with median). This preserves signal that value was missing.

Which to choose? Depends on cause of missingness (MCAR, MAR, MNAR), proportion missing, and importance of variable.



Question 4 ‚Äî Imbalanced dataset & two techniques

Imbalanced dataset: Class distribution heavily skewed (e.g., 95% negative, 5% positive). Standard classifiers may be biased toward majority class.

Techniques

Resampling

Oversampling (increase minority class): random oversampling or SMOTE (Synthetic Minority Over-sampling Technique) ‚Äî SMOTE creates synthetic minority examples by interpolating neighbors. Practical: imblearn.over_sampling.SMOTE()

Undersampling (reduce majority class): random undersampling or more advanced methods (NearMiss). Practical: imblearn.under_sampling.RandomUnderSampler()

Pros/cons: oversampling can overfit, undersampling may lose information; SMOTE reduces overfitting risk vs naive oversampling.

Algorithmic approaches / class weighting

Use class weights in loss function or model APIs (e.g., class_weight='balanced' in sklearn classifiers, or set pos_weight in XGBoost). Penalizes misclassification of minority class more heavily.

Good when you want to keep original data distribution and avoid synthetic data.

Question 5 ‚Äî Feature scaling importance; Min‚ÄìMax vs Standardization

Why scale?

Many ML algorithms rely on distance or gradient magnitudes (KNN, SVM, k-means, PCA, neural networks). Features on different scales distort distances and slow/unstable training (gradient descent). Scaling makes features comparable.

Min‚ÄìMax scaling (Normalization)

Formula:
ùë•
‚Ä≤
=
(
ùë•
‚àí
ùë•
ùëö
ùëñ
ùëõ
)
/
(
ùë•
ùëö
ùëé
ùë•
‚àí
ùë•
ùëö
ùëñ
ùëõ
)
x
‚Ä≤
=(x‚àíx
min
	‚Äã

)/(x
max
	‚Äã

‚àíx
min
	‚Äã

) ‚Üí maps to [0,1] (or other range).

Pros: preserves shape of original distribution, bounded range.

Cons: sensitive to outliers (min/max changes if outliers present).

Standardization (Z-score)

Pros: less sensitive to outliers than min‚Äìmax (but still affected), good for algorithms assuming zero-centered data (e.g., logistic regression, PCA).

Use when inputs approximate Gaussian or when using regularized models.

Impact examples

KNN/SVM: both sensitive to feature scales ‚Äî always scale.

Gradient descent: standardization often improves convergence speed.

Question 6 ‚Äî Label Encoding vs One-Hot Encoding (comparison & when to use)

Label Encoding

Maps categories to integers (e.g., A‚Üí0, B‚Üí1, C‚Üí2).

Use when categorical variable is ordinal (there is a natural ordering, e.g., low, medium, high) so numeric mapping preserves order.

One-Hot Encoding

Creates binary columns for each category (e.g., is_A, is_B, is_C).

Use for nominal variables (no natural order). Avoid implying ordinal relationships.

Cons: increases dimensionality (use sparse representation if many categories). For high-cardinality, consider target encoding or embedding.

When prefer one over other

Ordinal categorical ‚Üí Label Encoding.

Nominal categorical and models expecting numeric input (tree-based models can sometimes handle label encoding but one-hot is safer for linear models) ‚Üí One-Hot Encoding.

For tree-based models (random forest, XGBoost), label encoding can work OK, but if categories have no order, many practitioners still prefer one-hot or target encoding.



Question 7 ‚Äî Google Play Store: Category vs Rating (code + expected output)

In [None]:
# Q7: Google Play Store - category vs rating
import pandas as pd
import matplotlib.pyplot as plt

url = "https://raw.githubusercontent.com/MasteriNeuron/datasets/main/googleplaystore.csv"
df = pd.read_csv(url)

# Clean rating
df['Rating'] = pd.to_numeric(df['Rating'], errors='coerce')

# Average rating by category
cat_rating = df.groupby('Category', observed=True)['Rating'].mean().sort_values(ascending=False)

print("Top 10 categories by average rating:")
print(cat_rating.head(10).round(3))

print("\nBottom 10 categories by average rating:")
print(cat_rating.tail(10).round(3))

# Plot (top 10)
top10 = cat_rating.head(10)
plt.figure(figsize=(10,5))
top10.plot(kind='bar')
plt.title("Top 10 Categories by Average Rating")
plt.ylabel("Average Rating")
plt.tight_layout()
plt.show()



Question 8 ‚Äî Titanic: survival by Pclass and Age groups (code + expected output)

In [None]:
# Q8: Titanic dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

url = "https://raw.githubusercontent.com/MasteriNeuron/datasets/main/titanic.csv"
t = pd.read_csv(url)

# Ensure types
t['Survived'] = pd.to_numeric(t['Survived'], errors='coerce')
t['Pclass'] = pd.to_numeric(t['Pclass'], errors='coerce')
t['Age'] = pd.to_numeric(t['Age'], errors='coerce')

# a) Survival rate by Pclass
survival_by_pclass = t.groupby('Pclass')['Survived'].mean().sort_index()
print("Survival rate by Pclass (fraction):")
print((survival_by_pclass * 100).round(2))

# Plot
plt.figure(figsize=(6,4))
survival_by_pclass.plot(kind='bar')
plt.title("Survival Rate by Pclass")
plt.ylabel("Survival rate")
plt.tight_layout()
plt.show()

# b) Age effect: children (<18) vs adults (>=18)
t['AgeGroup'] = np.where(t['Age'] < 18, 'Child', 'Adult')
survival_by_agegroup = t.groupby('AgeGroup')['Survived'].mean()
print("\nSurvival rate by Age group:")
print((survival_by_agegroup * 100).round(2))

print("\nCounts:")
print(t['AgeGroup'].value_counts())

# Plot
plt.figure(figsize=(6,4))
survival_by_agegroup.plot(kind='bar')
plt.title("Survival Rate: Children vs Adults")
plt.ylabel("Survival rate")
plt.tight_layout()
plt.show()



Question 9 ‚Äî Flight Price Prediction: days left and airline comparison (code + expected output)

In [None]:
# Q9: Flight price analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

url = "https://raw.githubusercontent.com/MasteriNeuron/datasets/main/flight_price.csv"
f = pd.read_csv(url)

print("Columns:", list(f.columns))
# Attempt to locate a days-left column
candidates = [c for c in f.columns if any(k in c.lower() for k in ['day','days','left','dep','departure'])]
print("Candidates for days-like columns:", candidates)

# Example common column names: 'days_left' or 'days' or 'Days_left'
# If dataset has a column e.g. 'days_left' (numeric):
days_col = None
for c in candidates:
    if pd.api.types.is_numeric_dtype(f[c]):
        days_col = c
        break

if days_col is None:
    # If there is a departure date and booking date, compute days difference
    possible_date_cols = [c for c in f.columns if 'date' in c.lower()]
    print("No direct numeric days-left found. Date columns:", possible_date_cols)
else:
    # Relationship between days left and price: compute median price per days-left
    med_price = f.groupby(days_col)['Price'].median().reset_index().sort_values(by=days_col)
    print(med_price.head(10))
    # Plot median price vs days-left
    plt.figure(figsize=(8,4))
    plt.plot(med_price[days_col], med_price['Price'], marker='o')
    plt.gca().invert_xaxis()  # usually to show days->near departure
    plt.title("Median Price vs Days Left")
    plt.xlabel("Days left until departure")
    plt.ylabel("Median Price")
    plt.tight_layout()
    plt.show()

# b) Compare airlines for a route (e.g., 'Delhi' and 'Mumbai' appear in route columns)
# Find route-like column or source & destination columns:
route_cols = [c for c in f.columns if any(k in c.lower() for k in ['route','from','to','source','destination'])]
print("Potential route columns:", route_cols)

# Filter rows containing both Delhi and Mumbai somewhere
mask = f.apply(lambda r: r.astype(str).str.contains('Delhi', case=False, na=False).any() and r.astype(str).str.contains('Mumbai', case=False, na=False).any(), axis=1)
if mask.any():
    dm = f[mask]
    print("Records for Delhi-Mumbai:", len(dm))
    # find airline column
    airline_col = next((c for c in f.columns if 'air' in c.lower()), None)
    if airline_col:
        median_by_air = dm.groupby(airline_col)['Price'].median().sort_values()
        print("Median price by airline for Delhi-Mumbai:\n", median_by_air)
        median_by_air.plot(kind='bar', figsize=(8,4))
        plt.title("Median Price by Airline for Delhi-Mumbai")
        plt.ylabel("Median Price")
        plt.tight_layout()
        plt.show()
    else:
        print("No airline column detected.")
else:
    print("No explicit Delhi-Mumbai rows found. Inspect route columns manually.")


Question 10 ‚Äî HR Analytics: attrition drivers & projects relation (code + expected output)

In [None]:
# Q10: HR analytics - correlations & projects vs attrition
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

url = "https://raw.githubusercontent.com/MasteriNeuron/datasets/main/hr_analytics.csv"
h = pd.read_csv(url)

print("Columns:", list(h.columns))

# Normalize attrition to binary
if 'Attrition' in h.columns:
    h['AttritionB'] = h['Attrition'].map({'Yes':1,'No':0})
else:
    # fallback: if first column is Attrition
    h['AttritionB'] = (h.iloc[:,0].astype(str).str.lower() == 'yes').astype(int)

# Numeric correlations
numeric = h.select_dtypes(include=[np.number]).drop(columns=['AttritionB'], errors='ignore')
corr_with_attr = numeric.corrwith(h['AttritionB']).abs().sort_values(ascending=False)
print("Top features correlated with Attrition:")
print(corr_with_attr.head(10).round(3))

# Visualize a few important factors if present
for col in ['JobSatisfaction','OverTime','MonthlyIncome','TotalWorkingYears','DistanceFromHome','WorkLifeBalance']:
    if col in h.columns:
        if h[col].dtype == 'object':
            agg = h.groupby(col)['AttritionB'].mean().sort_values(ascending=False)
            print(f"\nAttrition rate by {col}:")
            print((agg*100).round(2))
            agg.plot(kind='bar', figsize=(6,3))
            plt.title(f"Attrition rate by {col}")
            plt.tight_layout()
            plt.show()
        else:
            # bin numeric column and show attrition rate
            bins = pd.qcut(h[col].rank(method='first'), q=5, duplicates='drop')
            agg = h.groupby(bins)['AttritionB'].mean()
            print(f"\nAttrition by binned {col}:")
            print((agg*100).round(2))
            agg.plot(kind='bar', figsize=(6,3))
            plt.title(f"Attrition by binned {col}")
            plt.tight_layout()
            plt.show()

# b) Are employees with more projects more likely to leave?
proj_candidates = [c for c in h.columns if 'project' in c.lower() or 'numproject' in c.lower() or 'no_of_projects' in c.lower() or 'num_projects' in c.lower()]
print("Project-like columns detected:", proj_candidates)
if proj_candidates:
    pc = proj_candidates[0]
    grp = h.groupby(pc)['AttritionB'].mean().sort_index()
    print("\nAttrition rate by # projects:")
    print((grp*100).round(2))
    grp.plot(kind='bar', figsize=(6,3))
    plt.title("Attrition rate by Number of Projects")
    plt.tight_layout()
    plt.show()
else:
    print("No explicit project-count column; check dataset for similar fields (e.g., 'NumProjects' or 'No_of_Projects').")
