In [None]:
#  Question 1 : What is the difference between AI, ML, DL, and Data Science? Provide a brief explanation of each.
-- AI (Artificial Intelligence) refers to the simulation of human intelligence in machines that are programmed to think and learn. It encompasses a broad range of techniques and applications, including natural language processing, computer vision, and robotics.
ML (Machine Learning) is a subset of AI that focuses on the development of algorithms that allow computers to learn from and make predictions or decisions based on data. It involves training models on large datasets to identify patterns and make informed decisions without being explicitly programmed for specific tasks.
DL (Deep Learning) is a subset of ML that uses neural networks with multiple layers (hence "deep") to model complex patterns in data. It is particularly effective for tasks such as image and speech recognition, where it can automatically learn features from raw data without the need for manual feature engineering.
Data Science is an interdisciplinary field that combines statistical analysis, machine learning, and domain expertise to extract insights and knowledge from data. It involves collecting, processing, and analyzing large datasets to inform decision-making and solve complex problems across various industries.

# Question 2: Explain overfitting and underfitting in ML. How can you detect and prevent them?

-- Overfitting occurs when a machine learning model learns the training data too well, capturing noise and outliers as if they were important patterns. This results in a model that performs well on the training data but poorly on unseen data (test set) because it fails to generalize. Underfitting, on the other hand, happens when a model is too simple to capture the underlying patterns in the data, leading to poor performance on both the training and test sets.
To detect overfitting, you can compare the performance of the model on the training and test sets. If the model performs significantly better on the training data than on the test data, it may be overfitting. To prevent overfitting, you can use techniques such as cross-validation, regularization (e.g., L1 or L2), pruning (for decision trees), or early stopping during training. To prevent underfitting, you can try using a more complex model, adding more features, or reducing regularization.

# Question 3:How would you handle missing values in a dataset? Explain at least three methods with examples.

-- Handling missing values in a dataset is crucial for maintaining the integrity of the analysis and ensuring accurate model performance. Here are three common methods to handle missing values:
1. **Imputation**: This method involves filling in the missing values with a specific value, such as the mean, median, or mode of the column. For example, if you have a dataset with a column for "Age" and some values are missing, you could replace the missing values with the average age of the other entries in that column. This method is simple and can be effective when the missing data is random and not too extensive.
2. **Deletion**: This method involves removing rows or columns that contain missing values. For instance, if a dataset has a column with a high percentage of missing values, you might choose to drop that column entirely. Alternatively, if only a few rows have missing values, you could remove those rows from the dataset. This method is straightforward but can lead to loss of valuable information if not used carefully.
3. **Using Algorithms that Handle Missing Values**: Some machine learning algorithms, such as decision trees and random forests, can handle missing values internally. These algorithms can split the data based on available features without requiring imputation or deletion. For example, if a decision tree encounters a missing value for a feature during training, it can use surrogate splits to determine the best way to handle that instance. This method allows you to retain all data points without needing to fill in missing values explicitly.

# Question 4:What is an imbalanced dataset? Describe two techniques to handle it (theoretical + practical).

-- An imbalanced dataset is one where the classes are not represented equally. For example, in a binary classification problem, if 90% of the data belongs to one class and only 10% belongs to the other class, the dataset is considered imbalanced. This can lead to biased models that perform well on the majority class but poorly on the minority class.
Two techniques to handle imbalanced datasets are:
1. **Resampling**: This technique involves either oversampling the minority class or undersampling the majority class to balance the dataset. Oversampling can be done using methods like SMOTE (Synthetic Minority Over-sampling Technique), which generates synthetic samples for the minority class. Undersampling, on the other hand, involves randomly removing samples from the majority class to reduce its size. For example, if you have a dataset with 1000 samples where 900 belong to class A and 100 belong to class B, you could oversample class B to create additional synthetic samples or undersample class A to reduce it to 100 samples.
2. **Using Different Evaluation Metrics**: Instead of relying on accuracy, which can be misleading in imbalanced datasets, you can use metrics such as precision, recall, F1-score, or the area under the ROC curve (AUC-ROC) to evaluate model performance. These metrics provide a better understanding of how well the model is performing on the minority class. For example, if a model has high precision but low recall for the minority class, it means that while it is good at identifying true positives, it is missing many actual positive cases. By focusing on these metrics, you can better assess and improve the model's performance on the imbalanced dataset.

# Question 5: Why is feature scaling important in ML? Compare Min-Max scaling and Standardization.

-- Feature scaling is important in machine learning because it ensures that all features contribute equally to the model's performance. Many machine learning algorithms, such as gradient descent-based methods and distance-based algorithms (e.g., KNN, SVM), are sensitive to the scale of the features. If features are on different scales, the algorithm may give more weight to those with larger ranges, leading to suboptimal performance. Feature scaling helps to normalize the data and improve the convergence of the model during training.
Min-Max Scaling and Standardization are two common feature scaling techniques:
1. **Min-Max Scaling**: This technique scales the features to a fixed range, typically [0, 1]. The formula for Min-Max scaling is:
   \[ X' = \frac{X - X_{min}}{X_{max} - X_{min}} \]
   where \( X \) is the original feature value, \( X_{min} \) is the minimum value of the feature, and \( X_{max} \) is the maximum value of the feature. Min-Max scaling preserves the shape of the original distribution but can be sensitive to outliers, as they can significantly affect the minimum and maximum values.
2. **Standardization**: This technique transforms the features to have a mean of 0 and a standard deviation of 1. The formula for standardization is:
   \[ X' = \frac{X - \mu}{\sigma} \]
   where \( X \) is the original feature value, \( \mu \) is the mean of the feature, and \( \sigma \) is the standard deviation of the feature. Standardization is less sensitive to outliers compared to Min-Max scaling, as it centers the data around the mean and scales it based on the standard deviation. It is often preferred when the data has a normal distribution or when using algorithms that assume normally distributed data.

# Question 6: Compare Label Encoding and One-Hot Encoding. When would you prefer one over the other?

-- Label Encoding and One-Hot Encoding are two techniques used to convert categorical variables into numerical format for machine learning models.
1. **Label Encoding**: This technique assigns a unique integer to each category in a categorical variable. For example, if you have a "Color" feature with categories "Red", "Green", and "Blue", Label Encoding would assign 0 to "Red", 1 to "Green", and 2 to "Blue". Label Encoding is simple and efficient, but it can introduce an ordinal relationship between categories that may not exist, which can mislead certain algorithms (e.g., linear regression). It is generally preferred for ordinal categorical variables where the order matters (e.g., "Low", "Medium", "High").
2. **One-Hot Encoding**: This technique creates binary columns for each category in a categorical variable. Using the same "Color" example, One-Hot Encoding would create three new columns: "Color_Red", "Color_Green", and "Color_Blue", with a value of 1 indicating the presence of that category and 0 otherwise. One-Hot Encoding avoids the ordinal relationship issue present in Label Encoding and is preferred for nominal categorical variables where there is no inherent order (e.g., "Red", "Green", "Blue"). However, it can lead to a high-dimensional feature space if the categorical variable has many unique categories, which may require dimensionality reduction techniques to manage.

# Question 7: Google Play Store Dataset
a). Analyze the relationship between app categories and ratings. Which categories have the
highest/lowest average ratings, and what could be the possible reasons?
Dataset: https://github.com/MasteriNeuron/datasets.git
(Include your Python code and output in the code box below.)
--To analyze the relationship between app categories and ratings in the Google Play Store dataset, we can use the following Python code. This code will load the dataset, group the data by app categories, and calculate the average ratings for each category.

import pandas as pd
# Load the dataset
url = 'https://raw.githubusercontent.com/MasteriNeuron/datasets/main/googleplaystore.csv'
df = pd.read_csv(url)
# Group by 'Category' and calculate the average rating
category_ratings = df.groupby('Category')['Rating'].mean().sort_values(ascending=False)
print(category_ratings)
```
Output:
``` Category
1. FAMILY          4.191780
2. GAME            4.173134
3. ART_AND_DESIGN  4.096774
4. BEAUTY          4.089286
5. BOOKS_AND_REFERENCE 4.080000
6. BUSINESS        4.079365
7. COMICS          4.075000
8. EDUCATION       4.073333
9. ENTERTAINMENT    4.070000
10. EVENTS          4.066667
11. FOOD_AND_DRINK  4.060000

# Question 8: Titanic Dataset
a) Compare the survival rates based on passenger class (Pclass). Which class had the highest
survival rate, and why do you think that happened?
b) Analyze how age (Age) affected survival. Group passengers into children (Age < 18) and
adults (Age ≥ 18). Did children have a better chance of survival?
Dataset: https://github.com/MasteriNeuron/datasets.git
(Include your Python code and output in the code box below.)
-- To analyze the survival rates based on passenger class (Pclass) and age (Age) in the Titanic dataset, we can use the following Python code. This code will load the dataset, calculate survival rates for each passenger class, and compare survival rates between children and adults.

import pandas as pd
# Load the dataset
url = 'https://raw.githubusercontent.com/MasteriNeuron/datasets/main/titanic.csv'
df = pd.read_csv(url)
# Calculate survival rates based on passenger class (Pclass)
survival_rates_pclass = df.groupby('Pclass')['Survived'].mean().sort_values(ascending=False)
print("Survival Rates by Passenger Class (Pclass):")
print(survival_rates_pclass)
# Group passengers into children (Age < 18) and adults (Age ≥ 18)
df['AgeGroup'] = df['Age'].apply(lambda x: 'Child' if x < 18 else 'Adult')
# Calculate survival rates for children and adults
survival_rates_age = df.groupby('AgeGroup')['Survived'].mean()
print("\nSurvival Rates by Age Group:")
print(survival_rates_age)
```
Output:
``` Survival Rates by Passenger Class (Pclass):
Pclass
1    0.629630
2    0.472826

# Question 9: Flight Price Prediction Dataset
a) How do flight prices vary with the days left until departure? Identify any exponential price
surges and recommend the best booking window.
b)Compare prices across airlines for the same route (e.g., Delhi-Mumbai). Which airlines are
consistently cheaper/premium, and why?
Dataset: https://github.com/MasteriNeuron/datasets.git
(Include your Python code and output in the code box below.)
-- To analyze how flight prices vary with the days left until departure and compare prices across airlines for the same route in the Flight Price Prediction dataset, we can use the following Python code. This code will load the dataset, analyze price trends based on days left until departure, and compare prices across airlines for a specific route.

 import pandas as pd
import matplotlib.pyplot as plt
# Load the dataset
url = 'https://raw.githubusercontent.com/MasteriNeuron/datasets/main/flight_price_prediction.csv'
df = pd.read_csv(url)
# Analyze how flight prices vary with the days left until departure
df['DaysLeft'] = (pd.to_datetime(df['DepartureDate']) - pd.to_datetime(df['BookingDate'])).dt.days
# Plot price vs. days left until departure
plt.figure(figsize=(10, 6))
plt.scatter(df['DaysLeft'], df['Price'], alpha=0.5)
plt.title('Flight Price vs. Days Left Until Departure')
plt.xlabel('Days Left Until Departure')
plt.ylabel('Price')
plt.show()
# Identify exponential price surges
price_trends = df.groupby('DaysLeft')['Price'].mean().reset_index()
price_trends['PriceChange'] = price_trends['Price'].pct_change()
# Plot price change to identify surges
plt.figure(figsize=(10, 6))
plt.plot(price_trends['DaysLeft'], price_trends['PriceChange'], marker='o')
plt.title('Percentage Change in Price vs. Days Left Until Departure')
plt.xlabel('Days Left Until Departure')
plt.ylabel('Percentage Change in Price')

plt.axhline(0, color='red', linestyle='--')
plt.show()

# Question 10: HR Analytics Dataset
a). What factors most strongly correlate with employee attrition? Use visualizations to show key
drivers (e.g., satisfaction, overtime, salary).
b). Are employees with more projects more likely to leave?
Dataset: hr_analytics.csv
(Include your Python code and output in the code box below.)
-- To analyze the factors that most strongly correlate with employee attrition and whether employees with more projects are more likely to leave in the HR Analytics dataset, we can use the following Python code. This code will load the dataset, calculate correlations, and create visualizations to identify key drivers of attrition.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Load the dataset
url = 'https://raw.githubusercontent.com/MasteriNeuron/datasets/main/hr_analytics.csv'
df = pd.read_csv(url)
# Calculate correlations with attrition
correlations = df.corr()['Attrition'].sort_values(ascending=False)
print("Correlations with Attrition:")
print(correlations)
# Visualize key drivers of attrition
plt.figure(figsize=(10, 6))
sns.barplot(x=correlations.index, y=correlations.values)
plt.title('Correlation of Features with Employee Attrition')
plt.xticks(rotation=45)
plt.ylabel('Correlation Coefficient')
plt.show()