In [None]:
# Q1. What are the key features of the wine quality data set? Discuss the importance of each feature in
# predicting the quality of wine.

"""
Key features of the wine quality dataset:
- Fixed acidity: Measures the amount of acid in wine, directly affecting taste and preservation.
- Volatile acidity: Affects the wine's aroma, with higher levels contributing to unpleasant smells.
- Citric acid: Affects wine flavor and preservation.
- Residual sugar: Affects sweetness, which can influence the wine's taste profile.
- Chlorides: Measures salt content and can affect wine taste.
- Free sulfur dioxide: Used as a preservative, it influences the wine’s shelf life.
- Total sulfur dioxide: Provides information on the overall level of sulfur in wine.
- Density: Relates to the amount of sugar in the wine, which affects alcohol content and sweetness.
- pH: Impacts the taste and stability of the wine.
- Sulphates: Affect the wine's aroma and overall taste.
- Alcohol: Affects the flavor and body of the wine.
- Quality: The target variable, representing wine quality on a scale from 0 to 10.

Each feature influences the overall taste and characteristics of the wine, and their combination will help predict wine quality.
"""

# Q2. How did you handle missing data in the wine quality data set during the feature engineering process?
# Discuss the advantages and disadvantages of different imputation techniques.

"""
Handling missing data techniques:
1. **Mean/Median Imputation**: Replace missing values with the mean or median of the respective feature.
   - **Advantages**: Simple and quick, keeps dataset size intact.
   - **Disadvantages**: Can distort data distribution, especially with large amounts of missing data.
2. **KNN Imputation**: Use k-nearest neighbors to impute missing values based on the nearest samples.
   - **Advantages**: More sophisticated, captures relationships between features.
   - **Disadvantages**: Computationally expensive, may introduce bias if k is poorly chosen.
3. **Regression Imputation**: Predict missing values using regression models on the other features.
   - **Advantages**: Leverages existing data patterns for imputation.
   - **Disadvantages**: Complex, may not work well if relationships between features are weak.
"""

# Q3. What are the key factors that affect students' performance in exams? How would you go about
# analyzing these factors using statistical techniques?

"""
Key factors affecting student performance:
- Study hours and time management.
- Attendance and participation in classes.
- Social environment and family background.
- Availability of learning resources.
- Psychological factors such as stress or motivation.

To analyze these factors:
- **Descriptive Statistics**: To summarize data and identify trends.
- **Correlation Analysis**: To check the relationships between factors and performance.
- **Regression Analysis**: To model the relationship between student performance and other factors.
"""

# Q4. Describe the process of feature engineering in the context of the student performance data set. How
# did you select and transform the variables for your model?

"""
Feature engineering process for student performance:
1. **Data Cleaning**: Handling missing data, removing outliers, and addressing inconsistencies.
2. **Feature Selection**: Choose the most relevant features based on domain knowledge or statistical tests.
3. **Feature Transformation**:
   - **Normalization/Scaling**: To ensure features have similar magnitudes (e.g., using Min-Max or Standardization).
   - **Encoding**: Convert categorical variables (e.g., gender, education level) to numerical format (e.g., using Label Encoding).
   - **Binning**: Group continuous features into categories, such as age ranges.
"""

# Q5. Load the wine quality data set and perform exploratory data analysis (EDA) to identify the distribution
# of each feature. Which feature(s) exhibit non-normality, and what transformations could be applied to
# these features to improve normality?

"""
EDA steps:
1. **Load the dataset** and inspect the first few rows and columns.
2. **Visualizations**: Use histograms or box plots to check the distribution of each feature.
3. **Statistical Tests**: Apply tests like Shapiro-Wilk or Anderson-Darling for normality testing.

Transformation options:
- For non-normal features, apply **log transformation** or **square root transformation** to reduce skewness.
"""

# Q6. Using the wine quality data set, perform principal component analysis (PCA) to reduce the number of
# features. What is the minimum number of principal components required to explain 90% of the variance in
# the data?

# Steps for PCA:
# 1. **Standardize the dataset**: Ensure all features are on the same scale (e.g., using StandardScaler).
# 2. **Apply PCA**: Use the PCA function in scikit-learn to reduce dimensionality.
# 3. **Analyze explained variance**: Plot a cumulative explained variance graph to determine the number of components needed to explain 90% of the variance.

# Code:

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load the dataset
wine_data = pd.read_csv('winequality.csv')

# Separate features and target
X = wine_data.drop('quality', axis=1)

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA()
pca.fit(X_scaled)

# Plot cumulative explained variance
plt.plot(range(1, len(pca.explained_variance_ratio_) + 1),
         np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Explained Variance vs. Number of Components')
plt.show()

# Determine the number of components that explain 90% of the variance
n_components = np.argmax(np.cumsum(pca.explained_variance_ratio_) >= 0.90) + 1
print(f"Minimum number of components to explain 90% of the variance: {n_components}")
