In [1]:
#### Q1. What are the key features of the wine quality data set? Discuss the importance of each feature in predicting the quality of wine.
###The Wine Quality dataset is a popular dataset used for regression and classification tasks, containing physicochemical characteristics of red and white wines. The key features of the Wine Quality dataset include:

#1. Fixed Acidity: Measures the acidity level, which affects the wine's taste and quality.
#2. Volatile Acidity: Reflects the wine's acidity and can indicate spoilage or poor winemaking practices.
#3. Citric Acid: Contributes to the wine's acidity and flavor.
#4. Residual Sugar: Affects the wine's sweetness and quality.
#5. Chlorides: Influences the wine's acidity and can impact the fermentation process.
#6. Free Sulfur Dioxide: Acts as an antioxidant and preservative, affecting the wine's quality and shelf life.
#7. Total Sulfur Dioxide: Includes both free and bound forms, impacting the wine's quality and stability.
#8. Density: Relates to the wine's alcohol content and can affect its quality.
#9. pH: Measures the wine's acidity level, influencing its quality and microbial stability.
#10. Sulfates: Contributes to the wine's bitterness and can impact its quality.
#11. Alcohol: Affects the wine's flavor, quality, and overall character.
#12. Quality: The target variable, scored by wine experts (0-10), representing the wine's overall quality.



In [5]:
###### Q2. How did you handle missing data in the wine quality data set during the feature engineering process?Discuss the advantages and disadvantages of different imputation techniques.

#When handling missing data in the Wine Quality dataset during feature engineering, I employed various imputation techniques to minimize the impact of missing values. Here are some common methods and their advantages and disadvantages:

#1. Mean/Median Imputation:
#    - Advantage: Simple and easy to implement.
#    - Disadvantage: Can mask the true distribution of data, leading to biased models.
#2. Regression Imputation:
#    - Advantage: Uses the relationship between features to estimate missing values.
#    - Disadvantage: Can be time-consuming and may not work well with complex relationships.
#3. K-Nearest Neighbors (KNN) Imputation:
#    - Advantage: Considers the similarity between data points to estimate missing values.
#    - Disadvantage: Computationally expensive and may not perform well with high-dimensional data.
#4. Multiple Imputation:
#    - Advantage: Accounts for uncertainty in imputed values, providing more accurate estimates.
#    - Disadvantage: Can be computationally intensive and may require careful tuning.
#5. Listwise Deletion:
#    - Advantage: Simple and fast, removing rows with missing values.
#    - Disadvantage: Can lead to biased estimates and reduced sample size.
#6. Forward/Backward Fill:
#    - Advantage: Preserves the temporal relationships in time-series data.
#    - Disadvantage: May not be suitable for non-time-series data and can propagate errors.
import pandas as pd
import numpy as np

# Load the Wine Quality dataset
wine_df = pd.read_csv('winequality-red.csv')

# Check for missing values
print(wine_df.isnull().sum())

# Mean imputation for numerical features
wine_df['pH'] = wine_df['pH'].fillna(wine_df['pH'].mean())
wine_df['sulfates'] = wine_df['sulfates'].fillna(wine_df['sulfates'].mean())

# Median imputation for categorical features
wine_df['quality'] = wine_df['quality'].fillna(wine_df['quality'].median())

# K-Nearest Neighbors (KNN) imputation
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
wine_df[['pH', 'sulfates']] = imputer.fit_transform(wine_df[['pH', 'sulfates']])

# Multiple imputation (using MICE)
from sklearn.impute import IterativeImputer
imputer = IterativeImputer(max_iter=10, random_state=42)
wine_df[['pH', 'sulfates']] = imputer.fit_transform(wine_df[['pH', 'sulfates']])



fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64


KeyError: 'sulfates'

In [None]:
#### Q3. What are the key factors that affect students' performance in exams? How would you go about analyzing these factors using statistical techniques?

import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.api as sm

# Load the Wine Quality dataset
wine_df = pd.read_csv('winequality-red.csv')

# Descriptive statistics
print(wine_df.describe())

# Correlation analysis
corr_matrix = wine_df.corr()
print(corr_matrix)

# Scatter plots to visualize relationships
import seaborn as sns
sns.pairplot(wine_df)

# Regression analysis (predicting quality using pH and sulfates)
X = wine_df[['pH', 'sulfates']]
y = wine_df['quality']
X = sm.add_constant(X)  # add intercept
model = sm.OLS(y, X).fit()
print(model.summary())

# Hypothesis testing (testing the significance of pH and sulfates)
print(stats.ttest_ind(wine_df['pH'], wine_df['sulphates']))

# Principal Component Analysis (PCA) for dimensionality reduction
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
wine_df_pca = pca.fit_transform(wine_df.drop('quality', axis=1))
print(wine_df_pca)

# Clustering analysis (K-Means clustering)
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
wine_df['cluster'] = kmeans.fit_predict(wine_df.drop('quality', axis=1))
print(wine_df['cluster'])



       fixed acidity  volatile acidity  citric acid  residual sugar  \
count    1599.000000       1599.000000  1599.000000     1599.000000   
mean        8.319637          0.527821     0.270976        2.538806   
std         1.741096          0.179060     0.194801        1.409928   
min         4.600000          0.120000     0.000000        0.900000   
25%         7.100000          0.390000     0.090000        1.900000   
50%         7.900000          0.520000     0.260000        2.200000   
75%         9.200000          0.640000     0.420000        2.600000   
max        15.900000          1.580000     1.000000       15.500000   

         chlorides  free sulfur dioxide  total sulfur dioxide      density  \
count  1599.000000          1599.000000           1599.000000  1599.000000   
mean      0.087467            15.874922             46.467792     0.996747   
std       0.047065            10.460157             32.895324     0.001887   
min       0.012000             1.000000         

KeyError: "['sulfates'] not in index"

In [8]:
wine_df

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5
3,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6
4,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
...,...,...,...,...,...,...,...,...,...,...,...,...
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6
1596,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5
