This project conducts comprehensive exploratory data analysis on the wine quality dataset to understand various physicochemical factors that influence wine quality.
- Data Source: Wine Quality Dataset
- Data Size: 1,143 records
- Data Type: Wine physicochemical test data
- Target Variable: quality (quality score, range 3-8)
The dataset contains the following 11 input variables (physicochemical test data):
- fixed acidity - Fixed acidity
- volatile acidity - Volatile acidity
- citric acid - Citric acid
- residual sugar - Residual sugar
- chlorides - Chlorides
- free sulfur dioxide - Free sulfur dioxide
- total sulfur dioxide - Total sulfur dioxide
- density - Density
- pH - pH value
- sulphates - Sulphates
- alcohol - Alcohol content
- Python 3.7+
- Dependencies:
pandas>=1.3.0 numpy>=1.21.0 matplotlib>=3.4.0 seaborn>=0.11.0 scikit-learn>=1.0.0 jupyter>=1.0.0
├── WineQT.csv # Original dataset
├── wine_quality_analysis.ipynb # Data exploration analysis notebook (Chinese)
├── wine_quality_analysis_en.ipynb # Data exploration analysis notebook (English)
├── README.md # Project documentation (Chinese)
├── README_EN.md # Project documentation (English)
├── requirements.txt # Dependencies list
- Dataset shape and basic statistics
- Data type checking
- Missing values and duplicate analysis
- Target variable distribution (bar chart)
- Feature distributions (histograms)
- Outlier detection (box plots)
- Feature correlation analysis (heatmap)
- Feature-target relationship (scatter plots)
- Data completeness check
- Data imbalance analysis
- Outlier identification
- No missing values
- No duplicate records
- Data imbalance issue (quality scores concentrated in 5-6)
- Some features contain outliers
- alcohol content shows positive correlation with quality score
- volatile acidity shows negative correlation with quality score
- citric acid shows positive correlation with quality score
- Most features follow normal or near-normal distribution
- Outlier Treatment: Use IQR method or Z-score method to handle outliers
- Feature Scaling: Standardize or normalize numerical features
- Data Balancing: Consider using SMOTE or other resampling techniques for imbalanced data
- Feature Engineering: Select important features based on correlation analysis
- Data Analysis: diw034@ucsd.edu