Skip to content

dwu025/Cse151a

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Wine Quality Dataset Exploratory Data Analysis

Project Overview

This project conducts comprehensive exploratory data analysis on the wine quality dataset to understand various physicochemical factors that influence wine quality.

Dataset Information

  • Data Source: Wine Quality Dataset
  • Data Size: 1,143 records
  • Data Type: Wine physicochemical test data
  • Target Variable: quality (quality score, range 3-8)

Feature Description

The dataset contains the following 11 input variables (physicochemical test data):

  1. fixed acidity - Fixed acidity
  2. volatile acidity - Volatile acidity
  3. citric acid - Citric acid
  4. residual sugar - Residual sugar
  5. chlorides - Chlorides
  6. free sulfur dioxide - Free sulfur dioxide
  7. total sulfur dioxide - Total sulfur dioxide
  8. density - Density
  9. pH - pH value
  10. sulphates - Sulphates
  11. alcohol - Alcohol content

Environment Requirements

  • Python 3.7+
  • Dependencies:
    pandas>=1.3.0
    numpy>=1.21.0
    matplotlib>=3.4.0
    seaborn>=0.11.0
    scikit-learn>=1.0.0
    jupyter>=1.0.0
    

Project Structure

├── WineQT.csv                         # Original dataset
├── wine_quality_analysis.ipynb        # Data exploration analysis notebook (Chinese)
├── wine_quality_analysis_en.ipynb     # Data exploration analysis notebook (English)
├── README.md                          # Project documentation (Chinese)
├── README_EN.md                       # Project documentation (English)
├── requirements.txt                   # Dependencies list

Data Exploration Analysis Content

1. Basic Data Information

  • Dataset shape and basic statistics
  • Data type checking
  • Missing values and duplicate analysis

2. Data Visualization

  • Target variable distribution (bar chart)
  • Feature distributions (histograms)
  • Outlier detection (box plots)
  • Feature correlation analysis (heatmap)
  • Feature-target relationship (scatter plots)

3. Data Quality Assessment

  • Data completeness check
  • Data imbalance analysis
  • Outlier identification

Key Findings

Data Quality

  • No missing values
  • No duplicate records
  • Data imbalance issue (quality scores concentrated in 5-6)
  • Some features contain outliers

Feature Characteristics

  • alcohol content shows positive correlation with quality score
  • volatile acidity shows negative correlation with quality score
  • citric acid shows positive correlation with quality score
  • Most features follow normal or near-normal distribution

Data Preprocessing Recommendations

  1. Outlier Treatment: Use IQR method or Z-score method to handle outliers
  2. Feature Scaling: Standardize or normalize numerical features
  3. Data Balancing: Consider using SMOTE or other resampling techniques for imbalanced data
  4. Feature Engineering: Select important features based on correlation analysis

Contributors

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors