<a href="https://colab.research.google.com/github/Yedzinovich/Data-607/blob/main/homework.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Assignment 7**

Weeks 8 & 9 - Pandas, Inna Yedzinovich


### **Part 1: Intoduction**:

In my previous class last semester, I used a wine quality dataset to investigate how various chemical properties affect the quality of Portuguese "Vinho Verde" red wine. My code was written in R, and since it was a final project, I dedicated a lot of effort to it. For this assignment, I've decided to reuse some concepts from that project and implement them in Python. Let's see how it goes!

The quality of wine is a complex attribute influenced by various chemical properties. Understanding these influences can help wine producers enhance their products and meet consumer expectations. This project aims to investigate how different chemical properties affect the quality of Portuguese "Vinho Verde" red wine.

The research question guiding this study is: "How do various chemical properties of wine influence its quality, and can we predict wine quality based on these properties?" This question is addressed using a dataset of wine samples, which includes measurements of properties such as fixed acidity, volatile acidity, and alcohol content, along with quality ratings provided by wine experts.

By analyzing the relationships between these chemical properties and the quality ratings, this study seeks to identify significant predictors of wine quality. The findings will provide valuable insights for wine producers, enabling them to make data-driven decisions to improve the quality of their wines. The use of descriptive statistics, correlation analysis, and multiple linear regression ensures a comprehensive examination of the data, allowing for robust conclusions to be drawn.


### **Part 2: Data Exploration**:


```
# This is formatted as code[link text](https://)
```



In [18]:
# Import necessary packages
import pandas as pd

# Load data
wines = pd.read_csv("https://raw.githubusercontent.com/Yedzinovich/FALL2024TIDYVERSE/refs/heads/main/WineQT.csv")

# Display column names
print(wines.columns)

# Remove unnecessary columns (id column)
wines = wines.drop(columns=['Id'])

# Display the first few rows of the dataframe
print(wines.head())

# Summary statistics
print("\nSummary Statistics:")
print(wines.describe())

# Missing value information
print("\nMissing Value Information:")
print(wines.isnull().sum())

# Additional relevant information about the dataset
print("\nAdditional Information:")
print(wines.info())

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality', 'Id'],
      dtype='object')
   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            7.4              0.70         0.00             1.9      0.076   
1            7.8              0.88         0.00             2.6      0.098   
2            7.8              0.76         0.04             2.3      0.092   
3           11.2              0.28         0.56             1.9      0.075   
4            7.4              0.70         0.00             1.9      0.076   

   free sulfur dioxide  total sulfur dioxide  density    ph  sulphates  \
0                 11.0                  34.0   0.9978  3.51       0.56   
1                 25.0                  67.0   0.9968  3.20       0.68   
2                 15.0                  54.0   0.9970  3.26       0.65   
3  

### **Part 3: Data Wrangling**:

In [23]:
print(wines.columns)
# Note: Add a test column with string values, we will use it later
wines['wine type'] = ['red', 'white', 'rose'] * (len(wines) // 3) + ['red'] * (len(wines) % 3)
print(wines.head())

# Create a subset of the data
subset_wines = wines.sample(frac = 0.5, random_state=1)
print(subset_wines.head())

# Check the structure of the data
print(subset_wines.dtypes)

# Fix missing and invalid values in data by adding constant inplace = true
subset_wines.fillna(0, inplace=True)
print(subset_wines.head())

# Create new columns based on existing columns or calculations
subset_wines['total acidity'] = subset_wines['fixed acidity'] + subset_wines['volatile acidity']
print(subset_wines.head())

# Drop a column from the dataset
subset_wines = subset_wines.drop(columns=['density'])
print(subset_wines.head())
print(subset_wines.dtypes)

# Drop a row from the dataset
subset_wines = subset_wines.drop(subset_wines.index[0])
print(subset_wines.head())

# Sort data based on multiple variables
subset_wines = subset_wines.sort_values(by=['quality', 'alcohol'], ascending=[True, False])
print(subset_wines)

# Filter data based on some condition
filtered_wines = subset_wines[subset_wines['quality'] > 2]
print(filtered_wines)

# Convert all string values to upper cases in one column
subset_wines['wine type'] = subset_wines['wine type'].str.upper()
print(subset_wines.head())

# Check whether numeric values are present in a given column of your dataframe
numeric_check = pd.to_numeric(subset_wines['alcohol']).notnull().all()
print(f"\n'alcohol' column are numeric: {numeric_check}")

# Group dataset by one column, and get the mean, min, and max values by group
grouped_quality = subset_wines.groupby('quality').agg({'alcohol': ['mean', 'min', 'max']})
print("\ngrouped by quality (mean, min, max of alcohol):")
print(grouped_quality)

# Group dataset by two columns and then sort the aggregated results within the groups
grouped_quality_alcohol = subset_wines.groupby(['quality', 'alcohol']).size().reset_index(name='counts')
print(grouped_quality_alcohol.head())

sorted_grouped_quality_alcohol = grouped_quality_alcohol.sort_values(by=['quality', 'counts'], ascending=[False, False])
print("\ngrouped by quality and qlcohol, sorted by counts:")
print(sorted_grouped_quality_alcohol.head())

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'ph', 'sulphates', 'alcohol', 'quality', 'wine.type', 'wine type'],
      dtype='object')
   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            7.4              0.70         0.00             1.9      0.076   
1            7.8              0.88         0.00             2.6      0.098   
2            7.8              0.76         0.04             2.3      0.092   
3           11.2              0.28         0.56             1.9      0.075   
4            7.4              0.70         0.00             1.9      0.076   

   free sulfur dioxide  total sulfur dioxide  density    ph  sulphates  \
0                 11.0                  34.0   0.9978  3.51       0.56   
1                 25.0                  67.0   0.9968  3.20       0.68   
2                 15.0                  54.0   0.9970  3.

### **Part 2: Conclusion**:

The following dataset shows a wide range of values for each chemical property, with most values clustering around the mean. The quality scores are mostly between 5 and 6, indicating that the wines are generally of average quality.

As nice to have, we can add
- histograms: create histograms for each variable to visualize their distributions. This helps in identifying the shape of the data, presence of outliers, and skewness.

- box plots: create box plots for each variable to visualize their spread and identify potential outliers.