# **Project Name**    -



##### **Project Type**    - Classification
##### **Contribution**    - Team
##### **Team Member 1 -** Anchal Gupta
##### **Team Member 2 -**Serena Balyan
##### **Team Member 3 -**Tanisha Mohapatra

# **Project Summary -**
The Wine Quality Dataset project involves an in-depth analysis of a dataset consisting of 1143 entries, each representing a distinct variety of wine. This dataset is a comprehensive collection of various chemical attributes that contribute to the overall composition of wine, including fixed acidity, volatile acidity, citric acid content, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH level, sulphate concentration, alcohol content, and a corresponding quality rating assigned to each wine.

Objectives:
The primary goal of this project is to investigate the relationship between the chemical composition of wines and their associated quality ratings. By examining the provided attributes, researchers aim to uncover underlying patterns, correlations, and insights that may influence the perceived quality of wine. Understanding these relationships can provide valuable information for both the wine industry and consumers alike.

Methodology:

Exploratory Data Analysis (EDA):

The project begins with exploratory data analysis to gain a comprehensive understanding of the dataset's structure and characteristics. This involves examining summary statistics, distributions, and correlations among the various attributes.

Feature Engineering:

Feature engineering techniques may be applied to derive new features or transform existing ones, potentially enhancing the predictive power of the model. For example, combinations of existing attributes or normalization techniques can be used to create new features.

Model Development:

Machine learning models, such as regression or classification algorithms, are employed to predict wine quality based on its chemical composition. These models are trained on the dataset and optimized to accurately predict quality ratings.

Model Evaluation:

The trained models undergo rigorous evaluation using appropriate metrics to assess their performance and generalization capabilities. Techniques such as cross-validation may be used to ensure the models' robustness and reliability.

Conclusion:
In conclusion, the Wine Quality Dataset project presents a valuable opportunity to explore the intricate relationship between wine composition and quality ratings. Through systematic analysis and modeling, researchers aim to uncover insights that may inform winemaking practices, quality assessment methodologies, and consumer preferences. By leveraging data science techniques, this project contributes to a deeper understanding of the factors influencing wine quality, ultimately benefiting both producers and consumers in the wine industry.








Write the summary here within 500-600 words.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.**

Problem Statement:
Predicting Wine Quality

Wine quality is influenced by various chemical properties such as acidity, residual sugar, pH level, alcohol content, and more. Understanding how these properties affect the perceived quality of wine is crucial for winemakers to produce high-quality products.

Objective:

The objective of this project is to build a predictive model that can accurately classify the quality of wine based on its chemical properties. By analyzing the dataset containing several physicochemical properties of wines along with their quality ratings, we aim to develop a model that can assist winemakers in assessing and improving the quality of their products.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sb

import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, ConfusionMatrixDisplay
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score



### Dataset Loading

In [None]:
# Load Dataset
from google.colab import files
uploaded=files.upload()

### Dataset First View

In [None]:
# Dataset First Look
# Reading data from a CSV file into a DataFrame.
df1 = pd.read_csv('WineQT.csv')
print(df1)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
#To fetch the number of rows and columns in the DataFrame.
print(df1.shape)
print(df1.shape[0])
print(df1.shape[1])

### Dataset Information

In [None]:
# Dataset Info
# Fetching information about the DataFrame's structure.
df1.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
# To Display the count of duplicate rows
duplicate_counts = df1.duplicated().value_counts()
print("Count of duplicate rows:")
print(duplicate_counts)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df1.isnull().sum()

In [None]:
# Visualizing the missing values
df1.isnull()

### What did you know about your dataset?

Answer Here

Dataset appears to be related to wine quality, containing various chemical properties of wines along with a quality rating. It consists of 1143 rows and 13 columns.

From the given dataset we can potentially explore relationships between these chemical properties and the quality of the wine, identifying which properties contribute most to a wine's quality rating. We can also perform predictive modeling to predict the quality of wine based on its chemical composition.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df1.count()

In [None]:
# Dataset Describe
df1.describe()

### Variables Description

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
import pandas as pd
unique_values = df1.nunique()
print(unique_values)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
df1 = df1.dropna()
print(df1.info())
df1.tail()

In [None]:
df1['alcohol'] = pd.to_numeric(df1['alcohol'],
errors='coerce')  # errors='coerce' means if there is any error, it will be converted to NaN
print(df1.isnull().sum())
df1.info()

In [None]:
df1.duplicated().sum()

In [None]:
df1.describe().T

### What all manipulations have you done and insights you found?

Answer Here.

Based on the dataset provided, it seems like a dataset related to wine quality, with various chemical properties (such as fixed acidity, volatile acidity, citric acid, etc.) and quality ratings.

Manipulations and insights that could be derived from this dataset include:

Data Cleaning:

Checking for missing values: We would inspect each column to ensure there are no missing values that could affect our analysis.

Removing duplicate rows: If there are any duplicate rows, they should be removed to avoid skewing the analysis.

Exploratory Data Analysis (EDA):

Statistical summary: Calculating basic statistics like mean, median, standard deviation, etc., for each numerical column to understand the distribution and variability of data.
Data Visualization: Creating visualizations such as histograms, box plots, or scatter plots to explore the relationships between different variables and their distributions. For instance, visualizing the distribution of wine quality ratings or the relationship between alcohol content and quality.

Feature Engineering:

Creating new features: Based on domain knowledge or insights gained during EDA, new features could be engineered. For example, a total sulfur dioxide to free sulfur dioxide ratio could be created to assess the balance between these two compounds.

Correlation Analysis:

Determining the correlation between different features and the target variable (quality). This can be done using correlation matrices or visualizations like heatmaps.

Identifying significant correlations: Understanding which features have the strongest correlation with wine quality can provide insights into which factors contribute most to the overall quality rating.

Insights:

Identification of key factors affecting wine quality: Through data analysis and modeling, we can identify which chemical properties have the most significant impact on wine quality.

Recommendations: Based on the insights gained, recommendations can be made to improve wine quality. For example, if acidity is found to strongly correlate with quality, winemakers could adjust acidity levels during production to enhance quality.

These are just some of the potential manipulations and insights that could be derived from the provided dataset. The specific analyses and conclusions would depend on the goals of the analysis and the questions being addressed.







## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
#visualization of the relationship between the quality wine and its alcohol content.
plt.bar(df1['quality'], df1['alcohol'],color='#00bfa0')
plt.xlabel('quality')
plt.ylabel('alcohol')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.
A bar chart is a suitable choice here because it effectively displays discrete categories (quality ratings) against a continuous variable (alcohol content). It allows for easy comparison between different quality levels and their corresponding alcohol content.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Insights:

It appears that higher quality ratings generally correspond to higher alcohol content, as we observe a general trend of increasing alcohol content with increasing quality ratings.
There might be exceptions where certain quality ratings have lower alcohol content than expected.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact: Understanding the relationship between quality and alcohol content can help in several ways. For example, if customers prefer higher alcohol content in wines of higher quality, producers can adjust their production processes accordingly to meet consumer preferences and potentially increase customer satisfaction and loyalty.

Negative Impact: If there are instances where higher quality wines actually have lower alcohol content, it could lead to dissatisfaction among consumers who expect higher alcohol content in premium wines. This could potentially lead to negative reviews, decreased sales, and damage to the brand's reputation.Answer Here

#### Chart - 2

In [None]:
# Chart - 2 visualization code
#To create histograms for each column in the given DataFrames.
df1.hist(bins=20, figsize=(10, 10),color='#beb9db')
plt.show()

##### 1. Why did you pick the specific chart?
 Histograms are suitable for visualizing the distribution of a single variable, making them useful for exploring the distribution of data within a DataFrame. The choice of 20 bins allows for a reasonable level of granularity without making the chart too cluttered or difficult to interpret.

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Insights from the chart could include:

Distribution of Data: The histogram would provide insights into the distribution of values within the DataFrame, including any skewness, central tendency, or spread.

Identifying Outliers: Any outliers or unusual patterns in the data may be visually apparent in the histogram.

Data Quality: It may also reveal any data quality issues such as missing values or erroneous entries.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact: Insights gained from the chart could inform business decisions, such as targeting specific customer segments, optimizing pricing strategies, or identifying areas for process improvement.

Negative Growth: If the insights reveal significant issues such as a highly skewed distribution or a large number of outliers, this could signal potential challenges or areas of concern for the business. For example, if the histogram reveals a long tail of high-value transactions, it could indicate a reliance on a small number of high-paying customers, posing a risk to revenue stability.Answer Here

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# To plot the mean density values by quality from a DataFrame df.
import matplotlib.pyplot as plt
mean_density_by_quality = df1.groupby('quality')['density'].mean()

#line plot:
plt.figure(figsize=(10, 6))
plt.plot(mean_density_by_quality.index, mean_density_by_quality.values, marker='o', linestyle='-',color='#7c1158')
plt.title('Mean Density by Quality')
plt.xlabel('Quality')
plt.ylabel('Mean Density')
plt.xticks(mean_density_by_quality.index)
plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

The line plot is suitable for showing the trend of mean density across different quality levels. It allows for easy visualization of any patterns or trends in the data over the quality scale.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Insights from the Chart: From the chart, we can gather insights such as:

Whether there's a correlation between quality and density.
Any significant changes or trends in density across different quality levels.
Whether there are quality levels where density consistently deviates from the overall trend.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Business Impact of Insights:

Positive Impact: If there's a positive correlation between quality and density, businesses could use this information to improve quality control processes. For instance, they might adjust production methods to ensure higher density for higher quality products, potentially leading to improved customer satisfaction and brand reputation.

Negative Impact: If the insights reveal that higher quality products have lower density or no correlation between quality and density, it could pose challenges. This might imply inefficiencies or inconsistencies in production processes that need to be addressed. However, such insights, though initially negative, could ultimately lead to positive impacts if addressed effectively. For example, identifying and rectifying production flaws could lead to higher-quality products and increased customer satisfaction in the long term.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# To create two subplots side by side
# ->1. Scatter plot to visualize quality vs alcohol component in wine.
# ->2. Histogram to visualize frequency of alcohol.
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(20, 8))
# Scatter plot:
ax[0].scatter(df1['quality'], df1['alcohol'], marker='o', color='#a86464')
ax[0].grid(ls="--")
ax[0].set_xlabel('Quality')
ax[0].set_ylabel('Alcohol')
ax[0].set_title('Scattering Quality and Alcohol')

# Histogram:
ax[1].hist(df1['alcohol'], color='#e27c7c')
ax[1].set_xlabel('Alcohol')
ax[1].set_ylabel('Frequency')
ax[1].set_title('Histogram analysis for Alcohol')

plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

We picked a scatter plot to visualize the relationship between quality and alcohol content in wine because it's suitable for showing the distribution of two continuous variables and any potential patterns or trends between them. Additionally, I chose a histogram to display the frequency distribution of alcohol content because it's effective in illustrating the distribution of a single variable.


##### 2. What is/are the insight(s) found from the chart?

Answer Here

From the scatter plot, we can see if there's any correlation between wine quality and alcohol content. If there's a discernible pattern, such as higher-quality wines tending to have higher or lower alcohol content, it could provide insights into production practices or consumer preferences.

From the histogram, we can observe the distribution of alcohol content across the dataset. This can help identify common alcohol levels and potentially outliers. Understanding the distribution is essential for quality control and production decisions.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

The insights gained from these charts can indeed have a positive business impact.

For example:

Correlation between Quality and Alcohol Content: If a positive correlation is observed between wine quality and alcohol content, winemakers could adjust their production processes to optimize alcohol levels for higher-quality wines, potentially increasing customer satisfaction and brand reputation.

Understanding Alcohol Distribution: Knowing the distribution of alcohol content in wines can help winemakers tailor their product offerings to meet consumer preferences. They can adjust marketing strategies, pricing, or even develop new products targeting specific segments of the market.

However, there are potential negative impacts as well:

Overemphasis on Alcohol Content: If the analysis suggests a strong correlation between quality and alcohol content, there might be a temptation to prioritize alcohol levels over other factors affecting wine quality, such as grape variety, terroir, or aging process. This could lead to a homogenization of wine styles and a loss of diversity in offerings, potentially alienating certain customer segments who prefer different characteristics in their wines.






#### Chart - 5

In [None]:
# Chart - 5 visualization code
#Histogram to analyze the distribution of the 'fixed acidity' feature from a DataFrame named 'df'.
fig, ax = plt.subplots()

# Histogram:
ax.hist(df1['fixed acidity'],bins=10, color='#df979e')
ax.set_xlabel('fixed acidity')
ax.set_ylabel('Frequency')
ax.set_title('Histogram analysis')
ax.axis([0, 16, 0, 400])

plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

We picked a histogram because it's suitable for visualizing the distribution of a single numerical variable, such as 'fixed acidity' in this case. Histograms allow us to see the frequency or count of data points within different intervals or bins, which is useful for understanding the distribution of the variable.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

From the histogram, we can glean several insights:

Distribution of Fixed Acidity: We can observe the distribution of fixed acidity values across the dataset. By setting the number of bins to 10, we've divided the range of fixed acidity values into 10 intervals, allowing us to see how the values are spread out.

Central Tendency: We can infer the central tendency of the fixed acidity values, such as whether they tend to cluster around a specific range or if they are evenly distributed across the range.
Skewness or Symmetry: Depending on the shape of the histogram, we can identify whether the distribution is symmetric or skewed to the left or right. This provides insights into the overall pattern of fixed acidity values.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

As for the impact on business:

Positive Impact: If the distribution of fixed acidity is centered around a desirable range for a product (such as wine), it could inform decisions related to production processes, quality control, or marketing strategies. For example, if customers prefer wines with a certain acidity level, this information could guide product development.

Negative Impact: If the histogram reveals that most of the fixed acidity values are concentrated in undesirable ranges or if there are outliers indicating quality issues, it could signal the need for process improvements or quality control measures. Addressing these issues could lead to improved product quality and customer satisfaction, ultimately benefiting the business.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
#7 Bar plot to visualize the frequency of wine quality ratings.
import seaborn as sns
quality_counts = df1['quality'].value_counts()
#bar plot:
sns.barplot(x=quality_counts.index, y=quality_counts.values)
plt.xlabel('Quality')
plt.ylabel('Frequency')
plt.title('Frequency of Wine Quality Ratings')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

We chose a bar plot because it's suitable for visualizing the frequency distribution of categorical data, such as wine quality ratings. Each quality rating is represented by a separate bar, and the height of each bar indicates the frequency of that rating.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

The insight gained from the chart is the distribution of wine quality ratings. This information can be used to understand the overall quality distribution of the wines in the dataset. For instance, it can help identify whether most wines are rated highly or poorly, or if there's a relatively even distribution across different quality ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Positive business impacts could arise if, for example, the majority of wines in the dataset are rated highly, indicating a high-quality product that could attract more customers and potentially lead to increased sales. Conversely, if there's a skew towards lower quality ratings, it might prompt the business to investigate and improve the quality of their wines to remain competitive.

However, if the distribution shows a significant number of low-quality ratings without a clear explanation (such as being a dataset of inexpensive wines), it could signal a potential issue with the product quality or market perception. This could lead to negative growth if not addressed, as it may result in decreased customer satisfaction, diminished brand reputation, and ultimately lower sales.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
#Scatter plot showing the relationship between volatile acidity and citric acid from a DataFrame df
plt.scatter(df1['volatile acidity'].head(100), df1['citric acid'].head(100), color='#48446e', marker='o')
plt.xlabel('Volatile Acidity')
plt.ylabel('Citric Acid')
plt.title('Correlation between Volatile Acidity and Citric Acid')
plt.show()



##### 1. Why did you pick the specific chart?

Answer Here.

We chose a scatter plot because it effectively visualizes the relationship between two continuous variables, volatile acidity and citric acid. Each point on the plot represents a combination of these two variables, allowing for the observation of any patterns or trends in their relationship.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

The insights gained from the scatter plot may include:

Identification of any correlation between volatile acidity and citric acid: If the points on the scatter plot exhibit a clear trend (e.g., moving from bottom-left to top-right or vice versa), it suggests a correlation between the two variables. For instance, a negative correlation would indicate that as volatile acidity increases, citric acid tends to decrease, and vice versa.
Presence of outliers: Outlying points that do not follow the general trend of the data may indicate unusual or extreme observations, which could be of interest for further investigation.
Clustering or grouping: Patterns may emerge where the data points cluster together, indicating potential subgroups within the data.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

The gained insights can potentially lead to positive business impacts:

Quality control: Understanding the relationship between volatile acidity and citric acid can help in quality control processes in industries like winemaking or food production. For instance, if a negative correlation is observed, it might suggest that adjusting one variable could help in optimizing the other, leading to better product quality.

Product development: Insights from the scatter plot can inform product development strategies. For example, if there is a positive correlation between volatile acidity and citric acid, it might suggest formulations where adjustments in one component can be made to achieve desired flavor profiles.
Process optimization: In industrial processes where volatile acidity and citric acid are critical parameters, understanding their relationship can aid in process optimization efforts, leading to improved efficiency and cost savings.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
#Subplots to analyse wine quality by plotting two different types of graphs
#->1 Pie chart to analyse wine quality
# -> Bar plot to analyse wine quality
quality_counts = df1['quality'].value_counts()

# Pie chart:
plt.figure(figsize=(20, 8))
plt.subplot(1, 2, 1)
plt.pie(quality_counts, labels=quality_counts.index, colors=['orange', 'gray'], explode=(0.1, 0, 0, 0, 0, 0), shadow=True, autopct="%.2f%%")
plt.title('Wine Quality Distribution')

# Bar plot:
plt.subplot(1, 2, 2)
sns.barplot(x=quality_counts.index, y=quality_counts.values, palette='husl')
plt.xlabel('Quality')
plt.ylabel('Frequency')
plt.title('Wine Quality Distribution')

plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

he pie chart was chosen to visualize the distribution of wine quality because it effectively shows the proportion of each quality category in relation to the whole. This type of chart is ideal for displaying the composition of a categorical variable like wine quality. The bar plot, on the other hand, was chosen because it allows for a comparison of the frequencies of different wine quality categories in a more straightforward manner.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

From the charts, we can observe the following insights:

Pie Chart: It provides a visual representation of the distribution of wine quality categories. We can see the proportion of each quality category relative to the total number of wines sampled.

Bar Plot: This plot shows the frequency of each wine quality category in a more quantitative manner. It's easier to compare the number of wines in each quality category using this plot.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Positive Business Impact:

Understanding the distribution of wine quality can help businesses tailor their marketing strategies and product offerings. For example, if a particular quality category is more prevalent, businesses can focus on producing wines that cater to that specific segment of the market.
By knowing which quality categories are more popular, businesses can adjust their pricing strategies accordingly. Wines of higher quality may command a premium price, while those of lower quality may be positioned as budget-friendly options.

Negative Growth:

While the insights gained from the charts can be valuable for decision-making, there aren't inherently negative insights from the distribution of wine quality. However, if a significant portion of the wines fall into lower quality categories, it may indicate potential issues with product quality or production processes. Addressing these issues would be necessary to prevent negative impacts on the business.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
#(a). Two subplots to visualize the relationship of fixed acidity with quality of wine
#        ->Scatter plot
#        ->Box plot
#   (b). Two subplots to visualize the relationship of volatile acidity with quality of wine
#        ->Scatter plot
#        ->Box plot
# Scatter plot of 'fixed acidity' vs 'quality':
plt.figure(figsize=(14, 6))
plt.subplot(1, 2, 1)
plt.scatter(df1['quality'], df1['fixed acidity'], marker='.', color='blue')
plt.xlabel('Quality')
plt.ylabel('Fixed Acidity')
plt.title('Relation of Fixed Acidity with Quality')

# Box plot of 'fixed acidity':
plt.subplot(1, 2, 2)
sns.boxplot(x=df1['quality'], y=df1['fixed acidity'], color='cyan')
plt.title("Box plotting of Fixed Acidity")

plt.show()

# Scatter plot of 'volatile acidity' vs 'quality':
plt.figure(figsize=(14, 6))
plt.subplot(1, 2, 1)
plt.scatter(df1['quality'], df1['volatile acidity'], marker='.', color='green')
plt.xlabel('Quality')
plt.ylabel('Volatile Acidity')
plt.title('Relation of Volatile Acidity with Quality')

# Box plot of 'volatile acidity':
plt.subplot(1, 2, 2)
sns.boxplot(x=df1['quality'], y=df1['volatile acidity'], color='lightgreen')
plt.title("Box plotting of Volatile Acidity")

plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

The scatter plot was chosen to visualize the relationship between fixed acidity (or volatile acidity) and the quality of wine because it helps to identify any patterns or trends in the data, particularly in terms of how the quality of wine varies with different levels of acidity. The box plot was chosen to provide a visual summary of the distribution of fixed acidity (or volatile acidity) across different quality levels, showing any potential outliers and the spread of the data.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

From the scatter plot of fixed acidity versus quality, it seems that there might not be a strong linear relationship between fixed acidity and wine quality. However, it appears that wines with higher quality ratings tend to have slightly lower fixed acidity levels. Similarly, in the case of volatile acidity versus quality, there seems to be a slight trend indicating that higher quality wines have lower volatile acidity levels.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

The insights gained from these visualizations could potentially help in making decisions related to wine production or quality control. Winemakers might aim to adjust acidity levels in their wines to optimize quality, based on the observed relationship. For instance, they could target lower acidity levels for wines intended to be of higher quality. This could lead to a positive business impact by improving the overall perception and marketability of their wines.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
#visualizing the distribution of sulphates levels from a DataFrame using a pie chart.
sulphates_counts = df1['sulphates'].head(10).value_counts()
plt.figure(figsize=(8, 6))
sulphates_counts.plot(kind='pie', autopct='%1.1f%%', colors=plt.cm.tab10.colors)
plt.title('Distribution of Sulphates Levels')
plt.ylabel('')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

The pie chart was chosen to visualize the distribution of sulphates levels because it effectively showcases the proportion of each category relative to the whole. This is particularly useful when comparing a small number of categories, as in this case where we're looking at the distribution of sulphates levels from the top 10 records in the DataFrame.


##### 2. What is/are the insight(s) found from the chart?

Answer Here

The insights found from the chart would be the distribution of sulphates levels among the top 10 records in the DataFrame. Viewers can easily see the proportion of each sulphate level relative to the total, providing a quick understanding of the distribution pattern.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Whether the gained insights will help create a positive business impact depends on the context and the goals of the business. If understanding the distribution of sulphates levels among the top records is crucial for quality control or product development, then these insights could be valuable. However, if there are unexpected patterns in the distribution (e.g., a significant portion of records with unusually high sulphates levels), it might signal quality issues that need addressing, potentially leading to negative growth if not handled appropriately. Further analysis and context are needed to determine the exact impact.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
#To create a subplot grid of histograms for the columns in a DataFrame using seaborn and matplotlib.
rows = 2
cols = 7

fig, ax = plt.subplots(nrows=rows, ncols=cols, figsize=(16, 4))

col = df1.columns

for i in range(rows):
    for j in range(cols):
        if (i * cols + j) < len(col):
            sns.histplot(df1[col[i * cols + j]], ax=ax[i, j], kde=True)
        else:
            ax[i, j].axis('off')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

The specific chart chosen here is a histogram. Histograms are suitable for visualizing the distribution of a single variable, showing the frequency or probability density of each value within a dataset. In this case, it seems appropriate as the goal is to visualize the distribution of each column in the DataFrame.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

The insights gained from the histograms can include understanding the distribution, central tendency, spread, and potential outliers within each column of the DataFrame. For example, the histograms can reveal whether the data is normally distributed, skewed, or if there are any unusual patterns in the data distribution.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

The insights gained from the histograms can indeed help create a positive business impact. Understanding the distribution of data can aid in making informed decisions, such as identifying areas for improvement, detecting anomalies, or optimizing processes. For instance, if a certain feature's distribution suggests a significant skew or outlier presence, it could indicate a need for further investigation or potential data preprocessing steps to improve model performance or decision-making.

However, if the insights from the histograms reveal consistently poor or undesirable distributions across multiple variables, it could potentially indicate underlying issues with data quality, data collection processes, or inherent biases in the dataset. Addressing these issues could lead to a negative growth impact initially, as it may require additional resources or efforts to rectify the data problems. However, in the long term, resolving these issues would likely result in more accurate analyses and better-informed decision-making, ultimately leading to positive business outcomes.


#### Chart - 12

In [None]:
# Chart - 12 visualization code
import matplotlib.pyplot as plt
selected_columns = ["fixed acidity", "volatile acidity", "citric acid", "residual sugar", "quality"]
for column in selected_columns:
    plt.plot(df1[column].head(20), label=column)
plt.xlabel("Index")
plt.ylabel("Values")
plt.title("Line Plot of Selected Columns")
plt.legend()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

Line plot is chosen here because it's effective for visualizing trends over a continuous variable. It's suitable for displaying the variations of multiple variables over a common index.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Insights from the chart:

We can observe the trends and patterns of the selected columns (fixed acidity, volatile acidity, citric acid, residual sugar, quality) over the first 20 rows of the dataset.

We can compare the variations in values among different columns and see if there are any correlations or patterns.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Positive business impact:

Understanding the trends and relationships between these variables can help in quality control and optimization in various industries, especially in sectors like food and beverage where these attributes (acidity, sugar content) are crucial for product quality.
Identifying correlations between these variables and quality can guide decision-making processes for product improvement or optimization of production processes, potentially leading to increased customer satisfaction and loyalty.
Negative growth:

If the analysis reveals unfavorable trends such as consistently high volatile acidity or low quality ratings across the sampled data, it could indicate potential issues in product quality or production processes. Addressing these issues promptly is essential to prevent negative impacts on business growth, such as decreased customer satisfaction, increased product returns, or reputational damage.


#### Chart - 13

In [None]:
# Chart - 13 visualization code
import matplotlib.pyplot as plt
import pandas as pd

# Assuming your dataset is stored in a variable called 'df'
# If not, replace 'df' with your actual dataframe variable name

# Selecting columns for box plots
columns_for_boxplot = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
                       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
                       'pH', 'sulphates', 'alcohol', 'quality']

# Creating box plots
plt.figure(figsize=(12, 8))
boxplot = df1[columns_for_boxplot].boxplot()
plt.ylim(0, 50)  # Set y-axis limits
plt.title('Box Plot of Various Features')
plt.xticks(rotation=45)
plt.ylabel('Values')
plt.xlabel('Features')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

Box plots are chosen here because they provide a concise way to visualize the distribution and spread of multiple variables simultaneously. They show the median, quartiles, and potential outliers in the data, making it easy to compare the distributions of different features at a glance.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Insights from the box plot:

It gives an overview of the spread and central tendency of each feature.
Outliers can be identified, indicating potential data anomalies or extreme values.

Comparisons between features can be made in terms of their central tendencies and variability.

It provides an understanding of the range of values each feature spans.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Positive business impacts:

Understanding the distribution of various chemical properties in wine can help in quality control during production.

Identification of outliers might signal areas for further investigation, potentially leading to process improvements.

Comparisons between different quality ratings can inform marketing strategies or production adjustments to target specific quality segments more effectively.

Negative growth:

If the box plot reveals consistently high levels of certain undesirable properties (e.g., high volatile acidity or chlorides), it could indicate issues with product quality that might lead to negative customer feedback or decreased sales.

Outliers at extreme ends of the scale may suggest problems in production or storage that need to be addressed to maintain or improve product quality and customer satisfaction.


#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
#To create a heatmap using seaborn to visualize the correlation matrix of a DataFrame called df.
import matplotlib.pyplot as plt
import seaborn as sns

fig = plt.figure(figsize=(12, 8))
fig.add_subplot(1, 1, 1)
sns.heatmap(df1.corr(), annot=True, cmap='ocean')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

I chose a heatmap because it's an effective way to visualize the correlation matrix of the DataFrame. Heatmaps use colors to represent the magnitude of correlation values, making it easy to identify patterns and relationships between variables.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Insights from the chart could include:

Strong positive correlations indicated by darker shades, suggesting that as one variable increases, the other tends to increase as well.

Strong negative correlations indicated by lighter shades or different colors, suggesting that as one variable increases, the other tends to decrease.

Weak correlations indicated by colors closer to neutral, suggesting little to no relationship between variables.






#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
selected_data = df1[['density', 'pH',
                      'alcohol', 'quality']]

# Create the pair plot
sns.pairplot(selected_data, diag_kind='kde', markers='*')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

The pair plot is chosen because it allows for a quick visual examination of the relationships between pairs of variables in the dataset. With this specific plot, you can observe both the distribution of individual variables along the diagonal (using kernel density estimation, or KDE) and the scatter plots of each pair of variables against each other. This provides insights into potential correlations or patterns within the data.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Some insights that can be gleaned from the pair plot include:

Correlation between variables: You can visually assess whether there are any apparent correlations between pairs of variables. For example, you might observe a positive or negative correlation between certain pairs of features.

Distribution of variables: The diagonal plots (KDE plots) show the distribution of each variable individually. This can help in understanding the overall distribution of the data and identifying any potential outliers.

Multivariate relationships: The scatter plots off the diagonal show the relationships between pairs of variables. This can help in understanding how one variable might be influenced by another.

Quality assessment: Since 'quality' is included as one of the variables, you can also assess how the quality of the wine relates to other chemical properties. For example, you might find that higher quality wines tend to have certain combinations of chemical properties.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Answer Here.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value

##### Which statistical test have you done to obtain P-Value?

Answer Here.

##### Why did you choose the specific statistical test?

Answer Here.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
df = df1.dropna()
print(df.info())
df.tail()

#### What all missing value imputation techniques have you used and why did you use those techniques?

Answer Here.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
import numpy as np

def handle_outliers_iqr(df, column_name):
  """
  This function identifies outliers in a column using IQR and removes them.

  Args:
      df: The pandas DataFrame containing the data.
      column_name: The name of the column to analyze for outliers.

  Returns:
      A new DataFrame with outliers removed from the specified column.
  """
  Q1 = df1[column_name].quantile(0.25)
  Q3 = df1[column_name].quantile(0.75)
  IQR = Q3 - Q1
  lower_bound = Q1 - 1.5 * IQR
  upper_bound = Q3 + 1.5 * IQR
  return df1[(df1[column_name] > lower_bound) & (df1[column_name] < upper_bound)]

# Example usage:
df1_filtered = handle_outliers_iqr(df1.copy(), "fixed acidity")  # Replace "fixed acidity" with your column name
# Example usage:
df1_filtered = handle_outliers_iqr(df1.copy(), "fixed acidity")  # Replace "fixed acidity" with your column name

##### What all outlier treatment techniques have you used and why did you use those techniques?

This code defines a function handle_outliers_iqr that takes the data and the column name as arguments. It calculates the quartiles (Q1 and Q3) and the IQR for the chosen column. Then, it defines upper and lower bounds based on IQR. Finally, it returns a new DataFrame containing only the rows where the values in the specified column fall within these bounds (excluding outliers).Answer Here.

### 3. Categorical Encoding

### 8. Data Splitting

In [None]:
df1['quality'] = df1.quality.apply(lambda x:1 if x>=7 else 0)

In [None]:
df1['quality'].value_counts()

In [None]:
X = df1.drop(columns=['quality'])  # Features
y = df1['quality']  # Target variable

# Splitting the dataset into the Training set and Test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Optional: Split training data further into training and validation sets (e.g., 70/30 split)
# X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.3, random_state=42)

# Displaying the shape of the training and testing sets
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

In [None]:
df1.shape

##### What data splitting ratio have you used and why?

This line splits your dataset into training and testing sets.

X_train and y_train are the features and target variable for the training set, respectively.
X_test and y_test are the features and target variable for the testing set, respectively.
test_size=0.3 means you're allocating 30% of the data for testing and 70% for training.
random_state=42 ensures that the data split is reproducible. It means every time you run this code, you'll get the same split.

Answer Here.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# ML Model - 1 Implementation
#Logistic Regression Model
# Fit the Algorithm
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
logreg_pred = logreg.predict(X_test)
logreg_acc = accuracy_score(logreg_pred, y_test)
print("test accuracy is: {:.2f}%".format(logreg_acc*100))

In [None]:
print(classification_report(y_test, logreg_pred))

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
# Logistic Regression is a classification algorithm used to model the probability of a certain class or event occurring based on input features. Despite its name, logistic regression is used for binary classification tasks, but it can be extended for multiclass classification through techniques like one-vs-rest or softmax regression.

# In this case, your logistic regression model is trained to predict wine quality categories based on features such as fixed acidity, volatile acidity, citric acid, etc. The model estimates the probability that a given wine belongs to a certain quality category, and then makes predictions based on these probabilities.
# style.use('classic')
cm = confusion_matrix(y_test, logreg_pred, labels=logreg.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix= cm, display_labels=logreg.classes_)
disp.plot()
print("TN: ", cm[0][0])
print("FN: ", cm[1][0])
print("TP: ", cm[1][1])
print("FP: ", cm[0][1])

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### ML Model - 2

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
#Decision tree
# A Decision Tree is a tree-like structure where each internal node represents a feature, each branch represents a decision based on that feature, and each leaf node represents a class label (in classification) or a value (in regression). Decision Trees are versatile and easy to interpret, making them popular for both classification and regression tasks.

# In this case, your Decision Tree model is trained to predict wine quality categories based on features similar to the Logistic Regression model. The tree splits the data into subsets based on features, aiming to minimize impurity (such as Gini impurity or entropy) at each node.
dtree = DecisionTreeClassifier()
dtree.fit(X_train, y_train)
dtree_pred = dtree.predict(X_test)
dtree_acc = accuracy_score(dtree_pred, y_test)
print("Test accuracy: {:.2f}%".format(dtree_acc*100))

In [None]:
print(classification_report(y_test, dtree_pred))

In [None]:
# style.use('classic')
cm = confusion_matrix(y_test, dtree_pred, labels=dtree.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix= cm, display_labels=dtree.classes_)
disp.plot()
print("TN: ", cm[0][0])
print("FN: ", cm[1][0])
print("TP: ", cm[1][1])
print("FP: ", cm[0][1])

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation

# Fit the Algorithm
svm = SVC()
svm.fit(X_train, y_train)
svm_pred = svm.predict(X_test)
accuracy = accuracy_score(y_test, svm_pred)
precision = classification_report(y_test, svm_pred)

# Print evaluation metrics
print("Test accuracy: {:.2f}%".format(accuracy*100))
# print("Precision: {:.2f}".format(precision))

# Make predictions on the test set

# Predict on the model

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

Support Vector Machine (SVM) is a powerful supervised learning algorithm used for classification, regression, and outlier detection. In classification tasks, SVM separates data points into different classes by finding the hyperplane that maximizes the margin between the classes. It works well in high-dimensional spaces and is effective even in cases where the number of dimensions exceeds the number of samples.

In [None]:
# Visualizing evaluation Metric Score chart
cm = confusion_matrix(y_test, svm_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=svm.classes_)

# Plot confusion matrix
plt.figure(figsize=(8, 6))
disp.plot(cmap='Blues')
plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm

# Predict on the model

##### Which hyperparameter optimization technique have you used and why?

Answer Here.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

Answer Here.

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Answer Here.
One evaluation metric that would have a positive business impact is Precision.

Precision measures the proportion of correctly predicted positive cases among all predicted positive cases. High precision indicates that the model is making fewer false positive predictions, which is crucial in scenarios where false positives have significant consequences (e.g., medical diagnosis, fraud detection). Maximizing precision can help minimize unnecessary costs or risks associated with false positive predictions

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

For the final prediction model, I would choose the Support Vector Machine (SVM) model.

I chose SVM because:

Performance: SVM typically provides good performance, especially in high-dimensional spaces, which is suitable for our dataset with multiple features.

Robustness: SVM is effective in handling datasets with a small number of samples compared to the number of features. It's less prone to overfitting, which is beneficial when dealing with limited data.

Flexibility: SVM offers flexibility through different kernel functions (linear, polynomial, radial basis function) which can capture complex relationships in the data.

Balanced Performance: SVM considers a balance between maximizing margin and minimizing classification error, providing a good balance between bias and variance.

Interpretability: While SVM might not be as interpretable as decision trees, it still provides some insights into the importance of features through support vectors.

Answer Here.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Answer Here.

The model I used for the final prediction is the Support Vector Machine (SVM). SVM is a powerful supervised learning algorithm used for classification tasks. It works by finding the hyperplane that best separates the data points of different classes in the feature space.

To explain the model and feature importance, I will use the SHAP (SHapley Additive exPlanations) library, which is a model-agnostic method for explaining predictions of any machine learning model. SHAP values provide a way to explain individual predictions by calculating the contribution of each feature to the prediction.


# **Conclusion**

Write the conclusion here.
In this project, we utilized a wine quality dataset to build a predictive model for determining wine quality based on various chemical properties. By employing logistic regression, we aimed to classify wines into different quality categories and explore the impact of various features on wine quality.

Key Findings:
Feature Importance: Through our analysis, it was evident that certain features like alcohol content, volatile acidity, and sulphates had significant impacts on the quality scores of wines. These insights can guide winemakers in adjusting these properties to enhance wine quality.
Class Imbalance: The dataset exhibited a class imbalance, with most wines clustered around quality scores 5 and 6. This imbalance can influence model performance, often leading to biased predictions towards the majority classes.
Model Performance: Despite the class imbalance, the logistic regression model performed reasonably well, demonstrating its ability to classify wines into quality categories with a certain level of accuracy. However, performance metrics suggested the need for improvement, especially in predicting minority classes.
Model Limitations: Logistic regression, while useful, has limitations in handling non-linear relationships and complex patterns within the data. This limitation highlights the potential benefits of exploring more advanced machine learning models like Random Forest, SVM, or ensemble methods.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***