# ANOVA Analysis on Diamond Dataset

This Jupyter notebook aims to explore the differences among various groups within the 'diamonds' dataset using ANOVA (Analysis of Variance). This statistical method helps in understanding how different factors like diamond color and cut influence diamond prices. 

This notebook is designed for individuals looking to understand basic to advanced statistical analysis in Python.

## Background
The 'diamonds' dataset contains price and other attributes of approximately 54,000 diamonds. It is an excellent dataset for demonstrating ANOVA because it contains both categorical variables (such as color and cut) and continuous variables (like price).

### Source
Data sourced from seaborn library, simulated representation of market data.


### Objective and Applicability 
This notebook demonstrates an advanced statistical analysis using ANOVA (Analysis of Variance) to explore the impact of categorical variables—specifically, the color and cut of diamonds—on their price. By understanding these influences, this analysis aids in strategic decision-making for businesses involved in the trading, valuation, or marketing of diamonds.

Practical Applications
Strategic Pricing: ANOVA helps identify which factors (color, cut) significantly impact diamond prices. Businesses can use this information to strategically price their products based on these attributes to maximize profit.

Inventory Management: By understanding the price variability associated with different cuts and colors, retailers can better manage their inventory by stocking products that are more likely to sell at higher prices.

Marketing and Promotions: Insight into which attributes (color and cut) significantly affect price can guide targeted marketing campaigns. For instance, if certain colors or cuts are found to significantly enhance a diamond's value, marketing efforts can be tailored to highlight these premium features to potential customers.

Customer Segmentation: The interaction effect between color and cut on price can help businesses understand consumer preferences in more depth. This knowledge can be used to segment customers based on their likely preferences for specific combinations of color and cut, enabling more personalized marketing and sales strategies.

## Setup and Data Preparation
In this section, we load necessary Python libraries and prepare the data by selecting relevant subsets and transforming variables to meet the requirements of ANOVA tests.

In [20]:
### pip install nbconvert

Collecting nbconvert
  Downloading nbconvert-7.16.4-py3-none-any.whl.metadata (8.5 kB)
Collecting beautifulsoup4 (from nbconvert)
  Downloading beautifulsoup4-4.12.3-py3-none-any.whl.metadata (3.8 kB)
Collecting bleach!=5.0.0 (from nbconvert)
  Downloading bleach-6.1.0-py3-none-any.whl.metadata (30 kB)
Collecting defusedxml (from nbconvert)
  Downloading defusedxml-0.7.1-py2.py3-none-any.whl.metadata (32 kB)
Collecting jupyterlab-pygments (from nbconvert)
  Downloading jupyterlab_pygments-0.3.0-py3-none-any.whl.metadata (4.4 kB)
Collecting mistune<4,>=2.0.3 (from nbconvert)
  Downloading mistune-3.0.2-py3-none-any.whl.metadata (1.7 kB)
Collecting nbclient>=0.5.0 (from nbconvert)
  Downloading nbclient-0.10.0-py3-none-any.whl.metadata (7.8 kB)
Collecting nbformat>=5.7 (from nbconvert)
  Downloading nbformat-5.10.4-py3-none-any.whl.metadata (3.6 kB)
Collecting pandocfilters>=1.4.1 (from nbconvert)
  Downloading pandocfilters-1.5.1-py2.py3-none-any.whl.metadata (9.0 kB)
Collecting tinycss

In [29]:
import nbformat as nbformat

# Load the Jupyter notebook
file_path = 'ANOVA_diamonds.ipynb'
with open(file_path, 'r') as f:
    nb = nbformat.read(f, as_version=4)

# Displaying the contents of the notebook
nb.cells


[{'cell_type': 'markdown',
  'metadata': {},
  'source': '### ANOVA Test Techniques with Python\n#### Purpose: Analysis of Variance to comprehend differences across various groups within a dataset. This is a notebook for learners who are already familiar with Python and basics of statistics.'},
 {'cell_type': 'markdown',
  'metadata': {},
  'source': '### Dataset\nIt will be used the "diamonds" dataset from seaborn library, which includes prices and attributes of apx 54,000 diamonds. The dataset allows a demonstration of ANOVA due to its categoricals variables like color and cur, and a continuos variable, price.'},
 {'cell_type': 'markdown',
  'metadata': {},
  'source': '### Setup and Data Preparation\nStart by loading the necessary Python libraries and the dataset. We focus on preparing the data to meet the requirements of ANOVA tests.'},
 {'cell_type': 'code',
  'execution_count': 9,
  'metadata': {},
  'outputs': [],
  'source': '# Import Libraries\n\nimport pandas as pd\nimport se

In [30]:
# Import Libraries

import pandas as pd
import seaborn as sns
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Load dataset
diamonds = sns.load_dataset("diamonds")

# Assuming 'diamonds' DataFrame is already defined and includes a 'price' column

# Data cleaning steps
# Select relevant subset of data
diamonds = diamonds[diamonds['color'].isin(['D', 'E', 'F', 'G', 'H', 'I'])]
diamonds['log_price'] = np.log(diamonds['price'])


In [25]:
# Manage Warning in you python script
import warnings

# Suppress all UserWarnings
warnings.filterwarnings('ignore', category=UserWarning)

# Alternatively, suppress only specific messages by part of their message text
warnings.filterwarnings('ignore', message='covariance of constraints does')


### One-way ANOVA
One-way ANOVA is used here to compare the means of the logarithmic prices among different color grades of diamonds.

In [31]:
# Run the ANOVA
model = ols('log_price ~ C(color)', data=diamonds).fit()
anova_results = sm.stats.anova_lm(model, typ=1)
print(anova_results)


               df        sum_sq     mean_sq           F         PR(>F)
C(color)      6.0   1052.065696  175.344283  175.521753  5.820088e-222
Residual  51126.0  51074.306444    0.998989         NaN            NaN


### Two-Way ANOVA
Understand how cut and color influence diamond prices, considering interactions between theses factors. 

In [32]:
# Preparing data for two-way ANOVA

## This line filters the diamonds DataFrame to include only those diamonds whose cut is either 'Ideal', 'Premium', or 'Very Good'. These are considered higher quality cuts in the dataset.
diamonds_cut = diamonds[diamonds['cut'].isin(['Ideal', 'Premium', 'Very Good'])]

# Run two-way ANOVA
model2 = ols('log_price ~ C(color) + C(cut) + C(color):C(cut)', data=diamonds_cut).fit()
anova2_results = sm.stats.anova_lm(model2, typ=2)
print(anova2_results)


                       sum_sq       df         F        PR(>F)
C(color)                  NaN      6.0       NaN           NaN
C(cut)                    NaN      4.0       NaN           NaN
C(color):C(cut)    116.543781     24.0  4.866272  4.739459e-07
Residual         44928.873326  45024.0       NaN           NaN


  F /= J


The results for the main effects (color and cut) are missing (NaN values for sum of squares, F-statistic, and p-value). This typically indicates a problem, such as a lack of variability within these groups, such as multicollinearity or missing data.

### Key Questions Addressed by ANOVA

Does the color of a diamond significantly influence its price when the cut quality is accounted for? This question explores the intrinsic value added by color variations in the market perception of diamonds.

How does the quality of the cut impact the price independently of color? This addresses the craftsmanship aspect of diamond processing and its effect on pricing.

Is there a significant interaction effect between color and cut on diamond prices? This examines whether the combined effect of color and cut is greater than the sum of their individual effects, which can indicate complex consumer preferences in the market.

In [34]:
# Analysis which groups differ from each other

from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Post hoc analysis after one-way ANOVA
# Comparisons: The test results show comparisons between different color groups.

# Significant Differences: Comparisons where the reject column is True indicate significant differences in mean log prices. 
# For instance, the mean log price of color D is significantly higher than those of colors F, G, H, and I.
# Non-Significant Differences: Comparisons with False in the reject column (like between D and E) indicate no significant difference in mean log prices between these color groups.


post_hoc_results = pairwise_tukeyhsd(endog=diamonds['log_price'], groups=diamonds['color'], alpha=0.05)
print(post_hoc_results.summary())


Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1 group2 meandiff p-adj   lower  upper  reject
---------------------------------------------------
     D      E  -0.0375 0.1652 -0.0825 0.0075  False
     D      F   0.1455    0.0  0.1003 0.1908   True
     D      G   0.1727    0.0  0.1289 0.2165   True
     D      H   0.3015    0.0  0.2549 0.3482   True
     D      I   0.4061    0.0  0.3542  0.458   True
     E      F    0.183    0.0  0.1421  0.224   True
     E      G   0.2102    0.0  0.1709 0.2495   True
     E      H    0.339    0.0  0.2966 0.3815   True
     E      I   0.4436    0.0  0.3953 0.4918   True
     F      G   0.0271 0.3698 -0.0125 0.0668  False
     F      H    0.156    0.0  0.1133 0.1988   True
     F      I   0.2605    0.0  0.2121  0.309   True
     G      H   0.1289    0.0  0.0877   0.17   True
     G      I   0.2334    0.0  0.1863 0.2804   True
     H      I   0.1045    0.0  0.0548 0.1542   True
---------------------------------------------------


### Conclusion and Real-Case Application
The analyses show that both color and cut significantly affect diamond prices, with notable interaction effects. This indicates that certain combinations of color and cut are more valued than others.

