# Data Visualisations with statistical tests

# Objectives

This notebook aims to visually explore and statistically test the hypotheses that have been outlined in my readme file.

1. Analysing the impact of discounts on profitability: I aim to determine whether or not higher discount rates lead to reduced profits.
2. I aim to evaluate the profitability of different product categories by identifying which categories and sub-categories generate the most profit.
3. Determine whether certain shipping modes or regions result in longer delivery times.
4. Examine whether larger orders lead to higher profits.

# Preparing the data for visualisation

In [3]:
import os

# Define the correct project root
project_root = "C:\\Users\\conor\\Desktop\\DA course\\SuperstoreSales"

# Move up only if currently in the 'jupyter_notebooks' folder
if "jupyter_notebooks" in os.getcwd():
    os.chdir(project_root)
    print(f"Changed working directory to: {os.getcwd()}")
else:
    print(f"Already in the correct directory: {os.getcwd()}")

Changed working directory to: C:\Users\conor\Desktop\DA course\SuperstoreSales


In [4]:
# Import libraries that will be used throughout the notebook
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

# Load cleaned dataset and ensure it is displaying correctly
df = pd.read_csv('Cleaned_data\\superstore_cleaned.csv')
df.head()

Unnamed: 0,Order ID,Order Date,Ship Date,Ship Mode,Segment,City,State,Region,Product ID,Category,Sub-Category,Sales,Quantity,Discount,Profit,Delivery Time,Profit Margin,Total Discount Effect,Order Month
0,CA-2016-152156,2016-11-08,2016-11-11,Second Class,Consumer,Henderson,Kentucky,South,FUR-BO-10001798,Furniture,Bookcases,261.96,2,0.0,41.9136,3,0.16,0.0,2016-11
1,CA-2016-152156,2016-11-08,2016-11-11,Second Class,Consumer,Henderson,Kentucky,South,FUR-CH-10000454,Furniture,Chairs,731.94,3,0.0,219.582,3,0.3,0.0,2016-11
2,CA-2016-138688,2016-06-12,2016-06-16,Second Class,Corporate,Los Angeles,California,West,OFF-LA-10000240,Office Supplies,Labels,14.62,2,0.0,6.8714,4,0.47,0.0,2016-06
3,US-2015-108966,2015-10-11,2015-10-18,Standard Class,Consumer,Fort Lauderdale,Florida,South,FUR-TA-10000577,Furniture,Tables,957.5775,5,0.45,-319.264953,7,-0.333409,2.25,2015-10
4,US-2015-108966,2015-10-11,2015-10-18,Standard Class,Consumer,Fort Lauderdale,Florida,South,OFF-ST-10000760,Office Supplies,Storage,22.368,2,0.2,2.5164,7,0.1125,0.4,2015-10


# Section 1

Hyposthesis: Higher discount levels impact profitability negatively.
Null Hyposthesis: There is no relationship between discount levels and profits.

In [None]:
# had to manually install nbformat to fix error message, only worked when running it here and not through requirements file or terminal
!pip install nbformat --upgrade

Collecting nbformat
  Using cached nbformat-5.10.4-py3-none-any.whl.metadata (3.6 kB)
Using cached nbformat-5.10.4-py3-none-any.whl (78 kB)
Installing collected packages: nbformat
  Attempting uninstall: nbformat
    Found existing installation: nbformat 4.2.0
    Uninstalling nbformat-4.2.0:
      Successfully uninstalled nbformat-4.2.0
Successfully installed nbformat-5.10.4


In [None]:
# Scatter plot showing Discount vs Profit

fig = px.scatter(df, x='Discount', y='Profit', 
                 color='Category', # Color by category to show which products are most affected by discount
                 title='Discount vs Profit', 
                 labels={'Discount': 'Discount (Fractional)', 'Profit': 'Profit ($)'})
fig.show()


This scatter plot clearly shows the relationship between higher discount rates and profits earned, every single item that was sold at a discount of greater than 45% was sold at a loss.

In [None]:
# Create the boxplot in Plotly
fig = px.box(df, x="Discount", y="Profit",
             title="Profit Distribution Across Discount Levels",
             labels={"Discount ": "Discount(Fractional)", "Profit": "Profit ($)"},
             color="Discount")  # Different colors for each different discount rate

fig.show()

With the boxplots there is also a clear, albeit not perfect, downtrend when comparing profit with discount as the discount value increases.

# Statistical test to prove/disprove hypothesis

I looked through our course content to try and decide which statistical test I could use for this hypothesis, none of them seemed to fit so I asked chatgpt which would be best for this hypothesis. It suggested using the Spearman test which was briefly mentioned in one video in our learning content, but I couldn't find a demonstration of it so the following code was taken directly from chatgpt.

In [20]:
from scipy.stats import spearmanr
# Compute Spearman correlation
corr, p_value = spearmanr(df["Discount"], df["Profit"])

# Print result
print(f"🔹 Spearman Correlation (rho): {corr:.2f}")
print(f"🔹 p-value: {p_value:.4f}")

🔹 Spearman Correlation (rho): -0.54
🔹 p-value: 0.0000


The results of this tet show that there is a moderate to strong negative correlation between discount rate and profits through the rho score of -0.54 (a score of -1 would be a perfect negative correlation, 0 would mean no correlation and 1 would mean a perfect positive correlation).
The p-value is <0.05 meaning the result is highly significant and therefore not due to random chance.

As a result of this statistical test as well as by looking at the visuals portrayed by the graphs, we can confirm that higher discount rates do tend to lead to reduced profits in general, so we reject the null hypothesis in this case.