Section 1: Analysis
A friend of yours owns a frozen drink shop. On hot days, she seems extra happy. She says that she sees a line extending around the block and knows that means more sales. However, even before your friend started the business, she always seemed to love summer and thrived in the heat, so you think her jubilant attitude might be intrinsic. She makes a bet with you that sales really are higher on hotter days. She gets data on the sales numbers (in dollars) and the daily temperatures (in degrees Fahrenheit).
    1.	What is the outcome?
    2.	What is the main effect/predictor she wants to understand the impact of?
    3.	What is the hypothesis?

Use the data she collected to conduct an analysis, test the hypothesis, and report results. The dataset is drinks.xlsx. Your analysis should have the following elements:
    4.	An explanation of why the analysis is being conducted and what the hypothesis is
    5.	Descriptive information about the data, including summary statistics (such as number 
        of observations, measures of central tendency, & measures of dispersion) and plots of
        the data distributions
    6.	Descriptive information about the relationships between the two variables, including correlation and                  scatterplots
    7.	A regression analysis to test the hypothesis. If you have trouble getting the regression analysis to work,            look closely at the data. Your friend wasn’t always able to get sales data for each day. Choose a method to           handle rows with missing data.
    8.	A description of the results of the analysis. Included in this description should be an interpretation of 
        the coefficients, description of the goodness of fit, and a discussion of whether the results are                     statistically significant.


In [None]:
"""
1. What is the outcome?
    The analytical outcome we are trying to predict is sales based on temperature. 
    More importantly, I want to know if the outcome is that I will win the bet!
"""

In [None]:
"""
2. What is the main effect/predictor she wants to understand the impact of?
    Temperature in degrees (fahrenheit) is the independent variable - our predictor (presumably).
"""

In [None]:
"""
3. What is the hypothesis?
    Null hypothesis (H₀) is that temperature has no effect on sales (the slope coefficient, β = 0).
    My losing hypothesis (H₁) is that higher temperature leads to higher sales (the slope coefficient, β > 0).
"""

In the following cell I have generated several analyses for understanding the data and testing our hypothesis.  Some are analytical and others are attempting to understand potential outcomes/conclusions from the analysis.

In [None]:
# imports the various libraries needed for this project
# I had to add most of these to my snowflake environment
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

# load the Excel file (that was already uploaded to the session) into a Pandas dataframe
df = pd.read_excel("drinks.xlsx")

# drop rows with missing sales values
df_clean = df.dropna(subset=["Sales"])

# show the summary statistics on the clean data
print("Summary Statistics:\n")
print(df_clean.describe())

# show the distribution plots for temperature and sales
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# first is the temperature distribution
sns.histplot(df_clean["Temperature_F"], bins=20, kde=True, ax=axes[0], color='orange')
axes[0].set_title("Distribution of Temperature (°F)")
axes[0].set_xlabel("Temperature (°F)")

# next is the sales distribution
sns.histplot(df_clean["Sales"], bins=20, kde=True, ax=axes[1], color='teal')
axes[1].set_title("Distribution of Sales ($)")
axes[1].set_xlabel("Sales ($)")

# put them side-by-side
plt.tight_layout()
plt.show()

# what is the correlation?
correlation = df_clean["Temperature_F"].corr(df_clean["Sales"])
print(f"\nCorrelation between Temperature and Sales: {correlation:.3f}")

# show the scatterplot for our visual
# I'm looking for outliers and don't see any
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df_clean, x="Temperature_F", y="Sales")
plt.title("Sales vs. Temperature")
plt.xlabel("Temperature (°F)")
plt.ylabel("Sales ($)")
plt.grid(True)
plt.show()

# calculate the linear regression
X = sm.add_constant(df_clean["Temperature_F"])  # adds intercept term
y = df_clean["Sales"]

# show the regression results
model = sm.OLS(y, X).fit()
print("\nRegression Results:\n")
print(model.summary())

Section 1: Analysis

4.	An explanation of why the analysis is being conducted and what the hypothesis is.

The analysis is being conducted to evaluate the belief of our friend who owns a frozen drink shop and beliaves that she has higher sales on days when the temperature is higher.  In fact, we've placed a wager on it.  She is wagering the higher temperatures lead to higher drink sales.  We'll examine the data she has provided to determine who wins the wager.  The purpose of the anaylsis is to objectively test this observation using the data provided.  In probability terms our Null Hypothesis (H₀) is that temperature has no effect on sales (slope = 0).  Our Alternative Hypothesis (H₁)) is that higher temperatures lead to higher sales (slope > 0).

5.	Descriptive information about the data, including summary statistics (such as number of observations, measures of central tendency, & measures of dispersion) and plots of the data distributions.

The dataset contains 180 observations of daily temperatures and corresponding frozen drink sales.  We also note that there are two missing sales entries. So, we'll need to appropriately "clean" our dataset before proceeding with analysis.  Summary statistics are as follows: -

Temperature: Mean = 79.3°F, Std Dev = 10.38, Min = 57, Max = 105 - Sales: Mean = $2,936, Std Dev
= 399, Min = $1,959, Max = $3,971 Temperature values are approximately normally distributed, while
sales are slightly skewed to the right. Distribution plots for both variables visually support these findings.

6.	Descriptive information about the relationships between the two variables, including correlation and scatterplots.

The Pearson correlation coefficient between temperature and sales is 0.823, indicating a strong positive, linear relationship. A scatterplot of sales versus temperature further (visually) confirms the upward trend, showing that as temperature increases, sales generally rise as well.


7.	A regression analysis to test the hypothesis. If you have trouble getting the regression analysis to work, look closely at the data. Your friend wasn’t always able to get sales data for each day. Choose a method to handle rows with missing data.

To test the hypothesis, an Ordinary Least Squares (OLS) linear regression was performed using temperature as the independent variable and sales as the dependent variable. Two records with missing sales data were removed before running the regression. The regression model used the following specification: 

    Sales = β₀ + β₁ * Temperature + ε


8.	A description of the results of the analysis. Included in this description should be an interpretation of the coefficients, description of the goodness of fit, and a discussion of whether the results are statistically significant.

The regression output is:
	•	Intercept (β₀): 423.26 (p = 0.002)
	•	Temperature Coefficient (β₁): 31.65 (p < 0.001)

This means that for every 1°F increase in temperature, sales are expected to increase by approximately $31.65.

The Goodness of Fit of the model, as measured by the R-squared value (R²), is 0.677. This means that approximately 67.7% of the variation in daily sales can be explained by the variation in temperature. With the R² value close to 1 it indicates a better fit between the model and the data.

To me, this suggests that temperature is a strong predictor (statistically significant) of frozen drink sales.

[SEE MY CONCLUSION, REGARDING THE WAGER, BELOW IN THE CONCLUSION MARKDOWN CELL.]


The model is statistically significant overall, and the coefficients are meaningful and precise (as indicated by low standard errors and small p-values). Therefore, we can confidently reject the null hypothesis and conclude that higher temperatures are strongly associated with increased sales.

HOWEVER, I'm not sure I lost the wager! Do we really know if the reason for higher sales is a higher temperature?  There could be confounding factors.  For example, recall that our shop owner is more jubilent on warmer days.  Could it be possible that her cheerful attitude creates an environment more conducive to employee/customer interaction that leads to higher sales?

Is there a way to test for this?  YES!  We don't have the data now, but it would also be interesting to conduct the same analysis with an indication of whether or not our shop owner was working at that location on a given day.  Furthermore, it might be interesting to somehow know how much customer interaction she had that day.  Alternatively, is it her enthusiastic engagement with her team on warmer days that perks up sales?