<a href="https://www.kaggle.com/code/benzilla987/data445-kaggle-project-2?scriptVersionId=142449961" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

<img src="https://i.imgur.com/h9KNfIq.jpg">


# Analyzing & Predicting Retail Sales Data

#### 🛒 What relationships exist between customer demographics and sales figures?

#### 🛒 Are there interesting distributions amongst different product categories?

#### 🛒 Do trends occur on a daily, weekly, or monthly basis among total sales in 2023?

#### 🛒 How well can we predict sales figures using customer age, quantity sold, and price per unit?


<img src="https://i.imgur.com/zF04gPM.png">


# 🛒 Initial Exploratory Analysis



#### Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import plotly.express as px
import plotly.graph_objects as go
from sklearn.metrics import mean_squared_error, r2_score

In [2]:
# Reading in the data 
retail=pd.read_csv("/kaggle/input/retail-sales-dataset/retail_sales_dataset.csv")
retail['Date'] = pd.to_datetime(retail['Date'])
retail.head()

Unnamed: 0,Transaction ID,Date,Customer ID,Gender,Age,Product Category,Quantity,Price per Unit,Total Amount
0,1,2023-11-24,CUST001,Male,34,Beauty,3,50,150
1,2,2023-02-27,CUST002,Female,26,Clothing,2,500,1000
2,3,2023-01-13,CUST003,Male,50,Electronics,1,30,30
3,4,2023-05-21,CUST004,Male,37,Clothing,1,500,500
4,5,2023-05-06,CUST005,Male,30,Beauty,2,50,100


## Looking at the shape of the data 

In [3]:
shape_str = f"Number of Rows: {retail.shape[0]}, Number of Columns: {retail.shape[1]}"
print(shape_str)


Number of Rows: 1000, Number of Columns: 9


## Checking datatypes

In [4]:
retail['Date'] = pd.to_datetime(retail['Date'])
retail.dtypes

Transaction ID               int64
Date                datetime64[ns]
Customer ID                 object
Gender                      object
Age                          int64
Product Category            object
Quantity                     int64
Price per Unit               int64
Total Amount                 int64
dtype: object

In [5]:
# Create an interactive scatterplot
fig = px.scatter(retail, x='Age', y='Total Amount', title='Scatterplot of Age vs. Total Amount', labels={'Age': 'Age', 'Total Amount': 'Total Amount'})

# Update hover information to display x and y values
fig.update_traces(marker=dict(size=10),
                  selector=dict(mode='markers+text'),
                  texttemplate='%{x}, %{y}')

# Show the plot
fig.show()

 ## There is no discernable relationship between age and quantity

<img src="https://i.imgur.com/zF04gPM.png">

# 🛒 Examing Age vs. Quantity

In [6]:
# Create an interactive scatterplot
fig = px.scatter(retail, x='Age', y='Quantity', title='Scatterplot of Age vs. Quantity', labels={'Age': 'Age', 'Quantity': 'Quantity'})

# Update hover information to display x and y values
fig.update_traces(marker=dict(size=10),
                  selector=dict(mode='markers+text'),
                  texttemplate='%{x}, %{y}')

# Show the plot
fig.show()

##  There is no discernable relationship between age and quantity

<img src="https://i.imgur.com/zF04gPM.png">

# 🛒 Examing Distribution of Gender

In [7]:
# Create an interactive histogram
fig = px.histogram(retail, x='Gender', title='Histogram of Gender', labels={'Gender': 'Gender', 'count': 'Frequency'})

# Update hover information to display frequency values
fig.update_traces(texttemplate='%{y}', textposition='outside')

# Show the plot
fig.show()

## The distribution if gender is relatively evenly split.

<img src="https://i.imgur.com/zF04gPM.png">

# 🛒 Examining distribution of Product Category

In [8]:
# Create an interactive histogram
fig = px.histogram(retail, x='Product Category', title='Histogram of Product Category', labels={'Product Category': 'Product Category', 'count': 'Frequency'})

# Update hover information to display frequency values
fig.update_traces(texttemplate='%{y}', textposition='outside')

# Show the plot
fig.show()

## The distribution of product category is relatively evenly split.

<img src="https://i.imgur.com/zF04gPM.png">

# 🛒 Examining time series data

#### Monthly Total Amount

In [9]:
retail=pd.read_csv("/kaggle/input/retail-sales-dataset/retail_sales_dataset.csv")
retail['Date'] = pd.to_datetime(retail['Date'])

# Set the 'Date' column as the index
retail.set_index('Date', inplace=True)

# Resample data by month and sum the 'Total Amount' per month
monthly_sales = retail.resample('M').sum()

# Create an interactive time series plot for Monthly Total Amount
fig = px.line(monthly_sales, x=monthly_sales.index, y='Total Amount', title='Monthly Total Amount')
fig.update_xaxes(title_text='Date (Monthly)')
fig.update_yaxes(title_text='Total Amount')

# Add hover-over values with the date
fig.update_traces(mode='lines+markers', hovertemplate='%{y:.2f} USD<br>%{x|%Y-%m}')

# Show the plot
fig.show()

#### Weekly Total Amount

In [10]:
retail=pd.read_csv("/kaggle/input/retail-sales-dataset/retail_sales_dataset.csv")
retail['Date'] = pd.to_datetime(retail['Date'])

# Set the 'Date' column as the index
retail.set_index('Date', inplace=True)

# Resample data by week and sum the 'Total Amount' per week
weekly_sales = retail.resample('W').sum()

# Create an interactive time series plot for Weekly Total Amount
fig = px.line(weekly_sales, x=weekly_sales.index, y='Total Amount', title='Weekly Total Amount')
fig.update_xaxes(title_text='Date (Weekly)')
fig.update_yaxes(title_text='Total Amount')

# Add hover-over values with the date
fig.update_traces(mode='lines+markers', hovertemplate='%{y:.2f} USD<br>%{x|%Y-%m-%d}')

# Show the plot
fig.show()

#### Daily Total Amount

In [11]:
retail=pd.read_csv("/kaggle/input/retail-sales-dataset/retail_sales_dataset.csv")
retail['Date'] = pd.to_datetime(retail['Date'])

# Set the 'Date' column as the index
retail.set_index('Date', inplace=True)

# Resample data by day and sum the 'Total Amount' per day
daily_sales = retail.resample('D').sum()

# Create an interactive time series plot for Daily Total Amount
fig = px.line(daily_sales, x=daily_sales.index, y='Total Amount', title='Daily Total Amount')
fig.update_xaxes(title_text='Date (Daily)')
fig.update_yaxes(title_text='Total Amount')

# Add hover-over values with the date
fig.update_traces(mode='lines+markers', hovertemplate='%{y:.2f} USD<br>%{x|%Y-%m-%d}')

# Show the plot
fig.show()

<img src="https://i.imgur.com/zF04gPM.png">


# 🛒 Creating a baseline model

In [12]:
retail=pd.read_csv("/kaggle/input/retail-sales-dataset/retail_sales_dataset.csv")

# Creating train/test split
X = retail[['Age', 'Quantity', 'Price per Unit']]
y = retail['Total Amount']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating model
model = LinearRegression()
model.fit(X_train, y_train)

# Making predictions
y_pred = model.predict(X_test)

# Evaluating model performance
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"Root Mean Squared Error: {rmse}")
print(f"R-squared (R2) Score: {r2}")

# Create a DataFrame for visualization
df = pd.DataFrame({'True Values': y_test, 'Predictions': y_pred})

# Create a scatter plot using Plotly Express
fig = px.scatter(df, x='True Values', y='Predictions', title='True vs. Predicted Values')
fig.update_traces(marker=dict(size=10, opacity=0.5, line=dict(width=2, color='black')))

# Add hover data (x and y coordinates)
fig.update_traces(customdata=df.index)

# Customize the hover tooltip
fig.update_traces(hovertemplate='True Value: %{x}<br>Predicted Value: %{y}<br>Sample Index: %{customdata}')

# Add a trendline (optional)
fig.add_trace(go.Scatter(x=df['True Values'], y=df['True Values'], mode='lines', name='Trendline'))

# Customize the layout
fig.update_layout(
    xaxis_title="True Values",
    yaxis_title="Predictions"
)

# Show the interactive plot in the Jupyter Notebook
fig.show()



Mean Squared Error: 41896.21322134358
Root Mean Squared Error: 204.68564488342503
R-squared (R2) Score: 0.8568772264250432


# 🛒 Making Predictions

In [13]:
# Predicting total amount on a customer whos age is 25, quantity is 22, and price per unit is 500

# New observation
new_observation = [[34, 3, 50]]

# Make a prediction
predicted_total_amount = model.predict(new_observation)

# Print the predicted total amount
print(f"Predicted Total Amount: {predicted_total_amount[0]}")

Predicted Total Amount: 231.96935865263845



X does not have valid feature names, but LinearRegression was fitted with feature names



## Predicted total amount is approximately $231.97