# Example usage

Welcome to the `sales_analyzer` package! This package is designed to help small-sized businesses analyze their retail sales data efficiently, without needing extensive data analytics expertise. If you've ever felt overwhelmed by tools like Pandas or Scikit-learn, or wished for more retail-specific functions, you're in the right place.

In this notebook, we'll walk through how to use the `sales_analyzer` package to extract valuable insights from your sales data. We’ll demonstrate key functionalities using real-world examples, so you can start improving your business decisions right away.

## Imports

In [7]:
import pandas as pd
import numpy as np
import random
from datetime import datetime, timedelta
import random

from salesanalyzer.sales_summary_statistics import sales_summary_statistics
from salesanalyzer.predict_sales import predict_sales

## Create a sample data

We'll first create a sample data to work with.

In [8]:
def generate_random_dates(n):
    random.seed(1)
    # Get the current date
    today = datetime.now()
    # Calculate the date two years ago
    two_years_ago = today - timedelta(days=730)
    
    # Generate n random dates
    random_dates = [
        two_years_ago + timedelta(days=random.randint(0, (today - two_years_ago).days))
        for _ in range(n)
    ]

    return random_dates  

In [9]:
def anonymize_data(obs=50):
    random.seed(1)

    df = pd.DataFrame({})

    fake_products = [
        "Laptop", "Monitor", "Headphone"
    ]

    fake_cities = [
        "Vancouver", "Toronto", "Calgary"
    ]
    

    # Replace InvoiceNo with random unique numbers
    df['InvoiceNo'] = [f'INV-{random.randint(100000, 999999)}' for _ in range(obs)]

    # Replace StockCode with random alphanumeric strings
    df['StockCode'] = [f'SC{random.randint(1000, 9999)}' for _ in range(obs)]

    # Replace Description with random fake product names
    df['Description'] = [random.choice(fake_products) for _ in range(obs)]

    # Modify Quantity with random realistic values
    df['Quantity'] = [int(np.random.exponential(2)) + 1 for _ in range(obs)]

    # Replace InvoiceDate with random dates in the last two years
    df['InvoiceDate'] = generate_random_dates(obs)

    # Modify UnitPrice with random prices
    df['UnitPrice'] = [round(random.uniform(0.5, 50), 2) for _ in range(obs)]

    # Replace CustomerID with random unique identifiers
    df['CustomerID'] = [random.randint(10000, 99999) for _ in range(obs)]

    # Replace Country with random countries
    df['Country'] = [random.choice(fake_cities) for _ in range(obs)]

    return df

In [10]:
sample_data = anonymize_data()
sample_data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,INV-240891,SC8174,Laptop,5,2023-06-09 15:51:36.351493,41.96,85732,Toronto
1,INV-696853,SC9123,Laptop,8,2024-08-27 15:51:36.351493,28.04,56304,Vancouver
2,INV-988598,SC4818,Laptop,1,2023-03-28 15:51:36.351493,32.29,70179,Toronto
3,INV-941235,SC6663,Headphone,1,2023-10-11 15:51:36.351493,9.7,45294,Vancouver
4,INV-900875,SC4782,Headphone,3,2023-05-23 15:51:36.351493,49.63,96404,Toronto


## Get Summary Statistics

One of the key features of `sales_analyzer` is its ability to quickly generate sales summary. Use the `analyze_sales_trends` function to generate insights like total revenue, average order value, and top selling products.

In [11]:
sales_summary_statistics(sample_data)

Unnamed: 0,total_revenue,unique_customers,average_order_value,top_selling_product_quantity,top_selling_product_revenue,average_revenue_per_customer
0,3422.78,50,68.4556,Monitor,Laptop,68.4556


## Predict Future Sales

Now that you have a good summary of your **past** sales, say, you want to peek into the **future** and predict how your products will sell in a month, 2 months or even a year? You can do this with `predict_sales()` function. This function uses a Random Forest machine learning model to make predictions on your specified target (e.g. quantity sold). The output will be a dictionary with predicted values, and the model's performance score (Mean Squared Error).

> **Important** <br>
> `predict_sales()` checks for duplicate entries, and only considers unique data points <br>
> By default the function uses 70% data for training and 30% for testing, to change that you can pass test_size = 0.2 increase the ratio, if your data size is small 

In [13]:
new_data = anonymize_data(3)
predict_sales(sample_data, new_data, ['UnitPrice'], ['Description', 'Country'], 'Quantity', 'InvoiceDate')

{'MSE of the model': 2.24,
 'Predicted values': [np.float64(1.68), np.float64(4.87), np.float64(2.91)]}

If you don't want to include a date feature into your analysis, you can omit it from your arguments


In [14]:
predict_sales(sample_data, new_data, ['UnitPrice'], ['Description', 'Country'], 'Quantity')

{'MSE of the model': 1.29,
 'Predicted values': [np.float64(1.93), np.float64(3.1), np.float64(2.37)]}