<a href="https://colab.research.google.com/github/aravind2060/Retail_Sales_Analytics_and_Demand_Forecasting/blob/master/Retail_Sales_Analytics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro


This project provides hands-on experience in building a large-scale data processing pipeline for demand forecasting, customer sentiment analysis, and real-time analytics for a retail company. It integrates historical data analysis, real-time monitoring, and predictive modeling, creating a complex workflow.

Objectives

* Process, analyze, and visualize historical retail data
* Forecast demand and analyze customer sentiment
* Implement real-time transaction monitoring using Spark Streaming



# Task 1 : Data Loading, Cleansing, and Exploration

* *Data Ingestion* : Load the sales and review datasets into PySpark. DataFrames, Explore and check for missing or inconsistent data.
* *Data Cleansing* : Handle missing values, drop duplicates, and parse dates for consistent data.

In [47]:

# !pip install ucimlrepo

# from ucimlrepo import fetch_ucirepo

# # fetch dataset
# online_retail = fetch_ucirepo(id=352)

# # data (as pandas dataframes)
# X = online_retail.data.features
# y = online_retail.data.targets

# # metadata
# print(online_retail.metadata)

# print(X)
# print(y)

# # variable information
# print(online_retail.variables)

{'uci_id': 352, 'name': 'Online Retail', 'repository_url': 'https://archive.ics.uci.edu/dataset/352/online+retail', 'data_url': 'https://archive.ics.uci.edu/static/public/352/data.csv', 'abstract': 'This is a transactional data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.', 'area': 'Business', 'tasks': ['Classification', 'Clustering'], 'characteristics': ['Multivariate', 'Sequential', 'Time-Series'], 'num_instances': 541909, 'num_features': 6, 'feature_types': ['Integer', 'Real'], 'demographics': [], 'target_col': None, 'index_col': ['InvoiceNo', 'StockCode'], 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 2015, 'last_updated': 'Mon Oct 21 2024', 'dataset_doi': '10.24432/C5BW33', 'creators': ['Daqing Chen'], 'intro_paper': {'ID': 361, 'type': 'NATIVE', 'title': 'Data mining for the online retail industry: A case study of RFM model-based customer segmenta

In [49]:
# Downloading Historical Sales data: – Contains sales transaction data for an online retail store.
# More Info : https://archive.ics.uci.edu/dataset/352/online+retail

import pandas as pd

historical_sales_data_path ="https://archive.ics.uci.edu/static/public/352/data.csv";

historical_sales_df = pd.read_csv(historical_sales_data_path)

# print(historical_sales_df.info())
print(historical_sales_df.head())

  InvoiceNo StockCode                          Description  Quantity  \
0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1    536365     71053                  WHITE METAL LANTERN         6   
2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   

      InvoiceDate  UnitPrice  CustomerID         Country  
0  12/1/2010 8:26       2.55     17850.0  United Kingdom  
1  12/1/2010 8:26       3.39     17850.0  United Kingdom  
2  12/1/2010 8:26       2.75     17850.0  United Kingdom  
3  12/1/2010 8:26       3.39     17850.0  United Kingdom  
4  12/1/2010 8:26       3.39     17850.0  United Kingdom  


In [50]:
# Downloading Customer Reviews Data: – Includes detailed customer review data for sentiment analysis.
# More information : https://registry.opendata.aws/helpful-sentences-from-reviews/

import requests
import pandas as pd
import json

# Define the URLs for the train and test files
train_file_path = "https://helpful-sentences-from-reviews.s3.amazonaws.com/train.json"
test_file_path  = "https://helpful-sentences-from-reviews.s3.amazonaws.com/test.json"

# Function to load JSON from a URL where each line is a separate JSON object
def load_json_from_url(url):
    response = requests.get(url)
    data = []

    # Read each line from the response text as a separate JSON object
    for line in response.text.splitlines():
        try:
            data.append(json.loads(line))  # Convert each line into a dictionary
        except json.JSONDecodeError:
            # Skip invalid lines if any, or handle them differently if needed
            continue

    return data

# Load the train data
train_data = load_json_from_url(train_file_path)
train_df = pd.DataFrame(train_data)

# Load the test data
test_data = load_json_from_url(test_file_path)
test_df = pd.DataFrame(test_data)

# Print the first few rows of the train DataFrame
print(train_df.head())

# Optionally, print the test DataFrame as well
# print(test_df.head())

         asin                                           sentence  helpful  \
0  B000AO3L84                      this flash is a superb value.     1.70   
1  B001SEQPGK                The pictures were not sharp at all.     1.30   
2  0553386697                  A very good resource for parents.     1.90   
3  B006SUWZH2  We have it in a child's room, and will be swit...     0.25   
4  B000W7F5SS  Again the makers are too lazy to bring in the ...     0.90   

                                      main_image_url  \
0  http://ecx.images-amazon.com/images/I/41XAEKR9...   
1  http://ecx.images-amazon.com/images/I/71KLvmtc...   
2  http://ecx.images-amazon.com/images/I/81HdbmkR...   
3  http://ecx.images-amazon.com/images/I/61A2WQOL...   
4  http://ecx.images-amazon.com/images/I/91E7TPDb...   

                                       product_title  
0  Canon 430EX Speedlite Flash for Canon EOS SLR ...  
1  Sony Cyber-shot DSC-W290 12 MP Digital Camera ...  
2  The Whole-Brain Child: 12 Revolu

Checking for Any null values

In [51]:
#Checking for null columns in historical_sales_df
missing_columns = historical_sales_df.columns[historical_sales_df.isnull().any()].tolist()
print(missing_columns)

#Checking for null columns in customer reviews
missing_columns = train_df.columns[train_df.isnull().any()].tolist()
print(missing_columns)

#Checking for null Columns in Customer reviews
missing_columns = test_df.columns[test_df.isnull().any()].tolist()
print(missing_columns)

['Description', 'CustomerID']
[]
[]
