#### Problem Statement:

### **Predicting High-Profit Sales Transactions Using Machine Learning**

---

#### Objective 
***
The goal of this project is to develop a machine learning model that can predict whether a sales transaction will yield a high profit (defined as a profit greater than $10,000). This predictive model will help the business prioritize and strategize its sales efforts to maximize profitability.

#### Data Description
***
The dataset contains 500,000 sales records with the following attributes:

- Region: The geographical region where the sale was made.
- Country: The country where the sale was made.
- Item Type: The type of item sold (e.g., Fruits, Clothes, Meat, etc.).
- Sales Channel: The sales channel used (Online or Offline).
- Order Priority: The priority of the order (High, Medium, Low, Critical).
- Order Date: The date when the order was placed.
- Order ID: A unique identifier for the order.
- Ship Date: The date when the order was shipped.
- Units Sold: The number of units sold.
- Unit Price: The price per unit.
- Unit Cost: The cost per unit.
- Total Revenue: The total revenue from the sale.
- Total Cost: The total cost of the sale.
- Total Profit: The total profit from the sale

---
#### **Load libraries And Dataset**

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, classification_report
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore')

In [12]:
# Loading the dataset
data = pd.read_csv("500000 Sales Records.csv")
data.head()

Unnamed: 0,Region,Country,Item Type,Sales Channel,Order Priority,Order Date,Order ID,Ship Date,Units Sold,Unit Price,Unit Cost,Total Revenue,Total Cost,Total Profit
0,Sub-Saharan Africa,South Africa,Fruits,Offline,M,7/27/2012,443368995,7/28/2012,1593,9.33,6.92,14862.69,11023.56,3839.13
1,Middle East and North Africa,Morocco,Clothes,Online,M,9/14/2013,667593514,10/19/2013,4611,109.28,35.84,503890.08,165258.24,338631.84
2,Australia and Oceania,Papua New Guinea,Meat,Offline,M,5/15/2015,940995585,6/4/2015,360,421.89,364.69,151880.4,131288.4,20592.0
3,Sub-Saharan Africa,Djibouti,Clothes,Offline,H,5/17/2017,880811536,7/2/2017,562,109.28,35.84,61415.36,20142.08,41273.28
4,Europe,Slovakia,Beverages,Offline,L,10/26/2016,174590194,12/4/2016,3973,47.45,31.79,188518.85,126301.67,62217.18


---
#### **Data Exploration and Analysis**

In [13]:
data.shape

(500000, 14)

In [14]:
# check the datatypes of each column
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   Region          500000 non-null  object 
 1   Country         500000 non-null  object 
 2   Item Type       500000 non-null  object 
 3   Sales Channel   500000 non-null  object 
 4   Order Priority  500000 non-null  object 
 5   Order Date      500000 non-null  object 
 6   Order ID        500000 non-null  int64  
 7   Ship Date       500000 non-null  object 
 8   Units Sold      500000 non-null  int64  
 9   Unit Price      500000 non-null  float64
 10  Unit Cost       500000 non-null  float64
 11  Total Revenue   500000 non-null  float64
 12  Total Cost      500000 non-null  float64
 13  Total Profit    500000 non-null  float64
dtypes: float64(5), int64(2), object(7)
memory usage: 53.4+ MB


In [15]:
# check for unique values
data.nunique()

Region                 7
Country              185
Item Type             12
Sales Channel          2
Order Priority         4
Order Date          2766
Order ID          500000
Ship Date           2816
Units Sold         10000
Unit Price            12
Unit Cost             12
Total Revenue     118520
Total Cost        118360
Total Profit      118404
dtype: int64

In [16]:
# Find the percentage of missing values in each column
(data.isnull().sum()/data.shape[0] * 100)

Region            0.0
Country           0.0
Item Type         0.0
Sales Channel     0.0
Order Priority    0.0
Order Date        0.0
Order ID          0.0
Ship Date         0.0
Units Sold        0.0
Unit Price        0.0
Unit Cost         0.0
Total Revenue     0.0
Total Cost        0.0
Total Profit      0.0
dtype: float64

In [17]:
# Summary stats for numerical column
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Order ID,500000.0,550131900.0,259960500.0,100002900.0,325181400.0,549184300.0,775629100.0,999999500.0
Units Sold,500000.0,4999.136,2884.024,1.0,2502.0,4999.0,7497.0,10000.0
Unit Price,500000.0,266.0367,216.9376,9.33,81.73,154.06,421.89,668.27
Unit Cost,500000.0,187.5286,175.624,6.92,56.67,97.44,263.33,524.96
Total Revenue,500000.0,1330096.0,1468090.0,9.33,278305.9,786242.6,1824236.0,6682700.0
Total Cost,500000.0,937616.3,1148684.0,6.92,162059.7,467712.0,1198736.0,5249600.0
Total Profit,500000.0,392480.0,378751.7,2.41,95385.06,281749.2,565392.3,1738700.0


In [18]:
# List of the categorical columns
categorical_columns = ["Region", "Country", "Item Type", "Sales Channel", "Order Priority"]

# Number of unique observations in each category
for column in categorical_columns:
    print(f"\nAnalysis of '{column}' column:")
    print("-" * 30)
    
    # Unique values
    unique_values = data[column].unique()
    print(f"Unique values ({len(unique_values)}): {unique_values[:5]}{'...' if len(unique_values) > 5 else ''}")
    
    # Value counts
    value_counts = data[column].value_counts()
    print("\nValue counts:")
    print(value_counts.head(10))  # Display top 10 for brevity
    
    # Missing values
    missing_values = data[column].isnull().sum()
    print(f"\nMissing values: {missing_values}")


Analysis of 'Region' column:
------------------------------
Unique values (7): ['Sub-Saharan Africa' 'Middle East and North Africa'
 'Australia and Oceania' 'Europe' 'Asia']...

Value counts:
Sub-Saharan Africa                   130422
Europe                               129286
Asia                                  72958
Middle East and North Africa          62020
Central America and the Caribbean     53964
Australia and Oceania                 40508
North America                         10842
Name: Region, dtype: int64

Missing values: 0

Analysis of 'Country' column:
------------------------------
Unique values (185): ['South Africa' 'Morocco' 'Papua New Guinea' 'Djibouti' 'Slovakia']...

Value counts:
Cape Verde     2840
Liberia        2805
Guinea         2805
Singapore      2804
New Zealand    2797
Malta          2791
Namibia        2789
Panama         2787
Algeria        2780
Lesotho        2778
Name: Country, dtype: int64

Missing values: 0

Analysis of 'Item Type' column:
----

In [24]:
# Standardize date formats
data['Order Date'] = pd.to_datetime(data['Order Date'])
data['Ship Date'] = pd.to_datetime(data['Ship Date'])