# **Introduction**
Greetings and welcome to the documentation for the Sales Forecasting dataset.Within these pages,you will discover the feautues of this data ,helpful insights and answers to some relations and business questions that might help you in your decision making!

# **Data Source**

* The data for this project has been sourced from the Kaggle platform,a respected platform for sharing and discovering datasets across a wide range of fields.
* the dataset titled "Sales Forecasting" was uploaded by Rohit Sahoo and it can be accessed through the following Kaggle link:https://www.kaggle.com/datasets/vivek468/superstore-dataset-final

#**Data Description**
* The dataset titled "Sales Forecasting" is a valuable repository of data, containing several important attributes that offer valuable insights into sales patterns.
* The dataset seems to provide insights into sales transactions, customer details, and product information

* Let's discover the fundamental columns, understand their significance, and grasp the types of data they hold.
 * **Order ID**: A unique identifier for each order placed.

 * **Order Date**: The date when the order was placed.

 * **Ship Date**: The date when the ordered items were shipped.

 * **Ship Mode**: The shipping mode chosen for the order (e.g., Second Class).

 * **Customer ID**: A unique identifier for each customer.

 * **Customer Name**: The name of the customer placing the order.

 * **Segment**: The market segment to which the customer belongs (e.g., Consumer, Corporate)

 * **Country**: The country where the order was placed (e.g., United States).

 * **City**: The city where the order was placed.

 * **State**: The state within the country where the order was placed.

 * **Postal Code**: The postal code associated with the order location.

 * **Region**: The region of the country where the order was placed (e.g., South, West).

 * **Product ID**: A unique identifier for each product.

 * **Category**: The broad category to which the product belongs (e.g., Furniture, Office Supplies).

 * **Sub-Category**: The specific sub-category to which the product belongs (e.g., Bookcases, Chairs)

 * **Product Name**: The name of the product ordered.

 * **Sales**: The sales amount associated with the order.

In [None]:
#importing the important libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import warnings
#warnings.filterwarnings("ignore")

In [None]:
df = pd.read_csv('/kaggle/input/superstore-dataset-final/Sample - Superstore.csv', encoding='windows-1252') #reading the data #cp1252
df

In [None]:
data = df.copy()
df.sample()

In [None]:
# descriptive statstics for numerical features
df.describe()

In [None]:
df.describe(include='O')

In [None]:
df.columns

In [None]:
#df.dtypes
df.info()

In [None]:
print("Shipping modes: ")
print(df['Ship Mode'].unique())
print("Countries: ")
print(df['Country'].unique())
print('Number of states: ' + str(df['State'].unique().shape[0]))
print("Region: ")
print(df['Region'].unique())
print('Number of cities: ' + str(df['City'].unique().shape[0]))
print("Categories: ")
print(df['Category'].unique())
print("Number of Sub categories: " + str(df['Sub-Category'].unique().shape[0]))
print("Number of Products: " + str(df['Product ID'].unique().shape[0]))

In [None]:
# Check for null values
print(df.isnull().sum())
print(f'There is a total of {df.isnull().sum().sum()/df.shape[0] * 100} NaN values')

In [None]:
#check it there is duplicated values
df.duplicated().sum()

In [None]:
##display duplicated values
#df[df.duplicated(keep='last')]
#remove duplicated values
#df.drop_duplicates()

In [None]:
#country has unique value
#we can drop this columns because they are superfluous
df.drop(columns=['Row ID','Postal Code','Country'], inplace=True)
df.columns

In [None]:
# Data extraction - creation of new columns:
df['Order Date'] = pd.to_datetime(df['Order Date'], format='%m/%d/%Y')
df['Ship Date'] = pd.to_datetime(df['Ship Date'], format='%m/%d/%Y')

df['Year'] = df['Order Date'].dt.year
df['Month'] = df['Order Date'].dt.month_name()
df['Day'] = df['Order Date'].dt.day
df['Year-Month'] = df['Year'].astype(str) + '-' + df['Month'].astype(str)
df['Year-Month'] = pd.to_datetime(df['Year-Month'])

#delivery day
df['Delivery Days'] = df['Ship Date'] - df['Order Date']
df.info()

In [None]:
df['Delivery Days'].describe()

In [None]:
#to get the days only
df['Delivery Days']=df['Delivery Days'].dt.days
df['Delivery Days'].describe()

# Data Visualization

## Feature Analysis

In [None]:
# Plot distribution of the Numerical columns
sales = go.Box(x=df['Sales'],name='Sales')
Quantity = go.Box(x=df['Quantity'],name='Quantity')
Discount = go.Box(x=df['Discount'],name='Discount')
Profit = go.Box(x=df['Profit'],name='Profit')
Delivery_days = go.Box(x=df['Delivery Days'],name='Delivery Days')


fig = make_subplots(rows=4, cols=2)
fig.append_trace(sales, row = 1, col = 1)
fig.append_trace(Quantity, row = 1, col = 2)
fig.append_trace(Discount, row = 2, col = 1)
fig.append_trace(Profit, row = 2, col = 2)
fig.append_trace(Delivery_days, row = 3, col = 1)
fig.update_layout(
    title_text = 'Distribution of the numerical data',
    title_font_size = 24,
    title_x=0.45)

fig.show()

In [None]:
# Plot distribution of the Categorial columns
ship_mode = go.Histogram(x=df["Ship Mode"],name="Ship Mode")
segment = go.Histogram(x=df["Segment"],name="Segment")
Region = go.Histogram(x=df["Region"],name="Region")
category = go.Histogram(x=df["Category"],name="Category")

fig = make_subplots(rows=2, cols=2)
fig.append_trace(ship_mode, row = 1, col = 1)
fig.append_trace(segment, row = 1, col = 2)
fig.append_trace(Region, row = 2, col = 1)
fig.append_trace(category, row = 2, col = 2)
fig.update_layout(
    title_text = 'Number of occurrences of each category in the categorical variables',
    title_x=0.45
)

fig.show()

In [None]:
state_ship_mode_counts = df.groupby(['State', 'Ship Mode']).size().reset_index(name='Count')
fig = px.treemap(state_ship_mode_counts, path=['State', 'Ship Mode'], values='Count')
fig.update_layout( title='Treemap of State vs. Ship Mode',title_x=0.45,width=1600,title_font=dict(size=24, family="Arial", color="black"))
fig.show()

In [None]:
# Countplot of ship modes
plt.figure(figsize=(10, 7))
sns.countplot(x='Ship Mode', data=df)
plt.title('Count of Ship Modes')
plt.xticks(rotation=45)
plt.show()


In [None]:
# What is the average order value for each shipping mode?
for i in df['Ship Mode'].unique():
    avarage = df[df['Ship Mode']==i]['Sales'].mean()
    print(60*'-')
    print(f'Avarage order value for "{i}" shipping mode is {avarage:.2f}')

In [None]:
# What is the average sales value for each sub-category product?
for i in df['Sub-Category'].unique():
    avarage = df[df['Sub-Category']==i]['Sales'].mean()
    print(60*'-')
    print(f'Avarage order value for "{i}" shipping mode is {avarage:.2f}')

In [None]:
# How does the average order value for each customer segment?
for i in df['Segment'].unique():
    avarage = df[df['Segment']==i]['Sales'].mean()
    print(60*'-')
    print(f'Avarage sale for {i} is {avarage:.2f}')

### ***What is the relation between the region and shipment mode?***

In [None]:
plt.figure(figsize=(15,6))
sns.countplot(df, x='Region', hue='Ship Mode', palette='coolwarm') #Paired #muted
plt.title('Count of Ship Modes by Region')
plt.show()

* technology category has the highest sales number*
* office supplies category is the lowest sales*

In [None]:
df['Delivery Days'].value_counts()

In [None]:
plt.figure(figsize=(15,5))
sns.countplot(df, x='Delivery Days') #,shade=True)
plt.title('Distribution of Delivery days')
plt.xlabel('delivery days')
plt.ylabel('Density')
plt.show()


***as we already noticed from the describe table,the average of delivery days is 4 days***

In [None]:
### ***does the ship mode affects the duration?***
box_fig = px.box(df, x='Ship Mode', y='Delivery Days', color='Ship Mode')
box_fig.update_layout(title="Impact of Ship Mode on Delivery Days",title_x=0.45,width=1300,title_font=dict(size=24, family="Arial", color="black"))
box_fig.show()

* *the average delivery days of the first class is 2 days*
* *the average delivery days of the second class is 3 days*
* *the average delivery days of the standard class is 5 days*

In [None]:
#Is there dominant products in each sub-category?
df.groupby("Sub-Category")["Product ID"].count().drop_duplicates(keep='last')

In [None]:
fig = px.histogram(df, y="Sub-Category", color="Product ID").update_yaxes(categoryorder='total ascending')
fig.show()

* The company sell more than a 1000 different types of 'Binders' and 'Papers'.
* Between 500 and 1000 different types of 'Phones', 'Storage', 'Art', 'Accesories' and 'Chairs'.
* Between 500 and 200 different types of 'Appliances', 'Lables', 'Tables', 'Envelopes', 'Bookcases', and 'Fasteners'.
* Less than 200 different types of 'Supplies', 'Machines', 'Copiers'.
* There isn't a dominant product in each sub-category.

# Sales-Wise analysis

### 1.1 Distribution of sales per segment

In [None]:
#distribution of sales per segment
fig = px.histogram(df, x='Sales', color='Segment')
fig.update_layout(
    height=400, width=1200,
    title_text = 'Distribution of Sales per segment',
    title_x=0.45
)
fig.show()

### *We can see that most of the sales are below 50 USD, and the biggest client are the 'Home Office', then 'Corporate' and finally 'consumer'.*

In [None]:
#sales-profit relation
plt.figure(figsize=(10, 6))
sns.scatterplot(df, x='Sales', y='Profit')
plt.title('Sales-Profits Relationship')
plt.show()

## 1.2 what are the sales of each Category in the past four years?

In [None]:
#What is distribution of sales over the past years
yearly_sales = df.groupby('Year')['Sales'].sum().reset_index()
yearly_sales
fig = px.line(yearly_sales, x='Year', y='Sales' )
fig .update_layout(title='Distribution of Sales Over the Years',title_x=0.45,width=1300,title_font=dict(size=24, family="Arial", color="black"))
fig.show()

In [None]:
category_sales = pd.DataFrame(df.groupby(['Year-Month', 'Category'])['Sales'].sum())
category_sales.sort_values(by='Year-Month')
category_sales

### What is the sales distribution in each category?

In [None]:
plt.figure(figsize=(15, 7))
plt.title('Sales Distribution for each category')
sns.lineplot(category_sales, x='Year-Month', y='Sales', hue='Category')
plt.xticks(rotation=45)
plt.show()

******

### 1.3 Which month owns the largest sales?

In [None]:
monthly_sales = df.groupby(['Year', 'Month'])[['Sales', 'Profit']].sum().reset_index()
monthly_sales

In [None]:
fig = px.area(monthly_sales, x='Month', y='Sales', color='Year',
              labels={'Year': 'Year', 'Month': 'Month', 'Sales': 'Total Sales'})

fig.update_layout(
    height=600, width=1300,
    title_text = 'Distribution of Sales per month ',
    title_x=0.45,title_font=dict(size=20)
)

fig.show()

In [None]:
fig = px.area(monthly_sales, x='Month', y='Profit', color='Year',
              labels={'Year': 'Year', 'Month': 'Month', 'Profit': 'Total Sales'})

fig.update_layout(
    height=600, width=1300,
    title_text = 'Distribution of Profit per month ',
    title_x=0.45,title_font=dict(size=20)
)

fig.show()

***November and December has the highest sales rate over the years while february has the lowest***

### 1.4 In which category is the highest sales?

In [None]:
#the relation between sales and category
color2=['#0a2d2e',  '#1c4e4f',  '#436e6f',  '#6a8e8f',  '#879693',  '#a49e97',  '#deae9f',  '#efd7cf',  '#f7ebe7', '#ffffff']
color1=[ '#44a9b8',  '#c55087',   '#91b49f','#f3f3f3', '#cbb079',  '#cec78e',  '#273454',  '#d79493',  '#556255' , '#c1d0d1','#273454',  '#d79493',  '#556255' , '#c1d0d1']
color=['#30039c',  '#22026d',  '#d3c0ff',  '#a782ff',  '#ff9600',  '#b36900',  '#ffe5bf',  '#ffcb80', '#fffc00','#b3b000']
color3=['#730220',  '#a60326',  '#b53737',  '#c46a47',  '#95998c',  '#65c8d0',  '#2968ba',  '#1b48a0',  '#124787', '#0e2450','#730220',  '#a60326',  '#b53737',  '#c46a47',  '#95998c',  '#65c8d0',  '#2968ba',  '#1b48a0',  '#124787']
color4=['#de324c',  '#e95e56',  '#f4895f',  '#f8e16f',  '#2ac196',  '#95cf92',  '#66b5af',  '#369acc',  '#6678b7',  '#9656a2','#2ac196',  '#95cf92',  '#66b5af',  '#369acc',  '#6678b7',  '#9656a2']
color5=['#2a3e4b',  '#4cbbb3',  '#dcdfd8',  '#8f9c9d',  '#3d6f6d',  '#5a5f4e',  '#70d0cc',  '#428c7c',  '#44546c',  '#248c84']
c1=['#7C3E66','#A5BECC','#243A73','#A76F6F','#D7C0AE']
plt.figure(figsize=(10, 6))
sns.barplot(x='Category', y='Sales', data=df, palette=c1)
plt.title('Sales by Category')
plt.xticks(rotation=45)
plt.show()


***technology category has the highest sales number office supplies category is the lowest sales***

### 1.5 Distribution of sub-category sales

In [None]:
#sorting sales of sub_category
sub_category_sales = pd.DataFrame(df.groupby('Sub-Category')['Sales'].sum())
sub_sales = pd.DataFrame(sub_category_sales.sort_values('Sales', ascending=False))
sub_sales

In [None]:
sub_df = df.groupby('Sub-Category', as_index=False)['Sales'].sum().sort_values(by = 'Sales')
fig = go.Figure(data=[go.Bar(
    x=sub_df['Sales'],
    y=sub_df['Sub-Category'],
    marker_color=color3,
    orientation='h',
    showlegend=False
)])
fig.update_layout(xaxis_title = 'Sales', yaxis_title = 'Sub-Category',title="sales by sub-category",title_x=0.45,width=1000,title_font=dict(size=24, family="Arial", color="black"))
fig.show()

* *all the sub-categories of furnitue are under 5k*
* *the least sales are in the fasteners sub-category in office-supplies category*

*Phones achieve the highest sales in sub-categories while fasteners are the lowest*

In [None]:
scatter_plot = px.scatter(
    df,
    x='Sub-Category',
    y='Sales',
    title='Sales vs. sub-category',
    color='Category',

)

# Set the title font properties
scatter_plot.update_layout(
    title_font_family="Arial",
    title_font_size=24,
    title_font_color="black",
    title_x=0.5,width=1200
)

scatter_plot.show()


***all the sub-categories of furnitue are under 5k
the least sales are in the fasteners sub-category in office-supplies category***

In [None]:
fig = px.box(df, x="Sales",color="Sub-Category",facet_col="Category")
fig.update_layout(title_text="Sales by category",title_font_size = 20, title_x=0.45)
fig.show()
#?????????

In [None]:
store_sales = df.groupby(['Region'])[['Sales', 'Profit']].sum().reset_index()

plt.figure(figsize=(10, 6))
plt.pie(store_sales['Sales'], labels=store_sales['Region'],autopct='%1.1f%%', startangle=160,colors=c1)
my_circle=plt.Circle( (0,0), 0.7, color='white')
p=plt.gcf()
p.gca().add_artist(my_circle)
plt.title('Sales Distribution by Store Location ')
plt.axis('equal')
plt.gca().set_xticks([])
plt.gca().set_yticks([])
plt.show()


***west region the highst sales***

In [None]:
store_sales = df.groupby(['Region'])[['Sales', 'Profit']].sum().reset_index()
store_sales

plt.figure(figsize=(10, 6))
plt.pie(store_sales['Profit'], labels=store_sales['Region'],autopct='%1.1f%%', startangle=160,colors=c1)
my_circle=plt.Circle( (0,0), 0.7, color='white')
p=plt.gcf()
p.gca().add_artist(my_circle)
plt.title('Profit Distribution by Store Location ')
plt.axis('equal')
plt.gca().set_xticks([])
plt.gca().set_yticks([])
plt.show()


### 1.7 Sales across states

In [None]:
state_df= df.groupby(['State', 'Region'], as_index=False)['Sales'].sum()
state_df

In [None]:
import json

with open(r'/kaggle/input/usa-states-geojson/us-states.json', encoding='utf8') as code:
    usa_states = json.load(code)

fig_states = px.choropleth(
    state_df,
    geojson = usa_states,
    color = 'Sales',
    locations = 'State',
    featureidkey = 'properties.name',
    projection = 'transverse mercator',
    color_continuous_scale = 'blues'
    )

fig_states.update_geos(fitbounds = 'locations', visible = False, showframe = True)
fig_states.update_layout(margin={'r':0, 't':0, 'l':0, 'b':0},
        title={'text': 'Sales distrbiuted in USA States',
        'y':0.95,
        'x':0.55,
        'xanchor': 'center',
        'yanchor': 'top'},
        title_font_color = 'blue',
        coloraxis_colorbar_x = -0.1)
fig_states.show()

***add observation here***

### 1.8 Top states with highest sales

In [None]:
state_sales = df.groupby('State')['Sales'].sum().reset_index()

# Find the state with the highest sales
state_with_highest_sales = state_sales[state_sales['Sales'] == state_sales['Sales'].max()]

print("State with the highest sales:", state_with_highest_sales['State'].values[0])
print("Total sales in that state:", state_with_highest_sales['Sales'].values[0])
print(f"with total of {round(state_with_highest_sales['Sales'].values[0] / (df['Sales'].sum()) * 100, 2)}% of sales")

*Calefornia has the highest sales*

In [None]:
state_sales_sorted = state_sales.sort_values(by='Sales', ascending=False)
top_10_states = state_sales_sorted.head(10)
fig = px.bar(top_10_states, x='Sales', y='State', orientation='h',color='State')
fig.update_layout(title="Sales Distribution of the Top 10 States",title_x=0.45,width=1300,title_font=dict(size=24, family="Arial", color="black"))
fig.show()

***California achieves the highest sales while virginia has the lowest***

### 1.9 Does this variance of delivery days affect the sales?

In [None]:
scatter_fig = px.scatter(df, x='Delivery Days', y='Sales', color='Ship Mode',
                         title='Impact of Delivery Days Variance on Sales',
                         labels={'Delivery Days Variance': 'Delivery Days Variance', 'Sales': 'Sales'})
scatter_fig .update_layout(title="Impact of Delivery Days Variance on Sales",title_x=0.45,width=1300,title_font=dict(size=24, family="Arial", color="black"))
scatter_fig.show()

***The answer is no, they are independent***

### 1.11 Sales Distribution over the week

In [None]:
df['Day of Week'] = df['Order Date'].dt.day_name()
daily_sales = df.groupby('Day of Week')['Sales'].sum().reset_index()
days_of_week = ['Saturday', 'Sunday','Monday', 'Tuesday', 'Wednesday','Thursday', 'Friday']
daily_sales['Day of Week'] = pd.Categorical(daily_sales['Day of Week'], categories=days_of_week, ordered=True)
daily_sales = daily_sales.sort_values(by='Day of Week')

fig = go.Figure(data=[go.Bar(
    x=daily_sales['Day of Week'],
    y=daily_sales['Sales'],
)])
fig.update_traces(marker_color=color1, marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5, opacity=0.6)
fig.update_layout(title='Sales by Day of the Week',title_x=0.45,width=1300,title_font=dict(size=24, family="Arial", color="black"))
fig.show()

***Monday and Friday come in the first Place***

In [None]:
fig = px.violin(df, x='Day of Week', y='Sales', color='Year',
                category_orders={'Day of Week': days_of_week})
fig.update_layout(title='Sales by Day of the Week over the years',title_x=0.45,width=1300,title_font=dict(size=24, family="Arial", color="black"))
fig.show()

In [None]:
pivot_table = data.pivot_table(index = "Segment", columns = "Ship Mode", values = 'Sales', aggfunc = 'sum')
plt.figure(figsize=(20, 6))

pivot_table.plot(kind='bar', stacked=False,color=c1)
plt.title('Sales by Segment and Ship Mode')

plt.ylabel('Sales')
plt.xticks(rotation=0)
plt.show()

In [None]:
#How are sales distributed among different cities?
city_sales = df.groupby(['State', 'City'])['Sales'].sum().reset_index()
top_10_cities = city_sales.nlargest(13, 'Sales')
fig = px.sunburst(
    top_10_cities,
    values='Sales',
    path=['State', 'City'],
    hover_name='City',
    color='Sales',
    title='Top 10 Cities by Total Sales',
)
fig.update_layout(title_text="Top 10 Cities by Total Sales",title_font_size = 22, title_x=0.5,width=1000)

fig.show()

### 1.12 Top selling products

In [None]:
#Top selling products
df.groupby('Product Name')['Sales'].sum().sort_values(ascending=True).tail(10).plot.barh(color=color1, title='Top ten selling products')
fig.update_layout(title='Sales by Day of the Week over the years',title_x=0.45,width=1300,title_font=dict(size=30, family="Arial", color="black"))
plt.show()

***Canon imageClass 220 Advanced copier is the highest product in sales***

## Profit-wise

In [None]:
#sorting profit of sub_category
sub_category = pd.DataFrame(df.groupby('Sub-Category')['Profit'].sum().sort_values(ascending =False))
p = pd.DataFrame(sub_category.sort_values('Profit',ascending=False))
p

In [None]:
#checking profit of segment
segment = pd.DataFrame(df.groupby('Segment')['Profit'].sum())
segment

In [None]:
#profit/ sales by region
region = df.groupby(['Region'])[['Sales', 'Profit']].sum().sort_values(by='Sales')
region

In [None]:
#profit/sales by subcategories
sub_category = df.groupby(['Sub-Category'])[['Sales', 'Profit']].sum().sort_values(by='Sales', ascending=False)
sub_category

In [None]:
#adding sales by category
category = df.groupby(['Category'])[['Sales', 'Profit', 'Quantity']].sum()
category

In [None]:
df.groupby(['State'])[['Profit', 'Sales']].sum().reset_index().sort_values(by='Sales', ascending=False)

In [None]:
state = df.groupby(['State'])[['Profit', 'Sales']].sum().reset_index().sort_values(by='Profit', ascending=False)
state

In [None]:
print("least profitable cities")
state.tail(10)

### 1.6 What is the total profits of each region?

In [None]:
plt.figure(figsize=(10,7))
plt.title('top 10 profitable sub-categories')
plt.xlabel('Sub-Category')
plt.ylabel('Profit')
df.groupby('Sub-Category')['Profit'].sum().sort_values(ascending=False)[0:10].plot(kind='bar',color=c1,ls='dashed',edgecolor='Black')
plt.show()

In [None]:
plt.figure(figsize=(10,7))
plt.title('least 10 profitable sub-categories')
plt.xlabel('Sub-Category')
plt.ylabel('Profit')
df.groupby('Sub-Category')['Profit'].sum().sort_values(ascending=True)[0:10].plot(kind='bar',color=c1,ls='dashed',edgecolor='Black')
plt.show()

In [None]:
plt.figure(figsize=(10,7))
plt.title('Top 10 profitable states')
plt.xlabel('State')
plt.ylabel('Profit')
df.groupby('State')['Profit'].sum().sort_values(ascending=False)[0:10].plot(kind='bar',color=c1,ls='dashed',edgecolor='Black')
plt.show()

In [None]:
plt.figure(figsize=(10,7))
plt.title('Top 10 profitable Cities')
plt.xlabel('City')
plt.ylabel('Profit')
df.groupby('City')['Profit'].sum().sort_values(ascending=False)[0:10].plot(kind='bar',color=c1,ls='dashed',edgecolor='Black')
plt.show()

### Top 10 profitable cities

In [None]:
state_sales = df.groupby('State')['Profit'].sum().reset_index()

# Find the state with the highest sales
state_with_highest_sales = state_sales[state_sales['Profit'] == state_sales['Profit'].max()]
state_sales_sorted = state_sales.sort_values(by='Profit', ascending=False)
top_10_states = state_sales_sorted.head(10)
fig = px.bar(top_10_states, x='Profit', y='State', orientation='h',color='State')
fig.update_layout(title="Profit Distribution of the Top 10 States",title_x=0.45,width=1300,title_font=dict(size=24, family="Arial", color="black"))
fig.show()

In [None]:
df['Profit per unit'] = df['Profit'] / df['Quantity']
fig = px.box(df, x='Profit per unit', color="Sub-Category", facet_col="Category")
fig.update_layout(title_text="Profit per unit by category",title_font_size = 22, title_x=0.45)
fig.show()

In [None]:
pivot_table = data.pivot_table(index = "Segment", columns = "Ship Mode", values = 'Profit', aggfunc = 'sum')
plt.figure(figsize=(20, 6))

pivot_table.plot(kind = 'bar', stacked = False,color=c1)
plt.title('Profits by Segment and Ship Mode')
plt.ylabel("Profit")
plt.xticks(rotation=0)
plt.show()

In [None]:
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=('Discounts and profit', 'Discounts and sales'))
fig.add_trace(go.Bar(x=df['Discount'], y=df['Profit']),
              1, 1)
fig.add_trace(go.Bar(x=df['Discount'], y=df['Sales']),
              1, 2)

fig.update_layout(title="Discounts vs sales and profits",title_x=0.45,width=1300,title_font=dict(size=24, family="Arial", color="black"))
fig.show()

In [None]:
fig = px.box(df, x="Discount",color="Sub-Category",facet_col="Category")
fig.update_layout(title="Discount by category",title_x=0.45,width=1300,title_font=dict(size=24, family="Arial", color="black"))
fig.show()

In [None]:
#discount affect the profit?
fig = px.scatter(df, x="Profit per unit", y="Discount", color="Sub-Category",facet_col="Category")
fig.update_layout(title="Discount vs profit in each category",title_x=0.45,width=1300,title_font=dict(size=24, family="Arial", color="black"))
fig.show()

In [None]:
region_sales = df.groupby('Region')['Sales'].sum()
region_sales

In [None]:
region_profit = df.groupby('Region')['Profit'].sum()
region_profit

In [None]:
fig = px.box(df, x="Quantity", color="Sub-Category", facet_col="Category")

fig.update_layout(title="Quantity sold by category",title_x=0.45,width=1300,title_font=dict(size=24, family="Arial", color="black"))

fig.show()

In [None]:
# Discounts impact on profit and sales
discounts = data.groupby("Discount").sum()[['Sales', 'Profit']].reset_index()
fig = px.bar(discounts,
             y=discounts['Discount'],
             x=discounts['Sales'],
             color='Sales',
             color_continuous_scale=["green", "blue"],
             orientation='h')
fig.update_layout(xaxis_title = 'Sales', yaxis_title = 'Discounts',title="Discounts impact on Sales",title_x=0.45,width=1000,title_font=dict(size=24, family="Arial", color="black"))
fig.show()

In [None]:
fig = px.bar(discounts,
             y=discounts['Discount'],
             x=discounts['Profit'],
             color='Profit',
             color_continuous_scale=['green', 'blue'],
             orientation='h')
fig.update_layout(xaxis_title = 'Profit', yaxis_title = 'Discounts',title="Discounts impact on profits",title_x=0.45,width=1000,title_font=dict(size=24, family="Arial", color="black"))

fig.show()

In [None]:
#LEAST SELLING CITIES
least_cities = df[['City' ,'Sales']].sort_values(by = ["Sales"], ascending = True).head(10)
least_cities

In [None]:
#HIGHEST SELLING CITIES
top_cities = df[['City' ,'Sales']].sort_values(by = ["Sales"] ,ascending = False).head(10)
top_cities

In [None]:
#HIGHEST PROFITABLE CITIES
profit_cities = df[['City', 'Profit']].sort_values(by = ["Profit"], ascending = False).head(10)
profit_cities


In [None]:
#NON-PROFITABLE CITIES
nonprofit_cities = pd.DataFrame(df.groupby('City')['Profit'].sum()).sort_values(by='Profit', ascending=True).head(20)
nonprofit_cities

# frequent customers

In [None]:
# Which customers have made the highest number of orders?
high_order = df['Customer ID'].value_counts().idxmax()
name = df[df['Customer ID'] == high_order][['Customer ID','Customer Name']].drop_duplicates()
name

In [None]:
customer_order_counts = df.groupby('Customer Name')['Order ID'].nunique()
frequents = customer_order_counts[customer_order_counts > 10]
num_frequents = len(frequents)
print("Number of frequent customers (with more than 10 orders):", num_frequents)


In [None]:
#frequents=list(frequents)

In [None]:
plt.figure(figsize=(14, 6))
frequents.plot(kind='bar')
plt.title('Number of Frequent Customers with More Than 10 Orders')
plt.xlabel('Customer Name')
plt.ylabel('Number of Orders')
plt.xticks(rotation=90)
plt.show()

In [None]:
nonfrequent=customer_order_counts[customer_order_counts <= 10]
num_nonfrequents=len(nonfrequent)

*there are 750 costumer with less than 10 orders*

In [None]:
plt.figure(figsize=(6, 6))
plt.pie([num_frequents, num_nonfrequents], labels=['Frequent', 'Not Frequent'], autopct='%1.1f%%', colors=['#8cb5db', '#8c8c8c'])
plt.title('Relationship Between Frequent and Not Frequent Clients')
plt.show()

## **give offers to the frequent clients**

In [None]:
df['Offer'] = df['Customer Name'].apply(lambda name: 'Offer' if name in frequents else 'No Offer')
df
#ممكن نعمل انواع مختلفه من الاوفرز بناء علي عدد الاوردر او عوامل تانيه

In [None]:
df['Frequent'] = df['Customer Name'].apply(lambda x: 'Yes' if x in frequents.index else 'No')
df['Frequent'] = df['Frequent'].apply(lambda x: x == 'Yes')
state_frequent_counts = df.groupby('State')['Frequent'].sum().reset_index()
state_frequent_counts

In [None]:
fig = px.bar(state_frequent_counts, x='State', y='Frequent')
fig.update_layout(title="Number of frequent customers in Each State",title_x=0.45,width=1300,title_font=dict(size=24, family="Arial", color="black"))
fig.show()

In [None]:
frequents

In [None]:
frequents.value_counts()
#df['customers frequency']=frequents


In [None]:
df['num_of_orders'] = df['Customer Name'].map(customer_order_counts).astype(int)
df['num_of_orders'].value_counts()

In [None]:
df['num_frequents'] = df['Customer Name'].apply(lambda x: 1 if x in frequents.index else 0)
df['num_frequents'].value_counts()

In [None]:
df['Customer Name'].map(customer_order_counts).astype(int)


In [None]:
dups = df.pivot_table(index = ['Customer Name'], aggfunc ='size')
dups

## **Is there a relation between the customer segment,product category with the sales?**

In [None]:
grouped_data = df.groupby(['Segment', 'Category'])['Sales'].sum().reset_index()
grouped_data

In [None]:
pivot_df = grouped_data.pivot(index='Segment', columns='Category', values='Sales')
pivot_df

In [None]:
plt.figure(figsize=(10, 6))
sns.heatmap(pivot_df, annot=True, fmt=".2f", cmap="YlGnBu")
plt.title('Relationship between Customer Segment, Product Category, and Sales')
plt.xlabel('Product Category')
plt.ylabel('Customer Segment')
plt.show()

*consumers who byes Technology have the highest sales*

In [None]:
df['Products'] = df['Product Name'].str.split('\n') #to seperate each product
product_df = df.explode('Products') #new dataframe with seperated products
most_required_product = product_df['Products'].value_counts().idxmax() #to count occurence of each product
print(f"The most required product is: **{most_required_product}**")
products_num = product_df['Products'].value_counts().reset_index()
products_num.columns = ['Product', 'Count']
products_num

In [None]:
plt.figure(figsize=(14, 6))
frequents.plot(kind='line')
plt.title('Number of Frequent Customers with More Than  Orders')
plt.xlabel('Customer Name')
plt.ylabel('Number of Orders')
plt.xticks(rotation=90)
plt.show()

In [None]:
state_df = df.groupby('State')['num_frequents'].sum()
state_df

In [None]:
dups = df.pivot_table(index = ['Customer Name', 'State'], aggfunc ='size')
dups

# Data Preprocessing

In [None]:
data.columns

In [None]:
data.drop(['Row ID', 'Order ID', 'Ship Date', 'Ship Mode', 'Customer ID', 'Customer Name', 'Country', 'Postal Code', 
           'Region', 'Product ID', 'Product Name'], axis=1, inplace=True)
data.columns

In [None]:
#encoding of categorial features
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for i in data.columns:
    if(data.dtypes[i] == 'object'):
        data[i] = le.fit_transform(data[i])

data.info()

In [None]:
#representing the outliers
data.plot(kind='box', subplots=True, figsize=(20,20), layout=(6,5))
plt.show()

***we observe that sales and profit have the highest number of outliers to deal with***

In [None]:
#deleting some of outliers while keeping the consistency of our data
print("Old Shape: ", df.shape[0])
data.drop(index= data[(data['Sales'] > 3000)].index, inplace=True)
data.drop(index= data[(data['Profit'] > 1000) | (data['Profit'] < -500)].index, inplace=True)
data.drop(index= data[(data['Discount'] >= 0.6)].index, inplace=True)
data.drop(index= data[(data['Quantity'] > 11)].index, inplace=True)
print("NEW Shape: ", data.shape[0])

In [None]:
data.plot(kind='box', subplots=True, figsize=(20,20), layout=(6,5))
plt.show()

In [None]:
#distribution of sales per segment
fig = px.histogram(data, x='Sales', color='Segment')
fig.update_layout(
    height=400, width=1200,
    title_text = 'Distribution of Sales per segment',
    title_x=0.45
)
fig.show()

#### ***Now the number of outliers have been reduced and we can start modeling***

#### ***by trying different transformation algorithms we found that Standard Scaler is the best for our data***

In [None]:
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV, KFold, cross_val_score #cross validation
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, r2_score, mean_absolute_error, mean_absolute_percentage_error, mean_squared_error
from mlxtend.plotting import plot_confusion_matrix

#using standard scaler 
sc = StandardScaler()
x = data.drop(['Sales'] , axis = 1).values
y =data['Sales'].values

#train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.33, random_state=42)
x_train = sc.fit_transform(x_train)
x_test = sc.fit_transform(x_test)

# Modeling

#### ***we will apply different modeling algorithms***

In [None]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression() #Pipeline([('std_scaler', StandardScaler()),('linear_regressor',LinearRegression(normalize=True,fit_intercept=False))])
lr.fit(x_train, y_train)
print("linear regression")
print(lr.score(x_train, y_train))
print(lr.score(x_test, y_test))
y_pred = lr.predict(x_test)
mse = mean_squared_error(y_test , y_pred)                     #Mean squared error
rmse = np.sqrt(mse)                                           #root mean squared error
reg_score = r2_score(y_test , y_pred)                         #reg_score
mape = mean_absolute_percentage_error(y_test , y_pred)        #mean absolute percentage error
mae = mean_absolute_error(y_test , y_pred)                    #mean absolute error
print('Reg_score:', reg_score )
print('Mean Squared Error:', mse)
print('Mean absolute percentage error:', mape )
print('Mean absolute error:', mae)
print('rmse:', rmse)
fig = plt.figure(figsize=(5,5))
ax = fig.add_subplot(111)
ax.errorbar(y_test, y_pred, fmt='o')
ax.errorbar([1, y_test.max()], [1, y_test.max()])
plt.show()
print()

from sklearn.linear_model import RidgeCV
rid = RidgeCV(alphas=np.arange(70,100,0.1), fit_intercept=True)
rid.fit(x_train,y_train)
print('Ridge regression')
print(rid.score(x_train,y_train))
print(rid.score(x_test , y_test))
y_pred = rid.predict(x_test)
mse = mean_squared_error(y_test , y_pred)                     #Mean squared error
rmse = np.sqrt(mse)                                           #root mean squared error
reg_score = r2_score(y_test , y_pred)                         #reg_score
mape = mean_absolute_percentage_error(y_test , y_pred)        #mean absolute percentage error
mae = mean_absolute_error(y_test , y_pred)                    #mean absolute error
print('Reg_score:', reg_score )
print('Mean Squared Error:', mse)
print('Mean absolute percentage error:', mape )
print('Mean absolute error:', mae)
print('rmse:', rmse)
fig = plt.figure(figsize=(5,5))
ax = fig.add_subplot(111)
ax.errorbar(y_test, y_pred, fmt='o')
ax.errorbar([1, y_test.max()], [1, y_test.max()])
plt.show()
print()

from sklearn.linear_model import LassoLars
lasso = LassoLars()
lasso.fit(x_train,y_train)
print('Lasso Regression')
print(lasso.score(x_train,y_train))
print(lasso.score(x_test , y_test))
y_pred = rid.predict(x_test)
mse = mean_squared_error(y_test , y_pred)                     #Mean squared error
rmse = np.sqrt(mse)                                           #root mean squared error
reg_score = r2_score(y_test , y_pred)                         #reg_score
mape = mean_absolute_percentage_error(y_test , y_pred)        #mean absolute percentage error
mae = mean_absolute_error(y_test , y_pred)                    #mean absolute error
print('Reg_score:', reg_score )
print('Mean Squared Error:', mse)
print('Mean absolute percentage error:', mape )
print('Mean absolute error:', mae)
print('rmse:', rmse)
fig = plt.figure(figsize=(5,5))
ax = fig.add_subplot(111)
ax.errorbar(y_test, y_pred, fmt='o')
ax.errorbar([1, y_test.max()], [1, y_test.max()])
plt.show()
print()

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(max_depth= 4 , max_features= 6)
rf.fit(x_train,y_train)
print('RF')
print(rf.score(x_train,y_train))
print(rf.score(x_test , y_test))
y_pred = rf.predict(x_test)
mse = mean_squared_error(y_test , y_pred)                     #Mean squared error
rmse = np.sqrt(mse)                                           #root mean squared error
reg_score = r2_score(y_test , y_pred)                         #reg_score
mape = mean_absolute_percentage_error(y_test , y_pred)        #mean absolute percentage error
mae = mean_absolute_error(y_test , y_pred)                    #mean absolute error
print('Reg_score:', reg_score )
print('Mean Squared Error:', mse)
print('Mean absolute percentage error:', mape )
print('Mean absolute error:', mae)
print('rmse:', rmse)
fig = plt.figure(figsize=(5,5))
ax = fig.add_subplot(111)
ax.errorbar(y_test, y_pred, fmt='o')
ax.errorbar([1, y_test.max()], [1, y_test.max()])
plt.show()
print()

from xgboost import XGBRegressor
xgb = XGBRegressor(objective ='reg:linear', n_estimators = 7, seed = 123) #Pipeline([('std_scaler', StandardScaler()),('xgboost_regressor',XGBRegressor(n_estimators=50,max_depth=3))])
xgb.fit(x_train, y_train)
print('XGB')
print(xgb.score(x_train, y_train))
print(xgb.score(x_test, y_test))
y_pred = xgb.predict(x_test)
mse = mean_squared_error(y_test , y_pred)                     #Mean squared error
rmse = np.sqrt(mse)                                           #root mean squared error
reg_score = r2_score(y_test , y_pred)                         #reg_score
mape = mean_absolute_percentage_error(y_test , y_pred)        #mean absolute percentage error
mae = mean_absolute_error(y_test , y_pred)                    #mean absolute error
print('Reg_score:', reg_score )
print('Mean Squared Error:', mse)
print('Mean absolute percentage error:', mape )
print('Mean absolute error:', mae)
print('rmse:', rmse)
fig = plt.figure(figsize=(5,5))
ax = fig.add_subplot(111)
ax.errorbar(y_test, y_pred, fmt='o')
ax.errorbar([1, y_test.max()], [1, y_test.max()])
plt.show()
print()

from sklearn.tree import DecisionTreeRegressor
dt = DecisionTreeRegressor(max_depth= 6, max_features= 7, min_samples_split= 10, random_state= 2)
param_grid = {"max_depth": [3, None,6,9],
              "max_features": [5, 7,11, 15],
              "min_samples_split": [2, 3, 10],
             "random_state":[2,4,6]}
grid = GridSearchCV(estimator = dt, param_grid = param_grid, cv = 5)
grid_result = grid.fit(x_train, y_train)
dt.fit(x_train,y_train)
print('DT')
print(dt.score(x_train,y_train))
print(dt.score(x_test , y_test))
y_pred = dt.predict(x_test)
mse = mean_squared_error(y_test , y_pred)                     #Mean squared error
rmse = np.sqrt(mse)                                           #root mean squared error
reg_score = r2_score(y_test , y_pred)                         #reg_score
mape = mean_absolute_percentage_error(y_test , y_pred)        #mean absolute percentage error
mae = mean_absolute_error(y_test , y_pred)                    #mean absolute error
print('Reg_score:', reg_score )
print('Mean Squared Error:', mse)
print('Mean absolute percentage error:', mape )
print('Mean absolute error:', mae)
print('rmse:', rmse)
fig = plt.figure(figsize=(5,5))
ax = fig.add_subplot(111)
ax.errorbar(y_test, y_pred, fmt='o')
ax.errorbar([1, y_test.max()], [1, y_test.max()])
plt.show()
print()

from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor() #Pipeline([('knn_regressor',KNeighborsRegressor(n_neighbors=15, weights='distance'))])
knn.fit(x_train, y_train)
print('KNN')
print(knn.score(x_train,y_train))
print(knn.score(x_test , y_test))
y_pred = knn.predict(x_test)
mse = mean_squared_error(y_test , y_pred)                     #Mean squared error
rmse = np.sqrt(mse)                                           #root mean squared error
reg_score = r2_score(y_test , y_pred)                         #reg_score
mape = mean_absolute_percentage_error(y_test , y_pred)        #mean absolute percentage error
mae = mean_absolute_error(y_test , y_pred)                    #mean absolute error
print('Reg_score:', reg_score )
print('Mean Squared Error:', mse)
print('Mean absolute percentage error:', mape )
print('Mean absolute error:', mae)
print('rmse:', rmse)
fig = plt.figure(figsize=(5,5))
ax = fig.add_subplot(111)
ax.errorbar(y_test, y_pred, fmt='o')
ax.errorbar([1, y_test.max()], [1, y_test.max()])
plt.show()
print()

from lightgbm import LGBMRegressor
gbm= LGBMRegressor()#Pipeline([('std_scaler', StandardScaler()),('lightgbm_regressor',LGBMRegressor())])
gbm.fit(x_train, y_train)
print('LightGBM')
print(gbm.score(x_train,y_train))
print(gbm.score(x_test, y_test))
y_pred = gbm.predict(x_test)
mse = mean_squared_error(y_test , y_pred)                     #Mean squared error
rmse = np.sqrt(mse)                                           #root mean squared error
reg_score = r2_score(y_test , y_pred)                         #reg_score
mape = mean_absolute_percentage_error(y_test , y_pred)        #mean absolute percentage error
mae = mean_absolute_error(y_test , y_pred)                    #mean absolute error
print('Reg_score:', reg_score )
print('Mean Squared Error:', mse)
print('Mean absolute percentage error:', mape )
print('Mean absolute error:', mae)
print('rmse:', rmse)
fig = plt.figure(figsize=(20,10))
ax = fig.add_subplot(111)
ax.errorbar(y_test, y_pred, fmt='o')
ax.errorbar([1, y_test.max()], [1, y_test.max()])
plt.show()
print()

from catboost import CatBoostRegressor
cat= CatBoostRegressor() #Pipeline([('std_scaler', StandardScaler()),('catboost_regressor',CatBoostRegressor(iterations=20))])
cat.fit(x_train, y_train)
print('CAT')
print(cat.score(x_train,y_train))
print(cat.score(x_test, y_test))
y_pred = cat.predict(x_test)
mse = mean_squared_error(y_test , y_pred)                     #Mean squared error
rmse = np.sqrt(mse)                                           #root mean squared error
reg_score = r2_score(y_test , y_pred)                         #reg_score
mape = mean_absolute_percentage_error(y_test , y_pred)        #mean absolute percentage error
mae = mean_absolute_error(y_test , y_pred)                    #mean absolute error
print('Reg_score:', reg_score )
print('Mean Squared Error:', mse)
print('Mean absolute percentage error:', mape )
print('Mean absolute error:', mae)
print('rmse:', rmse)
fig = plt.figure(figsize=(20,10))
ax = fig.add_subplot(111)
ax.errorbar(y_test, y_pred, fmt='o')
ax.errorbar([1, y_test.max()], [1, y_test.max()])
plt.show()
print()

from sklearn.ensemble import AdaBoostRegressor
ada = AdaBoostRegressor() #Pipeline([('std_scaler', StandardScaler()),('adaboost_regressor', AdaBoostRegressor(random_state=5, n_estimators=100,loss='square',learning_rate=1))])
ada.fit(x_train, y_train)
print('ADABoost')
print(ada.score(x_train,y_train))
print(ada.score(x_test , y_test))
y_pred = ada.predict(x_test)
mse = mean_squared_error(y_test , y_pred)                     #Mean squared error
rmse = np.sqrt(mse)                                           #root mean squared error
reg_score = r2_score(y_test , y_pred)                         #reg_score
mape = mean_absolute_percentage_error(y_test , y_pred)        #mean absolute percentage error
mae = mean_absolute_error(y_test , y_pred)                    #mean absolute error
print('Reg_score:', reg_score )
print('Mean Squared Error:', mse)
print('Mean absolute percentage error:', mape )
print('Mean absolute error:', mae)
print('rmse:', rmse)
fig = plt.figure(figsize=(5,5))
ax = fig.add_subplot(111)
ax.errorbar(y_test, y_pred, fmt='o')
ax.errorbar([1, y_test.max()], [1, y_test.max()])
plt.show()
print()

In [None]:
from sklearn.ensemble import BaggingRegressor
reg = BaggingRegressor()
reg.fit(x_train, y_train)
print('REG')
print(reg.score(x_train,y_train))
print(reg.score(x_test , y_test))
y_pred = reg.predict(x_test)
mse = mean_squared_error(y_test , y_pred)                     #Mean squared error
rmse = np.sqrt(mse)                                           #root mean squared error
reg_score = r2_score(y_test , y_pred)                         #reg_score
mape = mean_absolute_percentage_error(y_test , y_pred)        #mean absolute percentage error
mae = mean_absolute_error(y_test , y_pred)                    #mean absolute error
print('Reg_score:', reg_score )
print('Mean Squared Error:', mse)
print('Mean absolute percentage error:', mape )
print('Mean absolute error:', mae)
print('rmse:', rmse)
print()

In [None]:
col = [LinearRegression(), RidgeCV(), LassoLars(),RandomForestRegressor(), XGBRegressor(), DecisionTreeRegressor(), KNeighborsRegressor(), LGBMRegressor(), AdaBoostRegressor()] #CatBoostRegressor()

from sklearn.ensemble import BaggingRegressor

for i in col:
    reg = BaggingRegressor(estimator=i, n_estimators=8)
    reg.fit(x_train, y_train)
    print(f'{i}')
    print(reg.score(x_train,y_train))
    print(reg.score(x_test , y_test))
    y_pred = reg.predict(x_test)
    mse = mean_squared_error(y_test , y_pred)                     #Mean squared error
    rmse = np.sqrt(mse)                                           #root mean squared error
    reg_score = r2_score(y_test , y_pred)                         #reg_score
    mape = mean_absolute_percentage_error(y_test , y_pred)        #mean absolute percentage error
    mae = mean_absolute_error(y_test , y_pred)                    #mean absolute error
    print('Reg_score:', reg_score )
    print('Mean Squared Error:', mse)
    print('Mean absolute percentage error:', mape )
    print('Mean absolute error:', mae)
    print('rmse:', rmse)
    print()

### ***we see that LGBMRegressor gives us the best score with appropriate root mean squred error***

In [None]:
!pip install mljar-supervised

In [None]:
from supervised.automl import AutoML
automl = AutoML()
automl.fit(x_train,y_train)
print(automl.score(x_train, y_train))
print(automl.score(x_test, y_test))
y_pred = automl.predict(x_test)
rms = np.sqrt(mean_squared_error(y_test, y_pred))
print(rms)

In [None]:
models = [LinearRegression(), RidgeCV(), LassoLars(),RandomForestRegressor(), XGBRegressor(), DecisionTreeRegressor(), KNeighborsRegressor(), LGBMRegressor(), AdaBoostRegressor(), CatBoostRegressor()]
#mod = [lr, rid, lasso, rf, xgb, dt, knn, gbm, cat, ada, bagging] #svr
train=[]
test=[]
MSE=[]
RMSE=[]
R_Squared=[]
MAPE=[]
MAE=[]

for i in models:
    reg = BaggingRegressor(estimator=i, n_estimators=10)
    reg.fit(x_train, y_train)
    train.append(reg.score(x_train, y_train))
    test.append(reg.score(x_test, y_test))
    y_pred = reg.predict(x_test)
    mse = mean_squared_error(y_test, y_pred)                           #Mean squared error
    MSE.append(mse)
    RMSE.append(np.sqrt(mse))                                          #root mean squared error
    R_Squared.append(r2_score(y_test, y_pred))                         #r_squared
    MAPE.append(mean_absolute_percentage_error(y_test, y_pred))        #mean absolute percentage error
    MAE.append(mean_absolute_error(y_test, y_pred))                    #mean absolute error

conclusion = pd.DataFrame({'Model':models,
                           'train_score':train,
                           'test_score': test,
                           'MS_error':MSE,
                           'RMS_error':RMSE,
                           'R_Squared':R_Squared,
                           'MAP_error':MAPE,
                           'MA_error':MAE
                           })

In [None]:
conclusion.sort_values(by='train_score', ascending=False, inplace=True)
conclusion