<a href="https://www.kaggle.com/code/arunjangir245/supermarket-sales-prediction-and-eda?scriptVersionId=143427179" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

<div style="text-align: center; background-color: #d8b26e; color: #006600; padding: 20px; border-radius: 5px;">
    <h2 style="margin: 0; font-size: 13px;">Don't forget to upvote if you liked the notebook</h2>
</div>


![](https://content.pymnts.com/wp-content/uploads/2017/07/e-commerce-robot.jpg)

<div style="border-radius:10px; padding: 15px; background-color: #d8b26e; font-size:130%; text-align:left">

<h2 align="left"><font color=#006600>Why supermarkets are popular nowadays:</font></h2>

Supermarkets have gained immense popularity in recent years due to their unmatched convenience, offering a diverse range of products all under one roof, from groceries to electronics and clothing. Shoppers are drawn to the abundant variety and choices available, enabling them to select from numerous brands and sizes. Furthermore, supermarkets leverage their purchasing power to provide competitive prices and frequent discounts, making them an attractive option for budget-conscious consumers. With extended operating hours, including late evenings and weekends, they cater to busy schedules. Additionally, their commitment to offering fresh produce, seamless technology integration, exceptional customer experiences, and community engagement initiatives have solidified their appeal. Supermarkets continuously innovate to meet evolving consumer preferences, including the demand for organic and eco-friendly products. Their globalization efforts have also made these shopping havens a familiar and trusted presence in international markets.

<a id="libraries"></a>
# <b><span style='color:#c78a44'> Importing Necessary Libraries</span></b>

<div style="border-radius:10px; padding: 15px; background-color: #d8b26e; font-size:130%; text-align:left">
    
First of all, I will import all the necessary libraries that we will use throughout the project. This generally includes libraries for data manipulation, data visualization, and others based on the specific needs of the project:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import scipy as sp
import warnings
import datetime
warnings.filterwarnings("ignore")
%matplotlib inline

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error


<a id="libraries"></a>
# <b><span style='color:#c78a44'> Loading the Dataset</span></b>

<div style="border-radius:10px; padding: 15px; background-color: #d8b26e; font-size:130%; text-align:left">
    
Next, I will load the dataset into a pandas DataFrame which will facilitate easy manipulation and analysis:

In [None]:
df = pd.read_csv("/kaggle/input/super-market-sales/supermarket_sales.csv")

<div style="border-radius:10px; padding: 15px; background-color: #d8b26e; font-size:130%; text-align:left">

<h2 align="left"><font color=#006600>Why supermarkets are popular nowadays:</font></h2>

**Variable**   | **Description**
     
**Invoice ID** | A unique identifier for each invoice or transaction.
    
**Branch**  | The branch or location where the transaction occurred.
    
**City** | The city where the branch is located.
    
**Customer Type**  | Indicates whether the customer is a regular or new customer.
    
**Gender** | The gender of the customer. 

**Product Line** | The category or type of product purchased.

**Unit Price** | The price of a single unit of the product.

**Quantity** | The number of units of the product purchased.

**Tax 5%** | The amount of tax (5% of the total cost) applied to the transaction.

**Total** | The total cost of the transaction, including tax.

**Date** | The date when the transaction took place.

**Time** | The time of day when the transaction occurred.

**Payment** | The payment method used (e.g., credit card, cash).

**COGS (Cost of Goods Sold)** | The direct costs associated with producing or purchasing the products sold.

**Gross Margin Percentage** | The profit margin percentage for the transaction.

**Gross Income** | The total profit earned from the transaction.

**Rating** | Customer satisfaction rating or feedback on the transaction.
    


For instance, if you were interested in predicting customer satisfaction, Rating might be a suitable label. If you were trying to predict sales or revenue, Total or Gross Income could be a potential label.
</div>

In [None]:
df.head()


<a id="libraries"></a>
# <b><span style='color:#c78a44'> Initial Data Analysis</span></b>

<div style="border-radius:10px; padding: 15px; background-color: #d8b26e; font-size:130%; text-align:left">
    
First I will perform a preliminary analysis to understand the structure and types of data columns:

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df['Customer type'].nunique()

In [None]:
df['Customer type'].value_counts()

In [None]:
df['Branch'].value_counts()

In [None]:
df['City'].value_counts()

In [None]:
df['Product line'].value_counts()

In [None]:
df['Payment'].value_counts()

In [None]:
from wordcloud import WordCloud
plt.subplots(figsize=(20,8))
wordcloud = WordCloud(background_color='White',width=1920,height=1080).generate(" ".join(df['Product line']))
plt.imshow(wordcloud)
plt.axis('off')
plt.savefig('cast.png')
plt.show()

<a id="libraries"></a>
# <b><span style='color:#c78a44'> Checking if there are any missing values</span></b>

![](https://e7.pngegg.com/pngimages/875/142/png-clipart-missing-data-diagram-information-imputation-marketing-others-miscellaneous-company-thumbnail.png)

In [None]:
df.isnull().sum()

<a id="libraries"></a>
# <b><span style='color:#c78a44'> Exploratory Data Analysis(EDA)</span></b>


<div style="border-radius:10px; padding: 15px; background-color: #d8b26e; font-size:130%; text-align:left">
    
EDA (Exploratory Data Analysis) in simple words is like being a detective for data. It's the process of examining and understanding a dataset before you start building models or making decisions based on the data.

EDA is like exploring a new place, looking for clues, and making sense of what you find before making any important decisions. It's a crucial step in the data analysis process.

<div style="border-radius:10px; padding: 15px; background-color: #d8b26e; font-size:130%; text-align:left">

<h2 align="left"><font color=#006600>SCATTER PLOT</font></h2>

A scatter plot (aka scatter chart, scatter graph) uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables.

In [None]:
sns.scatterplot(data=df, x='Unit price', y='Rating',hue='Gender',style='Customer type')

<div style="border-radius:10px; padding: 15px; background-color: #d8b26e; font-size:130%; text-align:left">

<h2 align="left"><font color=#006600>BOXPLOT</font></h2>

A boxplot is a standardized way of displaying the distribution of data based on a five number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”). ... It can also tell you if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed.

In [None]:
plt.figure(figsize=(14,10))
sns.set_style(style='whitegrid')
plt.subplot(2,3,1)
sns.boxplot(x='Unit price',data=df)
plt.subplot(2,3,2)
sns.boxplot(x='Quantity',data=df)
plt.subplot(2,3,3)
sns.boxplot(x='Total',data=df)
plt.subplot(2,3,4)
sns.boxplot(x='cogs',data=df)
plt.subplot(2,3,5)
sns.boxplot(x='Rating',data=df)
plt.subplot(2,3,6)
sns.boxplot(x='gross income',data=df)

<div style="border-radius:10px; padding: 15px; background-color: #d8b26e; font-size:130%; text-align:left">

<h2 align="left"><font color=#006600>KDEPLOT</font></h2>
    
kdeplot is a data visualization technique that employs Kernel Density Estimation (KDE) to estimate and display the probability density function of continuous data. It produces a smoothed, continuous curve that reveals the underlying distribution's shape and characteristics. This method is particularly useful for exploring data patterns, identifying peaks, and visualizing the density of both univariate and bivariate data. kdeplot offers a complementary perspective to histograms and aids in understanding the distribution of data in a more detailed and visually appealing manner.

In [None]:
plt.figure(figsize=(14,10))
sns.set_style(style='whitegrid')
plt.subplot(2,3,1)
sns.kdeplot(x='Unit price',data=df)
plt.subplot(2,3,2)
sns.kdeplot(x='Quantity',data=df)
plt.subplot(2,3,3)
sns.kdeplot(x='Total',data=df)
plt.subplot(2,3,4)
sns.kdeplot(x='cogs',data=df)
plt.subplot(2,3,5)
sns.kdeplot(x='Rating',data=df)
plt.subplot(2,3,6)
sns.kdeplot(x='gross income',data=df)

<div style="border-radius:10px; padding: 15px; background-color: #d8b26e; font-size:130%; text-align:left">

<h2 align="left"><font color=#006600>PAIRPLOT</font></h2>

A pairplot plot a pairwise relationships in a dataset. The pairplot function creates a grid of Axes such that each variable in data will by shared in the y-axis across a single row and in the x-axis across a single column.

In [None]:
sns.pairplot(data=df)

<div style="border-radius:10px; padding: 15px; background-color: #d8b26e; font-size:130%; text-align:left">

<h2 align="left"><font color=#006600>BARPLOT</font></h2>

A barplot (or barchart) is one of the most common types of graphic. It shows the relationship between a numeric and a categoric variable. Each entity of the categoric variable is represented as a bar. The size of the bar represents its numeric value.



In [None]:
plt.style.use("default")
plt.figure(figsize=(5,5))
sns.barplot(x="Rating", y="Unit price", data=df[170:180])
plt.title("Rating vs Unit Price",fontsize=15)
plt.xlabel("Rating")
plt.ylabel("Unit Price")
plt.show()

In [None]:
plt.style.use("default")
plt.figure(figsize=(5,5))
sns.barplot(x="Rating", y="Quantity", data=df[170:180])
plt.title("Rating vs Quantity",fontsize=15)
plt.xlabel("Rating")
plt.ylabel("Quantity")
plt.show()

<a id="libraries"></a>
# <b><span style='color:#c78a44'> Correlation</span></b>

![](https://www.mathsisfun.com/data/images/correlation-examples.svg)

<div style="border-radius:10px; padding: 15px; background-color: #d8b26e; font-size:130%; text-align:left">

Now, when we train any algorithm, the number of features and their correlation plays an important role. If there are features and many of the features are highly correlated, then training an algorithm with all the featues will reduce the accuracy. Thus features selection should be done carefully. This dataset has less featues but still we will see the correlation.

In [None]:
df.corr()

In [None]:
plt.figure(figsize = (12,10))

sns.heatmap(df.corr(), annot =True)

In [None]:
#lets find the categorialfeatures
list_1=list(df.columns)

In [None]:
list_cate=[]
for i in list_1:
    if df[i].dtype=='object':
        list_cate.append(i)

<a id="libraries"></a>
# <b><span style='color:#c78a44'> Label encoding</span></b>

![](https://miro.medium.com/max/772/1*QQe-4476Oy3_dI1vhb3dDg.png)

In [None]:
le=LabelEncoder()

In [None]:
for i in list_cate:
    df[i]=le.fit_transform(df[i])

In [None]:
df

<a id="libraries"></a>
# <b><span style='color:#c78a44'> Splitting The Data into Training And Testing Dataset</span></b>

In [None]:
y=df['Gender']
x=df.drop('Gender',axis=1)

In [None]:
x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=0,test_size=0.2)

<a id="libraries"></a>
# <b><span style='color:#c78a44'> Building Machine Learning Models</span></b>

<a id="libraries"></a>
# <b><span style='color:#c78a44'> 1. K Nearest Neighbor</span></b>

<div style="border-radius:10px; padding: 15px; background-color: #d8b26e; font-size:130%; text-align:left">
    
K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning technique.

K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into the category that is most similar to the available categories.

K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new data appears then it can be easily classified into a well suite category by using K- NN algorithm.

![](https://cambridgecoding.files.wordpress.com/2016/01/knn2.jpg)

In [None]:
knn=KNeighborsClassifier(n_neighbors=7)
knn.fit(x_train,y_train)

In [None]:
y_pred=knn.predict(x_test)
print("Classification Report is:\n",classification_report(y_test,y_pred))
print("Confusion Matrix:\n",confusion_matrix(y_test,y_pred))
print("Training Score:\n",knn.score(x_train,y_train)*100)

<a id="libraries"></a>
# <b><span style='color:#c78a44'> 2. Decision Tree</span></b>


<div style="border-radius:10px; padding: 15px; background-color: #d8b26e; font-size:130%; text-align:left">
    
A decision tree is a flowchart-like structure in which each internal node represents a test on a feature (e.g. whether a coin flip comes up heads or tails) , each leaf node represents a class label (decision taken after computing all features) and branches represent conjunctions of features that lead to those class labels. The paths from root to leaf represent classification rules.

![](https://regenerativetoday.com/wp-content/uploads/2022/04/dt.png)

In [None]:
dtree = DecisionTreeClassifier(max_depth=6, random_state=123,criterion='entropy')
dtree.fit(x_train,y_train)

In [None]:
y_pred=dtree.predict(x_test)
print("Classification Report is:\n",classification_report(y_test,y_pred))
print("Confusion Matrix:\n",confusion_matrix(y_test,y_pred))
print("Training Score:\n",dtree.score(x_train,y_train)*100)

<a id="libraries"></a>
# <b><span style='color:#c78a44'> 3. Random Forest</span></b>

<div style="border-radius:10px; padding: 15px; background-color: #d8b26e; font-size:130%; text-align:left">
    
Random forest, like its name implies, consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction.

A large number of relatively uncorrelated models (trees) operating as a committee will outperform any of the individual constituent models.

![](https://av-eks-blogoptimized.s3.amazonaws.com/33019random-forest-algorithm287548.png)

In [None]:
rfc=RandomForestClassifier()
rfc.fit(x_train,y_train)

In [None]:
y_pred=rfc.predict(x_test)
print("Classification Report is:\n",classification_report(y_test,y_pred))
print("Confusion Matrix:\n",confusion_matrix(y_test,y_pred))
print("Training Score:\n",rfc.score(x_train,y_train)*100)

<a id="libraries"></a>
# <b><span style='color:#c78a44'> 4. Gradient Boosting Classifier</span></b>

<div style="border-radius:10px; padding: 15px; background-color: #d8b26e; font-size:130%; text-align:left">
    
The GradientBoostingClassifier is a machine learning model designed for classification tasks. It utilizes gradient boosting, an ensemble technique, to combine the predictions of multiple weak classifiers sequentially. With features like weighted voting, adjustable learning rates, and regularization parameters, it provides robust and accurate solutions for a wide range of classification problems. It is particularly useful when dealing with complex datasets and has applications in spam detection, fraud prevention, and image classification, among others.

![](https://www.researchgate.net/publication/351542039/figure/fig1/AS:11431281172877200@1688685833363/Flow-diagram-of-gradient-boosting-machine-learning-method-The-ensemble-classifiers.png)

In [None]:
gbc=GradientBoostingClassifier()
gbc.fit(x_train,y_train)

In [None]:
y_pred=gbc.predict(x_test)
print("Classification Report is:\n",classification_report(y_test,y_pred))
print("Confusion Matrix:\n",confusion_matrix(y_test,y_pred))
print("Training Score:\n",gbc.score(x_train,y_train)*100)

<a id="libraries"></a>
# <b><span style='color:#c78a44'> Which is the best Model ?</span></b>

<div style="border-radius:10px; padding: 15px; background-color: #d8b26e; font-size:130%; text-align:left">
As we see best Model is given by Random forest classifier(100% Accuracy).

<div style="border-radius:10px; padding: 15px; background-color: #9d8cd1; font-size:120%; text-align:left">
    
If you've made it this far, I hope you found my analysis enjoyable and informative.

If you found it helpful, please consider upvoting!

As a beginner, I welcome any suggestions and feedback in the comments section. Your input is highly valuable.

If you have any questions or uncertainties about any part of the notebook, please don't hesitate to leave a comment with your inquiries.

**Thank you for your time and attention!**