<a href="https://www.kaggle.com/code/aayushsin7a/big-mart-sales-prediction?scriptVersionId=143800469" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# STEPS > <
- Perceive challenges early
- Motivate Sales Team
- Plan recruitment Strategies
- Aid Future Marketing plans
- Predict Sales Revenue

## Importing the dependencies 

In [None]:
import pandas as pd
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn import metrics
import warnings

# Ignore all warnings
warnings.filterwarnings("ignore")


## Data collection & Analysis 

In [None]:
# loading the dataset from csv file to a pandas Dataframe
big_mart_data = pd.read_csv(r'/kaggle/input/bigmart-sales-data/Train.csv')

In [None]:
# First 5 rows of a dataframe
big_mart_data.head(5)

# In Item Identifier column 
- FD - represents Food 
- DR - represents Drinks
- NC - represents Non-Consumables 

In [None]:
big_mart_data.Item_Identifier.nunique()

In [None]:
unique_items_identifier = big_mart_data.Item_Identifier.str[:3].unique()

In [None]:
unique_items_identifier

In [None]:
# Number of data points & features
big_mart_data.shape

In [None]:
# More dataset info
big_mart_data.info()

## RangeIndex: 8523 entries, 0 to 8522: This line tells you that your DataFrame has 8523 rows with row indices ranging from 0 to 8522. Essentially, it's specifying the range of row indices.

- Data columns (total 12 columns):: This indicates that your DataFrame has a total of 12 columns.
- Column Name: This is the name of each column in your DataFrame.
- Non-Null Count: It tells you how many non-null (non-missing) values are present in each column. For example, "Item_Weight" has 7060 non-null values, meaning it has some missing values because your DataFrame has 8523 rows, but only 7060 non-null values in this column.
- Dtype: This specifies the data type of the values in each column. For example, "Item_Weight" is of data type float64, "Outlet_Establishment_Year" is of data type int64, and others are of data type object (usually indicating string or mixed data types).

## From this information, you can see which columns have missing values (e.g., "Item_Weight" and "Outlet_Size") and the data types of each column in your DataFrame. This information can be helpful when you're cleaning and preparing the data or performing data analysis.

# Let us list categorical and numerical features seperately 

In [None]:
categorical_features = big_mart_data.select_dtypes(include=['object']).columns.to_list()
numerical_features = big_mart_data.select_dtypes(exclude=['object']).columns.to_list()

print("Categorical Features:",categorical_features)
print("Numerical Features:",numerical_features)

In [None]:
# Checking for missing values
big_mart_data.isnull().sum()

# Handling Missing Values 
- Mean -> average value
- Median -> Mid value
- Mode -> Most Occuring 

## In our case we will impute the missing values with 'Mean' for Item_Weight & 'Mode' for Outlet_Size

In [None]:
# mean value of Item_Weight column
big_mart_data.Item_Weight.mean()

In [None]:
# Filling the Missing "Item_Weight" column values with "Mean" value
big_mart_data['Item_Weight'].fillna(big_mart_data['Item_Weight'].mean(),inplace = True)

In [None]:
# Filling the Missing "Outlet_Size" column values with "Mode" value
mode_of_outlet_size = big_mart_data.pivot_table(values='Outlet_Size',columns='Outlet_Type',aggfunc=(lambda x:x.mode()[0]))

In [None]:
mode_of_outlet_size

- We cannot directly impute with Mode value for 'outlet_Size'.
- In order to impute with mode we need to compare with the 'Output_Type' values
- For Instance mode of Output_Type -> Grocery Store is small, Supermarket Type1 is small, Supermarket Type2 is medium
- So based on the Output_Type value we will impute the Outlet_Size values

In [None]:
missing_values = big_mart_data.Outlet_Size.isnull()
print(missing_values)

In [None]:
big_mart_data.loc[missing_values,'Outlet_Size'] = big_mart_data.loc[missing_values,'Outlet_Type'].apply(lambda x: mode_of_outlet_size[x])


- ^ This code will fill in missing "Outlet_Size" values with the mode of "Outlet_Size" based on the "Outlet_Type" column.

In [None]:
# Re-check for missing values after Imputation
big_mart_data.isnull().sum()

# Data Analysis 


In [None]:
# Statistical Measures about the data
big_mart_data.describe()

# Numerical Features

In [None]:
numerical_features

In [None]:
sns.set()

In [None]:
# Item weight distribution
plt.figure(figsize=(4,4))
sns.distplot(big_mart_data.Item_Weight)
plt.show()

- Average Item_Weight is around 12-13

In [None]:
# Item_Visibility distribution
plt.figure(figsize=(4,4))
sns.distplot(big_mart_data.Item_Visibility)
plt.show()

- Average Item Visibility is around 0.06 

In [None]:
# Item_MRP distribution 
plt.figure(figsize=(4,4))
sns.distplot(big_mart_data.Item_MRP)
plt.show()

- Average MRP is around 140 

In [None]:
# Outlet_Establishment_Year
plt.figure(figsize=(6,4))
sns.countplot(x=big_mart_data.Outlet_Establishment_Year)
plt.show()

- Average Outlet_Establishment_Year is close to Early 2000 and late 90's
- Max Outlets were established in the year 1985
- Min in 1998


In [None]:
# Item_Outlet_Sales
plt.figure(figsize=(4,4))
sns.distplot(big_mart_data.Item_Outlet_Sales)
plt.show()

- Average Item_Outlet_Sales is close to 2000

# Categorical Features 

In [None]:
categorical_features

In [None]:
# Item_Fat_Content 
plt.figure(figsize=(6,6))
sns.countplot(x = big_mart_data.Item_Fat_Content)
plt.show()

- We need to clean Item_Fat_Content column - as Low Fat, low fat, LF represents the same Low Fat category and reg also represents Regular

In [None]:
big_mart_data.Item_Fat_Content.value_counts()

In [None]:
# Lets standardize the values in 'Item_Fat_Content' column
big_mart_data['Item_Fat_Content'] = big_mart_data['Item_Fat_Content'].replace(
{
    'low fat' : 'Low Fat',
    'LF' : 'Low Fat',
    'reg' : 'Regular'
})

In [None]:
# Item_Fat_Content 
plt.figure(figsize=(6,6))
sns.countplot(x = big_mart_data.Item_Fat_Content)
plt.show()

In [None]:
# Item_Type
# Calculate the count of each item type and sort in descending order
item_type_counts = big_mart_data['Item_Type'].value_counts().sort_values(ascending=False)

plt.figure(figsize=(12, 6))
ax = sns.countplot(data=big_mart_data, x='Item_Type', order=item_type_counts.index)
plt.xticks(rotation=45)
plt.xlabel('Item Type')
plt.ylabel('Count')
plt.title('Count of Items by Type (Descending Order)')

# Add count labels on top of each bar
for p in ax.patches:
    ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', fontsize=10, color='black', xytext=(0, 5), textcoords='offset points')

plt.show()

In [None]:
# Calculate the count of each Outlet_Size and sort in descending order
Outlet_Size_counts = big_mart_data['Outlet_Size'].value_counts().sort_values(ascending=False)

# Create a countplot for "Item_Type" in descending order
plt.figure(figsize=(12, 6))
ax = sns.countplot(data=big_mart_data, x='Outlet_Size', order=Outlet_Size_counts.index)
plt.xticks(rotation=45)
plt.xlabel('Outlet_Size')
plt.ylabel('Count')
plt.title('Count of Outlet by size (Descending Order)')

# Add count labels on top of each bar
for p in ax.patches:
    ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', fontsize=10, color='black', xytext=(0, 5), textcoords='offset points')

plt.show()


In [None]:
# Calculate the count of each Outlet_Location_Type and sort in descending order
Outlet_Location_Type_counts = big_mart_data['Outlet_Location_Type'].value_counts().sort_values(ascending=False)

# Create a countplot for "Item_Type" in descending order
plt.figure(figsize=(12, 6))
ax = sns.countplot(data=big_mart_data, x='Outlet_Location_Type', order=Outlet_Location_Type_counts.index)
plt.xticks(rotation=45)
plt.xlabel('Outlet_Size')
plt.ylabel('Count')
plt.title('Count of Outlet by Location Type (Descending Order)')

# Add count labels on top of each bar
for p in ax.patches:
    ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', fontsize=10, color='black', xytext=(0, 5), textcoords='offset points')

plt.show()

In [None]:
# Calculate the count of each Outlet_Type and sort in descending order
Outlet_Type_counts = big_mart_data['Outlet_Type'].value_counts().sort_values(ascending=False)

# Create a countplot for "Item_Type" in descending order
plt.figure(figsize=(12, 6))
ax = sns.countplot(data=big_mart_data, x='Outlet_Type', order=Outlet_Type_counts.index)
plt.xticks(rotation=45)
plt.xlabel('Outlet_Type')
plt.ylabel('Count')
plt.title('Count of Outlet by Type (Descending Order)')

# Add count labels on top of each bar
for p in ax.patches:
    ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', fontsize=10, color='black', xytext=(0, 5), textcoords='offset points')

plt.show()

# Data pre-processing

## Label Encoding

In [None]:
encoder = LabelEncoder()

In [None]:
categorical_features

In [None]:
# Item_Identifier
big_mart_data['Item_Identifier'] = encoder.fit_transform(big_mart_data['Item_Identifier'])
# Item_Fat_Content
big_mart_data['Item_Fat_Content'] = encoder.fit_transform(big_mart_data['Item_Fat_Content'])
# Item_Type
big_mart_data['Item_Type'] = encoder.fit_transform(big_mart_data['Item_Type'])
# Outlet_Identifier
big_mart_data['Outlet_Identifier'] = encoder.fit_transform(big_mart_data['Outlet_Identifier'])
# Outlet_Size
big_mart_data['Outlet_Size'] = encoder.fit_transform(big_mart_data['Outlet_Size'])
# Outlet_Location_Type
big_mart_data['Outlet_Location_Type'] = encoder.fit_transform(big_mart_data['Outlet_Location_Type'])
# Outlet_Type
big_mart_data['Outlet_Type'] = encoder.fit_transform(big_mart_data['Outlet_Type'])


In [None]:
big_mart_data.head(10)

# Splitting Features and Target 

In [None]:
X = big_mart_data.drop(columns='Item_Outlet_Sales',axis=1)
# if you are removing a column you need to specify axis = 1

Y = big_mart_data['Item_Outlet_Sales']

print(X)


In [None]:
print(Y)

# Splitting the data into train , test 


- X_train is our training data
- Y_train is our target variable

- We split the train and test to check the performance 


In [None]:
X_train, X_test, Y_train, Y_test  = train_test_split(X,Y,random_state=2,test_size=0.2)


In [None]:
print(X_train.shape,X_test.shape,Y_train.shape,Y_test.shape)

In [None]:
print(X.shape,X_train.shape, X_test.shape)

# Model Training 

In [None]:
regressor = XGBRegressor(n_estimators=100, max_depth=3, learning_rate=0.1, random_state=42)

In [None]:
regressor.fit(X_train,Y_train)

# Evaluating the model 

# prediction on training data 

In [None]:

training_data_prediction = regressor.predict(X_train)

- Y_train is our actual Output values
- training_data_prediction is what the model predicted for the given input X_train data

In [None]:
# R-squared value (ranges from 0-1)

In [None]:
r2_train = metrics.r2_score(Y_train,training_data_prediction)

In [None]:
print('R Squared value : ', r2_train)

# Prediction on testing data 


In [None]:
test_data_prediction = regressor.predict(X_test)

In [None]:
r2_test = metrics.r2_score(Y_test,test_data_prediction)

In [None]:
print('R Squared value : ', r2_test)

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Initialize the model
linear_reg = LinearRegression()

# Fit the model
linear_reg.fit(X_train, Y_train)

# Make predictions
y_pred_linear = linear_reg.predict(X_test)

# Evaluate the model
mse_linear = mean_squared_error(Y_test, y_pred_linear)
r2_linear = r2_score(Y_test, y_pred_linear)

print("Linear Regression:")
print("Mean Squared Error:", mse_linear)
print("R-squared:", r2_linear)


In [None]:
from sklearn.ensemble import RandomForestRegressor

# Initialize the model
random_forest_reg = RandomForestRegressor(n_estimators=100, random_state=42)

# Fit the model
random_forest_reg.fit(X_train, Y_train)

# Make predictions
y_pred_rf = random_forest_reg.predict(X_test)

# Evaluate the model
mse_rf = mean_squared_error(Y_test, y_pred_rf)
r2_rf = r2_score(Y_test, y_pred_rf)

print("Random Forest Regression:")
print("Mean Squared Error:", mse_rf)
print("R-squared:", r2_rf)


In [None]:
#from sklearn.ensemble import RandomForestRegressor

# Initialize the model
random_forest_reg = RandomForestRegressor(n_estimators=100, random_state=42)

# Fit the model
random_forest_reg.fit(X_train, Y_train)

# Get feature importances
feature_importances = random_forest_reg.feature_importances_

# Associate feature names with their importances
feature_names = X_train.columns

# Create a DataFrame to display feature importances
feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# Visualize feature importances (e.g., bar plot)
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.barh(feature_importance_df['Feature'], feature_importance_df['Importance'])
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Feature Importance')
plt.show()


In [None]:
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# Initialize the model with scaling
svr_reg = make_pipeline(StandardScaler(), SVR(C=1.0, epsilon=0.2))

# Fit the model
svr_reg.fit(X_train, Y_train)

# Make predictions
y_pred_svr = svr_reg.predict(X_test)

# Evaluate the model
mse_svr = mean_squared_error(Y_test, y_pred_svr)
r2_svr = r2_score(Y_test, y_pred_svr)

print("Support Vector Regression (SVR):")
print("Mean Squared Error:", mse_svr)
print("R-squared:", r2_svr)


# Catboost

In [None]:
import catboost
from catboost import CatBoostRegressor, Pool


In [None]:
# Assuming you have your data in X_train, X_test, y_train, and y_test
train_data = Pool(data=X_train, label=Y_train)
test_data = Pool(data=X_test, label=Y_test)

In [None]:
# Initialize the CatBoostRegressor with optional hyperparameters
catboost_reg = CatBoostRegressor(iterations=1000,  # Number of boosting iterations
                                 depth=6,         # Depth of each tree
                                 learning_rate=0.1,  # Learning rate
                                 loss_function='RMSE',  # Loss function
                                 verbose=200)     # Print progress every 200 iterations

# Fit the model to the training data
#catboost_reg.fit(train_data, eval_set=test_data, early_stopping_rounds=50, verbose=200)
#iterations: The number of boosting iterations.
#depth: The depth of each tree.
#learning_rate: The learning rate for gradient boosting.
#loss_function: The loss function to optimize (e.g., 'RMSE' for regression).
#verbose: Print progress every specified number of iterations.
#early_stopping_rounds: If specified, training will stop if the evaluation metric doesn't improve for a certain number of rounds on the test dataset.

#Make Predictions:

# Fit the model to the training data
catboost_reg.fit(train_data, eval_set=test_data, early_stopping_rounds=50, verbose=200)

# Make predictions on the test data
y_pred = catboost_reg.predict(test_data)

# Make predictions on the test data
y_pred = catboost_reg.predict(test_data)









# Evaluate the Model:

- You can evaluate the model using metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared.


In [None]:
#from sklearn.metrics import mean_squared_error, r2_score

mse = mean_squared_error(Y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(Y_test, y_pred)

print("Mean Squared Error:", mse)
print("Root Mean Squared Error:", rmse)
print("R-squared:", r2)


# Perform Cross Validation and then check the result 

In [None]:
from sklearn.model_selection import cross_val_score
# Perform cross-validation
cv_scores = cross_val_score(catboost_reg, X_train, Y_train, cv=5, scoring='neg_mean_squared_error')

# Take the absolute value of the scores and calculate RMSE
rmse_scores = np.sqrt(-cv_scores)

# Print RMSE scores for each fold
print("Cross-Validation RMSE Scores:", rmse_scores)
print("Mean RMSE:", np.mean(rmse_scores))

# Fit the model to the entire training data
catboost_reg.fit(X_train, Y_train)

# Make predictions on the test data
y_pred = catboost_reg.predict(X_test)
