<a href="https://colab.research.google.com/github/chrishunt11/Prediction-of-Product-Sales/blob/main/Prediction_of_Product_Sales.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prediction of Product Sales
Christopher Hunt

## Project Overview


### Link to original dataset from Analytics Vidhya: 
- [Analytics Vidhya Link](https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/)

### Data Dictionary for this dataset:
- [Data Dictionary](https://drive.google.com/file/d/1zTSwo2__MqZsTqetwajXurSwDbwGnMd5/view?usp=drive_link)


## Load and Inspect Data

In [3]:
# Importing the necessary libraries
import pandas as pd
import numpy as np

# Reading the csv file using pandas then assigning it to df
df = pd.read_csv('/content/drive/MyDrive/CodingDojo/01-Fundamentals/Week02/Data/sales_predictions_2023.csv')

# Viewing the first 5 rows in the DataFrame
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [4]:
# Taking a look at how many rows, columns in the dataset
df.shape

(8523, 12)

In [5]:
# looking at the basic information in the dataset (rows, columns, non-null count, dtype)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


In [6]:
# Creating a copy of the dataframe before making any changes
df2 = df.copy()

## Clean Data

#### Checking the amoount of rows and columns

In [7]:
# Checking for rows, columns
df.shape

(8523, 12)

There are 8523 rows and 12 columns

#### Data types of each variable



In [8]:
# checking the data types for each variable
df.dtypes

Item_Identifier               object
Item_Weight                  float64
Item_Fat_Content              object
Item_Visibility              float64
Item_Type                     object
Item_MRP                     float64
Outlet_Identifier             object
Outlet_Establishment_Year      int64
Outlet_Size                   object
Outlet_Location_Type          object
Outlet_Type                   object
Item_Outlet_Sales            float64
dtype: object

#### Checking for duplicates

In [9]:
# Checking for any duplicates 
df.duplicated()

0       False
1       False
2       False
3       False
4       False
        ...  
8518    False
8519    False
8520    False
8521    False
8522    False
Length: 8523, dtype: bool

In [10]:
# Finding the sum for duplicated values
df.duplicated().sum()

0

There are not any duplicate values in this dataset

#### Identifying Missing Values

##### Seperating numerical and categorical columns

In [11]:
# Seperating the numeric and categorical columns
cat_cols = df.select_dtypes('object').columns
num_cols = df.select_dtypes('number').columns
print(f'Categorical columns: {cat_cols} \n\n Numeric columns: {num_cols}')

Categorical columns: Index(['Item_Identifier', 'Item_Fat_Content', 'Item_Type', 'Outlet_Identifier',
       'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type'],
      dtype='object') 

 Numeric columns: Index(['Item_Weight', 'Item_Visibility', 'Item_MRP',
       'Outlet_Establishment_Year', 'Item_Outlet_Sales'],
      dtype='object')


##### Addressing the missing values by using a placeholder value.

###### Categorical columns

In [12]:
# Checking the sum of the NaN values in cat_cols
df[cat_cols].isna().sum()

Item_Identifier            0
Item_Fat_Content           0
Item_Type                  0
Outlet_Identifier          0
Outlet_Size             2410
Outlet_Location_Type       0
Outlet_Type                0
dtype: int64

In [13]:
# Taking a look at Outlet Size values
df['Outlet_Size'].value_counts(dropna=False)

Medium    2793
NaN       2410
Small     2388
High       932
Name: Outlet_Size, dtype: int64

In [14]:
# Filling in 'MISSING' for the NaN values in Outlet_Size
df['Outlet_Size'] = df['Outlet_Size'].fillna('MISSING')
df['Outlet_Size'].value_counts(dropna=False)

Medium     2793
MISSING    2410
Small      2388
High        932
Name: Outlet_Size, dtype: int64

###### Numerical Columns

In [15]:
# Checking the sum of the NaN values in num_cols
df[num_cols].isna().sum()

Item_Weight                  1463
Item_Visibility                 0
Item_MRP                        0
Outlet_Establishment_Year       0
Item_Outlet_Sales               0
dtype: int64

In [16]:
# Taking a look at 'Item Weight' values
df['Item_Weight'].value_counts(dropna=False)

NaN       1463
12.150      86
17.600      82
13.650      77
11.800      76
          ... 
7.275        2
7.685        1
9.420        1
6.520        1
5.400        1
Name: Item_Weight, Length: 416, dtype: int64

In [17]:
# Finding the stats behind the Item_Weight column
df['Item_Weight'].describe()

count    7060.000000
mean       12.857645
std         4.643456
min         4.555000
25%         8.773750
50%        12.600000
75%        16.850000
max        21.350000
Name: Item_Weight, dtype: float64

In [24]:
# Assigning the mean of 'Item_Weight' to a variable
item_mean = df['Item_Weight'].mean()
item_mean

12.857645184135976

In [23]:
# Filling in the mean weight to all the missing values in the 'Item_Weight' column
df['Item_Weight'] = df['Item_Weight'].fillna(item_mean)
df['Item_Weight'].value_counts(dropna=False)

12.857645    1463
12.150000      86
17.600000      82
13.650000      77
11.800000      76
             ... 
7.275000        2
7.685000        1
9.420000        1
6.520000        1
5.400000        1
Name: Item_Weight, Length: 416, dtype: int64

##### Confirming no more missing values

In [None]:
# Checking the entire DataFrame for any missing values
df.isna().sum()

#### Finding and fixing any inconsistencies

In [None]:
# Creating a variable for string columns
string_cols = df.select_dtypes('object').columns
string_cols

In [None]:
# Using a for loop to print out the value counts of each string column
for col in string_cols:
  print(f'Value count: {col}')
  print(df[col].value_counts())
  print('\n')

In [None]:
# Taking a look at 'Item_Fat_Content' values
df['Item_Fat_Content'].value_counts()

In [None]:
# Fixing the inconsistencies
df['Item_Fat_Content'] = df['Item_Fat_Content'].replace({'LF' : 'Low Fat',
                                                         'reg' : 'Regular',
                                                         'low fat' : 'Low Fat'})

In [None]:
# Checking value counts for Item_Fat_Content again
df['Item_Fat_Content'].value_counts()

#### Printing the stats for the numerical columns

In [None]:
# Using a for loop to print out the stats for any numerical column
for col in num_cols:
  print(f'Stats for:{col}')
  print(df[col].describe())
  print('\n')

## Exploratory Data Analysis

## Explanatory Data Analysis