# Installation Command & Importing Libraries

Install the dataprep library, which is a data preparation library for data scientists.
The code snippet installs and imports several libraries and modules that are commonly used in data science and machine learning.
pandas is used for data manipulation, plotly.express, matplotlib.pyplot, and seaborn are used for data visualization.
LabelEncoder from sklearn.preprocessing is used for encoding categorical variables.
train_test_split from sklearn.model_selection is used for splitting data into training and testing sets.
mean_squared_error and r2_score from sklearn.metrics are used for evaluating the performance of regression models.

In [None]:
pip install dataprep
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
#from dataprep.eda import create_report
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Reading CSV Files into DataFrames

**Loading the Features Data:**
Reading a CSV file named Features_data_set.csv from the specified Google Drive path using the pandas function read_csv.
The data from the CSV file is loaded into a pandas DataFrame named Features_Data.
A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns).

In [None]:
Features_Data = pd.read_csv("/content/drive/MyDrive/Data Set/Features_data_set.csv")

**Loading the Sales Data:**
Reading a CSV file named sales_data_set.csv from the specified Google Drive path using the pandas function read_csv.
The data from the CSV file is loaded into a pandas DataFrame named Sales_Data.

In [None]:
Sales_Data = pd.read_csv("/content/drive/MyDrive/Data Set/sales_data_set.csv")

**Loading the Stores Data:**
Reading third CSV file named stores_data_set.csv from the specified Google Drive path using the pandas function read_csv.
The data from the CSV file is loaded into a pandas DataFrame named Stores_Data.

In [None]:
Stores_Data = pd.read_csv("/content/drive/MyDrive/Data Set/stores_data_set.csv")

**Purpose:** These lines of code are used to import data from three different CSV files into three separate pandas DataFrames for further analysis or manipulation.
**Function Used:** pd.read_csv is a pandas function that reads a comma-separated values (CSV) file into a DataFrame.
**File Paths:** The paths provided (/content/drive/MyDrive/Data Set/Features_data_set.csv, /content/drive/MyDrive/Data Set/sales_data_set.csv, /content/drive/MyDrive/Data Set/stores_data_set.csv) are assumed to be locations in Google Drive, typically used when running code on Google Colab.
By loading the data into DataFrames, you can now use various pandas functionalities to manipulate, analyze, and visualize the data as needed.

# Exploring the DataFrames

**View the First Few Rows of Features_Data:**
The head() method in pandas returns the first 5 rows of the Features_Data DataFrame by default.
This is useful to quickly inspect the structure and some initial records of the DataFrame.

In [None]:
Features_Data.head()

**View the Last Few Rows of Features_Data:**
The tail() method in pandas returns the last 5 rows of the Features_Data DataFrame by default.
This allows you to see the most recent entries in the DataFrame.

In [None]:
Features_Data.tail()

**Get a Summary of Features_Data:**
The info() method provides a concise summary of the DataFrame, including:
The number of non-null entries for each column.
The data type of each column.
The memory usage of the DataFrame.
This is helpful for understanding the general structure and completeness of the data.

In [None]:
Features_Data.info()

**View the First Few Rows of Sales_Data:**
Similar to Features_Data.head(), this line returns the first 5 rows of the Sales_Data DataFrame.

In [None]:
Sales_Data.head()

**View the Last Few Rows of Sales_Data:**
Similar to Features_Data.tail(), this line returns the last 5 rows of the Sales_Data DataFrame.

In [None]:
Sales_Data.tail()

**Get a Summary of Sales_Data:**
Similar to Features_Data.info(), this line provides a concise summary of the Sales_Data DataFrame.

In [None]:
Sales_Data.info()

**View the First Few Rows of Stores_Data:**
This line returns the first 5 rows of the Stores_Data DataFrame.

In [None]:
Stores_Data.head()

**View the Last Few Rows of Stores_Data:**
This line returns the last 5 rows of the Stores_Data DataFrame.

In [None]:
Stores_Data.tail()

**Get a Summary of Stores_Data:**
Similar to Features_Data.info(), this line provides a concise summary of the Stores_Data DataFrame.

In [None]:
Stores_Data.info()

**Get Descriptive Statistics of Stores_Data:**
The describe() method generates descriptive statistics of the DataFrame, such as:
Count,
Mean,
Standard Deviation,
Minimum,
25th percentile,
Median (50th percentile),
75th percentile &
Maximum.
The ".T" transposes the result, switching rows and columns for a more readable format.
This is useful for understanding the distribution and basic statistics of the numerical columns in the DataFrame.

In [None]:
Stores_Data.describe().T

**Get Descriptive Statistics of the 'Type' Column in Stores_Data:**
The describe() method, when applied to a specific column, generates descriptive statistics for that column.
For the 'Type' column, which is likely categorical, it will provide:
Count,
Unique values count,
Top (most frequent) value &
Frequency of the top value.
The ".T" transposes the result, although it is more relevant when multiple columns are described together.

In [None]:
Stores_Data.Type.describe().T

**Purpose:** These lines of code are used to explore and understand the structure, completeness, and basic statistics of the data in each DataFrame.
**Methods Used:**
**head():** View the first few rows.
**tail():** View the last few rows.
**info():** Get a summary including data types and non-null counts.
**describe():** Generate descriptive statistics.
**.T:** Transpose the result for better readability.
**Outcome:** These commands help to quickly get an overview of the data, which is essential before proceeding with further analysis or manipulation.

# Merging DataFrames

**Merge Stores_Data and Sales_Data:**
**pd.merge():** This function is used to merge two DataFrames.
**Stores_Data:** The first DataFrame to be merged.
**Sales_Data:** The second DataFrame to be merged.
**on='Store':** This specifies that the merge should be done on the 'Store' column, which should be present in both DataFrames.
**how='outer':** This specifies the type of merge to be performed. 'outer' merge returns all rows from both DataFrames, with NaNs in places where a row from one DataFrame does not have a corresponding row in the other DataFrame.
The result, df, is a new DataFrame that contains all columns from both Stores_Data and Sales_Data, merged on the 'Store' column.

In [None]:
df = pd.merge(Stores_Data, Sales_Data, on='Store', how='outer')

**Merge the Resulting DataFrame (df) with Features_Data & View the Last Few Rows of df1:**
**pd.merge():** Again, this function is used to merge DataFrames.
**df:** The DataFrame resulting from the previous merge (combination of Stores_Data and Sales_Data).
**Features_Data:** The third DataFrame to be merged.
**on=['Store','Date']:** This specifies that the merge should be done on both the 'Store' and 'Date' columns. Both these columns should be present in the DataFrames being merged.
**how='outer':** Similar to the previous merge, this is an 'outer' merge. It ensures that all rows from df and Features_Data are included in the final DataFrame, with NaNs in places where a row from one DataFrame does not have a corresponding row in the other.
The result, df1, is a new DataFrame that contains all columns from df and Features_Data, merged on the 'Store' and 'Date' columns.


**df1.tail():** This method returns the last 5 rows of the df1 DataFrame by default.
This is useful to inspect the most recent entries in the merged DataFrame and verify that the merge operation was performed correctly.

In [None]:
df1 = pd.merge(df, Features_Data, on=['Store','Date'], how='outer')
df1.tail()

**First Merge:** Combines Stores_Data and Sales_Data based on the 'Store' column.
The result (df) contains all rows from both DataFrames with NaNs where there are mismatches.

**Second Merge:** Combines df (the result of the first merge) with Features_Data based on both 'Store' and 'Date' columns.
The result (df1) includes all rows from both DataFrames with NaNs where there are mismatches.

**Inspecting the Result:** df1.tail(): Shows the last few rows of the merged DataFrame df1 to verify the merge operation.

These steps ensure that all relevant data from Stores_Data, Sales_Data, and Features_Data are combined into a single DataFrame (df1), which can then be used for further analysis or processing

# Converting Date Column to DateTime Format

**Convert 'Date' Column to DateTime:**
pd.to_datetime(): This function converts a column to datetime format.
df1.Date: The 'Date' column in the df1 DataFrame.
format="%d/%m/%Y": Specifies the format of the date strings in the 'Date' column (day/month/year).
This line converts the 'Date' column from string format to pandas datetime format, which makes it easier to perform date-related operations.

In [None]:
df1.Date = pd.to_datetime(df1.Date, format="%d/%m/%Y")

**View the First Few Rows of df1:**
df1.head(): This method returns the first 5 rows of the df1 DataFrame.
This is useful for verifying that the date conversion has been applied correctly and inspecting the initial rows of the DataFrame.

In [None]:
df1.head()

**View Unique Dates:**
df1.Date.unique(): This method returns an array of unique values in the 'Date' column.
This is useful for checking the range and distinct values of dates in the DataFrame.

In [None]:
df1.Date.unique()

**Drop Rows with Dates After November 1, 2012:**
df1['Date'] > '2012-11-01': This condition checks each row's 'Date' column to see if it is greater than '2012-11-01'.
df1[df1['Date'] > '2012-11-01'].index: This extracts the indices of rows where the condition is true.
df1.drop(..., inplace=True): This drops the rows with the specified indices from the DataFrame and inplace=True means the operation is performed on the original DataFrame (df1) without returning a new DataFrame.
This line removes all rows from df1 where the date is later than November 1, 2012.

In [None]:
df1.drop(df1[df1['Date'] > '2012-11-01'].index, inplace=True)

**View the Last Few Rows of df1:**
df1.tail(): This method returns the last 5 rows of the df1 DataFrame.
This is useful for verifying that the rows with dates after November 1, 2012, have been successfully removed and inspecting the final rows of the DataFrame.

In [None]:
df1.tail()

**View Unique Dates Again:**
df1.Date.unique(): This method returns an array of unique values in the 'Date' column.
This is useful for verifying the range and distinct values of dates in the DataFrame after the rows with dates later than November 1, 2012, have been dropped.

In [None]:
df1.Date.unique()

**Convert 'Date' Column:** Converts the 'Date' column from string to datetime format.
**View Initial Rows:** Inspects the first few rows to verify the conversion.
**Check Unique Dates:** Lists unique dates to understand the range and distinct values.
**Drop Future Dates:** Removes rows where the date is later than November 1, 2012.
**View Final Rows:** Inspects the last few rows to confirm the rows have been dropped.
**Check Unique Dates Again:** Verifies the range and distinct values of dates after dropping rows.
These steps ensure that the DataFrame df1 has the 'Date' column in the correct format and only includes data up to November 1, 2012. This is important for maintaining a consistent and relevant dataset for further analysis.

# Exploratory Data Analysis (EDA)

View the First Few Rows of df1.
View the Last Few Rows of df1.
Get DataFrame Information.
List Column Names.

In [None]:
df1.head()
df1.tail()
df1.info()
df1.columns

**Define Continuous Columns:**
A list of column names that represent continuous numerical data.
These are columns that will likely be used for numerical analysis and statistical operations.

In [None]:
continuous_columns = ['Size', 'Weekly_Sales', 'Temperature', 'Fuel_Price', 'MarkDown1', 'MarkDown2',
                      'MarkDown3', 'MarkDown4', 'MarkDown5', 'CPI', 'Unemployment']

**Define Categorical Columns:**
A list of column names that represent categorical data.
These columns are typically used for grouping, aggregation, and categorical analysis.

In [None]:
category_columns = ["Store",  "Type", "IsHoliday_x", "IsHoliday_y", "Dept"]

**Select String Columns:**
df1.select_dtypes(exclude=['int64', "float64", "datetime64"]): Selects columns that are not of types int64, float64, or datetime64.
.columns: Gets the column names of the selected columns.
This identifies columns that are likely to be of object/string type.

In [None]:
string_columns = df1.select_dtypes(exclude=['int64', "float64", "datetime64"]).columns

**Select Numeric Columns:**
df1.select_dtypes(include=['int64', "float64"]): Selects columns that are of types int64 or float64.
.columns: Gets the column names of the selected columns.
This identifies columns that are numerical.

In [None]:
numeric_columns = df1.select_dtypes(include=['int64', "float64"]).columns

**Descriptive Statistics for Numeric Columns:**
df1.describe(): This method generates descriptive statistics for numeric columns (e.g., count, mean, std, min, max, and percentiles).
.T: Transposes the DataFrame for a better view (columns become rows).
Provides a statistical summary of the numerical columns in the DataFrame.

In [None]:
df1.describe().T

**Descriptive Statistics for String Columns:**
df1[string_columns].describe(): Generates descriptive statistics for string columns (e.g., count, unique, top, freq).
.T: Transposes the DataFrame for a better view (columns become rows).
Provides a summary of the categorical/string columns.

In [None]:
df1[string_columns].describe().T

**Count Missing Values:**
df1.isnull(): Checks each element of the DataFrame for missing values (returns a DataFrame of the same shape with True for missing values and False otherwise).
.sum(): Sums the True values for each column (True is treated as 1).
This line provides the number of missing values in each column, helping identify columns that need data cleaning or imputation.

In [None]:
df1.isnull().sum()

**View Data:** head(), tail(), info() and columns help understand the structure, data types, and initial entries of the DataFrame.
**Define Column Categories:** continuous_columns and category_columns categorize the columns for easier analysis.
**Select Columns by Data Type:** select_dtypes helps identify string and numeric columns separately.
**Descriptive Statistics:** describe().T gives a summary of numeric and string columns.
**Check Missing Values:** isnull().sum() identifies columns with missing values, guiding the need for data cleaning.
These steps are foundational for EDA, helping you understand your dataset's structure, contents, and any potential issues such as missing values.

# Correlation Matrix for Numeric Columns

**Calculate Correlation Matrix for Numeric Columns:**
df1[numeric_columns].corr(): This computes the pairwise correlation of columns specified in numeric_columns.
Correlation is a statistical measure that indicates the extent to which two variables fluctuate together.

In [None]:
df1[numeric_columns].corr()

**Display String Columns:**
string_columns: This simply outputs the list of string columns identified earlier.
This helps verify which columns are considered string-type for further processing.

In [None]:
string_columns

**Create a Copy of df1:**
df2 = df1.copy(): This creates a copy of df1 named df2.
It ensures that any operations performed on df2 do not affect the original DataFrame df1.

In [None]:
df2 = df1.copy()

**Initialize Label Encoder:**
encode = LabelEncoder(): This initializes a LabelEncoder instance from sklearn.preprocessing.
Label encoding converts categorical string data into numerical data, which is necessary for many machine learning algorithms.

In [None]:
encode = LabelEncoder()

**Encode Each String Column:**
This loop iterates over each column in string_columns.
encode.fit(df2[[column]]): Fits the encoder to the unique values in the column.
df2[column] = encode.transform(df2[[column]]): Transforms the original string values into numeric values.

In [None]:
for column in string_columns:
  encode.fit(df2[[column]])
  df2[column] = encode.transform(df2[[column]])

**Display the First Few Rows of df2:**
df2.head(): Shows the first 5 rows of the DataFrame df2 to verify the encoding process.

In [None]:
df2.head()

**Create and Display Correlation Matrix Plot:**
correlation_matrix = df2.corr(): Computes the correlation matrix for df2.
fig = px.imshow(correlation_matrix, color_continuous_scale='Viridis', title="Correlation Matrix"): Uses Plotly Express to create a heatmap of the correlation matrix.
fig.show(): Displays the heatmap.

In [None]:
correlation_matrix = df2.corr()
fig = px.imshow(correlation_matrix, color_continuous_scale='Viridis', title="Correlation Matrix")
fig.show()

**Generate EDA Report (Commented Out):**
create_report(df1): This would generate an EDA report using dataprep.eda (if uncommented).
It provides comprehensive insights into the DataFrame, including data distributions, missing values, and correlations.

In [None]:
#create_report(df1)

**Define Function to Plot Box Plot and Histogram:**
def plot(df, column): Defines a function plot that takes a DataFrame df and a column name column.
plt.figure(figsize=(20,5)): Creates a new figure with specified size.
plt.subplot(1,2,1): Creates the first subplot in a 1x2 grid.
sns.boxplot(data=df, x=column): Creates a box plot for the specified column.
plt.title(f'Box Plot for {column}'): Sets the title for the box plot.
plt.subplot(1,2,2): Creates the second subplot in a 1x2 grid.
sns.histplot(data=df, x=column, kde=True, bins=50): Creates a histogram with a KDE (Kernel Density Estimate) for the specified column.
plt.title(f'Distribution Plot for {column}'): Sets the title for the histogram.

In [None]:
def plot(df, column):
    plt.figure(figsize=(20,5))
    plt.subplot(1,2,1)
    sns.boxplot(data=df, x=column)
    plt.title(f'Box Plot for {column}')

    plt.subplot(1,2,2)
    sns.histplot(data=df, x=column, kde=True, bins=50)
    plt.title(f'Distribution Plot for {column}')

**Plot for Each Numeric Column:**
This loop iterates over each column in numeric_columns.
plot(df1, i): Calls the plot function for each column, creating a box plot and histogram.

In [None]:
for i in numeric_columns:
    plot(df1, i)

**Summary**

**Calculate Correlations:** The correlation matrix for numeric columns helps understand the relationships between variables.
**Label Encoding:** Encodes categorical string data into numerical format, making it suitable for analysis and machine learning.
**Correlation Plot:** Visualizes correlations using a heatmap, aiding in identifying strong relationships between features.
**EDA Report:** (commented out) Provides an automated, comprehensive analysis of the data.
**Plotting Function:** Creates box plots and histograms for visualizing the distribution and spread of numeric data.
**Iterate Over Numeric Columns:** Generates plots for each numeric column, offering insights into the data's distribution and potential outliers.

# Data Pre-Processing

Display the First Few Rows of df1

Check for Duplicates

Count Unique Values for 'IsHoliday_x':
df1.IsHoliday_x.value_counts(): Counts the occurrences of each unique value in the 'IsHoliday_x' column.

Count Unique Values for 'IsHoliday_y':
df1.IsHoliday_y.value_counts(): Counts the occurrences of each unique value in the 'IsHoliday_y' column.

Drop 'IsHoliday_x' Column:
df1= df1.drop(columns=['IsHoliday_x']): Removes the 'IsHoliday_x' column from the DataFrame.

Check Shape of DataFrame:
df1.shape: Outputs the shape (number of rows and columns) of df1.

Check for Missing Values:
df1.isnull().sum(): Counts the number of missing values in each column of df1.

In [None]:
df1.head()
df1.duplicated().sum()
df1.IsHoliday_x.value_counts()
df1.IsHoliday_y.value_counts()
df1= df1.drop(columns=['IsHoliday_x'])
df1.shape
df1.isnull().sum()

**Calculate Proportion of Missing Values:**
These lines calculate the proportion of missing values in the 'MarkDown' columns by dividing the number of missing values by the total number of rows (421,570).
print(Markdown1,Markdown2,Markdown3,Markdown4,Markdown5): Prints the calculated proportions.

In [None]:
Markdown1=270889/421570
Markdown2=310322/421570
Markdown3=284479/421570
Markdown4=286603/421570
Markdown5=270138/421570
print(Markdown1,Markdown2,Markdown3,Markdown4,Markdown5)

**Fill Missing Values with Zero:**
For each 'MarkDown' column, missing values are replaced with 0 using the fillna method with inplace=True.

In [None]:
df1.MarkDown1.fillna(0, inplace = True)
df1.MarkDown2.fillna(0, inplace = True)
df1.MarkDown3.fillna(0, inplace = True)
df1.MarkDown4.fillna(0, inplace = True)
df1.MarkDown5.fillna(0, inplace = True)

In [None]:
# Rechecking for missing values again
df1.isnull().sum()

**Extract Date Components:**
df1['month'] = df1['Date'].dt.month: Extracts the month from the 'Date' column and creates a new column 'month'.
df1['year'] = df1['Date'].dt.year: Extracts the year from the 'Date' column and creates a new column 'year'.
df1['day'] = df1['Date'].dt.day: Extracts the day from the 'Date' column and creates a new column 'day'.
df1.head(): Displays the first 5 rows of df1 to verify the new columns.

In [None]:
df1['month'] = df1['Date'].dt.month
df1['year'] = df1['Date'].dt.year
df1['day'] = df1['Date'].dt.day
df1.head()

**Drop 'Date' Column:**
df1= df1.drop(columns=['Date']): Removes the 'Date' column from the DataFrame since its components have been extracted.

In [None]:
df1= df1.drop(columns=['Date'])

**Define Outlier Handling Function:**
def outlier(df, column): Defines a function outlier that takes a DataFrame df and a column name column.

iqr = df[column].quantile(0.75) - df[column].quantile(0.25): Calculates the Interquartile Range (IQR) for the column.

upper_threshold = df[column].quantile(0.75) + (1.5*iqr): Calculates the upper threshold for detecting outliers.

lower_threshold = df[column].quantile(0.25) - (1.5*iqr): Calculates the lower threshold for detecting outliers.

df[column] = df[column].clip(lower_threshold, upper_threshold): Clips the column values to be within the thresholds.

In [None]:
def outlier(df, column):
    iqr = df[column].quantile(0.75) - df[column].quantile(0.25)
    upper_threshold = df[column].quantile(0.75) + (1.5*iqr)
    lower_threshold = df[column].quantile(0.25) - (1.5*iqr)
    df[column] = df[column].clip(lower_threshold, upper_threshold)

**Handle Outliers and Plot:**
For each specified column, the outlier function is called to clip the values within thresholds.
After handling outliers, the plot function (defined earlier) is called to create box plots and histograms for each column.
This process is repeated for 'Unemployment', 'Temperature', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4', and 'MarkDown5'.

In [None]:
outlier(df1, 'Unemployment')
plot(df1, 'Unemployment')

outlier(df1, 'Temperature')
plot(df1, 'Temperature')

outlier(df1, 'MarkDown1')
plot(df1, 'MarkDown1')

outlier(df1, 'MarkDown2')
plot(df1, 'MarkDown2')

outlier(df1, 'MarkDown3')
plot(df1, 'MarkDown3')

outlier(df1, 'MarkDown4')
plot(df1, 'MarkDown4')

outlier(df1, 'MarkDown5')
plot(df1, 'MarkDown5')

**Calculate IQR and Thresholds for 'Weekly_Sales':**
Calculates the IQR for 'Weekly_Sales'.
Computes the upper and lower thresholds for outlier detection.
Outputs the IQR, upper threshold, and lower threshold values.

In [None]:
iqr = df1['Weekly_Sales'].quantile(0.75) - df1['Weekly_Sales'].quantile(0.25)
upper_threshold = df1['Weekly_Sales'].quantile(0.75) + (1.5*iqr)
lower_threshold = df1['Weekly_Sales'].quantile(0.25) - (1.5*iqr)
iqr, upper_threshold, lower_threshold

**Describe 'Weekly_Sales':**
df1.Weekly_Sales.describe(): Outputs summary statistics for the 'Weekly_Sales' column, including count, mean, std, min, 25%, 50%, 75%, and max.

In [None]:
df1.Weekly_Sales.describe()

**Plot 'Weekly_Sales':**
Calls the plot function to create a box plot and histogram for 'Weekly_Sales'.

In [None]:
plot(df1, 'Weekly_Sales')

**Count Negative 'Weekly_Sales' Values:**

len(df1[df1['Weekly_Sales'] < 0]): Counts the number of rows where 'Weekly_Sales' is negative.
Count Extremely High 'Weekly_Sales' Values:

len(df1[df1['Weekly_Sales'] >= 250000]): Counts the number of rows where 'Weekly_Sales' is greater than or equal to 250,000.

In [None]:
len(df1[df1['Weekly_Sales'] < 0])

len(df1[df1['Weekly_Sales'] >= 250000])

In [None]:
# Check Shape of DataFrame
df1.shape

**Drop Rows with Negative 'Weekly_Sales' Values:**

df1.drop(df1[df1['Weekly_Sales'] < 0].index, inplace=True): Drops rows where 'Weekly_Sales' is negative.
Drop Rows with Extremely High 'Weekly_Sales' Values:

df1.drop(df1[df1['Weekly_Sales'] >= 250000].index, inplace=True): Drops rows where 'Weekly_Sales' is greater than or equal to 250,000.

In [None]:
df1.drop(df1[df1['Weekly_Sales'] < 0].index, inplace=True)
df1.drop(df1[df1['Weekly_Sales'] >= 250000].index, inplace=True)

In [None]:
# Re-check Shape of DataFrame
df1.shape

# Re-plot 'Weekly_Sales'
plot(df1, 'Weekly_Sales')

**Summary**

**Initial Inspection:** Displayed first few rows, checked for duplicates, and value counts for holiday indicators.
**Cleaning and Preprocessing:** Removed 'IsHoliday_x', checked shape, and missing values.
**Missing Values Handling:** Filled missing 'MarkDown' values with zeros.
**Date Feature Engineering:** Extracted month, year, and day from 'Date' and then dropped 'Date' column.
**Outlier Handling:** Defined and applied a function to clip outliers for multiple columns.
**'Weekly_Sales' Analysis:** Calculated IQR and thresholds, described, plotted, and removed extreme values, then re-checked the shape and plotted again.

# Encoding

In [None]:
# Display the First Two Rows of df1
df1.head(2)

**Define Columns to Encode:**
columns=["Type", "IsHoliday_y"]: Creates a list named columns containing the names of the columns that will be encoded. In this case, 'Type' and 'IsHoliday_y'.

**Initialize the LabelEncoder:**
encode=LabelEncoder(): Initializes an instance of the LabelEncoder from scikit-learn, which will be used to convert categorical labels into numeric form.

In [None]:
columns=["Type", "IsHoliday_y"]
encode=LabelEncoder()

**Loop Through Each Column in the List:**

for column in columns:: Starts a loop that will iterate over each column name in the columns list.
Fit the Encoder to the Column:

encode.fit(df1[[column]]): Fits the LabelEncoder to the column data. This step learns the unique values in the column and assigns a numeric label to each unique value.
Note: Typically, LabelEncoder's fit method expects a 1-dimensional array, but here it's being called with a DataFrame. This might work with some versions but ideally should be encode.fit(df1[column]).
Transform the Column Data:

df1[column] = encode.transform(df1[[column]]): Transforms the categorical data in the column into numeric labels. The transformed numeric values replace the original values in the DataFrame.
Similar to the fit method, transform is typically used with a 1-dimensional array: df1[column] = encode.transform(df1[column]).

In [None]:
for column in columns:
  encode.fit(df1[[column]])
  df1[column] = encode.transform(df1[[column]])

In [None]:
# Display the First Two Rows of df1 After Encoding
df1.head(2)

**Summary**

**Initial Display: **The first two rows of df1 are displayed to show the original data.
**Columns to Encode:** A list of columns to be encoded is defined.
**Label Encoder Initialization:** A LabelEncoder object is initialized.
**Encoding Loop:** For each column in the list, the encoder is fitted to the column's data, and the column's categorical values are transformed into numeric labels.
**Final Display:** The first two rows of df1 are displayed again to show the transformed numeric values in the specified columns.

# Machine Learning

**Define Features and Target Variable:**

x: Contains the feature variables by dropping the "Weekly_Sales" column from the DataFrame df1.

y: Contains the target variable "Weekly_Sales".
Split the Data into Training and Testing Sets:

train_test_split(x, y, test_size=0.25, random_state=42): Splits the dataset into training and testing sets. 75% of the data will be used for training (x_train and y_train), and 25% will be used for testing (x_test and y_test). The random_state parameter ensures reproducibility by fixing the random seed.

In [None]:
x=df1.drop("Weekly_Sales",axis=1)
y=df1["Weekly_Sales"]

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.25, random_state=42)

**Initialize and Train the SVR Model:**

SVR(): Initializes a Support Vector Regression (SVR) model.
.fit(x_train, y_train): Fits the SVR model to the training data (x_train and y_train).

Make Predictions:
y_pred_train and y_pred_test: Predicts the target variable for both the training and testing datasets.

Evaluate the Model:
Calculates the coefficient of determination (R^2) for both the training and testing sets.

In [None]:
from sklearn.svm import SVR
Model2=SVR().fit(x_train,y_train)
y_pred_train = Model2.predict(x_train)
y_pred_test = Model2.predict(x_test)

r2_train = r2_score(y_train, y_pred_train)
r2_test = r2_score(y_test, y_pred_test)

r2_train, r2_test

**Initialize and Train the Decision Tree Regressor Model:**

DecisionTreeRegressor(): Initializes a Decision Tree Regressor model.

.fit(x_train, y_train): Fits the Decision Tree Regressor model to the training data (x_train and y_train).

Make Predictions and Evaluate the Model:
Same as the SVR model, predictions are made and the R^2 scores are calculated for both the training and testing datasets.

In [None]:
from sklearn.tree import DecisionTreeRegressor
Model4=DecisionTreeRegressor().fit(x_train,y_train)
y_pred_train = Model4.predict(x_train)
y_pred_test = Model4.predict(x_test)

r2_train = r2_score(y_train, y_pred_train)
r2_test = r2_score(y_test, y_pred_test)

r2_train, r2_test

**Initialize and Train the Gradient Boosting Regressor Model:**

GradientBoostingRegressor(): Initializes a Gradient Boosting Regressor model.

.fit(x_train, y_train): Fits the Gradient Boosting Regressor model to the training data (x_train and y_train).

Make Predictions and Evaluate the Model:
Predictions are made and the R^2 scores are calculated for both the training and testing datasets.

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
Model6=GradientBoostingRegressor().fit(x_train,y_train)
y_pred_train = Model6.predict(x_train)
y_pred_test = Model6.predict(x_test)

r2_train = r2_score(y_train, y_pred_train)
r2_test = r2_score(y_test, y_pred_test)

r2_train, r2_test

**Initialize and Train the Random Forest Regressor Model:**

RandomForestRegressor(): Initializes a Random Forest Regressor model.

.fit(x_train, y_train): Fits the Random Forest Regressor model to the training data (x_train and y_train).

Make Predictions and Evaluate the Model:
Predictions are made and the R^2 scores are calculated for both the training and testing datasets.

In [None]:
from sklearn.ensemble import RandomForestRegressor
Model5=RandomForestRegressor().fit(x_train,y_train)
y_pred_train = Model5.predict(x_train)
y_pred_test = Model5.predict(x_test)

r2_train = r2_score(y_train, y_pred_train)
r2_test = r2_score(y_test, y_pred_test)

r2_train, r2_test

**Calculate Mean Squared Error (MSE):**
Calculates the MSE between the true target values (y_test) and the predicted target values (y_pred_test).

In [None]:
mse = mean_squared_error(y_test, y_pred_test)
mse

**Save the Trained Model:**
Saves the trained Random Forest Regressor model (Model5) using the pickle module. The model is serialized and stored in a file named "WS_Pred_Model".

In [None]:
import pickle

pickle_=open("WS_Pred_Model","wb")
pickle.dump(Model5,pickle_)
pickle_.close()

**Summary**

**Data Preparation:**
The dataset is split into features (x) and the target variable (y), which is the "Weekly_Sales" column.
The data is split into training and testing sets using the train_test_split function from scikit-learn.

**Model Training:**
Four regression models are trained on the training data:
Support Vector Regression (SVR)
Decision Tree Regressor
Gradient Boosting Regressor
Random Forest Regressor

**Model Evaluation:**
For each model, predictions are made on both the training and testing datasets.
The coefficient of determination (R^2) is calculated for both the training and testing sets to evaluate the model's performance.

**Mean Squared Error (MSE):**
The mean squared error (MSE) is calculated to quantify the average squared difference between the true and predicted values for the target variable.

**Model Serialization:**
The trained Random Forest Regressor model (Model5) is serialized and saved using the pickle module for future use.

Overall, the machine learning part involves training multiple regression models, evaluating their performance, and selecting the best model based on the evaluation metrics. Finally, the best model is saved for deployment or further analysis.

# Overall Summary

**Project Overview:**

This project involves the analysis and prediction of weekly sales data for a retail store chain. The dataset includes information about store features, sales, and store types. The project follows a structured approach, including data exploration, preprocessing, visualization, and machine learning modeling.

**Key Steps:**

***Data Exploration:***
The project starts with data exploration to understand the structure and features of the datasets. Exploratory Data Analysis (EDA) techniques such as summary statistics, correlation analysis, and visualization are used to gain insights into the data.

***Data Preprocessing:***
Data preprocessing involves handling missing values, encoding categorical variables, and feature engineering. Techniques like label encoding, handling outliers, and transforming date variables are applied to prepare the data for modeling.

***Machine Learning Modeling:***
Several regression models, including Support Vector Regression (SVR), Decision Tree Regressor, Gradient Boosting Regressor, and Random Forest Regressor, are trained on the preprocessed data to predict weekly sales.
The performance of each model is evaluated using metrics such as R^2 score and Mean Squared Error (MSE).
The best-performing model, Random Forest Regressor, is serialized and saved for future use.

**Conclusion:**

The project concludes with a summary of the machine learning process and the selection of the best model for predicting weekly sales. The serialized model can be deployed for making predictions on new data or integrated into a production environment for decision-making purposes.

Overall, this project demonstrates the application of data analysis and machine learning techniques to solve real-world business problems, such as sales forecasting for retail stores.