# **Sales forecasting of Corporation Favorita Stores using Time Series Regression.**

## **Objective:** 
### To develop a predictive model for store sales for Corporation Favorita, a large grocery retailer headquartered in Ecuador. The model aims to predict the unit sales of numerous items across various Favorita stores, enabling more precise estimation of sales performance.

## Hypotheses for testing:
Hypothesis 1: <br>
```Null```: The promotional activities, oil prices, and holidays/events do not have a significant impact on store sales for Corporation Favorita.<br>
```Alternate```: The promotional activities, oil prices, and holidays/events have a significant impact on store sales for Corporation Favorita.

Hypothesis 2: <br>
```Null```: Sales increase over time. <br>
```Alternate```: Sales dont increase with time.

Hypothesis 3: <br>
```Null```: Situating a startup in a particular city does not influence funding.<br>
```Alternate```: Situating a startup in a particular city significantly affects funding.

### Hypothesis 4: <br>
```Null```: The more the transactions the higher the sales. <br>
```Alternate```: Transactions don't have an impact on sales.



### **Import packages**

In [None]:
# Data Handling
import pyodbc
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from scipy import stats
from dotenv import dotenv_values


# Statistical Analysis
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import seasonal_decompose
from scipy.stats import ttest_ind
from sklearn.metrics import mean_absolute_error, mean_squared_error
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.arima.model import ARIMA


# Visualization
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.patches as mpatches
import seaborn as sns
import plotly.express as px
from matplotlib.dates import MonthLocator


# Other Packages
import warnings

warnings.filterwarnings("ignore")

## **1. Data Acquistion**

In [None]:
# Load environment variables from .env file into a dictionary
environment_variables = dotenv_values('.env')


# Get the values for the credentials you set in the '.env' file
database = environment_variables.get("database")
server = environment_variables.get("server")
username = environment_variables.get("user")
password = environment_variables.get("password")


connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password}"



In [None]:
# Use the connect method of the pyodbc library and pass in the connection string.
# This will connect to the server and might take a few seconds to be complete. 
# Check your internet connection if it takes more time than necessary

connection = pyodbc.connect(connection_string)

In [None]:
# Define SQL queries for each table
query1 = 'SELECT * FROM dbo.oil'
query2 = 'SELECT * FROM dbo.holidays_events'
query3 = 'SELECT * FROM dbo.stores'

# Read data from tables into pandas DataFrames
oil = pd.read_sql(query1, connection)
holidays_events = pd.read_sql(query2, connection)
stores = pd.read_sql(query3, connection)

# Close the database connection
connection.close()

In [None]:
oil.head()

 <div class="alert alert-block alert-danger" style ="background-color : #e6ebef;">
    <h4 style="padding: 15px;
              color:black;">📌 Renaming the type in holiday data to holiday type
    </h4>
  </div>

In [None]:
# Display the first few rows of the DataFrame
holidays_events.head()


# Rename the 'type' column to 'holiday_type'
holidays_events.rename(columns={
    'type': 'holiday_type'
}, inplace=True)

# Print the modified DataFrame to see the changes
holidays_events

Rename type here as holiday type and concat with oil['dailyoilprices']

In [None]:
stores.head()  # View the first 5 rows of the stores dataframe

In [None]:
''' sample_submission = pd.read_csv('data/sample_submission.csv')
sample_submission.head() ''' 

Do away with this set since sales has no values.

In [None]:
transactions = pd.read_csv('data/transactions.csv')   # load the transactions data
transactions.head()   # View the first 5 rows

In [None]:
train = pd.read_csv('data/train.csv')   # load the train data
train.sample(5)    # View the first 5 rows

In [None]:
train[(train['sales'] == 770) & (train['store_nbr'] == 25) ]  # Check for rows in train data whose sales is 770 and store number is 25
                                                              # This is to confirm if sales is same as transactions.


...

In [None]:
# oil.to_csv('data/oil.csv',index=False)
# transactions.to_csv('data/transactions.csv',index=False)
# holidays_events.to_csv('data/holidays_events.csv',index=False)
# stores.to_csv('data/stores.csv',index=False)

## Join Tables

### Join to display data contained in both dataframes

In [None]:
# Read data from the 'transactions.csv' file and create a DataFrame named 'transactions'
transactions = pd.read_csv('data/transactions.csv')

# Merge the 'transactions' DataFrame with the 'train' DataFrame
# This combines the data from both DataFrames based on their common columns, creating a new DataFrame named 'full_transaction'
full_transaction = pd.merge(transactions, train)

# Display a random sample of 5 rows from the 'full_transaction' DataFrame
# The 'sample()' function is used to extract a random subset of rows from the DataFrame for inspectionctions
full_transaction.sample(5)

## Join the full transactions based on stores

# Merge the 'full_transaction' DataFrame with the 'stores' DataFrame
# This combines the data based on the 'store_nbr' column, using an 'inner' join type
# The result is a new DataFrame named 'result'

result = pd.merge(full_transaction, stores, on='store_nbr', how='inner')
result.head(5)


## Join the full transactions based on oil data for each date

In [None]:
# Merge the 'result' DataFrame with the 'oil' DataFrame
# This combines the data based on the 'date' column, using an 'inner' join type
# The result is a new DataFrame named 'result1'
result1= pd.merge(result, oil, on='date', how='inner')
result1.sample(5)


## Join the full transactions based on holidays

In [None]:
# Merge the 'result1' DataFrame with the 'holidays_events' DataFrame
# This combines the data based on the 'date' column, using an 'inner' join type
# The result is a new DataFrame named 'salesdata'

salesdata= pd.merge(result1, holidays_events, on='date', how='inner')

# Reset the index of the 'salesdata' DataFrame
# The 'drop=True' parameter removes the current index, and 'inplace=True' applies the change directly to the DataFrame
salesdata.reset_index(drop=True,inplace=True)
salesdata.head(5)


## Drop some columns (id column)

In [None]:
#salesdata.drop(columns='id', inplace=True)

## Rename columns

In [None]:
# Rename store_nbr as store_number amd dcpo;wtocp as oil_prices

salesdata.rename(columns={
    'store_nbr': 'store_number',
    'dcoilwtico': 'oil_prices',
}, inplace=True)
salesdata.sample(5)

In [None]:
salesdata.columns  # Get the column names of the salesdata

In [None]:
salesdata = salesdata[['id','date',  'store_number', 'transactions', 'family', 'sales',
       'onpromotion', 'city', 'state', 'type', 'cluster', 'oil_prices',
       'holiday_type', 'locale', 'locale_name', 'description', 'transferred']] # Rearrange columns of the data

In [None]:
salesdata.head()   # Get first 5 rows

## **Generate summary statistics and transpose the rows and columns of the resultant DataFrame then trnsposing for a detailed view.**

In [None]:
# Display descriptive statistics for the 'salesdata' DataFrame
# The 'describe()' function computes various summary statistics for numerical columns
# The 'T' attribute is used to transpose the summary statistics for better readability
salesdata.describe().T

## **Checking for duplicate rows.**

In [None]:
# Check for duplicated rows in the 'salesdata' DataFrame
# The 'duplicated()' function returns a boolean Series indicating whether each row is a duplicate
# The 'sum()' function then counts the number of 'True' (duplicated) values in the Series
salesdata.duplicated().sum()

 <div class="alert alert-block alert-danger" style ="background-color : #e6ebef;">
    <h4 style="padding: 15px;
              color:black;">📌 There are no duplicate rows!
    </h4>
  </div>

In [None]:
salesdata.to_csv('data/FavoritaStores_Data.csv', index=False)  # Save new data frame as FavoritaStores_Data which is a csv


 <div class="alert alert-block alert-danger" style ="background-color : #e6ebef;">
    <h4 style="padding: 15px;
              color:black;">📌 Data saved to a csv file for further analysis in BI
    </h4>
  </div>

## **2. Univariate Data Analysis**

### **Both histograms and boxplot are plotted to show distributions and any presence of outliers**

>### **2.1. Sales column**



In [None]:
# Create a figure and two subplots side by side
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Plot the histogram on the first subplot (ax1)
ax1.hist(salesdata['sales'], bins=20)
ax1.set_xlabel('Sales')
ax1.set_ylabel('Frequency')
ax1.set_title('Distribution of Sales')

# Plot the boxplot on the second subplot (ax2)
ax2.boxplot(salesdata['sales'])
ax2.set_ylabel('Sales')
ax2.set_title('Boxplot of Sales')

# Adjust layout to avoid overlapping labels
plt.tight_layout()

# Show the plots
plt.show()

<div class="alert alert-block alert-danger" style="background-color: #e6ebef;">
    <h4 style="padding: 15px; color: black;">
    From our plots:<br><br>
        📌 Sales is positively skewed. <br> <br>
        📌 The median value is thus closer to the first quartile. <br><br>
        📌 The boxplot shows presence of very extreme values. <br><br>        
        📌 There is a high range between the values.
    </h4>
</div>


>### **2.2. Transactions column**



In [None]:
# Create a figure and two subplots side by side
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Plot the histogram on the first subplot (ax1)
ax1.hist(salesdata['transactions'], bins=20)
ax1.set_xlabel('transactions')
ax1.set_ylabel('Frequency')
ax1.set_title('Distribution of transactions')
plt.grid(False)


# Plot the boxplot on the second subplot (ax2)
ax2.boxplot(salesdata['transactions'])
ax2.set_ylabel('transactions')
ax2.set_title('Boxplot of transactions')

# Adjust layout to avoid overlapping labels
plt.tight_layout()

# Show the plots
plt.grid(False)
plt.show()

<div class="alert alert-block alert-danger" style="background-color: #e6ebef;">
    <h4 style="padding: 15px; color: black;">
    From the plots:<br>
        📌 The transactions are positively skewed.<br><br>
        📌 Transactions that fall within the interval of 500 - 1500 had the most occurance.<br><br>
        📌 This depicts pressence of outliers ash confirmed by the boxplot.
</div>


>### **2.3. Oil Prices column column**



In [None]:
# Create a figure and two subplots side by side
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Plot the histogram on the first subplot (ax1)
ax1.hist(salesdata['oil_prices'], bins=20)
ax1.set_xlabel('oil_prices')
ax1.set_ylabel('Frequency')
ax1.set_title('Distribution of oil_prices')
plt.grid(False)


# Plot the boxplot on the second subplot (ax2)
ax2.boxplot(salesdata['oil_prices'])
ax2.set_ylabel('oil_prices')
ax2.set_title('Boxplot of oil_prices')

# Adjust layout to avoid overlapping labels
plt.tight_layout()

# Show the plots
plt.grid(False)
plt.show()

<div class="alert alert-block alert-danger" style="background-color: #e6ebef;">
    <h4 style="padding: 15px; color: black;">
        📌 The histogram displays a bimodal distribution characterized by two prominent peaks. <br><br>
        📌 The first peak is observed in the interval between 40 and 55, indicating a concentration of data points in this range. <br><br>
        📌 This suggests that a significant portion of the dataset falls within this range, leading to a higher frequency count within this interval.<br><br>
        📌 The second peak occurs in the interval between 98 and 100. <br><br>
        📌 This peak signifies another concentration of data points in this range, which is distinct from the first peak. <br><br>
        📌 The presence of two distinct peaks suggests the existence of two modes or clusters within the dataset.<br><br>
        📌 Maybe this phenomenon is due to the pressence of some missing data.
        
</div>


>### **2.4. Onpromotion column**



In [None]:
# Create a figure and two subplots side by side
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Plot the histogram on the first subplot (ax1)
ax1.hist(salesdata['onpromotion'], bins=20)
ax1.set_xlabel('onpromotion')
ax1.set_ylabel('Frequency')
ax1.set_title('Distribution of onpromotion')
plt.grid(False)


# Plot the boxplot on the second subplot (ax2)
ax2.boxplot(salesdata['onpromotion'])
ax2.set_ylabel('onpromotion')
ax2.set_title('Boxplot of onpromotion')

# Adjust layout to avoid overlapping labels
plt.tight_layout()

# Show the plots
plt.grid(False)
plt.show()

<div class="alert alert-block alert-danger" style="background-color: #e6ebef;">
    <h4 style="padding: 15px; color: black;">
    From our plots:<br><br>
        📌 onpromotion is positively skewed. <br> <br>
        📌 The median value is thus closer to the first quartile. <br><br>
        📌 The boxplot shows presence of very extreme values. <br><br>        
        📌 There is a high range between the values.
    </h4>
</div>

## **3. Bivariate Data Analysis**

>### **3.1. Trend of Daily average sales**



In [None]:
# Convert date column in the data to python date format
salesdata['date']=pd.to_datetime(salesdata['date'])
# Group by data and obtain mean sales
salesdata_daily=salesdata.groupby('date')['sales'].mean()
# Define the size of plot area
plt.figure(figsize= (12,6))
# Plot the dates by mean sales
plt.plot(salesdata_daily.index,salesdata_daily.values)
plt.xlabel('Date')
plt.ylabel('Sales')
plt.title('Daily average Sales over Time')


In [None]:

# Convert the 'date' column in the 'salesdata' DataFrame to datetime format
salesdata['date'] = pd.to_datetime(salesdata['date'])

# Extract the year from the 'date' column and create a new 'year' column
salesdata['year'] = salesdata['date'].dt.year

# Group the sales data by 'year', summing up the 'sales' column for each year
salesdata_yearly = salesdata.groupby('year')['sales'].sum()

# Create a new figure for the plot with a specified size
plt.figure(figsize=(12, 6))

# Create a line plot using years as x-axis and their corresponding total sales as y-axis
plt.plot(salesdata_yearly.index, salesdata_yearly.values, marker='o')

# Set the label for the x-axis
plt.xlabel('Year')

# Set the label for the y-axis
plt.ylabel('Total Sales')

# Set the title of the plot
plt.title('Total Sales by Year')

# Display the plot
plt.show()

In [None]:
# Converting the 'date' column to datetime format
salesdata['date'] = pd.to_datetime(salesdata['date'])

# Grouping the sales data by date and calculating the mean sales for each day
salesdata_daily = salesdata.groupby('date')['sales'].mean()
salesdata_daily

<div class="alert alert-block alert-danger" style="background-color: #e6ebef;">
    <h4 style="padding: 15px; color: black;">
    From our plot:<br><br>
        📌 Daily average sales exhibit an upward trend over the years, except from 2017. <br> <br>
        📌 There are some seasonal peaks in each year as well, especially at the end of each year. <br><br>        
    </h4>
</div>

>### **3.2. Trend of Daily Average Oil Prices**



In [None]:
# Converting the 'date' column to datetime format
salesdata['date'] = pd.to_datetime(salesdata['date'])

# Grouping the data by year and calculating the mean oil prices for each year
salesdata_yearly = salesdata.groupby(salesdata['date'].dt.year)['oil_prices'].mean()

# Creating a new figure for the plot with a specified size
plt.figure(figsize=(12, 6))

# Creating a line plot of mean oil prices by year
plt.plot(salesdata_yearly.index, salesdata_yearly.values)

# Adding a label to the x-axis
plt.xlabel('Year')

# Adding a label to the y-axis
plt.ylabel('Mean Oil Prices')

# Adding a title to the plot
plt.title('Mean Oil Prices by Year')

# Displaying the plot
plt.show()


In [None]:
salesdata_yearly 

In [None]:
# Converting the 'date' column to datetime format
salesdata['date'] = pd.to_datetime(salesdata['date'])

# Grouping the data by year and calculating the mean oil prices for each year
salesdata_yearly = salesdata.groupby(salesdata['date'].dt.year)['oil_prices'].mean()

# Creating a new figure for the plot with a specified size
plt.figure(figsize=(12, 6))

# Creating a bar plot of mean oil prices by year
plt.bar(salesdata_yearly.index, salesdata_yearly.values)

# Adding a label to the x-axis
plt.xlabel('Year')

# Adding a label to the y-axis
plt.ylabel('Mean Oil Prices')

# Adding a title to the plot
plt.title('Mean Oil Prices by Year')

# Displaying the plot
plt.show()

>### **3.3. Sales against holiday type**



In [None]:
salesdata['date']=pd.to_datetime(salesdata['date'])
salesdata_daily=salesdata.groupby('holiday_type')['sales'].sum()
plt.figure(figsize= (12,6))
plt.bar(salesdata_daily.index,salesdata_daily.values)
plt.xlabel('Holiday Type')
plt.ylabel('Sales Count')
plt.title('Sales count against Holidays')


In [None]:
salesdata_daily

<div class="alert alert-block alert-danger" style="background-color: #e6ebef;">
    <h4 style="padding: 15px; color: black;">
        📌 There were more sales on holidyas than any other day with the least being a bridge day
</div>


>### **3.3. Sales against store number**



In [None]:
salesdata['date']=pd.to_datetime(salesdata['date'])
salesdata_daily=salesdata.groupby('store_number')['sales'].sum().head(10)
salesdata_daily= salesdata_daily.sort_values(ascending=False)
plt.figure(figsize= (12,6))
plt.bar(salesdata_daily.index,salesdata_daily.values)
plt.xlabel('Store Number')
plt.ylabel('Sales Count')
plt.title('Top 10 Sales Count against Store Number')


In [None]:
salesdata['date']=pd.to_datetime(salesdata['date'])
salesdata_daily=salesdata.groupby('store_number')['sales'].sum().tail(10)
salesdata_daily= salesdata_daily.sort_values(ascending=False)
plt.figure(figsize= (12,6))
plt.bar(salesdata_daily.index,salesdata_daily.values)
plt.xlabel('Store Number')
plt.ylabel('Sales Count')
plt.title('Bottom 10 Sales Count against Store Number')


<div class="alert alert-block alert-danger" style="background-color: #e6ebef;">
    <h4 style="padding: 15px; color: black;">
        📌 After displaying both the top 10 and bottom 10 most store salers the highest store seller was store number 3 and the bottom store seller being store number 52.
</div>


> ### **3.4 Sales against Product**

In [None]:
salesdata['date']=pd.to_datetime(salesdata['date'])
salesdata_daily=salesdata.groupby('family')['sales'].sum().head(10)
salesdata_daily= salesdata_daily.sort_values(ascending=False)
plt.figure(figsize= (15,6))
plt.bar(salesdata_daily.index,salesdata_daily.values)
plt.xlabel('Product Sold')
plt.ylabel('Sales Count')
plt.title('Top 10 Sales Count against Product Sold')


<div class="alert alert-block alert-danger" style="background-color: #e6ebef;">
    <h4 style="padding: 15px; color: black;">
        📌 After displaying the top 10 product with most sales beverages were the leading products followed by cleaning products
</div>


> ### **3.5 Sales against State**

In [None]:
salesdata['date']=pd.to_datetime(salesdata['date'])
salesdata_daily=salesdata.groupby('state')['sales'].sum().head(10)
salesdata_daily= salesdata_daily.sort_values(ascending=False)
plt.figure(figsize= (15,6))
plt.bar(salesdata_daily.index,salesdata_daily.values)
plt.xlabel('State')
plt.ylabel('Sales Count')
plt.title('Top 10 Sales Count against State')


In [None]:
salesdata['date']=pd.to_datetime(salesdata['date'])
salesdata_daily=salesdata[salesdata['state']=='Guayas']
salesdata_daily=salesdata.groupby('city')['sales'].sum().head(10)
salesdata_daily= salesdata_daily.sort_values(ascending=False)
plt.figure(figsize= (15,6))
plt.bar(salesdata_daily.index,salesdata_daily.values)
plt.xlabel('Various cities in Guayas')
plt.ylabel('Sales Count')
plt.title('Sales Count in Various series in Guayas')



<div class="alert alert-block alert-danger" style="background-color: #e6ebef;">
    <h4 style="padding: 15px; color: black;">
        📌 Most sales where recorded in the state of Guayas. Given the state of Guayas the highest city with most sales in Guayas is Guayaquil
</div>


> ### **3.6. Sales against type**

In [None]:
salesdata['date']=pd.to_datetime(salesdata['date'])
salesdata_daily=salesdata.groupby('type')['sales'].sum().head(10)
salesdata_daily= salesdata_daily.sort_values(ascending=False)
plt.figure(figsize= (15,6))
plt.bar(salesdata_daily.index,salesdata_daily.values)
plt.xlabel('State')
plt.ylabel('Sales Count')
plt.title('Top 10 Sales Count against State')


<div class="alert alert-block alert-danger" style="background-color: #e6ebef;">
    <h4 style="padding: 15px; color: black;">
        📌 Most sales where related to product of type D and the least of product type E
</div>


## **4. Time Series Analysis of sales by resampling**

### We visualize the time series of sales across time

In [None]:
# Choose date and sales columns
timeseriesdata=salesdata[['sales','date']]
timeseriesdata.index = timeseriesdata['date']
timeseriesdata
# make date the index
del timeseriesdata['date']
timeseriesdata



>### **4.1. Yearly Series of Total Sales**

In [None]:
sales_per_year= timeseriesdata.resample('Y').sum()
plt.figure(figsize= (15,6))
sns.lineplot(sales_per_year)
plt.ylabel('Sales')


>### **4.2. Analyzing monthly sales across each year**

In [None]:
timeseriesdata= timeseriesdata.resample('M').sum()



>## **4.2.1. Year 2013**

In [None]:
data2013 = timeseriesdata[timeseriesdata.index.year == 2013]
# Set the figure size
plt.figure(figsize=(15, 6))
# Create the line plot using Seaborn
sns.lineplot(data=data2013)
# Set x-axis locator to one-month interval
plt.gca().xaxis.set_major_locator(MonthLocator(interval=1))
plt.ylabel('2013 Sales')
plt.title('Sales Data for the Year 2013')
# Rotate x-axis labels for better visibility
plt.xticks(rotation=45)
# Display the plot
plt.show()



>## **4.2.2. Year 2014**

In [None]:
data2014 = timeseriesdata[timeseriesdata.index.year == 2014]
# Set the figure size
plt.figure(figsize=(15, 6))
# Create the line plot using Seaborn
sns.lineplot(data=data2014)
# Set x-axis locator to one-month interval
plt.gca().xaxis.set_major_locator(MonthLocator(interval=1))
plt.ylabel('2014 Sales')
plt.title('Sales Data for the Year 2014')
# Rotate x-axis labels for better visibility
plt.xticks(rotation=45)
# Display the plot
plt.show()



>## **4.2.3. Year 2015**

In [None]:
data2015 = timeseriesdata[timeseriesdata.index.year == 2015]
# Set the figure size
plt.figure(figsize=(15, 6))
# Create the line plot using Seaborn
sns.lineplot(data=data2015)
# Set x-axis locator to one-month interval
plt.gca().xaxis.set_major_locator(MonthLocator(interval=1))
plt.ylabel('2015 Sales')
plt.title('Sales Data for the Year 2015')
# Rotate x-axis labels for better visibility
plt.xticks(rotation=45)
# Display the plot
plt.show()



>## **4.2.4. Year 2016**

In [None]:
data2016 = timeseriesdata[timeseriesdata.index.year == 2016]
# Set the figure size
plt.figure(figsize=(15, 6))
# Create the line plot using Seaborn
sns.lineplot(data=data2016)
# Set x-axis locator to one-month interval
plt.gca().xaxis.set_major_locator(MonthLocator(interval=1))
plt.ylabel('2016 Sales')
plt.title('Sales Data for the Year 2016')
# Rotate x-axis labels for better visibility
plt.xticks(rotation=45)
# Display the plot
plt.show()



>## **4.2.5. Year 2013**

In [None]:
data2017 = timeseriesdata[timeseriesdata.index.year == 2017]
# Set the figure size
plt.figure(figsize=(15, 6))
# Create the line plot using Seaborn
sns.lineplot(data=data2017)
# Set x-axis locator to one-month interval
plt.gca().xaxis.set_major_locator(MonthLocator(interval=1))
plt.ylabel('2017 Sales')
plt.title('Sales Data for the Year 2017')
# Rotate x-axis labels for better visibility
plt.xticks(rotation=45)
# Display the plot
plt.show()



## **4.3. Sales series across months**

In [None]:

# Group by month and calculate the sum of sales
monthly_sales = salesdata.groupby(salesdata['date'].dt.strftime('%B'))['sales'].sum()
# List of month names in order
month_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
# Convert month names to categorical with specified order
monthly_sales.index = pd.Categorical(monthly_sales.index, categories=month_order, ordered=True)
# Sort the index to order the months
monthly_sales = monthly_sales.sort_index()
# Set the figure size
plt.figure(figsize=(15, 6))
# Create the line plot using Seaborn
plt.plot(monthly_sales)
plt.show()





<div class="alert alert-block alert-danger" style="background-color: #e6ebef;">
    <h4 style="padding: 15px; color: black;">
        The sales start increasing from september but exponentionally increase from October to December and from March to April. The highest purchases where witnessed in December and the lowest in September.
    </h4>
</div>


## **4.4. Sales Series in Date**

In [None]:
salesdata['date']=pd.to_datetime(salesdata['date'])
daily_sales = salesdata.groupby(salesdata['date'].dt.day)['sales'].sum().reset_index()

# Create a time series plot with slider
fig = px.line(daily_sales, x='date', y='sales')
fig.update_xaxes(rangeslider_visible=True)
fig.update_layout(title='Trend of Sales Over Time', title_x=0.5)
fig.show()


<div class="alert alert-block alert-danger" style="background-color: #e6ebef;">
    <h4 style="padding: 15px; color: black;">
        The highest sales are recorded at the beginning and at the end of the month.
    </h4>
</div>


## **Sales Series Quarterly**

In [None]:
sales_per_quarter=timeseriesdata.resample('Q').sum()
plt.figure(figsize= (15,6))
sns.lineplot(sales_per_quarter)
plt.ylabel('Sales')


>## **MultiVariate Analysis**

In [None]:
# Select numerical variables for correlation analysis
numerical_vars = ['sales', 'transactions', 'oil_prices','onpromotion']

# Compute correlation matrix
corr_matrix = salesdata[numerical_vars].corr()

# Plot heatmap
sns.heatmap(corr_matrix, annot=True, cmap='Blues')
plt.title('Correlation Matrix')
plt.show()

<div class="alert alert-block alert-danger" style="background-color: #e6ebef;">
    <h4 style="padding: 15px; color: black;">
        There is a significant relationship between number of items on promotion with slaes as well as a week relation with transactions and sales.
    </h4>
</div>


## **5. Testing Hypothesis**

### Before hypothesis testing we explore the distribution of sales.Using shapiro wilk test to explore distribution.

In [None]:
# Group the data by 'Year Funded'
grouped_data = salesdata.groupby('date')['sales'].sum()
grouped_data
# Perform Shapiro-Wilk test for each group
statistic, p_value = stats.shapiro(grouped_data)
print("Shapiro-Wilk Test Results:")
print("Statistic:", statistic)
print("P-value:", p_value)
if p_value < 0.05:
    print("The data does not follow a normal distribution.")
else:
    print("The data follows a normal distribution.")

<div class="alert alert-block alert-danger" style="background-color: #e6ebef;">
    <h4 style="padding: 15px; color: black;">
        The sales do not follow normal distribution.
    </h4>
</div>


### **The distribution is not normal hence non parametric ANOVA is used.**


### **Hypothesis 1: <br>**
#### ```Null```: The promotional activities, oil prices, and holidays/events do not have a significant impact on store sales for Corporation Favorita.<br>
#### ```Alternate```: The promotional activities, oil prices, and holidays/events have a significant impact on store sales for Corporation Favorita.

>### On promotion .In this we use a scatterplot analysis

<div class="alert alert-block alert-danger" style="background-color: #e6ebef;">
    <h4 style="padding: 15px; color: black;">
        The sales do not follow normal distribution.
    </h4>
</div>


### **Hypothesis 2: <br>**
### ```Null```: Sales increase over time. <br>
### ```Alternate```: Sales dont increase with time.



### **Hypothesis 3: <br>**
### ```Null```: Situating a startup in a particular city does not influence funding.<br>
### ```Alternate```: Situating a startup in a particular city significantly affects funding.



### **Hypothesis 4: <br>**
### ```Null```: The more the transactions the higher the sales. <br>
### ```Alternate```: Transactions don't have an impact on sales.