<a href="https://colab.research.google.com/github/anshupandey/Machine_Learning_Training/blob/master/JPMC24/code0x_anova_analysis_profit_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ANOVA Analysis on Profit Forecast and Reported Data

### Dataset Information:

1. **Forecasted Profit Values** for 30 days: These are the expected profit values predicted ahead of time.
2. **Reported Profit Values from 2 Sources** for the same 30 days: These are the actual profit values reported by two different sources.
3. **Day**: A column representing the day index (1 to 30).

### Explanation of Columns:
- **Day**: The day for which the profit is reported (from 1 to 30).
- **Forecasted_Profit**: The expected profit value for that day.
- **Reported_Profit_Source1**: The actual profit value reported by Source 1 for that day.
- **Reported_Profit_Source2**: The actual profit value reported by Source 2 for that day.

### Objective:
The goal is to identify if there is a significant deviation in the reported values from the forecasted values using hypothesis testing.

In this notebook, we will perform an ANOVA analysis to compare the deviations between forecasted profits and reported profits from two different sources for a period of 30 days.

## Loading Data

In [1]:
import pandas as pd
import numpy as np

#load the dataset
df= pd.read_csv("datasets-1/profit_forecast_reported.csv")
df.head()

  from pandas.core import (


Unnamed: 0,Day,Forecasted_Profit,Reported_Profit_Source1,Reported_Profit_Source2
0,1,500.0,507.45,487.97
1,2,518.96,516.88,556.0
2,3,537.86,547.57,537.59
3,4,556.65,579.49,535.49
4,5,575.28,571.76,591.73


## Comparing Source 1 and Source 2 with forecasted values to observe deviation

In [2]:
# Forecasted v/s source1
from scipy import stats
anova = stats.f_oneway(df.Forecasted_Profit.values,df.Reported_Profit_Source1.values)
print(anova)

F_onewayResult(statistic=0.007944609948917544, pvalue=0.9292836462736932)


In [3]:
# Forecasted v/s source2
from scipy import stats
anova = stats.f_oneway(df.Forecasted_Profit.values,df.Reported_Profit_Source2.values)
print(anova)

F_onewayResult(statistic=0.02640104786592417, pvalue=0.8714895040541173)


## Calculating Deviations from Forecasted Profit

In [4]:
# Calculating deviations from forecasted profits
df['Deviation_Source1'] = df['Reported_Profit_Source1'] - df['Forecasted_Profit']
df['Deviation_Source2'] = df['Reported_Profit_Source2'] - df['Forecasted_Profit']

df[['Day', 'Forecasted_Profit', 'Deviation_Source1', 'Deviation_Source2']].head()

Unnamed: 0,Day,Forecasted_Profit,Deviation_Source1,Deviation_Source2
0,1,500.0,7.45,-12.03
1,2,518.96,-2.08,37.04
2,3,537.86,9.71,-0.27
3,4,556.65,22.84,-21.16
4,5,575.28,-3.52,16.45


## Performing ANOVA Analysis

In [5]:
from scipy import stats

# Performing one-way ANOVA between the deviations from the two sources
anova_result = stats.f_oneway(df['Deviation_Source1'], df['Deviation_Source2'])

anova_result

F_onewayResult(statistic=2.9421128233345986, pvalue=0.09163624507158695)

## Interpretation of Results

The ANOVA test returns two key values:
- **F-statistic**: This tells us how much variance exists between the groups compared to within the groups.
- **p-value**: This tells us whether the difference in variances between groups is statistically significant.

In this case:
- **F-statistic**: 2.94
- **p-value**: 0.0916

Since the p-value is greater than 0.05, we fail to reject the null hypothesis. This means that we do not have enough evidence to conclude that there is a significant difference in the deviations of reported profits between the two sources.