## 2 Data Understanding 
---

In this phase, I will try to understand and explore the raw data in order to devise my approach towards how to clean and reformat it. I will use the following steps to understand the data:

1. Data Collection and Setup
2. Data Description
3. Data Exploration
4. Feature Analysis
5. Conclusion

### 2.1 Data Collection & Setup
---

First, I need to load the data and the necessary libraries. After that, I will take a look at the data to understand its structure.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from pylab import rcParams
from statsmodels.tsa.stattools import adfuller
import warnings

warnings.filterwarnings('ignore', category=DeprecationWarning)

avocados_df = pd.read_csv("../data/raw/avocado.csv")
avocados_df.head()

### 2.2 Data Description

---

In order to understand the data better, I must first analyse the data characteristics, each of its columns, and their properties.In this section, I will describe the dataset, covering its attributes, types, and any missing values.

In [None]:
avocados_df.info()

Check the shape of the dataset to understand the number of records and columns.

In [None]:
print(avocados_df.shape)

Describe each columns count, mean, std, min, 25%, 50%, 75%, max to gain a high level overview of each column. This can help spot any anomalies, missing values or extreme values.

In [None]:
avocados_df.describe()

Check the number of missing values in each column.

In [None]:
print(avocados_df.isnull().sum())

Check for duplicate entries using the built in method. This built in method might be unreliable as it might not detect all duplicate entries. Hence, I will check for other duplicates manually in later steps as well.

In [None]:
print(avocados_df.duplicated().sum(), "duplicate entries found.")

As the describe() menthod only shows the summary of the numerical columns, I will check the unique values of the categorical columns to understand them as well.

In [None]:
print('Unique types:', avocados_df['type'].unique())
print('Number of unique regions:', avocados_df['region'].nunique())

Based on the output and some help from the dataset documentation, I can understand the following about the dataset:

**Dataset columns description:**

- `Date` *(object)* - The date of the observation.
- `AveragePrice` *(float64)* - The average price of a single avocado.
- `TotalVolume` *(float64)* - Total number of avocados sold.
- `4046` *(float64)* - Total number of avocados with PLU 4046 sold.
- `4225` *(float64)* - Total number of avocados with PLU 4225 sold.
- `4770` *(float64)* - Total number of avocados with PLU 4770 sold.
- `TotalBags` *(float64)* - Total number of bags sold.
- `SmallBags` *(float64)* - Total number of small bags sold.
- `LargeBags` *(float64)* - Total number of large bags sold.
- `XLargeBags` *(float64)* - Total number of extra-large bags sold.
- `Year` *(int64)* - The year.
- `Type` *(object)* - Conventional or organic.
- `Region` *(object)* - The city or region of the observation.

**Dataset information:**

Based on the overview that we have seen so far, we can deduct the following general facts about the whole dataset for now:

- **Count:** The total number of records in this dataset is 18,249
- **Column [Unnamed 0]:** This column seems to be the index of each dataset as it resets when it reaches 51, hence this document must have been created by appending multiple datasets of 51 records together. This also means that there might be many duplicate or anomalous records in the dataset.
- **Year:** Document description says that this started expanding at 2013, but the earliest record in it is from 2015. Also the last record is from 2018. We might only have data for a span of 3 years.
- **Missing Values:** Interestingly, There are no missing values in this dataset. This could be a good sign, but we should still check for any placeholders or NaN values.
- **Duplicate Values:** There are 0 duplicate entries in the dataset, indicating that all records are unique. This was checked using an automatic method in the code, hence we cannot be sure and need to investigate further.
- **Types:** There are 3 types of data in this dataset: float64(9), int64(1), object(3)
- **Regions:**: There are 54 unique regions in the dataset, but USA only has 50 states. This means that the dataset probably contains duplicate or invalid region values.

From now on, I will remove the 'Unnamed: 0' column as it is not necessary for the analysis. I will also change the name of all columns to lowercase and to use underscores instead of spaces for consistency.

In [None]:
avocados_df.drop('Unnamed: 0', axis=1, inplace=True)
avocados_df.columns = avocados_df.columns.str.lower().str.replace(' ', '_')
avocados_df.rename(columns={'averageprice': 'average_price'}, inplace=True)
avocados_df.head()

### 2.3 Data Exploration

---

Since the main goal is to predict the average price avocados in the next month, `average_price` column is the target variable. In this section, I will explore the data to understand the relationships between the features and the target variable.


**Brevity**

In my later stages of analysis I found that one particular region, 'TotalUS', is most likely a sum of most of the other regions. This is a redundant data point and can be removed. I will remove this region from the dataset in an early stage for brevity, however I will try to justify my decesion with the following analysis. If I do not do this, the 'TotalUS' region will skew the results of the analysis and I have to repeat all of these steps again. 

First lets visualize the data to see the distribution of the regions and the total volume of avocados sold per region:

In [None]:
plt.figure(figsize=(20, 6))
avocados_df.groupby('region')['total_volume'].sum().plot(kind='bar', title="Total Volume of Avocados Sold by Region")
plt.xlabel('Region')
plt.ylabel('Total Volume')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

The `TotalUS` region is clearly an outlier in terms of total volume sold. This is most likely because it is the sum of all regions. I will remove this region from the dataset and visualize the data again:

In [None]:
avocados_df_no_totalus = avocados_df.drop(avocados_df[avocados_df['region'] == 'TotalUS'].index)
plt.figure(figsize=(20, 6))
avocados_df_no_totalus.groupby('region')['total_volume'].sum().plot(kind='bar', title="Total Volume of Avocados Sold by Region")
plt.xlabel('Region')
plt.ylabel('Total Volume')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

The result is extremely different now. The maximum total volume sold was 6m but now it is about 1.2m. This is a significant difference. However, I need to 
confirm my hypothesis. I will calculate the sum of total volume sold for all regions except 'TotalUS' and compare it with the total volume sold in 'TotalUS'. 
I will also try to do the same for other regions that might be aggregations as well.

Other regions that might be aggregations:

- `West`
- `SouthCentral`
- `SouthEast`
- `NorthEast`
- `Plains`
- `GreatLakes`
- `Midsouth`

> Note: Total avocado sale of `California` is almost as big as `SouthCentral` and `West`. This might mean that California is also an aggreagation, however I will not include it in my aggregation list as it is a state and not a region.

In [None]:
excluded_regions = [
    'TotalUS', 'West', 'SouthCentral', 'SouthEast',
    'NorthEast', 'Plains', 'GreatLakes', 'Midsouth'
]

total_df = avocados_df['total_volume'].sum()
total_us_volume = avocados_df[avocados_df['region'] == 'TotalUS']['total_volume'].sum()
total_no_excluded_regions = avocados_df[~avocados_df['region'].isin(excluded_regions)]['total_volume'].sum()
total_only_excluded_regions = avocados_df[avocados_df['region'].isin(excluded_regions)]['total_volume'].sum()

plt.figure(figsize=(12, 6))
totals = {
    'All Regions': total_df,
    'TotalUS Only': total_us_volume,
    'Non-Aggregated\nRegions': total_no_excluded_regions,
    'Aggregated\nRegions': total_only_excluded_regions
}

totals_billions = {k: v/1e9 for k, v in totals.items()}
plt.bar(totals.keys(), totals_billions.values())
plt.title('Comparison of Total Sale Volumes for Aggregated and Non-Aggregated Regions')
plt.ylabel('Total Volume (Billions)')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()

print(f"All regions (sum): {total_df:,.0f}")
print(f"TotalUS only (sum): {total_us_volume:,.0f}")
print(f"Non-aggregated regions (sum): {total_no_excluded_regions:,.0f}")
print(f"Aggregated regions (sum): {total_only_excluded_regions:,.0f}")


The other possible aggregations mentioned above summed together are not exactly the same as the `TotalUS` region. This can mean that other aggregations are not consistent throughout the data. Removing all these aggregations will destory the data and make it unusable.

Hence, I will only remove the `TotalUS` region and keep the other aggregations for now.

In [49]:
avocados_df = avocados_df.drop(avocados_df[avocados_df['region'] == 'TotalUS'].index)

**Outlier Detection**

First, I will check the distribution of the target variable, `AveragePrice` based on each `type` of avocado.

In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(x='type', y='average_price', data=avocados_df)
plt.title("Distribution of Average Price of Avocados by Type")
plt.xlabel("Type of Avocado")
plt.ylabel("Average Price")
plt.grid(True)
plt.show()

The boxplot shows that the average price of organic avocados is higher than conventional avocados. It also highlights the presence of some outliers in the data. Hence, I must investigate the outliers further.

Lets print the number of records present for each type of avocado:

In [None]:
print(avocados_df['type'].value_counts())

Both avocado types almost have the same number of records, which means the data is well distributed between the two types and no balancing is required.

In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(x='type', y='total_volume', data=avocados_df)
plt.title("Distribution of Total Volume of Avocados by Type")
plt.xlabel("Type of Avocado")
plt.ylabel("Total Volume")
plt.grid(True)
plt.show()

This boxplot outlines some important points:
    
- Total Sold Volume of <em>Organic</em> avocados is much more consistent. This can be deducted as the number of records for each avocado type is almost the same.
- The <em>Conventional</em> avocados probably have a lot of outliers, even though the number of records for each avocado type is relatively the same. This could be due to the presence of different regions in the dataset.

I will now check the distribution of avocado PLU codes.

In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(data=avocados_df[['4046', '4225', '4770']])
plt.title("Distribution of Avocado PLU Codes")
plt.xlabel("PLU Codes")
plt.ylabel("Values")
plt.grid(True)
plt.xticks([0, 1, 2], ['4046', '4225', '4770'])
plt.tight_layout()

First thing these boxplot shows is that the data on codes `4046` and `4225` are very similar but the data on code `4770` is very different. This could be due to the fact that the data on `4770` is very different from the other two codes. It also outlines the presence of many outliers in the data with extreme values, as the difference is in millions (0-1m vs 6m).

**Data Frequency**

It is important to understand the frequency of the data in the dataset. Knowing the granularity of my data will help me to understand the data better.

In [None]:
avocados_df['date'] = pd.to_datetime(avocados_df['date'])
plt.figure(figsize=(20, 6))
monthly_counts = avocados_df['date'].dt.to_period('M').value_counts().sort_index()
monthly_counts.plot(kind='bar')
plt.title("Number of Records in the Dataset per Month of Each Year")
plt.xlabel("Date")
plt.ylabel("Record Count")
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Data frequency seems to be shifting between 425 and 530 between months and seems to be stable. Most likely no data balancing is required.

Sales volume trend by year:

In [None]:
avocados_df.groupby("year")['total_volume'].sum().plot(kind='bar', figsize=(20, 6))
plt.title("Total Sales Volume of Avocados by Year")
plt.xlabel("Year")
plt.ylabel("Total Sales Volume")
plt.grid(True)
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

It is evident that either in 2018 the total sales has declined extremely or most likely the data for 2018 is incomplete. In order to make sure which is the case, I will check the total sale for each month in each year.

In [None]:
sales_per_year = avocados_df.copy()
sales_per_year['date'] = pd.to_datetime(avocados_df['date'])
sales_per_year['month'] = sales_per_year['date'].dt.month
sales_per_year['year'] = sales_per_year['date'].dt.year
plt.figure(figsize=(20, 6))
sns.countplot(x='month', data=sales_per_year, hue='year')
plt.title("Number of Sales by Month per Year")
plt.xlabel("Month")
plt.ylabel("Number of Sales")
plt.grid(True)
plt.show()

This Count Plot chart indicates that my hypothesis was correct and the count of records for 2018 is significantly lower than the other years. This is why the total sales for 2018 is lower than the other years.

The lack of data in 2018 can affect the model's performance if I decide to predict the price over a year. However, as it is much better to increase the granulairty of the data scope by labeling the season or months and predict the price for e.g. each month, lack of data in 2018 most likely will not be a problem.

**Price Trends**

In this section, I will investigate the price trends of avocados by type over the whole dataset to see how they change over the time and if there are seasonal patterns in the pricing.

In [None]:
plt.figure(figsize=(20, 6))
sns.lineplot(x='date', y='average_price', hue='type', data=avocados_df)
plt.axhline(y=avocados_df[avocados_df['type']=='conventional']['average_price'].mean(), 
           color='blue', linestyle='--', alpha=0.5, 
           label='Conventional Average')
plt.axhline(y=avocados_df[avocados_df['type']=='organic']['average_price'].mean(),
           color='orange', linestyle='--', alpha=0.5,
           label='Organic Average') 
plt.title("Average Price Trends Over Time by Avocado Type")
plt.xlabel("Date")
plt.ylabel("Average Price")
plt.grid(True)
plt.tight_layout()
plt.xticks(rotation=0)
plt.show()

This line plot highlights some critical points as follows:

- Even though the average price of organic avocados is higher than conventional avocados, they seem to share the same trend and seasonality. But this fact does not mean we can use one type instead of both or use a weighted average of the two types.

- Prices are highest in 2017 and lowest in 2015, which suggests potential external factors not present in this dataset vastly affect the prices of avocados (such as poor harvest, weather, marketing, supply shortage or other economic factors...).

- After each rise and fall, the prices seem to be returning their mean. This could suggest that the prices circle around a mean value and there could possibly be a seasonal pattern in the data.

However, I will have to explore further in later steps to understand the seasonality in the dataset.

### 2.4 Feature Analysis

---

#### Pearson Correlation Matrix

First I will visualize the Pearson correlation matrix to understand the relationships between the features.

In [None]:
avocados_df_corr = avocados_df.drop(['date', 'type', 'region'], axis=1)

corr_matrix = avocados_df_corr.corr()

plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title("Correlation Matrix of Avocado Dataset")
plt.show()

Based on the Heatmap, there are not much meaningful correlations. This could suggest that the current form of our data is not ready to be analyzed. However, we could deduct the following:

**`Total Volume` and `PLU 4046`, `PLU 4225`:**

- This strong positive correlation could suggest that the total volume of avocados sold is highly correlated with the total number of avocados sold with PLU 4046, then 4224 and 4770, respectively. This might mean that the majority of sales records are for PLU 4046 and 4225 avocados.

**`Total Volume` and `Bags`:**

- This strong positive correlation could suggest that the total volume of avocados sold is highly correlated with the total number of bags sold. 

**`Total Volume` and `Average Price`:**

- This weak negative correlation could suggest that total volume of avocados sold increases as the average price of avocados decreases or vice versa.

For now, I can mark some features to be important for further analysis and modeling:

- `Total Volume`
- `Average Price`
- `PLU codes`
- `Total Bags`

As these features have strong positive correlation with the `Total Volume` and a weak negative correlation with the `Average Price`, I will focus on these features in general.

**Bags**

Since bags can be potential features and `total_bags` seems to be the sum of all other bag size sales and the correlation of different bag sizes are almost the same as their total, it is better to discard the redundant columns. 

I will first prove my hypothesis and upon vlidation, remove all other fields for bag sizes and only keep the total bags sale as others do not produce much value into my analysis.

Verify if the `total_bags` column is the sum of the other 3 columns.

In [None]:
print("Total Bags Sales Sum:", avocados_df['total_bags'].sum())
print("Bag Size Sales Sum:  ", avocados_df['small_bags'].sum() + avocados_df['large_bags'].sum() + avocados_df['xlarge_bags'].sum())

There is only 10 sales difference between the `total_bags` column and the sum of the other 3 columns. This difference is negligible and can be ignored. Therefore, I will drop the other 3 columns and keep the `total_bags` column.

In [None]:
avocados_df = avocados_df.drop(columns=['small_bags', 'large_bags', 'xlarge_bags'])
avocados_df.head()

**`Total Volume` and `Average Price`**

The natural relationship of these two fields in almost every business is to be inverse. As the correlation betweem them is negative (hence, natural), I will leave them as they are.

**`Year`**

The `year` field seems to be almost neutral in correelation with other fields. Slightly positive for some fields, and slightly negative for others. This might suggest that `year` is not a good feature to use for modeling. However, I will keep it as it is for now.

As it is crucial to explore and discover more features that could be more useful for the analysis, I will perform another correlation analysis after the data preparation step, to see if I can discover more features that could be more useful for the analysis.

#### Regions

In order to get a better understanding of regions, first I will quickly check all the available region data and identify any inconsistencies or missing information. This will help in validating the regions better.

In [None]:
region_count = avocados_df.groupby('region')['date'].count().sort_values()
region_count

In the histogram [from the **Brevity** section], we saw some anomalies in regions. Values such as (`TotalUS`, `Plains`, `West`, ...) are not states or cities. There seem to be aggregations, duplicates or wrong values that need to be addressed.

In order to use the regions for modeling, I will need to clean them by aggregating them into more meaningful regions. As this can be a time consuming task, I will ask ChatGPT to analyze and group the regions into bigger <em>areas</em>. Doing this task manually is time consuming and error prone. 

The following is the result of ChatGPTs analysis:

**Anomalies**

- **TotalUS:** This is not a region, but a sum of all regions.

And the following regions cover multiple states:

- **Plains:** Kansas, Nebraska, etc...
- **West:** California, Washington, etc...
- **SouthCentral:** Texas, Oklahoma, etc...
- **GreatLakes:** Illinois, Michigan, etc...
- **Midsouth:** Mississippi, Arkansas, etc...
- **Northeast:** New York, Pennsylvania, etc...
- **Southeast:** Florida, Georgia, etc...

**Aggregation**

The regions can be grouped into the following areas:

1. **Northeast:** New York, Philadelphia, Boston, BaltimoreWashington, HartfordSpringfield, Buffalo-Rochester, Pittsburgh, Syracuse, Harrisburg-Scranton, Albany, Northern New England.
2. **Midwest:** Chicago, Detroit, Columbus, CincinnatiDayton, St. Louis, Grand Rapids, Indianapolis, Great Lakes, Plains.
3. **South:** Atlanta, Charlotte, MiamiFtLauderdale, DallasFtWorth, Houston, Nashville, RaleighGreensboro, Tampa, Orlando, Jacksonville, NewOrleansMobile, Louisville, RichmondNorfolk, SouthCarolina, SouthCentral, Midsouth, Southeast, Roanoke.
4. **West:** LosAngeles, SanFrancisco, SanDiego, Seattle, Portland, Denver, LasVegas, Sacramento, Spokane, Boise, PhoenixTucson, West, WestTexNewMexico, California.

In [None]:
northeast = [
    "Albany", "BaltimoreWashington", "Boston", "BuffaloRochester", "HarrisburgScranton",
    "HartfordSpringfield", "NewYork", "Philadelphia", "Pittsburgh", "Syracuse",
    "Northeast", "NorthernNewEngland"
]

midwest = [
    "Chicago", "CincinnatiDayton", "Columbus", "Detroit", "GrandRapids",
    "GreatLakes", "Indianapolis", "Plains", "StLouis"
]

south = [
    "Atlanta", "Charlotte", "DallasFtWorth", "Houston", "Jacksonville",
    "Louisville", "MiamiFtLauderdale", "Midsouth", "Nashville",
    "NewOrleansMobile", "Orlando", "RaleighGreensboro", "RichmondNorfolk",
    "SouthCarolina", "SouthCentral", "Southeast", "Tampa", "Roanoke"
]

west = [
    "Boise", "California", "Denver", "LasVegas", "LosAngeles", "PhoenixTucson",
    "Portland", "Sacramento", "SanDiego", "SanFrancisco", "Seattle", "Spokane",
    "West", "WestTexNewMexico"
]

region_to_area = {region: "Northeast" for region in northeast}
region_to_area.update({region: "Midwest" for region in midwest})
region_to_area.update({region: "South" for region in south})
region_to_area.update({region: "West" for region in west})

avocados_df["area"] = avocados_df["region"].map(region_to_area).fillna("Unknown")

print("Number of 'Unknown' imputations: ", avocados_df[avocados_df['area'] == 'Unknown'].shape[0])
avocados_df.head()

Now, I will visualize the data again to see if the area aggregation was successful:

In [None]:
plt.figure(figsize=(20, 6))
sns.boxplot(data=avocados_df, x='area', y='average_price')
plt.title('Distribution of Average Price of Avocados by Area')
plt.xlabel('Area')
plt.ylabel('Average Price')
plt.xticks(rotation=0)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

It is true that we have lost our granularity in `region` and `area`, however, this will help us understand the data better and make more accurate predictions as the regions values contained too many duplicated, vague, ambigous values that were practically useless if used them in their current state.

It seems that the new `area` feature does not provide much valuable information regarding the effect of region on average price. However, it is also not completely useless. I will keep it for now to see if the model can utilize its values after enconding.

Also, it seems like the <em>Northeast</em> region has the most expensive avocados in US!

**Redundant Column**

Now that I have introduced the `area` column, I believe that the `region` column is redundant and can be removed. I will remove this column.

In [None]:
avocados_df = avocados_df.drop(columns=['region'])
avocados_df.head()

#### Seasonality

Inorder to introduce seasonal granularity to the dataset, I will add a new column called `season` and `month_name`. Afterwards, I will aggregate the regions to their respective areas.

**Months**

In [None]:
avocados_df['month_name'] = avocados_df['date'].dt.strftime('%B')
avocados_df.head()

Now we can check the distribution of the data based on the month:

In [None]:
avocados_df = avocados_df.sort_values('date')
plt.figure(figsize=(20, 6))
sns.lineplot(x='month_name', y='total_volume', hue='year', data=avocados_df, errorbar=None)
plt.title("Total Volume of Avocados Sold by Month per Year")
plt.xlabel("Month")
plt.ylabel("Total Volume")
plt.grid(True)
plt.tight_layout()
plt.show()

This line chart can suggest the possibility of cyclical patterns in the data. The data seems to be more consistent in the summer months, which could suggest that the prices of avocados are more stable in the summer months. However, possibly the data may not contain trends or seasonality as the patterns do not seem to be in fixed intervals, furthermore, there are only one or two trends (peaks) in the data.

**Seasons**

In [None]:
seasons = {
    1: 'Winter',
    2: 'Winter',
    3: 'Spring',
    4: 'Spring',
    5: 'Spring',
    6: 'Summer',
    7: 'Summer',
    8: 'Summer',
    9: 'Autumn',
    10: 'Autumn',
    11: 'Autumn',
    12: 'Winter'
}
avocados_df['season'] = avocados_df['date'].dt.month.map(seasons)

avocados_df['season'].value_counts()

Now we can check the distribution of the data based on the season:

In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(x='season', y='average_price', data=avocados_df)
plt.title("Distribution of Average Price of Avocados Sold by Season")
plt.xlabel("Season")
plt.ylabel("Average Price")
plt.grid(True)
plt.show()

In [None]:
plt.figure(figsize=(20, 6))
sns.lineplot(x='season', y='total_volume', hue='year', data=avocados_df, errorbar=None)
plt.title("Total Volume of Avocados Sold by Season per Year")
plt.xlabel("Season")
plt.ylabel("Total Volume")
plt.grid(True)
plt.tight_layout()
plt.show()

This chart is very interesting as it outlines very key trends in the data that can help us understand customer behavior better. 

- It seems like the data on 2018 might be entirely misleading and might be worth removing it.
- During the other years, Fall season has had the least sale volume.

**Validate**

Check to see if new entires do not have any missing values:

In [None]:
print(avocados_df[['month_name', 'season']].isnull().sum())

**Conclusion**

Adding seasons does not add much value more than the months. Also, as the goal is to predict the price of avocados by month and not season, I will not use them in later steps and will only use the months column that I have added.

In [71]:
avocados_df = avocados_df.drop(columns=['season'])

**Seasonal Decomposition**

In order to understand the seasonality in the dataset better, I will decompose the data into its seasonal, trend, and residual components.

In [None]:
rcParams['figure.figsize'] = 18, 8
s_decompotions = avocados_df.copy()
s_decompotions["date"] = pd.to_datetime(s_decompotions["date"])
s_decompotions = s_decompotions.sort_values(by="date")
s_decompotions.set_index("date", inplace=True)
y = s_decompotions["average_price"]
y = y.resample("W").mean()

decomposition = sm.tsa.seasonal_decompose(y, model='additive', period=12)  # 12: Monthly seasonality

fig = decomposition.plot()
plt.show()

Trend: 

- The trend component of the data seems to be decreasing at first and increasing a few times over time. However, the trend is not very strong and also not linear.

Seasonal:

- The price of avocado is clear to be fluctuating seasonally.

Residual (random noise):

- It is hard to say of there is a pattern in the residual component. If there is a pattern, other factors are influencing the price of avocados. If not, the model will probably fit well. My personal opinion is that there is almost a pattern in the residual component.

The previous assumptions can help with choosing the right model for the job. The seasonality suggests utilizing a model such as SARIMA or Prophet is a good idea. If the trend is clearly linear, it can be a good idea to use XGBoost.

**Stationarity**

It is important to check the stationarity of the data incase I decide to use certain statistical models such as SARIMA. In order to check if the data is stationary, I will use the Augmented Dickey-Fuller test. I will check the stationarity of the data based on the `Average Price` of avocados. 

In [None]:
result = adfuller(avocados_df["average_price"])

print("ADF Statistic:", result[0])
print("p-value:        {:.14f}".format(result[1]))

if result[1] > 0.05:
    print("The data is non-stationary (p > 0.05)")
else:
    print("The data is stationary (p ≤ 0.05)")

Based on the reult, the data is stationary as p-value is lower than 0.05. Is this value is correct, it means that I can skip differencing the data and use the data directly for modeling using either ARIAMA, SARIMA or Prophet.

**Save Output**

Save the output to a new CSV file for the next step (3 Data Preparation).

In [74]:
avocados_df = avocados_df.sort_index(axis=1)
avocados_df.to_csv("../data/interim/avocado_cleaned.csv", index=False)

### 2.5 Conclusion

---

So far I have analyzed the data as much as I could and have made the following changes:

- Removed the `Unnamed: 0` column as it was not necessary.
- Grouped the regions into more meaningful `areas`.
- Added new `month_name` and `season` columns, but removed the `season` as it did not add much value more than `month_name`.
- Removed the `small_bags`, `large_bags`, `xlarge_bags`, `region`, `total_bags` and `Unnamed: 0` columns.
- Visualized the data to understand the distribution of the data better.
- Lowercased the column names for consistency.
- Correlation matrix analysis to understand the relationships between the features.
- Visualized the data seasonal decomposition to understand the seasonality of the data.
- Checked the stationarity of the data.
- Saved the output to a new CSV file for the next step (3 Data Preparation).