# <center><b>AVOCADO PRICE PREDICTION<b></center> 

---
# **Table of Contents**
---
**1.** [**Introduction**](#Section1)<br>
**2.** [**Problem Statement**](#Section2)<br>
**3.** [**Installing & Importing Libraries**](#Section3)<br>
**4.** [**Data Acquisition & Description**](#Section4)<br>
**5.** [**Data Profiling & Pre-Preprocessing**](#Section5)<br>
**6.** [**Exploratory Data Analysis**](#Section6)<br>
**7.** [**Insights and Conclusion**](#Section7)<br>
**8.** [**ML Model**](#Section8)<br>

<a name = Section1></a>
## **1. Introduction**

#### WHAT IS AN AVOCADO:
- An avocado is a bright green fruit with a large pit and dark leathery skin. They’re also known as alligator pears or butter fruit. 
- They’re the go-to ingredient for guacamole dips. And they're turning up in everything from salads and wraps to smoothies and even brownies.

![image.png](attachment:image.png)
#### NUTRITION:
* Avocados have a lot of calories. The recommended serving size is smaller than you’d expect: 1/3 of a medium avocado (50 grams or 1.7 ounces). One ounce has 50 calories. 

* Avocados are high in fat. But it's monounsaturated fat, which is a "good" fat that helps lower bad cholesterol, as long as you eat them in moderation.

* Avocados are low in sugar. And they contain fiber, which helps you feel full longer. In one study, people who added a fresh avocado half to their lunch were interested in eating during the next 3 hours than those who didn’t have the fruit.

#### HEALTH BENEFITS:
* A healthy lifestyle that includes nutritious food can help prevent and reverse disease. Avocados are a healthy food you can add. The vitamins, minerals, and healthy fats you get from avocados help prevent disease and keep your body in good working order.  

#### AVOCADOS MAY HELP WARD OFF:

- Cancer: The folate you get from avocados may lower your risk of certain cancers, such as prostate and colon cancer. Nutrients in avocados may also treat cancer. 
- Arthritis and Osteoporosis: Studies on oil extracts from avocados show they can reduce osteoarthritis symptoms. The vitamin K in avocados boosts your bone health by slowing down bone loss and warding off osteoporosis.
- Depression:Folate helps block the buildup of a substance called homocysteine in your blood. Homocysteine slows down the flow of nutrients to your brain and ramps up depression. The high levels of folate in avocados may help keep depression symptoms at bay. 
- Inflammation: Chronic inflammation can kick off many diseases, including diabetes, Alzheimer’s disease, and arthritis. The vitamin E in avocados lowers inflammation in your body.
- Digestion: High insoluble fiber which is the kind that helps move waste through your body. Fiber keeps you regular and can prevent constipation
- Blood Pressure:  level out your blood pressure by lowering sodium levels in your blood and easing tension in your blood vessel walls.
- Heart: Lowers cardivascular inflammation and lowers cholestrol levels.
- Vision: Help protect the tissue in your eyes from UV light damage and help prevent both cataracts and macular degeneration.
- Pregnancy: Prevents birth defects in your baby's brain and spine.

<a name = Section2></a>
## **2. Problem Statement**

- **Avacard-corp** avocados are sourced from over 1000 growers owning over 65,000 acres across California, Mexico, Chile, and Peru.

- With generations of experience growing, packing, and shipping avocados, they have a deep understanding of the avocado industry.

- Their aim is to source quality fruit that’s sustainably grown and handled in the most efficient, shortest supply route possible.

- They want to increase their supply throughout the United States and need to make sure that they are selling their products at the best possible price.

- Avocado prices have rocketed in recent years by up to 129%, with the average national price in the US of a single **Hass avocado** reaching **2.10 dollar** in 2019, almost doubling in just one year.

- Due to this uncertainty in the prices, the company is not able to sell their produce at the optimal price.

- Your task is to **predict the optimal price of the avocardo** using the previous sales data of avocardo according to different regions.

<a name = Section3></a>
## **3. Installing and Importing Libraries**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

<a name = Section4></a>
## **4. Data Acquisation & Description**

In [2]:
data = pd.read_csv("avocado_data.csv")
data.head()

Unnamed: 0.1,Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,3,2015-12-06,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,conventional,2015,Albany
4,4,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany


In [3]:
data.columns

Index(['Unnamed: 0', 'Date', 'AveragePrice', 'Total Volume', '4046', '4225',
       '4770', 'Total Bags', 'Small Bags', 'Large Bags', 'XLarge Bags', 'type',
       'year', 'region'],
      dtype='object')

In [4]:
data.shape

(18249, 14)

In [5]:
data = data.drop("Unnamed: 0", axis = 1)

In [6]:
data.head()

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,2015-12-06,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,conventional,2015,Albany
4,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany


- The dataset contains weekly retail scan data for National Retail Volume (units) and price.

- Retail scan data comes directly from retailers’ cash registers based on actual retail sales of Hass avocados.

- The column AveragePrice is the average price of a single avocado.

- This is the data that we have to predict for future samples.

<br>

| # | Feature Name | Feature Description |
|:--:|:--|:--| 
|1| Date | The date of observation. |
|2| AveragePrice | The average price of a single avocado. |
|3| Total Volume | Total number of avocados sold. |
|4| 4046 | Total number of avocados with PLU 4046 sold. |
|5| 4225 | Total number of avocados with PLU 4225 sold. |
|6| 4770 | Total number of avocados with PLU 4770 sold. |
|7| Total Bags | Total number of bags sold |
|8| Small Bags | Total number of small bags sold |
|9| Large Bags | Total number of large bags sold |
|10| XLarge Bags | Total number of extra-large bags sold |
|11| type | Type of an avocado(conventional or organic). |
|12| year |	The year of observation. |
|13| Region | The city of region of the observation. |

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18249 entries, 0 to 18248
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Date          18249 non-null  object 
 1   AveragePrice  18249 non-null  float64
 2   Total Volume  18249 non-null  float64
 3   4046          18249 non-null  float64
 4   4225          18249 non-null  float64
 5   4770          18249 non-null  float64
 6   Total Bags    18249 non-null  float64
 7   Small Bags    18249 non-null  float64
 8   Large Bags    18249 non-null  float64
 9   XLarge Bags   18249 non-null  float64
 10  type          18249 non-null  object 
 11  year          18249 non-null  int64  
 12  region        18249 non-null  object 
dtypes: float64(9), int64(1), object(3)
memory usage: 1.8+ MB


#### Observations:
- We have 18249 Observations and its 13 Characteristics
- Zero missing values.
- 3 Object type data type: Date, type, region
- Date : date column to be converted in datetime for analysis.
- 9 Float data types
- 1 integer data type

In [8]:
data.describe()

Unnamed: 0,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,year
count,18249.0,18249.0,18249.0,18249.0,18249.0,18249.0,18249.0,18249.0,18249.0,18249.0
mean,1.405978,850644.0,293008.4,295154.6,22839.74,239639.2,182194.7,54338.09,3106.426507,2016.147899
std,0.402677,3453545.0,1264989.0,1204120.0,107464.1,986242.4,746178.5,243966.0,17692.894652,0.939938
min,0.44,84.56,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2015.0
25%,1.1,10838.58,854.07,3008.78,0.0,5088.64,2849.42,127.47,0.0,2015.0
50%,1.37,107376.8,8645.3,29061.02,184.99,39743.83,26362.82,2647.71,0.0,2016.0
75%,1.66,432962.3,111020.2,150206.9,6243.42,110783.4,83337.67,22029.25,132.5,2017.0
max,3.25,62505650.0,22743620.0,20470570.0,2546439.0,19373130.0,13384590.0,5719097.0,551693.65,2018.0


Statistical report shows the following:
1. Min price of Avocado is 0.44 and max is 3.25.
2. Mean price of Avocado is 1.40 and median is 1.37
3. Price of the Avocado is normally distributed.
4. Standard Deviation of price is -0.40 to + 0.40

## 5. Data Profiling & Pre-Processing

In [9]:
#date
data['Date']

0        2015-12-27
1        2015-12-20
2        2015-12-13
3        2015-12-06
4        2015-11-29
            ...    
18244    2018-02-04
18245    2018-01-28
18246    2018-01-21
18247    2018-01-14
18248    2018-01-07
Name: Date, Length: 18249, dtype: object

In [10]:
#Date is an object , we will convert to datetime format

data['Date'] = pd.to_datetime(data['Date'])

In [11]:
data['Date'].dtype

dtype('<M8[ns]')

In [12]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18249 entries, 0 to 18248
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   Date          18249 non-null  datetime64[ns]
 1   AveragePrice  18249 non-null  float64       
 2   Total Volume  18249 non-null  float64       
 3   4046          18249 non-null  float64       
 4   4225          18249 non-null  float64       
 5   4770          18249 non-null  float64       
 6   Total Bags    18249 non-null  float64       
 7   Small Bags    18249 non-null  float64       
 8   Large Bags    18249 non-null  float64       
 9   XLarge Bags   18249 non-null  float64       
 10  type          18249 non-null  object        
 11  year          18249 non-null  int64         
 12  region        18249 non-null  object        
dtypes: datetime64[ns](1), float64(9), int64(1), object(2)
memory usage: 1.8+ MB


In [13]:
from pandas_profiling import ProfileReport

In [14]:
profile_report = ProfileReport(data)
profile_report.to_file('avocado_report.html')

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

#### Observations:
1. **18249 Observations** for the Avocado dataset.
2. **AveragePrice** of Avocado sold at **1.4 Dollar**, at some point the **minimum sale price of Avocado was 0.44** and **Max sale price was 3.25 Dollar**.
3. Average Price has Normal Distribution of data.
4. **Total Volume** is **highly correlated** with Feature variables (4046, 4225, 4770, Large bags, Small bags).
5. In **Total Volume** we can observe the presence of extreme values as **62505646** and Min values as low as **84.56**.
6. **4046** is **highly correlated** and right skewed due to the presence of extreme values.
7. **4225** is **highly correlated** and right skewed due to the presence of extreme values.
8. **4770** is **highly correlated** and right skewed due to the presence of extreme values.
9. **Bags** (Total, XLarge Bags,Large Bags, Small Bags) are highly correlated and right skewed.
10. We have two types of Avocados in terms of growth **Conventional** and **Organic**.
11. We can observe **Region** is highly cardinal and the dataset is uniformly distributed.

<a name = Section6></a>
## **6. Exploratory Data Analysis**
- Objective:
    - Find the feature labels best fit for our ML Model.
    - Observe the relationship and association of feature labels in predicting the price
    - Ask the relavant questions and answer them.

Q.1 How is Average Price of Hass Avocado been associated with Date?

In [15]:
fig, ax = plt.subplots(figsize = (12, 8))

sns.lineplot(x = 'Date', y ='AveragePrice', data = data)

plt.title('Average Price 2015-2018')
plt.xlabel('')
plt.ylabel('Average Price')

plt.show()

Q2. What is Total sale of Avocados from 2015 to 2018 ?

In [16]:
fig, ax = plt.subplots(figsize = (12, 8))

sns.lineplot(x = 'Date', y ='Total Volume', data = data)

plt.title('Avocado Sale 2015-2018')
plt.xlabel('')
plt.ylabel('Volume Sold')

plt.show()

Q.3 Which month of the year sale is high ?

In [17]:
fig, ax = plt.subplots(figsize = (12, 8))

sns.lineplot(x = data['Date'].dt.month, y = data['Total Volume'], color = 'g')

plt.title('Avocado Sale 2015-2018')
plt.xlabel('Monthly Sales')
plt.ylabel('Volume Sold')

plt.show()

Q.4 What is the Average Sale of Avocado in Regions?

In [18]:
data['region'].unique()

array(['Albany', 'Atlanta', 'BaltimoreWashington', 'Boise', 'Boston',
       'BuffaloRochester', 'California', 'Charlotte', 'Chicago',
       'CincinnatiDayton', 'Columbus', 'DallasFtWorth', 'Denver',
       'Detroit', 'GrandRapids', 'GreatLakes', 'HarrisburgScranton',
       'HartfordSpringfield', 'Houston', 'Indianapolis', 'Jacksonville',
       'LasVegas', 'LosAngeles', 'Louisville', 'MiamiFtLauderdale',
       'Midsouth', 'Nashville', 'NewOrleansMobile', 'NewYork',
       'Northeast', 'NorthernNewEngland', 'Orlando', 'Philadelphia',
       'PhoenixTucson', 'Pittsburgh', 'Plains', 'Portland',
       'RaleighGreensboro', 'RichmondNorfolk', 'Roanoke', 'Sacramento',
       'SanDiego', 'SanFrancisco', 'Seattle', 'SouthCarolina',
       'SouthCentral', 'Southeast', 'Spokane', 'StLouis', 'Syracuse',
       'Tampa', 'TotalUS', 'West', 'WestTexNewMexico'], dtype=object)

In [30]:
data.groupby('region')['Total Volume'].max().plot(kind = 'bar', 
                                                  12)

SyntaxError: positional argument follows keyword argument (<ipython-input-30-8828e6296e2e>, line 2)