# Avocado Prices and Sales Volume 2015-2023

In [None]:
from IPython.display import Image
Image(url='https://www.buygrow.co.za/cdn/shop/products/FuerteAvocadoWB.png?v=1675280423')


# Table of Contents
- [1. Project Overview](#1-project-overview)
  - [1.1 Introduction](#11-introduction)
  - [1.2 Problem Statement](#12-problem-statement)
  - [1.3 Objectives](#13-objectives)
- [2. Importing Packages](#2-importing-packages)
- [3. Loading Data](#3-loading-data)
- [4. Data Cleaning](#4-data-cleaning)
- [5. Exploratory Data Analysis (EDA)](#5-exploratory-data-analysis-eda)
- [6. Conclusion](#6-conclusion)

### 1. Project Overview

##### 1.1 Introduction

This project focuses on data cleaning and exploratory data analysis (EDA) of historical avocado prices. The primary goal is to clean the dataset and explore trends, seasonal patterns, and potential factors affecting avocado prices, setting a foundation for future predictive modeling.

##### 1.2 Problem Statement

##### 1.3 Objectives

- Data Cleaning: Ensure data quality by handling missing values, correcting data types and adding new columns.

- Exploratory Data Analysis (EDA): Identify trends, seasonal variations, and key differences in avocado prices by region and type (conventional vs. organic).

### 2. Importing Packages

To carry out data cleaning, manipulation, and visualization, we’ll use the following Python libraries:

* pandas: Provides data structures and functions needed to efficiently clean and manipulate the dataset.

* numpy: Adds support for numerical operations, including handling arrays and mathematical functions for outlier treatment.

* matplotlib and seaborn: Libraries for data visualization. matplotlib is a core plotting library, while seaborn builds on it to provide more aesthetic and statistical visualizations.

In [2]:
# Libraries for data loading, manipulation and analysis

import numpy as np
import pandas as pd
import csv
import seaborn as sns
import matplotlib.pyplot as plt

# Displays output inline
%matplotlib inline

# Libraries for Handing Errors
import warnings
warnings.filterwarnings('ignore')

### 3. Loading Data

The data used for this project was located in the Avocado_HassAvocadoBoard_20152023v1.0.1.csv file. To better manipulate and analyse the Avocado_HassAvocadoBoard_20152023v1.0.1.csv file, it was loaded into a Pandas Data Frame using the Pandas function, .read_csv(). 

In [3]:
# loading dataset
df = pd.read_csv("Avocado_HassAvocadoBoard_20152023v1.0.1.csv", index_col=False)

Check the DataFrame to see if it loaded correctly.

In [4]:
df.head() 

Unnamed: 0,Date,AveragePrice,TotalVolume,plu4046,plu4225,plu4770,TotalBags,SmallBags,LargeBags,XLargeBags,type,region
0,2015-01-04,1.22,40873.28,2819.5,28287.42,49.9,9716.46,9186.93,529.53,0.0,conventional,Albany
1,2015-01-04,1.79,1373.95,57.42,153.88,0.0,1162.65,1162.65,0.0,0.0,organic,Albany
2,2015-01-04,1.0,435021.49,364302.39,23821.16,82.15,46815.79,16707.15,30108.64,0.0,conventional,Atlanta
3,2015-01-04,1.76,3846.69,1500.15,938.35,0.0,1408.19,1071.35,336.84,0.0,organic,Atlanta
4,2015-01-04,1.08,788025.06,53987.31,552906.04,39995.03,141136.68,137146.07,3990.61,0.0,conventional,BaltimoreWashington


In [8]:
# Displays the number of rows and columns
df.shape

(53415, 12)

Results : The dataset consists of 53415 rows (observations) and 12 columns (features).

In [26]:
## Display summary information about the DataFrame.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53415 entries, 0 to 53414
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Date          53415 non-null  object 
 1   AveragePrice  53415 non-null  float64
 2   TotalVolume   53415 non-null  float64
 3   plu4046       53415 non-null  float64
 4   plu4225       53415 non-null  float64
 5   plu4770       53415 non-null  float64
 6   TotalBags     53415 non-null  float64
 7   SmallBags     41025 non-null  float64
 8   LargeBags     41025 non-null  float64
 9   XLargeBags    41025 non-null  float64
 10  type          53415 non-null  object 
 11  region        53415 non-null  object 
dtypes: float64(9), object(3)
memory usage: 4.9+ MB


In [28]:
df.describe()

Unnamed: 0,AveragePrice,TotalVolume,plu4046,plu4225,plu4770,TotalBags,SmallBags,LargeBags,XLargeBags
count,53415.0,53415.0,53415.0,53415.0,53415.0,53415.0,41025.0,41025.0,41025.0
mean,1.42891,869447.4,298270.7,222217.0,20531.95,217508.3,103922.2,23313.16,2731.811796
std,0.393116,3545274.0,1307669.0,955462.4,104097.7,867694.7,569260.8,149662.2,22589.096454
min,0.44,84.56,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.119091,16264.65,694.725,2120.8,0.0,7846.52,0.0,0.0,0.0
50%,1.4,120352.5,14580.58,17516.63,90.05,36953.1,694.58,0.0,0.0
75%,1.69,454238.0,128792.4,93515.6,3599.735,111014.6,37952.98,2814.92,0.0
max,3.44083,61034460.0,25447200.0,20470570.0,2860025.0,16298300.0,12567160.0,4324231.0,679586.8


### 4. Data Cleaning

Handling Missing Values

### 5. Exploratory Data Analysis (EDA)

### 6. Conclusion