#Avocado price forecasting

The dataset, which contains retail scan data for avocado sellings, was downloaded from the Hass Avocado Board website in May of 2018 & compiled into a single CSV. The data can easily be retrieved over the [kaggle challenge page](https://www.kaggle.com/neuromusic/avocado-prices). Let's start by loading our dataset, so we can start exploring the data inside!

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

# Load the data
df = pd.read_csv('./data/avocado.csv')

Let's check the size of our dataset first:

In [4]:
print('Training set shape: ', df.shape)

Training set shape:  (18249, 14)


Let's by taking a peek at our dataset trough the head function to see what kind of variables are present.

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,3,2015-12-06,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,conventional,2015,Albany
4,4,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany


Let's drop the useless index column:

In [15]:
df = df.drop(['Unnamed: 0'], axis=1)

It's  quick to check that we have no missing entries in our dataset. Kaggle likes to make things easier for us, but real dataset aren't that kind.

In [21]:
df.isna().sum()

Date            0
AveragePrice    0
Total Volume    0
4046            0
4225            0
4770            0
Total Bags      0
Small Bags      0
Large Bags      0
XLarge Bags     0
type            0
year            0
region          0
dtype: int64

In our data, we have two categorical features, namely type and region. We can quickly check how many entries each category has and which are the category:

In [28]:
df['type'].value_counts()

conventional    9125
organic         9123
Name: type, dtype: int64

We notice that almost 50% of our avocados are conventional and the remaining are organic. Let's check the regions:

In [36]:
df['region'].value_counts()

NorthernNewEngland     338
Houston                338
Roanoke                338
Jacksonville           338
Chicago                338
Southeast              338
Portland               338
Philadelphia           338
RaleighGreensboro      338
SouthCarolina          338
Boston                 338
Plains                 338
Boise                  338
Midsouth               338
GrandRapids            338
Detroit                338
Denver                 338
Tampa                  338
Orlando                338
GreatLakes             338
LosAngeles             338
Pittsburgh             338
SouthCentral           338
NewYork                338
TotalUS                338
Atlanta                338
SanDiego               338
Spokane                338
HarrisburgScranton     338
PhoenixTucson          338
Columbus               338
California             338
BuffaloRochester       338
Syracuse               338
Sacramento             338
RichmondNorfolk        338
Indianapolis           338
M

In [39]:
print('The total number of region is ' + str(df['region'].value_counts().count()))

The total number of region is 54


We have a grandtotal of 54 different regions, each with almost 338 entries. The region are mostly from USA, but we can see that there are some outsiders, like Albany thrown in too. 

For the numerical features, we can quickly take a look at the statistical informations trough the describe function.

In [41]:
df.describe()

Unnamed: 0,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,year
count,18248.0,18248.0,18248.0,18248.0,18248.0,18248.0,18248.0,18248.0,18248.0,18248.0
mean,1.405983,850687.1,293024.4,295167.8,22840.98,239651.9,182204.2,54341.06,3106.596741,2016.147961
std,0.402687,3453635.0,1265022.0,1204152.0,107466.9,986267.9,746197.9,243972.3,17693.364516,0.939926
min,0.44,84.56,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2015.0
25%,1.1,10837.88,853.905,3008.552,0.0,5087.33,2848.935,127.8,0.0,2015.0
50%,1.37,107404.0,8646.205,29058.88,184.995,39752.49,26369.42,2648.225,0.0,2016.0
75%,1.66,433009.8,111028.9,150220.4,6243.62,110784.9,83337.73,22031.51,132.6675,2017.0
max,3.25,62505650.0,22743620.0,20470570.0,2546439.0,19373130.0,13384590.0,5719097.0,551693.65,2018.0


Let's create a temporary dataset we'll use to experiment. 

In [23]:
fakedf = df.copy()

## data viz

It's time to start analizing our data finally. 