## Data exploration and cleaning


### First, I'll load the data and look at the first few entries
* To create the list of types I first ran genfromtxt with *dtype=None* and then manually assigned types and string sizes
* I'm ignoring the first column is this seems to be an index for the week of the year the sample was observed. It should be redundant when we have the date
* I can’t see a reason to include the year field for the same reason
* I imagine the relationship with the date will have a strong non-linear element, i.e. when are avocados in season? I'm going to add another categorical feature for month to try to capture this
* I gather that 4046, 4225, and 4770 are Hass avocado sizes
  * I was thinking about removing the total volume, but the sum of the volumes for the three types does not equal the total
    * I see the same characteristic with the bag size breakdown


In [120]:
import numpy as np

filepath = '/home/ben/Data/morsum/avocado.csv'
types = ["|S10", float, float, float, float, float, float, float, float, float, "|S20", "|S20"]
columnsToIgnore = [0,12]
rawData = np.genfromtxt(filepath, delimiter=',', usecols=np.setdiff1d(range(14), columnsToIgnore), dtype=types, names=True)
rawData[:3]

array([('2015-12-27', 1.33,  64236.62, 1036.74,  54454.85,  48.16, 8696.87, 8603.62,  93.25, 0., 'conventional', 'Albany'),
       ('2015-12-20', 1.35,  54876.98,  674.28,  44638.81,  58.33, 9505.56, 9408.07,  97.49, 0., 'conventional', 'Albany'),
       ('2015-12-13', 0.93, 118220.22,  794.7 , 109149.67, 130.5 , 8145.35, 8042.21, 103.14, 0., 'conventional', 'Albany')],
      dtype=[('Date', 'S10'), ('AveragePrice', '<f8'), ('Total_Volume', '<f8'), ('4046', '<f8'), ('4225', '<f8'), ('4770', '<f8'), ('Total_Bags', '<f8'), ('Small_Bags', '<f8'), ('Large_Bags', '<f8'), ('XLarge_Bags', '<f8'), ('type', 'S20'), ('region', 'S20')])

### Next I need to reformat some of the features for regression
* This mostly means converting categorical features into one-hot vectors and converting the date to epoch time
* I am performing this column by column but it may be more computationally efficient to go row by row
* To minimise memory use I'm deleting columns from the raw data once they are no longer needed
  *  If memory still became an issue I would need to look at not loading whole columns at once

In [144]:
from datetime import datetime

# I would use datetime.timestamp if this was Python 3
# I would consider doing this in two steps (conver to datetime, then to epoch), or in for loop
epochDT = datetime(1970,1,1)
reformattedData = np.array([((datetime.strptime(dataSample['Date'], '%Y-%m-%d')) - epochDT).total_seconds()
                       for dataSample in rawData]).astype(int)
monthColumn = np.array([dataSample['Date'].split('-')[1] for dataSample in rawData])

# If I were doing this in production code I would use a one-hot helper function from the likes of SKLearn
def convert_categorical_feature_to_onehot(categoricalFeatureVector):
    """Encodes a column of categorical features into one-hot vectors"""
    categoryValues, categoryIndices = np.unique(categoricalFeatureVector, return_inverse=True)
    # The np.eye function should generate a one-hot vector for each category
    categoryOneHotOptions = np.eye(len(categoryValues))
    # I use the indices from 'return_inverse=True' to determine which category each sample is
    return categoryOneHotOptions[categoryIndices]

onehotMonths = convert_categorical_feature_to_onehot(monthColumn) # Convert months to one-hot vectos
reformattedData = np.append(reformattedData.reshape(-1,1), onehotMonths, axis=1) # Append month features

#rawData = rawData[list(rawData.dtype.names[1:])] # Remove date from raw data

targetPrices = rawData['AveragePrice']
#rawData = rawData[list(rawData.dtype.names[1:])] # Remove prices from raw data

# Some features are fine to add as they are
simpleNumberFeatureNames = ['Total_Volume', '4046', '4225', '4770', 'Total_Bags','Small_Bags',
                            'Large_Bags', 'XLarge_Bags']
reformattedData = np.append(reformattedData, np.array(rawData[simpleNumberFeatureNames].tolist()), axis=1)
# Remove standard features from raw data
#rawData = rawData[list(rawData.dtype.names[len(simpleNumberFeatureNames):])] 



(18249, 13)
(18249, 8)
(18249, 21)


array([1.4511744e+09, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
       0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
       0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
       1.0000000e+00, 6.4236620e+04, 1.0367400e+03, 5.4454850e+04,
       4.8160000e+01, 8.6968700e+03, 8.6036200e+03, 9.3250000e+01,
       0.0000000e+00])