# Getting started

Once you've chosen your scenario, download the data from [the Iowa website](https://data.iowa.gov/Economy/Iowa-Liquor-Sales/m3tr-qhgy) in csv format. Start by loading the data with pandas. You may need to parse the date columns appropriately.

In [4]:
from __future__ import division
%matplotlib inline

In [3]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime


#### Load the data into a DataFrame

#### Drop NaN values

#### Transform the dates to datetime format


#### Rename columns to remove whitespaces


#### Set category columns to category types

#### Set prices columns to floats

# Explore the data

Perform some exploratory statistical analysis and make some plots, such as histograms of transaction totals, bottles sold, etc.

#### Explore volumes of the bottles sold


#### Number of 750mL bottles sold


#### Total sales in dollars per liquor category sorted in descending order

#### Total sales in dollars per vendors (company for the brand of liquor ordered) sorted in descending order


#### Total sales in dollars per store sorted in descending order


#### Number of cities listed in the dataset

#### Number of counties listed in the dataset


#### Number of stores listed in the dataset


#### Total sales in dollars per county sorted in descending order


#### Sales histogram


#### Bottle retail prices histogram


#### Relationship between bottle price as bought by the Alcoholic Beverages Division and bottle price sold to shops


## Record your findings

This dataset contains liquor sales from the Alcoholic Beverages Division to 1378 liquor stores in Iowa from 5th January 2015 to 3rd March 2016 in 386 cities.   
   
Total sales amount to 349,854,916 dollars and represent 24,173,278.5 liters (6,386,460.6 gallons) of liquor.   

Bottles sold last year ranged from 50 mililiters to 225 liters. They cost between 89 cents and 6,468 dollars (9.82 dollars on average) to the Alcoholic Beverages Division and were sold between 1.34 dollar and 9,702 dollars to Iowa stores (14.74 dollars on average). Retail price is 50% higher than state price.  
   
1,227,979 bottles of 750mL were sold. Other frequent bottle volumes include 375mlL (272,113 bottles sold), 500mL (121,004 bottles sold), 1.75L (541,448 bottles sold) and 1L (367,592 bottles sold).  
   
Top selling liquor categories are CANADIAN WHISKIES and VODKA 80 PROOF.  
Top selling vendor (company for brand of liquor sold) is number 260 with over $77m sales.
  
Biggest buyers are stores number 2633 and number 4829.

## Scenario 1 problem statement:

Based on reported sales from 2015 and the first quarter of 2016, what is the current state of liquor sales in Iowa? What are the expected liquor sale values in 2016? Which strategy should the Iowa State legislature consider in terms of liquor tax rates?

# Mine the data
Now you are ready to compute the variables you will use for your regression from the data. For example, you may want to
compute total sales per store from Jan to March of 2015, mean price per bottle, etc. Refer to the readme for more ideas appropriate to your scenario.

Pandas is your friend for this task. Take a look at the operations [here](http://pandas.pydata.org/pandas-docs/stable/groupby.html) for ideas on how to make the best use of pandas and feel free to search for blog and Stack Overflow posts to help you group data by certain variables and compute sums, means, etc. You may find it useful to create a new data frame to house this summary data.

#### Get year and month of each sale in a separate column


#### Orders in 2015

#### Total sales per store per month in 2015


#### Add volumes solds per store in 2015


#### Add bottles sold per store in 2015

#### Add average price per bottle


#### Orders in 2016

#### Total sales per store per month in 2016


#### Add volumes solds per store in 2015


#### Add bottles sold per store in 2016


#### Add average price per bottle


# Refine the data
Look for any statistical relationships, correlations, or other relevant properties of the dataset.

#### Total liquor sales in 2015



#### Total liquor sales from Jan 2015 to March 2016


#### Correlations between sales parameters in 2015


#### Correlation coefficients between sales parameters in 2015


Yearly sales are strongly correlated with the sales of January, February and March. They are also strongly correlated with number of bottles sold and total volumes sold.   
They are much less correlated with the average bottle price so we will not consider this parameter in our model.

# Build your models

Using scikit-learn or statsmodels, build the necessary models for your scenario. Evaluate model fit.

### Model 1: Linear regression with sales, Bottles Sold and Volume Sold  from January to March

In [None]:
from sklearn import linear_model
from sklearn.cross_validation import train_test_split

#### Build linear model to predict sales based on data from 2015


#### Test model on test data


In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score


r^2 score is high which suggests that the model is a good fit. However, the mean squarred error is also very high so we will try fitting a Lasso regression to the data instead to see if some parameters could be ignored in the model.

### Model 2: Linear model with Lasso regression 

#### Linear model with Lasso regression

### Cross Validation of model 1

#### Evaluate model fit with 5-fold cross-validation


In [None]:
from sklearn.cross_validation import cross_val_score

### Cross Validation of model 2

#### Evaluate model fit with 5-fold cross-validation


In [None]:
from sklearn.cross_validation import cross_val_score

Cross-validation r^2 scores are high so our model seems to be a good fit.  
The Lasso regression did not improve the errors or r^2 scores so we will our initial linear model to predict sales values in 2016. 

## Plot your results

Again make sure that you record any valuable information. For example, in the tax scenario, did you find the sales from the first three months of the year to be a good predictor of the total sales for the year? Plot the predictions versus the true values and discuss the successes and limitations of your models

Based on this plot, the model seems to be performing well in predicting yearly sales values. We will therefore be able to predict 2016 sales based on sales, number of bottles sold and total volume sold in Q1 of 2016.

### 2016 sales prediction

# Present the Results

Present your conclusions and results. If you have more than one interesting model feel free to include more than one along with a discussion. Use your work in this notebook to prepare your write-up.

Based on our model predicting 2016 sale values from sales, number of bottles sold and total volume sold so far in 2016 (Q1), liquor sales in Iowa are expected to decrease by 7.54% between 2015 and 2016.  

# Write Up Summary

The Iowa State tax board has reviewed liquor sales accross the State in 2015 and in the first quarter of 2016 to study their evolution.  
  
Based on our analysis, the total liquor sales in Iowa in 2015 amount to 26,655,007.70 USD.  
This amount is expected to decrease by 7.54% in 2016, based on quarter 1 figures.
  
During the next steps we will consider the trend county by county in order to detect if some counties are encountering specific difficulties in 2016. Indeed, there are some significant discrepancies between stores predicted sale and we would like to verify if this is linked to any geographical pattern.