# Getting started

Once you've chosen your scenario, download the data from [the Iowa website](https://data.iowa.gov/Economy/Iowa-Liquor-Sales/m3tr-qhgy) in csv format. Start by loading the data with pandas. You may need to parse the date columns appropriately.

In [90]:
import pandas as pd

## Load the data into a DataFrame
df = pd.read_csv('assets/Iowa_Liquor_sales_sample_10pct.csv')
df.columns = [x.lower().replace(' ','_') for x in df.columns]
df.columns

Index([u'date', u'store_number', u'city', u'zip_code', u'county_number',
       u'county', u'category', u'category_name', u'vendor_number',
       u'item_number', u'item_description', u'bottle_volume_(ml)',
       u'state_bottle_cost', u'state_bottle_retail', u'bottles_sold',
       u'sale_(dollars)', u'volume_sold_(liters)', u'volume_sold_(gallons)'],
      dtype='object')

In [91]:
## Transform the dates if needed, e.g.
df["date"] = pd.to_datetime(df["date"], format="%m/%d/%Y")

# Explore the data

Perform some exploratory statistical analysis and make some plots, such as histograms of transaction totals, bottles sold, etc.

In [65]:
import seaborn as sns
import matplotlib.pyplot as plt

#### Calculate the yearly liquor sales for each store

In [92]:
df.columns.get_loc('date')

0

In [93]:
# convert money from string to float in new column, inserted next to original
def convert_money(x):
    x = x.strip('$')
    return float(x)

df.insert(16, 'sale_convert', df['sale_(dollars)'].apply(convert_money))

In [112]:
# create new column with just year
df.insert(1, 'year',  df['date'].dt.year)

ValueError: cannot insert year, already exists

In [114]:
df.insert(2, 'month',  df['date'].dt.month)

In [118]:
# create new dataframe with 2016 data removed
d_2015 = df.loc[df['year'] == 2015]

In [139]:
df.head(2)

Unnamed: 0,date,year,month,store_number,city,zip_code,county_number,county,category,category_name,...,item_number,item_description,bottle_volume_(ml),state_bottle_cost,state_bottle_retail,bottles_sold,sale_(dollars),sale_convert,volume_sold_(liters),volume_sold_(gallons)
0,2015-11-04,2015,11,3717,SUMNER,50674,9.0,Bremer,1051100.0,APRICOT BRANDIES,...,54436,Mr. Boston Apricot Brandy,750,$4.50,$6.75,12,$81.00,81.0,9.0,2.38
1,2016-03-02,2016,3,2614,DAVENPORT,52807,82.0,Scott,1011100.0,BLENDED WHISKIES,...,27605,Tin Cup,750,$13.75,$20.63,2,$41.26,41.26,1.5,0.4


In [142]:
# create new dataframe for 2016 of just first three months
d_2016 = df[(df['year'] == 2016) & (df['month'] <= 3)]

In [145]:
# create new dataframe with total 2015 sales for each store
store_sales_2015 = d_2015.groupby(['year','store_number'])['sale_convert'].sum()

In [146]:
# create new dataframe with total 2016 sales (of first 3 months) for each store
store_sales_2016 = d_2016.groupby(['year','store_number'])['sale_convert'].sum()

#### Use the data from 2015 to make a linear model using as many variables as you find useful to predict the yearly sales of each store. 
- You must use the sales from Jan to March per store as one of your variables.

In [151]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

price = house.price.values/1000.
sqft = house.sqft.values

## Record your findings

Be sure to write out anything observations from your exploratory analysis.

# Mine the data
Now you are ready to compute the variables you will use for your regression from the data. For example, you may want to
compute total sales per store from Jan to March of 2015, mean price per bottle, etc. Refer to the readme for more ideas appropriate to your scenario.

Pandas is your friend for this task. Take a look at the operations [here](http://pandas.pydata.org/pandas-docs/stable/groupby.html) for ideas on how to make the best use of pandas and feel free to search for blog and Stack Overflow posts to help you group data by certain variables and compute sums, means, etc. You may find it useful to create a new data frame to house this summary data.

# Refine the data
Look for any statistical relationships, correlations, or other relevant properties of the dataset.

# Build your models

Using scikit-learn or statsmodels, build the necessary models for your scenario. Evaluate model fit.

In [6]:
from sklearn import linear_model


## Plot your results

Again make sure that you record any valuable information. For example, in the tax scenario, did you find the sales from the first three months of the year to be a good predictor of the total sales for the year? Plot the predictions versus the true values and discuss the successes and limitations of your models

# Present the Results

Present your conclusions and results. If you have more than one interesting model feel free to include more than one along with a discussion. Use your work in this notebook to prepare your write-up.