# Project 1 - Iowa Liquor 

You are a data scientist in residence at the Iowa State tax board. The Iowa State legislature is considering changes in the liquor tax rates and wants a report of current liquor sales by county and projections for the rest of the year. 

Your task is as follows:

* Calculate the yearly liquor sales for each store using the provided data. You can add up the transactions for each year, and store sales in 2015 specifically will be used later as your target variable.
* Use the data from 2015 to make a linear model using as many variables as you find useful to predict the yearly sales of all stores. You must use the sales from Jan to March as one of your variables.
* Use your model for 2015 to estimate total sales in 2016, extrapolating from the sales so far for Jan-March of 2016.
* Report your findings, including any projected increase or decrease in total sales (over the entire state) for the tax committee of the Iowa legislature.
* Use cross-validation to check how your model predicts to held out data compared to the model metrics on the full dataset.
* Fit your model(s) using one or both of the regularization tactics covered. Explain whether the regularized or the non-regularized model performed better and what the selected regression(s) are doing.



# Part 2

### Feature Engineering, Model Building, and Tuning

In Part 2 of this two-part project, you will use the insights gained from your Exploratory Data Analysis (EDA) to build a linear regression model predicting end-of-year total sales using Q1 data. You will use 2015 data to train and tune your model, then make final predictions using Q1 2016 data to make your best estimates for end of year 2016!

### Requirements:


**Mine the data**
- Create necessary derived columns from the data
- Format, clean, slice, and combine the data in Python

**Build a data model**
- Complete linear regressions using scikit-learn or statsmodels and interpret your findings
- Calculate and plot predicted probabilities and/or present tables of results
- Describe the bias-variance tradeoff of your model and errors metrics
- Evaluate model fit by using loss functions, including mean absolute error, mean squared error, and root mean squared error, or r-squared

**Present the results**
- Create a Jupyter Notebook hosted on GitHub that provides a dataset overview with visualizations, statistical analysis, data cleaning methodologies, and models
- Create a write-up on the interpretation of findings including an executive summary with conclusions and next steps

***Bonus!:***
- Handle outliers, use regularization (Ridge & Lasso regressions)
- Brainstorm ways to improve your analysis; for example:
 - Add additional breakdowns and models, e.g. by month.
 - Recommend additional data that might improve your models
 - Can you think of other uses for the dataset? E.g healthcare / disease estimates

In [None]:
import os

os.path.isfile('../Assets/Iowa_Liquor_sample.csv') 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

## Import and Clean data:

This time, we've cleaned the data set and column names for you; however we have not touched missing values.

In [None]:
# Import, convert 'Date' col to datetime
liquor=pd.read_csv('../Assets/Iowa_Liquor_sample.csv',parse_dates=['Date'],infer_datetime_format=True)

In [None]:
# format column names
import re

liquor.columns = [re.sub("[^a-zA-Z]+", "", x) for x in liquor.columns]

In [None]:
# remove '$' in values and convert to numeric
adjust_cols = ['StateBottleCost','StateBottleRetail','SaleDollars']

for col in adjust_cols:
    liquor[col] = pd.to_numeric(liquor[col].str.replace('$',''),errors='coerce')

### Null Values

Handle null values as you see fit

In [None]:
# liquor = liquor.dropna()
# liquor.isnull().sum()

## Split Data to Create a features and targets

The goal of this project is to predict **total year-end 2015 sales for each store** using **first-quarter 2015 data**

Our data is currently formatted as total purchases for each product per day per store for every day in the year. We will need to group our data by store when we perform our aggregations.

In order to accomplish our goal, we need two sets of data:
* Total full-year  sales for each store in 2015 (our target / y)
* Data from Q1 2015 (will become our features / X)

Create two dataframes, 'liquor2015_fy' and 'liquor2015_q1'

'liquor2015_fy' should contain only store numbers and the full year sales for that store

'liquor2015_q1' should contain all your features, but only for Q1


In [None]:
# Filter to only 2015:
# hint: liquor.Date.dt._______

liquor2015 = 

In [None]:
# Calculate the sum of sales for each store in 2015 by grouping the full year data
# hint: what columns do you need? what is your aggregating function? 
liquor2015_fy =

In [None]:
# Filter to just Q1 data: 
# hint: df[df.Date.dt.___ == __]

# Feature Engineering

Using the insight your gained into your dataset while perfomorming *exploratory data analysis* in Part 1 of the project, aggregate the liquor2015_q1 data frame to create cross-sectional features from our longitudinal data.


[Aggregation functions in pandas](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.DataFrameGroupBy.agg.html)

In addition to aggregation, you may chose to create columns to more advanced measures of the data, such as sales for a particular product or category, measures of profitbility, daily or weekly sales statistics, etc.

Combine your aggregations and other engineered features into a dataframe called 'liquor2015_q1_features'

*At a minimum, you will need to aggreate your features by Store in order to procede*



In [None]:
# liquor2015_q1.groupby()

In [None]:
# Dataframe of your Q1 features
liquor2015_q1_features = 

## Combine Q1 Features with Full Year Target

Now that you've created a set of features using the Q1 data, we much combine it wil the full-year data so that our Xs (features) are matched up to their coresponding y's (targets).

Pandas' 'merge' function allows us to combine two dataframes, using SQL-like joins.

[Pandas Merge/Join Documentation](https://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging)

We will create a new dataframe, called 'liquor2015_combined' by merging our 'liquor2015_fy' and 'liquor2015_q1_features' dataframes on Store Number - giving us a dataframe which in each row has the Q1 features you've developed for each store, and the year-end total sales for that store.

#### In pandas, merge can take two forms:

pd.merge(left_dataframe,right_dataframe, \*\*args)

*or*

left_dataframe.merge(right_dataframe,\*\*args)

Both of these return the merged dataframe. For arguments, you will need to chose which column(s) from your right and left dataframe you're merging on.

Args:
* left: your left-dataframe
* right: your right-dataframe
* on= : if your dataframes have a common column name that you're merging on, use this arg
* left_on= / right_on= : if your dataframes do not have a common column name, you can specify the names
* left_index= / right_index= : these are boolean (True/False) flags for whether to use the dataframe's index as the merging column.



In [None]:
liquor2015_combined = 

## Cross Validation

As we build our model, we will use cross-validation techniques to help navigate the bias/variance tradeoff, with a goal of producing the best model which will generalize to new data. 

![crossval](../Assets/validation.png)

### Step 1: Hold Out / Testing Data

In order to evaluate our final model performance, we will seperate out a small amount of data which will will not touch while train and test our model (labeled in red as "Testing Data" in the image above). 

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
training_data,holdout = train_test_split(liquor2015_combined,shuffle=True,test_size=0.10,random_state=123)

### Step 2: Kfolds

With our holdout set removed, we can set up **Kfolds** cross validation

In [None]:
from sklearn.cross_validation import KFold

In [None]:
# Number of folds you wish to train
folds = 

# Number of rows in your dataframe
n = training_data.shape[0]

kf = KFold(<FILL IN ARGS>,random_state=123)

# Model Building - Linear Regression

With feature prepared and a cross-validation framework in place, train and tune a linear regressor to predict year-end sales using your q1 data

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score
import numpy as np

In [None]:
#select your feature column names

feature_cols = [<FEATURES>]

In [None]:
# define your X (features) and y (target)
# hint - make sure your y is not in your X!

X = training_data[feature_cols]
y = training_data[<TARGET>]


Instantiate your model

In [None]:
lr = LinearRegression()

Use the kfolds iterator to **train** and **evaluate** your model, using Mean Squared Error (MSE) as your evluation metric

In [None]:
# Create a blank list to store fold scores
scores =
# Fill-in the kfolds-loop:

for train,test in kf:
    # Set up your training and testing sets
    x_train = X.iloc[train]
#     x_test =
    y_train = y.iloc[train]
#     y_true =
    
    # Fit your model on your training x and training y
    lr.fit(x_train,y_train)
    
    # Make Predictions
    y_preds = 
    
    # Score your predictions vs. your true values using mean_squared_error
    fold_score = mean_squared_error()
    
    # Append your score 
    scores.append()


In [None]:
# View your fold scores, and calculate the mean score across your folds
# np.mean()

### Coefficients and Intercept

View the coefficients of your model - what do the coefficients tell you about the relationships between your features and your target?

In [None]:
list(zip(feature_cols,lr.coef_))

In [None]:
lr.intercept_

### Tuning Your Model

So far, you've trained a basic linear model and evaluated it using Mean Squared Error. Use the same process as above to evaluate your model using: Mean Absolute Error (MAE), Root Mean Squared Error (RMSE) and calculate the R2 score of your predictions.

Try some of the parameters available for your linear model, and different sets of features to find a model that you feel will **perform best on new, out of sample data**

In [None]:
feature_cols_new = []

In [None]:
X = training_data[feature_cols_new]

In [None]:
# Use K-Folds cross validation to train your model
# Evaluate your model using MAE, MSE, RMSE and R2 

In [None]:
# Compare your MAE, MSE, RMSE and R2 values for your folds; describe anything that stands out.
# How do your metrics respond to different feature sets?

In [None]:
# Evaluate your coefficients and your intercept

## Test against your hold-out set

Before you build your model, you set aside some of your data for testing. Your model has never trained against these data points or been evaluated agaist these points.

Use **ALL** of your training data to train, then test your model against your holdout set.

In [None]:
# Pick your best set of feature columns
features = [<FEATURES>]

X_train = training_data[features]
y_train = 

x_holdout = holdout[features]
y_holdout = 

In [None]:
lr = LinearRegression()

In [None]:
# Fit your model using all of your training data
lr.fit()

In [None]:
# Create predictions using your holdout set (x_holdout)
holdout_preds =

In [None]:
# score your model using MAE, MSE, RMSE, and R2
# hint: what is y_test and what is your y_true?

MAE_score =
MSE_score =
RMSE_score =
R2_score =


In [None]:
# print your scores



In [None]:
# Create a scatter plot of your predicted values vs. their true values
# Describe anything you observe


In [None]:
# Calculate your residuals (prediction - actual)


In [None]:
# Create a histogram of your residuals. Describe anything you observe



# Final Predictions

You've created a model that predicts 2015 year end sales based on Q1 2015 data. 

In the data source, we have included data for Q1 of 2016. Apply your feature engineering process to the 2016 Q1 data, then use your trained 2015 model to predict the 2016 year end values for those stores.

Note: you do not have the 2016 year end values to evaluate against.


### Feature Engineering
Perform the same aggregation and feature creation you used on 2015 data on the 2016 data 

In [None]:
# liquour[liquor.Date.dt.Year == 2016]

### Make Predictions

Once you have your 2016 features, use your trained 2015 model on the 2016 Q1 data to get your predictions for 2016

Do not retrain a model on the 2016 data*

In [None]:
# Make Predictions

In [None]:
# Show your 2016 year-end prediction for each store

# Evaluation:

Do your best to answer the following questions:

* What was the best set of features you found for your model?
* Describe the relationships between your features and your target
* How did your model perform in the training phase? Against the holdout set? 
* Did it perform better or worse against the holdout set?

Finally:
* Write a short description of your analysis, describing the process you went through and your confidence in your model's predictive ability
* Include any data, or visualizations you feel would help support your findings

# Bonus - Regularization & Grid Search

As a bonus, experiment with the effect of Lasso (L1) and Ridge (L2) regularization on your linear model. Use GridSearch to tune your additional parameters.

See [gridseach 'scoring' options](http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter) for a list of scoring function strings recognized by GridSeach

In [None]:
from sklearn.linear_model import Lasso, Ridge
from sklearn.grid_search import GridSearchCV

In [None]:
# Instantiate models
lr_ridge = 
lr_lasso = 

In [None]:
# Use your post-holdout training data, so you can evaluate on the holdout later
X = 
y = 

Expirement with values of Alpha, scoring functions, and L1/L2 regulatization

In [None]:
params = {'alpha':[0.2,1.0]}

In [None]:
gs = GridSearchCV(<model>,params,cv=5,scoring='neg_mean_squared_error')

In [None]:
# Gridsearch incorporates k-folds validation
# You do not have to create training/testing splits
gs.fit(X,y)

In [None]:
# View all permutation scores
gs.grid_scores_

In [None]:
# use the best set of parameters
lr_best = gs.best_estimator_

In [None]:
# try the best estimator on your holdout set

## Evaluation:

Did regularization improve your model? What was the impact of regularization on your features? Did regularization make any features stand out?