## Python Demo
### A prelude to xxx Python Training

We're going to follow a normal analyst workflow:
1. Pull data from Exadata.
2. View the data.
3. Split the data into training and test.
4. Clean the data and remove outliers.
5. Train a model.
6. Test the model.
7. Export the model.

This demo is intended to be something of a [whirlwind](http://www.oreilly.com/programming/free/files/a-whirlwind-tour-of-python.pdf), so don't worry if you aren't able to keep up. This should just give you a taste of the look and feel of Python, along with its power.

We're going to model a store's sales of in June 2017, based on its sales in April and June 2017.

### 1. Pull Data

In [None]:
# Import the Oracle connection module from the py_effo library.
import py_effo.oracle_connection as oracle_connection
# Instantiate a connection object.
con = oracle_connection.OracleConnection('an_cm_ws29')
# Construct a query for what data we want to pull back.
query = '''SELECT store_id,
                  -- April Spend
                  SUM(CASE WHEN transaction_dttm BETWEEN TO_DATE('20170401', 'YYYYMMDD')
                                                     AND TO_DATE('20170430', 'YYYYMMDD')
                      THEN basket_net_spend_amt
                      ELSE 0
                  END) AS apr_spend,
                  -- May Spend
                  SUM(CASE WHEN transaction_dttm BETWEEN TO_DATE('20170501', 'YYYYMMDD')
                                                     AND TO_DATE('20170531', 'YYYYMMDD')
                      THEN basket_net_spend_amt
                      ELSE 0
                  END) AS may_spend,
                  -- June Spend
                  SUM(CASE WHEN transaction_dttm BETWEEN TO_DATE('20170601', 'YYYYMMDD')
                                                     AND TO_DATE('20170630', 'YYYYMMDD')
                      THEN basket_net_spend_amt
                      ELSE 0
                  END) AS june_spend
            FROM transaction_basket_fct
            WHERE transaction_dttm BETWEEN TO_DATE('20170401', 'YYYYMMDD') AND TO_DATE('20170630', 'YYYYMMDD')
            -- Use this modulus to take a sample (just to make this simpler)
            AND MOD(store_id, 23) = 0
            GROUP BY store_id'''
# Execute the query.
store_spend = con.query(query)
# View a sample of the data
store_spend.head()

### 2. View the data
If you want to take a look at the whole thing as a table, run the following cell.

In [None]:
store_spend

Let's make density plots of store spend in each month.

In [None]:
# Set up for plotting
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
# Create 3 vertical plots that share an x axis
fig, ax = plt.subplots(nrows=3, sharex='row')
for i, col_name in enumerate(['APR_SPEND', 'MAY_SPEND', 'JUNE_SPEND']):
    sns.distplot(store_spend[col_name], ax=ax[i])

Yes, we could spend time to make this prettier, but that's not what we're after.

### 3. Split into training and test
First, separate the response and the features into their own NumPy arrays.

In [None]:
from sklearn.model_selection import train_test_split
# Separate response from features
X = store_spend[['APR_SPEND', 'MAY_SPEND']]
y = store_spend['JUNE_SPEND']
# Split into training ang test
X_train, X_test, y_train, y_test = train_test_split(X, y)

### 4. Clean and Remove outliers
We'll use a method called Isolation Forest to find outliers.

In [None]:
from sklearn.ensemble import IsolationForest

isolation_forest = IsolationForest(max_samples=100, random_state=8451)
iso_fit = isolation_forest.fit(X_train)
# Get outlier ratings for all of the points in the training data
outlier_ratings = iso_fit.decision_function(X_train)
# Take a look at the first 50 outlier ratings
outlier_ratings[:50]

Plot the outlier ratings to see if there is an obvious cutoff (lower scores mean more likely to be an outlier).

In [None]:
sns.distplot(outlier_ratings)

Let's cut off at -0.10; it looks like there's a gap there. Remove rows that meet this outlier condition.

In [None]:
outlier_filter = (outlier_ratings <= -.1)
X_train = X_train[~outlier_filter]
y_train = y_train[~outlier_filter]
print('X_train shape: %s' % str(X_train.shape))
print('y_train shape: %s' % str(y_train.shape))

### Train a model.
For the sake of simplicity, let's just do a linear regression.

In [None]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
model = lin_reg.fit(X=X_train, y=y_train)
print('Coef: %s' % str(model.coef_))
print('Intercept: %s' % str(model.intercept_))

That was easy. What's the training MSE?

In [None]:
from sklearn.metrics import mean_squared_error
y_pred_train = model.predict(X_train)
mean_squared_error(y_train, y_pred_train)

### Test the model.
First, make predictions.

In [None]:
y_pred_test = model.predict(X_test)

Now see how good they are.

In [None]:
mean_squared_error(y_test, y_pred_test)

### Export the model.
Let's just "pickle" the model and save it. Pickling saves Python objects in files, so you can reload them in a later session.