### AI WorkFlow Capstone Proj (final part)

#### (i). The following python codes concisely show laoding about 21 json formatted files into a dataframe, and to print out the top 10 countries (order by Revenue)

#### (ii). Please review the 12 questions of AI-Workflow beblow

In [1]:
"""
collection of functions for the final case study solution
"""

import os
import sys
import re
import shutil
import time
import pickle
from collections import defaultdict
from datetime import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

In [2]:
COLORS = ["darkorange","royalblue","slategrey"]

In [3]:
data_dir = os.path.join(".","cs-train")

In [4]:
file_list = [os.path.join(data_dir,f) for f in os.listdir(data_dir) if re.search("\.json",f)]
correct_columns = ['country', 'customer_id', 'day', 'invoice', 'month',
                       'price', 'stream_id', 'times_viewed', 'year']

    ## read data into a temp structure
all_months = {}
for file_name in file_list:
    df = pd.read_json(file_name)
    all_months[os.path.split(file_name)[-1]] = df

    ## ensure the data are formatted with correct columns
for f,df in all_months.items():
    cols = set(df.columns.tolist())
    if 'StreamID' in cols:
        df.rename(columns={'StreamID':'stream_id'},inplace=True)
    if 'TimesViewed' in cols:
        df.rename(columns={'TimesViewed':'times_viewed'},inplace=True)
    if 'total_price' in cols:
        df.rename(columns={'total_price':'price'},inplace=True)

    cols = df.columns.tolist()
    if sorted(cols) != correct_columns:
        raise Exception("columns name could not be matched to correct cols")

    ## concat all of the data
df = pd.concat(list(all_months.values()),sort=True)
years,months,days = df['year'].values,df['month'].values,df['day'].values 
dates = ["{}-{}-{}".format(years[i],str(months[i]).zfill(2),str(days[i]).zfill(2)) for i in range(df.shape[0])]
df['invoice_date'] = np.array(dates,dtype='datetime64[D]')
df['invoice'] = [re.sub("\D+","",i) for i in df['invoice'].values]
    
    ## sort by date and reset the index
df.sort_values(by='invoice_date',inplace=True)
df.reset_index(drop=True,inplace=True)

In [5]:
# data preprocessing ...and obtaining the original dataframe
df.head()

Unnamed: 0,country,customer_id,day,invoice,month,price,stream_id,times_viewed,year,invoice_date
0,United Kingdom,13085.0,28,489434,11,6.95,85048,12,2017,2017-11-28
1,United Kingdom,13085.0,28,489434,11,6.75,79323W,12,2017,2017-11-28
2,United Kingdom,13085.0,28,489434,11,2.1,22041,21,2017,2017-11-28
3,United Kingdom,13085.0,28,489434,11,1.25,21232,5,2017,2017-11-28
4,United Kingdom,13085.0,28,489434,11,1.65,22064,17,2017,2017-11-28


In [6]:
"""
country = None
df_orig = df

if country:
    if country not in np.unique(df_orig['country'].values):
        raise Excpetion("country not found")
    
    mask = df_orig['country'] == country
    df = df_orig[mask]
else:
    df = df_orig
"""
 
    ## use a date range to ensure all days are accounted for in the data
invoice_dates = df['invoice_date'].values
start_month = '{}-{}'.format(df['year'].values[0],str(df['month'].values[0]).zfill(2))
stop_month = '{}-{}'.format(df['year'].values[-1],str(df['month'].values[-1]).zfill(2))
df_dates = df['invoice_date'].values.astype('datetime64[D]')
days = np.arange(start_month,stop_month,dtype='datetime64[D]')
    
purchases = np.array([np.where(df_dates==day)[0].size for day in days])
invoices = [np.unique(df[df_dates==day]['invoice'].values).size for day in days]
streams = [np.unique(df[df_dates==day]['stream_id'].values).size for day in days]
views =  [df[df_dates==day]['times_viewed'].values.sum() for day in days]
revenue = [df[df_dates==day]['price'].values.sum() for day in days]
year_month = ["-".join(re.split("-",str(day))[:2]) for day in days]

df_time = pd.DataFrame({'date':days,
                        'purchases':purchases,
                        'unique_invoices':invoices,
                        'unique_streams':streams,
                        'total_views':views,
                        'start_month':start_month,
                        'stop_month':stop_month,
                        'year_month':year_month,
                        'revenue':revenue})

In [7]:
df_time.head()

Unnamed: 0,date,purchases,unique_invoices,unique_streams,total_views,start_month,stop_month,year_month,revenue
0,2017-11-01,0,0,0,0,2017-11,2019-07,2017-11,0.0
1,2017-11-02,0,0,0,0,2017-11,2019-07,2017-11,0.0
2,2017-11-03,0,0,0,0,2017-11,2019-07,2017-11,0.0
3,2017-11-04,0,0,0,0,2017-11,2019-07,2017-11,0.0
4,2017-11-05,0,0,0,0,2017-11,2019-07,2017-11,0.0


In [8]:
# the total days within the date span
len(df_time.date.unique())

607

In [9]:
# dat = sum data

# df.groupby('a')['b'].sum()[1]
# purchases = np.array([np.where(df_dates==day)[0].size for day in days])
# country = [np.unique(df[df_dates==day]['country'].values) for day in days]

revenue = df.groupby('country')['price'].sum()
country = df.country.unique()

dat = pd.DataFrame({#'country':country,
                    'revenue':revenue})



    # the top 10 countries (order by Revenue)
_top_10 = dat.sort_values('revenue',ascending=False)

    # print out
_top_10[:10]

Unnamed: 0_level_0,revenue
country,Unnamed: 1_level_1
United Kingdom,3521514.0
EIRE,107069.2
Germany,49271.82
France,40565.14
Norway,38494.75
Spain,16040.99
Hong Kong,14452.57
Portugal,13528.67
Singapore,13175.92
Netherlands,12322.8


In [None]:
def feature_engineering(df,training=True):
    """
    for any given day the target becomes the sum of the next days revenue
    for that day we engineer several features that help predict the summed revenue
    
    the 'training' flag will trim data that should not be used for training
    when set to false all data will be returned

    """

    ## extract dates
    dates = df['date'].values.copy()
    dates = dates.astype('datetime64[D]')

    ## engineer some features
    eng_features = defaultdict(list)
    previous =[7, 14, 28, 70]  #[7, 14, 21, 28, 35, 42, 49, 56, 63, 70]
    y = np.zeros(dates.size)
    for d,day in enumerate(dates):

        ## use windows in time back from a specific date
        for num in previous:
            current = np.datetime64(day, 'D') 
            prev = current - np.timedelta64(num, 'D')
            mask = np.in1d(dates, np.arange(prev,current,dtype='datetime64[D]'))
            eng_features["previous_{}".format(num)].append(df[mask]['revenue'].sum())

        ## get get the target revenue    
        plus_30 = current + np.timedelta64(30,'D')
        mask = np.in1d(dates, np.arange(current,plus_30,dtype='datetime64[D]'))
        y[d] = df[mask]['revenue'].sum()

        ## attempt to capture monthly trend with previous years data (if present)
        start_date = current - np.timedelta64(365,'D')
        stop_date = plus_30 - np.timedelta64(365,'D')
        mask = np.in1d(dates, np.arange(start_date,stop_date,dtype='datetime64[D]'))
        eng_features['previous_year'].append(df[mask]['revenue'].sum())

        ## add some non-revenue features
        minus_30 = current - np.timedelta64(30,'D')
        mask = np.in1d(dates, np.arange(minus_30,current,dtype='datetime64[D]'))
        eng_features['recent_invoices'].append(df[mask]['unique_invoices'].mean())
        eng_features['recent_views'].append(df[mask]['total_views'].mean())

    X = pd.DataFrame(eng_features)
    ## combine features in to df and remove rows with all zeros
    X.fillna(0,inplace=True)
    mask = X.sum(axis=1)>0
    X = X[mask]
    y = y[mask]
    dates = dates[mask]
    X.reset_index(drop=True, inplace=True)

    if training == True:
        ## remove the last 30 days (because the target is not reliable)
        mask = np.arange(X.shape[0]) < np.arange(X.shape[0])[-30]
        X = X[mask]
        y = y[mask]
        dates = dates[mask]
        X.reset_index(drop=True, inplace=True)
    
    return(X,y,dates)


if __name__ == "__main__":

    run_start = time.time() 
    data_dir = os.path.join("..","data","cs-train")
    print("...fetching data")

    ts_all = fetch_ts(data_dir,clean=False)

    m, s = divmod(time.time()-run_start,60)
    h, m = divmod(m, 60)
    print("load time:", "%d:%02d:%02d"%(h, m, s))

    for key,item in ts_all.items():
        print(key,item.shape)


### The Answers of the 12-Question of AI Workflow

1. Are there unit tests for the API?

   Yes. Unit testing is the process of testing small portions of the software, also known as units. This is done one test at a time, to verify that an expected result is returned under controlled conditions. Importantly, the unit tests are usually organized as a suite and return objective evidence, in the form of a boolean value, which is a key element that enables workflow automation.
   One of the reasons to create unit tests is to ensure that iterative improvements to code do not break the functionality of the API. 

2. Are there unit tests for the model?

   Yes. Unit testing is the process of testing small portions of the software, also known as units. This is done one test at a time, to verify that an expected result is returned under controlled conditions. Importantly, the unit tests are usually organized as a suite and return objective evidence, in the form of a boolean value, which is a key element that enables workflow automation.
   One of the reasons to create unit tests is to ensure that iterative improvements to code do not break the functionality of the model.

3. Are there unit tests for the logging?

No. Like all problems in data science, performance monitoring starts with collecting the right data in the right format. Data for performance monitoring is generally collected using log files. Basically, there are key requirements for performance monitoring for most model deployment projects (logging): runtime, timestamp, prediction, input_data_summary & model_version_number. 

4. Can all of the unit tests be run with a single script and do all of the unit tests pass?

Yes. But If one of the tests were more comprehensive, for example an API test that tested multiple functions, it would likely fall under the umbrella of integration testing. Both unit tests and integration tests are part of the CI/CD pipeline.

5. Is there a mechanism to monitor performance?

Yes. Because performance monitoring is a concern in nearly all customer-facing computer systems, there is a well-established set of tools and techniques for collecting this data. Data for performance monitoring is generally collected using log files. 

6. Was there an attempt to isolate the read/write unit tests from production models and logs?

Yes - but in this situation we should assume that the data science team has decided to keep containers as isolated as possible. One reason for this approach would be that the company uses a hybrid cloud or multicloud architecture of storage and services.

7. Does the API work as expected? For example, can you get predictions for a specific country as well as for all countries combined?

Yes. The API works well as we can get predicitons for a specific country as well as for all countries combined. 

8. Does the data ingestion exists as a function or script to facilitate automation?

Yes. Any form of data movement from source to target can be considered as data ingestion. In reality,  A common database as a target is next to impossible due to logistical and privacy concerns,but API keys could be a viable solution towards automation. Another comprehensive solutions automation can be achieved with scripting. And cron jobs are an incredibly powerful way to automate the process.

9. Were multiple models compared?

Yes - they were compared. How well a model performs can be decomposed as bias, variance and noise. The bias of a model is its average error when a model is subjected to different training sets and it comes from the underlying model assumptions. The variance of a model is reflective of how sensitive it is to variations in the training data.

10. Did the EDA investigation use visualizations?

Yes. The first task in data science is always data visualization. The data visualization deliverables have become an important part of a playback! It is part of one of the key principles of design thinking: Observation and Reflection.

11. Is everything containerized within a working Docker image?

Yes. In data science today, Docker is the industry standard for containerization of machine learning models and AI services. The Docker container is a running process that is kept isolated from the host and from other containers. One of the important consequence of this isolation is that each container interacts with its own private filesystem. A Docker image includes everything needed to run an application: code, runtime libraries, and a private filesystem.

12. Did they use a visualization to compare their model to the baseline model?

Yes. The model was compared to the baseline model by visualizing. 