# Business understanding
------------

This initial phase focuses on understanding the project objectives and requirements from a business perspective, then converting this knowledge into a data mining problem definition and a preliminary plan designed to achieve the objectives.


## Determine business objectives
----------

### task

The first objective of the data analyst is to thoroughly understand, from a business perspective, what the client really wants to accomplish. Often the client has many competing objectives and constraints that must be properly balanced. The analyst's goal is to uncover important factors, at the beginning, that can influence the outcome of the project. A possible consequence of neglecting this step is to expend a great deal of effort producing the right answers to the wrong questions.

### output

#### background

Record the information that is known about the organization's business situation at the beginning of the project.

#### business objectives 

Describe the customer's primary objective, from a business perspective. In addition to the primary business objective, there are typically other related business questions that the customer would like to address. For example, the primary business goal might be to keep current customers by predicting when they are prone to move to a competitor. Examples of related business questions are "How does the primary channel (e.g., ATM, visit branch, internet) a bank customer uses affect whether they stay or go?" or "Will lower ATM fees significantly reduce the number of high-value customers who leave?"

#### business success criteria 

Describe the criteria for a successful or useful outcome to the project from the business point of view. This might be quite specific and able to be measured objectively, such as reduction of customer churn to a certain level or general and subjective such as "give useful insights into the relationships." In the latter case it should be indicated who makes the subjective judgment.


## Assess situation
-------------

### task

This task involves more detailed fact-finding about all of the resources,constraints, assumptions and other factors that should be considered in determining the data analysis goal and project plan. In the previous task, your objective is to quickly get to the crux of the situation. Here, you want to flesh out the details.

### output

#### inventory of resources

List the resources available to the project, including: personnel (business experts, data experts, technical support, data mining personnel), data (fixed extracts, access to live warehoused or operational data), computing resources (hardware platforms) and software (data miningtools, other relevant software).

#### requirements, assumptions and constraints

List all requirements of the project including schedule of completion, comprehensibility and quality of results and security as well as legal issues.As part of this output, make sure that you are allowed to use the data. List the assumptions made by the project. 

These may be assumptions about the data that can be checked during data mining, but may also include non-checkable assumptions about the business upon which the project rests. It is particularly important to list the latter if they form conditions on the validity of the results.

List the constraints on the project. These may be constraints on the availability of resources, but may also include technological constraints such as the size of data that it is practical to use for modeling.

#### Risks and contingencies 

List the risks or events that might occur to delay the project or cause it to fail. List the corresponding contingency plans; what action will be taken if the risks happen.

#### Terminology

Compile a glossary of terminology relevant to the project. This may include two components: 
(1) A glossary of relevant business terminology, which forms part of the business understanding available to the project. Constructing this glossary is a useful "knowledge elicitation" and education exercise.
(2) A glossary of data mining terminology, illustrated with examples relevant to the business problem in question.

#### Costs and benefits

Construct a cost-benefit analysis for the project, which compares the costs of the project with the potential benefit to the business if it is successful. The comparison should be as specific as possible, for example using monetary measures in a commercial situation.


## Determine data mining goals
-----------

### task

A business goal states objectives in business terminology. A data mining goal states project objectives in technical terms. For example, the business goal might be "Increase catalog sales to existing customers." A data mining goal might be "Predict how many widgets a customer will buy, given their purchases over the past three years, demographic information (age, salary, city, etc.) and the price of the item."

### output

#### data mining goals

Describe the intended outputs of the project that enables the achievement of the business objectives.

#### data mining success criteria

Define the criteria for a successful outcome to the project in technical terms, for example a certain level of predictive accuracy or a propensity to purchase profile with a given degree of "lift." As with business success criteria, it may be necessary to describe these in subjective terms, in which case the person or persons making the subjective judgment should be identified.


## produce project plan
----------------

### task

Describe the intended plan for achieving the data mining goals and thereby achieving the business goals. The plan should specify the anticipated set of steps to be performed during the rest of the project including an initial selection of tools and techniques.

### output

#### project plan

List the stages to be executed in the project, together with duration, resources required, inputs, outputs and dependencies. Where possible make explicit the large-scale iterations in the data mining process, for example repetitions of the modeling and evaluation phases. As part of the project plan, it is also important to analyze dependencies between time schedule and risks. Mark results of these analyses explicitly in the project plan, ideally with actions and recommendations if the risks appear.

Note: the project plan contains detailed plans for each phase. For example, decide at this point which evaluation strategy will be used in the evaluation phase. The project plan is a dynamic document in the sense that at the end of each phase a review of progress and achievements is necessary and an update of the project plan accordingly is recommended. Specific review points for these reviews are part of the project plan, too.

#### initial assessment of tools and techniques

At the end of the first phase, the project also performs an initial assessment of tools and techniques. Here, you select a data mining tool that supports various methods for different stages of the process, for example. It is important to assess tools and techniques early in the process since the selection of tools and techniques possibly influences the entire project.


# Data understanding
------------
The data understanding phase starts with an initial data collection and proceeds withactivities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data or to detect interesting subsets to form hypothesesfor hidden information.

## Collect initial data
----------

### task

Acquire within the project the data (or access to the data) listed in the
project resources. This initial collection includes data loading if necessary
for data understanding. For example, if you apply a specific tool for data
understanding, it makes perfect sense to load your data into this tool.
This effort possibly leads to initial data preparation steps.

Note: if you acquire multiple data sources, integration is an additional
issue, either here or in the later data preparation phase.

### output

List the dataset (or datasets) acquired, together with their locations
within the project, the methods used to acquire them and any problems
encountered. Record problems encountered and any solutions achieved
to aid with future replication of this project or with the execution of
similar future projects.

In [None]:
# read csv file
df = pd.read_csv(path + file_name)

In [None]:
# read excel file
df = pd.read_excel(path + file_name, sheet_name=sheet_name)

## Describe data
----------

### task

Examine the “gross” or “surface” properties of the acquired data and
report on the results.

### output

Describe the data which has been acquired, including: the format of
the data, the quantity of data, for example number of records and fields
in each table, the identities of the fields and any other surface features
of the data which have been discovered. Does the data acquired satisfy
the relevant requirements?

In [None]:
# number of records and fields
df.shape # (records, fields)

In [None]:
# head
df.head(n=5)

In [None]:
# tail
df.tail(n=5)

In [None]:
# data types
df.dtypes

## Explore data
----------

### task

This task tackles the data mining questions, which can be addressed
using querying, visualization and reporting. These include: distribution
of key attributes, for example the target attribute of a prediction task;
relations between pairs or small numbers of attributes; results of
simple aggregations; properties of significant sub-populations; simple
statistical analyses. These analyses may address directly the data mining goals; they may also contribute to or refine the data description
and quality reports and feed into the transformation and other data
preparation needed for further analysis.

### output

Describe results of this task including first findings or initial hypothesis and their impact on the remainder of the project. If appropriate,
include graphs and plots, which indicate data characteristics or lead
to interesting data subsets for further examination.

In [None]:
# convert data type
df.shipping_date = pd.to_datetime(df.shipping_date, format='%Y/%m/%d')
df.price = df.price.astype('int64')
df.quantity = df.quantity.astype('int64')

In [None]:
# describe
df.describe()

In [None]:
# histogram
sns.distplot(df.price)

In [None]:
# histogram
sns.distplot(df.quantity)

In [None]:
# random data wrangling
data = df.groupby('shipping_date', as_index=False).agg({'price': np.sum, 'quantity': np.sum})
sns.lineplot(data=data, x="shipping_date", y="price")

In [None]:
# pandas profiling
profile = ProfileReport(df, minimal=True)
profile.to_file(path + "data_understanding_report.html")
profile

## Verify data quality
----------

### task

Examine the quality of the data, addressing questions such as: is the
data complete (does it cover all the cases required)? Is it correct or
does it contain errors and if there are errors how common are they?
Are there missing values in the data? If so how are they represented,
where do they occur and how common are they?

### output

List the results of the data quality verification; if quality problems
exist, list possible solutions. Solutions to data quality problems
generally depend heavily on both data and business knowledge.

In [None]:
# check duplicate row
df[df.duplicated(keep=False)]

In [None]:
# check outlier
from scipy import stats
z_thr = 3.0
df[(np.abs(stats.zscore(df.select_dtypes(include=int))) > z_thr).any(axis=1)]

In [None]:
# check missing data
df.isnull().sum()

In [None]:
# check number of unique value
df.nunique()

In [None]:
# check unique value
for feature in df.columns:
  print('------' + feature + '------')
  print(np.sort(df[feature].unique()))
  print()

# Data preparation
---------------------
The data preparation phase covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) from the initial raw data. Data preparation tasks are likely to be performed multiple times and not in any prescribed order. Tasks include table, record and attribute selection as well as transformation and cleaning of data for modeling tools.

## Select data
----------

### task

Decide on the data to be used for analysis. Criteria include relevance to the data mining goals, quality and technical constraints such as limits on data volume or data types. Note that data selection covers selection of attributes (columns) as well as selection of records (rows) in a table.

### output

List the data to be included/excluded and the reasons for these decisions.

#### select data source

| # | data | included/excluded | reasons | quality | volume/data types |
|:---:|:---|:---|:---|:---|:---|
| 1 |  | included |  |  |  |

#### select attributes & records



## Clean data
------------------

### task

Raise the data quality to the level required by the selected analysistechniques. This may involve selection of clean subsets of the data, the insertion of suitable defaults or more ambitious techniques such as the estimation of missing data by modeling.

### output

Describe what decisions and actions were taken to address the data quality problems reported during the verify data quality task of the data understanding phase. Transformations of the data for cleaning purposes and the possible impact on the analysis results should be considered.


In [None]:
# remove unnncesary columns
columns = [] # remove no columns if array is empty
df = df.drop(columns=columns)

In [None]:
# clean missing data


In [None]:
# clean outlier

## clean minus data
column = ''
del_flg = df[column] < 0
df = df.drop(df[del_flg].index)

In [None]:
# clean duplicate records
df = df.drop_duplicates()

## Construct data
----------

### task

This task includes constructive data preparation operations such as the production of derived attributes, entire new records or transformed values for existing attributes.

### output

#### derived attributes

Derived attributes are new attributes that are constructed from one or more existing attributes in the same record. 

Examples: area = length * width

#### generated records

Describe the creation of completely new records. 

Example: create records for customers who made no purchase during the past year.There was no reason to have such records in the raw data, but for modeling purposes it might make sense to explicitly represent the fact that certain customers made zero purchases.


In [None]:
# derive attributes


In [None]:
# generate records


## Integrate data
-------------

### task

These are methods whereby information is combined from multiple tables or records to create new records or values.

### output

Merging tables refers to joining together two or more tables that have different information about the same objects. 

Example: a retail chainhas one table with information about each store's general characteristics(e.g., floor space, type of mall), another table with summarized sales data (e.g., profit, percent change in sales from previous year) and another with information about the demographics of the surrounding area. Each of these tables contains one record for each store. These tables can be merged together into a new table with one record foreach store, combining fields from the source tables.

Merged data also covers aggregations. Aggregation refers to operations where new values are computed by summarizing together information from multiple records and/or tables. For example, converting a table ofcustomer purchases where there is one record for each purchase into a new table where there is one record for each customer, with fields such as number of purchases, average purchase amount, percent of orders charged to credit card, percent of items under promotion, etc.


# Format data
----------------

### task

Formatting transformations refer to primarily syntactic modifications made to the data that do not change its meaning, but might be required by the modeling tool.

### output

Some tools have requirements on the order of the attributes, such as the first field being a unique identifier for each record or the last field being the outcome field the model is to predict.

It might be important to change the order of the records in the dataset. Perhaps the modeling tool requires that the records be sorted according to the value of the outcome attribute. A common situation is that the records of the dataset are initially ordered in some way but the modeling algorithm needs them to be in a fairly random order. For example, when using neural networks it is generally best for the records to be presented in a random order although some tools handle this automatically with-out explicit user intervention.

Additionally, there are purely syntactic changes made to satisfy the requirements of the specific modeling tool. 

Examples: removing commas from within text fields in comma-delimited data files, trimming all values to a maximum of 32 characters.


# Modeling
---------------------
In this phase, various modeling techniques are selected and applied and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often necessary.

## Select modeling technique
----------

### task

As the first step in modeling, select the actual modeling technique that is to be used. Whereas you possibly already selected a tool in business understanding, this task refers to the specific modeling technique, e.g.,decision tree building with C4.5 or neural network generation with back propagation. If multiple techniques are applied, perform this task for each technique separately.

### output

#### modeling technique

Document the actual modeling technique that is to be used.

#### modeling assumptions

Many modeling techniques make specific assumptions on the data, e.g.,all attributes have uniform distributions, no missing values allowed, class attribute must be symbolic etc. Record any such assumptions made.

## Generate test design
------------------

### task

Before we actually build a model, we need to generate a procedure or mechanism to test the model's quality and validity. For example, in supervised data mining tasks such as classification, it is common to use error rates as quality measures for data mining models. Therefore, we typically separate the dataset into train and test set, build the model on the train set and estimate its quality on the separate test set.

### output

Describe the intended plan for training, testing and evaluating the models. A primary component of the plan is to decide how to divide the available dataset into training data, test data and validation datasets.


## Build model
----------

### task

Run the modeling tool on the prepared dataset to create one or more models.

### output

#### parameter settings 

With any modeling tool, there are often a large number of parameters that can be adjusted. List the parameters and their chosen value, along with the rationale for the choice of parameter settings. 

#### models 

These are the actual models produced by the modeling tool, not a report.

#### model description

describe the resultant model. Report on the interpretation of the models and document any difficulties encountered with their meanings.


In [None]:
# split data
## closed test
X_train, X_test, y_train, y_test = X, X, y, y

## random split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

#### modeling

In [None]:
# SVM
from sklearn.svm import SVC
model = SVC(kernel='linear', random_state=None)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

In [None]:
# LightGBM
!pip install optuna
import optuna.integration.lightgbm as lgb
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
params = {
    "objective" : "multiclass",
    "metric" : "multi_logloss",
    "num_class" : len(y.unique())
}
model = lgb.train(params, lgb_train, valid_sets=lgb_eval)
y_prob = model.predict(X_test, num_iteration=model.best_iteration)
y_pred = np.argmax(y_prob, axis=1)

## Assess model
-------------

### task

The data mining engineer interprets the models according to his domain knowledge, the data mining success criteria and the desired test design. This task interferes with the subsequent evaluation phase. Whereas the data mining engineer judges the success of the application of modeling and discovery techniques more technically, he contacts business analysts and domain experts later in order to discuss the data mining results in the business context. Moreover, this task only considers models whereas the evaluation phase also takes into account all other results that were produced in the course of the project. The data mining engineer tries to rank the models. He assesses the models according to the evaluation criteria. As far as possible he also takes into account business objectives and business success criteria. In most data mining projects, the data mining engineer applies a single technique more than once or generates data mining results with different alternative techniques. In this task, he also compares all results according to the evaluation criteria.

### output

#### model assessment

Summarize results of this task, list qualities of generated models (e.g.,in terms of accuracy) and rank their quality in relation to each other. 

#### revised parameter settings

According to the model assessment, revise parameter settings and tune them for the next run in the Build Model task. Iterate model building and assessment until you strongly believe that you found the best model(s). Document all such revisions and assessments.


### classification
----------------


In [None]:
# accuracy
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(str('{:.1g}'.format(accuracy * 100)) + '%')

In [None]:
# confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

#### binary classification

#### multi-class classification

# Evaluation
---------------------
At this stage in the project you have built a model (or models) that appears to have high quality from a data analysis perspective. Before proceeding to final deployment of the model, it is important to more thoroughly evaluate the model and review the steps executed to construct the model to be certain it properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached.

# Deployment
---------------------

Creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that the customer can use it. It often involves applying "live" models within an organization's decision making processes, for example in real-time personalization of Web pages or repeated scoring of marketing databases. However, depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process across the enterprise. In many cases it is the customer, not the data analyst, who carries out the deployment steps. However, even if the analyst will not carry out the deployment effort it is important for the customer to understand up front what actions need to be carried out in order to actually make use of the created models.


## Plan deployment
----------

### task

In order to deploy the data mining result(s) into the business, this task takes the evaluation results and concludes a strategy for deployment. If a general procedure has been identified to create the relevant model(s), this procedure is documented here for later deployment.

### output

Summarize deployment strategy including necessary steps and how to perform them.

## Plan monitoring & maintenance
------------------

### task

Monitoring and maintenance are important issues if the data mining result becomes part of the day-to-day business and its environment. A careful preparation of a maintenance strategy helps to avoid unnecessarily long periods of incorrect usage of data mining results. In order to monitor the deployment of the data mining result(s), the project needs a detailed plan on the monitoring process. This plan takes into account the specific type of deployment.

### output

Summarize monitoring and maintenance strategy including necessary steps and how to perform them.


## Produce final report
----------

### task

At the end of the project, the project leader and his team write up a final report. Depending on the deployment plan, this report may be only a summary of the project and its experiences (if they have not already been documented as an ongoing activity) or it may be a final and comprehensive presentation of the data mining result(s)

### output

#### final report 

This is the final written report of the data mining engagement. It includes all of the previous deliverables and summarizes and organizes the results.

#### final presentation 

There will also often be a meeting at the conclusion of the project where the results are verbally presented to the customer.




# Review project
-------------

## task

Assess what went right and what went wrong, what was done well and what needs to be improved.

## output

Summarize important experiences made during the project. For example, pitfalls, misleading approaches or hints for selecting the best suited data mining techniques in similar situations could be part of this documentation. In ideal projects, experience documentation covers also any reports that have been written by individual project members during the project phases and their tasks.



## Evaluate result
----------

### task

Previous evaluation steps dealt with factors such as the accuracy and generality of the model. This step assesses the degree to which the model meets the business objectives and seeks to determine if there is some business reason why this model is deficient. Another option of evaluation is to test the model(s) on test applications in the real application if time and budget constraints permit.

Moreover, evaluation also assesses other data mining results generated. Data mining results cover models which are necessarily related to the original business objectives and all other findings which are not necessarily related to the original business objectives but might also unveil additional challenges, information or hints for future directions.

### output

#### assessment of data mining results with respect to business success criteria 

Summarize assessment results in terms of business success criteria including a final statement whether the project already meets the initial business objectives.

#### approved models 

After model assessment with respect to business success criteria, the gen-erated models that meet the selected criteria become approved models.


## Review process
------------------

### task

At this point the resultant model hopefully appears to be satisfactory and to satisfy business needs. It is now appropriate to do a more thorough review of the data mining engagement in order to determine if there is any important factor or task that has somehow been overlooked. This review also covers quality assurance issues, e.g., did we correctly build the model? Did we only use attributes that we are allowed to use and that are available for future analyses?

### output

Summarize the process review and highlight activities that have been missed and/or should be repeated. 


## Determine next steps
----------

### task

According to the assessment results and the process review, the project decides how to proceed at this stage. The project needs to decide whether to finish this project and move on to deployment if appropriate or whether to initiate further iterations or set up new data mining projects. This task includes analyses of remaining resources and budget that influences the decisions.

### output

#### list of possible actions 

List the potential further actions along with the reasons for and against each option.

#### Decision 

Describe the decision as to how to proceed along with the rationale.
