# Capstone 1 Springboard
## Predictive Maintenance Utilizing CRISP-DM

### Background

Now that the hype surrounding Data Science has slightly diminished, we can affirm that this is not a drill but rather an exhilarating reality. Governments, large organizations, and start-ups alike have already seen and understood the value this discipline brings to the table and are fervently competing for talent and dominance in this space. As of 2019, we are finding ourselves at a new precipice in entering the Fourth Industrial Revolution: The Real-Time Enterprise! Just like the intersections of the past: First in 1784, Water and steam; Second in 1870, First conveyor belt; and the Third in 1969, Electronics and information technology, this pause will bring with it both challenges and new opportunities. Data stewards in the discipline have realized associative costs for data storage has expounded this issue while driving talent gaps. As their respective organizations continue to ingest voluminous amounts of data, they must become more tactical with the data they are creating and using to sustain or improve Return on Investment (ROI). 
More so, the approach of advancing insight through analyses of mountains of data must clear and consistent so that results are both reproducible and can be made autonomous. One method in doing this is by utilizing the Cross-Industry Standard Process for Data Mining (CRISP-DM). This logical method enables data stewards and stakeholders to clearly understand what, when, where, why, and how they are mining data through six easy to follow phases:

![crisp-dm-phases](Data/crisp-dm-phases.png)

>	1. Business understanding
>	2. Data understanding
>	3. Data prep
>	4. Modeling & Application Development
>	5. Evaluation
>	6. Model Deployment. 

As talent fluctuations occur in organizations, reproducibility is facilitated by accurately recording steps taken in each phase. The structure will not only assist firms in creating better business outcomes; it will also enable consistent and reproducible workflows through transitioning data science practitioners.

### Phase I. Business Understanding.

The following project will demonstrate how to utilize CRISP-DM from a practical standpoint; the next analog will use it on simulated manufacturing data predicting maintenance failures for a theoretical client's manufacturing operation. Predictive maintenance is an area that has a clear use-case for data-mining and analytics and primarily due to the breakthroughs of applied machine learning. With the continuous advancements in the Real-time enterprise fueled through the: Internet of Things (IoT), Telemetry, Low-Cost Digital Storage, and increasing Computing Power amongst others, the capabilities of transforming voluminous data into insight in this space does not appear to be slowing. The growth in Artificial Intelligence (AI) amongst increasing levels of Automation seen in manufacturing allows firms to be more resilient in connecting fixed-assets while improving productivity through data-driven decisions and insights not possible before. As the use of automation continues to augment and takeover manufacturing, the reduced response time required in dealing with maintenance issues will outpace the speed at which humans can intervene, requiring sophisticated and automated optimization decisions, especially about maintenance schedules. Executives and managers in both the public and private sectors to cope with this transition must speed up training initiatives to groom new tech talent to utilize tools to assist them in managing this transition through a structured method, or else the cost of doing business will outweigh the profits of its outputs.

**Who might care?** Maintenance Managers, Operations Managers, Capital Expenditure planners, and manufacturing companies such as Exon Mobile Corporation, General Motors, Ford Motors, Apple, Boeing, and the Department of Defense, etc. can use such a model to predict the likelihood of equipment failures to allocate resources better. They can then proactively inform their maintainers and or customers well in advance of potential disruptions in their respective operations. Understanding the probability of material failures will help sustain customer service or level of service efforts. From the customer's point of view, it would be very convenient in knowing if a supply, production, or any other disruption may occur so that they can in turn, proactively mitigate risk. On the manufacturer's hand, such a predictive model would enhance the product base and performance of the organization's operations. Moreover, there is a possibility of developing an app or other front-end communication effort in which customers and or internal users can consult with to understand the likelihood of issues well in advance.

> **Cost and benefits**

> • Every maintenance hour reduced in human labor will save the company and average of 75.69 which includes fringe benefits. The company's current budget for maintenance includes a staff of 70 maintainers overseeing 10 machines (700 machines total) each week working 40 hours a week at an operating expense of 211,932.00 a week. Due to high attrition the firm would like to augment the operations team with a data scientist at a cost of 125.00 an hour fully burdened to help optimize and allocate maintenance labor hours. The operations team has refitted each of the 700 machines with telemetry sensors.

__Set objectives:__

> 1. In this case the client has furnished publicly available sensor data on 1,900 machines which was collected over a span of 4 years. The machines recorded observations over time that include the following sample features: Device ID; Time Stamp; Warning Flags; Error/Issues; and Target Variable: Fail/No Fail. Based on the data provided the client wanted us to predict what mix of sensor readings would trigger a response: Normal, Fail, and Recovering.

> 2. Are there any leading conditions that exist with the sensors before a fail scenario occurs.


__Contraints, limitations, and assumptions__

> • _Constraint 1:_ The success of the model will depend on how good the prediction of material failure is based on historical observations in the curated data. For instance after analyzing several observations of data after X period of months pertaining to consistent machine utilization hours the model's predictions will be based on future and consistent occurences similar to the recorded and analyzed observations in the past. 

> • _Limitations 1:_ In order to predict the likelihood of a future material failure for a machine after X period of running hours, we would need to know how many hours the machine is expected to run in the future: which we do not know "accurately" today. 

> • _Assumptions 1:_ The classification model will be able to communicated probablities at various run levels to address this limitation. 

__Risk and contigencies__
                
> • _Risk 1 Scheduling: What if the project takes longer than anticipated?_ Consultant will monitor work performance and proactively partner with stakeholder to seek additional time or alternative recommendations from stakeholder.

> • _Risk 2 Financial: What if the project sponsor encounters budgetary problems?_ Consultant will monitor work performance and proactively partner with stakeholder to seek alternative recommendations from stakeholder pertaining to trade-space opportunities.

> • _Risk 3 Data: What if the data are of poor quality or coverage?_ Consultant will monitor work performance and proactively partner with stakeholder to seek alternative recommendations from stakeholder pertaining to trade-space opportunities concerning scope change in acquiring additional data applicable to this project.

> • _Risk 4 Results: What if the initial results are less dramatic than expected?_ Consultant will monitor work performance and project phases in order to proactively partner with stakeholder to evaluate go or no-go empasses to ensure all partners are in agreement before moving on to the next phase of CRISP-DM. Each phase will be signed off by stakeholder to ensure all concerned parties are in agreement of progress.


__Terminology__
                
> • All definitions involved in this project could be found [here](Data\Data_Mining_Glossary.csv).

        
**Business/Data mining goals** describe the intended outputs of the project that enable the achievement of the business objectives.
        a. Business success criteria
        b. Data mining success criteria

**Project plan**  Describe the intended plan for achieving the data mining goals and thereby achieving the business goals. Your plan should specify the steps to be performed during the rest of the project, including the initial selection of tools and techniques.
        a. Project plan:

This overview will be structured using the CRISP-DM framework: 

**Phase I. Business Understanding:**
_Current situation assessment_
> a.    Set objectives
> b.    Inventory of resources
> c.    Requirements, assumptions and constraints
> d.    Risks and contingencies
> e.    Terminology
> f.    Cost and benefits
        
**Business/Data mining goals**
> a.    Business success criteria
> b.    Data mining success criteria
        
_Project plan_
> a.    Project plan
> b.    Initial assessment of tools and techniques  

*Review and approval point*  


**Phase II. Data Understanding:**
_Data collection_
> a.    Initial data collection report

_Data Description_
> a.    Data description report

_Data Exploration_
> a.    Data exploration report

_Data Quality_
> a.    Data quality report

_Review and approval point_


**Phase III. Data Preparation:** 
> a.    Rational for inclusion/exclusion
> b.    Data cleaning report
> c.    Derived attributes
> d.    Generated records
> e.    Merged data
> f.    Aggregations

_Review and approval point_


**Phase IV. Data Modeling & Application Development:**
_Technique_
> a.    Modeling technique
> b.    Modeling assumptions

_Test design_
> a.    Test design

_Build model_
> a.    Parameter settings
> b.    Models
        
_Model assessment_
> a.    Model assessment
> b.    Revised parameter settings

_Review and approval point_


**Phase V. Model Evaluation:**
Evaluation
> a.    Assessment of data mining results
> b.    Approved model results

_Reviewal process_
> a.    Review of process

_Next steps_
> a.    List of course of actions (COAs)
> b.    Final Decision/Selection of COA

_Review and approval point_

**Phase VI. Model Deployment and Communication:**
_Plan deployment_
> a.    Plan deployment

_Plan monitoring and maintenance_
> a.    Montoring and maintenance plan
        
_Produce final report_
> a.    Final report
> b.    Final presentation

_Review/Archiving of process_
> a.    Experience documentation


***
__Initial assessment of resources and tools:__

> 1.  __Resourse:__ Data Scientist, Alfred Hull
> 2.  __Resourse:__ The data in this overview is simulated at about 37 Megabytes (MB) to demonstrate how machines performed over a time frame of five months: 31-March-2018 to 31-August-2018. The data contains a 222k by 55 matrix: two hundred twenty two observations by 55 fields.
> 3.  __Tool:__ The computing resources used for this example were the MS Surface Book 2 for Business - 15" Display /256 GB / Intel Core i7. High-speed Intel processors, (quad-core available), NVIDIA GeForce GTX graphics, 17 hours of battery life, and running Windows 10 Pro.
> 4.  __Tool:__ The software used for these examples was Windows 10, Anaconda, IPython, and Jupyter Lab


__Reference:__

> 1.  Github Repository: https://github.com/ahull002/wrangling_csv.git
> 2.  folium documentation: https://python-visualization.github.io/folium/