### Overview of Machine Learning Workflow Steps

Although all ML projects differ in terms of emphasis, most incorporate some version of the following steps:

1. **Problem Definition**
2. **Data Collection**
3. **Exploratory Data Analysis (EDA)**
4. **Data Cleaning and Preprocessing**
5. **Feature Engineering**
6. **Model Selection**
7. **Model Training**
8. **Model Evaluation**
9. **Model Tuning and Optimization**
10. **Model Deployment**
11. **Monitoring and Maintenance**

It's important to note that these steps are usually not sequential!  In most cases you will need to go back and revisit previous steps as you discover new things about your data and the performance of your algorithm.  

In the following I offer a bit more detail and motivation for each of these steps.  The remainder of this lesson includes a worked example, drawn from the textbook (adapted from the [author's github repository](https://github.com/ageron/handson-ml3/blob/main/02_end_to_end_machine_learning_project.ipynb)).


#### 1. Problem Definition
- **What and Why**: Clearly define what you are trying to solve. This involves understanding the business or research question and determining if and how ML can provide a solution. It is useful here to identify specific stakeholders, and then figure out what their needs are.  These might be existing needs, or needs they didn't know they had (e.g., who knew they needed a mobile phone in 1880?).  In either case, it is important to be as specific as possible here about what your goals are!

- **Considerations**: The problem definition guides the choice of data to collect, the ML model to use, and how the solution will be evaluated.  It will also inform your strategies for data analysis and cleaning.

#### 2. Data Collection
- **What and Why**: Gather the data needed to train the model. This could be from existing databases, public datasets, or through new data collection methods like surveys or sensors.
- **Considerations**: The quality and quantity of data collected will directly impact the performance of the ML model. More diverse and comprehensive data can lead to more robust models.  Very often, some of the best ML result from combining different datasets in new ways!

#### 3. Exploratory Data Analysis (EDA)
- **What and Why**: Analyze the data to understand patterns, anomalies, trends, and relationships within the data.  Perhaps the most important and powerful technique you have here is visualization! You should almost *never* engage in data analysis / ML without some initial attempts to visualize your data to understand distributions and correlations.
- **Considerations**: EDA informs the feature engineering and model selection process. It's also vital for gaining insights and understanding the data's nature.  This stage of analysis may cause you to abandon a project, or go back to prior steps before you sink resources into subsequent stages.


#### 4. Data Cleaning and Preprocessing
- **What and Why**: Clean and preprocess the data to a format suitable for analysis. This includes handling missing values, noise, and irrelevant data. This is one of the more elaborate and well-supported stages in ML, and we will cover this in detail.  It includes:
    - Removing outliers
    - Scaling data
    - Dealing with missing values by removal or imputation
    - Rebalancing data for skewed datasets
    - Encoding data in a manner that an ML model can work with
    
- **Considerations**: This step is crucial for ensuring data quality and integrity. Poor data quality can lead to misleading results.

#### 5. Feature Engineering
- **What and Why**: Create or select the most relevant features (input variables) from the data to train the model. Feature engineering is much of the "art" of machine learning.  This include turning continuous features into ordinal data, turning categorical data into continuous features, creating "meta-features," and many other steps. 
- **Considerations**: Effective feature engineering is one of the most powerful methods you have for improving model performance.

#### 6. Model Selection
- **What and Why**: Choose an appropriate ML model or models based on the problem type (e.g., classification, regression).  This course should equip you with a rich toolkit of models you can apply.
- **Considerations**: The choice of model depends on the nature of the problem, the size and type of data, and the desired outcome.

#### 7. Model Training
- **What and Why**: Train the model using the prepared dataset. This is where the model learns from the data.
- **Considerations**: Requires a careful balance to avoid underfitting (too simple model) or overfitting (too complex model).

#### 8. Model Evaluation
- **What and Why**: Assess the model's performance using appropriate metrics (like accuracy, precision, recall, F1 score, etc.) to determine if it meets the desired objectives. As with data cleaning, this is a well-supported activity in existing libraries, and it is important to understand the nuances of both measurement and sampling.
- **Considerations**: Critically, evaluation should be done using a _separate test set_ that the model has not seen during training.

#### 9. Model Tuning and Optimization
- **What and Why**: Optimize the model by tuning hyperparameters to improve performance.
- **Considerations**: This step involves finding the right balance between model complexity and model generalizability.

#### 10. Model Deployment
- **What and Why**: Deploy the model into a production environment where it can start making predictions on new data.
- **Considerations**: Deployment requires integration with existing systems and ensuring the model performs reliably in real-world conditions.

#### 11. Monitoring and Maintenance
- **What and Why**: Continuously monitor the model’s performance to ensure it remains effective over time and update or retrain as needed.
- **Considerations**: Models can drift over time as data patterns change, requiring regular checks and updates.