# How to solve any machine learning problem using the PPDAC problem solving cycle

### Why? - is this important

Many companies hiring data scientists often require candidates to have PhD or masters in a STEM (Science, Technology, Engineering and Maths) subject largely, in my opinion, based on the fact students who graduate in these subjects will have had rigorous training in deriving mathematical theories and solving complex problems. While these skills are undoubtedly applicable to data science, this is just small part of the modern data scientists toolbox with arguably communication and...


Data science projects are inheritantly researched based given that a large chunk of work is exploring data, building models and seeing what works and what doesn't. However, if the project is not clearly defined then it can to lots of work done, but no results or product to show for your hardwork. 

Data science has been adopted at a expontial rate over the last 15 years and as a result there is a large amount of data iliteracy in key decision-making parts of businesses leading to misaligned of expectations of what data science is, what probelems it can solve and what it takes to implement at scale. This can result in angry exectives disillusioned by the ability of data science and frustrated developers unable to complete the work they were hired to do.

### Who? - should care

To make sure expectations are managed, it is the job of a data scientist to clearly define a project so that the exact purpose of the project can be unamibiously set and key objectives agreed to achieve this problem.


### What? - is the solution I offer

To demonstrate how issues can be avoided I'm going to walk through the PPDAC (Problem, Plan, Data, Analysis, Conclusion) method as outlined in The Art of Statistics by David Spiegelhalter. This process complements the iterative nature of the data science life cycle and shows that by taking a problem first approach, we can create results for any data science project.


## Problem - What are we trying to solve?

Understanding and defining the problem. How do we go about answering this question? 

Before diving into the data, it's important to frame your problem as one that can be solved using data science - specifically machine learning. The difficulty of your task can vary wildly depending on how you frame it. 

For example, using a dataset from the World Bank which details all contract awards financed by The World Bank under Investment Project Financing (IPF) operations. Suppose you work for a contractor who has recently heard about machine learning and is keen to adopt it witin the business to help increase revenue. There are a number of things to scope out. Firstly, what is the problem here? Your boss wants to increase revenue and a good way to do that is the analyse contract prices to assess how we would save costs by increasing pricing accuracy. Secondly, what has been done before? Is there work already done to price contracts? If so then can that solution be used or built upon? Thirdly, can this be defined as a machine learning problem? In this case, because we are trying to estimate price, a continuous variable, we can frame this as a regression problem. 

## Plan - How are we going to solve it? 

What to measure and how? Study design? recording? collecting?

It is easy to skip the planning stage of a project and go straight to exploring the data and building models, but there are several things to consider before getting to the fun part.

1. Project objective. It is vital to agree on what the objective is so everyone involved, developers to stakeholders, understand what we're trying to accompolish. In this use case, we want to estimate contract prices accurately within a 5% margin of error (this figure can vary depending on the project), and understand the most important features in estimating contract prices. 

2. Choose a metric to optimise. Now that we know our objectives, we can choose a performance metric to optimise. As this is a regression task we will use either Root Mean Squared Error (RMSE) or Mean Absolute Error (MAE) - depending on the distribution of contract prices. 

3. Data collection. Collecting the right data is arguably the most important part for developing predictive models so identifying the data needed, the data sources, and identifying potential limitations early will prevent future delays. For the purpose of this article, we already have a dataset, but in real projects it is vital to acquire as much data as possible and consult with industry experts to understand what features are important, and to address how to clean the data. 

4. Resource management. Depending on the size of the task the amount of compute power, personnel and time will vary. Therefore, identifying these issues early will save time downstream. 

5. Methodology. This stems from defining the task and the metric we want to optimise allowing us to identify the tools needed for the project to be successful. Considerations include:
- What Python packages to use?
- Do we need to use ML? Is there a simpler method? 
- What model training method will we use? 
- What models should we utilise?
- How should we interpret feature importances?
- How will we communicate our results? 

6. Identiyfing potential risks. As suggested previously, risks can arise around insufficient data, lack of resources to complete task, and poor model performance. It is important to flag these early so stakeholders are aware to keep the scope of the project realistic. 

7. Ethical considerations. Perhaps the most important question when using data and machine learning is to ask whether you should at all. It's important to consider the consequences of your analysis and whether there will be any negative impact on society. Hence, removing features that introduce bias such as gender, using appropriate modelling techniques to remove your own bias, and creating transparency about the model and data used are all are some of the essential considerations when planning for your use case. 

8. Reliability & maintainability. The reliability of a model is paramount to deciding if your prediction is wrong, how do you know? Correct code documentation and reproducibility means agreeing on what coding standards everyone in your team should adhear to, and following data and model versioning to increase tracibility of your analysis. 
Examples of coding practices include:
- Documentation - Following PEP 257, adding README.md to project repo, adding docstrings for modules, classes and functions
- Coding Standards - using formatters such as Black, consistent naming conventions (I use snake_case for functions and PascalCase for classes), using type hints, making functions small by focusing them on one specific thing.
- Reproducibility - use virtual environments (I use poetry to manages my environments and packages), set a random seed for reproducing results, use logging to track code execution for easier debugging, and using configuration files for project management.
- Version Control - use Git and GitHub to track changes and store code, track experiments using Mlflow or Weights & Biases, version data using DVC (Data Version Control), make small code commits to easily track changes, and create Git branches to develop code before reviewing and committing to main (usually done as part of a established code review process).

9. Scalability & adaptability. This point extends beyond the initial research phase of the project to plan for how the project could scale in a production environment. The key considerations here relate to what technologies can we use to manage scalable compute resources and documenting model artifacts, as well as can model improvements be made in production without server downtime so the user experience isn't impacted? 

## Data - What data do we need?

Collection. Management. Cleaning

Now our data set has been identified, we need to understand how we intend to collect, store and process the data. This process is commonly referred to as ETL (Extract, Transform, Load). This is where good coding skills in Python and SQL come into play to automate the preparation of data for downstream analytics. 

[DRAW DIAGRAM OF ETL PROCESS THAT I'VE ALREADY SKETCHED OUT TO REFLECT THE TRANSFERMARKT DATASET GITHUB]

1. Extract. Collecting the data is usually straight forward once the sources have been identified. The World Bank data source can be accessed via a downloadable CSV file or an API. Once we've collected the raw data, we need to validate it and make sure it conforms to our business requirements, if not then notifications should be sent to teh sources so they can make appropriate changes. The accepted data can be uploaded it to our database to keep track of the raw data or it can be sent through to the transform stage.

2. Transform. In a research setting, this is often done as part of the data exploration stage to investigate the data for signal to see if it can be used to solve our problem. In a production environment, because vast amounts of data can be generated the cleaning of data needs to be automated. Data cleaning can involve joining datasets, recoding values, standardising features, imputation, feature mutation and introducing business logic. 

3. Load. The transformed data is then written out as a file, database or data warehouse such as Google BigQuery, Snowflake or Amazon S3. 

In [9]:
import pandas as pd
import httpx

In [2]:
url = 'https://finances.worldbank.org/resource/kdui-wcs3.json?$limit=50000' # limit to 50k rows
resp = httpx.get(url)

In [8]:
df = pd.DataFrame(resp.json())
df.to_parquet('../data/raw/raw_world_bank_procurement_contracts.parquet')

In [18]:
df.head()

Unnamed: 0,as_of_date,fiscal_year,region,borrower_country,borrower_country_code,project_id,project_name,major_sector,procurement_category,procurement_method,wb_contract_number,contract_description,borrower_contract_reference_number,contract_signing_date,supplier_id,supplier,supplier_country,supplier_country_code,supplier_contract_amount_usd,rvw_type
0,2024-09-07T00:00:00.000,2024,EAST ASIA AND PACIFIC,Samoa,WS,P173920,Samoa COVID-19 Emergency Response Project,Health,Goods,Request for Quotations,1794856,Procurement of 3 laptops to support JEE proces...,WS-MOH-414434-GO-RFQ,2024-04-08T00:00:00.000,426026,CONNECTIT,Samoa,WS,4964.69,Post
1,2024-09-07T00:00:00.000,2024,LATIN AMERICA AND CARIBBEAN,Mexico,MX,P159835,Mexico: Sustainable Productive Landscapes Project,"Agriculture, Fishing and Forestry;Industry, Tr...",Consultant Services,Direct Selection,1794855,"Consultor?a para Capacitaci?n, Evaluaci?n y Ce...",MX-SEMARNAT-412622-CS-CDS,2024-03-25T00:00:00.000,872867,INSTITUTO NACIONAL PARA EL DESARROLLO DE CAPAC...,Mexico,MX,23872.33,Post
2,2024-09-07T00:00:00.000,2025,Western and Central Africa,"Congo, Republic of",CG,P175592,Congo Digital Acceleration Project,Information and Communications Technologies,Goods,Request for Quotations,1794854,l'acquisition des ?quipement informatique pour...,N?066/MPTEN/PATN-UCP/F/2024,2024-08-13T00:00:00.000,888585,BUROTOP IRIS,"Congo, Republic of",CG,113143.94,Post
3,2024-09-07T00:00:00.000,2025,LATIN AMERICA AND CARIBBEAN,Peru,PE,P179037,Irrigation for Climate Resilient Agriculture,"Agriculture, Fishing and Forestry",Consultant Services,Individual Consultant Selection,1794853,ESPECIALISTA FINANCIERO (ADMINISTRADOR),PE-PSI-434019-CS-INDV,2024-09-03T00:00:00.000,177724,GIGIMA ROCIO ROSAS LEON,Peru,PE,9604.8,Post
4,2024-09-07T00:00:00.000,2025,EAST ASIA AND PACIFIC,Kiribati,KI,P165821,Kiribati: Pacific Islands Regional Oceanscape ...,"Agriculture, Fishing and Forestry",Consultant Services,Direct Selection,1794852,Legal adviser CFD -CBFM reviewing fisheries ma...,KI-MFMRD-423413-CS-CDS,2024-09-02T00:00:00.000,888855,RURIA ITERAERA,Kiribati,KI,6099.3,Post


## Analysis - What data science techniques can we use?

construct tables and graphs, look for patterns, construct hypothesis.

- Split data before investigating - train - validation - split
- study the percentage of missing values, data types, noiseiness and type of noise, data distributions, identify the useful relationships between variables.
- clean data (if needed)
- featrure engineering with good domain knowledge
- add transformations to data if necessary
- encode categorical features
- aggregate features to capture non-linear relationships
- write out processed data
- built a quick statistical model to assess features and suitability of data for problem - don't do this
- shortlist some models to train data on - automate this process
- compare training performance of each model
- analyse the most significant variables for each algorithm
- analyse the types of errors made by each method
- quickly go through a round of feature selection / engineering
- shortlist 3 to 5 models with uncorrelated errors - make it better for ensemble model building
- perform some hyperparameter tuning using Randomised search on validation set
- create ensemble of best uncorrelated models for final prediction on test set

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column                              Non-Null Count  Dtype 
---  ------                              --------------  ----- 
 0   as_of_date                          50000 non-null  object
 1   fiscal_year                         50000 non-null  object
 2   region                              50000 non-null  object
 3   borrower_country                    50000 non-null  object
 4   borrower_country_code               45069 non-null  object
 5   project_id                          50000 non-null  object
 6   project_name                        50000 non-null  object
 7   major_sector                        50000 non-null  object
 8   procurement_category                50000 non-null  object
 9   procurement_method                  50000 non-null  object
 10  wb_contract_number                  50000 non-null  object
 11  contract_description                50000 non-null  ob

We can see that all dtypes are currently labelled as 'object' which means some types will need to be changed before we do modelling.

To get a better look at what percentage of null values and other useful stats, we can create our own function.

In [24]:
def describe_data(df: pd.DataFrame) -> pd.DataFrame:
    """Custom function to describe the data including null values, data types, and
    unique values.

    Args:
        df (pd.DataFrame): pandas dataframe

    Returns:
        pd.DataFrame: descripive stats of the dataframe
    """
    # check null values
    null_counts = df.isnull().sum().sort_values(ascending=False)

    # calculate percentage of null values
    null_percentages = round(
        df.isnull().sum().sort_values(ascending=False) / len(df) * 100, 2
    )

    # check data types
    dtypes = df.dtypes

    # check unique values
    unique_vals = df.nunique().sort_values(ascending=False)
    
    # calculate percentage of unique values
    unique_vals_pct = round(unique_vals / len(df) * 100, 2)

    return pd.concat(
        [
            dtypes,
            null_counts,
            null_percentages,
            unique_vals,
            unique_vals_pct
        ],
        axis=1,
        keys=[
            "Data Types",
            "Null Counts",
            "Null Counts (%)",
            "Unique Values",
            "Unique Values (%)",
        ],
    )

In [25]:
describe_data(df)

Unnamed: 0,Data Types,Null Counts,Null Counts (%),Unique Values,Unique Values (%)
as_of_date,object,0,0.0,1,0.0
fiscal_year,object,0,0.0,9,0.02
region,object,0,0.0,8,0.02
borrower_country,object,0,0.0,139,0.28
borrower_country_code,object,4931,9.86,123,0.25
project_id,object,0,0.0,1586,3.17
project_name,object,0,0.0,1577,3.15
major_sector,object,0,0.0,177,0.35
procurement_category,object,0,0.0,4,0.01
procurement_method,object,0,0.0,13,0.03


We can wee that approximately 10% of borrower country code is missing which is strange given borrower country is non-null - this suggests that we should use the borrower country column instead.

### Investigate Distributions

It's important to get a sense of the distributions of our data so that we can decide how we want to split our data before investigating further. It's important to split data prior to investigating so that we don't make assumptions about the test data - this is called data snooping bias.

In [28]:
['region', 'procurement_category', 'procurement_method']

['region', 'procurement_category', 'procurement_method']

In [30]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
def cat_distribution(df: pd.DataFrame) -> None:
    """Plot the distribution of the categorical columns.

    Args:
        df (pd.DataFrame): pandas dataframe
    """

    def plot(col: str, ax) -> sns.barplot:
        """Create a barplot."""
        data = df[col].value_counts().reset_index()
        return sns.barplot(data=data, x=col, y='count', ax=ax)
    
    # configure the plot
    fig = plt.figure(figsize=(12, 7))
    ax1 = fig.add_subplot(221)
    ax2 = fig.add_subplot(222)
    ax3 = fig.add_subplot(212)
    
    # plot the data
    plot('gender', ax1)
    ax1.set_title('Distribution of Gender', fontsize=14)
    ax1.set_xlabel('Gender', fontsize=12)
    
    plot('geo_region', ax2)
    ax2.set_title('Distribution of Geo Region', fontsize=14)
    ax2.set_xlabel('Geo Region', fontsize=12)
    
    plot('ethnicity_group', ax3)
    ax3.set_title('Distribution of Ethnicity Group', fontsize=14)
    ax3.set_xlabel('Ethnicity Group', fontsize=12)
    
    plt.tight_layout()
    plt.show()

cat_distribution(data)
    

## Conclusion - What insights can be draw?

interpretation, conclusions, new ideas, communication

Arguably, communicating results is the hardest part, and possibly the most underemphaised part of data science. Being able to explain to a non-technical auidence what you have done in way that relates the analysis findings back to the overall business objective is the whole point of having data science in business in the first place. 

Key things to articulate:
- Recap on what the business requirement for data science work is.
- Explain why your work has or has not achieved the objective.
- Highlight the results and reasons for why you got the result you did.
- Outline the assumptions you made about the data and limitations of it.
- Visualise the key findings including model performance, the most important features, and other interesting visualisations that may provide business intelligence. 
- Express questions that the work has raised and offer avenues for future work including gathering more data, using different techniques etc, or if the work is actually not suitable for machine learning and another approach might be more benefitial.