# **1 - IMPORTS**

*Prepare notebooks for upcoming sections*

## **Libraries**

*Load required libraries for the project*

In [1]:
import os
from   pathlib import Path

# date
from datetime import datetime

# data manipulation
import numpy      as     np
import pandas     as     pd

# plot and images
import seaborn              as sns
import matplotlib.pyplot    as plt

# modelling
from sklearn.preprocessing   import StandardScaler
from sklearn.model_selection import train_test_split, StratifiedKFold, learning_curve
from sklearn.dummy           import DummyClassifier
from sklearn.linear_model    import LogisticRegression
from sklearn.metrics         import f1_score, confusion_matrix, ConfusionMatrixDisplay

## **Functions**

*Define functions that will be used on the project*

In [2]:
# TO-DO

## **Setup**

*Set some configurations for the whole notebook*

In [3]:
# e.g.     
# set cientific notation for pandas
pd.set_option(
    "display.float_format", "{:,.3f}".format 
)

## **Constants**

*Define constants that will be used throughout the project*

In [4]:
# define the path to root of tasks folder
TASKS_PATH = Path.cwd().parent.parent.parent

# check tasks path
# print(f"Tasks path: {TASKS_PATH}")

# define path to data folder inside task folder - regardless of os system
DATA_PATH = os.path.join(TASKS_PATH, "data")

# check data path
# print(f"Data path: {DATA_PATH}")

# check available datasets
os.listdir(DATA_PATH)

['README.md', 'merged_saudi_github_thabtah_toddler_child.csv']

In [5]:
# define seed and random states
seed = 12345

# **2 - DATA EXTRACTION**

*Extract data from data sources*

## **Data dictionary**

*Describe the meaning of every column of the dataset*

**Variable in Dataset** - **Corresponding Q-chat-10-Toddler Features**

- A1 - Does your child look at you when you call his/her name?
- A2 - How easy is it for you to get eye contact with your child? 
- A3 - Does your child point to indicate that s/he wants something? (e.g. a toy that is out of reach) 
- A4 - Does your child point to share interest with you? (e.g. poin9ng at an interes9ng sight) 
- A5 - Does your child pretend? (e.g. care for dolls, talk on a toy phone) 
- A6 - Does your child follow where you’re looking? 
- A7 - If you or someone else in the family is visibly upset, does your child show signs of wan9ng to comfort them? (e.g. stroking hair, hugging them)
- A8 - Would you describe your child’s first words as: 
- A9 - Does your child use simple gestures? (e.g. wave goodbye) 
- A10 - Does your child stare at nothing with no apparent purpose? 

## **Data Loading**

*Load data from required files*

**For this first inspection**, we will use the following dataset: 

https://drive.google.com/file/d/15kJI5ucE0TlWqVxZ4SiQ5IVF5cLBYeYY/view

**Note that the dataset may change on the upcoming weeks!**

In [6]:
# load data from csv file
df_extraction = pd.read_csv(filepath_or_buffer = os.path.join(DATA_PATH, "merged_saudi_github_thabtah_toddler_child.csv"), 
                                 low_memory = False) # don't load data in chunks

# inspect shape and column data types
display(
    f'Dataframe shape: {df_extraction.shape}', 
    df_extraction.dtypes, 
    df_extraction.sample(3)
)

'Dataframe shape: (2566, 12)'

A1        int64
A2        int64
A3        int64
A4        int64
A5        int64
A6        int64
A7        int64
A8        int64
A9        int64
A10       int64
Class     int64
src      object
dtype: object

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,Class,src
319,0,0,0,1,1,0,1,1,0,1,1,Saudi
1574,0,1,0,0,0,0,1,1,0,1,1,thabtah_toddler
1789,1,1,1,1,1,1,1,0,0,1,1,thabtah_toddler


# **3 - DATA DESCRIPTION**

*Perform a general overview about the data*

## **Restore Point**

*Create a checkpoint of the last dataframe from previous section*

In [7]:
# create a restore point for the previous section dataframe
df_description = df_extraction.copy()

# check dataframe
df_description.sample(3)

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,Class,src
1607,1,1,1,1,0,1,1,0,1,1,1,thabtah_toddler
2322,1,0,0,1,1,0,1,0,1,0,1,github
1860,0,0,0,0,0,0,1,0,0,1,0,github


## **Column Names**

*Search for misleading or error-prone column names*

In [8]:
# check column names
df_description.columns

Index(['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10', 'Class',
       'src'],
      dtype='object')

## **Data Types**

*Check if data types on dataframe makes sense according to database information*

In [9]:
# check data types
df_description.dtypes

A1        int64
A2        int64
A3        int64
A4        int64
A5        int64
A6        int64
A7        int64
A8        int64
A9        int64
A10       int64
Class     int64
src      object
dtype: object

## **Check Duplicated Rows**

*Inspect duplicated rows (based on dataframe granularity) and handle them properly*

In [10]:
# TO-DO

## **Check Missing Values**

*Inspect number and percentage of missing value per column to decide what to do with them*

In [11]:
# check NAs
df_description.isna().sum()

A1       0
A2       0
A3       0
A4       0
A5       0
A6       0
A7       0
A8       0
A9       0
A10      0
Class    0
src      0
dtype: int64

## **Handle Missing Values**

*Handle missing value for columns*

In [12]:
# TO-DO

# **4 - FEATURE ENGINEERING**

*Create feature relevant for the given problem*

## **Restore Point**

*Create a checkpoint of the last dataframe from previous section*

In [13]:
# create a restore point for the previous section dataframe
df_f_eng = df_description.copy()

# check dataframe
df_f_eng.sample(3)

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,Class,src
75,0,0,0,0,0,0,0,0,0,1,0,Saudi
230,0,0,0,0,0,0,0,0,0,1,0,Saudi
380,1,1,0,1,1,1,0,1,1,0,1,Saudi


## **Feature Creation**

*Create new features (columns) that can be meaningful for EDA and, especially, machine learning modelling.*

In [14]:
# TO-DO

# **5 - DATA FILTERING**

*Inspect and remove misleading rows and/or columns*

## **Restore Point**

*Create a checkpoint of the last dataframe from previous section*

In [15]:
# create a restore point for the previous section dataframe
df_filter = df_f_eng.copy()

# check dataframe
df_filter.sample(3)

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,Class,src
1279,1,1,1,1,1,1,1,1,1,1,1,thabtah_toddler
1471,1,0,1,1,1,1,1,1,1,1,1,thabtah_toddler
1458,1,1,1,1,1,1,1,0,1,1,1,thabtah_toddler


## **Rows Filtering**

*Remove rows with meaningless (or unimportant) data*

In [16]:
# TO-DO

In [17]:
# e.g. 
# remove (???) rows where age is -1

## **Columns Filtering**

*Remove auxiliary columns or columns that won't be available in the prediction moment*

In [18]:
# TO-DO

In [19]:
# e.g.
# remove "ethnicity" and "sex" columns???

# **6 - Exploratory Data Analysis**

*Explore data for insights*

## **Restore Point**

*Create a checkpoint of the last dataframe from previous section*

In [20]:
# create a restore point for the previous section dataframe
df_eda = df_filter.copy()

# check dataframe
df_eda.sample(3)

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,Class,src
2404,0,0,1,0,0,1,1,1,1,1,1,github
2417,1,0,0,0,1,0,1,1,1,0,1,github
541,0,1,0,0,0,0,0,0,0,0,0,UCI_child


## **Univariate Analysis**

*Check the distribution of variables taking into account their data types*

### Numeric Variables

In [21]:
# TO-DO

### Categorical Variables

In [22]:
# TO-DO

### Time Variables

In [23]:
# TO-DO

## **Bivariate Analysis**

*Check the correlation among variables*

### Numeric vs Numeric Variables

In [24]:
# TO-DO

### Categorical vs Categorical Variables

In [25]:
# TO-DO

### Numerical vs Categorical Variables

In [26]:
# TO-DO

# **7 - Synthetic Data**

*Generate synthetic data for modelling purpose*

## **Restore Point**

*Create a checkpoint of the last dataframe from previous section*

In [27]:
# create a restore point for the previous section dataframe
df_synth = df_eda.copy()

# check dataframe
df_synth.sample(3)

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,Class,src
1379,1,1,1,1,1,1,1,1,1,1,1,thabtah_toddler
1768,1,0,0,0,0,0,1,1,0,1,1,thabtah_toddler
346,0,1,1,1,1,1,0,0,0,0,1,Saudi


*Do we need to create synthetic data?*

In [28]:
# TO-DO

*Is it possible to create synthetic data for this problem?*

In [29]:
# TO-DO

*What are the pros and cons of synthetic data?*

In [30]:
# TO-DO

*Techniques*:
- oversampling
- undersampling
- SMOTE
- model loss function weighting

In [31]:
# TO-DO

# **8 - Data Preparation**

*Prepare data for modelling*

## **Restore Point**

*Create a checkpoint of the last dataframe from previous section*

In [32]:
# create a restore point for the previous section dataframe
df_prep = df_synth.copy()

# check dataframe
df_prep.sample(3)

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,Class,src
1553,1,1,1,1,1,1,1,1,1,0,1,thabtah_toddler
669,1,0,0,1,1,1,1,0,1,1,1,UCI_child
2066,1,1,1,0,0,1,1,0,1,1,1,github


In [33]:
# define X and y data
X = df_prep[['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10']]
y = df_prep["Class"]

# check X and y
display("X data", X, "y data", y)

'X data'

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10
0,0,0,0,0,0,1,1,1,0,0
1,0,0,1,0,1,0,0,1,0,0
2,0,0,0,0,0,0,1,0,0,0
3,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...
2561,0,0,0,0,0,0,0,0,0,0
2562,0,0,0,0,0,0,0,0,0,1
2563,1,0,1,1,1,1,1,1,1,1
2564,1,0,0,0,0,0,0,1,0,1


'y data'

0       0
1       0
2       0
3       0
4       0
       ..
2561    0
2562    0
2563    1
2564    0
2565    1
Name: Class, Length: 2566, dtype: int64

## **Train-validation-test split**

*Split data into train, validation and test*

In [34]:
# TO-DO

## **Scaling**

*Make sure numerical variables has a comparable range of values*

In [35]:
# TO-DO

## **Discretizing**

*Discretize variables with bins*

In [36]:
# TO-DO

## **Categorical encoding**

*Prepare categorical variables for modelling*

In [37]:
# TO-DO

## **Time-cyclic variables**

*Transform cyclic time variables into sin and cos variables*

In [38]:
# TO-DO

## **Response Variable Transformation**

*Make sure target variable is ready for modelling*

In [39]:
# TO-DO

## **sklearn.pipelines**

*Use sklearn.pipelines instead of "manual" transformations*

In [40]:
# TO-DO

## **Final preparation check**

*Final check after all data preparation steps*

In [41]:
# TO-DO

# **9 - Metrics**

*Define the most relevant metrics for the given problem*

*What will be the model output?* 
- Label or probability?

In [42]:
# TO-DO

*What will be the main model metric? And the auxiliary metrics?*
- Accuracy
- Recall
- Precision
- F1-score
- ROC-AUC
- Precision-recall AUC
- Cross Entropy (log-loss)
- confusion matrix

In [43]:
# TO-DO

# **10 - Feature Selection**

*Select features for modelling*

## **Restore Point**

*Create a checkpoint of the last dataframe from previous section*

In [44]:
# create a restore point for the previous section dataframe
df_f_selection = df_prep.copy()

# check dataframe
df_f_selection.sample(3)

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,Class,src
1894,0,0,0,0,0,0,0,0,0,0,0,github
886,1,0,0,0,1,0,1,0,1,0,1,thabtah_toddler
2320,1,1,0,1,1,1,1,0,1,0,1,github


*Techniques:*
- Permutation importance
- Feature coefficients
- Feature importance
- Buruta
- etc

In [45]:
# TO-DO

# **11 - Modelling**

*Use ML models to find patterns on data*

## **Restore Point**

*Create a checkpoint of the last dataframe from previous section*

In [46]:
# create a restore point for the previous section dataframe
df_modelling = df_f_selection.copy()

# check dataframe
df_modelling.sample(3)

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,Class,src
2026,1,1,1,1,1,1,1,0,1,1,1,github
1871,1,1,1,0,1,0,1,1,0,1,1,github
1683,1,1,1,1,1,1,1,1,1,1,1,thabtah_toddler


## **Dummy model**

*The easiest model without any coding*

In [47]:
# TO-DO

## **Baseline model**

*The fastest model to build*

In [48]:
# TO-DO

## **Best model**

*The best model of all*

In [49]:
# TO-DO

# **12 - Hyper-parameter tuning**

*Find out the best hyper-parameters for the ML models*

## **Restore Point**

*Create a checkpoint of the last dataframe from previous section*

In [50]:
# create a restore point for the previous section dataframe
df_tunning = df_modelling.copy()

# check dataframe
df_tunning.sample(3)

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,Class,src
1441,0,0,0,1,0,1,1,0,0,0,0,thabtah_toddler
2217,1,0,0,0,1,1,1,1,1,0,1,github
1496,1,0,1,1,1,1,1,0,0,1,1,thabtah_toddler


*What strategy to use*
- Grid Search
- Random search
- Bayesian search
- Optuna

In [51]:
# TO-DO

*How ot keep track of model performances?*
- mlflow

In [52]:
# TO-DO

# **13 - Error Analysis**

*Perform error analysis to check model performance*

## **Restore Point**

*Create a checkpoint of the last dataframe from previous section*

In [53]:
# create a restore point for the previous section dataframe
df_error = df_tunning.copy()

# check dataframe
df_error.sample(3)

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,Class,src
1230,1,1,1,1,0,1,1,1,1,1,1,thabtah_toddler
1824,0,0,0,0,0,0,0,0,0,1,0,thabtah_toddler
1899,0,0,0,0,0,0,0,0,0,1,0,github


## **Cross-validation**

*Perform cross-validation to ensure good model predictions on unseen data*

In [54]:
# TO-DO

## **Overfitting vs Wellfitting vs Underfitting**

*Make user the model is not underfitting or overfitting*

In [55]:
# TO-DO

## **Confusion Matrix**

*Inspect confusion matrix to undertand model errors*

In [56]:
# TO-DO

## **Learning Curves**

*Check learning curve to understand the impact of dataframe size as well as underfitting-overfittin issues*

In [57]:
# TO-DO

## **Model Calibration**

*Check if model predictions are calibrated*

In [58]:
# TO-DO

# **14 - Model Performance**

*Evaluate model performance*

## **Restore Point**

*Create a checkpoint of the last dataframe from previous section*

In [59]:
# create a restore point for the previous section dataframe
df_perform = df_error.copy()

# check dataframe
df_perform.sample(3)

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,Class,src
263,1,0,0,1,0,1,1,1,0,1,1,Saudi
2536,1,0,0,1,1,0,1,0,1,0,1,github
2517,1,0,0,0,0,0,1,1,0,1,1,github


## **Training performance**

*Evaluate model performance on training dataset*

In [60]:
# TO-DO

## **Validation performance**

*Evaluate model performance on validation dataset*

In [61]:
# TO-DO

## **Test performance**

*Evaluate model performance on test dataset*

In [62]:
# TO-DO

# **15 - Model diagnostics**

*Understand model errors and what are the relevant features for predictions*

## **Restore Point**

*Create a checkpoint of the last dataframe from previous section*

In [63]:
# create a restore point for the previous section dataframe
df_diag = df_perform.copy()

# check dataframe
df_diag.sample(3)

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,Class,src
640,1,1,1,1,1,1,0,1,1,1,1,UCI_child
318,1,1,1,1,1,1,1,1,1,0,1,Saudi
983,0,0,0,0,1,0,1,0,1,1,1,thabtah_toddler


## **Feature importance**

*Check the importance of features for model prediciton*

In [64]:
# TO-DO

## **Bias analysis**

*Check for bias on model predictions*

In [65]:
# TO-DO

## **Uncertainty analysis**

*Check for uncertainty on model predictions*

In [66]:
# TO-DO

## **SHAP values**

*Inspect shap values for model prediction*

In [67]:
# TO-DO

# **16 - Prepare model for production**

*Convert notebooks (.ipynb) to scripts (.py)*

In [68]:
# TO-DO