![image](ml_workflow.png)


## 1. Problem definition

### 1.1 Types of Machine Learning Problems

There are three main types of machine learning problems:

1. **Supervised Learning**: In this type of learning, the algorithm is trained on a labeled dataset, where the input data is accompanied by the correct output. The goal is to learn a mapping function that can predict the output for new input data.

2. **Unsupervised Learning**: In this type of learning, the algorithm is trained on an unlabeled dataset, where the input data is not accompanied by the correct output. The goal is to learn the underlying structure or patterns in the data.

3. **Reinforcement Learning**: In this type of learning, the algorithm learns by interacting with an environment. The goal is to learn a policy that maximizes a reward signal.

There is also a type of machine learning problem called **Transfer Learning**, which involves using knowledge gained from one task to improve performance on a different but related task. This can be useful when there is limited labeled data available for the target task.

### 1.2 Examples of Machine Learning Problems

Here are some examples of machine learning problems:

1. **Classification**: Predicting whether an email is spam or not.

2. **Regression**: Predicting the price of a house based on its features.

3. **Clustering**: Grouping customers based on their purchasing behavior.

4. **Dimensionality Reduction**: Reducing the number of features in a dataset while preserving its structure.

5. **Recommendation**: Recommending products or services to users based on their past behavior.

6. **Natural Language Processing**: Analyzing and generating human language.

7. **Computer Vision**: Analyzing and understanding visual data, such as images and videos.

8. **Anomaly Detection**: Identifying unusual patterns or outliers in data.

These are just a few examples of the many types of machine learning problems that exist. Each problem requires a different approach and set of techniques to solve effectively.

## 2. Data

In machine learning, there are different types of data that can be used as input for algorithms. These data types include:

- Numerical data: This type of data consists of numbers and can be either continuous or discrete. Examples include age, height, weight, temperature, etc.
- Categorical data: This type of data consists of categories or labels that cannot be ordered or compared numerically. Examples include gender, race, occupation, etc.
- Text data: This type of data consists of words or sentences and is used in natural language processing (NLP) tasks such as sentiment analysis, text classification, etc.
- Image data: This type of data consists of pictures or images and is used in computer vision tasks such as object detection, image recognition, etc.
- Audio data: This type of data consists of sound or speech and is used in speech recognition, speaker identification, etc.
- Time-series data: This type of data consists of observations recorded over time, such as stock prices, weather data, etc. Time-series data requires special algorithms that can capture trends and patterns over time.

Streaming data is a type of data that is generated continuously over time and needs to be processed in real-time. Examples of streaming data include social media posts, sensor data, clickstream data, and financial data.

Streaming data can be seen as a type of time-series data, but with some key differences. Time-series data is typically stored in a database or a file and analyzed offline, whereas streaming data is analyzed as it is being generated.

## 3. Types of evaluation

In machine learning, metrics are used to measure the performance of a model on a given task. The choice of metric depends on the specific problem and the desired outcome. Here are some commonly used metrics in machine learning:

- Accuracy: This is the most common metric used in classification problems. It measures the percentage of correctly classified instances out of all instances.
- Precision: This metric measures the percentage of true positives (correctly predicted positive instances) out of all predicted positives.
- Recall: This metric measures the percentage of true positives out of all actual positive instances.
- F1 score: This is the harmonic mean of precision and recall, and is used to balance the trade-off between them.
- Mean Squared Error (MSE): This metric is used in regression problems and measures the average squared difference between the predicted and actual values.
- Mean Absolute Error (MAE): This metric is also used in regression problems and measures the average absolute difference between the predicted and actual values.
- R-squared: This metric measures the proportion of variance in the target variable that is explained by the model. A higher R-squared value indicates a better fit of the model to the data.
- Receiver Operating Characteristic (ROC) curve: This is a graphical representation of the trade-off between true positive rate and false positive rate for different classification thresholds.
- Area Under the Curve (AUC): This metric is used to evaluate the performance of binary classification models based on the ROC curve. A higher AUC value indicates better performance.

The choice of metric depends on the specific problem and the desired outcome. For example, accuracy may be a good metric for a balanced dataset, while precision and recall may be more appropriate for imbalanced datasets. Similarly, R-squared may be more appropriate for regression problems, while AUC may be more appropriate for binary classification problems.

## 4. Features

![features](features.png)


In machine learning, features are the measurable properties or characteristics of the input data that are used to train a model. These features are used to represent the input data in a way that the machine learning algorithm can understand and make predictions on.

Features can be categorized into three types:
- Numeric features: These are features that are represented by numerical values. Examples include age, income, temperature, and height.
- Categorical features: These are features that take on a limited number of possible values, such as color, gender, or country of origin.
- Text features: These are features that are represented by text data, such as email messages or social media posts.

Features can also be transformed or engineered to create new features that may be more informative or relevant to the problem at hand. For example, in image recognition, features can be extracted from the raw pixel values, such as edges, corners, or texture patterns, to create higher-level features that capture more meaningful information.

## 5. Modelling

![modelling](modelling.png)


![Uploading data_split.png](data_split.png)

### 5.1 Picking the model

![model](model.png)


Choosing the right model is a critical step in the machine learning process and can greatly impact the performance of the system. Here are some important things to consider when choosing a model:

**Problem type**: The type of problem you are trying to solve will dictate the type of model that is best suited for the task. For example, if you are working on a classification problem, you may want to consider using a decision tree, support vector machine (SVM), or logistic regression model. On the other hand, if you are working on a regression problem, you may want to consider using a linear regression, decision tree, or neural network model.

**Dataset size**: The size of the dataset can also affect the choice of model. Some models work better with large datasets, while others may be more appropriate for smaller datasets. For example, deep learning models, such as convolutional neural networks (CNNs), may require large datasets to achieve good performance.

**Complexity of the model**: More complex models may be able to capture more intricate patterns in the data, but they may also be more prone to overfitting, especially if the dataset is small. It's important to balance model complexity with model performance.

**Interpretability**: Some models, such as decision trees, are more interpretable than others, such as deep neural networks. If interpretability is important for your application, you may want to consider using a model that is more transparent.

**Scalability**: If you plan to deploy your model in a production environment, it's important to consider its scalability. Some models may be more efficient than others when it comes to processing large amounts of data or handling multiple requests simultaneously.

**Resources**: The choice of model may also depend on the available resources, such as computing power or memory. For example, deep learning models may require powerful GPUs to train effectively.

**Regularization**: Regularization techniques, such as L1 and L2 regularization, can help prevent overfitting and improve model generalization. Some models, such as linear regression, support regularization out of the box, while others may require additional tuning.

### 5.2 Tuning the model

Model tuning, also known as **hyperparameter tuning**, is the process of finding the best combination of hyperparameters for a machine learning model to achieve optimal performance on a given task. Here are some steps you can follow to tune a model:

**Define the hyperparameters**: Hyperparameters are parameters that are not learned during the training process, but instead are set by the user before training the model. These can include learning rate, regularization strength, batch size, and number of hidden layers in a neural network, among others.

**Define a search space**: Once you have defined the hyperparameters, you need to define a search space for each hyperparameter. The search space can be continuous or discrete, and can include a range of values or a set of values to test.

**Choose a search algorithm**: There are several algorithms you can use to search the hyperparameter space, including grid search, random search, and Bayesian optimization. Each algorithm has its own advantages and disadvantages, and the choice of algorithm will depend on the size of the search space and the resources available.

**Train and evaluate the model**: For each combination of hyperparameters, train the model on a training set and evaluate its performance on a validation set. You can use metrics such as accuracy, precision, recall, or F1 score to evaluate the model.

**Select the best model**: Once you have trained and evaluated the model for all combinations of hyperparameters, select the combination that gives the best performance on the validation set.

**Test the model**: Finally, test the performance of the best model on a test set that was not used during the hyperparameter tuning process. This will give you an estimate of how well the model will perform on new, unseen data.

It's important to note that hyperparameter tuning can be a time-consuming and computationally expensive process, especially for deep learning models with many hyperparameters. However, by tuning the hyperparameters, you can significantly improve the performance of the model and achieve better results on your task.

### 5.3 Models comparison - testing the model

![model_testing](model_testing.png)

![fitting](fitting.png)

Overfitting occurs when a machine learning model fits the training data too closely and captures noise or random fluctuations in the data rather than the underlying pattern. This can lead to poor generalization performance on new, unseen data.

The main reasons for overfitting include:
1. Insufficient training data: If the training data is too small or unrepresentative of the true distribution, the model may learn to fit the noise in the data instead of the underlying pattern. Collecting more data or using data augmentation techniques can help to mitigate this problem.
2. Overly complex model: If the model is too complex, it may have too many parameters and be prone to overfitting. Simplifying the model architecture or adding regularization can help to prevent overfitting.
3. Lack of regularization: Regularization techniques such as L1 or L2 regularization can help to prevent overfitting by adding a penalty term to the loss function that discourages large weights. Dropout regularization can also be used in neural networks to randomly drop out some of the neurons during training.
4. Data leakage: If the validation set is contaminated with information from the training set, the model may overfit to the validation set and perform poorly on new data. It's important to ensure that the validation set is independent of the training set and that the data is properly shuffled and partitioned.
5. Incorrect model selection: Choosing an inappropriate model for the task at hand can also lead to overfitting. It's important to choose a model that is not too complex and that can capture the underlying patterns in the data without overfitting.

To address overfitting, it's important to monitor the model's performance on both the training and validation sets during training and use appropriate regularization techniques to prevent overfitting. It's also important to ensure that the data is properly partitioned and that the model is chosen appropriately for the task at hand.

Underfitting occurs when a machine learning model is too simple to capture the underlying patterns in the data, resulting in poor performance on both the training and validation sets. The main reasons for underfitting include:
1. Insufficient model complexity: If the model is too simple and has too few parameters, it may not be able to capture the complexity of the underlying patterns in the data. Increasing the model complexity by adding more layers or neurons, increasing the number of decision trees, or adding more features to the dataset can help to address underfitting.
2. Insufficient training time: If the model is not trained for long enough, it may not have had enough time to learn the underlying patterns in the data. Increasing the number of training epochs or using a larger batch size can help to address this problem.
3. Insufficient training data: If the training data is too small or unrepresentative of the true distribution, the model may not be able to learn the underlying patterns in the data. Collecting more data or using data augmentation techniques can help to mitigate this problem.
4. Over-regularization: If the model is too heavily regularized, it may not be able to capture the underlying patterns in the data. Decreasing the strength of the regularization or using a different regularization technique can help to address this problem.
5. Incorrect model selection: Choosing an inappropriate model for the task at hand can also lead to underfitting. It's important to choose a model that is complex enough to capture the underlying patterns in the data without overfitting.

To address underfitting, it's important to monitor the model's performance on both the training and validation sets during training and use appropriate regularization techniques to prevent over-regularization. It's also important to ensure that the model has sufficient complexity to capture the underlying patterns in the data and that the appropriate model is chosen for the task at hand.

## 6. Experimentation

Experimentation in machine learning involves a cycle of designing, executing, and analyzing experiments to improve the performance of machine learning models. It is an iterative process that involves the steps 1-5.

## A1.  Tools

![tools](tools.png)


## A2. Credits

- Andrei Neagoie, Daniel Bourke - Complete Machine Learning & Data Science Bootcamp 2023
- ChatGPT