# Introduction to Machine Learning

### What is Machine Learning?

Machine Learning is a subset of Artificial Intelligence concerned with designing computing systems that can learn from data without being programmed to do so. For example, Instead of telling a computer "if the email contains these exact words, it's spam," (think about it, this is how you would do it with simple Python, right?) we show it thousands of examples of spam and non-spam emails, and it figures out the patterns itself.

#### Types of Machine Learning
Machine learning can be broadly classified into three types based on the nature of the learning system...

 1. Supervised Machine Learning
In supervised learning, we train our model on labeled data - data where we already know the correct answer. The model learns the relationship between inputs (features) and outputs (labels), so it can predict the output for new, unseen inputs.

    * Regression Tasks - when your target variable is numerical
    * Classification Tasks - when your target variable is categorical

2. Unsupervised Machine Learning
In unsupervised learning, we have input data but no labels, no "correct answers." We ask the algorithm to discover the underlying structure or patterns in the data on its own.
    * Dimensionality Reduction
    * Unsupervised clustering
    * Time Series Forecasting
    * Deep learning


3. Reinforcement Learning
The algorithm learns by trial and error by interacting with an environment. A good example is those robotic vacuums- they learn how to navigate around the house by bumping into objects in the house until they learn the clear path . The vacuum, as it navigates the house (aka, interacts with its envrionment), it eventually learns the correct path.




![image.png](attachment:image.png)

### Understanding the Machine Learning Workflow

##### 1. Problem Definition
Before writing a single line of code, you need to clearly define what you're trying to solve. A vague problem leads to wasted effort and unusable models. Be sure you are as specific as possible - prevent customer churn is very vague, but predict customers who will churn in the next 30 days is perfect.

Another thing is , keep it simple, always ask yourself, is this actually a machine learning problem? Not everything needs ML. If you can solve it with simple rules or a database query, do that instead. Use ML when the patterns are too complex for explicit rules.

What does success look like? Define concrete metrics. How will I know the model is good? which evaluation metric best suits the business case... don't worry we will get to evaluation metrics later.


##### 2. Data Collection
What kind of data do I need to solve this problem using ML? Is it structured or unstructured- data formats heavily influece the type of model you can use. Is the quality of the data up to par? Remember GIGA! You can have the best algorithm in the world, but without good data, it's useless.
Machine learning is fundamentally data-hungry. Your model is only as good as the data you feed it.

What makes good training data?
* Relevant features - Your data needs to contain information that actually relates to what you're predicting. If you're predicting loan defaults, you need financial information, not favorite colors.
* Sufficient quantity - How much is enough? It depends on complexity. Simple problems might need hundreds of examples. Complex image recognition might need millions. A rough rule: more complex patterns require more data.
* Representative - Your training data should reflect the real-world situations where you'll use the model. If you train a face recognition system only on well-lit professional photos, it will fail on grainy security camera footage.
* Labeled correctly (for supervised learning) - Labels need to be accurate. If 20% of your "spam" training emails are actually legitimate, your model will learn the wrong patterns

##### 3. Data Preprocessing
This is where you'll spend 60-80% of your time. It's not glamorous, but it's absolutely critical. The goal at this stage is to transform raw data into clean, structured and ***machine readable format**. Data Preprocessing can be broadly broken down into the following phrases...

* Data Cleaning - this includes all the steps you learned in PyData -handling missingness, data types consistency, removing duplicates and handling outliers.

* Data Transformations
    * Feature Scaling - its about adjusting the scales of your features to a common range. The essence is to ensure that all features in your data controbute equally to the model's outcome.
    * Feature encoding- transforming categorical features into numerical features. All models are 'math' models.
    * Log transformations - used to transform non-linear data to a high-dimensional space where it is lineraly separable.

* Feature Engineering - creating new features from existing ones
* Handling class imbalance


 


##### 4. Model Selection & Training
Choosing the right algorithm:
This depends on your problem type (classification vs regression), data size (neural networks need lots of data, simpler models can work with less), interpretability needs (doctors might need to understand why a prediction was made, so a decision tree might be better than a neural network), and computational resources (training a large neural network requires powerful hardware).

***Common approach - start simple, parsimonius models***


##### 5. Performance Evaluation & Hyperparameter tuning

* Choosing the right Evaluation metrics (choosing the right one is crucial):
For classification, accuracy seems obvious - what percentage did we get right? But it can be misleading. If only 2% of transactions are fraudulent and your model just predicts "not fraud" every time, it's 98% accurate but completely useless.

The righ evaluation metric depends on your business goal, in the case of predicting whether someone has cancer or not, which is better- false positive or false negative, or even both?

* Hyperparameter tuning
Hyperparameters are parameters that you as a Data Scientist set- for example, how many trees in your decision tree etc... the best combination of hyperparameters produce the best model.

* Interpreting Model Performance
This is where the concept of bias variance tradeoff comes in... is my model overfitting or underfitting?

##### 6. Deployment and Monitoring
Packaging your model in a way it can be accessed by the intended user.

Deployment options:
* Batch prediction - run the model periodically (daily, weekly) on large batches of data. Our hospital might score all current patients every morning.
* Real-time API - the model serves predictions instantly when requested. A fraud detection system needs to evaluate transactions in milliseconds.
* Edge deployment - the model runs on the device itself (like facial recognition on your phone) rather than in the cloud.