# Final Report :

#### In this notebook, we will apply machine learning algorithms on the "Meteo12" dataset to analyze and extract important data to group and classify this data.

## **1 - Dataset** :

#### **Inroduction** :

 The **"Meteo12"** dataset contains the following features: ['No', 'year', 'month', 'day', 'hour', 'PM2.5', 'PM10', 'SO2', 'NO2', 'CO', 'O3', 'TEMP', 'PRES', 'DEWP', 'RAIN', 'wd', 'WSPM', 'station'], such that each column has a meaning and this dataset does not contain the Target column.

  That is to say, to apply supervised machine learning algorithms we must first apply an unsupervised algorithm to create a Target column.
 
  These columns provide various measurements and indicators of weather conditions, air quality and atmospheric properties.

![dataset](ref/img0.png)


#### **the meaning of each column in your weather dataset**:

1. **year**: The year in which the weather data was recorded.
2. **month**: The month in which the weather data was recorded.
3. **day**: The day of the month on which the weather data was recorded.
4. **time**: The time of day the weather data was recorded.
5. **PM2.5**: Particulate matter (PM) with a diameter of 2.5 micrometers or less, which can penetrate deep into the respiratory system and pose health risks when present in high concentrations .
6. **PM10**: Particulate matter (PM) with a diameter of 10 micrometers or less, which can also cause adverse health effects when present in high concentrations.
7. **SO2**: Sulfur dioxide, a gas primarily emitted by the burning of fossil fuels such as coal and oil. It can contribute to respiratory problems and is a precursor to acid rain.
8. **NO2**: Nitrogen dioxide, a reddish-brown gas that forms when nitrogen oxides react with pollutants in the air. This can irritate the airways and worsen respiratory conditions.
9. **CO**: Carbon monoxide, a colorless and odorless gas produced by the incomplete combustion of fuels containing carbon. It can be harmful when inhaled in large quantities, leading to carbon monoxide poisoning.
10. **O3**: Ozone, a gas composed of three oxygen atoms. While ozone in the atmosphere protects us from the sun's ultraviolet rays, ground-level ozone is a pollutant that can cause respiratory problems and exacerbate lung conditions.
11. **TEMP**: Temperature, measure of how hot or cold the air is.
12. **PRES**: Atmospheric pressure, the force exerted by the weight of air molecules above a particular point on the Earth's surface.
13. **DEWP**: Dew point, the temperature at which the air becomes saturated with water vapor and dew begins to form.
14. **RAIN**: Precipitation, amount of rain recorded during the specified period.
15. **WSPM**: Wind speed, the rate at which air moves horizontally past a given point.
16. **wd**: Wind direction, indicating the direction the wind is blowing from. It is usually indicated in cardinal directions (e.g., N for North, S for South, etc.) or in degrees (e.g., 0° for North, 90° for East, etc.).
17. **station**: The name or identifier of the weather station where the data was recorded. This column specifies the location or source of weather observations.

#### **Data pre-processing**:

1- **Data cleaning**: - We will process the missing data: such as some features containing missing elements.

![missing values](ref/img1.png)

=> For this, we will replace the missing elements with the average feature. 

- We will delete the duplicate lines
  

2- **Feature selection**: Now, we will select Features or delete unnecessary Features, depending on the feasibility of the Feature and the correlation matrix.

=> the unnecessary features in the dataset are: **"No"**, **"wd"** and **"station"**.

=> the most useless features according to the correlation matrix are: **'year'**, **'month'**, **'day'**, **'hour'**, ** "RAIN"** and **"WSPM"**.


![correlation matrix](ref/img2.png)


3- **Standardization**: Now, we will standardize the data to ensure a fair comparison between the variables measured in different units and eliminate the bias introduced by the different measurement scales.

#### **Visualization** :

Visualization of data after pre-processing:

![Visualization of data](ref/img3.png)



And the correlation matrix after pre-processing:

![correlation matrix](ref/img4.png)



---------------

## 2 - **Apply Unsupervised algorithm (K-means)**:

    After pre-processing the data and as we said that our dataset does not contain the column, for this we will now apply an unsupervised algorithm (k-means) to group the data and create a target column.

#### **Choose the number of clusters**:

Before applying the K-Means algorithm, you must first choose the number of culsters (contraids). For this, we will determine the number of culsters from **Elbow point**. this point produced by a graph of “silhouette scores” for different numbers of clusters.

This is the graph of “silhouette scores” for different numbers of clusters.

![graph of “silhouette scores” for different numbers of clusters](ref/img5.png)

    => In this case, Elbow point is the number of clusters which equals **4**.

#### **visualization of the K-Means model after training**:

Pair Plot :
![k-means visualisation](ref/img9.png)

Scatter Plot : 

![k-means visualisation](ref/img7.png)

![k-means visualisation](ref/img8.png)



## 3 - **Apply Supervised algorithms**:

### i- **The KNN algorithm**:

The **KNN** algorithm is a simple and intuitive method used for classification and regression in machine learning. It assigns a class label (classification) or predicts a value (regression) based on the closest examples in the feature space. KNN is non-parametric, easy to understand and implement, but can be computationally expensive with large amounts of data.

After training the KNN model, the evaluation results are as follows:

**accuracy_score** : 0.966348210466277


**confusion_matrix** :

![KNN confusion_matrix](ref/img10%20-%20KNN.png)

**classification_report** :

              precision    recall  f1-score   support

           0       0.97      0.99      0.98      2810
           1       0.93      0.95      0.94      1755
           2       0.98      0.97      0.97      1869
           3       0.98      0.92      0.95       579

    accuracy                           0.97      7013
    macro avg       0.97      0.96      0.96      7013
    weighted avg       0.97      0.97      0.97      7013



### ii- **The SVM algorithm**:

**SVM** algorithm is a supervised learning method used for classification and regression. They seek to find the hyperplane that optimally separates the data from different classes, by maximizing the margin between the closest examples of each class. SVMs are efficient in high-dimensional spaces and can handle both linearly and non-linearly separable data through the use of kernels.

After training the SVM model, the evaluation results are as follows:

**accuracy_score** : 0.9964351917866818


**confusion_matrix** :

![SVM confusion_matrix](ref/img11%20-%20SVM.png)

**classification_report** :

              precision    recall  f1-score   support
           0       1.00      1.00      1.00      2155
           1       1.00      1.00      1.00      1944
           2       0.99      1.00      1.00      1026
           3       1.00      1.00      1.00      1888

    accuracy                           1.00      7013
    macro avg       1.00      1.00      1.00      7013
    weighted avg       1.00      1.00      1.00      7013



### iii- **The CART algorithm**:

The **CART** (Classification And Regression Trees) algorithm is a supervised learning method used for classification and regression. It builds a decision tree by recursively dividing the data into subgroups based on feature values. CART decision trees are easy to interpret and effective in processing different types of data, providing a robust and versatile solution for predictive modeling.

After training the CART model, the evaluation results are as follows:

**accuracy_score** : 0.9446741765293027


**confusion_matrix** :

![CART confusion_matrix](ref/img12%20-%20CART.png)

**classification_report** :
              
              precision    recall  f1-score   support

           0       0.96      0.96      0.96      2156
           1       0.95      0.96      0.95      1879
           2       0.94      0.94      0.94      1029
           3       0.92      0.92      0.92      1949

    accuracy                           0.94      7013
    macro avg       0.94      0.94      0.94      7013
    weighted avg       0.94      0.94      0.94      7013



## 4- **Conclusion** :

#### In conclusion, although all the models give very good accuracy. but, the best model with the most accuracy is the **SVM** model.