**Input**

* `N` - ratio of Nitrogen (NH4+) content in soil
* `P` - ratio of Phosphorous (P) content in soil
* `K` - ratio of Potassium (K) content in soil
* `ph` - soil acidity (pH)
* `ec` - electrical conductivity
* `oc` - organic carbon
* `S` - sulfur (S)
* `zn` - Zinc (Zn)
* `fe` - Iron (Fe)
* `cu` - Copper (Cu)
* `Mn` - Manganese (Mn)
* `B` - Boron (B)

**Output**

* Class fertility (0 "Less Fertile", 1 "Fertile", 2 "Highly Fertile")

## Importation des Librairies et Modules

In [None]:
!pip install scikit-learn

In [None]:
import pickle
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import tree, ensemble
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

## Chargement du jeu de données

In [None]:
data = pd.read_csv('/content/dataset1.csv')
data.head()

## Exploration des données

In [None]:
data.shape

In [None]:
data.info()

Ce dataset à 13 colonnes composées comme suit:


*   12 colonnes representants les caractéristiques du sol.
*   La dernière colonne (`Output`) represente l'étiquette ie. si le sol est fertile



In [None]:
data.describe()

In [None]:
data['Output'].value_counts()

La colonne cible (`Output`) contient 3 classes:


*   `0` : représente un sol très peu fertile
*   `1` : représente un sol moyennement fertile
*   `2` : représente un sol très fertile



In [None]:
corr = data.corr()

In [None]:
f, ax = plt.subplots(figsize=(6, 6))
sns.heatmap(corr,
            cmap=sns.color_palette('BuGn_r'),
            vmin=-1.0,
            vmax=1.0,
            square=True,
            ax=ax)

## Traitement des données

In [None]:
# Renommons la colonne cible
data.rename(columns={'Output': 'fertility'}, inplace=True)
data.head()

In [None]:
features = data.drop('fertility', axis=1)
label = data.loc[:, 'fertility']

In [None]:
features.columns

In [None]:
features.shape

In [None]:
type(features)

In [None]:
features.head()

In [None]:
print(label.shape)

In [None]:
label.head()

In [None]:
features.hist(bins=50, figsize=(10, 10), color ='green', grid=False)
plt.show()

## Préparation des données pour former le modèle de Machine Learning

Nous utilisons la transformation logarithmique pour convertir une distribution asymétrique en une distribution normale.

In [None]:
def transform_features(features):
    """
    Transforme les caractéristiques en prenant le logarithme base 10 si la caractéristique est numérique.

    Args:
    - features : DataFrame contenant les caractéristiques à transformer

    Returns:
    - transformed_features : DataFrame contenant les caractéristiques transformées
    """
    def log_transform(x):
        if np.issubdtype(x.dtype, np.number):
            return np.log10(x)
        else:
            return x

    transformed_features = features.apply(log_transform)
    return transformed_features

In [None]:
featuresTransformed = transform_features(features)
featuresTransformed.hist(bins=50, figsize=(10, 10), color='green', grid=False)
plt.show()

### Séparation des données en données d'entrainement et de validation

In [None]:
trainInput, validationInput, trainTarget, validationTarget = train_test_split(featuresTransformed, label, test_size = 0.2, shuffle=True, random_state = 42)

In [None]:
trainInput.shape, validationInput.shape, trainTarget.shape, validationTarget.shape

## Processus du choix du meilleur Modèle

In [None]:
trainTarget = trainTarget.values.ravel()

In [None]:
svcClf = SVC()
svcClf.fit(trainInput, trainTarget)

In [None]:
forestClf = ensemble.RandomForestClassifier()
forestClf.fit(trainInput, trainTarget)

In [None]:
nbClf = GaussianNB()
nbClf.fit(trainInput, trainTarget)

In [None]:
knnClf = KNeighborsClassifier()
knnClf.fit(trainInput, trainTarget)

In [None]:
treeClf = tree.DecisionTreeClassifier()
treeClf.fit(trainInput, trainTarget)

In [None]:
models = [svcClf, forestClf, nbClf, knnClf, treeClf]
accs = []
titles = []

for model in models:
    pred = model.predict((validationInput))
    model_acc = accuracy_score(validationTarget, pred)
    accs.append(model_acc)
    titles.append(type(model).__name__)
    print(type(model).__name__, " accuarcy is ", model_acc)

fig = plt.figure(figsize=(10, 5))
sns.barplot(x = titles, y=accs)
plt.xlabel("Modèles")  # Étiquette de l'axe x
plt.ylabel("Accuracies")


Nous constatons que le modèle RandomForest a la meilleure précision. Nous allons le maintenir et dans le suite nous allons apporter des modifications ses hyper-paramètres pour améliorer sa précision

## Fine Tuning de notre modèle RandForest

In [None]:
forestClassifier = ensemble.RandomForestClassifier(random_state=42)

In [None]:
paramGrid = {
    'n_estimators': [200, 300, 500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [4, 5, 6 ,7 ,8, 9, 10],
    'criterion' :['gini', 'entropy']
}

In [None]:
clf = GridSearchCV(
    estimator=forestClassifier,
    param_grid=paramGrid,
    cv= 5
  )
clf.fit(trainInput, trainTarget)

In [None]:
clf.best_params_

## Entrainement du modèle

In [None]:
randomForestModel = ensemble.RandomForestClassifier(criterion = 'gini',
 max_depth = 10,
 max_features = 'auto',
 n_estimators = 300, random_state=42)

In [None]:
randomForestModel.fit(trainInput, trainTarget)

In [None]:
predictions = randomForestModel.predict(validationInput)

In [None]:
report = classification_report(validationTarget, predictions)
report

In [None]:
mode_acc = accuracy_score(validationTarget, predictions)
mode_acc

#### Sauvegarde du modèle au format pkl

In [None]:
# with open('random_forest_pkl.pkl', 'wb') as file:
#     pickle.dump(model, file)
import joblib

# Sauvegarder le modèle
joblib.dump(model, 'random_forest_pkl.pkl')

In [None]:
import sklearn
print(sklearn.__version__)