# Prediction of Soil Viability for Sustainable Agriculture

🎯 The goal of this challenge is to train a model that classifies soils as viable or not for sustainable agriculture.

💡 As part of an initiative to promote sustainable agriculture worldwide, experiments were made at different locations.

Each experiment consisted in an analysis of the soil.  
The results of these analysis are our features.

After the analysis, a small agriculture project was launched at the location:    
- If the project was successful, the soil was labeled as viable.  
- On the other hand if the project failed, the soil was labeled as not-viable.  

The viability of the soil is our target.

💡 Small test projects were used for data collection, but the ambition is to launch projects of much larger scale.  

The costs and time investment on these large scale projects are extremely high.  

🎯 To be valuable, our model should be right at least 90% of the time when it identifies a viable soil.

Here is a description of the fields:
- **id**: Unique identification number of the experiment
- **scientist**: Name of the scientist responsible for the experiment
- **measure_index**: Engineered measure of soil characteristics
- **measure_moisture**: Moisture level of the soil
- **measure_temperature**: Temperature of the soil, in Celsius degrees
- **measure_chemicals**: Indice of chemicals presence in the soil
- **measure_biodiversity**: Indice of biodiversity in the soil
- **measure_flora**: Indice of diversity of flora in the soil
- **main_element**: Symbol of the main chemical element found in the soil
- **past_agriculture**: Indicates the presence of past agriculture on the soil
- **soil_condition**: Overall indicator of the soil fertility
- **datetime_start**: Timestamp of experiment's start 
- **datetime_end**: Timestamp of experiment's end
- **target**: Viability of the soil  
    - 1: means the soil was viable, i.e. the test project was a success  
    - 0: means the soil was not viable, i.e. the test project was a failure

## Data Collection

*C5 - Préparer les données en vue de l'apprentissage afin que celles-ci soient nettoyées*

**📝 Load the csv provided at this URL: https://wagon-public-datasets.s3.amazonaws.com/certification/soils_viability/soils_viability_train.csv.**

In [1]:
import pandas as pd
import numpy as np
url = "https://wagon-public-datasets.s3.amazonaws.com/certification/soils_viability/soils_viability_train.csv"
df = pd.read_csv(url)
df.head(20)

  from pandas.core import (


Unnamed: 0,id,scientist,measure_index,measure_moisture,measure_temperature,measure_chemicals,measure_biodiversity,measure_flora,main_element,past_agriculture,soil_condition,datetime_start,datetime_end,target
0,493,Kathryn Owens,1.875085,24.442232,18.510316,5.715697,521.074105,,Na,yes,normal,2017-06-27 16:53:42,2017-06-27 20:05:36,1
1,2340,Andrea Pratt,7.658911,30.121175,17.05025,1.973804,314.443474,,Ca,no,rich,2018-12-10 07:06:56,2018-12-10 11:43:29,1
2,5434,Kaitlyn Jackson,18.000212,34.188025,17.157393,3.658506,361.79618,,Al,yes,normal,2018-10-04 18:45:29,2018-10-04 23:20:38,0
3,2304,Brett Rosario,4.056764,37.462768,13.275961,6.666983,402.016494,,Ca,no,normal,2018-10-03 08:03:36,2018-10-03 10:56:40,0
4,1911,Craig Thompson,53.271676,31.425482,17.433458,1.940748,978.383654,,Si,no,poor,2018-07-20 09:27:34,2018-07-20 13:48:30,0
5,4733,Stacy Elliott,5.446522,31.946289,14.820124,1.531663,775.368114,,Ca,yes,rich,2017-08-06 14:23:51,2017-08-06 19:20:46,1
6,1597,Scott Morris,4.986124,28.368372,19.055275,1.095355,806.166614,,Si,yes,poor,2018-03-19 10:47:24,2018-03-19 12:12:49,0
7,1258,Denise Duffy,19.539513,39.871162,16.362313,0.488977,388.962889,,Na,yes,normal,2018-09-14 10:32:44,2018-09-14 13:14:49,1
8,6526,Casey Rivera,45.656502,33.024977,15.167387,1.579964,405.198078,,C,no,normal,2018-08-15 06:06:40,2018-08-15 10:15:31,0
9,331,Christopher Sullivan,20.629855,38.786551,16.189421,3.075686,528.724173,,Ca,,normal,2018-03-04 02:13:01,2018-03-04 03:23:59,1


In [6]:
df.shape

(8302, 13)

**📝 Clean the dataset and store the resulting dataset in the `data` variable:**

In [2]:
df.drop(columns='measure_flora', inplace=True)
columns_to_fill = ['measure_moisture', 'measure_temperature', 'measure_chemicals', 'measure_biodiversity']
for column in columns_to_fill:
    df[column].fillna(df[column].median(), inplace=True)
df['past_agriculture'].fillna(df['past_agriculture'].mode()[0], inplace=True)
df['datetime_start'] = pd.to_datetime(df['datetime_start'])
df['datetime_end'] = pd.to_datetime(df['datetime_end'])

data = df
data

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[column].fillna(df[column].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[column].fillna(df[column].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are se

Unnamed: 0,id,scientist,measure_index,measure_moisture,measure_temperature,measure_chemicals,measure_biodiversity,main_element,past_agriculture,soil_condition,datetime_start,datetime_end,target
0,493,Kathryn Owens,1.875085,24.442232,18.510316,5.715697,521.074105,Na,yes,normal,2017-06-27 16:53:42,2017-06-27 20:05:36,1
1,2340,Andrea Pratt,7.658911,30.121175,17.050250,1.973804,314.443474,Ca,no,rich,2018-12-10 07:06:56,2018-12-10 11:43:29,1
2,5434,Kaitlyn Jackson,18.000212,34.188025,17.157393,3.658506,361.796180,Al,yes,normal,2018-10-04 18:45:29,2018-10-04 23:20:38,0
3,2304,Brett Rosario,4.056764,37.462768,13.275961,6.666983,402.016494,Ca,no,normal,2018-10-03 08:03:36,2018-10-03 10:56:40,0
4,1911,Craig Thompson,53.271676,31.425482,17.433458,1.940748,978.383654,Si,no,poor,2018-07-20 09:27:34,2018-07-20 13:48:30,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8297,6275,James Carter,6.616369,999.000000,17.283441,7.693774,683.214251,Al,no,poor,2017-04-23 04:10:57,2017-04-23 06:11:58,0
8298,9851,Benjamin Rodriguez,23.578767,999.000000,18.456903,1.009824,313.740816,O,no,rich,2017-07-21 16:57:54,2017-07-21 20:30:17,1
8299,1453,Colin Baxter,9.967761,999.000000,14.949271,3.484480,370.533784,O,no,rich,2017-01-07 04:00:02,2017-01-07 06:26:16,1
8300,4265,Christina Ortega,5.195543,999.000000,21.811990,2.261112,403.970941,Si,no,poor,2018-12-12 10:12:04,2018-12-12 13:02:11,0


### 💾 Save your results

Run the cell below to save your results.

In [4]:
pip install nbresult

Collecting nbresult
  Downloading nbresult-0.0.9-py3-none-any.whl (4.3 kB)
Installing collected packages: nbresult
Successfully installed nbresult-0.0.9
Note: you may need to restart the kernel to use updated packages.




In [5]:
from nbresult import ChallengeResult
results = ChallengeResult(
    "data_cleaning",
    columns=data.columns,
    shape=data.shape,
    samples=data.loc[7000:,:]
)
results.write()

KeyboardInterrupt: 

## Target, Baseline & Metrics

*C9 - Entraîner un modèle d'apprentissage supervisé pour optimiser une fonction de prédiction à partir d'exemples annotées (50%)*

**📝 Check the number of target classes and their repartition.**

In [None]:
target_classes_repartition = data['target'].value_counts(normalize=True)

print(target_classes_repartition)

In [None]:
target_counts = data['target'].value_counts()
print(target_counts)

❓ Is the dataset balanced?

> Ratio being close to 1 (Balance: 0.991) the dataset is balanced, thats means we are safe to proceed to the training of the model

🎯 Recall our initial requirement:

**"To be valuable, our model should be right at least 90% of the time when it predicts a viable soil."**

📝 Store the name of the metric we should use for this purpose in a variable `metric` from the list proposed by [Scikit-learn](https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values).


In [None]:
metric = "precision"

**📝 Compute the baseline score and store the result as a floating number in the `baseline_score` variable.**


In [None]:
baseline_score = max(data['target'].value_counts(normalize=True))
baseline_score

**📝 Store the target in a variable named `y`.**

In [None]:
y = data['target']

### 💾 Save your results

Run the cell below to save your results.

In [None]:
results = ChallengeResult(
    "baseline",
    metric=metric,
    baseline=baseline_score
)
results.write()

## Features

In [None]:
from sklearn import set_config; set_config(display='diagram')

**📝 Store the features in a DataFrame `X`.**


In [None]:
X = data.drop('target', axis=1)

💡 Two features in there are useless.

- `id`: serves a technical need and does not carry any information.  
- `scientist`: almost all experiments were conducted by different scientists, we assume they all followed the same protocol for the experiment.

**📝 Drop these two features.**

In [None]:
X = X.drop(['id', 'scientist'], axis=1)

**📝 Create variables to store feature names according to their types.**

- `feat_num`: list of numerical features' name
- `feat_cat` list of categorical features' name
- `feat_time` list of time features' name

In [None]:
feat_num = ['measure_index', 'measure_moisture', 'measure_temperature', 'measure_chemicals', 'measure_biodiversity']
feat_cat = ['main_element', 'past_agriculture', 'soil_condition']
feat_time = ['datetime_start', 'datetime_end']

💡 We will ignore date-like features for the basic preprocessing.

**📝 Create `X_basic` that contains only numerical and categorical features.**


In [None]:
X_basic = X[feat_num + feat_cat]

### 💾 Save your results

Run the cell below to save your results.

In [None]:
from nbresult import ChallengeResult
result = ChallengeResult(
    "features",
    columns=X.columns,
    shape=X.shape,
    target=y.ndim
)
result.write()

## Preprocessing

*C6 - Transformer les données d'entrée afin de satisfaire les contraintes inhérentes au modèle (Preprocessing)*

In [None]:
from sklearn import set_config; set_config(display='diagram')

**📝 Scale and Encode your features.**

Prepare a ColumnTransformer that:
- Scale the numerical features between $0$ and $1$
- Encode the categorical features

Store it in a variable `preprocessing_basic`


In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

preprocessing_basic = ColumnTransformer(
    transformers=[
        ('num', MinMaxScaler(), feat_num),  # Mise à l'échelle des caractéristiques numériques
        ('cat', OneHotEncoder(handle_unknown='ignore'), feat_cat)  # Encodage des caractéristiques catégorielles
    ],
    remainder='drop'  # Ignorer les autres colonnes non spécifiées dans transformers
)

# Affichage du transformateur
preprocessing_basic

## Linear Model

*C9 - Entraîner un modèle d'apprentissage supervisé pour optimiser une fonction de prédiction à partir d'exemples annotées (50%)*

**📝 Cross-validate a linear model on `X_basic` to see how it compares to your baseline.**

Inside a pipeline, apply the basic preprocessing, then use a basic **linear** model with **no penalty**.

Cross-validate your pipeline and store the scores in `scores_linear` as a `numpy.ndarray`.

In [None]:
from sklearn.linear_model import LogisticRegression

# Correction du pipeline avec LogisticRegression pour un problème de classification
pipeline_linear = Pipeline(steps=[
    ('preprocessing', preprocessing_basic),  # Prétraitement défini précédemment
    ('linear_model', LogisticRegression(penalty='none', max_iter=1000))  # Modèle de classification linéaire
])

# Ré-application de la validation croisée avec la correction
scores_linear = cross_val_score(pipeline_linear, X_basic, data['target'], cv=5, scoring='precision')

print(scores_linear)


**❓ Does your model beat the baseline? Do you reach your goal?**

> Yes, the model clearly beats the baseline. The model's accuracy scores are significantly higher than the proportion of the majority class, indicating a significant improvement over a naive prediction. The objective was for the model to be correct at least 90% of the time when it identifies soil as viable. Accuracy scores obtained with the linear model are below this target, falling between approximately 78% and 81% accuracy. Although the model outperforms the baseline and shows an ability to identify viable soils more accurately than chance, it does not reach the 90% accuracy target. Although the model is an improvement over a naive prediction, it does not reach the set goal of 90% accuracy for positive predictions. This suggests that further model improvements or the use of more advanced modeling techniques may be necessary to achieve this goal. You might consider exploring other models, adding or selecting features more strategically, or applying feature engineering techniques to improve model performance.

### 💾 Save your results

Run the cell below to save your results.

In [None]:
from nbresult import ChallengeResult
X_preproc=preprocessing_basic.fit_transform(X_basic)
from sklearn.model_selection import train_test_split
X_,X_val,y_,y_val = train_test_split(X_basic,y,test_size=0.3,random_state=10)
pipe=pipeline_linear.fit(X_,y_)

result = ChallengeResult(
    'basic_pipeline',
    preproc=preprocessing_basic,
    preproc_shape=X_preproc.shape,
    pipe=pipeline_linear,
    y=y_val,
    y_pred=pipeline_linear.predict(X_val),
    scores=scores_linear
)
result.write()

## Feature Engineering

*C7 - Générer des données d'entrée afin de satisfaire les contraintes inhérentes au modèles (Feature Engineering)*

💡 We are going to look more closely at the features and try to enhance our preprocessing.

### Enhanced `soil_condition` Encoding

**📝 Check the possible values of the feature `soil_condition`**

In [None]:
soil_condition_values = data['soil_condition'].unique()
soil_condition_values

**❓ Can you a better way to encode the `soil_condition` feature?**

> yes, if we use ordinal encoding we can convert strings into numeric format which would help to rank labels (e.g. : 3 = 'rich', 2 = 'neutral', 1 = 'poor'). This helps to take into account the maximum information from the dataset during encoding when we have a lot of ordinal features.

In [None]:
soil_condition_mapping = {'rich': 3, 'normal': 2, 'poor': 1}
data['soil_condition_encoded'] = data['soil_condition'].map(soil_condition_mapping)

**📝 Select a transformer keeping a sense of the order of the values of `soil_condition` to encode that feature.**

Encode `soil_condition` from `X` with that relevant encoder and store the result in `X_soil_condition_encoded` as a `numpy.ndarray`.

In [None]:
from sklearn.preprocessing import OrdinalEncoder

ordinal_encoder = OrdinalEncoder(categories=[['poor', 'normal', 'rich']])
soil_condition = X['soil_condition'].values.reshape(-1, 1)
X_soil_condition_encoded = ordinal_encoder.fit_transform(soil_condition)

**📝 Make sure that it works properly.**

Check the value counts for the feature `soil_condition`

In [None]:
value_counts_before = X['soil_condition'].value_counts()
print("Value counts before transformation:\n", value_counts_before)

**📝 Check it again,  after transformation with the relevant encoder:**

In [None]:
encoded_series = pd.Series(X_soil_condition_encoded.flatten())

value_counts_after = encoded_series.value_counts().sort_index()
print("Value counts after transformation:\n", value_counts_after)

### Custom Time Transformers

#### Datetime Features Extraction

💡  We want to extract two information from our time features

📅 The `month` of the experiment's start

⏳ The `duration` of the experiment in an appropriate unit

**📝 Compute the `duration` of experiments, and look at the statistics.**

In [None]:
X['experiment_duration'] = (pd.to_datetime(X['datetime_end']) - pd.to_datetime(X['datetime_start']))
duration = X['experiment_duration'].dt.total_seconds().describe()
print(duration)

**❓ What is the most accurate time unit to use to describe the `duration` feature?**

**📝 Choose between `['days', 'hours', 'minutes', 'seconds']` and store your choice in the `duration_time_unit` variable:**

In [None]:
duration_time_unit = 'seconds'

**📝 Create a `TimeFeaturesExtractor` class that transforms `datetime_start` and `datetime_end` into `month` and `duration`:**
- `month` as a number from 1 to 12
- `duration` as a float in the relevant `duration_time_unit`

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class TimeFeaturesExtractor(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        X_transformed = pd.DataFrame()
        X_transformed['month'] = pd.to_datetime(X['datetime_start']).dt.month
        duration = (pd.to_datetime(X['datetime_end']) - pd.to_datetime(X['datetime_start'])).dt.total_seconds()
        X_transformed['duration'] = duration
        return X_transformed

time_feature_extractor = TimeFeaturesExtractor()
X_time_features = time_feature_extractor.transform(X.head(100))

**📝 Apply your `TimeFeaturesExtractor` to _100 rows_ of `X` and store the result in a DataFrame `X_time_features`**

Double check that it has **2 columns**: `month` and `duration`, and **100 rows**

In [None]:
print(X_time_features.shape)
print(X_time_features.head())

#### Cyclical Encoding & Scaling

💡 We now have to encode and scale the extracted time features!  

You should scale the `duration` between 0 and 1.  

However we need to build a **Cyclical Encoder** for the `month`.

**📝Create a `CyclicalEncoder` class that transforms `month` into `month_cos` and `month_sin`.**

Recall the equations:  

$month\_norm = 2\pi\frac{month}{12}$  
$month\_cos = \cos({month\_norm})$  
$month\_sin = \sin({month\_norm})$

In [None]:
class CyclicalEncoder(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        
        X_encoded = pd.DataFrame()
        month_norm = 2 * np.pi * X['month'] / 12
        X_encoded['month_cos'] = np.cos(month_norm)
        X_encoded['month_sin'] = np.sin(month_norm)
        return X_encoded

**📝 Apply your `CyclicalEncoder` to `X_time_features` and store the result in a DataFrame `X_time_cyclical`.**

Double check that it has **2 columns**: `month_cos` and `month_sin`, and **100 rows**

In [None]:
cyclical_encoder = CyclicalEncoder()
X_time_cyclical = cyclical_encoder.transform(X_time_features[['month']])
print(X_time_cyclical.shape)
print(X_time_cyclical.head())

**📝 Build a pipeline, that contains all the steps for time features.**

Store it in a variable `preprocessing_time`

**Steps**

- Extraction of `month` and `duration` from  `datetime_start` and `datetime_end`  
- Scaling of `duration` between 0 and 1
- Cyclical encoding of `month`

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler

preprocessing_time = Pipeline([
    ('time_features_extractor', TimeFeaturesExtractor()),
    ('feature_transform', ColumnTransformer([
        ('duration_scaler', MinMaxScaler(), ['duration']),
        ('cyclical_encoder', CyclicalEncoder(), ['month'])
    ], remainder='drop'))
])
preprocessing_time

### 💾 Save your results

Run the cell below to save your results.

In [None]:
from nbresult import ChallengeResult
results = ChallengeResult(
    'feature_engineering',
    x_soil_condition=X_soil_condition_encoded,
    X_time_features=X_time_features,
    X_time_cyclical= X_time_cyclical,
    X_time=preprocessing_time.fit_transform(X)
)
results.write()

## Advanced Pipeline

**📝  Build a full preprocessing pipeline and store it in `preprocessing_advanced`.**

Here are its steps, they should go in a parallel ColumnTransformer

- Scale all numerical features between 0 and 1
- Encode `main_element`  
- Better encode `soil_condition`
- Apply the `preprocessing_time` pipeline on `datetime_start` and `datetime_end`

In [None]:
preprocessing_advanced = ColumnTransformer(
    [
        ('num', MinMaxScaler(), feat_num),
        ('element', OneHotEncoder(handle_unknown='ignore'), ['main_element']),
        ('soil', OrdinalEncoder(categories=[['poor', 'normal', 'rich']]), ['soil_condition']),
        ('time', Pipeline([
            ('time_features_extractor', TimeFeaturesExtractor()),            
            ('cyclical_encoder', CyclicalEncoder()),
            ('duration_scaler', MinMaxScaler(feature_range=(0, 1)))
        ]), ['datetime_start', 'datetime_end'])
    ],
    remainder='passthrough'
)

## Regularized Linear Model

C8 - Maîtriser les différents algorithmes d'apprentissage afin d'apporter une réponse adaptée à une problématique d'une organisation (entreprise, laboratoire, etc.)

**📝 Build a pipeline that uses `preprocessing_advanced` and then a _Regularized Linear_ model.**

Cross-validate your pipeline and store the scores in a list `scores_regularized`

In [None]:
X['past_agriculture']= X['past_agriculture'].replace({'yes': 1,  'no':0})

In [None]:
X['experiment_duration'] = X['experiment_duration'].apply(lambda x: x.total_seconds())

In [None]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

pipeline_regularized = Pipeline([
    ('preprocessing_advanced', preprocessing_advanced),
    ('ridge', Ridge())
])

scores_regularized = cross_val_score(pipeline_regularized, X, y, cv=5, scoring='neg_mean_squared_error', error_score='raise')
scores_regularized = np.abs(scores_regularized)
print(scores_regularized)

### 💾 Save your results

Run the cell below to save your results.

In [None]:
from nbresult import ChallengeResult
from sklearn.model_selection import train_test_split
X_,X_val,y_,y_val = train_test_split(X,y,test_size=0.3,random_state=7)
pipe=pipeline_regularized.fit(X_,y_)

result = ChallengeResult(
    'advanced_pipeline',
    steps=str(pipeline_regularized.steps),
    scores=scores_regularized,
    y=y_val,
    y_pred=pipeline_regularized.predict(X_val)
)
result.write()

## Dimensionality Reduction

*C10 - Entraîner un modèle d'apprentissage non supervisé pour détecter des structures sous-jacentes à partir de données non étiquetées*

**📝 Add a dimensional reduction step as the last step of your `preprocessing_advanced`. Make sure your dimensional reduction keeps _only 12 features_.**

In [None]:
from sklearn.decomposition import PCA

preprocessing_advanced.steps.append(('dimensionality_reduction', PCA(n_components=12)))

**📝 Apply your `preprocessing_advanced` to `X` and store the result in the `X_preproc_adv` variable.**

In [None]:
X_preproc_adv = preprocessing_advanced.fit_transform(X)

### 💾 Save your results

Run the cell below to save your results.

In [None]:
from nbresult import ChallengeResult
results=ChallengeResult(
    'unsupervised',
    algorithm=preprocessing_advanced.steps[-1],
    X_preproc_adv=X_preproc_adv
)
results.write()

## Non-linear Model

**📝 Build a pipeline that uses `preprocessing_advanced` and then a _Ensemble_ model.**

Store this pipeline in the variable `pipeline_ensemble`

Cross-validate your pipeline and store the scores in a list `scores_ensemble`

In [None]:
from sklearn.ensemble import RandomForestRegressor

pipeline_ensemble = Pipeline([
    ('preprocessing', preprocessing_advanced),
    ('ensemble_model', RandomForestRegressor())
])

scores_ensemble = cross_val_score(pipeline_ensemble, X, y, cv=5)

**❓ Does this non-linear model satisfy the goal of the study?**

> YOUR ANSWER HERE

💡 Wait, did our feature engineering helps us ❓

**📝 Build a pipeline that uses `preprocessing_basic` and the same Ensemble model as above.**

In [None]:
pipeline_ensemble_basic = Pipeline([
    ('preprocessing', preprocessing_basic),
    ('ensemble_model', RandomForestRegressor())
])

scores_ensemble_basic = cross_val_score(pipeline_ensemble_basic, X, y, cv=5)

**❓ What is your conclusion?**

> YOUR ANSWER HERE

### 💾 Save your results

Run the cell below to save your results.

In [None]:
from nbresult import ChallengeResult
from sklearn.model_selection import train_test_split
X_,X_val,y_,y_val=train_test_split(X,y,test_size=0.3,random_state=7)
pipeline_ensemble.fit(X_,y_)
y_pred=pipeline_ensemble.predict(X_val)

results=ChallengeResult(
    'ensemble',
    steps=str(pipeline_ensemble.steps),
    scores=scores_ensemble,
    y=y_val,
    y_pred=y_pred
)
results.write()

## Fine-Tuning

*C11 - Améliorer les capacités prédictives d'un systèmes en sélectionnant un modèle différent ou en modifiant ses hyperparamètres en vue de corriger des erreurs (hyperparameter tuning)*

💡 To improve the model as much as we can, it's time to grid search for optimal hyperparameters

**📝 Look at the hyperparameters of your estimator**

In [None]:
# YOUR CODE HERE

**📝 Try to fine tune some hyperparameters to improve your model!**

In [None]:
# YOUR CODE HERE

**📝 Store the _fitted_ grid search in the `search` variable:**

In [None]:
# YOUR CODE HERE

**📝 Store the _cross-validated results_ of your grid search in the `cv_results` variable:**

In [None]:
# YOUR CODE HERE

**📝 Store the _best model_ of your grid search in a variable `tuned_model`.**

In [None]:
# YOUR CODE HERE

### 💾 Save your results

Run the cell below to save your results.

In [None]:
from nbresult import ChallengeResult
from sklearn.model_selection import train_test_split
X_,X_val,y_,y_val=train_test_split(X,y,test_size=0.3,random_state=10)
tuned_model.fit(X_,y_)

result = ChallengeResult(
    'model_tuning',
    scores_ensemble=scores_ensemble,
    scoring=search.scorer_,
    params=search.best_params_,
    cv_results=cv_results,
    y=y_val,
    y_pred=tuned_model.predict(X_val)
)
result.write()

## Prediction

**📝 Use your newly fine-tuned model to predict on a test set.**

Load the test provided at this url: "https://wagon-public-datasets.s3.amazonaws.com/certification/soils_viability/soils_viability_test.csv".

Create `X_test` and `y_test`

Use your fine-tuned model to predict on `X_test`

Print a full classification report with your prediction and `y_test`

In [None]:
df_test = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/certification/soils_viability/soils_viability_test.csv")

In [None]:
# YOUR CODE HERE

**❓ Comment your results:**

> YOUR ANSWER HERE

## API 

Time to put a pipeline in production!

👉 Go back to the certification interface and follow the instructions about the API challenge.

**This final part is independent from the above notebook**

*C12 - Mettre en production le modèle d'apprentissage supervisé ou non supervisé obtenu sous la forme d'une API*