<a href="https://colab.research.google.com/github/darqheartX/newProject/blob/branch_one/Radiant_Earth_Spot_the_Crop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src='https://radiant-assets.s3-us-west-2.amazonaws.com/PrimaryRadiantMLHubLogo.png' alt='Radiant MLHub Logo' width='300'/>

# A Baseline Model for the Radiant Earth Spot the Crop Challenge

This notebook walks you through the steps to load the data and build a baseline model using Random Forests for `Radiant Earth Spot the Crop Challenge`.

## Radiant MLHub API


The Radiant MLHub API gives access to open Earth imagery training data for machine learning applications. You can learn more about the repository at the [Radiant MLHub site](https://mlhub.earth) and about the organization behind it at the [Radiant Earth Foundation site](https://radiant.earth).

Full documentation for the API is available at [docs.mlhub.earth](docs.mlhub.earth).

Each item in our collection is explained in json format compliant with [STAC](https://stacspec.org/) [label extension](https://github.com/radiantearth/stac-spec/tree/master/extensions/label) definition.

## Dependencies

All the dependencies for this notebook are included in the `requirements.txt` file included in this folder.


**You must replace the `YOUR_API_KEY_HERE` text with your API key which you can obtain by creating a free account on the [MLHub Dashboard](https://dashboard.mlhub.earth/) within the `API Keys` tab at the top of the page.**

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install rasterio

In [None]:
!pip install radiant-mlhub

In [1]:
from radiant_mlhub import Collection
import tarfile
import os
from pathlib import Path
import json

import datetime
import rasterio
import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import log_loss
from sklearn.model_selection import StratifiedShuffleSplit

## Downloading and Loading the Data

In this part, we will download the data from Radiant MLHub and load the properties of each item in the dataset into a DataFrame


In [5]:
os.environ['MLHUB_API_KEY'] = 'f83bfa09f61af148f68190b6e8d4839ef1358256dac814960922cd9c43d89e8e'

collections = [
    'ref_south_africa_crops_competition_v1_train_labels',
    'ref_south_africa_crops_competition_v1_train_source_s1', 
    'ref_south_africa_crops_competition_v1_test_labels',
    'ref_south_africa_crops_competition_v1_test_source_s1', 
#     'ref_south_africa_crops_competition_v1_test_source_s2' # Uncomment this out if you want to download the Sentinel-2 Data (not needed for the Hackathon)
#     'ref_south_africa_crops_competition_v1_train_source_s2', # Uncomment this out if you want to download the Sentinel-2 Data (not needed for the Hackathon)
]

def download(collection_id):
    print(f'Downloading {collection_id}...')
    collection = Collection.fetch(collection_id)
    path = collection.download('.')
    path = collection_id + '.tar.gz'
    tar = tarfile.open(path, "r:gz")
    tar.extractall()
    tar.close()
    os.remove(path)
    
def resolve_path(base, path):
    return Path(os.path.join(base, path)).resolve()
    
def load_df(collection_id):
    collection = json.load(open(f'{collection_id}/collection.json', 'r'))
    rows = []
    item_links = []
    for link in collection['links']:
        if link['rel'] != 'item':
            continue
        item_links.append(link['href'])
        
    for item_link in item_links:
        item_path = f'{collection_id}/{item_link}'
        current_path = os.path.dirname(item_path)
        item = json.load(open(item_path, 'r'))
        tile_id = item['id'].split('_')[-1]
        for asset_key, asset in item['assets'].items():
            rows.append([
                tile_id,
                None,
                None,
                asset_key,
                str(resolve_path(current_path, asset['href']))
            ])
            
        for link in item['links']:
            if link['rel'] != 'source':
                continue
            link_path = resolve_path(current_path, link['href'])
            source_path = os.path.dirname(link_path)
            try:
                source_item = json.load(open(link_path, 'r'))
            except FileNotFoundError:
                continue
            datetime = source_item['properties']['datetime']
            satellite_platform = source_item['collection'].split('_')[-1]
            for asset_key, asset in source_item['assets'].items():
                rows.append([
                    tile_id,
                    datetime,
                    satellite_platform,
                    asset_key,
                    str(resolve_path(source_path, asset['href']))
                ])
    return pd.DataFrame(rows, columns=['tile_id', 'datetime', 'satellite_platform', 'asset', 'file_path'])

for c in collections:
    download(c)

competition_train_df = load_df('ref_south_africa_crops_competition_v1_train_labels')
competition_test_df = load_df('ref_south_africa_crops_competition_v1_test_labels')

Downloading ref_south_africa_crops_competition_v1_train_labels...


  0%|          | 0/31.4 [00:00<?, ?M/s]

Downloading ref_south_africa_crops_competition_v1_train_source_s1...


  0%|          | 0/5987.8 [00:00<?, ?M/s]

Downloading ref_south_africa_crops_competition_v1_test_labels...


  0%|          | 0/10.9 [00:00<?, ?M/s]

Downloading ref_south_africa_crops_competition_v1_test_source_s1...


  0%|          | 0/2566.1 [00:00<?, ?M/s]

In [None]:
competition_train_df.to_csv('/content/drive/MyDrive/zindi/radiant_ml/data/train.csv', index=False)
competition_test_df.to_csv('/content/drive/MyDrive/zindi/radiant_ml/data/test.csv', index=False)

In [2]:
competition_train_df=pd.read_csv('/content/drive/MyDrive/zindi/radiant_ml/data/train.csv')
competition_test_df=pd.read_csv('/content/drive/MyDrive/zindi/radiant_ml/data/test.csv')

In [None]:
competition_train_df

In [4]:
# This DataFrame lists all types of assets including documentation of the data. 
# In the following, we will use the Sentinel-1 bands (VV and VH) as well as labels. 
competition_train_df['asset'].unique()

array(['documentation', 'field_ids', 'field_info_train', 'labels',
       'raster_values', 'VH', 'VV'], dtype=object)

In [3]:
tile_ids_train = competition_train_df['tile_id'].unique()
tile_ids_train, len(tile_ids_train)

(array([2587, 1302, 1130, ...,   99, 1379, 2198]), 2650)

In [4]:
# For simplicty of this baseline model, we will use only 5 observations throughout the growing season
# You can choose to use all of them, select a few of them at specifc intervals, or 
# load as many as you want and interpolate between them to have a regular temporal frequency.
n_obs = 17

# 1

In [5]:
import tqdm

X = np.empty((0, 2 * (n_obs-1)))
y = np.empty((0, 1))
field_ids = np.empty((0, 1))

for tile_id in tqdm.notebook.tqdm(tile_ids_train[:150]):
    if tile_id != '1951': # avoid using this specific tile for the Hackathon as it might have a missing file
        
        tile_df = competition_train_df[competition_train_df['tile_id']==tile_id]

        label_src = rasterio.open(tile_df[tile_df['asset']=='labels']['file_path'].values[0])
        label_array = label_src.read(1)
        y = np.append(y, label_array.flatten())

        field_id_src = rasterio.open(tile_df[tile_df['asset']=='field_ids']['file_path'].values[0])
        field_id_array = field_id_src.read(1)
        field_ids = np.append(field_ids, field_id_array.flatten())

        tile_date_times = tile_df[tile_df['satellite_platform']=='s1']['datetime'].unique()

        X_tile = np.empty((256 * 256, 28))
        for date_time in tile_date_times[ : 2 * n_obs : n_obs]:

            vv_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VV')]['file_path'].values[0])
            vv_array = np.expand_dims(vv_src.read(1).flatten(), axis=1)

            vh_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VH')]['file_path'].values[0])
            vh_array = np.expand_dims(vh_src.read(1).flatten(), axis=1)

            X_tile = np.append(X_tile, vv_array, axis = 1)
            X_tile = np.append(X_tile, vh_array, axis = 1)

        X = np.append(X, X_tile, axis=0)

  0%|          | 0/150 [00:00<?, ?it/s]

In [7]:
data1 = pd.DataFrame(X)
data1['label'] = y.astype(int)
data1['field_id'] = field_ids
data1 = data1[data1.field_id != 0]

data_grouped = data1.groupby('field_id').mean().reset_index()
data_grouped.to_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xdata1.csv', index=False)
data_grouped

# 2

In [6]:
import tqdm

X = np.empty((0, 2 * (n_obs-1)))
y = np.empty((0, 1))
field_ids = np.empty((0, 1))


for tile_id in tqdm.notebook.tqdm(tile_ids_train[150:300]):
    if tile_id != '1951': # avoid using this specific tile for the Hackathon as it might have a missing file
        
        tile_df = competition_train_df[competition_train_df['tile_id']==tile_id]

        label_src = rasterio.open(tile_df[tile_df['asset']=='labels']['file_path'].values[0])
        label_array = label_src.read(1)
        y = np.append(y, label_array.flatten())

        field_id_src = rasterio.open(tile_df[tile_df['asset']=='field_ids']['file_path'].values[0])
        field_id_array = field_id_src.read(1)
        field_ids = np.append(field_ids, field_id_array.flatten())

        tile_date_times = tile_df[tile_df['satellite_platform']=='s1']['datetime'].unique()

        X_tile = np.empty((256 * 256, 28))
        for date_time in tile_date_times[ : 2 * n_obs : n_obs]:

            vv_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VV')]['file_path'].values[0])
            vv_array = np.expand_dims(vv_src.read(1).flatten(), axis=1)

            vh_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VH')]['file_path'].values[0])
            vh_array = np.expand_dims(vh_src.read(1).flatten(), axis=1)

            X_tile = np.append(X_tile, vv_array, axis = 1)
            X_tile = np.append(X_tile, vh_array, axis = 1)

        X = np.append(X, X_tile, axis=0)

  0%|          | 0/150 [00:00<?, ?it/s]

In [8]:
data2 = pd.DataFrame(X)
data2['label'] = y.astype(int)
data2['field_id'] = field_ids
data2 = data2[data2.field_id != 0]

data_grouped = data2.groupby('field_id').mean().reset_index()
data_grouped.to_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xdata2.csv', index=False)
data_grouped

# 3

In [5]:
import tqdm

X = np.empty((0, 2 * (n_obs-1)))
y = np.empty((0, 1))
field_ids = np.empty((0, 1))


for tile_id in tqdm.notebook.tqdm(tile_ids_train[300:450]):
    if tile_id != '1951': # avoid using this specific tile for the Hackathon as it might have a missing file
        
        tile_df = competition_train_df[competition_train_df['tile_id']==tile_id]

        label_src = rasterio.open(tile_df[tile_df['asset']=='labels']['file_path'].values[0])
        label_array = label_src.read(1)
        y = np.append(y, label_array.flatten())

        field_id_src = rasterio.open(tile_df[tile_df['asset']=='field_ids']['file_path'].values[0])
        field_id_array = field_id_src.read(1)
        field_ids = np.append(field_ids, field_id_array.flatten())

        tile_date_times = tile_df[tile_df['satellite_platform']=='s1']['datetime'].unique()

        X_tile = np.empty((256 * 256, 28))
        for date_time in tile_date_times[ : 2 * n_obs : n_obs]:

            vv_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VV')]['file_path'].values[0])
            vv_array = np.expand_dims(vv_src.read(1).flatten(), axis=1)

            vh_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VH')]['file_path'].values[0])
            vh_array = np.expand_dims(vh_src.read(1).flatten(), axis=1)

            X_tile = np.append(X_tile, vv_array, axis = 1)
            X_tile = np.append(X_tile, vh_array, axis = 1)

        X = np.append(X, X_tile, axis=0)

  0%|          | 0/150 [00:00<?, ?it/s]

In [None]:
data3 = pd.DataFrame(X)
data3['label'] = y.astype(int)
data3['field_id'] = field_ids
data3 = data3[data3.field_id != 0]

data_grouped = data3.groupby('field_id').mean().reset_index()
data_grouped.to_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xdata3.csv', index=False)
data_grouped

# 4

In [5]:
import tqdm

X = np.empty((0, 2 * (n_obs-1)))
y = np.empty((0, 1))
field_ids = np.empty((0, 1))


for tile_id in tqdm.notebook.tqdm(tile_ids_train[450:650]):
    if tile_id != '1951': # avoid using this specific tile for the Hackathon as it might have a missing file
        
        tile_df = competition_train_df[competition_train_df['tile_id']==tile_id]

        label_src = rasterio.open(tile_df[tile_df['asset']=='labels']['file_path'].values[0])
        label_array = label_src.read(1)
        y = np.append(y, label_array.flatten())

        field_id_src = rasterio.open(tile_df[tile_df['asset']=='field_ids']['file_path'].values[0])
        field_id_array = field_id_src.read(1)
        field_ids = np.append(field_ids, field_id_array.flatten())

        tile_date_times = tile_df[tile_df['satellite_platform']=='s1']['datetime'].unique()

        X_tile = np.empty((256 * 256, 28))
        for date_time in tile_date_times[ : 2 * n_obs : n_obs]:

            vv_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VV')]['file_path'].values[0])
            vv_array = np.expand_dims(vv_src.read(1).flatten(), axis=1)

            vh_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VH')]['file_path'].values[0])
            vh_array = np.expand_dims(vh_src.read(1).flatten(), axis=1)

            X_tile = np.append(X_tile, vv_array, axis = 1)
            X_tile = np.append(X_tile, vh_array, axis = 1)

        X = np.append(X, X_tile, axis=0)

  0%|          | 0/200 [00:00<?, ?it/s]

In [None]:
data4 = pd.DataFrame(X)
data4['label'] = y.astype(int)
data4['field_id'] = field_ids
data4 = data4[data4.field_id != 0]

data_grouped = data4.groupby('field_id').mean().reset_index()
data_grouped.to_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xdata4.csv', index=False)
data_grouped

# 5

In [7]:
import tqdm

X = np.empty((0, 2 * (n_obs-1)))
y = np.empty((0, 1))
field_ids = np.empty((0, 1))



for tile_id in tqdm.notebook.tqdm(tile_ids_train[650:850]):
    if tile_id != '1951': # avoid using this specific tile for the Hackathon as it might have a missing file
        
        tile_df = competition_train_df[competition_train_df['tile_id']==tile_id]

        label_src = rasterio.open(tile_df[tile_df['asset']=='labels']['file_path'].values[0])
        label_array = label_src.read(1)
        y = np.append(y, label_array.flatten())

        field_id_src = rasterio.open(tile_df[tile_df['asset']=='field_ids']['file_path'].values[0])
        field_id_array = field_id_src.read(1)
        field_ids = np.append(field_ids, field_id_array.flatten())

        tile_date_times = tile_df[tile_df['satellite_platform']=='s1']['datetime'].unique()

        X_tile = np.empty((256 * 256, 28))
        for date_time in tile_date_times[ : 2 * n_obs : n_obs]:

            vv_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VV')]['file_path'].values[0])
            vv_array = np.expand_dims(vv_src.read(1).flatten(), axis=1)

            vh_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VH')]['file_path'].values[0])
            vh_array = np.expand_dims(vh_src.read(1).flatten(), axis=1)

            X_tile = np.append(X_tile, vv_array, axis = 1)
            X_tile = np.append(X_tile, vh_array, axis = 1)

        X = np.append(X, X_tile, axis=0)

  0%|          | 0/200 [00:00<?, ?it/s]

In [None]:
data5 = pd.DataFrame(X)
data5['label'] = y.astype(int)
data5['field_id'] = field_ids
data5 = data5[data5.field_id != 0]

data_grouped = data5.groupby('field_id').mean().reset_index()
data_grouped.to_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xdata5.csv', index=False)
data_grouped

# 6

In [5]:
import tqdm

X = np.empty((0, 2 * (n_obs-1)))
y = np.empty((0, 1))
field_ids = np.empty((0, 1))



for tile_id in tqdm.notebook.tqdm(tile_ids_train[850:1050]):
    if tile_id != '1951': # avoid using this specific tile for the Hackathon as it might have a missing file
        
        tile_df = competition_train_df[competition_train_df['tile_id']==tile_id]

        label_src = rasterio.open(tile_df[tile_df['asset']=='labels']['file_path'].values[0])
        label_array = label_src.read(1)
        y = np.append(y, label_array.flatten())

        field_id_src = rasterio.open(tile_df[tile_df['asset']=='field_ids']['file_path'].values[0])
        field_id_array = field_id_src.read(1)
        field_ids = np.append(field_ids, field_id_array.flatten())

        tile_date_times = tile_df[tile_df['satellite_platform']=='s1']['datetime'].unique()

        X_tile = np.empty((256 * 256, 28))
        for date_time in tile_date_times[ : 2 * n_obs : n_obs]:

            vv_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VV')]['file_path'].values[0])
            vv_array = np.expand_dims(vv_src.read(1).flatten(), axis=1)

            vh_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VH')]['file_path'].values[0])
            vh_array = np.expand_dims(vh_src.read(1).flatten(), axis=1)

            X_tile = np.append(X_tile, vv_array, axis = 1)
            X_tile = np.append(X_tile, vh_array, axis = 1)

        X = np.append(X, X_tile, axis=0)

  0%|          | 0/200 [00:00<?, ?it/s]

In [None]:
data6 = pd.DataFrame(X)
data6['label'] = y.astype(int)
data6['field_id'] = field_ids
data6 = data6[data6.field_id != 0]

data_grouped = data6.groupby('field_id').mean().reset_index()
data_grouped.to_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xdata6.csv', index=False)
data_grouped

# 7

In [5]:
import tqdm

X = np.empty((0, 2 * (n_obs-1)))
y = np.empty((0, 1))
field_ids = np.empty((0, 1))


for tile_id in tqdm.notebook.tqdm(tile_ids_train[1050:1250]):
    if tile_id != '1951': # avoid using this specific tile for the Hackathon as it might have a missing file
        
        tile_df = competition_train_df[competition_train_df['tile_id']==tile_id]

        label_src = rasterio.open(tile_df[tile_df['asset']=='labels']['file_path'].values[0])
        label_array = label_src.read(1)
        y = np.append(y, label_array.flatten())

        field_id_src = rasterio.open(tile_df[tile_df['asset']=='field_ids']['file_path'].values[0])
        field_id_array = field_id_src.read(1)
        field_ids = np.append(field_ids, field_id_array.flatten())

        tile_date_times = tile_df[tile_df['satellite_platform']=='s1']['datetime'].unique()

        X_tile = np.empty((256 * 256, 28))
        for date_time in tile_date_times[ : 2 * n_obs : n_obs]:

            vv_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VV')]['file_path'].values[0])
            vv_array = np.expand_dims(vv_src.read(1).flatten(), axis=1)

            vh_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VH')]['file_path'].values[0])
            vh_array = np.expand_dims(vh_src.read(1).flatten(), axis=1)

            X_tile = np.append(X_tile, vv_array, axis = 1)
            X_tile = np.append(X_tile, vh_array, axis = 1)

        X = np.append(X, X_tile, axis=0)

  0%|          | 0/200 [00:00<?, ?it/s]

In [None]:
data7 = pd.DataFrame(X)
data7['label'] = y.astype(int)
data7['field_id'] = field_ids
data7 = data7[data7.field_id != 0]

data_grouped = data7.groupby('field_id').mean().reset_index()
data_grouped.to_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xdata7.csv', index=False)
data_grouped

# 8

In [5]:
import tqdm

X = np.empty((0, 2 * (n_obs-1)))
y = np.empty((0, 1))
field_ids = np.empty((0, 1))


for tile_id in tqdm.notebook.tqdm(tile_ids_train[1250:1450]):
    if tile_id != '1951': # avoid using this specific tile for the Hackathon as it might have a missing file
        
        tile_df = competition_train_df[competition_train_df['tile_id']==tile_id]

        label_src = rasterio.open(tile_df[tile_df['asset']=='labels']['file_path'].values[0])
        label_array = label_src.read(1)
        y = np.append(y, label_array.flatten())

        field_id_src = rasterio.open(tile_df[tile_df['asset']=='field_ids']['file_path'].values[0])
        field_id_array = field_id_src.read(1)
        field_ids = np.append(field_ids, field_id_array.flatten())

        tile_date_times = tile_df[tile_df['satellite_platform']=='s1']['datetime'].unique()

        X_tile = np.empty((256 * 256, 28))
        for date_time in tile_date_times[ : 2 * n_obs : n_obs]:

            vv_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VV')]['file_path'].values[0])
            vv_array = np.expand_dims(vv_src.read(1).flatten(), axis=1)

            vh_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VH')]['file_path'].values[0])
            vh_array = np.expand_dims(vh_src.read(1).flatten(), axis=1)

            X_tile = np.append(X_tile, vv_array, axis = 1)
            X_tile = np.append(X_tile, vh_array, axis = 1)

        X = np.append(X, X_tile, axis=0)

  0%|          | 0/200 [00:00<?, ?it/s]

In [None]:
data8 = pd.DataFrame(X)
data8['label'] = y.astype(int)
data8['field_id'] = field_ids
data8 = data8[data8.field_id != 0]

data_grouped = data8.groupby('field_id').mean().reset_index()
data_grouped.to_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xdata8.csv', index=False)
data_grouped

# 9

In [5]:
import tqdm

X = np.empty((0, 2 * (n_obs-1)))
y = np.empty((0, 1))
field_ids = np.empty((0, 1))


for tile_id in tqdm.notebook.tqdm(tile_ids_train[1450:1650]):
    if tile_id != '1951': # avoid using this specific tile for the Hackathon as it might have a missing file
        
        tile_df = competition_train_df[competition_train_df['tile_id']==tile_id]

        label_src = rasterio.open(tile_df[tile_df['asset']=='labels']['file_path'].values[0])
        label_array = label_src.read(1)
        y = np.append(y, label_array.flatten())

        field_id_src = rasterio.open(tile_df[tile_df['asset']=='field_ids']['file_path'].values[0])
        field_id_array = field_id_src.read(1)
        field_ids = np.append(field_ids, field_id_array.flatten())

        tile_date_times = tile_df[tile_df['satellite_platform']=='s1']['datetime'].unique()

        X_tile = np.empty((256 * 256, 28))
        for date_time in tile_date_times[ : 2 * n_obs : n_obs]:

            vv_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VV')]['file_path'].values[0])
            vv_array = np.expand_dims(vv_src.read(1).flatten(), axis=1)

            vh_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VH')]['file_path'].values[0])
            vh_array = np.expand_dims(vh_src.read(1).flatten(), axis=1)

            X_tile = np.append(X_tile, vv_array, axis = 1)
            X_tile = np.append(X_tile, vh_array, axis = 1)

        X = np.append(X, X_tile, axis=0)

  0%|          | 0/200 [00:00<?, ?it/s]

In [None]:
data9 = pd.DataFrame(X)
data9['label'] = y.astype(int)
data9['field_id'] = field_ids
data9 = data9[data9.field_id != 0]

data_grouped = data9.groupby('field_id').mean().reset_index()
data_grouped.to_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xdata9.csv', index=False)
data_grouped

# 10

In [5]:
import tqdm

X = np.empty((0, 2 * (n_obs-1)))
y = np.empty((0, 1))
field_ids = np.empty((0, 1))


for tile_id in tqdm.notebook.tqdm(tile_ids_train[1650:1850]):
    if tile_id != '1951': # avoid using this specific tile for the Hackathon as it might have a missing file
        
        tile_df = competition_train_df[competition_train_df['tile_id']==tile_id]

        label_src = rasterio.open(tile_df[tile_df['asset']=='labels']['file_path'].values[0])
        label_array = label_src.read(1)
        y = np.append(y, label_array.flatten())

        field_id_src = rasterio.open(tile_df[tile_df['asset']=='field_ids']['file_path'].values[0])
        field_id_array = field_id_src.read(1)
        field_ids = np.append(field_ids, field_id_array.flatten())

        tile_date_times = tile_df[tile_df['satellite_platform']=='s1']['datetime'].unique()

        X_tile = np.empty((256 * 256, 28))
        for date_time in tile_date_times[ : 2 * n_obs : n_obs]:

            vv_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VV')]['file_path'].values[0])
            vv_array = np.expand_dims(vv_src.read(1).flatten(), axis=1)

            vh_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VH')]['file_path'].values[0])
            vh_array = np.expand_dims(vh_src.read(1).flatten(), axis=1)

            X_tile = np.append(X_tile, vv_array, axis = 1)
            X_tile = np.append(X_tile, vh_array, axis = 1)

        X = np.append(X, X_tile, axis=0)

  0%|          | 0/200 [00:00<?, ?it/s]

In [None]:
data10 = pd.DataFrame(X)
data10['label'] = y.astype(int)
data10['field_id'] = field_ids
data10 = data10[data10.field_id != 0]

data_grouped = data10.groupby('field_id').mean().reset_index()
data_grouped.to_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xdata10.csv', index=False)
data_grouped

# 11

In [6]:
import tqdm

X = np.empty((0, 2 * (n_obs-1)))
y = np.empty((0, 1))
field_ids = np.empty((0, 1))


for tile_id in tqdm.notebook.tqdm(tile_ids_train[1850:2050]):
    if tile_id != '1951': # avoid using this specific tile for the Hackathon as it might have a missing file
        
        tile_df = competition_train_df[competition_train_df['tile_id']==tile_id]

        label_src = rasterio.open(tile_df[tile_df['asset']=='labels']['file_path'].values[0])
        label_array = label_src.read(1)
        y = np.append(y, label_array.flatten())

        field_id_src = rasterio.open(tile_df[tile_df['asset']=='field_ids']['file_path'].values[0])
        field_id_array = field_id_src.read(1)
        field_ids = np.append(field_ids, field_id_array.flatten())

        tile_date_times = tile_df[tile_df['satellite_platform']=='s1']['datetime'].unique()

        X_tile = np.empty((256 * 256, 28))
        for date_time in tile_date_times[ : 2 * n_obs : n_obs]:

            vv_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VV')]['file_path'].values[0])
            vv_array = np.expand_dims(vv_src.read(1).flatten(), axis=1)

            vh_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VH')]['file_path'].values[0])
            vh_array = np.expand_dims(vh_src.read(1).flatten(), axis=1)

            X_tile = np.append(X_tile, vv_array, axis = 1)
            X_tile = np.append(X_tile, vh_array, axis = 1)

        X = np.append(X, X_tile, axis=0)

  0%|          | 0/200 [00:00<?, ?it/s]

In [None]:
data11 = pd.DataFrame(X)
data11['label'] = y.astype(int)
data11['field_id'] = field_ids
data11 = data11[data11.field_id != 0]

data_grouped = data11.groupby('field_id').mean().reset_index()
data_grouped.to_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xdata11.csv', index=False)
data_grouped

# 12

In [7]:
import tqdm

X = np.empty((0, 2 * (n_obs-1)))
y = np.empty((0, 1))
field_ids = np.empty((0, 1))


for tile_id in tqdm.notebook.tqdm(tile_ids_train[2050:2250]):
    if tile_id != '1951': # avoid using this specific tile for the Hackathon as it might have a missing file
        
        tile_df = competition_train_df[competition_train_df['tile_id']==tile_id]

        label_src = rasterio.open(tile_df[tile_df['asset']=='labels']['file_path'].values[0])
        label_array = label_src.read(1)
        y = np.append(y, label_array.flatten())

        field_id_src = rasterio.open(tile_df[tile_df['asset']=='field_ids']['file_path'].values[0])
        field_id_array = field_id_src.read(1)
        field_ids = np.append(field_ids, field_id_array.flatten())

        tile_date_times = tile_df[tile_df['satellite_platform']=='s1']['datetime'].unique()

        X_tile = np.empty((256 * 256, 28))
        for date_time in tile_date_times[ : 2 * n_obs : n_obs]:

            vv_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VV')]['file_path'].values[0])
            vv_array = np.expand_dims(vv_src.read(1).flatten(), axis=1)

            vh_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VH')]['file_path'].values[0])
            vh_array = np.expand_dims(vh_src.read(1).flatten(), axis=1)

            X_tile = np.append(X_tile, vv_array, axis = 1)
            X_tile = np.append(X_tile, vh_array, axis = 1)

        X = np.append(X, X_tile, axis=0)

  0%|          | 0/200 [00:00<?, ?it/s]

In [None]:
data12 = pd.DataFrame(X)
data12['label'] = y.astype(int)
data12['field_id'] = field_ids
data12 = data12[data12.field_id != 0]

data_grouped = data12.groupby('field_id').mean().reset_index()
data_grouped.to_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xdata12.csv', index=False)
data_grouped

# 13

In [5]:
import tqdm

X = np.empty((0, 2 * (n_obs-1)))
y = np.empty((0, 1))
field_ids = np.empty((0, 1))


for tile_id in tqdm.notebook.tqdm(tile_ids_train[2250:2450]):
    if tile_id != '1951': # avoid using this specific tile for the Hackathon as it might have a missing file
        
        tile_df = competition_train_df[competition_train_df['tile_id']==tile_id]

        label_src = rasterio.open(tile_df[tile_df['asset']=='labels']['file_path'].values[0])
        label_array = label_src.read(1)
        y = np.append(y, label_array.flatten())

        field_id_src = rasterio.open(tile_df[tile_df['asset']=='field_ids']['file_path'].values[0])
        field_id_array = field_id_src.read(1)
        field_ids = np.append(field_ids, field_id_array.flatten())

        tile_date_times = tile_df[tile_df['satellite_platform']=='s1']['datetime'].unique()

        X_tile = np.empty((256 * 256, 28))
        for date_time in tile_date_times[ : 2 * n_obs : n_obs]:

            vv_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VV')]['file_path'].values[0])
            vv_array = np.expand_dims(vv_src.read(1).flatten(), axis=1)

            vh_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VH')]['file_path'].values[0])
            vh_array = np.expand_dims(vh_src.read(1).flatten(), axis=1)

            X_tile = np.append(X_tile, vv_array, axis = 1)
            X_tile = np.append(X_tile, vh_array, axis = 1)

        X = np.append(X, X_tile, axis=0)

  0%|          | 0/200 [00:00<?, ?it/s]

In [None]:
data13 = pd.DataFrame(X)
data13['label'] = y.astype(int)
data13['field_id'] = field_ids
data13 = data13[data13.field_id != 0]

data_grouped = data13.groupby('field_id').mean().reset_index()
data_grouped.to_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xdata13.csv', index=False)
data_grouped

# 14

In [5]:
import tqdm

X = np.empty((0, 2 * (n_obs-1)))
y = np.empty((0, 1))
field_ids = np.empty((0, 1))


for tile_id in tqdm.notebook.tqdm(tile_ids_train[2450:]):
    if tile_id != '1951': # avoid using this specific tile for the Hackathon as it might have a missing file
        
        tile_df = competition_train_df[competition_train_df['tile_id']==tile_id]

        label_src = rasterio.open(tile_df[tile_df['asset']=='labels']['file_path'].values[0])
        label_array = label_src.read(1)
        y = np.append(y, label_array.flatten())

        field_id_src = rasterio.open(tile_df[tile_df['asset']=='field_ids']['file_path'].values[0])
        field_id_array = field_id_src.read(1)
        field_ids = np.append(field_ids, field_id_array.flatten())

        tile_date_times = tile_df[tile_df['satellite_platform']=='s1']['datetime'].unique()

        X_tile = np.empty((256 * 256, 28))
        for date_time in tile_date_times[ : 2 * n_obs : n_obs]:

            vv_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VV')]['file_path'].values[0])
            vv_array = np.expand_dims(vv_src.read(1).flatten(), axis=1)

            vh_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VH')]['file_path'].values[0])
            vh_array = np.expand_dims(vh_src.read(1).flatten(), axis=1)

            X_tile = np.append(X_tile, vv_array, axis = 1)
            X_tile = np.append(X_tile, vh_array, axis = 1)

        X = np.append(X, X_tile, axis=0)

  0%|          | 0/200 [00:00<?, ?it/s]

In [None]:
data14 = pd.DataFrame(X)
data14['label'] = y.astype(int)
data14['field_id'] = field_ids
data14 = data14[data14.field_id != 0]

data_grouped = data14.groupby('field_id').mean().reset_index()
data_grouped.to_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xdata14.csv', index=False)
data_grouped

## Building the Model

In [None]:
# Each field has several pixels in the data. Here our goal is to build a Random Forest (RF) model using the average values
# of the pixels within each field. So, we use `groupby` to take the mean for each field_id
data_grouped = data.groupby('field_id').mean().reset_index()
data_grouped

Unnamed: 0,field_id,0,1,2,3,4,5,6,7,label
0,29.0,13.049180,3.032787,11.311475,3.344262,13.459016,3.868852,16.639344,4.721311,4.0
1,78.0,34.942857,9.085714,21.957143,5.614286,19.000000,4.614286,14.042857,3.485714,4.0
2,92.0,11.702970,4.504950,14.118812,4.207921,23.475248,6.207921,18.851485,5.415842,1.0
3,104.0,15.441284,3.189908,18.779817,4.483486,17.837615,3.785321,28.519266,8.812844,4.0
4,114.0,8.675676,2.243243,15.378378,3.081081,11.540541,3.135135,16.432432,3.783784,4.0
...,...,...,...,...,...,...,...,...,...,...
3349,122419.0,16.659292,5.216814,17.300885,4.243363,22.265487,5.699115,21.415929,5.115044,4.0
3350,122436.0,9.101974,1.532895,8.582237,2.003289,8.355263,1.723684,8.625000,1.986842,5.0
3351,122615.0,11.488889,2.385185,14.362963,2.088889,19.607407,3.511111,11.851852,3.800000,2.0
3352,122704.0,8.971173,2.785288,8.131213,2.137177,14.989066,3.524851,11.246521,3.647117,5.0


In [None]:
# Split train and test
# We use field_ids to split the data to train and test. Note that the test portion for training is different than the test 
# portion provided as part of the competition. 
train_per = 0.7

n_fields = len(data_grouped['field_id'])
np.random.seed(10)
train_fields = np.random.choice(data_grouped['field_id'], int(n_fields * train_per), replace=False)
test_fields = data_grouped['field_id'][~np.in1d(data_grouped['field_id'], train_fields)]

In [None]:
X_train, X_test = data_grouped[data_grouped['field_id'].isin(train_fields)], data_grouped[data_grouped['field_id'].isin(test_fields)]
X_train = X_train.drop(columns=['label', 'field_id'])
X_test = X_test.drop(columns=['label', 'field_id'])
y_train, y_test = data_grouped[data_grouped['field_id'].isin(train_fields)]['label'], data_grouped[data_grouped['field_id'].isin(test_fields)]['label']

In [None]:
# We ran a simple hyperparameter tuning for the number of trees, and concluded to use:
n_trees = 50

In [None]:
# Fitting the RF model
rf = RandomForestClassifier(n_estimators = n_trees, random_state = 0, n_jobs = 3)
rf.fit(X_train, y_train.astype(int))

RandomForestClassifier(n_estimators=50, n_jobs=3, random_state=0)

## Competition Test Data

In this part we will load the competition test data (which does not have labels) and predict the crop class for each field

In [5]:
tile_ids_test = competition_test_df['tile_id'].unique()

# Test1

In [8]:
import tqdm

X_competition_test = np.empty((0, 2 * (n_obs-1)))
field_ids_test = np.empty((0, 1))

for tile_id in tqdm.notebook.tqdm(tile_ids_test[:200]):
    tile_df = competition_test_df[competition_test_df['tile_id']==tile_id]
    
    field_id_src = rasterio.open(tile_df[tile_df['asset']=='field_ids']['file_path'].values[0])
    field_id_array = field_id_src.read(1)
    field_ids_test = np.append(field_ids_test, field_id_array.flatten())
    
    tile_date_times = tile_df[tile_df['satellite_platform']=='s1']['datetime'].unique()
    
    X_tile = np.empty((256 * 256, 28))
    for date_time in tile_date_times[ : 2 * n_obs : n_obs]:
        vv_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VV')]['file_path'].values[0])
        vv_array = np.expand_dims(vv_src.read(1).flatten(), axis=1)
        
        vh_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VH')]['file_path'].values[0])
        vh_array = np.expand_dims(vh_src.read(1).flatten(), axis=1)
        
        X_tile = np.append(X_tile, vv_array, axis = 1)
        X_tile = np.append(X_tile, vh_array, axis = 1)
        
    X_competition_test = np.append(X_competition_test, X_tile, axis=0)

  0%|          | 0/200 [00:00<?, ?it/s]

In [None]:
data_test = pd.DataFrame(X_competition_test)
data_test['field_id'] = field_ids_test
data_test = data_test[data_test.field_id != 0]

data_test_grouped = data_test.groupby('field_id').mean().reset_index()

data_test_grouped.to_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xtest1.csv')

data_test_grouped

# Test2

In [8]:
import tqdm

X_competition_test = np.empty((0, 2 * (n_obs-1)))
field_ids_test = np.empty((0, 1))

for tile_id in tqdm.notebook.tqdm(tile_ids_test[200:400]):
    tile_df = competition_test_df[competition_test_df['tile_id']==tile_id]
    
    field_id_src = rasterio.open(tile_df[tile_df['asset']=='field_ids']['file_path'].values[0])
    field_id_array = field_id_src.read(1)
    field_ids_test = np.append(field_ids_test, field_id_array.flatten())
    
    tile_date_times = tile_df[tile_df['satellite_platform']=='s1']['datetime'].unique()
    
    X_tile = np.empty((256 * 256, 28))
    for date_time in tile_date_times[ : 2 * n_obs : n_obs]:
        vv_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VV')]['file_path'].values[0])
        vv_array = np.expand_dims(vv_src.read(1).flatten(), axis=1)
        
        vh_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VH')]['file_path'].values[0])
        vh_array = np.expand_dims(vh_src.read(1).flatten(), axis=1)
        
        X_tile = np.append(X_tile, vv_array, axis = 1)
        X_tile = np.append(X_tile, vh_array, axis = 1)
        
    X_competition_test = np.append(X_competition_test, X_tile, axis=0)

  0%|          | 0/200 [00:00<?, ?it/s]

In [None]:
data_test = pd.DataFrame(X_competition_test)
data_test['field_id'] = field_ids_test
data_test = data_test[data_test.field_id != 0]

data_test_grouped = data_test.groupby('field_id').mean().reset_index()

data_test_grouped.to_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xtest2.csv')

data_test_grouped

# Test3

In [6]:
import tqdm

X_competition_test = np.empty((0, 2 * (n_obs-1)))
field_ids_test = np.empty((0, 1))

for tile_id in tqdm.notebook.tqdm(tile_ids_test[400:600]):
    tile_df = competition_test_df[competition_test_df['tile_id']==tile_id]
    
    field_id_src = rasterio.open(tile_df[tile_df['asset']=='field_ids']['file_path'].values[0])
    field_id_array = field_id_src.read(1)
    field_ids_test = np.append(field_ids_test, field_id_array.flatten())
    
    tile_date_times = tile_df[tile_df['satellite_platform']=='s1']['datetime'].unique()
    
    X_tile = np.empty((256 * 256, 28))
    for date_time in tile_date_times[ : 2 * n_obs : n_obs]:
        vv_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VV')]['file_path'].values[0])
        vv_array = np.expand_dims(vv_src.read(1).flatten(), axis=1)
        
        vh_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VH')]['file_path'].values[0])
        vh_array = np.expand_dims(vh_src.read(1).flatten(), axis=1)
        
        X_tile = np.append(X_tile, vv_array, axis = 1)
        X_tile = np.append(X_tile, vh_array, axis = 1)
        
    X_competition_test = np.append(X_competition_test, X_tile, axis=0)

  0%|          | 0/200 [00:00<?, ?it/s]

In [None]:
data_test = pd.DataFrame(X_competition_test)
data_test['field_id'] = field_ids_test
data_test = data_test[data_test.field_id != 0]

data_test_grouped = data_test.groupby('field_id').mean().reset_index()

data_test_grouped.to_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xtest3.csv')

data_test_grouped

# Test4

In [6]:
import tqdm

X_competition_test = np.empty((0, 2 * (n_obs-1)))
field_ids_test = np.empty((0, 1))

for tile_id in tqdm.notebook.tqdm(tile_ids_test[600:800]):
    tile_df = competition_test_df[competition_test_df['tile_id']==tile_id]
    
    field_id_src = rasterio.open(tile_df[tile_df['asset']=='field_ids']['file_path'].values[0])
    field_id_array = field_id_src.read(1)
    field_ids_test = np.append(field_ids_test, field_id_array.flatten())
    
    tile_date_times = tile_df[tile_df['satellite_platform']=='s1']['datetime'].unique()
    
    X_tile = np.empty((256 * 256, 28))
    for date_time in tile_date_times[ : 2 * n_obs : n_obs]:
        vv_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VV')]['file_path'].values[0])
        vv_array = np.expand_dims(vv_src.read(1).flatten(), axis=1)
        
        vh_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VH')]['file_path'].values[0])
        vh_array = np.expand_dims(vh_src.read(1).flatten(), axis=1)
        
        X_tile = np.append(X_tile, vv_array, axis = 1)
        X_tile = np.append(X_tile, vh_array, axis = 1)
        
    X_competition_test = np.append(X_competition_test, X_tile, axis=0)

  0%|          | 0/200 [00:00<?, ?it/s]

In [None]:
data_test = pd.DataFrame(X_competition_test)
data_test['field_id'] = field_ids_test
data_test = data_test[data_test.field_id != 0]

data_test_grouped = data_test.groupby('field_id').mean().reset_index()

data_test_grouped.to_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xtest4.csv')

data_test_grouped

# Test5

In [6]:
import tqdm

X_competition_test = np.empty((0, 2 * (n_obs-1)))
field_ids_test = np.empty((0, 1))

for tile_id in tqdm.notebook.tqdm(tile_ids_test[800:1000]):
    tile_df = competition_test_df[competition_test_df['tile_id']==tile_id]
    
    field_id_src = rasterio.open(tile_df[tile_df['asset']=='field_ids']['file_path'].values[0])
    field_id_array = field_id_src.read(1)
    field_ids_test = np.append(field_ids_test, field_id_array.flatten())
    
    tile_date_times = tile_df[tile_df['satellite_platform']=='s1']['datetime'].unique()
    
    X_tile = np.empty((256 * 256, 28))
    for date_time in tile_date_times[ : 2 * n_obs : n_obs]:
        vv_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VV')]['file_path'].values[0])
        vv_array = np.expand_dims(vv_src.read(1).flatten(), axis=1)
        
        vh_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VH')]['file_path'].values[0])
        vh_array = np.expand_dims(vh_src.read(1).flatten(), axis=1)
        
        X_tile = np.append(X_tile, vv_array, axis = 1)
        X_tile = np.append(X_tile, vh_array, axis = 1)
        
    X_competition_test = np.append(X_competition_test, X_tile, axis=0)

  0%|          | 0/200 [00:00<?, ?it/s]

In [None]:
data_test = pd.DataFrame(X_competition_test)
data_test['field_id'] = field_ids_test
data_test = data_test[data_test.field_id != 0]

data_test_grouped = data_test.groupby('field_id').mean().reset_index()

data_test_grouped.to_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xtest5.csv')

data_test_grouped

# Test6

In [6]:
import tqdm

X_competition_test = np.empty((0, 2 * (n_obs-1)))
field_ids_test = np.empty((0, 1))

for tile_id in tqdm.notebook.tqdm(tile_ids_test[1000:]):
    tile_df = competition_test_df[competition_test_df['tile_id']==tile_id]
    
    field_id_src = rasterio.open(tile_df[tile_df['asset']=='field_ids']['file_path'].values[0])
    field_id_array = field_id_src.read(1)
    field_ids_test = np.append(field_ids_test, field_id_array.flatten())
    
    tile_date_times = tile_df[tile_df['satellite_platform']=='s1']['datetime'].unique()
    
    X_tile = np.empty((256 * 256, 28))
    for date_time in tile_date_times[ : 2 * n_obs : n_obs]:
        vv_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VV')]['file_path'].values[0])
        vv_array = np.expand_dims(vv_src.read(1).flatten(), axis=1)
        
        vh_src = rasterio.open(tile_df[(tile_df['datetime']==date_time) & (tile_df['asset']=='VH')]['file_path'].values[0])
        vh_array = np.expand_dims(vh_src.read(1).flatten(), axis=1)
        
        X_tile = np.append(X_tile, vv_array, axis = 1)
        X_tile = np.append(X_tile, vh_array, axis = 1)
        
    X_competition_test = np.append(X_competition_test, X_tile, axis=0)

  0%|          | 0/137 [00:00<?, ?it/s]

In [None]:
data_test = pd.DataFrame(X_competition_test)
data_test['field_id'] = field_ids_test
data_test = data_test[data_test.field_id != 0]

data_test_grouped = data_test.groupby('field_id').mean().reset_index()

data_test_grouped.to_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xtest6.csv')

data_test_grouped

# Traininng Bridge

In [9]:
tr1=pd.read_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xdata1.csv')
tr2=pd.read_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xdata2.csv')
tr3=pd.read_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xdata3.csv')
tr4=pd.read_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xdata4.csv')
tr5=pd.read_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xdata5.csv')
tr6=pd.read_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xdata6.csv')
tr7=pd.read_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xdata7.csv')
tr8=pd.read_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xdata8.csv')
tr9=pd.read_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xdata9.csv')
tr10=pd.read_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xdata10.csv')
tr11=pd.read_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xdata11.csv')
tr12=pd.read_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xdata12.csv')
tr13=pd.read_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xdata13.csv')
tr14=pd.read_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xdata14.csv')


te1=pd.read_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xtest1.csv')
te2=pd.read_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xtest2.csv')
te3=pd.read_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xtest3.csv')
te4=pd.read_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xtest4.csv')
te5=pd.read_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xtest5.csv')
te6=pd.read_csv('/content/drive/MyDrive/zindi/radiant_ml/data/Xtest6.csv')

In [10]:
full_train=pd.concat([tr1, tr2, tr3, tr4, tr5, tr6, tr7, tr8, tr9, tr10, tr11, tr12, tr13, tr14]).reset_index(drop=True)
full_test=pd.concat([te1, te2, te3, te4, te5, te6]).reset_index(drop=True)

full_train.shape, full_test.shape

((87113, 34), (35295, 34))

In [None]:
full_train.label.value_counts()

In [None]:
full_test=full_test.drop('Unnamed: 0', axis=1)
full_test.head()

In [13]:
full_train.to_csv('/content/drive/MyDrive/zindi/radiant_ml/data/train_data.csv', index=False)
full_test.to_csv('/content/drive/MyDrive/zindi/radiant_ml/data/test_data.csv', index=False)

In [14]:
full_train.columns

Index(['field_id', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10',
       '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22',
       '23', '24', '25', '26', '27', '28', '29', '30', '31', 'label'],
      dtype='object')

In [None]:
full_train.isna().sum()

In [53]:
from sklearn.model_selection import train_test_split as tts

seed=59
main_cols=full_train.columns.difference(['field_id', 'label'])

X=full_train[main_cols]
y=np.round(full_train.label)

x_train, x_test, y_train, y_test=tts(X, y, test_size=0.2, random_state=seed, stratify=y)
x_train.shape, x_test.shape

((69690, 32), (17423, 32))

In [61]:
from sklearn.ensemble import RandomForestClassifier

model=RandomForestClassifier(n_estimators=50)

model.fit(x_train, y_train)

  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)


ValueError: ignored

In [70]:
y.value_counts()

4.0    24225
2.0    13917
7.0    10712
1.0     8340
6.0     8249
5.0     8137
3.0     7915
9.0     4124
8.0     1494
Name: label, dtype: int64

In [69]:
from lightgbm import LGBMClassifier

model=LGBMClassifier(n_estimators=50)

model.fit(x_train, y_train)

LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.1, max_depth=-1,
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=50, n_jobs=-1, num_leaves=31, objective=None,
               random_state=None, reg_alpha=0.0, reg_lambda=0.0, silent=True,
               subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

In [64]:
y_pred=model.predict_proba(x_test)

In [65]:
y_pred = [np.argmax(line) for line in y_pred]
len(y_pred)

17423

In [66]:
y_pred=pd.Series(y_pred)

In [67]:
y_test.unique(), y_pred.unique()

(array([4., 2., 6., 7., 5., 3., 9., 1., 8.]),
 array([3, 1, 0, 6, 5, 4, 8, 2, 7]))

In [68]:
from sklearn.metrics import accuracy_score as acc
from sklearn.metrics import log_loss

print(acc(y_test, y_pred))

0.08792974803420765


In [36]:
ss=pd.read_csv('/content/drive/MyDrive/zindi/radiant_ml/data/SampleSubmission.csv')
ss.shape

(35389, 10)

In [43]:
t=[1, 2, 3, 4, 5, 6, 7, 8, 9]

print(t[:3], t[3:])

[1, 2, 3] [4, 5, 6, 7, 8, 9]


In [35]:
full_test.shape

(35295, 33)

In [None]:
y_competition_prob = rf.predict_proba(data_test_grouped.drop(columns=['field_id']))

In [None]:
# In this part we format the DataFrame to have column names and order similar to the sample submission file. 
pred_df = pd.DataFrame(y_competition_prob)
pred_df = pred_df.rename(columns={
    0:'Crop_ID_1',
    1:'Crop_ID_2', 
    2:'Crop_ID_3',
    3:'Crop_ID_4',
    4:'Crop_ID_5',
    5:'Crop_ID_6',
    6:'Crop_ID_7',
    7:'Crop_ID_8',
    8:'Crop_ID_9'
})
pred_df['field_id']=data_test_grouped['field_id']
pred_df = pred_df[['field_id', 'Crop_ID_1', 'Crop_ID_2', 'Crop_ID_3', 'Crop_ID_4', 'Crop_ID_5', 'Crop_ID_6', 'Crop_ID_7', 'Crop_ID_8', 'Crop_ID_9']]
pred_df

Unnamed: 0,field_id,Crop_ID_1,Crop_ID_2,Crop_ID_3,Crop_ID_4,Crop_ID_5,Crop_ID_6,Crop_ID_7,Crop_ID_8,Crop_ID_9
0,56.0,0.06,0.12,0.04,0.34,0.02,0.04,0.30,0.02,0.06
1,60.0,0.26,0.06,0.14,0.10,0.06,0.14,0.10,0.10,0.04
2,97.0,0.10,0.32,0.06,0.06,0.12,0.10,0.04,0.00,0.20
3,103.0,0.04,0.26,0.02,0.08,0.16,0.24,0.00,0.04,0.16
4,123.0,0.04,0.04,0.00,0.82,0.02,0.00,0.02,0.06,0.00
...,...,...,...,...,...,...,...,...,...,...
2884,122658.0,0.00,0.04,0.02,0.00,0.00,0.14,0.74,0.06,0.00
2885,122689.0,0.04,0.20,0.02,0.48,0.10,0.02,0.06,0.06,0.02
2886,122698.0,0.04,0.28,0.04,0.24,0.10,0.08,0.10,0.00,0.12
2887,122703.0,0.06,0.00,0.02,0.88,0.00,0.02,0.00,0.00,0.02


In [None]:
# Write the predicted probabilites to a csv for submission
pred_df.to_csv('baseline_submission.csv', index=False)