# Постановка задачи
Нефтегазовая отрасль промышленности имеет несколько важных особенностей, которые обуславливают необходимость поиска инновационных решений – это непрерывность и высокая сложность технологической цепи, которая начинается с геологической разведки, а заканчивается доставкой нефти и газа потребителям.

Сектор разведки и добычи включает в себя поиск рентабельных залежей нефти и газа на берегу и в море, бурение поисковоразведочных скважин и эксплуатацию скважин, дающих нефть, газ и жидкие продукты залежей или их смесь.

Алгоритмы машинного обучения могут быть полезны для решения различных задач в нефтегазовой отрасли, в частности для построения прогнозов разработки новых рентабельных месторождений.

Необходимо реализовать алгоритм машинного обучения, который позволит по различным параметрам определить место залежей нефти и газа: на суше, на море.

# Описание признаков
Field name - название месторождения

Reservoir unit - юнит месторождения

Country - страна расположения

Region - регион расположения

Basin name - название бассейна пород

Tectonic regime - тектонический режим

Latitude - широта

Longitude - долгота

Operator company - название компании

**Onshore or oﬀshore - целевая переменная (ONSHORE - 1, OFFSHORE - 0, ONSHORE-OFFSHORE - 2)**

Hydrocarbon type (main) - тип углеводорода

Reservoir status (current) - статус месторождения

Structural setting - структурные свойства

Depth (top reservoir ft TVD) - глубина

Reservoir period - литологический период

Lithology (main) - литология

Thickness (gross average ft) - общая толщина

Thickness (net pay average ft) - эффективная толщина

Porosity (matrix average 20)

Permeability (air average mD) – проницаемость

# Подготовка данных

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MultiLabelBinarizer, MinMaxScaler, StandardScaler, LabelEncoder

In [2]:
df = pd.read_csv('train_oil.csv')
df_initial = df.copy()

In [3]:
df.head(10)

Unnamed: 0,Field name,Reservoir unit,Country,Region,Basin name,Tectonic regime,Latitude,Longitude,Operator company,Onshore/Offshore,Hydrocarbon type,Reservoir status,Structural setting,Depth,Reservoir period,Lithology,Thickness (gross average ft),Thickness (net pay average ft),Porosity,Permeability
0,ZHIRNOV,MELEKESKIAN,RUSSIA,FORMER SOVIET UNION,VOLGA-URAL,COMPRESSION/EVAPORITE,51.0,44.8042,NIZHNEVOLZHSKNET,ONSHORE,OIL,DECLINING PRODUCTION,FORELAND,1870,CARBONIFEROUS,SANDSTONE,262.0,33.0,24.0,30.0
1,LAGOA PARDA,LAGOA PARDA (URUCUTUCA),BRAZIL,LATIN AMERICA,ESPIRITO SANTO,EXTENSION,-19.6017,-39.8332,PETROBRAS,ONSHORE,OIL,NEARLY DEPLETED,PASSIVE MARGIN,4843,PALEOGENE,SANDSTONE,2133.0,72.0,23.0,350.0
2,ABQAIQ,ARAB D,SAUDI ARABIA,MIDDLE EAST,THE GULF,COMPRESSION/EVAPORITE,26.08,49.81,SAUDI ARAMCO,ONSHORE,OIL,REJUVENATING,FORELAND,6050,JURASSIC,LIMESTONE,250.0,184.0,21.0,410.0
3,MURCHISON,BRENT,UK /NORWAY,EUROPE,NORTH SEA NORTHERN,EXTENSION,61.3833,1.75,CNR,OFFSHORE,OIL,NEARLY DEPLETED,RIFT,8988,JURASSIC,SANDSTONE,425.0,300.0,22.0,750.0
4,WEST PEMBINA,NISKU (PEMBINA L POOL),CANADA,NORTH AMERICA,WESTERN CANADA,COMPRESSION,53.2287,-115.8008,NUMEROUS,ONSHORE,OIL,UNKNOWN,FORELAND,9306,DEVONIAN,DOLOMITE,233.0,167.0,11.8,1407.0
5,UCHKYR,XV-1,UZBEKISTAN,FORMER SOVIET UNION,AMU DARYA,INVERSION/COMPRESSION/EXTENSION/EVAPORITE,40.1494,62.9906,SREDAZGAZPROM,ONSHORE,GAS,DECLINING PRODUCTION,INVERSION/RIFT,5443,JURASSIC,DOLOMITE,82.0,59.0,16.0,61.0
6,WESTHOPE SOUTH,CHARLES,USA,NORTH AMERICA,WILLISTON,COMPRESSION,48.8521,-101.013,AMERADA HESS,ONSHORE,OIL,MATURE PRODUCTION,INTRACRATONIC,3275,CARBONIFEROUS,DOLOMITIC LIMESTONE,43.0,15.5,10.0,2.6
7,MCALLEN RANCH,LOWER VICKSBURG,USA,NORTH AMERICA,GULF OF MEXICO NORTHERN ONSHORE,GRAVITY/EXTENSION/SHALE/SYNSEDIMENTATION,26.6226,-98.3153,SHELL,ONSHORE,GAS,MATURE PRODUCTION,DELTA/PASSIVE MARGIN,9400,PALEOGENE,SANDSTONE,300.0,140.0,16.0,0.6
8,ELLIS RANCH,CLEVELAND,USA,NORTH AMERICA,ANADARKO,COMPRESSION/EROSION,36.2724,-100.7161,NUMEROUS,ONSHORE,GAS,DECLINING PRODUCTION,FORELAND,6500,CARBONIFEROUS,SHALY SANDSTONE,150.0,26.0,14.0,0.4
9,STRACHAN,LEDUC,CANADA,NORTH AMERICA,WESTERN CANADA,COMPRESSION,52.45,-115.4286,BANF AND AQUITAINE,ONSHORE,GAS,NEARLY DEPLETED,FORELAND,12068,DEVONIAN,DOLOMITE,900.0,535.0,7.8,11.0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 309 entries, 0 to 308
Data columns (total 20 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Field name                      309 non-null    object 
 1   Reservoir unit                  309 non-null    object 
 2   Country                         282 non-null    object 
 3   Region                          271 non-null    object 
 4   Basin name                      271 non-null    object 
 5   Tectonic regime                 309 non-null    object 
 6   Latitude                        282 non-null    float64
 7   Longitude                       279 non-null    float64
 8   Operator company                309 non-null    object 
 9   Onshore/Offshore                309 non-null    object 
 10  Hydrocarbon type                309 non-null    object 
 11  Reservoir status                309 non-null    object 
 12  Structural setting              309 

Некоторые признаки имеют пропуски. Числовые признаки заполним медианой, сохранив ее как отдельную переменную для заполнения пропусков в тестовом датасете. Пропуски в категориальных признаках будем заполнять как категория 'Unknown'.

In [5]:
columns_to_fill = ['Country', 'Region', 'Basin name']
df[columns_to_fill] = df[columns_to_fill].fillna('Unknown')
median_latitude = df['Latitude'].median()
median_longitude = df['Longitude'].median()
df['Latitude'] = df['Latitude'].fillna(median_latitude)
df['Longitude'] = df['Longitude'].fillna(median_longitude)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 309 entries, 0 to 308
Data columns (total 20 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Field name                      309 non-null    object 
 1   Reservoir unit                  309 non-null    object 
 2   Country                         309 non-null    object 
 3   Region                          309 non-null    object 
 4   Basin name                      309 non-null    object 
 5   Tectonic regime                 309 non-null    object 
 6   Latitude                        309 non-null    float64
 7   Longitude                       309 non-null    float64
 8   Operator company                309 non-null    object 
 9   Onshore/Offshore                309 non-null    object 
 10  Hydrocarbon type                309 non-null    object 
 11  Reservoir status                309 non-null    object 
 12  Structural setting              309 

Убедимся, что целевая переменная имеет больше 2 классов

In [6]:
df['Onshore/Offshore'].unique()

array(['ONSHORE', 'OFFSHORE', 'ONSHORE-OFFSHORE'], dtype=object)

Колонка с именем месторождания фактически играет роль ID. Уберем его из датасета. Также закодируем целевую переменную и остальные категориальные признаки в числовом представлении.

In [7]:
label_encoder = LabelEncoder()
df['Onshore/Offshore'] = label_encoder.fit_transform(df['Onshore/Offshore'])
df = df.drop(columns=['Field name'])

mlb_dict = {}
columns_to_encode = df.select_dtypes(include=['object']).columns

for column in columns_to_encode:
    mlb = MultiLabelBinarizer()
    df[column + '_list'] = df[column].astype(str).str.split('/')
    encoded = mlb.fit_transform(df[column + '_list'])
    encoded_df = pd.DataFrame(encoded, columns=[f"{column}_{x}" for x in mlb.classes_])
    df = pd.concat([df, encoded_df], axis=1)
    mlb_dict[column] = mlb

In [8]:
df.head(10)

Unnamed: 0,Reservoir unit,Country,Region,Basin name,Tectonic regime,Latitude,Longitude,Operator company,Onshore/Offshore,Hydrocarbon type,...,Lithology_DOLOMITE,Lithology_DOLOMITIC LIMESTONE,Lithology_LIMESTONE,Lithology_LOW-RESISTIVITY SANDSTONE,Lithology_SANDSTONE,Lithology_SHALE,Lithology_SHALY SANDSTONE,Lithology_SILTSTONE,Lithology_THINLY-BEDDED SANDSTONE,Lithology_VOLCANICS
0,MELEKESKIAN,RUSSIA,FORMER SOVIET UNION,VOLGA-URAL,COMPRESSION/EVAPORITE,51.0,44.8042,NIZHNEVOLZHSKNET,1,OIL,...,0,0,0,0,1,0,0,0,0,0
1,LAGOA PARDA (URUCUTUCA),BRAZIL,LATIN AMERICA,ESPIRITO SANTO,EXTENSION,-19.6017,-39.8332,PETROBRAS,1,OIL,...,0,0,0,0,1,0,0,0,0,0
2,ARAB D,SAUDI ARABIA,MIDDLE EAST,THE GULF,COMPRESSION/EVAPORITE,26.08,49.81,SAUDI ARAMCO,1,OIL,...,0,0,1,0,0,0,0,0,0,0
3,BRENT,UK /NORWAY,EUROPE,NORTH SEA NORTHERN,EXTENSION,61.3833,1.75,CNR,0,OIL,...,0,0,0,0,1,0,0,0,0,0
4,NISKU (PEMBINA L POOL),CANADA,NORTH AMERICA,WESTERN CANADA,COMPRESSION,53.2287,-115.8008,NUMEROUS,1,OIL,...,1,0,0,0,0,0,0,0,0,0
5,XV-1,UZBEKISTAN,FORMER SOVIET UNION,AMU DARYA,INVERSION/COMPRESSION/EXTENSION/EVAPORITE,40.1494,62.9906,SREDAZGAZPROM,1,GAS,...,1,0,0,0,0,0,0,0,0,0
6,CHARLES,USA,NORTH AMERICA,WILLISTON,COMPRESSION,48.8521,-101.013,AMERADA HESS,1,OIL,...,0,1,0,0,0,0,0,0,0,0
7,LOWER VICKSBURG,USA,NORTH AMERICA,GULF OF MEXICO NORTHERN ONSHORE,GRAVITY/EXTENSION/SHALE/SYNSEDIMENTATION,26.6226,-98.3153,SHELL,1,GAS,...,0,0,0,0,1,0,0,0,0,0
8,CLEVELAND,USA,NORTH AMERICA,ANADARKO,COMPRESSION/EROSION,36.2724,-100.7161,NUMEROUS,1,GAS,...,0,0,0,0,0,0,1,0,0,0
9,LEDUC,CANADA,NORTH AMERICA,WESTERN CANADA,COMPRESSION,52.45,-115.4286,BANF AND AQUITAINE,1,GAS,...,1,0,0,0,0,0,0,0,0,0


In [9]:
df = df.select_dtypes(include=np.number)
X = df.drop(columns=['Onshore/Offshore'])
y = df['Onshore/Offshore']



In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'C': [0.1, 1, 10, 100],
    'class_weight': ['balanced', None]
}   

scaler = StandardScaler()
X_std = scaler.fit_transform(X)
X_std = pd.DataFrame(X_std, columns=X.columns)
grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=5, scoring='f1_weighted')
grid_search.fit(X_std, y)
model = grid_search.best_estimator_
print(grid_search.best_params_)
model.fit(X_std, y)

{'C': 0.1, 'class_weight': 'balanced'}


0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,0.1
,fit_intercept,True
,intercept_scaling,1
,class_weight,'balanced'
,random_state,
,solver,'lbfgs'
,max_iter,100


In [11]:
df_test = pd.read_csv('oil_test.csv')
df_test.head()

Unnamed: 0,Field name,Reservoir unit,Country,Region,Basin name,Tectonic regime,Latitude,Longitude,Operator company,Hydrocarbon type,Reservoir status,Structural setting,Depth,Reservoir period,Lithology,Thickness (gross average ft),Thickness (net pay average ft),Porosity,Permeability
0,ABU GHARADIG,BAHARIYA,EGYPT,AFRICA,ABU GHARADIG,EXTENSION,29.7422,28.4925,GUPCO,GAS-CONDENSATE,MATURE PRODUCTION,RIFT,10282,CRETACEOUS,SANDSTONE,745.0,144.0,10.0,8.0
1,ABU MADI-EL QARA,ABU MADI (LEVEL III),EGYPT,AFRICA,NILE DELTA,STRIKE-SLIP/TRANSTENSION/SHALE/EVAPORITE/BASEM...,31.4382,31.3616,IEOC,GAS,DECLINING PRODUCTION,WRENCH/DELTA,10499,NEOGENE,SANDSTONE,509.0,410.0,20.0,300.0
2,ALIBEKMOLA,KT I,KAZAKHSTAN,FORMER SOVIET UNION,CASPIAN NORTH,COMPRESSION/EVAPORITE,48.474,57.6667,KAZAKHOIL AKTOBE,OIL,DEVELOPING,SUB-SALT/FORELAND,6000,CARBONIFEROUS,LIMESTONE,300.0,105.0,10.0,20.0
3,ALWYN NORTH,BRENT (BRENT EAST),UK,EUROPE,NORTH SEA NORTHERN,INVERSION/COMPRESSION/EXTENSION,60.7833,1.7333,TOTAL,OIL,NEARLY DEPLETED,RIFT,9790,JURASSIC,SANDSTONE,886.0,344.0,17.0,500.0
4,ANKLESHWAR,ANKLESHWAR (HAZAD-ARDOL),INDIA,FAR EAST,CAMBAY,STRIKE-SLIP/TRANSPRESSION/BASEMENT-I,21.6,72.9167,ONGC,OIL,MATURE PRODUCTION,WRENCH/RIFT,2950,PALEOGENE,SANDSTONE,670.0,0.0,21.0,250.0


In [12]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 133 entries, 0 to 132
Data columns (total 19 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Field name                      133 non-null    object 
 1   Reservoir unit                  133 non-null    object 
 2   Country                         120 non-null    object 
 3   Region                          117 non-null    object 
 4   Basin name                      125 non-null    object 
 5   Tectonic regime                 133 non-null    object 
 6   Latitude                        120 non-null    float64
 7   Longitude                       117 non-null    float64
 8   Operator company                133 non-null    object 
 9   Hydrocarbon type                133 non-null    object 
 10  Reservoir status                133 non-null    object 
 11  Structural setting              133 non-null    object 
 12  Depth                           133 

In [13]:
columns_to_fill = ['Country', 'Region', 'Basin name']
df_test[columns_to_fill] = df_test[columns_to_fill].fillna('Unknown')
df_test['Latitude'] = df_test['Latitude'].fillna(median_latitude)
df_test['Longitude'] = df_test['Longitude'].fillna(median_longitude)
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 133 entries, 0 to 132
Data columns (total 19 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Field name                      133 non-null    object 
 1   Reservoir unit                  133 non-null    object 
 2   Country                         133 non-null    object 
 3   Region                          133 non-null    object 
 4   Basin name                      133 non-null    object 
 5   Tectonic regime                 133 non-null    object 
 6   Latitude                        133 non-null    float64
 7   Longitude                       133 non-null    float64
 8   Operator company                133 non-null    object 
 9   Hydrocarbon type                133 non-null    object 
 10  Reservoir status                133 non-null    object 
 11  Structural setting              133 non-null    object 
 12  Depth                           133 

In [14]:
df_test = df_test.drop(columns=['Field name'])
for column in columns_to_encode:
    if column in df_test.columns:
        df_test[column + '_list'] = df_test[column].astype(str).str.split('/')
        mlb = mlb_dict[column]
        encoded = mlb.transform(df_test[column + '_list'])
        encoded_df = pd.DataFrame(encoded, columns=[f"{column}_{x}" for x in mlb.classes_])
        df_test = pd.concat([df_test, encoded_df], axis=1)



In [15]:
df_test.head(5)

Unnamed: 0,Reservoir unit,Country,Region,Basin name,Tectonic regime,Latitude,Longitude,Operator company,Hydrocarbon type,Reservoir status,...,Lithology_DOLOMITE,Lithology_DOLOMITIC LIMESTONE,Lithology_LIMESTONE,Lithology_LOW-RESISTIVITY SANDSTONE,Lithology_SANDSTONE,Lithology_SHALE,Lithology_SHALY SANDSTONE,Lithology_SILTSTONE,Lithology_THINLY-BEDDED SANDSTONE,Lithology_VOLCANICS
0,BAHARIYA,EGYPT,AFRICA,ABU GHARADIG,EXTENSION,29.7422,28.4925,GUPCO,GAS-CONDENSATE,MATURE PRODUCTION,...,0,0,0,0,1,0,0,0,0,0
1,ABU MADI (LEVEL III),EGYPT,AFRICA,NILE DELTA,STRIKE-SLIP/TRANSTENSION/SHALE/EVAPORITE/BASEM...,31.4382,31.3616,IEOC,GAS,DECLINING PRODUCTION,...,0,0,0,0,1,0,0,0,0,0
2,KT I,KAZAKHSTAN,FORMER SOVIET UNION,CASPIAN NORTH,COMPRESSION/EVAPORITE,48.474,57.6667,KAZAKHOIL AKTOBE,OIL,DEVELOPING,...,0,0,1,0,0,0,0,0,0,0
3,BRENT (BRENT EAST),UK,EUROPE,NORTH SEA NORTHERN,INVERSION/COMPRESSION/EXTENSION,60.7833,1.7333,TOTAL,OIL,NEARLY DEPLETED,...,0,0,0,0,1,0,0,0,0,0
4,ANKLESHWAR (HAZAD-ARDOL),INDIA,FAR EAST,CAMBAY,STRIKE-SLIP/TRANSPRESSION/BASEMENT-I,21.6,72.9167,ONGC,OIL,MATURE PRODUCTION,...,0,0,0,0,1,0,0,0,0,0


In [16]:
df_test = df_test.select_dtypes(include=np.number)

common_columns = X.columns.intersection(df_test.columns)
X_aligned = X[common_columns]
X_test_aligned = df_test[common_columns]

missing_cols = X_std.columns.difference(df_test.columns)
for col in missing_cols:
    X_test_aligned[col] = 0

X_test_aligned = X_test_aligned[X_std.columns]

X_std_test = scaler.transform(X_test_aligned)
#X_std_test = pd.DataFrame(X_std_test, columns=X_aligned.columns)

In [17]:
y_pred = model.predict(X_std_test)



In [18]:
results = pd.DataFrame({
    'index': range(len(y_pred)),
    'Onshore/Offshore': y_pred
})
results.to_csv('oil_predictions_4.csv', index=False)

Score на Kaggle (F1-score): 0.84284