# Forecasting Brazil’s Oil Production using an ARIMA model and ANP data
<b>Luís Eduardo Anunciado Silva</b> - BS in Information Technology, Federal University of Rio Grande do Norte<br>

<b>Date</b>: November 12, 2018

## Introduction
My project aims to predict the yearly oil production of a given petroleum reservoir over a given time period, using an <b>ARIMA model</b> for <b>time series forecasting</b> and the [ANP public data base](http://www.anp.gov.br/dados-estatisticos).

<p>I also aim to predict the top-10 petroleum reservoir in the Brazil that which will produce the largest amount of oil in the next 10 years.</p>


### OBJECTIVE 1: Predicting the Oil Production of a Petroleum Reservoir across a Specified Time Period
<p> Using the ANP public data base, we will do the following steps below:</p>

<b>Pre-Processing:</b><br>
<ol>
<li>Build a dataframe with the columns: dt (in DateTime format), well, latitude, longitude.</li>
<li>Dropping irrelevant columns and removing rows with NaN values</li>
</ol>
<b>Processing:</b><br>
ARIMA models need the data to be stationary i.e. the data must not exhibit trend and/or seasonality. To identify and remove trend and seasonality, we used the following methods:
<ol>
<li>Plotting the time series to visually check for trend and seasonality</li>
<li>Checking if the histogram of the data fits a Gaussian Curve, and then splitting data into two parts, calculating means and variances and seeing if they vary</li>
<li>Calculating the Augmented Dickey-Fuller Test statistic and using the p-value to determine stationarity</li>
</ol>
If the data was not stationary, we performed <b>differencing</b> to make it stationary.
<br><br>
<b>Fitting the ARIMA model:</b><br>
We performed a grid-search to estimate the best p, q values for the model, for the given data.<br>
We then fit the ARIMA model using the calculated p, q values.
<br><br>
<b>Evaluation:</b><br>
We calculated the <b>Mean Squared Error (MSE)</b> to estimate the performance of the model.

In [2]:
# import packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.arima_model import ARIMA, ARMAResults
from sklearn.metrics import mean_squared_error
import ipywidgets as widgets
import plotly.plotly as py
import seaborn as sns
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go
init_notebook_mode(connected=True)

# hide warnings
import warnings
warnings.simplefilter("ignore")

In [28]:
import pandas as pd

#read file with the weels
dfs = pd.read_excel('data/tabela_de_pocos_abril_2018.xlsx', sheet_name='Plan1')
dfs = dfs.drop({'CADASTRO', 'OPERADOR', 'POCO_OPERADOR', 'ESTADO', 'BACIA', 'BLOCO', 'SIG_CAMPO', 'CAMPO', 'TERRA_MAR', 
                'POCO_POS_ANP', 'INICIO', 'TERMINO', 'CONCLUSAO', 'TITULARIDADE', 'LATITUDE_BASE_4C', 
                'LONGITUDE_BASE_4C', 'DATUM_HORIZONTAL', 'TIPO_DE_COORDENADA_DE_BASE', 'DIRECAO', 'PROFUNDIDADE_VERTICAL_M',
                'PROFUNDIDADE_SONDADOR_M','PROFUNDIDADE_MEDIDA_M', 'REFERENCIA_DE_PROFUNDIDADE', 'MESA_ROTATIVA', 
                'COTA_ALTIMETRICA_M','LAMINA_D_AGUA_M', 'DATUM_VERTICAL', 'UNIDADE_ESTRATIGRAFICA', 'GEOLOGIA_GRUPO_FINAL',
                'GEOLOGIA_FORMACAO_FINAL', 'GEOLOGIA_MEMBRO_FINAL', 'CDPE', 'AGP', 'PC', 'PAG', 'PERFIS_CONVENCIONAIS',
                'DURANTE_PERFURACAO', 'PERFIS_DIGITAIS', 'PERFIS_PROCESSADOS', 'PERFIS_ESPECIAIS', 'AMOSTRA_LATERAL', 
                'SISMICA', 'TABELA_TEMPO_PROFUNDIDADE', 'DADOS_DIRECIONAIS', 'TESTE_A_CABO', 'CANHONEIO', 'TESTEMUNHO', 
                'GEOQUIMICA', 'SIG_SONDA', 'NOM_SONDA', 'DHA_ATUALIZACAO'}, 1)

In [29]:
def get_produtors(dfs):
    '''
    Filters producting wells in a field
    '''
    return dfs[(dfs['TIPO']==u'Explotatório') & (dfs['SITUACAO']=='PRODUZINDO') & 
               (dfs['CATEGORIA']=='Desenvolvimento') & (dfs['RECLASSIFICACAO']==u'PRODUTOR COMERCIAL DE PETRÓLEO')]

In [30]:
def get_injections(dfs):
    '''
    Filters the injection wells in a field
    '''
    return dfs[(dfs['TIPO']==u'Explotatório') & (dfs['SITUACAO']=='INJETANDO') & \
            (dfs['CATEGORIA']==u'Injeção') & (dfs['RECLASSIFICACAO']==u'INJEÇÃO DE ÁGUA')]

In [33]:
produtors = get_produtors(dfs)
injetors = get_injections(dfs)
dfs = pd.concat([produtors, injetors]).reset_index(inplace=True)

In [34]:
dfs

### OBJECTIVE 2: Top-10 Cities in the US with Maximum Temperature Change

## Conclusion

In this project, we:

<ul>
<li>Forecasted the temperature of a given city over a given period of time</li>
<li>Predicted the top-10 cities in the US which will experience the most temperature change from 2013-2013.</li>
<li>Analyzed the correlation between pollution levels and temperature, as well as the correlation between Greenhouse gas emissions and temperature, which helped us identify the Greenhouse Gas that has and will have the most impact on temperature change.</li>
</ul>

## Refference

<ul>
<li>Forecasted the temperature of a given city over a given period of time</li>
<li>Predicted the top-10 cities in the US which will experience the most temperature change from 2013-2013.</li>
<li>Analyzed the correlation between pollution levels and temperature, as well as the correlation between Greenhouse gas emissions and temperature, which helped us identify the Greenhouse Gas that has and will have the most impact on temperature change.</li>
</ul>