# Scraper for PAJ Oil Statistics Weekly

Japanese Oil Statistics weekly from Petroleum Association of Japan.

Source: https://stats.paj.gr.jp/en/d-member/index.html

Authentication: HTTP Digest authentication.

Japanese weekly data are published usually on Wednesdays lunchtime. More details and schedule can be found on Petroleum Association of Japan Weekly data website.

We subscribe to Member data, Data for Analizing course member (Charged).

Data is separated by year (2018, 2019 and 2020) and the region (All Japan, East Japan  and West Japan). We use All Japan data only.

Column names are defined in the template file: https://stats.paj.gr.jp/en/d-member/csvs/FormY_EP.xlt


In [1]:
%cd ..

C:\Users\ROSA_L\PycharmProjects\scraper


In [5]:
# Testing with Session object, getting login from settings
import requests
from requests.auth import HTTPDigestAuth

from scraper.settings import PAJ_USERNAME, PAJ_PASSWORD

ROOT_URL = 'https://stats.paj.gr.jp/en/d-member'

s = requests.Session()
s.auth=HTTPDigestAuth(PAJ_USERNAME, PAJ_PASSWORD)
result = s.get(f'{ROOT_URL}/index.html')

print(result)

SSLError: HTTPSConnectionPool(host='stats.paj.gr.jp', port=443): Max retries exceeded with url: /en/d-member/index.html (Caused by SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:833)'),))

In [6]:
# Get URLs to the csv files
from bs4 import BeautifulSoup

files_to_download = [a['href'] for a in BeautifulSoup(result.text, 'html.parser').find_all('a') if a.text == 'All Japan']
print(files_to_download)

['./csvs/2020.csv', './csvs/2019.csv', './csvs/2018.csv']


In [26]:
from pathlib import Path

file_prefix = 'jp_gr_paj_all_japan'
filestore_dir = Path('./filestore')

for f in files_to_download:
    p = Path(f)
    print(p.name)
    year = p.stem
    print(f'{ROOT_URL}/{f}')
    r = s.get(f'{ROOT_URL}/{f}')
    print(r)
    if r.ok:
        target_file = filestore_dir / f'{file_prefix}_{year}.csv'
        target_file.write_bytes(r.content)
        print(f'file {target_file} written.')
    

2020.csv
https://stats.paj.gr.jp/en/d-member/./csvs/2020.csv
<Response [200]>
file filestore\jp_gr_paj_all_japan_2020.csv written.
2019.csv
https://stats.paj.gr.jp/en/d-member/./csvs/2019.csv
<Response [200]>
file filestore\jp_gr_paj_all_japan_2019.csv written.
2018.csv
https://stats.paj.gr.jp/en/d-member/./csvs/2018.csv
<Response [200]>
file filestore\jp_gr_paj_all_japan_2018.csv written.


In [39]:
import pandas as pd

header_list = {'Current Week': str,
               'Refinery Operations - Crude Input(kl)': float, 
               'Refinery Operations - Weekly Average Capacity(BPSD)': float,
               'Refinery Operations - Util. Rate against BPSD': float,
               'Refinery Operations - Designed Capacity(BPCD)': float,
               'Refinery Operations - Util. Rate against BPCD': float,
               'Products Stocks(kl) - Crude Oil': float,
               'Products Stocks(kl) - Gasoline': float,
               'Products Stocks(kl) - Naphtha': float,
               'Products Stocks(kl) - Jet': float,
               'Products Stocks(kl) - Kerosene': float,
               'Products Stocks(kl) - Gas Oil(Diesel)': float,
               'Products Stocks(kl) - LSA': float,
               'Products Stocks(kl) - HSA': float,
               'Products Stocks(kl) - AFO': float,
               'Products Stocks(kl) - LSC': float,
               'Products Stocks(kl) - HSC': float,
               'Products Stocks(kl) - CFO': float,
               'Products Stocks(kl) - Total': float,
               'Unfinished Oil Stocks(kl) - Unfinished Gasoline': float,
               'Unfinished Oil Stocks(kl) - Unfinished Kerosene': float,
               'Unfinished Oil Stocks(kl) - Unfinished Gas Oil': float,
               'Unfinished Oil Stocks(kl) - Unfinished AFO': float,
               'Unfinished Oil Stocks(kl) - Feed Stocks': float,
               'Unfinished Oil Stocks(kl) - Total': float,
               'Refinery Production(kl) - Gasoline': float, 
               'Refinery Production(kl) - Naphtha': float,
               'Refinery Production(kl) - Jet': float,
               'Refinery Production(kl) - Kerosene': float,
               'Refinery Production(kl) - Gas Oil(Diesel)': float,
               'Refinery Production(kl) - LSA': float,
               'Refinery Production(kl) - HSA': float,
               'Refinery Production(kl) - AFO': float,
               'Refinery Production(kl) - LSC': float,
               'Refinery Production(kl) - HSC': float,
               'Refinery Production(kl) - CFO': float,
               'Refinery Production(kl) - Total': float,
               'Imports(kl) - Gasoline': float,
               'Imports(kl) - Naphtha': float,
               'Imports(kl) - Jet': float,
               'Imports(kl) - Kerosene': float,
               'Imports(kl) - Gas Oil(Diesel)': float,
               'Imports(kl) - LSA': float,
               'Imports(kl) - HSA': float,
               'Imports(kl) - AFO': float,
               'Imports(kl) - LSC': float,
               'Imports(kl) - HSC': float,
               'Imports(kl) - CFO': float,
               'Imports(kl) - Total': float,
               'Exports(kl) - Gasoline': float,
               'Exports(kl) - Naphtha': float,
               'Exports(kl) - Jet': float,
               'Exports(kl) - Kerosene': float,
               'Exports(kl) - Gas Oil(Diesel)': float,
               'Exports(kl) - LSA': float,
               'Exports(kl) - HSA': float,
               'Exports(kl) - AFO': float,
               'Exports(kl) - LSC': float,
               'Exports(kl) - HSC': float,
               'Exports(kl) - CFO': float,
               'Exports(kl) - Total': float}
display(*header_list)

df = pd.read_csv(filestore_dir / 'jp_gr_paj_all_japan_2020.csv', skip_blank_lines=True, names=header_list.keys(), dtype=header_list, na_values='n.a.')
df.head()

'Current Week'

'Refinery Operations - Crude Input(kl)'

'Refinery Operations - Weekly Average Capacity(BPSD)'

'Refinery Operations - Util. Rate against BPSD'

'Refinery Operations - Designed Capacity(BPCD)'

'Refinery Operations - Util. Rate against BPCD'

'Products Stocks(kl) - Crude Oil'

'Products Stocks(kl) - Gasoline'

'Products Stocks(kl) - Naphtha'

'Products Stocks(kl) - Jet'

'Products Stocks(kl) - Kerosene'

'Products Stocks(kl) - Gas Oil(Diesel)'

'Products Stocks(kl) - LSA'

'Products Stocks(kl) - HSA'

'Products Stocks(kl) - AFO'

'Products Stocks(kl) - LSC'

'Products Stocks(kl) - HSC'

'Products Stocks(kl) - CFO'

'Products Stocks(kl) - Total'

'Unfinished Oil Stocks(kl) - Unfinished Gasoline'

'Unfinished Oil Stocks(kl) - Unfinished Kerosene'

'Unfinished Oil Stocks(kl) - Unfinished Gas Oil'

'Unfinished Oil Stocks(kl) - Unfinished AFO'

'Unfinished Oil Stocks(kl) - Feed Stocks'

'Unfinished Oil Stocks(kl) - Total'

'Refinery Production(kl) - Gasoline'

'Refinery Production(kl) - Naphtha'

'Refinery Production(kl) - Jet'

'Refinery Production(kl) - Kerosene'

'Refinery Production(kl) - Gas Oil(Diesel)'

'Refinery Production(kl) - LSA'

'Refinery Production(kl) - HSA'

'Refinery Production(kl) - AFO'

'Refinery Production(kl) - LSC'

'Refinery Production(kl) - HSC'

'Refinery Production(kl) - CFO'

'Refinery Production(kl) - Total'

'Imports(kl) - Gasoline'

'Imports(kl) - Naphtha'

'Imports(kl) - Jet'

'Imports(kl) - Kerosene'

'Imports(kl) - Gas Oil(Diesel)'

'Imports(kl) - LSA'

'Imports(kl) - HSA'

'Imports(kl) - AFO'

'Imports(kl) - LSC'

'Imports(kl) - HSC'

'Imports(kl) - CFO'

'Imports(kl) - Total'

'Exports(kl) - Gasoline'

'Exports(kl) - Naphtha'

'Exports(kl) - Jet'

'Exports(kl) - Kerosene'

'Exports(kl) - Gas Oil(Diesel)'

'Exports(kl) - LSA'

'Exports(kl) - HSA'

'Exports(kl) - AFO'

'Exports(kl) - LSC'

'Exports(kl) - HSC'

'Exports(kl) - CFO'

'Exports(kl) - Total'

Unnamed: 0,Current Week,Refinery Operations - Crude Input(kl),Refinery Operations - Weekly Average Capacity(BPSD),Refinery Operations - Util. Rate against BPSD,Refinery Operations - Designed Capacity(BPCD),Refinery Operations - Util. Rate against BPCD,Products Stocks(kl) - Crude Oil,Products Stocks(kl) - Gasoline,Products Stocks(kl) - Naphtha,Products Stocks(kl) - Jet,...,Exports(kl) - Jet,Exports(kl) - Kerosene,Exports(kl) - Gas Oil(Diesel),Exports(kl) - LSA,Exports(kl) - HSA,Exports(kl) - AFO,Exports(kl) - LSC,Exports(kl) - HSC,Exports(kl) - CFO,Exports(kl) - Total
0,29/Dec/2019-04/Jan/2020,3569258.0,3448800.0,93.0,3518800.0,91.1,11743285.0,1657964.0,1639051.0,760513.0,...,116323.0,49190.0,106367.0,1100.0,43068.0,44168.0,69253.0,76128.0,145381.0,530461.0
1,05/Jan/2020-11/Jan/2020,3506928.0,3448800.0,91.4,3518800.0,89.6,10534954.0,1698530.0,1514410.0,843806.0,...,148421.0,0.0,136856.0,1949.0,49078.0,51027.0,49130.0,83241.0,132371.0,493477.0
2,12/Jan/2020-18/Jan/2020,3497244.0,3448800.0,91.1,3518800.0,89.3,10850893.0,1751681.0,1556528.0,862632.0,...,168103.0,41730.0,213146.0,10236.0,3456.0,13692.0,98899.0,105605.0,204504.0,768068.0
3,19/Jan/2020-25/Jan/2020,3416796.0,3345943.0,91.8,3518800.0,87.3,11400924.0,1795447.0,1709336.0,775379.0,...,162500.0,66560.0,142576.0,4016.0,12835.0,16851.0,95255.0,77015.0,172270.0,574920.0
4,26/Jan/2020-01/Feb/2020,3347139.0,3328800.0,90.4,3518800.0,85.5,10903010.0,1811192.0,1727520.0,810927.0,...,237592.0,0.0,72690.0,6348.0,56895.0,63243.0,58849.0,63730.0,122579.0,612601.0


In [34]:
%timeit
[*header_list]

['Current Week',
 'Refinery Operations - Crude Input(kl)',
 'Refinery Operations - Weekly Average Capacity(BPSD)',
 'Refinery Operations - Util. Rate against BPSD',
 'Refinery Operations - Designed Capacity(BPCD)',
 'Refinery Operations - Util. Rate against BPCD',
 'Products Stocks(kl) - Crude Oil',
 'Products Stocks(kl) - Gasoline',
 'Products Stocks(kl) - Naphtha',
 'Products Stocks(kl) - Jet',
 'Products Stocks(kl) - Kerosene',
 'Products Stocks(kl) - Gas Oil(Diesel)',
 'Products Stocks(kl) - LSA',
 'Products Stocks(kl) - HSA',
 'Products Stocks(kl) - AFO',
 'Products Stocks(kl) - LSC',
 'Products Stocks(kl) - HSC',
 'Products Stocks(kl) - CFO',
 'Products Stocks(kl) - Total',
 'Unfinished Oil Stocks(kl) - Unfinished Gasoline',
 'Unfinished Oil Stocks(kl) - Unfinished Kerosene',
 'Unfinished Oil Stocks(kl) - Unfinished Gas Oil',
 'Unfinished Oil Stocks(kl) - Unfinished AFO',
 'Unfinished Oil Stocks(kl) - Feed Stocks',
 'Unfinished Oil Stocks(kl) - Total',
 'Refinery Production(kl) -

In [19]:
# let's read history from Excel file maintained by stocks

df_history = pd.read_excel(r'G:\OMRstocks\Weekly PAJ\PAJ Weekly Data Download.xls', sheet_name='Data', skiprows=6, names=header_list, na_values='n.a.')
df_history

Unnamed: 0,Current Week,Refinery Operations - Crude Input(kl),Refinery Operations - Weekly Average Capacity(BPSD),Refinery Operations - Util. Rate against BPSD,Refinery Operations - Designed Capacity(BPCD),Refinery Operations - Util. Rate against BPCD,Products Stocks(kl) - Crude Oil,Products Stocks(kl) - Gasoline,Products Stocks(kl) - Naphtha,Products Stocks(kl) - Jet,...,Exports(kl) - Jet,Exports(kl) - Kerosene,Exports(kl) - Gas Oil(Diesel),Exports(kl) - LSA,Exports(kl) - HSA,Exports(kl) - AFO,Exports(kl) - LSC,Exports(kl) - HSC,Exports(kl) - CFO,Exports(kl) - Total
0,30/Dec/2007-05/Jan/2008,4862019.0,4742924.0,92.1,4894924.0,89.3,15245363.0,2134723.0,1702903.0,734501.0,...,125223.0,2470.0,44542.0,0.0,1306.0,1306.0,0.0,79787.0,79787.0,253657.0
1,06/Jan/2008-12/Jan/2008,4923734.0,4742924.0,93.3,4894924.0,90.4,15121042.0,2281485.0,1768893.0,765975.0,...,143620.0,1999.0,144299.0,17145.0,2191.0,19336.0,0.0,178025.0,178025.0,495084.0
2,13/Jan/2008-19/Jan/2008,4845262.0,4800067.0,90.7,4894924.0,88.9,15825234.0,2190311.0,1661387.0,791587.0,...,97450.0,22575.0,107246.0,5223.0,3687.0,8910.0,0.0,185523.0,185523.0,421995.0
3,20/Jan/2008-26/Jan/2008,4795522.0,4842924.0,89.0,4894924.0,88.0,16045710.0,2223607.0,1781535.0,805229.0,...,110341.0,2038.0,207326.0,19587.0,3986.0,23573.0,0.0,206998.0,206998.0,553880.0
4,27/Jan/2008-02/Feb/2008,4691786.0,4811495.0,87.6,4894924.0,86.1,15486007.0,2135441.0,1857759.0,795095.0,...,182257.0,32164.0,214991.0,6308.0,4960.0,11268.0,0.0,221057.0,221057.0,667521.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
670,01/Nov/2020-07/Nov/2020,2633282.0,2985329.0,79.3,3457800.0,68.4,12414465.0,1936245.0,1419365.0,859147.0,...,49993.0,24950.0,16.0,721.0,26297.0,27018.0,45426.0,64204.0,109630.0,240417.0
671,08/Nov/2020-14/Nov/2020,2757488.0,3074529.0,80.6,3457800.0,71.7,11996740.0,1917909.0,1351268.0,883946.0,...,48840.0,24950.0,4886.0,71.0,44192.0,44263.0,44602.0,82907.0,127509.0,314694.0
672,15/Nov/2020-21/Nov/2020,2839968.0,3099100.0,82.3,3457800.0,73.8,10922929.0,1895164.0,1280093.0,806323.0,...,27613.0,72515.0,20636.0,1098.0,12263.0,13361.0,33377.0,100159.0,133536.0,309516.0
673,22/Nov/2020-28/Nov/2020,2839684.0,3099100.0,82.3,3457800.0,73.8,11578269.0,1992945.0,1485961.0,809920.0,...,65543.0,29800.0,28249.0,1474.0,9955.0,11429.0,44827.0,106762.0,151589.0,380922.0


## Ideas on how to implement it

According to full_load flag:

- True: 
    * it loads the history from excel file
    * it loads the 3 files from the website, and overwrite the existing values from excel in database

- False:
    * it loads only the most recent year from the files available in the website, overwriting existing values in database


In [40]:
display(df.info())
display(df_history.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49 entries, 0 to 48
Data columns (total 61 columns):
 #   Column                                               Non-Null Count  Dtype  
---  ------                                               --------------  -----  
 0   Current Week                                         49 non-null     object 
 1   Refinery Operations - Crude Input(kl)                49 non-null     float64
 2   Refinery Operations - Weekly Average Capacity(BPSD)  49 non-null     float64
 3   Refinery Operations - Util. Rate against BPSD        49 non-null     float64
 4   Refinery Operations - Designed Capacity(BPCD)        49 non-null     float64
 5   Refinery Operations - Util. Rate against BPCD        49 non-null     float64
 6   Products Stocks(kl) - Crude Oil                      49 non-null     float64
 7   Products Stocks(kl) - Gasoline                       49 non-null     float64
 8   Products Stocks(kl) - Naphtha                        49 non-null     flo

None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 675 entries, 0 to 674
Data columns (total 61 columns):
 #   Column                                               Non-Null Count  Dtype  
---  ------                                               --------------  -----  
 0   Current Week                                         675 non-null    object 
 1   Refinery Operations - Crude Input(kl)                672 non-null    float64
 2   Refinery Operations - Weekly Average Capacity(BPSD)  672 non-null    float64
 3   Refinery Operations - Util. Rate against BPSD        672 non-null    float64
 4   Refinery Operations - Designed Capacity(BPCD)        672 non-null    float64
 5   Refinery Operations - Util. Rate against BPCD        672 non-null    float64
 6   Products Stocks(kl) - Crude Oil                      672 non-null    float64
 7   Products Stocks(kl) - Gasoline                       672 non-null    float64
 8   Products Stocks(kl) - Naphtha                        672 non-null    f

None