# India Crude Processed by Refineries

In this notebook we will prototype on how to download "India Crude Processed by Refineries data" and load it into External DB.

Sources:

* Historical data: https://www.ppac.gov.in/WriteReadData/userfiles/file/PT_crude_H%20.xls
* Current season:  https://www.ppac.gov.in/WriteReadData/userfiles/file/PT_crude.xls

## Notes

* Data is reported by _season_: from April of year N to March of year N + 1.
* Data is in _thousand metric tonnes_.
* Data is grouped by company (in **bold** in the file).
* Detail level is refining unit location.
* Assuming that files are downloaded to PROJECT_ROOT/filestore directory.

**Our first goal**: produce a file with these fields:

company,location,product,period,value

**Second goal**: load it into External-DB.

## Setup

In the following cells, we do some standard setup:

* go the the PROJECT_ROOT directory
* setup logging

To run them, select the cell and type _Shift_+ Enter.

In [1]:
# this goes back to project root directory
%cd ..


C:\Users\ROSA_L\PycharmProjects\scraper


In [2]:
# this makes Jupyter lab reload any python module imported every 2s
%load_ext autoreload
%autoreload 2

In [3]:
# this sets up logging with DEBUG level
import logging
import sys

root = logging.getLogger()
root.setLevel(logging.DEBUG)

handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.DEBUG)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
root.addHandler(handler)

## Time to play with data

Let's experiment with data!

In [4]:
# let's download data

import requests
from pathlib import Path

logger = logging.getLogger(__name__)
provider = 'in_gov_ppac'

download_dir = Path('.') / 'filestore'

urls = ['https://www.ppac.gov.in/WriteReadData/userfiles/file/PT_crude_H%20.xls',
        'https://www.ppac.gov.in/WriteReadData/userfiles/file/PT_crude.xls']

for url in urls:
    logger.info(f'Downloading {url}')
    response = requests.get(url)
    if response.ok:
        file = download_dir / f"{provider}_{url.split('/')[-1]}"
        logger.debug(f'Response OK: downloading to {file}')
        logger.info(f'Saving file to {file}')
        file.write_bytes(response.content)

2020-06-02 11:39:41,273 - __main__ - INFO - Downloading https://www.ppac.gov.in/WriteReadData/userfiles/file/PT_crude_H%20.xls
2020-06-02 11:39:41,284 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): www.ppac.gov.in:443
2020-06-02 11:39:43,333 - urllib3.connectionpool - DEBUG - https://www.ppac.gov.in:443 "GET /WriteReadData/userfiles/file/PT_crude_H%20.xls HTTP/1.1" 200 533504
2020-06-02 11:39:44,544 - __main__ - DEBUG - Response OK: downloading to filestore\in_gov_ppac_PT_crude_H%20.xls
2020-06-02 11:39:44,547 - __main__ - INFO - Saving file to filestore\in_gov_ppac_PT_crude_H%20.xls
2020-06-02 11:39:44,555 - __main__ - INFO - Downloading https://www.ppac.gov.in/WriteReadData/userfiles/file/PT_crude.xls
2020-06-02 11:39:44,566 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): www.ppac.gov.in:443
2020-06-02 11:39:45,469 - urllib3.connectionpool - DEBUG - https://www.ppac.gov.in:443 "GET /WriteReadData/userfiles/file/PT_crude.xls HTTP/1.1" 200 8

### ... But the name of the files are dynamic

So in this case, in real life, we need first to find the correct URL for the file.

In [51]:
from bs4 import BeautifulSoup
import requests

response = requests.get('https://www.ppac.gov.in/content/146_1_ProductionPetroleum.aspx')
if not response.ok:
    raise Exception("Problem accessing website {}")

page = response.content
soup = BeautifulSoup(page, 'html.parser')
crude_processing = None
for title in soup.find_all('h5'):
    print(title.text)
    if title.text == 'Crude Processing':
        break

Indigenous Crude Oil Production 
Crude Processing


In [71]:
ul = title.find_next('ul')
if not ul:
    raise AttributeError("Element <ul> not found after 'Crude Processing'. Check if the website has changed.")
list_a = [li.select('a')[0] for li in ul.find_all('li')]
dict_url = {a.text.split(' ')[0]: a['href'] for a in list_a}
display(dict_url)


[<a href="/WriteReadData/userfiles/file/PT_CRUDE_22-5-2020.xls" target="_blank">Current  <img alt="View Document" height="15" src="/images/excel.png" width="15"/> 73  Kb</a>,
 <a href="/WriteReadData/userfiles/file/PT_crude_H_22-5-2020.xls" target="_blank">Historical  <img alt="View Document" height="15" src="/images/excel.png" width="15"/>550  Kb</a>]

[<a href="/WriteReadData/userfiles/file/PT_CRUDE_22-5-2020.xls" target="_blank">Current  <img alt="View Document" height="15" src="/images/excel.png" width="15"/> 73  Kb</a>,
 <a href="/WriteReadData/userfiles/file/PT_crude_H_22-5-2020.xls" target="_blank">Historical  <img alt="View Document" height="15" src="/images/excel.png" width="15"/>550  Kb</a>]

{'Current': '/WriteReadData/userfiles/file/PT_CRUDE_22-5-2020.xls',
 'Historical': '/WriteReadData/userfiles/file/PT_crude_H_22-5-2020.xls'}

## Transform the files

From PT_crude.xls, we get the current month numbers and PT_crude_H .xls for the history.

In [4]:
import xlwings as xw
import sys
import pandas as pd
from pathlib import Path
import logging

logger = logging.getLogger(__name__)

PERIOD_CELL = 'A7'
TABLE_HEADER_START = (9, 1)

download_dir = Path('.') / 'filestore'
file = download_dir / 'in_gov_ppac_PT_crude.xls'

app = xw.App(visible=False, add_book=False)
app.display_alerts = False

df = None

try:
    logger.info(f"Opening {file}")  
        #wb = app.books.open(file)
    wb = app.books.open(file, update_links=False)
    #wb = xw.Book.open(file, update_links=False)
    logger.debug("Opening first sheet")
    sheet = wb.sheets[0]
    
    period = sheet.range(PERIOD_CELL).value
    year = period.split('-')[0]
    
    logger.debug(f"Period: {period}, start year: {year}")
    
    last_col = sheet.range(TABLE_HEADER_START).expand('right').end('right').column
    last_row = sheet.range('A' + str(sheet.cells.last_cell.row)).end('up').row
    
    logger.debug(f'last_col: {last_col} last_row: {last_row}')
    rng = sheet.range(TABLE_HEADER_START,(last_row, last_col))
    logger.debug(f'range: {rng.address}')
    
    # convert range to data frame
    df = rng.options(pd.DataFrame, index=False).value
    
    display(df)
    
    #sheets = xw.Range('CountrySheets').value
    #df = pd.DataFrame()
    #for country in sheets:
    #    logger.info(f"Loading Majors: {country}")
    #    tp_df = __get_country_sheet_data(wb, country)
    #    bunkers = __get_country_sheet_bunker(wb, country)
    #    df = pd.concat([df, tp_df, bunkers], ignore_index=True, sort=True)
    logger.info(f'Closing file {file}')
    wb.close()
except Exception as e:
    raise e
finally:
    app.quit()


2020-06-11 14:44:31,597 - comtypes - DEBUG - CoInitializeEx(None, 2)
2020-06-11 14:44:33,096 - matplotlib - DEBUG - $HOME=C:\Users\ROSA_L
2020-06-11 14:44:33,105 - matplotlib - DEBUG - CONFIGDIR=C:\Users\ROSA_L\.matplotlib
2020-06-11 14:44:33,108 - matplotlib - DEBUG - matplotlib data path: c:\users\rosa_l\pycharmprojects\scraper\venv\lib\site-packages\matplotlib\mpl-data
2020-06-11 14:44:33,159 - matplotlib - DEBUG - loaded rc file c:\users\rosa_l\pycharmprojects\scraper\venv\lib\site-packages\matplotlib\mpl-data\matplotlibrc
2020-06-11 14:44:33,170 - matplotlib - DEBUG - matplotlib version 3.0.0
2020-06-11 14:44:33,173 - matplotlib - DEBUG - interactive is False
2020-06-11 14:44:33,177 - matplotlib - DEBUG - platform is win32
2020-06-11 14:44:33,426 - matplotlib - DEBUG - CACHEDIR=C:\Users\ROSA_L\.matplotlib
2020-06-11 14:44:33,468 - matplotlib.font_manager - DEBUG - Using fontManager instance from C:\Users\ROSA_L\.matplotlib\fontlist-v300.json
2020-06-11 14:44:47,298 - __main__ - IN

Unnamed: 0,OIL COMPANIES,APR,MAY,JUN,JULY,AUG,SEP,OCT,NOV,DEC,JAN,FEB,MAR,TOTAL
0,Indian Oil Corporation Ltd.(IOCL),,,,,,,,,,,,,
1,"IOCL-BARAUNI,BIHAR",568.619,582.78,569.518,581.487,461.788,434.612,538.976,594.799,572.995,556.909,541.376,,6003.859
2,"IOCL-KOYALI, GUJARAT",637.655,817.111,937.449,1117.818,993.803,1086.221,1256.184,1271.088,1299.229,1262.295,1228.934,,11907.787
3,"IOCL-HALDIA, WEST BENGAL",684.222,703.712,687.906,708.032,599.918,515.586,373.197,363.607,376.048,339.85,517.431,,5869.509
4,"IOCL-MATHURA, UTTAR PRADESH",847.237,889.678,870.179,899.136,858.56,843.126,866.429,829.502,82.814,430.699,747.478,,8164.838
5,"IOCL-PANIPAT, HARYANA",1154.681,1374.667,1291.537,1401.681,1329.78,580.47,1267.255,1292.905,1401.98,1395.113,1332.449,,13822.518
6,"IOCL-GUWAHATI,ASSAM",74.71,74.162,80.307,86.44,92.773,89.596,81.411,90.414,87.209,82.953,52.428,,892.403
7,"IOCL-DIGBOI,ASSAM",52.163,52.566,50.878,59.917,50.1,53.687,60.411,58.916,48.501,61.685,55.97,,604.794
8,"IOCL-BONGAIGAON,ASSAM",198.33,196.55,194.917,206.787,207.609,206.515,216.232,219.587,64.998,108.318,90.338,,1910.181
9,"IOCL-PARADIP,ODISHA",1316.818,1235.475,1138.974,1354.654,1365.025,1352.301,1382.581,1369.257,1429.69,1404.748,1157.823,,14507.346


2020-06-11 14:44:48,735 - __main__ - INFO - Closing file filestore\in_gov_ppac_PT_crude.xls


## Transform data frame

From the data frame:

* separate company list and location list
* parse periods correctly


In [5]:
# remove 2 last rows
df = df.iloc[:-2,:]

# drop null rows
df.dropna(how='all', inplace=True)
df.dropna(axis='columns', how='all', inplace=True)
df = df[~(df['OIL COMPANIES '].str.contains('TOTAL'))]

# remove TOTAL column
cols = list(df.columns)
df = df[cols[:-1]]

df


Unnamed: 0,OIL COMPANIES,APR,MAY,JUN,JULY,AUG,SEP,OCT,NOV,DEC,JAN,FEB
0,Indian Oil Corporation Ltd.(IOCL),,,,,,,,,,,
1,"IOCL-BARAUNI,BIHAR",568.619,582.78,569.518,581.487,461.788,434.612,538.976,594.799,572.995,556.909,541.376
2,"IOCL-KOYALI, GUJARAT",637.655,817.111,937.449,1117.818,993.803,1086.221,1256.184,1271.088,1299.229,1262.295,1228.934
3,"IOCL-HALDIA, WEST BENGAL",684.222,703.712,687.906,708.032,599.918,515.586,373.197,363.607,376.048,339.85,517.431
4,"IOCL-MATHURA, UTTAR PRADESH",847.237,889.678,870.179,899.136,858.56,843.126,866.429,829.502,82.814,430.699,747.478
5,"IOCL-PANIPAT, HARYANA",1154.681,1374.667,1291.537,1401.681,1329.78,580.47,1267.255,1292.905,1401.98,1395.113,1332.449
6,"IOCL-GUWAHATI,ASSAM",74.71,74.162,80.307,86.44,92.773,89.596,81.411,90.414,87.209,82.953,52.428
7,"IOCL-DIGBOI,ASSAM",52.163,52.566,50.878,59.917,50.1,53.687,60.411,58.916,48.501,61.685,55.97
8,"IOCL-BONGAIGAON,ASSAM",198.33,196.55,194.917,206.787,207.609,206.515,216.232,219.587,64.998,108.318,90.338
9,"IOCL-PARADIP,ODISHA",1316.818,1235.475,1138.974,1354.654,1365.025,1352.301,1382.581,1369.257,1429.69,1404.748,1157.823


In [14]:
logger.debug("Processing Entity dimension (companies).")
# Get Oil company names
# oil company names are rows with Nan in period columns
df_companies = df[df.iloc[:, 1].isnull()][['OIL COMPANIES ']]
df_companies.columns = ['name']
df_companies['code'] = df_companies['name'].map(lambda x: x.split('(')[1][:-1])
df_companies['long_name'] = df_companies['name'].map(lambda x: x.split('(')[0])
del df_companies['name']

nel = pd.DataFrame([{'code': 'NEL', 'long_name': 'Nayara Energy Ltd.'}])

df_companies = pd.concat([df_companies, nel])
df_companies['category'] = 'company'

df_companies.reset_index(drop=True, inplace=True)
logger.debug(f"Number of companies: {len(df_companies)}")
df_companies

2020-06-11 15:50:10,220 - __main__ - DEBUG - Processing Entity dimension (companies).
2020-06-11 15:50:10,239 - __main__ - DEBUG - Number of companies: 7


Unnamed: 0,code,long_name,category
0,IOCL,Indian Oil Corporation Ltd.,company
1,CPCL,Chennai Petroleum Corporation Ltd.,company
2,BPCL,Bharat Petroleum Corporation Ltd.,company
3,ONGC,Oil & Natural Gas Corporation Ltd.,company
4,HPCL,Hindustan Petroleum Corporation Ltd.,company
5,RIL,Reliance Industries Ltd.,company
6,NEL,Nayara Energy Ltd.,company


In [15]:
# locations are the rows where months have numbers
df_location = df[~(df.iloc[:, 1].isnull())]
df_location = df_location.rename(columns={'OIL COMPANIES ': 'location_code'})
# company is the text before '-' in OIL COMPANIES
# but first let's fix RIL,JAMNAGAR,GUJARAT -> RIL-JAMNAGAR,GUJARAT
df_location['location_code'] = df_location['location_code'].replace(to_replace={'RIL,JAMNAGAR,GUJARAT': 'RIL-JAMNAGAR,GUJARAT'})
df_location['company_code'] = df_location['location_code'].map(lambda x: x.split('-')[0])
df_location['location_code'] = df_location['location_code'].map(lambda x: x.split('-')[1])

# month_cols = [f'{x} {year}' if i < 9 else f'{x} {str(int(year) + 1)}'  for i, x in enumerate(list(df_location.columns)[1:-1])]
month_cols_dict = {x: f'{x[:3]} {year}' if i < 9 else f'{x[:3]} {str(int(year) + 1)}'  for i, x in enumerate(list(df_location.columns)[1:-1])}

df_location = df_location.rename(columns=month_cols_dict)

loc_cols = ['company_code', 'location_code'] + list(month_cols_dict.values())
logger.debug(f'location columns: {loc_cols} dict: {month_cols_dict}')
df_location = df_location[loc_cols]
df_location

2020-06-11 16:48:26,644 - __main__ - DEBUG - location columns: ['company_code', 'location_code', 'APR 2019', 'MAY 2019', 'JUN 2019', 'JUL 2019', 'AUG 2019', 'SEP 2019', 'OCT 2019', 'NOV 2019', 'DEC 2019', 'JAN 2020', 'FEB 2020'] dict: {'APR': 'APR 2019', 'MAY': 'MAY 2019', 'JUN': 'JUN 2019', 'JULY': 'JUL 2019', 'AUG': 'AUG 2019', 'SEP': 'SEP 2019', 'OCT': 'OCT 2019', 'NOV': 'NOV 2019', 'DEC': 'DEC 2019', 'JAN': 'JAN 2020', 'FEB': 'FEB 2020'}


Unnamed: 0,company_code,location_code,APR 2019,MAY 2019,JUN 2019,JUL 2019,AUG 2019,SEP 2019,OCT 2019,NOV 2019,DEC 2019,JAN 2020,FEB 2020
1,IOCL,"BARAUNI,BIHAR",568.619,582.78,569.518,581.487,461.788,434.612,538.976,594.799,572.995,556.909,541.376
2,IOCL,"KOYALI, GUJARAT",637.655,817.111,937.449,1117.818,993.803,1086.221,1256.184,1271.088,1299.229,1262.295,1228.934
3,IOCL,"HALDIA, WEST BENGAL",684.222,703.712,687.906,708.032,599.918,515.586,373.197,363.607,376.048,339.85,517.431
4,IOCL,"MATHURA, UTTAR PRADESH",847.237,889.678,870.179,899.136,858.56,843.126,866.429,829.502,82.814,430.699,747.478
5,IOCL,"PANIPAT, HARYANA",1154.681,1374.667,1291.537,1401.681,1329.78,580.47,1267.255,1292.905,1401.98,1395.113,1332.449
6,IOCL,"GUWAHATI,ASSAM",74.71,74.162,80.307,86.44,92.773,89.596,81.411,90.414,87.209,82.953,52.428
7,IOCL,"DIGBOI,ASSAM",52.163,52.566,50.878,59.917,50.1,53.687,60.411,58.916,48.501,61.685,55.97
8,IOCL,"BONGAIGAON,ASSAM",198.33,196.55,194.917,206.787,207.609,206.515,216.232,219.587,64.998,108.318,90.338
9,IOCL,"PARADIP,ODISHA",1316.818,1235.475,1138.974,1354.654,1365.025,1352.301,1382.581,1369.257,1429.69,1404.748,1157.823
13,CPCL,"MANALI, TAMILNADU",839.910596,904.236365,877.784881,925.238056,898.160256,743.429828,711.089169,788.625779,936.78167,946.86017,735.328386


In [31]:
# let's convert column names to date
from datetime import datetime

date_cols = [datetime.strptime(col.title(), '%b %Y') if col not in ('company_code', 'location_code') else col for col in list(df_location.columns)]
df_location.columns = date_cols
df_location

Unnamed: 0,company_code,location_code,2019-04-01 00:00:00,2019-05-01 00:00:00,2019-06-01 00:00:00,2019-07-01 00:00:00,2019-08-01 00:00:00,2019-09-01 00:00:00,2019-10-01 00:00:00,2019-11-01 00:00:00,2019-12-01 00:00:00,2020-01-01 00:00:00,2020-02-01 00:00:00
1,IOCL,"BARAUNI,BIHAR",568.619,582.78,569.518,581.487,461.788,434.612,538.976,594.799,572.995,556.909,541.376
2,IOCL,"KOYALI, GUJARAT",637.655,817.111,937.449,1117.818,993.803,1086.221,1256.184,1271.088,1299.229,1262.295,1228.934
3,IOCL,"HALDIA, WEST BENGAL",684.222,703.712,687.906,708.032,599.918,515.586,373.197,363.607,376.048,339.85,517.431
4,IOCL,"MATHURA, UTTAR PRADESH",847.237,889.678,870.179,899.136,858.56,843.126,866.429,829.502,82.814,430.699,747.478
5,IOCL,"PANIPAT, HARYANA",1154.681,1374.667,1291.537,1401.681,1329.78,580.47,1267.255,1292.905,1401.98,1395.113,1332.449
6,IOCL,"GUWAHATI,ASSAM",74.71,74.162,80.307,86.44,92.773,89.596,81.411,90.414,87.209,82.953,52.428
7,IOCL,"DIGBOI,ASSAM",52.163,52.566,50.878,59.917,50.1,53.687,60.411,58.916,48.501,61.685,55.97
8,IOCL,"BONGAIGAON,ASSAM",198.33,196.55,194.917,206.787,207.609,206.515,216.232,219.587,64.998,108.318,90.338
9,IOCL,"PARADIP,ODISHA",1316.818,1235.475,1138.974,1354.654,1365.025,1352.301,1382.581,1369.257,1429.69,1404.748,1157.823
13,CPCL,"MANALI, TAMILNADU",839.910596,904.236365,877.784881,925.238056,898.160256,743.429828,711.089169,788.625779,936.78167,946.86017,735.328386


In [16]:
df_location[['location_code']].drop_duplicates()


Unnamed: 0,location_code
1,"BARAUNI,BIHAR"
2,"KOYALI, GUJARAT"
3,"HALDIA, WEST BENGAL"
4,"MATHURA, UTTAR PRADESH"
5,"PANIPAT, HARYANA"
6,"GUWAHATI,ASSAM"
7,"DIGBOI,ASSAM"
8,"BONGAIGAON,ASSAM"
9,"PARADIP,ODISHA"
13,"MANALI, TAMILNADU"


In [32]:
# Now time to unpivot the data frame
df_location = df_location.melt(id_vars = ['company_code', 'location_code'], var_name = 'timestamp')
df_location

Unnamed: 0,company_code,location_code,timestamp,value
0,IOCL,"BARAUNI,BIHAR",2019-04-01,568.619000
1,IOCL,"KOYALI, GUJARAT",2019-04-01,637.655000
2,IOCL,"HALDIA, WEST BENGAL",2019-04-01,684.222000
3,IOCL,"MATHURA, UTTAR PRADESH",2019-04-01,847.237000
4,IOCL,"PANIPAT, HARYANA",2019-04-01,1154.681000
...,...,...,...,...
248,HPCL,"VISAKH,ANDHRA PRADESH",2020-02-01,798.285067
249,HMEL,"GGSR, BATHINDA, PUNJAB",2020-02-01,1028.208000
250,RIL,"JAMNAGAR,GUJARAT",2020-02-01,2880.927013
251,RIL,"(SEZ), JAMNAGAR,GUJARAT",2020-02-01,2786.852000


In [34]:
# let's write to csv
download_dir = Path('.') / 'filestore'
prefix = 'in_gov_ppac'
companies_file = download_dir / f'{prefix}_companies.csv'
data_file = download_dir / f'{prefix}_data.csv'

df_companies.to_csv(companies_file, index=False)
logger.info(f'{len(df_companies)} rows written to {companies_file}')
df_location.to_csv(data_file, index=False)
logger.info(f'{len(df_location)} rows written to {data_file}')

2020-06-03 18:02:42,145 - __main__ - INFO - 7 rows written to filestore\in_gov_ppac_companies.csv
2020-06-03 18:02:42,145 - __main__ - INFO - 7 rows written to filestore\in_gov_ppac_companies.csv
2020-06-03 18:02:42,164 - __main__ - INFO - 253 rows written to filestore\in_gov_ppac_data.csv
2020-06-03 18:02:42,164 - __main__ - INFO - 253 rows written to filestore\in_gov_ppac_data.csv


## Loading External DB

For loading the current file into External DB, we have to create a new class derived from scraper.core.job.Job.

We create this new class in a new python package, scraper.jobs.in_gov_ppac (a new directory to be created with a __init__.py file inside).

Then, we have at least to implement the following methods:

* get_sources()
* transform()

In [None]:
from scraper.core.job import Job


class CrudeOilProcJob(Job):
    """
    Scraper for loading Crude Oil Processed by Refineries data from ppac.gov.in.
    """
    
    def get_sources(self):
        """
        Method returning the list of files to download and process.
        :return: 
        """
        pass

    def transform(self):
        """
        Method defining the transformations to be applied on the source data.
        :return: 
        """
        pass

Let's implement get_sources() and also add a constructor to the class to allow selecting between loading current run or full history load.

Let's also add some constants at the beginning.

Usually I do this by adding them as variables at the top declared in UPPER CASE (in Python, there is no "constant" variable like the _const_ in C, so upper case is just a convention).

As an alternative, this time I will not declare "constants", but declare them as "class variables".

The result ends up the same, but these variables are encapsulated by the class.
We can access this class variables in the following ways:

* inside the class:
    * cls.BASE_URL in class methods 
    * self.BASE_URL in normal methods
* outside of the class:
    * CrudeOilProcJob.BASE_URL 

The constructor below (__init__()) defines an instance variable **full_load**.
We can only access it in instance methods with self.full_load.

In [4]:
from scraper.core.job import Job

# I could have declared the constant like this...
BASE_URL = "https://www.ppac.gov.in/WriteReadData/userfiles/file/"

class CrudeOilProcJob(Job):
    """
    Scraper for loading Crude Oil Processed at Refineries data from ppac.gov.in.
    """
    # ... but this time I will do like this:
    BASE_URL = "https://www.ppac.gov.in/WriteReadData/userfiles/file/"

    def __init__(self, full_load=False):
    """"
    Constructor.
    @param full_load: True for full-load (since 1941).
                      False loads latest available month (current month - PUBLICATION_DELAY).
    """
    self.full_load = full_load
    super().__init__()

And here we have a first get_sources() method:

In [None]:
    def get_sources(self):
        """
        Method returning the list of files to download and process.
        :return:
        """
        # if full_load, process both files
        files = []
        if self.full_load:
            files.append(self.history_file)
        files.append(self.current_file)
    
        # let's add files to process to self.sources
        for file in files:
            dest_file = f'{self.provider}_{file}'
            
            base_source = BaseSource(url=f'{self.base_url}/{file}',
                                     code=dest_file.split('.')[0],
                                     path=dest_file,
                                     long_name=f"{self.area} {self.provider} Crude Oil Processed at Refineries")
            
            # append it to self.sources (files to be processed)
            self.sources.append(base_source)

            # add dictionary to dynamic dims. 
            # dynamic dims will be updated in external DB through API.
            # self.dynamic_dim expects dictionaries as elements.
            dicto = vars(copy(base_source))
            self.dynamic_dim['source'] += [dicto]

        # run self.remove_existing_dynamic_dim to remove existing sources
        self.remove_existing_dynamic_dim('source')

Time to test it:

In [5]:
from scraper.jobs.in_gov_ppac.crude_oil_proc import CrudeOilProcJob

india = CrudeOilProcJob(full_load=False)
india.get_sources()
display([vars(source) for source in india.sources])

#india = CrudeOilProcJob(full_load=True)
#india.get_sources()
#display([vars(source) for source in india.sources])

ModuleNotFoundError: No module named 'scraper.jobs.in_gov_ppac.crude_oil_proc'

In [5]:
from scraper.jobs.in_gov_ppac.crudeoil_proc_job import CrudeOilProcJob

india = CrudeOilProcJob(full_load=True)
india.parse_urls()


2020-06-16 18:27:27,577 - scraper.jobs.in_gov_ppac.crudeoil_proc_job - INFO - Parsing file URLs to download from https://www.ppac.gov.in/content/146_1_ProductionPetroleum.aspx
2020-06-16 18:27:27,581 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): www.ppac.gov.in:443
2020-06-16 18:27:31,596 - urllib3.connectionpool - DEBUG - https://www.ppac.gov.in:443 "GET /content/146_1_ProductionPetroleum.aspx HTTP/1.1" 200 33698
2020-06-16 18:27:31,791 - scraper.jobs.in_gov_ppac.crudeoil_proc_job - DEBUG - Title found: Indigenous Crude Oil Production 
2020-06-16 18:27:31,792 - scraper.jobs.in_gov_ppac.crudeoil_proc_job - DEBUG - Title found: Crude Processing
2020-06-16 18:27:31,793 - scraper.jobs.in_gov_ppac.crudeoil_proc_job - DEBUG - List of 'a' elements: [<a href="/WriteReadData/userfiles/file/PT_CRUDE_22-5-2020.xls" target="_blank">Current  <img alt="View Document" height="15" src="/images/excel.png" width="15"/> 73  Kb</a>, <a href="/WriteReadData/userfiles/file/PT_crude_

{'Current': 'https://www.ppac.gov.in/WriteReadData/userfiles/file/PT_CRUDE_22-5-2020.xls',
 'Historical': 'https://www.ppac.gov.in/WriteReadData/userfiles/file/PT_crude_H_22-5-2020.xls'}

## Time to test the full scraper

Time to test the full scraper by running run()

In [4]:
from scraper.jobs.in_gov_ppac.crudeoil_proc_job import CrudeOilProcJob

india = CrudeOilProcJob()
india.run()

2020-09-24 17:14:15,508 - scraper.jobs.in_gov_ppac.crudeoil_proc_job - INFO - Getting sources...
2020-09-24 17:14:15,509 - scraper.jobs.in_gov_ppac.crudeoil_proc_job - INFO - Parsing file URLs to download from https://www.ppac.gov.in/content/146_1_ProductionPetroleum.aspx
2020-09-24 17:14:15,513 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): www.ppac.gov.in:443


SSLError: HTTPSConnectionPool(host='www.ppac.gov.in', port=443): Max retries exceeded with url: /content/146_1_ProductionPetroleum.aspx (Caused by SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:833)'),))

In [5]:
import ssl
ssl.get_default_verify_paths()

DefaultVerifyPaths(cafile='C:\\Users\\ROSA_L\\.certificates\\checkpoint_https_inspection_cert_b64.crt', capath='C:\\Users\\ROSA_L\\.certificates', openssl_cafile_env='SSL_CERT_FILE', openssl_cafile='/usr/local/ssl/cert.pem', openssl_capath_env='SSL_CERT_DIR', openssl_capath='/usr/local/ssl/certs')

In [6]:
india.dynamic_dim['entity']

[]

In [13]:
india.data

[{'entity': 'IOCL',
  'detail': 'BARAUNI,BIHAR',
  'APR2020': 246.52600000000004,
  'area': 'INDIA',
  'flow': 'REFINOBS',
  'provider': 'IN_GOV_PPAC',
  'source': 'IN_GOV_PPAC_Current',
  'frequency': 'Monthly',
  'unit': 'KT',
  'original': True},
 {'entity': 'IOCL',
  'detail': 'KOYALI, GUJARAT',
  'APR2020': 578.306,
  'area': 'INDIA',
  'flow': 'REFINOBS',
  'provider': 'IN_GOV_PPAC',
  'source': 'IN_GOV_PPAC_Current',
  'frequency': 'Monthly',
  'unit': 'KT',
  'original': True},
 {'entity': 'IOCL',
  'detail': 'HALDIA, WEST BENGAL',
  'APR2020': 305.29900000000004,
  'area': 'INDIA',
  'flow': 'REFINOBS',
  'provider': 'IN_GOV_PPAC',
  'source': 'IN_GOV_PPAC_Current',
  'frequency': 'Monthly',
  'unit': 'KT',
  'original': True},
 {'entity': 'IOCL',
  'detail': 'MATHURA, UTTAR PRADESH',
  'APR2020': 505.71500000000003,
  'area': 'INDIA',
  'flow': 'REFINOBS',
  'provider': 'IN_GOV_PPAC',
  'source': 'IN_GOV_PPAC_Current',
  'frequency': 'Monthly',
  'unit': 'KT',
  'original': T

## Historical File

Historical file is a bit more complicated, as it has yearly and monthly data sheets.
We can identify the file by a H on this name.

In [11]:
from pathlib import Path
import pandas as pd

download_dir = Path('.') / 'filestore'
file = download_dir / 'IN_GOV_PPAC_Historical.xls'

xl = pd.ExcelFile(file)

In [12]:
xl.sheet_names

['PT_crude_H',
 'Monthwise 2019-20',
 'Monthwise 2018-19',
 'Monthwise 2017-18',
 'Monthwise 2016-17 ',
 'Monthwise 2015-16',
 'Monthwise 2014-15',
 'Monthwise 2013-14',
 'Monthwise 2012-13',
 'Monthwise 2011-12',
 'Monthwise 2010-11']

In [42]:
df = xl.parse(sheet_name='Monthwise 2015-16')
df.head()

[autoreload of scraper.jobs.in_gov_ppac.crudeoil_proc_job failed: Traceback (most recent call last):
  File "c:\users\rosa_l\pycharmprojects\scraper\venv\lib\site-packages\IPython\extensions\autoreload.py", line 245, in check
    superreload(m, reload, self.old_objects)
  File "c:\users\rosa_l\pycharmprojects\scraper\venv\lib\site-packages\IPython\extensions\autoreload.py", line 384, in superreload
    update_generic(old_obj, new_obj)
  File "c:\users\rosa_l\pycharmprojects\scraper\venv\lib\site-packages\IPython\extensions\autoreload.py", line 323, in update_generic
    update(a, b)
  File "c:\users\rosa_l\pycharmprojects\scraper\venv\lib\site-packages\IPython\extensions\autoreload.py", line 278, in update_class
    if old_obj == new_obj:
  File "c:\users\rosa_l\pycharmprojects\scraper\venv\lib\site-packages\pandas\core\generic.py", line 1479, in __nonzero__
    f"The truth value of a {type(self).__name__} is ambiguous. "
ValueError: The truth value of a DataFrame is ambiguous. Use a.e

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Petroleum Planning & Analysis Cell,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13
0,,,,,,,,,,,,,,
1,,,,,,,,,,,,,,
2,,,,,,,,,,,,,,
3,,,,,,,,,,,,,,
4,2015-16 (Apr - March),,,,,,,,('000 Metric Tonnes),,,,,


In [43]:
# Let's rename columns according to the values where the first column is OIL COMPANIES.

# create boolean array testing if first column contains 'OIL COMPANIES'
header = df.iloc[:,0].str.contains('OIL COMPANIES')
# replace nulls by False in the array
header = header.fillna(False)
header_index = df[header].index
cols = df.iloc[header_index].values.tolist()
print(cols[0])
# set column headers to value in the array where value is True
df.columns = df.iloc[header_index].values.tolist()[0]
df = df.iloc[header_index.values[0] + 1:,:]

['OIL COMPANIES ', 'APR', 'MAY', 'JUN', 'JULY', 'AUG', 'SEPT', 'OCT', 'NOV', 'DEC', 'JAN', 'FEB', 'MAR', 'TOTAL']


In [44]:
df

Unnamed: 0,OIL COMPANIES,APR,MAY,JUN,JULY,AUG,SEPT,OCT,NOV,DEC,JAN,FEB,MAR,TOTAL
7,Indian Oil Corporation Ltd.(IOCL),,,,,,,,,,,,,
8,"IOCL-KOYALI, GUJARAT",603.785,1204.39,1176.2,1256.23,1190.82,1065.66,1264.21,1218.07,1205.53,1188.98,1177.5,1268.55,13819.9
9,"IOCL-MATHURA, UTTAR PRADESH",675.394,699.131,732.791,757.257,703.472,695.8,775.503,735.969,759.98,770.797,739.98,814.207,8860.28
10,"IOCL-PANIPAT, HARYANA",1293.08,1366.43,1212.38,1191.24,1086.83,1268.99,1264.46,1300.59,1298.52,1357.2,1296.62,1346.02,15282.4
11,"IOCL-HALDIA, WEST BENGAL",647.111,662.709,667.002,663.362,667.202,491.446,534.321,696.871,726.618,722.924,640.623,656.282,7776.47
12,"IOCL-BARAUNI,BIHAR",474.419,527.142,541.215,533.943,565.999,526.941,559.226,578.696,602.386,561.59,506.437,566.759,6544.75
13,"IOCL-GUWAHATI,ASSAM",93.522,95.257,93.288,84.597,78.504,71.93,76.726,72.355,74.637,40.265,37.799,84.643,903.523
14,"IOCL-DIGBOI,ASSAM",29.174,55.185,51.361,49.06,53.402,44.598,48.387,42.595,42.035,43.004,44.029,59.003,561.833
15,"IOCL-BONGAIGAON,ASSAM",235.884,230.727,200.486,213.414,212.166,209.637,182.795,154.282,208.194,200.447,191.727,201.905,2441.66
16,"IOCL-PARADIP,ODISHA*",,,,,,,,,,809.9,503.17,503.603,1816.67


In [57]:
df.loc[df.iloc[:, 1:].isna().all(axis='columns')]

Unnamed: 0,OIL COMPANIES,APR,MAY,JUN,JULY,AUG,SEPT,OCT,NOV,DEC,JAN,FEB,MAR,TOTAL
7,Indian Oil Corporation Ltd.(IOCL),,,,,,,,,,,,,
18,,,,,,,,,,,,,,
19,Hindustan Petroleum Corporation Ltd.(HPCL),,,,,,,,,,,,,
24,,,,,,,,,,,,,,
25,Bharat Petroleum Corporation Ltd (BPCL),,,,,,,,,,,,,
31,,,,,,,,,,,,,,
32,Chennai Petroleum Corporation Ltd (CPCL),,,,,,,,,,,,,
36,,,,,,,,,,,,,,
37,Oil & Natural Gas Corporation Ltd.(ONGC),,,,,,,,,,,,,
41,,,,,,,,,,,,,,


In [97]:
df[(df.iloc[:, 0].str.contains('NAYARA')).fillna(False)].iloc[:, 0].values

array(['NAYARA ENERGY LTD.\nVADINAR, GUJARAT, (Formerly ESSAR OIL LTD.)'],
      dtype=object)

In [110]:
companies_to_replace = {
        'NAYARA ENERGY LTD.\nVADINAR, GUJARAT, (Formerly ESSAR OIL LTD.)': 'NEL-VADINAR,GUJARAT',
        'ESSAR OIL LTD.,VADINAR,GUJARAT': 'EOL-VADINAR,GUJARAT'
    }

df.iloc[:, 0].replace(companies_to_replace, inplace=True)
df

Unnamed: 0,OIL COMPANIES,APR,MAY,JUN,JULY,AUG,SEP,OCT,NOV,DEC,JAN,FEB,MAR,TOTAL
8,Indian Oil Corporation Ltd.(IOCL),,,,,,,,,,,,,
9,"IOCL-KOYALI, GUJARAT",1214.96,1240.31,1233.47,1273.44,1228.66,1084.32,1134.24,1185.01,1129.92,1007.13,1028.21,1234.72,13994.4
10,"IOCL-MATHURA, UTTAR PRADESH",797.483,813.746,796.809,754.922,701.061,748.541,804.47,714.072,761.365,786.215,721.468,829.847,9230.0
11,"IOCL-PANIPAT, HARYANA",1316.01,1386.0,1356.6,1400.08,1048.49,1233.88,1301.8,1294.07,1328.77,1330.37,1247.57,1394.4,15638.0
12,"IOCL-HALDIA, WEST BENGAL",692.329,713.191,681.656,659.372,630.331,665.081,678.899,544.002,534.156,550.093,624.213,715.974,7689.3
13,"IOCL-BARAUNI,BIHAR",559.264,576.665,547.059,576.496,570.971,512.54,481.318,523.376,579.355,554.047,487.235,557.719,6526.05
14,"IOCL-GUWAHATI,ASSAM",75.424,72.838,67.746,85.65,66.32,68.189,72.223,72.169,71.244,65.413,67.884,78.481,863.581
15,"IOCL-DIGBOI,ASSAM",51.753,45.866,43.155,23.701,40.318,46.023,50.998,46.562,45.141,44.508,42.016,53.406,533.447
16,"IOCL-BONGAIGAON,ASSAM",211.582,214.646,199.569,214.852,202.454,192.799,212.071,212.765,219.568,206.369,188.788,210.328,2485.79
17,"IOCL-PARADIP,ODISHA",394.268,538.27,256.932,697.353,464.844,444.836,806.829,553.805,1012.7,926.1,929.088,1204.91,8229.93


In [120]:
[int(sheet.strip().split(' ')[1].split('-')[0]) for sheet in xl.sheet_names if "Monthwise" in sheet]

[2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010]

## time to test full scraper

Below we run a full load.

In [16]:
from scraper.jobs.in_gov_ppac.crudeoil_proc_job import CrudeOilProcJob

india = CrudeOilProcJob(full_load=True)
india.run(download=False)

2020-06-17 18:04:03,023 - scraper.jobs.in_gov_ppac.crudeoil_proc_job - INFO - Getting sources...
2020-06-17 18:04:03,025 - scraper.jobs.in_gov_ppac.crudeoil_proc_job - INFO - Parsing file URLs to download from https://www.ppac.gov.in/content/146_1_ProductionPetroleum.aspx
2020-06-17 18:04:03,031 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): www.ppac.gov.in:443


[autoreload of scraper.jobs.in_gov_ppac.crudeoil_proc_job failed: Traceback (most recent call last):
  File "c:\users\rosa_l\pycharmprojects\scraper\venv\lib\site-packages\IPython\extensions\autoreload.py", line 245, in check
    superreload(m, reload, self.old_objects)
  File "c:\users\rosa_l\pycharmprojects\scraper\venv\lib\site-packages\IPython\extensions\autoreload.py", line 384, in superreload
    update_generic(old_obj, new_obj)
  File "c:\users\rosa_l\pycharmprojects\scraper\venv\lib\site-packages\IPython\extensions\autoreload.py", line 323, in update_generic
    update(a, b)
  File "c:\users\rosa_l\pycharmprojects\scraper\venv\lib\site-packages\IPython\extensions\autoreload.py", line 278, in update_class
    if old_obj == new_obj:
  File "c:\users\rosa_l\pycharmprojects\scraper\venv\lib\site-packages\pandas\core\generic.py", line 1479, in __nonzero__
    f"The truth value of a {type(self).__name__} is ambiguous. "
ValueError: The truth value of a DataFrame is ambiguous. Use a.e

2020-06-17 18:04:07,009 - urllib3.connectionpool - DEBUG - https://www.ppac.gov.in:443 "GET /content/146_1_ProductionPetroleum.aspx HTTP/1.1" 200 33714
2020-06-17 18:04:07,274 - scraper.jobs.in_gov_ppac.crudeoil_proc_job - DEBUG - Title found: Indigenous Crude Oil Production 
2020-06-17 18:04:07,277 - scraper.jobs.in_gov_ppac.crudeoil_proc_job - DEBUG - Title found: Crude Processing
2020-06-17 18:04:07,281 - scraper.jobs.in_gov_ppac.crudeoil_proc_job - DEBUG - List of 'a' elements: [<a href="/WriteReadData/userfiles/file/PT_CRUDE_22-5-2020.xls" target="_blank">Current  <img alt="View Document" height="15" src="/images/excel.png" width="15"/> 73  Kb</a>, <a href="/WriteReadData/userfiles/file/PT_crude_H_22-5-2020.xls" target="_blank">Historical  <img alt="View Document" height="15" src="/images/excel.png" width="15"/>550  Kb</a>]
2020-06-17 18:04:07,284 - scraper.jobs.in_gov_ppac.crudeoil_proc_job - INFO - Dictionary of file URLs: {'Current': 'https://www.ppac.gov.in/WriteReadData/userf

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.dropna(how='all', inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.dropna(axis='columns', how='all', inplace=True)


2020-06-17 18:04:09,583 - scraper.jobs.in_gov_ppac.crudeoil_proc_job - INFO - Number of processed rows in year 2018: 276
2020-06-17 18:04:09,586 - scraper.jobs.in_gov_ppac.crudeoil_proc_job - INFO - Processing Monthwise 2017-18 for monthly data for year 2017 source_code IN_GOV_PPAC_Historical
2020-06-17 18:04:09,600 - scraper.jobs.in_gov_ppac.crudeoil_proc_job - DEBUG - Cleaning fiscal year. Number of rows before cleaning: 54
2020-06-17 18:04:09,605 - scraper.jobs.in_gov_ppac.crudeoil_proc_job - DEBUG - columns = Index(['OIL COMPANIES', 'APR', 'MAY', 'JUN', 'JULY', 'AUG', 'SEP', 'OCT',
       'NOV', 'DEC', 'JAN', 'FEB', 'MAR', 'TOTAL'],
      dtype='object')
2020-06-17 18:04:09,615 - scraper.jobs.in_gov_ppac.crudeoil_proc_job - DEBUG - end_index: 35
2020-06-17 18:04:09,623 - scraper.jobs.in_gov_ppac.crudeoil_proc_job - DEBUG - before standardisation: 0                     Indian Oil Corporation Ltd.(IOCL)
1                                  IOCL-KOYALI, GUJARAT
2                        

In [14]:
from scraper.jobs.in_gov_ppac.crudeoil_proc_job import CrudeOilProcJob
import pandas as pd

xl = pd.ExcelFile(r"C:\Users\ROSA_L\PycharmProjects\scraper\filestore\IN_GOV_PPAC_Historical.xls")
india = CrudeOilProcJob(full_load=True)
india.process_sheet("IN_GOV_PPAC_Historical", xl, "PT_crude_H")

[autoreload of scraper.jobs.in_gov_ppac.crudeoil_proc_job failed: Traceback (most recent call last):
  File "c:\users\rosa_l\pycharmprojects\scraper\venv\lib\site-packages\IPython\extensions\autoreload.py", line 245, in check
    superreload(m, reload, self.old_objects)
  File "c:\users\rosa_l\pycharmprojects\scraper\venv\lib\site-packages\IPython\extensions\autoreload.py", line 384, in superreload
    update_generic(old_obj, new_obj)
  File "c:\users\rosa_l\pycharmprojects\scraper\venv\lib\site-packages\IPython\extensions\autoreload.py", line 323, in update_generic
    update(a, b)
  File "c:\users\rosa_l\pycharmprojects\scraper\venv\lib\site-packages\IPython\extensions\autoreload.py", line 278, in update_class
    if old_obj == new_obj:
  File "c:\users\rosa_l\pycharmprojects\scraper\venv\lib\site-packages\pandas\core\generic.py", line 1479, in __nonzero__
    f"The truth value of a {type(self).__name__} is ambiguous. "
ValueError: The truth value of a DataFrame is ambiguous. Use a.e

2020-06-17 17:52:43,891 - scraper.jobs.in_gov_ppac.crudeoil_proc_job - INFO - Processing PT_crude_H for annual data history source_code IN_GOV_PPAC_Historical
2020-06-17 17:52:43,901 - scraper.jobs.in_gov_ppac.crudeoil_proc_job - DEBUG - Cleaning fiscal year. Number of rows before cleaning: 58
2020-06-17 17:52:43,905 - scraper.jobs.in_gov_ppac.crudeoil_proc_job - DEBUG - columns = Index(['OIL COMPANIES', '1998-99', '1999-00', '2000-01', '2001-02', '2002-03',
       '2003-04', '2004-05', '2005-06', '2006-07', '2007-08', '2008-09',
       '2009-10', '2010-11', '2011-12', '2012-13', '2013-14', '2014-15',
       '2015-16', '2016-17', '2017-18', '2018-19', '2019-20(P)'],
      dtype='object')
2020-06-17 17:52:43,912 - scraper.jobs.in_gov_ppac.crudeoil_proc_job - DEBUG - end_index: 37
2020-06-17 17:52:43,918 - scraper.jobs.in_gov_ppac.crudeoil_proc_job - DEBUG - before standardisation: 0                    Indian Oil Corporation Ltd. (IOCL)
1                                  IOCL-KOYALI, GUJ

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.dropna(how='all', inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.dropna(axis='columns', how='all', inplace=True)


Unnamed: 0,entity,detail,period,value,area,flow,provider,source,product,frequency,unit,original
0,IN_GOV_PPAC-IOCL,"KOYALI, GUJARAT",1998,10935,INDIA,REFINOBS,IN_GOV_PPAC,IN_GOV_PPAC_Historical,CRUDEOIL,Annual,KT,True
1,IN_GOV_PPAC-IOCL,"MATHURA, UTTAR PRADESH",1998,8909,INDIA,REFINOBS,IN_GOV_PPAC,IN_GOV_PPAC_Historical,CRUDEOIL,Annual,KT,True
2,IN_GOV_PPAC-IOCL,"PANIPAT, HARYANA",1998,2208,INDIA,REFINOBS,IN_GOV_PPAC,IN_GOV_PPAC_Historical,CRUDEOIL,Annual,KT,True
3,IN_GOV_PPAC-IOCL,"HALDIA, WEST BENGAL",1998,4714,INDIA,REFINOBS,IN_GOV_PPAC,IN_GOV_PPAC_Historical,CRUDEOIL,Annual,KT,True
4,IN_GOV_PPAC-IOCL,"BARAUNI,BIHAR",1998,2204,INDIA,REFINOBS,IN_GOV_PPAC,IN_GOV_PPAC_Historical,CRUDEOIL,Annual,KT,True
...,...,...,...,...,...,...,...,...,...,...,...,...
350,IN_GOV_PPAC-NRL,"NUMALIGARH, ASSAM",2018,2900.39,INDIA,REFINOBS,IN_GOV_PPAC,IN_GOV_PPAC_Historical,CRUDEOIL,Annual,KT,True
351,IN_GOV_PPAC-CPCL,"MANALI, TAMILNADU",2018,10271.3,INDIA,REFINOBS,IN_GOV_PPAC,IN_GOV_PPAC_Historical,CRUDEOIL,Annual,KT,True
352,IN_GOV_PPAC-CPCL,"NARIMANAM,TAMILNADU",2018,423.367,INDIA,REFINOBS,IN_GOV_PPAC,IN_GOV_PPAC_Historical,CRUDEOIL,Annual,KT,True
355,IN_GOV_PPAC-MRPL,"MANGALORE,KARNATAKA",2018,16231,INDIA,REFINOBS,IN_GOV_PPAC,IN_GOV_PPAC_Historical,CRUDEOIL,Annual,KT,True


In [6]:
df = xl.parse(sheet_name='Monthwise 2012-13')

end_index = df[df.iloc[:,0] == 'GRAND TOTAL'].index.values[0]
display(end_index)
df[:end_index]

46

Unnamed: 0,Petroleum Planning & Analysis Cell,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13
0,,,,,,,,,,,,,,
1,2012-13 (April_March),,,,,,,,,,,,,
2,('000 Metric Tonnes),,,,,,,,,,,,,
3,Crude Oil Processed by Refineries,,,,,,,,,,,,,
4,OIL COMPANIES,APR,MAY,JUN,JULY,AUG,SEPT,OCT,NOV,DEC,JAN,FEB,MAR,TOTAL
5,Indian Oil Corporation Ltd.(IOCL),,,,,,,,,,,,,
6,"IOCL-KOYALI, GUJARAT",983.153,876.744,1149.48,1055.38,1194.97,1121.16,1285.65,1223.47,1227.88,1191.12,1091.69,754.248,13154.9
7,"IOCL-MATHURA, UTTAR PRADESH",763.218,766.52,683.344,700.063,732.937,665.129,691.043,730.372,703.402,700.005,653.306,771.548,8560.89
8,"IOCL-PANIPAT, HARYANA",1243.37,1395.73,1262.69,1196.88,1307.32,984.308,1148.44,1272.51,1299.99,1355.07,1268.66,1391.05,15126
9,"IOCL-HALDIA, WEST BENGAL",678.145,703.93,456.34,577.474,543.192,526.968,654.994,655.038,675.131,688.98,632.59,697.465,7490.25


In [17]:
a = dict()
a['test'] = a.get('test', []) + [1]
a['test'] = a.get('test', []) + [1]
a



{'test': [1, 1]}

In [25]:
df = pd.DataFrame(india.data)
df[df['value'].isnull()]

Unnamed: 0,entity,detail,period,value,area,flow,provider,source,product,frequency,unit,original


## Let's have a look on annual data sheet

Lets see what's inside annual data sheet.

In [8]:
from pathlib import Path
import pandas as pd

download_dir = Path('.') / 'filestore'
file = download_dir / 'IN_GOV_PPAC_Historical.xls'

xl = pd.ExcelFile(file)
df = xl.parse(sheet_name='PT_crude_H', na_values=[' -', '-'])
df

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22
0,,,Petroleum Planning & Analysis Cell,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
6,Period : Since 1998 -99,('000 Metric Tonnes),,,,,,,,,...,,,,,,,,,,
7,Crude Oil Processed by Refineries,,,,,,,,,,...,,,,,,,,,,
8,OIL COMPANIES,1998-99,1999-00,2000-01,2001-02,2002-03,2003-04,2004-05,2005-06,2006-07,...,2010-11,2011-12,2012-13,2013-14,2014-15,2015-16,2016-17,2017-18,2018-19,2019-20(P)
9,Indian Oil Corporation Ltd. (IOCL),,,,,,,,,,...,,,,,,,,,,


Well understood, we need:
    
* define '-' and ' -' as null values
* standardise company names


In [15]:
df.iloc[:, 0].str.upper().values

array([nan, nan, nan, nan, nan, nan, 'PERIOD : SINCE 1998 -99',
       'CRUDE OIL PROCESSED BY REFINERIES', 'OIL COMPANIES ',
       'INDIAN OIL CORPORATION LTD. (IOCL)', 'IOCL-KOYALI, GUJARAT',
       'IOCL-MATHURA, UTTAR PRADESH', 'IOCL-PANIPAT, HARYANA',
       'IOCL-HALDIA, WEST BENGAL', 'IOCL-BARAUNI,BIHAR',
       'IOCL-DIGBOI,ASSAM', 'IOCL-GUWAHATI,ASSAM',
       'IOCL-BONGAIGAON,ASSAM', 'IOCL -PARADIP', 'IOCL TOTAL', nan,
       'HINDUSTAN PETROLEUM CORPORATION LTD.(HPCL)',
       'HPCL-MUMBAI,MAHARASHTRA', 'HPCL-VISAKH,ANDHRA PRADESH',
       'HMEL-GGSR, BATHINDA, PUNJAB', 'HPCL-TOTAL', nan,
       'BHARAT PETROLEUM CORPORATION LTD (BPCL)',
       'BPCL-MUMBAI, MAHARASHTRA', 'BPCL-KOCHI,KERALA', 'BPCL- BORL-BINA',
       'BPCL-TOTAL', nan,
       'NUMALIGARH REFINERY LTD.(NRL)\nNUMALIGARH, ASSAM', nan,
       'CHENNAI PETROLEUM CORPORATION LTD (CPCL)',
       'CPCL-MANALI, TAMILNADU', 'CPCL-NARIMANAM,TAMILNADU', 'CPCL-TOTAL',
       nan, 'KOCHI REFINERY LTD.-KOCHI, KERALA', na

## Fix filename for 25-08-2020

This month, the file name pattern has changed from PT_CRUDE_DD-MM-YYYY.xls to PT_CRUDE-DD-M-YYYY.xls

In [12]:
import re
from datetime import datetime

filename = r"https://www.ppac.gov.in/WriteReadData/userfiles/file/PT_CRUDE-20-8-2020.xls"

print(filename)

date = filename.split('/')[-1] \
        .split('.')[0] \
        .split('_')[-1]

print(date)

date_search = re.search('PT_CRUDE[-,_](.*).xls', filename, re.IGNORECASE)
date = date_search.group(1)

timestamp = datetime.strptime(date, "%d-%m-%Y")

print(date)

print(timestamp)

https://www.ppac.gov.in/WriteReadData/userfiles/file/PT_CRUDE-20-8-2020.xls
CRUDE-20-8-2020
20-8-2020
2020-08-20 00:00:00
