# HDFS


There is the log of online store's transactions in several files in json format located along the path /data/transactions/
on the cluster in hdfs.

File names are `site-transactions-01.json, ..., site-transactions-08.json`.

The file structure is as follows:
```json
{
    "transactions":
    [
        {
            "commitTimestamp": "2016-03-29 10:28:31",
            "customerId": 24,
            "trackId": 838680843675,
            "goods": 
            [
                {
                    "amount": 1, 
                    "pricePerUnit": 811.45, 
                    "vendorCode": "44"
                },
                {
                    "amount": 1, 
                    "pricePerUnit": 365.86, 
                    "vendorCode": "60"
                },
                ...
            ]
        },
        ...
    ]
}
```

You have to calculate how much the customer with `customerId=42` spent in total on the product with `vendorCode="104"`.

## Output

The resulting number (i.e. 123) as JSON with following structure:
```json
{
    "q1": 123
}
```


### Comment
It is guaranteed that the size of one file does not exceed the size of one hdfs block.

The correctness of the transaction log file is also guaranteed.

It is assumed that an item with the same `vendorCode` may appear several times in the list of items of the same transaction.

Try to solve the problem using the HDFS Python API.

Before getting started, you need to set up your environment.
You need to create the file `~/.hdfscli.cfg` containing the following:
```
[global]
default.alias = default

[default.alias]
url = http://localhost:50070/
user = <YOUR_USER_LOGIN>
```

Now you can use the Python API.
```
>>> from hdfs import Config
>>> client = Config().get_client()
>>> client.list('/data')
['access_logs', ..., 'lsml' ... ]
```

More examples:
https://hdfscli.readthedocs.io/en/latest/quickstart.html#reading-and-writing-files
read, write, downloading and uploading to local filesystem

Getting file's status:
```
>>> client.status('/data/wiki')
{'accessTime': 0, 'length': 0, ...}
```

More libraries for Python: `hadoopy, pydoop, dumbo, mrjob`.

In [None]:
# HERE IS SOLUTION

In [7]:
from hdfs import Config
import json

client = Config().get_client()

client.list('/data')

['clickstream.csv', 'transactions']

In [21]:
transactions_list = client.list('/data/transactions/transactions')

In [23]:
transactions_list

['site-transactions-01.json',
 'site-transactions-02.json',
 'site-transactions-03.json',
 'site-transactions-04.json',
 'site-transactions-05.json',
 'site-transactions-06.json',
 'site-transactions-07.json',
 'site-transactions-08.json']

In [35]:
customer_id = 42
vendor_code = "104"
total = 0

for file in transactions_list:
    path = f'/data/transactions/transactions/{file}'


    with client.read(path) as reader:
    
        content = json.loads(reader.read())
        transactions = content.get('transactions', [])

        for transaction in transactions:
            if transaction['customerId'] == customer_id:
                goods = transaction.get('goods', [])

                for item in goods:
                    if item['vendorCode'] == vendor_code:
                        total += item['amount'] * item['pricePerUnit']

print(total)

312931.3200000004


In [37]:
res = {'q1': total}
result = json.dumps(res)
print(result)

{"q1": 312931.3200000004}


In [38]:
f = open("result.json", "w")
f.write(result)
f.close()