# Traballando con ficheiro purchases.txt

## Load data

Load the data in `sales/salesdata.txt` into an RDD:

In [None]:
sales = sc.textFile('sales/salesdata.txt')

In [None]:
sales.take(5)

Remove all the "lines" not having 5 items

In [None]:
salesclean = sales.filter(lambda line: len(line.split('\t')) == 5)

## Filter 'Ferrol' data

Filter data from the RDD keeping only "Ferrol" lines.

In [None]:
ferrol = salesclean.filter(lambda line: 'Ferrol' in line)

## Count the number of sales in Ferrol

In [None]:
ferrol.count()

## Find the maximum sale

Extract the column with the cost strings:

In [None]:
costs = salesclean.map(lambda line: line.split('\t')[3])

And now we can convert them to floats:

In [None]:
fcosts = costs.map(lambda x: float(x))

Finally we can calculate the maximum sale:

In [None]:
fcosts.reduce(lambda x, y: x if x>y else y)

Or directly with **max** function

In [None]:
fcosts.max()

## Find the minimum sale

In [None]:
fcosts.reduce(lambda x, y: x if x<y else y)

In [None]:
fcosts.min()

## Sum the total sales in Electronics paid with Cash

In [None]:
categorias = salesclean.map(lambda line: line.split('\t')[2])

In [None]:
categorias.distinct().collect()

In [None]:
salesclean.map(lambda line: line.split('\t')[4]).distinct().collect()

Option 1) Filter Electronics and then filter Cash

In [None]:
selectronics = salesclean.filter((lambda line: 'Electronica' in line))

In [None]:
scash = selectronics.filter(lambda line: 'efectivo' in line or 'metalico' in line or 'cash' in line)

In [None]:
fcosts = scash.map(lambda line: line.split('\t')[3]).map(lambda x: float(x))

In [None]:
fcosts.reduce(lambda x, y: x+y)

Option 2) Combined filter and all steps in a cell

In [None]:
salesclean.filter((lambda line: ('Electronica' in line) and ( 'efectivo' in line or 'metalico' in line or 'cash' in line))) \
    .map(lambda line: line.split('\t')[3]) \
    .map(lambda x: float(x)) \
    .reduce(lambda x, y: x+y)