# Traballando con ficheiro purchases.txt

## Load data

Load the data in `???/purchases.txt` into an RDD:

In [5]:
purchases = sc.textFile("purchases.txt")

In [6]:
purchases.toDebugString()

'(2) purchases.txt MapPartitionsRDD[9] at textFile at NativeMethodAccessorImpl.java:0 []\n |  purchases.txt HadoopRDD[8] at textFile at NativeMethodAccessorImpl.java:0 []'

In [7]:
purchases.takeSample(withReplacement=True, num = 5)

[u'2012-08-18\t11:08\tGarland\tPet Supplies\t55.42\tMasterCard',
 u'2012-10-10\t14:58\tNorth Las Vegas\tPet Supplies\t242.58\tAmex',
 u'2012-02-09\t09:23\tBoise\tComputers\t379.82\tCash',
 u'2012-11-14\t10:29\tFremont\tDVDs\t23.6\tCash',
 u'2012-02-19\t13:27\tChesapeake\tHealth and Beauty\t179.35\tDiscover']

## Filter 'San Jose' data

Filter data from the RDD keeping only "San Jose" lines.

In [9]:
purchases_sanjose = purchases.filter(lambda line: 'San Jose' in line)

In [10]:
purchases_sanjose.take(5)

[u"2012-01-01\t09:00\tSan Jose\tMen's Clothing\t214.05\tAmex",
 u"2012-01-01\t09:00\tSan Jose\tWomen's Clothing\t215.82\tCash",
 u'2012-01-01\t09:09\tSan Jose\tToys\t337.71\tCash',
 u'2012-01-01\t09:17\tSan Jose\tGarden\t192.82\tCash',
 u'2012-01-01\t09:19\tSan Jose\tCameras\t95.81\tCash']

## Count the number of purchases in San Jose

In [14]:
num_purchases = purchases.count()
num_purchases

4138476

## Find the maximum cost

Extract the column with the cost strings:

In [16]:
max_purchase = purchases.map(lambda line: float(line.split("\t")[4])).reduce(lambda x, y: max(x, y))

In [37]:
# Paso por paso:
cost_strings = purchases.map(lambda line: line.split("\t")[4])

In [38]:
cost_strings

PythonRDD[28] at RDD at PythonRDD.scala:53

And now we can convert them to floats:

In [39]:
costs = cost_strings.map(lambda t: float(t))

Finally we can calculate the max cost:

In [40]:
most_expensive = costs.reduce(lambda x, y: x if x > y else y)

Or directly with **max** function

In [41]:
most_expensive_max = costs.max()
most_expensive_max

499.99

## Find the minimum cost

In [42]:
least_expensive = costs.reduce(lambda t1, t2: t1 if t1 < t2 else t2)

In [45]:
least_expensive

0.0