Create an RDD named `products` with `parallelize` containing the elements in the output.

In [3]:
products = sc.parallelize(["Apple", "Apple", "Cheese", "Apple", "Orange"])

Count the number of elements in `products`

In [4]:
products.count()

5

Count the number of apples in `products`. Tip: use filter.

In [5]:
appleCount = products.filter(lambda x: x == "Apple")
appleCount.count()

3

show the (distinct) products.

In [6]:
list(set(products.collect()))

['Apple', 'Cheese', 'Orange']

Load the contents of the file babynames from the data folder into a RDD called `babynames` with textFile. Show the first 5 lines.

In [7]:
babyNames = sc.textFile("babynames.csv")
babyNames.take(5)

['Year,First Name,County,Sex,Count',
 '2013,GAVIN,ST LAWRENCE,M,9',
 '2013,LEVI,ST LAWRENCE,M,9',
 '2013,LOGAN,NEW YORK,M,44',
 '2013,HUDSON,NEW YORK,M,49']

The first line in the file is a header, filter out the first line to keep only lines with actual data.

In [8]:
firstline = babyNames.first()
babyNamesNoHeader = babyNames.filter(lambda x: x != firstline)
babyNamesNoHeader.take(5)

['2013,GAVIN,ST LAWRENCE,M,9',
 '2013,LEVI,ST LAWRENCE,M,9',
 '2013,LOGAN,NEW YORK,M,44',
 '2013,HUDSON,NEW YORK,M,49',
 '2013,GABRIEL,NEW YORK,M,50']

The elements in this RDD are each a line of text. Transform each element into a tuple or list that consists of the 5 columns in the csv by splitting the lines on comma characters. Show the first 5. Tip: you need `map` and the `split` method on Python Strings.

In [9]:
babySplit = babyNamesNoHeader.map(lambda x: x.split(','))
babySplit.take(5)

[['2013', 'GAVIN', 'ST LAWRENCE', 'M', '9'],
 ['2013', 'LEVI', 'ST LAWRENCE', 'M', '9'],
 ['2013', 'LOGAN', 'NEW YORK', 'M', '44'],
 ['2013', 'HUDSON', 'NEW YORK', 'M', '49'],
 ['2013', 'GABRIEL', 'NEW YORK', 'M', '50']]

Count how many male babies are in the RDD.

In [12]:
maleBabies = babySplit.filter(lambda x: x[3] == 'M')
maleBabies.count()

70137

The next objective is to find the most given babyname.

First, convert the RDD into a key,value structure. Since we do not need anything but the name, we can convert every element into (name, 1). Show the first 5.

In [15]:
nameValueBaby = babySplit.map(lambda x: (x[1], 1 ))
nameValueBaby.take(5)

[('GAVIN', 1), ('LEVI', 1), ('LOGAN', 1), ('HUDSON', 1), ('GABRIEL', 1)]

Now you can aggregate the elements that have the same key, and sum the values to get the number of occurrences per name. Show the first 5, these might be different ones than displayed below. Tip: use `reduceByKey`

In [18]:
def addToSet(names, name):
    names.add(name)
    return names

namesCount =  nameValueBaby.aggregateByKey(\
                  0, # initial value for an accumulator \
                  lambda r, v: r + v, # function that adds a value to an accumulator \
                  lambda r1, r2: r1 + r2 # function that merges/combines two accumulators \
                 )
namesCount.collect()


[('GAVIN', 262),
 ('LEVI', 148),
 ('LOGAN', 386),
 ('HUDSON', 100),
 ('GABRIEL', 243),
 ('ELIZA', 59),
 ('MADELEINE', 51),
 ('ZARA', 40),
 ('DAISY', 53),
 ('JONATHAN', 190),
 ('JACKSON', 262),
 ('JUDY', 11),
 ('DAVID', 239),
 ('SEBASTIAN', 148),
 ('SAMUEL', 232),
 ('DEVORA', 24),
 ('JAYDEN', 273),
 ('MICHAEL', 315),
 ('MATTHEW', 294),
 ('CHARLES', 168),
 ('LUNA', 62),
 ('ADELE', 24),
 ('LIAM', 319),
 ('DYLAN', 279),
 ('DANIEL', 260),
 ('RYAN', 335),
 ('ETHAN', 327),
 ('WYATT', 181),
 ('SURI', 17),
 ('ZISSY', 20),
 ('YIDES', 21),
 ('WILLIAM', 286),
 ('ALEXANDER', 315),
 ('LENA', 51),
 ('CORA', 57),
 ('GIA', 64),
 ('MADELINE', 143),
 ('ANDREA', 77),
 ('TRINITY', 70),
 ('LEILANI', 54),
 ('HARMONY', 34),
 ('AMANDA', 88),
 ('RACHEL', 112),
 ('MARGOT', 18),
 ('NOA', 37),
 ('JESSICA', 97),
 ('ABBY', 29),
 ('JENNY', 24),
 ('MILANA', 19),
 ('ADDISON', 209),
 ('MACKENZIE', 141),
 ('ADRIANNA', 108),
 ('ATHENA', 47),
 ('HANNA', 42),
 ('ANIYAH', 59),
 ('CRYSTAL', 39),
 ('JULIET', 71),
 ('VALERIE', 

Now `map` the name,frequency pairs so that you only have the values and use the `max` action to get the highest value.

In [24]:
namesCountValue = namesCount.map(lambda x: (x[1] ) )

maxValue =  max(namesCountValue.collect())
print(maxValue)

386


And revert back to the name,frequency pairs and filter the pair(s) that have a frequency equal to the max you found.

In [25]:
maxAmountNames =  namesCount.filter(lambda x: x[1] == maxValue  )
maxAmountNames.collect()

[('LOGAN', 386)]