<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">


# Spark RDD Practice

*Authors: Adapted and modified from Dave Yerrington (SF) by Christoph Rahmede (LDN)*

Basic RDDs - These are just references for future needs
======================================

Common RDD Constructors
-----------------------

Expression                               |Meaning
----------                               |-------
`sc.parallelize(iterable)`               |Create RDD of elements of some iterable
`sc.textFile(path)`                      |Create RDD of lines from file

Common Transformations
----------------------

Expression                               |Meaning
----------                               |-------
`filter(boolean condition)`              |Returns for where some boolean condition is True
`map(some function)`                     |Applies some function
`flatMap(some function)`                 |Apply some function that returns an iterator and flatten the entire output
`sample(withReplacement=True, ratio)`    |Sample the data by some ratio
`distinct()`                             |Remove duplicates in RDD
`sortBy(key function, ascending=True)`   |Sort elements by key defined in function in designated order
`randomSplit([ratio1, ratio2], seed)`    |Splits your data into two depening on ratio array

Common Key Pair RDD Transformations
----------------------------------

Expression                               |Meaning
----------                               |-------
`groupByKey(key value rdd)`              |Collapse a key value RDD by the key, and keeps the values in a iterable
`reduceByKey(some function)`             |Collapse a key value RDD by the key, and combines the values by some function
`mapValues(some function)`               |Apply some function to the values of some key value RDD
`flatMapValues(some function)`           |Apply some function that turns a key and iterable value RDD into key value RDD
`keys()`                                 |Returns the keys of a key value RDD
`values()`                               |Returns the values of a key value RDD

Common Multiple RDD Transformations
----------------------------------

Expression                               |Meaning
----------                               |-------
`union(another rdd)`                     |Append another RDD to current RDD
`join(another rdd)`                      |Join another RDD to current RDD by matching keys
`leftOuterJoin(another rdd)`             |Join another RDD to current RDD where another RDD has matching keys
`rightOuterJoin(another rdd)`            |Join current RDD to other RDD where current RDD has matching keys
`zip(another rdd)`                       |Combines two RDD to form a key value pair RDD

Common Actions
--------------

Expression                             |Meaning
----------                             |-------
`collect()`                            |Convert RDD to in-memory list 
`take(n)`                              |First n elements of RDD 
`top(n)`                               |Top n elements of RDD
`takeSample(withReplacement=True, n)`  |Create sample of n elements with replacement
`sum()`                                |Find element sum (assumes numeric elements)
`mean()`                               |Find element mean (assumes numeric elements)
`stdev()`                              |Find element deviation (assumes numeric elements)
`takeOrdered(n, function)`             |Returns n ordered elements as sorted by the value returned by the function

### Step 1: import pyspark

In [1]:
import pyspark as ps    # for the pyspark suite
import warnings         # for displaying warning
from pyspark.sql import SQLContext

In [2]:
sc = ps.SparkContext('local[*]')

### Step 2: initialize a spark context (RDD manager)

In [3]:
sc

### Step 3:  Construct an RDD with the data (we will be using churn.csv)

In [4]:
churn_rdd = sc.textFile('data/churn.csv')

In [5]:
churn_rdd.first()

"State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?"

In [6]:
churn_rdd.take(5)

["State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?",
 'KS,128,415,382-4657,no,yes,25,265.100000,110,45.070000,197.400000,99,16.780000,244.700000,91,11.010000,10.000000,3,2.700000,1,False.',
 'OH,107,415,371-7191,no,yes,26,161.600000,123,27.470000,195.500000,103,16.620000,254.400000,103,11.450000,13.700000,3,3.700000,1,False.',
 'NJ,137,415,358-1921,no,no,0,243.400000,114,41.380000,121.200000,110,10.300000,162.600000,104,7.320000,12.200000,5,3.290000,0,False.',
 'OH,84,408,375-9999,yes,no,0,299.400000,71,50.900000,61.900000,88,5.260000,196.900000,89,8.860000,6.600000,7,1.780000,2,False.']

### Step 4: Lets look at the first two lines to understand the format that textFile creates

In [7]:
churn_rdd.take(2)

["State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?",
 'KS,128,415,382-4657,no,yes,25,265.100000,110,45.070000,197.400000,99,16.780000,244.700000,91,11.010000,10.000000,3,2.700000,1,False.']

### Step 5: We need to split the data by commas.

In [8]:
churn_rdd = churn_rdd.map(lambda x: x.split(','))

In [9]:
churn_rdd.take(2)

[['State',
  'Account Length',
  'Area Code',
  'Phone',
  "Int'l Plan",
  'VMail Plan',
  'VMail Message',
  'Day Mins',
  'Day Calls',
  'Day Charge',
  'Eve Mins',
  'Eve Calls',
  'Eve Charge',
  'Night Mins',
  'Night Calls',
  'Night Charge',
  'Intl Mins',
  'Intl Calls',
  'Intl Charge',
  'CustServ Calls',
  'Churn?'],
 ['KS',
  '128',
  '415',
  '382-4657',
  'no',
  'yes',
  '25',
  '265.100000',
  '110',
  '45.070000',
  '197.400000',
  '99',
  '16.780000',
  '244.700000',
  '91',
  '11.010000',
  '10.000000',
  '3',
  '2.700000',
  '1',
  'False.']]

### Step 6: Extract the headers

In [10]:
headers = churn_rdd.first() # this is a list for reference

In [11]:
headers

['State',
 'Account Length',
 'Area Code',
 'Phone',
 "Int'l Plan",
 'VMail Plan',
 'VMail Message',
 'Day Mins',
 'Day Calls',
 'Day Charge',
 'Eve Mins',
 'Eve Calls',
 'Eve Charge',
 'Night Mins',
 'Night Calls',
 'Night Charge',
 'Intl Mins',
 'Intl Calls',
 'Intl Charge',
 'CustServ Calls',
 'Churn?']

### Step 7: Remove the header from the data

In [12]:
churn_rdd = churn_rdd.filter(lambda x: x != headers)

In [13]:
churn_rdd.first()

['KS',
 '128',
 '415',
 '382-4657',
 'no',
 'yes',
 '25',
 '265.100000',
 '110',
 '45.070000',
 '197.400000',
 '99',
 '16.780000',
 '244.700000',
 '91',
 '11.010000',
 '10.000000',
 '3',
 '2.700000',
 '1',
 'False.']

### Step 8: Finding total churn

In [14]:
(
    churn_rdd.map(lambda x: x[-1] != 'False.') 
             .sum()
)

483

In [15]:
churn_rdd.filter(lambda x: x[-1] != 'False.')\
             .count()

483

### Step 9: Finding total churn per State

In [16]:
(
    churn_rdd.map(lambda x: (x[0], x[-1] != 'False.'))
             .reduceByKey(lambda x, y: x + y)
             .take(5)
)

[('KS', 13), ('OH', 10), ('MO', 7), ('LA', 4), ('WV', 10)]

In [17]:
(
    churn_rdd.map(lambda x: (x[0], x[-1] != 'False.'))
             .reduceByKey(lambda x, y: x + y).sortBy(lambda x: x[1], ascending=False)
             .take(5)
)

[('TX', 18), ('NJ', 18), ('MD', 17), ('MI', 16), ('NY', 15)]

### Step 10: Finding average Customer Service Calls per Churn for each State

In [18]:
service_calls = (churn_rdd.filter(lambda x: ((x[0]=='SC') and (x[-1]=='True.')))
                .map(lambda x: int(x[-2])))

service_calls.mean()

2.0714285714285716

In [19]:
(
    churn_rdd.filter(lambda x: x[-1]=='True.')
             .map(lambda x: (x[0], (int(x[-2]), 1)))
    .reduceByKey(lambda x, y: (x[0]+y[0], x[1]+y[1]))
             .take(20)
)

[('NY', (40, 15)),
 ('CO', (23, 9)),
 ('WY', (17, 9)),
 ('TX', (30, 18)),
 ('DC', (9, 5)),
 ('LA', (9, 4)),
 ('ME', (26, 13)),
 ('OH', (10, 10)),
 ('AK', (10, 3)),
 ('VA', (8, 5)),
 ('KS', (22, 13)),
 ('SD', (12, 8)),
 ('CT', (17, 12)),
 ('RI', (12, 6)),
 ('CA', (16, 9)),
 ('MO', (19, 7)),
 ('WI', (13, 7)),
 ('DE', (13, 9)),
 ('WV', (28, 10)),
 ('SC', (29, 14))]

In [20]:
(
    churn_rdd.filter(lambda x: x[-1]=='True.')
             .map(lambda x: (x[0], (int(x[-2]), 1)))
             .reduceByKey(lambda x, y: (x[0]+y[0], x[1]+y[1]))
             .mapValues(lambda x: x[0] / x[1])
             .take(20)
)

[('NY', 2.6666666666666665),
 ('CO', 2.5555555555555554),
 ('WY', 1.8888888888888888),
 ('TX', 1.6666666666666667),
 ('DC', 1.8),
 ('LA', 2.25),
 ('ME', 2.0),
 ('OH', 1.0),
 ('AK', 3.3333333333333335),
 ('VA', 1.6),
 ('KS', 1.6923076923076923),
 ('SD', 1.5),
 ('CT', 1.4166666666666667),
 ('RI', 2.0),
 ('CA', 1.7777777777777777),
 ('MO', 2.7142857142857144),
 ('WI', 1.8571428571428572),
 ('DE', 1.4444444444444444),
 ('WV', 2.8),
 ('SC', 2.0714285714285716)]

In [21]:
(
    churn_rdd.filter(lambda x: x[-1]=='True.')
             .map(lambda x: (x[0], (int(x[-2]), 1)))
             .reduceByKey(lambda x, y: (x[0]+y[0], x[1]+y[1]))
             .mapValues(lambda x: x[0] / x[1]) 
             .sortBy(lambda x: x[1], ascending=False)
             .take(20)
)

[('AR', 3.8181818181818183),
 ('AK', 3.3333333333333335),
 ('IN', 3.111111111111111),
 ('TN', 3.0),
 ('HI', 3.0),
 ('ID', 3.0),
 ('AL', 3.0),
 ('KY', 3.0),
 ('IA', 3.0),
 ('NH', 3.0),
 ('VT', 2.875),
 ('WV', 2.8),
 ('MN', 2.7333333333333334),
 ('MO', 2.7142857142857144),
 ('NY', 2.6666666666666665),
 ('UT', 2.6),
 ('MT', 2.5714285714285716),
 ('CO', 2.5555555555555554),
 ('NM', 2.5),
 ('MI', 2.4375)]

In [22]:
def mean_func(iterator):
    total, count = 0, 0
    for x in iterator:
        total += x
        count += 1
    return total / count

(churn_rdd.filter(lambda x: x[-1]=='True.')
         .map(lambda x: (x[0], int(x[-2])))
         .groupByKey()
         .map(lambda x: (x[0], mean_func(x[1])))
         .sortBy(lambda x: x[1], ascending=False)
         .collect())

[('AR', 3.8181818181818183),
 ('AK', 3.3333333333333335),
 ('IN', 3.111111111111111),
 ('TN', 3.0),
 ('HI', 3.0),
 ('ID', 3.0),
 ('AL', 3.0),
 ('KY', 3.0),
 ('IA', 3.0),
 ('NH', 3.0),
 ('VT', 2.875),
 ('WV', 2.8),
 ('MN', 2.7333333333333334),
 ('MO', 2.7142857142857144),
 ('NY', 2.6666666666666665),
 ('UT', 2.6),
 ('MT', 2.5714285714285716),
 ('CO', 2.5555555555555554),
 ('NM', 2.5),
 ('MI', 2.4375),
 ('OR', 2.3636363636363638),
 ('MS', 2.2857142857142856),
 ('NC', 2.272727272727273),
 ('LA', 2.25),
 ('AZ', 2.25),
 ('GA', 2.25),
 ('MD', 2.235294117647059),
 ('NJ', 2.1666666666666665),
 ('NV', 2.142857142857143),
 ('FL', 2.125),
 ('SC', 2.0714285714285716),
 ('WA', 2.0714285714285716),
 ('ME', 2.0),
 ('RI', 2.0),
 ('WY', 1.8888888888888888),
 ('PA', 1.875),
 ('WI', 1.8571428571428572),
 ('DC', 1.8),
 ('CA', 1.7777777777777777),
 ('KS', 1.6923076923076923),
 ('TX', 1.6666666666666667),
 ('OK', 1.6666666666666667),
 ('VA', 1.6),
 ('SD', 1.5),
 ('ND', 1.5),
 ('DE', 1.4444444444444444),
 ('

### Step 11: What's the min, mean, and max night charge for users that churned?

In [23]:
headers[-1]

'Churn?'

In [24]:
headers.index('Night Charge')

15

In [25]:
churn_rdd.map(lambda x:(x[-1])=='True.').take(20)

[False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 False]

In [26]:
churned = churn_rdd.filter(lambda x: x[-1]=='True.')

In [27]:
churned.map(lambda x: float(x[15])).mean()

9.23552795031056

In [28]:
churned.map(lambda x: float(x[15])).max()

15.97

In [29]:
churned.map(lambda x: float(x[15])).min()

2.13

### Step 12: How many of the churned users have Vmail plan?

In [30]:
headers.index('VMail Plan')

5

In [31]:
churned.map(lambda x: x[5]).take(5)

['no', 'no', 'no', 'no', 'yes']

In [32]:
churned.filter(lambda x: x[5]=='yes').count()

80

### Step 13: Which state has the most day calls?

In [33]:
headers.index('Day Calls')

8

In [34]:
(
    churn_rdd.map(lambda x: (x[0], int(x[8])))
             .reduceByKey(lambda x, y: x + y)
             .sortBy(lambda x: x[1], ascending=False)
             .first()
)

('WV', 11001)