# Assignment 1
In this assignment, we	will be reducing large datasets, which can take a lot of time. We should use the map-reduce logic as learned in the sessions.

Load the directory `activations_xml` in your file system.

Each XML file contains data for all the	devices activated by customers during the 12 months of several years (many activations per file). Sample input data: 
```xml
<activations>
	<activation timestamp="2013-1-18 11:53.000" type="phone">
		<account-number>454721</account-number>
		<device-id>d051735e-cd2d-11e8-a49d-b86b239563ce</device-id>
		<phone-number>66641135</phone-number>
		<model>WIKO</model>
	</activation>
	<activation timestamp="2013-1-21 03:55.000" type="phone">
		<account-number>239096</account-number>
		<device-id>d051735f-cd2d-11e8-a2c0-b86b239563ce</device-id>
		<phone-number>42743767</phone-number>
		<model>HTC</model>
	</activation>
    ...............
</activations>
```
For convenience you have been provided with functions to parse the XML,	as that is not the focus of this exercise. This functions are:
```python
import xml.etree.ElementTree as ElementTree

# Given a string containing XML, parse the string, and 
# return an iterator of activation XML records (Elements) contained in the string
def getActivations(s):
    filetree = ElementTree.fromstring(s)
    return filetree.getiterator('activation')
    
# Given an activation record (XML Element), return the model name
def getModel(activation):
    return activation.find('model').text 

# Given an activation record (XML Element), return the account number 
def getAccount(activation):
    return activation.find('account-number').text 
```
To proccess files you should:
1. Use wholeTextFiles to create an RDD from the activations dataset. The resulting RDD will consist of tuples, in which the first value is the name of the file, and the second value is the contents of the file (XML) as a string.
2. Each	XML file can contain many activation records; use flatMap to map the contents of each file to a collection of XML records by calling the provided getActivations function. getActivations takes an XML string, parses it, and returns a collection of XML records; flatMap maps each record	to a separate RDD element.
3. Map each activation record to a string in the format account-number:model. Use the provided getAccount and getModel functions to find the values from the activation record.
4. Save the formatted strings to a text file in the directory
5. Map each activation record to a string in the format (account-number,model) to build a `pairRDD` to answer the following questions. Use the provided getAccount and getModel functions to find the values from the activation record.

QUESTIONS:
1. Get the top 5 accounts with highest number of activations. Hint: build the proper `pairRDD` and use `reduceByKey()` and `sortBy()`.
2. Get the top 5 models with highest number of activations. Hint: build the proper `pairRDD` and use `reduceByKey()` and `sortBy()`.
3. How many distinct accounts have activated a model in these years. Hint: build the proper `pairRDD` and use `distinct()`
4. Get how many accounts activated a model for each frequency. Hint: try to reverse (account-number,activations) to (activations,account-number) and try `countByKey()`.
5. Get the models of the top 5 accounts which have activated more models. Hint: From the question 1 we know the accounts with more activations, now we want to know these models. Build the proper `pairRDD` and use `groupByKey()` and `sortBy()`. To print out the models, you should use Python sentences to get data from the pyspark return.

In [11]:
import os
import sys

os.environ['SPARK_HOME'] = "C:\\spark-2.3.2-bin-hadoop2.7\\"

# Create a variable for our root path
SPARK_HOME = os.environ['SPARK_HOME']

#Add the following paths to the system path. Please check your installation
#to make sure that these zip files actually exist. The names might change
#as versions change.
sys.path.insert(0,os.path.join(SPARK_HOME,"python"))
sys.path.insert(0,os.path.join(SPARK_HOME,"python","lib"))
sys.path.insert(0,os.path.join(SPARK_HOME,"python","lib","pyspark.zip"))
sys.path.insert(0,os.path.join(SPARK_HOME,"python","lib","py4j-0.10.7-src.zip"))

#Initialize SparkSession and SparkContext
from pyspark.sql import SparkSession

#Create a Spark Session
spark = SparkSession \
    .builder \
    .master("local[2]") \
    .appName("MiPrimer") \
    .config("spark.executor.memory", "6g") \
    .config("spark.cores.max","4") \
    .getOrCreate()


#Get the Spark Context from Spark Session    
sc = spark.sparkContext

In [12]:
import xml.etree.ElementTree as ElementTree

# Given a string containing XML, parse the string, and 
# return an iterator of activation XML records (Elements) contained in the string
def getActivations(s):
    filetree = ElementTree.fromstring(s)
    return filetree.getiterator('activation')

# Given an activation record (XML Element), return the model name
def getModel(activation):
    return activation.find('model').text 

# Given an activation record (XML Element), return the account number 
def getAccount(activation):
    return activation.find('account-number').text 

### 1. Use `wholeTextFiles()` with xml directory

In [13]:
filepath='../data/activations-xml' #takes all xml extension files in the data folder of assignment_1
xmlRDD_raw=sc.wholeTextFiles(filepath)
xmlRDD_raw.take(1)

[('file:/C:/Users/breog/Documents/IE_MBD/2_Spark/Assignment 1/data/activations-xml/2013-01.xml',
  '<activations>\r\n\t<activation timestamp="2013-1-24 18:39.000" type="phone">\r\n\t\t<account-number>6845</account-number>\r\n\t\t<device-id>8126d422-f3c7-11e8-897c-b86b239563ce</device-id>\r\n\t\t<phone-number>6610058</phone-number>\r\n\t\t<model>PLUM</model>\r\n\t</activation>\r\n\t<activation timestamp="2013-1-11 19:38.000" type="phone">\r\n\t\t<account-number>361</account-number>\r\n\t\t<device-id>8127220c-f3c7-11e8-8a29-b86b239563ce</device-id>\r\n\t\t<phone-number>6240295</phone-number>\r\n\t\t<model>XIAOMI</model>\r\n\t</activation>\r\n\t<activation timestamp="2013-1-09 05:47.000" type="phone">\r\n\t\t<account-number>8724</account-number>\r\n\t\t<device-id>8127220d-f3c7-11e8-9e2a-b86b239563ce</device-id>\r\n\t\t<phone-number>6189944</phone-number>\r\n\t\t<model>LEECO</model>\r\n\t</activation>\r\n\t<activation timestamp="2013-1-04 18:58.000" type="phone">\r\n\t\t<account-number>584

### 2. Use `flatMap()` to load an activation into each RDD register

In [14]:
xmlRDD_activations=xmlRDD_raw.map(lambda x: x[1]).flatMap(lambda x:getActivations(x))
xmlRDD_activations.take(5)

[<Element 'activation' at 0x0000017FA22C9CC8>,
 <Element 'activation' at 0x0000017FA22D5868>,
 <Element 'activation' at 0x0000017FA22D59F8>,
 <Element 'activation' at 0x0000017FA22D5B88>,
 <Element 'activation' at 0x0000017FA22D5D18>]

### 3. Map each activation record to "account-number:model-name"

In [15]:
listRDD=xmlRDD_activations.map(lambda x: getAccount(x)+':'+getModel(x))
listRDD.take(10)

['6845:PLUM',
 '361:XIAOMI',
 '8724:LEECO',
 '5840:ACER',
 '6563:LG',
 '2121:LENOVO',
 '901:HTC',
 '3850:LEECO',
 '3223:VIVO',
 '6322:ZTE']

### 4. Save the data to a file

In [17]:
directory='../data/listRDD_account_model.xml' 
listRDD.coalesce(1).saveAsTextFile(directory) 
#Set the path in the data folder with name pairRDD_account_model.xml
#save it as only one document
#Highlight: If you execute twice this code, it will give an error cuz the folder already exist.
#Doubt: How yo update without having to delete?

### 5. New `map` for the questions

In [18]:
pairRDD=xmlRDD_activations.map(lambda x:(getAccount(x),getModel(x)))
pairRDD.take(5)

[('6845', 'PLUM'),
 ('361', 'XIAOMI'),
 ('8724', 'LEECO'),
 ('5840', 'ACER'),
 ('6563', 'LG')]

#### 5.1 Get the top 5 accounts with highest number of activations.

In [19]:
top_5_accounts=pairRDD.map(lambda x: (x[0],1))\
    .reduceByKey(lambda x,y: x+y)\
    .sortBy(lambda x: x[1],False)

top_5_accounts.take(5)

[('3829', 12), ('2219', 12), ('4084', 12), ('5158', 11), ('3007', 11)]

#### 5.2 Get the top 5 models with highest number of activations.

In [20]:
top_5_models=pairRDD.map(lambda x: (x[1],1))\
    .reduceByKey(lambda x,y: x+y)\
    .sortBy(lambda x: x[1],False)

top_5_models.take(5)

[('HP', 1073), ('LEECO', 1064), ('LAVA', 1060), ('APPLE', 1058), ('HTC', 1042)]

#### 5.3 How many distinct accounts have activated a model in these years

In [22]:
#Is asking for all distinct accounts that have been activated during this years, so:

total_distinct_accounts=pairRDD.map(lambda x: x[0])\
        .distinct()\
        .count()
print('The total number of distinct accounts is: {}'.format(Total_distinct_accounts))

The total number of distinct accounts is: 9731


#### 5.4 Get how many accounts activated a model for each frequency

In [23]:
pairRDD.take(5)

[('6845', 'PLUM'),
 ('361', 'XIAOMI'),
 ('8724', 'LEECO'),
 ('5840', 'ACER'),
 ('6563', 'LG')]

In [32]:
#Refers to obtain a histogram y:number of accounts for each frequency. x:frequency
#map.reduceByKey.sortBy
histogram=pairRDD.map(lambda x:(x[0],1))\
    .reduceByKey(lambda x,y:x+y)\
    .map(lambda x:(x[1],1))\
    .reduceByKey(lambda x,y:x+y)\
    .sortByKey()
    
histogram.take(12)

[(1, 989),
 (2, 1759),
 (3, 2160),
 (4, 1925),
 (5, 1339),
 (6, 798),
 (7, 428),
 (8, 219),
 (9, 75),
 (10, 25),
 (11, 11),
 (12, 3)]

In [40]:
#Another option
histogram=pairRDD.map(lambda x:(x[0],1))\
    .reduceByKey(lambda x,y:x+y)\
    .map(lambda x:(x[1],x[0]))\
    .sortByKey() 
histogram.countByKey()

defaultdict(int,
            {1: 989,
             2: 1759,
             3: 2160,
             4: 1925,
             5: 1339,
             6: 798,
             7: 428,
             8: 219,
             9: 75,
             10: 25,
             11: 11,
             12: 3})

#### 5.5 Get the models of the top 5 accounts which have activated more models

In [42]:
#For the top 5 accounts, get the models

list=top_5_accounts.map(lambda x:x[0]).take(5)

#models for those top5
value=pairRDD.filter(lambda x:x[0] in list)\
.groupByKey()
value.collect()

for row in value.take(5):
    print (row[0], ":")
    for models in row[1]:
        print ("\t",models)


3829 :
	 MEIZU
	 WIKO
	 MOTOROLA
	 ALCATEL
	 PANASONIC
	 ZTE
	 VODAFONE
	 MOTOROLA
	 ONEPLUS
	 XIAOMI
	 YU
	 ONEPLUS
5158 :
	 PANASONIC
	 ACER
	 MICROSOFT
	 VIVO
	 ENERGIZER
	 NOKIA
	 HUAWEI
	 BLU
	 MICROMAX
	 VIVO
	 ALCATEL
3007 :
	 MEIZU
	 LAVA
	 ONEPLUS
	 SAMSUNG
	 VODAFONE
	 ENERGIZER
	 ALCATEL
	 ONEPLUS
	 XIAOMI
	 HUAWEI
	 SAMSUNG
2219 :
	 LAVA
	 XIAOMI
	 VIVO
	 APPLE
	 MAXWEST
	 TOSHIBA
	 PLUM
	 PLUM
	 ALCATEL
	 LG
	 YU
	 ACER
4084 :
	 ZTE
	 ONEPLUS
	 OPPO
	 YU
	 LENOVO
	 HTC
	 BLU
	 PANASONIC
	 MOTOROLA
	 HTC
	 ZTE
	 ALCATEL


In [43]:
sc.stop()