# Social Media Analytics:
    
In this notebook, we are interested in performing analytics on social media data using Pyspark. I got the data from [here](https://www.udemy.com/taming-big-data-with-apache-spark-hands-on/learn/v4/t/lecture/3710452?start=0).

#### Problem 1:
We would like to see which age groups have the highest number of frields?
This concept is related to key-value pair management in spark.

In [11]:
%%file codes/average_frieds_age.py

def parser(line):
    data = line.strip().split(',')
    age = int(data[2])
    num_friends = int(data[3])
    return (age, num_friends)
    


import findspark
findspark.init()
from pyspark import SparkContext, SparkConf  # for configuring and conneting to db
conf = SparkConf().setMaster('local').setAppName('average friends')  # configuration
sc = SparkContext(conf = conf)  # connection to spark
line = sc.textFile('file:///Users/Amin/Dropbox/Career Deveoment/Data Science/PySpark/Pyspark_Social_Media/data/raw/fakefriends.csv')
line_parser = line.map(parser)  # refer to parser function above
friends_counter = line_parser.mapValues(lambda x: (x,1))
# up untill now no calculation is performed
# the following line is an action which makes an acyclic graphs to run/optimize the code
total_friends = friends_counter.reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1])) # returning a tuple (important)
final_results = total_friends.mapValues(lambda x: round(x[0] / x[1], 2)) # average calculation
results = final_results.collect() # getting the results into a list
sorted_results = sorted(results, reverse=True, key = lambda x : x[1])[:10] # top 10 age group with highet number of friends
for result in sorted_results:
    print(result)

Overwriting codes/average_frieds_age.py


In [12]:
! python codes/average_frieds_age.py

(63, 384.0)
(21, 350.88)
(18, 343.38)
(52, 340.64)
(33, 325.33)
(45, 309.54)
(56, 306.67)
(42, 303.5)
(51, 302.14)
(65, 298.2)
SUCCESS: The process with PID 5920 (child process of PID 20976) has been terminated.
SUCCESS: The process with PID 20976 (child process of PID 17392) has been terminated.
SUCCESS: The process with PID 17392 (child process of PID 3312) has been terminated.


19/03/10 16:48:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

[Stage 0:>                                                          (0 + 1) / 1]
[Stage 1:>                                                          (0 + 1) / 1]
                                                                                


#### Problem 2:
Who is the most popular comic book character in the Marvel world. 
We are loading a data set has the coappearance of all of the Mavel characters.
We have two data files:

``hero_graph`` has all the co-apprarances of specific hero with other heros.

``hero_table`` that is a look up table of hero_id and hero name which we need to broadcast to allt eh nodes.

In [29]:
try:
    sc.stop() # stop any available sparkcontext (if exists)
except:
    pass # if the context doesnt exist ignore

# develop a look up table of heros. {hero_id:hero_name}
import os
hero_name_file = os.path.join(os.getcwd() + r'\..' + r'\data\raw\Marvel-Names.txt')
def look_up_table():
    with open(hero_name_file) as f:
        hero_dict = {}
        for line in f:
            hero_id, *hero_name = line.strip().split(' ') # hero name field also has whitespace
            hero_name = " ".join(hero_name)[1:-1]  # bringing it back to correct format
            hero_dict[hero_id] = hero_name
    return hero_dict
hero_lut = look_up_table()

def parser(line):
    hero, *friends = line.strip().split(' ')
    return (hero, len(friends)) # returns hero id and number of frieds
    

# find and import pyspark
import findspark
findspark.init()
from pyspark import SparkConf, SparkContext
# get connected to pyspark context
conf = SparkConf().setMaster('local[*]').setAppName('popular hero')
sc = SparkContext(conf = conf)
# import and parse the data into correct format
hero_graph = sc.textFile(os.path.join(os.getcwd() + r'\..' + r'\data\raw\Marvel-Graph.txt'))
hero_friends = hero_graph.map(parser)
# broadcast the look up table
broadcast_lut = sc.broadcast(hero_lut)
# perform the reduction and sort 
hero_total_friends = hero_friends.reduceByKey(lambda x,y: (x + y)).\
map(lambda x :(broadcast_lut.value[x[0]], x[1])).sortBy(lambda x:x[1], ascending = False)
# collect the data
results = hero_total_friends.take(10)
# print the data
print('Superhero:\t\t#coappearance')
for res in results:
    print(res[0] + ':\t\t', res[1])
# stop the spark context
sc.stop()

Superhero:		#coappearance
CAPTAIN AMERICA:		 1933
SPIDER-MAN/PETER PAR:		 1741
IRON MAN/TONY STARK :		 1528
THING/BENJAMIN J. GR:		 1426
WOLVERINE/LOGAN :		 1394
MR. FANTASTIC/REED R:		 1386
HUMAN TORCH/JOHNNY S:		 1371
SCARLET WITCH/WANDA :		 1345
THOR/DR. DONALD BLAK:		 1289
BEAST/HENRY &HANK& P:		 1280
