-
Notifications
You must be signed in to change notification settings - Fork 0
Module 2: Lab #1
Generates key value pairs.
Remaining friends are present in the value.
Grouping of keys is done before reduce task.
Intersection of same keys is done and final result is obtained.
Spark code runs more faster than Hadoop Map-reduce code due to in-memory computation.
FIFA dataset is used to implement this.
a.
Dataframe is created by loading dataset with schema containing various struct types.
b.
1. Number of teams in various stages
2. Goals scored in range of [2,5] by home teams
3. Finding attendance for current day and maximum attendance of that stadium
4. Left outer join
5. Pattern matching: playing in finals and homeTeamInitials like 'ITA'
6. Average goals scored by each team
7. Performing UNIONALL and displaying row count
8. FULL JOIN
9. Number of times, when a country scored more than 6 goals.
10. Goals scored by countries in particular period
11. Total goals scored by each country
c.
Queries on Spark RDD's and Spark DataFrame's
RDD is created and displayed.
DataFrame is created from RDD by specifying the underlying schema.
1. Displaying total goals by each nation in descending order
2. Displaying the years in which host wins
4. Displaying the row for a particular year.
5. Selecting countries who have played more matches
Output is same for both RDD and DataFrames.
For RDD, Schema should be supplied explicitly, where as data frame implicitly extracts the schema from the given data.
To use Twitter Streaming API, developer account is required which provides the credentials to download real time streaming data.
Stream is created and data is filtered.
Filtered data is divided into words.
Output:
Part-4: Spark GraphX
Wordgame dataset is used for this.
Input:
Dataset is loaded and vertices and edges are identified.
Graph frame is created based on vertices and edges.
Schema and vertices output:
Edges output:
Pagerank is calculated for the graph frame constructed above.
GraphX is Spark API to implement parallel and iterative graph processing.
Graph frame is constructed from vertices and edges.
https://spark.apache.org/docs/latest/graphx-programming-guide.html#pagerank
https://www.toptal.com/apache/apache-spark-streaming-twitter
https://www.programcreek.com/scala/org.apache.spark.streaming.twitter.TwitterUtils
https://www.kaggle.com/anneloes/wordgame
https://www.edureka.co/community/12000/what-the-difference-between-rdd-and-dataframes-apache-spark