**GHTorrent Data Analytics with PySpark RDD: An unstructured case study**<a href="#GHTorrent-Data-Analytics-with-PySpark-RDD:-An-unstructured-case-study" class="anchor-link">¶</a>
===================================================================================================================================================================================

### Udemy Course: Best Hands-on Big Data Practices and Use Cases using PySpark<a href="#Udemy-Course:-Best-Hands-on-Big-Data-Practices-and-Use-Cases-using-PySpark" class="anchor-link">¶</a>

### Author: Amin Karami (PhD, FHEA)<a href="#Author:-Amin-Karami-(PhD,-FHEA)" class="anchor-link">¶</a>

##### source 1: <https://ghtorrent.org><a href="#source-1:-https://ghtorrent.org" class="anchor-link">¶</a>

##### source 2: <https://ghtorrent.org/downloads.html><a href="#source-2:-https://ghtorrent.org/downloads.html" class="anchor-link">¶</a>

In \[ \]:

    ########## ONLY in Colab ##########
    !pip3 install pyspark
    ########## ONLY in Colab ##########

In \[ \]:

    ########## ONLY in Ubuntu Machine ##########
    # Load Spark engine
    !pip3 install -q findspark
    import findspark
    findspark.init()
    ########## ONLY in Ubuntu Machine ##########

In \[ \]:

    from pyspark import SparkContext, SparkConf

    # Initializing Spark
    conf = SparkConf().setAppName("GHTorrent_PySpark").setMaster("local[*]")
    sc = SparkContext(conf=conf)
    print(sc)
    print("Ready to go!")

In \[ \]:

    ########## ONLY in Colab ##########
    from google.colab import drive
    drive.mount('/content/drive')
    ########## ONLY in Colab ##########

In \[ \]:

    # Read and Load Data to Spark
    rdd = sc.textFile('ghtorrent-logs.txt.gz')

In \[ \]:

    # Repartition and Cache Data:
    rdd = rdd.repartition(8) # shuffle all data
    print(sc.defaultParallelism)
    print(rdd.getNumPartitions())

    from pyspark import StorageLevel
    rdd.persist(StorageLevel.MEMORY_AND_DISK)

Question 1: Count the number of records and get twenty records randomly.<a href="#Question-1:-Count-the-number-of-records-and-get-twenty-records-randomly." class="anchor-link">¶</a>
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In \[ \]:

    print("the number of records is:", '\033[1m', rdd.count())
    rdd.takeSample(False, 20, 1234)

**GHTorrent data format**<a href="#GHTorrent-data-format" class="anchor-link">¶</a>
===================================================================================

Every line of this log file includes:

1.  Logging level, one of `DEBUG`, `INFO`, `WARN`, `ERROR`
2.  A timestamp
3.  The downloader id
4.  The logging stage including at least one of the following names:
    -   `event_processing`
    -   `ght_data_retrieval`
    -   `api_client`
    -   `retriever`
    -   `ghtorrent`

Question 2: Get the number of lines with both `Transaction` or `Repo` information.<a href="#Question-2:-Get-the-number-of-lines-with-both-Transaction-or-Repo-information." class="anchor-link">¶</a>
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In \[ \]:

    import re
    def collect_words(line):
        return re.compile('\w+').findall(line.lower())

    print(collect_words('we are TESTING GHTorrent! ?, OK!'))

In \[ \]:

    rdd_Transactions = rdd.filter(lambda line: "transaction" in collect_words(line))
    rdd_Repo = rdd.filter(lambda line: "repo" in collect_words(line))

    rdd_intersect = rdd_Transactions.intersection(rdd_Repo)
    rdd_intersect.count()

In \[ \]:

    rdd_intersect.collect()

Question 3: Get the number of lines including `web link` for `WARN` logging levels.<a href="#Question-3:-Get-the-number-of-lines-including-web-link-for-WARN-logging-levels." class="anchor-link">¶</a>
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In \[ \]:

    #import re
    def get_URLs(line):
        return re.findall(r'http[s]?://(?:[-\w.]|(?:%[\da-zA-Z]{2,}))+', line)

In \[ \]:

    rdd.filter(lambda line: line.split(',')[0] == 'WARN') \
       .filter(lambda line: len(get_URLs(line)) > 1) \
       .count()

Question 4: What is the most active `downloader id` for `Failed` connections?<a href="#Question-4:-What-is-the-most-active-downloader-id-for-Failed-connections?" class="anchor-link">¶</a>
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In \[ \]:

    rdd_Failed = rdd.filter(lambda line: 'failed' in collect_words(line))

    # Create Key-Value
    rdd_active_ids = rdd_Failed.map(lambda line: (line.replace('--', ',').split(',')[2].split('-')[1], 1))

In \[ \]:

    # groupByKey
    rdd_active_ids.groupByKey(4).mapValues(sum).sortBy(lambda x: x[1], ascending=False).first()

In \[ \]:

    # reduceByKey
    rdd_active_ids.reduceByKey(lambda a,b: a+b).sortBy(lambda x: x[1], ascending=False).first()

Question 5: What is the most active `repository`?<a href="#Question-5:-What-is-the-most-active-repository?" class="anchor-link">¶</a>
-------------------------------------------------------------------------------------------------------------------------------------

In \[ \]:

    #import re

    def get_repo(line):
        return re.compile(' \w+ ').findall(line.lower())


    rdd.filter(lambda line: " repo " in get_repo(line)) \
       .map(lambda line: line.lower().split('repo')[1].split(' ')[1]) \
       .map(lambda repo: (repo, 1)) \
       .reduceByKey(lambda a,b: a+b) \
       .sortBy(lambda x: x[1], ascending=False) \
       .first()

\[challenge\] Question 6: Get the number of `Failed HTTP`requests per `hour`.<a href="#%5Bchallenge%5D-Question-6:-Get-the-number-of-Failed-HTTPrequests-per-hour." class="anchor-link">¶</a>
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In \[ \]: