### Task-I: Build and populate necessary tables (30% of course project grade)
1) Ingest both train and test data into one Postgres Database table. Use the augmented datasets that are provided under Final CSV folder.
2) Add a field to your database table that distinguishes between train and test datasets.
3) Identify constraints as needed and document them in your Readme.md file.
4) Your tables should be created in schema with the name “mqtt”.
5) In your ReadMe.md, add a description for the features in the dataset.
6) Use the reduced version of the data if your laptop’s memory can’t handle the original dataset.

In [1]:
# Uncomment the following lines if you are using Windows!
import findspark
findspark.init()
findspark.find()

import pyspark

from pyspark.sql import SparkSession
from pyspark import SparkContext, SQLContext

appName = "Big Data Analytics"
master = "local"

# Create Configuration object for Spark.
conf = pyspark.SparkConf()\
    .set('spark.driver.host','127.0.0.1')\
    .setAppName(appName)\
    .setMaster(master)

# Create Spark Context with the new configurations rather than relying on the default one
sc = SparkContext.getOrCreate(conf=conf)

# You need to create SQL Context to conduct some database operations like what we will see later.
sqlContext = SQLContext(sc)

# If you have SQL context, you create the session from the Spark Context
spark = sqlContext.sparkSession.builder.getOrCreate()



This Schema has been created after manually observing the different values in the .csv file. For Hexadecimal vals, string type is used and for mqtt_msg string type is used.

In [2]:
sqlWay = spark.sql("""
CREATE SCHEMA mqtt;
""")

sqlWay.show()

++
||
++
++



In [3]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType


mqtt = StructType([
    StructField("tcp_flags", StringType(), True),
    StructField("tcp_time_delta", DoubleType(), True),
    StructField("tcp_len", IntegerType(), True),
    StructField("mqtt_conack_flags", StringType(), True),
    StructField("mqtt_conack_flags_reserved", IntegerType(), True),
    StructField("mqtt_conack_flags_sp", IntegerType(), True),
    StructField("mqtt_conack_val", IntegerType(), True),
    StructField("mqtt_conflag_cleansess", IntegerType(), True),
    StructField("mqtt_conflag_passwd", IntegerType(), True),
    StructField("mqtt_conflag_qos", IntegerType(), True),
    StructField("mqtt_conflag_reserved", IntegerType(), True),
    StructField("mqtt_conflag_retain", IntegerType(), True),
    StructField("mqtt_conflag_uname", IntegerType(), True),
    StructField("mqtt_conflag_willflag", IntegerType(), True),
    StructField("mqtt_conflags", StringType(), True),
    StructField("mqtt_dupflag", IntegerType(), True),
    StructField("mqtt_hdrflags", StringType(), True),
    StructField("mqtt_kalive", IntegerType(), True),
    StructField("mqtt_len", IntegerType(), True),
    StructField("mqtt_msg", StringType(), True),
    StructField("mqtt_msgid", IntegerType(), True),
    StructField("mqtt_msgtype", IntegerType(), True),
    StructField("mqtt_proto_len", IntegerType(), True),
    StructField("mqtt_protoname", StringType(), True),
    StructField("mqtt_qos", IntegerType(), True),
    StructField("mqtt_retain", IntegerType(), True),
    StructField("mqtt_sub_qos", IntegerType(), True),
    StructField("mqtt_suback_qos", IntegerType(), True),
    StructField("mqtt_ver", IntegerType(), True),
    StructField("mqtt_willmsg", IntegerType(), True),
    StructField("mqtt_willmsg_len", IntegerType(), True),
    StructField("mqtt_willtopic", IntegerType(), True),
    StructField("mqtt_willtopic_len", IntegerType(), True),
    StructField("target", StringType(), True)
])

In [4]:
# df_train = spark.read.csv("train70_reduced.csv" ,header=True, inferSchema= False)
# df_test = spark.read.csv("test30_reduced.csv", header=True, inferSchema=False)

# df_train = spark.read.csv("train70_reduced.csv" ,header=True, inferSchema= False).schema(mqtt)
# df_test = spark.read.csv("test30_reduced.csv", header=True, inferSchema=False).schema(mqtt)

df_train = spark.read.format('csv').schema(mqtt).option('header', True).load('train70_reduced.csv')
df_test = spark.read.format('csv').schema(mqtt).option('header', True).load('test30_reduced.csv')

In [5]:
from pyspark.sql.functions import lit

df_train = df_train.withColumn('Train', lit(1))
df_test = df_test.withColumn('Train', lit(0))

In [6]:
db_properties={}
#update your db username
db_properties['username']="postgres"
#update your db password
db_properties['password']="18763kebjeseaya"
#make sure you got the right port number here
db_properties['url']= "jdbc:postgresql://localhost:5432/postgres"
#make sure you had the Postgres JAR file in the right location
db_properties['driver']="org.postgresql.Driver"
db_properties['table']= "mqtt"


df_train.write.format("jdbc")\
.mode("overwrite")\
.option("url", db_properties['url'])\
.option("dbtable", db_properties['table'])\
.option("user", db_properties['username'])\
.option("password", db_properties['password'])\
.option("Driver", db_properties['driver'])\
.save()

df_test.write.format("jdbc")\
.mode("append")\
.option("url", db_properties['url'])\
.option("dbtable", db_properties['table'])\
.option("user", db_properties['username'])\
.option("password", db_properties['password'])\
.option("Driver", db_properties['driver'])\
.save()

In [7]:
# df = sqlContext.read.format("jdbc")\
#     .option("url", db_properties['url'])\
#     .option("dbtable", db_properties['table'])\
#     .option("user", db_properties['username'])\
#     .option("password", db_properties['password'])\
#     .option("Driver", db_properties['driver'])\
#     .load()

# df.show(1, vertical=True)

In [8]:
# cols = df.columns

# for column in cols:
#     new_column = column.replace('.', '_')
#     df = df.withColumnRenamed(column, new_column)
    
# df.printSchema()

In [9]:
print(df_train.count())
print(df_test.count())
# print(f'{df.count()} {df_train.count()+df_test.count()}')

231646
99290


### Task-II: Conduct analytics on your dataset (20% of course project grade)
Develop Python functions that run Spark to answer the following questions. All of the core analysis and data ingestion should be conducted via PySpark. Ingest all the data to answer the following questions from the Postgres Database table.
1. What is the average length of an MQTT message captured in the training dataset?
2. For each target value, what is the average length of the TCP message? (Conduct this process programmatically and don’t hardcode any of the target values in your command)
3. Build a Python function that uses PySpark to list the most frequent X TCP flags where X is a user-provided parameter.
    o Make sure to handle this scenario as well: if the user requests 5 most frequent TCP flags but there are 3 Flags that share the same count at rank number 5, please include all of them in your output.
4. Among the listed targets, what is the most popular target on Google News? (Use 5-minutes Google News feed to justify your answer).
    o Use this query: https://news.google.com/rss/search?q=popular+cyber+attacks
    o You may find yourself in need to decrypt the target values in the dataset to proper English equivalent. For example, “bruteforce” to “brute force”.

In [10]:
df = sqlContext.read.format("jdbc")\
    .option("url", db_properties['url'])\
    .option("dbtable", db_properties['table'])\
    .option("user", db_properties['username'])\
    .option("password", db_properties['password'])\
    .option("Driver", db_properties['driver'])\
    .load()

df.show(1, vertical=True)

-RECORD 0--------------------------------
 tcp_flags                  | 0x00000018 
 tcp_time_delta             | 0.998867   
 tcp_len                    | 10         
 mqtt_conack_flags          | 0          
 mqtt_conack_flags_reserved | null       
 mqtt_conack_flags_sp       | null       
 mqtt_conack_val            | null       
 mqtt_conflag_cleansess     | null       
 mqtt_conflag_passwd        | null       
 mqtt_conflag_qos           | null       
 mqtt_conflag_reserved      | null       
 mqtt_conflag_retain        | null       
 mqtt_conflag_uname         | null       
 mqtt_conflag_willflag      | null       
 mqtt_conflags              | 0          
 mqtt_dupflag               | null       
 mqtt_hdrflags              | 0x00000030 
 mqtt_kalive                | null       
 mqtt_len                   | null       
 mqtt_msg                   | 32         
 mqtt_msgid                 | null       
 mqtt_msgtype               | null       
 mqtt_proto_len             | null

In [11]:
# (1)
from pyspark.sql.functions import length, avg

def avg_len_mqtt_msg_train(df):
    df_train = df[df['Train'] == 1]
    df_mqtt_msg_avg = df_train.agg(avg(length('mqtt_msg')).alias('mqtt_avg_len'))
    avg_length = df_mqtt_msg_avg.collect()[0]["mqtt_avg_len"]
    print(f'Average length of the mqtt msg in training set is {avg_length}')
    
avg_len_mqtt_msg_train(df)


Average length of the mqtt msg in training set is 33.87447657201074


In [12]:
# (2)

def avg_len_tcp_cmd_all_targets(df):
#     df.select('target', 'tcp_len').show(5)
    df_tcp_msg_avg_len = df.groupBy('target').agg(avg(length('tcp_len')).alias('tcp_msg_avg_len'))
    df_tcp_msg_avg_len.show()
    
avg_len_tcp_cmd_all_targets(df)

+----------+------------------+
|    target|   tcp_msg_avg_len|
+----------+------------------+
|   slowite|1.1112801564877202|
|bruteforce|1.1034411419902075|
|     flood|2.8890701468189235|
| malformed|1.2652874404979861|
|       dos|2.0167238718297207|
|legitimate|1.6122577252920594|
+----------+------------------+



In [13]:
from pyspark.sql.functions import desc

def get_frequent_tcp_flags(x, df):
    df_temp = df.groupBy('tcp_flags').count().orderBy(desc('count'))
    num_distinct_counts = df_temp.select('count').count()
    if x <= 0:
        print(f'{x} is too small, give a legitimate number')
        return
    if x > num_distinct_counts:
        print(f'{x} is out of bounds, give a smaller number')
        return
    print(f'{x} most frequent tcp flags are given below')
    print(df_temp.show(x))

for x in range(10):
    get_frequent_tcp_flags(x, df)

0 is too small, give a legitimate number
1 most frequent tcp flags are given below
+----------+------+
| tcp_flags| count|
+----------+------+
|0x00000018|183076|
+----------+------+
only showing top 1 row

None
2 most frequent tcp flags are given below
+----------+------+
| tcp_flags| count|
+----------+------+
|0x00000018|183076|
|0x00000010|134547|
+----------+------+
only showing top 2 rows

None
3 most frequent tcp flags are given below
+----------+------+
| tcp_flags| count|
+----------+------+
|0x00000018|183076|
|0x00000010|134547|
|0x00000011|  4198|
+----------+------+
only showing top 3 rows

None
4 most frequent tcp flags are given below
+----------+------+
| tcp_flags| count|
+----------+------+
|0x00000018|183076|
|0x00000010|134547|
|0x00000011|  4198|
|0x00000002|  3372|
+----------+------+
only showing top 4 rows

None
5 most frequent tcp flags are given below
+----------+------+
| tcp_flags| count|
+----------+------+
|0x00000018|183076|
|0x00000010|134547|
|0x0000001

In [14]:
from confluent_kafka import Producer
import socket
#Initialize Your Parameters here - Keep the variable values as is for the ones you can't find on the Confluent-Kafka connection 
KAFKA_CONFIG = {
    "bootstrap.servers":"pkc-lzvrd.us-west4.gcp.confluent.cloud:9092",
    "security.protocol":"SASL_SSL",
    "sasl.mechanisms":"PLAIN",
    "sasl.username":"SNZEP2LF5EDA6R7A",
    "sasl.password":"kB1YfUEB5YPurgB/Ma451fdyKlE5KNMa7KHtoJ8miKX2/A2qcAnUZLMrSFFAX7NY",
    "session.timeout.ms":"45000",
    "group.id":"python-group-1",
    'auto.offset.reset': 'smallest',
    'client.id': socket.gethostname()
}

# Update your topic name
topic_name = "topic_0"
producer = Producer(KAFKA_CONFIG)


In [15]:
import feedparser
import time

# We are searching for Analytics in the news
feed_url = "https://news.google.com/rss/search?q=popular+cyber+attacks"
duration_minutes = 5


def extract_news_feed(feed_url):
    start_time = time.time()
    end_time = start_time + (duration_minutes * 60)
    
    feed = feedparser.parse(feed_url)
    articles = []
    extracted_articles = set()
    
    seconds = 20
    print(f'Start time: {0} sec...')
    while time.time() < end_time:
        for entry in feed.entries:
            if time.time() >= end_time:
                break
            if time.time()-start_time > seconds:
                print(f'Checkpoint {time.time()-start_time} sec...')
                seconds += 20
            link = entry.link
            title = entry.title.encode('ascii', 'ignore').decode()
            unique_id = f'{link}-{title}'
            if unique_id in extracted_articles:
                continue
            extracted_articles.add(unique_id)
            article_data = {"title": title, "link":link}
            if article_data is not None:
                producer.produce(topic_name, key=article_data["title"], value=article_data["link"])
        producer.flush()
    print(f'End time: {time.time()-start_time} sec...')

    
extract_news_feed(feed_url)

Start time: 0 sec...
Checkpoint 20.00419044494629 sec...
Checkpoint 40.014198541641235 sec...
Checkpoint 60.014158964157104 sec...
Checkpoint 80.00529956817627 sec...
Checkpoint 100.00778102874756 sec...
Checkpoint 120.0051155090332 sec...
Checkpoint 140.0073642730713 sec...
Checkpoint 160.01463890075684 sec...
Checkpoint 180.00393629074097 sec...
Checkpoint 200.00091552734375 sec...
Checkpoint 220.00052618980408 sec...
Checkpoint 240.00517797470093 sec...
Checkpoint 260.01083540916443 sec...
Checkpoint 280.01047539711 sec...
End time: 300.0124797821045 sec...


In [16]:
from confluent_kafka import Consumer
from pyspark.sql.types import *
import string
import time


# Clean the punctation by making a translation table that maps punctations to empty strings
translator = str.maketrans("", "", string.punctuation)


emp_RDD = spark.sparkContext.emptyRDD()
# Defining the schema of the DataFrame
columns = StructType([StructField('key', StringType(), False),
                      StructField('value', StringType(), False)])

# Creating an empty DataFrame
df2 = spark.createDataFrame(data=emp_RDD,
                                   schema=columns)
 
# Printing the DataFrame with no data
df2.show()

consumer = Consumer(KAFKA_CONFIG)
consumer.subscribe([topic_name])

try:
    i = 0
    while i < 10:
        msg = consumer.poll(timeout=1.0)
        if msg is None:
            i = i + 1
            print("Waiting...")
            continue
        if msg is not None:
            key = msg.key().decode('utf-8').lower().translate(translator)
            cleaned_key = " ".join(key.split())
            value = msg.value().decode('utf-8')
            added_row = [[cleaned_key,value]]
            added_df2 = spark.createDataFrame(added_row, columns)
            df2 = df2.union(added_df2)



except KeyboardInterrupt:
    pass
finally:
    consumer.close()
    df2.show()


+---+-----+
|key|value|
+---+-----+
+---+-----+

Waiting...
Waiting...
Waiting...
Waiting...
Waiting...
Waiting...
Waiting...
Waiting...
Waiting...
Waiting...
+--------------------+--------------------+
|                 key|               value|
+--------------------+--------------------+
|16th af san anton...|https://news.goog...|
|the cyberwar betw...|https://news.goog...|
|nsa releases a re...|https://news.goog...|
|honeywell unveils...|https://news.goog...|
|second annual pon...|https://news.goog...|
|gillibrand encour...|https://news.goog...|
|hackers could exp...|https://news.goog...|
|cyberattacks know...|https://news.goog...|
|microsoft how cyb...|https://news.goog...|
|google amazon fac...|https://news.goog...|
|why user buyin is...|https://news.goog...|
|us electrical gri...|https://news.goog...|
|researchers uncov...|https://news.goog...|
|suspected cyberat...|https://news.goog...|
|cyber attacks on ...|https://news.goog...|
|cybersecurity tre...|https://news.goog...|
|subm

In [17]:
## scan full dataset
from pyspark.sql.functions import *
# response = table.scan(AttributesToGet=['protocol_type'])
# items = response['Items']

# while 'LastEvaluatedKey' in response:
#     response = table.scan(ExclusiveStartKey=response['LastEvaluatedKey'])
#     items.extend(response['Items'])
    
# response['Items'] = items
    
# unique_values = set()

# if 'Items' in response:
#     for item in response['Items']:
#         if 'protocol_type' in item:
#             unique_values.add(item['protocol_type'])

attacks = ['slowite', 'brute force', 'flood', 'malformed', 'dos', 'ddos']
# not including "legitimate" as it is not an attack


streamed_data = df2.withColumn('word', explode(split(col('key'), ' '))) \
                .filter(col('word').isin(attacks)) \
                .groupBy('word') \
                .count() \
                .sort('count', ascending=False)
    
streamed_data.show()

+----+-----+
|word|count|
+----+-----+
+----+-----+

