# Spark Project

**Dataset** - https://www.kaggle.com/datasets/kimjihoo/coronavirusdataset

Step 1 - Create a producer with a python connector in confluent kafka and
stream your data.

Step 2 - Consume your data through the python connector and dump it in
mongodb atlas.
Note: Here in the dataset you will be finding a multiple files you
need to use all file for the kafka and mongodb

Step 3 - Collect your data as a pyspark dataframe and perform different
operations.<br>
Note: Consider only three files for creating a dataframe among all
case, region and TimeProvince<br>
1. Read the data, show it and Count the number of records.
2. Describe the data with a describe function.
3. If there is any duplicate value drop it.
4. Use limit function for showcasing a limited number of
records.
5. If you find the column name is not suitable, change the
column name.[optional]
6. Select the subset of the columns.
7. If there is any null value, fill it with any random value or drop
it.
8. Filter the data based on different columns or variables and
do the best analysis.
<br>For example: We can filter a data frame using multiple
conditions using AND(&), OR(|) and NOT(~) conditions. For
example, we may want to find out all the dif erent
infection_case in Daegu Province with more than 10
confirmed cases.</br>
9. Sort the number of confirmed cases. Confirmed column is
there in the dataset. Check with descending sort also.
10. In case of any wrong data type, cast that data type from
integer to string or string to integer.
Use group by on top of province and city column and agg it
with sum of confirmed cases. For example
df.groupBy(["province","city"]).agg(function.sum("co
nfirmed")
11. For joins we will need one more file.you can use region file.
User different different join methods.for example
cases.join(regions, ['province','city'],how='left')
You can do your best analysis.

Step 5 - If you want, you can also use SQL with data frames. Let us try to
run some SQL on the cases table.<br>
For example:<br>
cases.registerTempTable('cases_table')<br>
newDF = sqlContext.sql('select * from cases_table where
confirmed>100')<br>
newDF.show()
<br>
<t>
<br>
Here is a example how you can use df for sql now you can perform
various operations with GROUP BY, HAVING, AND ORDER BY

Step 6 - Create Spark UDFs
Create function casehighlow()<br>
If case is less than 50 return low else return high<br>
convert into a UDF Function and mention the return type of
function.<br>
Note: You can create as many as udf based on analysis.

### Step 1 - Create a producer with a python connector in confluent kafka and stream your data.

### Importing all needed packages


In [1]:
# importing packages

import argparse
from uuid import uuid4
from six.moves import input
from confluent_kafka import Producer
from confluent_kafka.serialization import StringSerializer, SerializationContext, MessageField
from confluent_kafka.schema_registry import SchemaRegistryClient
from confluent_kafka.schema_registry.json_schema import JSONSerializer

import pandas as pd
import json
from typing import List
from zipfile import ZipFile

from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import FloatType, DoubleType, IntegerType

In [2]:
# Kafka essentials details

# Variables
API_KEY = 'AMTQUJ4OYGGMNYOJ'
ENDPOINT_SCHEMA_URL  = 'https://psrc-8qyy0.eastus2.azure.confluent.cloud'
API_SECRET_KEY = '4HTAJgLsRtyeQjCpKZItrYiCs7eduapoIzAkFApIZer4nzPQ2i53tPWe58TGuPY/'
BOOTSTRAP_SERVER = 'pkc-n00kk.us-east-1.aws.confluent.cloud:9092'
SECURITY_PROTOCOL = 'SASL_SSL'
SSL_MACHENISM = 'PLAIN'
SCHEMA_REGISTRY_API_KEY = 'BS3PHBXGYPRUDGFP'
SCHEMA_REGISTRY_API_SECRET = 'gMwNVdkw1Cn5pFHmtFKrZam64E5HeDEo1dJu2hxCcUSSjaIkuscBUw6ucAlmjT2l'

## Kafka Producer class

In [87]:
def sasl_conf():
    # connection of producer to kakfa confluent  
    sasl_conf = {'sasl.mechanism': SSL_MACHENISM,
                 # Set to SASL_SSL to enable TLS support.
                #  'security.protocol': 'SASL_PLAINTEXT'}
                'bootstrap.servers':BOOTSTRAP_SERVER,
                'security.protocol': SECURITY_PROTOCOL,
                'sasl.username': API_KEY,
                'sasl.password': API_SECRET_KEY
                }
    return sasl_conf


def schema_config():
    # schema registry authentication
    return {'url':ENDPOINT_SCHEMA_URL,
    'basic.auth.user.info':f"{SCHEMA_REGISTRY_API_KEY}:{SCHEMA_REGISTRY_API_SECRET}"
    }

def delivery_report(err, msg):
    """
    Reports the success or failure of a message delivery.
    Args:
        err (KafkaError): The error that occurred on None on success.
        msg (Message): The message that was produced or failed.
    """

    if err is not None:
        print("Delivery failed for User record {}: {}".format(msg.key(), err))
        return
    print('User record {} successfully produced to {} [{}] at offset {}'.format(
        msg.key(), msg.topic(), msg.partition(), msg.offset()))

class Car: 
    # constructor  
    def __init__(self,record:dict):
        for k,v in record.items():
            setattr(self,k,v)
        
        self.record=record
   
    @staticmethod
    def dict_to_car(data:dict,ctx):
        return Car(record=data)

    def __str__(self):
        return f"{self.record}"

def car_to_dict(car:Car, ctx):
    """
    Returns a dict representation of a User instance for serialization.
    Args:
        user (User): User instance.
        ctx (SerializationContext): Metadata pertaining to the serialization
            operation.
    Returns:
        dict: Dict populated with user attributes to be serialized.
    """

    # User._address must not be serialized; omit from dict
    return car.record

# read the data from csv file
def get_car_instance(file_path, columns):
    df=pd.read_csv(file_path)
    df=df.iloc[:,:] 
    cars:List[Car]=[]
    #df.replace(np.nan, '', regex=True)
    nan_values = df.isna().any()
    for col, val in nan_values.items():
        #print(col, val)
        #break
        if val == True:
            type_ = df[col].dtype.name
            #print(type_)
            if type_ == 'int64' or type_ == 'int32':
                df[col].fillna(0, inplace=True)
            else:
                df[col].fillna("*miss*", inplace=True)
    for data in df.values:
        car=Car(dict(zip(columns,data)))
        cars.append(car)
        yield car

def streamingToKafka(FILE_PATH, topic, schema_id, columns):
    schema_registry_conf = schema_config()
    schema_registry_client = SchemaRegistryClient(schema_registry_conf)

    schema_str = schema_registry_client.get_schema(schema_id).schema_str

    string_serializer = StringSerializer('utf_8')
    json_serializer = JSONSerializer(schema_str, schema_registry_client, car_to_dict)

    producer = Producer(sasl_conf())

    print("Producing user records to topic {}. ^C to exit.".format(topic))
    #while True:
        # Serve on_delivery callbacks from previous calls to produce()
    producer.poll(0.0)
    try:
        #for idx,car in enumerate(get_car_instance(FILE_PATH, columns)):
        for car in get_car_instance(FILE_PATH, columns):
            print(car)
            producer.produce(topic=topic,
                            key=string_serializer(str(uuid4()), car_to_dict),
                            value=json_serializer(car, SerializationContext(topic, MessageField.VALUE)),
                            on_delivery=delivery_report)
            # if idx == 2:
            #     break
            #break
    except KeyboardInterrupt:
        pass
    except ValueError:
        print("Invalid input, discarding record...")
        pass

    print("\nFlushing records...")
    producer.flush()

In [99]:
file_name_list = []
file_name = "dataset.zip"
with ZipFile(file_name, 'r') as zipfile:
    for zipInfo in zip.filelist:
        print("Added: ", zipInfo.filename)
        file_name_list.append(zipInfo.filename)
    #zip.extractall()
    print("done")

Added:  Case.csv
Added:  PatientInfo.csv
Added:  Policy.csv
Added:  Region.csv
Added:  SearchTrend.csv
Added:  SeoulFloating.csv
Added:  Time.csv
Added:  TimeAge.csv
Added:  TimeGender.csv
Added:  TimeProvince.csv
Added:  Weather.csv
done


In [4]:
from pandas.io.json import build_table_schema

In [5]:
kafka_basic_schema = {
  "$id": "http://example.com/myURI.schema.json",
  "$schema": "http://json-schema.org/draft-07/schema#",
  "additionalProperties": False,
  "description": "Sample schema to help you get started.",
  "title": "SampleRecord",
  "type": "object"
}

## Case file processing

In [82]:
case_df = pd.read_csv('Case.csv')
case_df_schema = build_table_schema(case_df, index=False, version=False)['fields']

In [88]:
#renaming the dictionary key from name to description
# to match the kafka schema string
for value in case_df_schema:
    value['description'] = value.pop('name')

case_df_schema

[{'type': 'integer', 'description': 'case_id'},
 {'type': 'string', 'description': 'province'},
 {'type': 'string', 'description': 'city'},
 {'type': 'boolean', 'description': 'group'},
 {'type': 'string', 'description': 'infection_case'},
 {'type': 'integer', 'description': 'confirmed'},
 {'type': 'string', 'description': 'latitude'},
 {'type': 'string', 'description': 'longitude'}]

In [90]:
case_df_final_schema = kafka_basic_schema.copy()
case_df_final_schema['properties'] = {}
for value in case_df_schema:
    name = value['description']
    value.pop('description')
    case_df_final_schema['properties'][name] = value

In [91]:
case_df_final_schema

{'$id': 'http://example.com/myURI.schema.json',
 '$schema': 'http://json-schema.org/draft-07/schema#',
 'additionalProperties': False,
 'description': 'Sample schema to help you get started.',
 'title': 'SampleRecord',
 'type': 'object',
 'properties': {'case_id': {'type': 'integer'},
  'province': {'type': 'string'},
  'city': {'type': 'string'},
  'group': {'type': 'boolean'},
  'infection_case': {'type': 'string'},
  'confirmed': {'type': 'integer'},
  'latitude': {'type': 'string'},
  'longitude': {'type': 'string'}}}

In [93]:
# schema_registry_conf = schema_config()
# schema_registry_client = SchemaRegistryClient(schema_registry_conf)

In [109]:
# Schema.schema_str = json.dumps(case_df_final_schema)
# Schema.schema_type = 'JSON'

In [None]:
# id = schema_registry_client.register_schema("case_schema", Schema)
# id

In [88]:
FILE_PATH = 'Case.csv'
topic = "topic_case"
schema_id = 100005
cols = list(pd.read_csv(FILE_PATH).columns)

In [None]:
streamingToKafka(FILE_PATH, topic, schema_id, cols)

## PatientInfo CSV file

In [53]:
patient_info_df = pd.read_csv('PatientInfo.csv')
patient_info_df_schema = build_table_schema(patient_info_df, index=False, version=False)['fields']

In [7]:
#renaming the dictionary key from name to description
# to match the kafka schema string
for value in patient_info_df_schema:
    value['description'] = value.pop('name')

patient_info_df_schema

[{'type': 'integer', 'description': 'patient_id'},
 {'type': 'string', 'description': 'sex'},
 {'type': 'string', 'description': 'age'},
 {'type': 'string', 'description': 'country'},
 {'type': 'string', 'description': 'province'},
 {'type': 'string', 'description': 'city'},
 {'type': 'string', 'description': 'infection_case'},
 {'type': 'string', 'description': 'infected_by'},
 {'type': 'string', 'description': 'contact_number'},
 {'type': 'string', 'description': 'symptom_onset_date'},
 {'type': 'string', 'description': 'confirmed_date'},
 {'type': 'string', 'description': 'released_date'},
 {'type': 'string', 'description': 'deceased_date'},
 {'type': 'string', 'description': 'state'}]

In [8]:
patient_info_df_final_schema = kafka_basic_schema.copy()
patient_info_df_final_schema['properties'] = {}
for value in patient_info_df_schema:
    name = value['description']
    value.pop('description')
    patient_info_df_final_schema['properties'][name] = value

In [11]:
print(json.dumps(patient_info_df_final_schema))

{"$id": "http://example.com/myURI.schema.json", "$schema": "http://json-schema.org/draft-07/schema#", "additionalProperties": false, "description": "Sample schema to help you get started.", "title": "SampleRecord", "type": "object", "properties": {"patient_id": {"type": "integer"}, "sex": {"type": "string"}, "age": {"type": "string"}, "country": {"type": "string"}, "province": {"type": "string"}, "city": {"type": "string"}, "infection_case": {"type": "string"}, "infected_by": {"type": "string"}, "contact_number": {"type": "string"}, "symptom_onset_date": {"type": "string"}, "confirmed_date": {"type": "string"}, "released_date": {"type": "string"}, "deceased_date": {"type": "string"}, "state": {"type": "string"}}}


In [57]:
patient_info_df.isna().any()

patient_id            False
sex                   False
age                   False
country               False
province              False
city                  False
infection_case        False
infected_by           False
contact_number        False
symptom_onset_date    False
confirmed_date        False
released_date         False
deceased_date         False
state                 False
dtype: bool

In [59]:
FILE_PATH = 'PatientInfo.csv'
topic = "topic_patientinfo"
schema_id = 100006
cols = list(pd.read_csv(FILE_PATH).columns)

In [None]:
%%capture output
streamingToKafka(FILE_PATH, topic, schema_id, cols)

In [81]:
print('test')

test


## Policy CSV file

In [None]:
patient_info_df = pd.read_csv('PatientInfo.csv')
patient_info_df_schema = build_table_schema(patient_info_df, index=False, version=False)['fields']

In [None]:
#renaming the dictionary key from name to description
# to match the kafka schema string
for value in patient_info_df_schema:
    value['description'] = value.pop('name')

patient_info_df_schema

[{'type': 'integer', 'description': 'patient_id'},
 {'type': 'string', 'description': 'sex'},
 {'type': 'string', 'description': 'age'},
 {'type': 'string', 'description': 'country'},
 {'type': 'string', 'description': 'province'},
 {'type': 'string', 'description': 'city'},
 {'type': 'string', 'description': 'infection_case'},
 {'type': 'string', 'description': 'infected_by'},
 {'type': 'string', 'description': 'contact_number'},
 {'type': 'string', 'description': 'symptom_onset_date'},
 {'type': 'string', 'description': 'confirmed_date'},
 {'type': 'string', 'description': 'released_date'},
 {'type': 'string', 'description': 'deceased_date'},
 {'type': 'string', 'description': 'state'}]

In [None]:
patient_info_df_final_schema = kafka_basic_schema.copy()
patient_info_df_final_schema['properties'] = {}
for value in patient_info_df_schema:
    name = value['description']
    value.pop('description')
    patient_info_df_final_schema['properties'][name] = value

In [None]:
print(json.dumps(patient_info_df_final_schema))

{"$id": "http://example.com/myURI.schema.json", "$schema": "http://json-schema.org/draft-07/schema#", "additionalProperties": false, "description": "Sample schema to help you get started.", "title": "SampleRecord", "type": "object", "properties": {"patient_id": {"type": "integer"}, "sex": {"type": "string"}, "age": {"type": "string"}, "country": {"type": "string"}, "province": {"type": "string"}, "city": {"type": "string"}, "infection_case": {"type": "string"}, "infected_by": {"type": "string"}, "contact_number": {"type": "string"}, "symptom_onset_date": {"type": "string"}, "confirmed_date": {"type": "string"}, "released_date": {"type": "string"}, "deceased_date": {"type": "string"}, "state": {"type": "string"}}}


In [None]:
patient_info_df.isna().any()

patient_id            False
sex                   False
age                   False
country               False
province              False
city                  False
infection_case        False
infected_by           False
contact_number        False
symptom_onset_date    False
confirmed_date        False
released_date         False
deceased_date         False
state                 False
dtype: bool

In [None]:
FILE_PATH = 'PatientInfo.csv'
topic = "topic_patientinfo"
schema_id = 100006
cols = list(pd.read_csv(FILE_PATH).columns)

In [None]:
%%capture output
streamingToKafka(FILE_PATH, topic, schema_id, cols)

In [None]:
print('test')

test
