# Importiere benötigte Bibliotheken
In diesem Notebook werden die notwendigen Bibliotheken importiert.

In [1]:
import requests
import json
from pyspark.sql import SparkSession
from google.cloud import storage
from google.oauth2 import service_account

# Initialisiere Spark Session
In diesem Notebook wird die Spark Session vorbereitet.

In [2]:
# Beispiel für das Hinzufügen des GCS Connectors zu einer lokalen Spark-Session
spark = SparkSession.builder \
    .appName("MockServerToGCS") \
    .config("spark.jars.packages", "com.google.cloud.bigdataoss:gcs-connector:hadoop3-2.2.2") \
    .config("spark.hadoop.fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem") \
    .config("spark.hadoop.google.cloud.auth.service.account.enable", "true") \
    .config("spark.hadoop.google.cloud.auth.service.account.json.keyfile", "/Users/Kevin/Documents/GitHub/Transferarbeit/Prototyp_Transferarbeit_Lokal/Setup/prototyp-etl-pipline-d6cbb438aa70.json") \
    .getOrCreate()
    
# Überprüfen der SparkSession
spark

24/06/16 20:48:15 WARN Utils: Your hostname, MacBook-Pro-3.local resolves to a loopback address: 127.0.0.1; using 192.168.1.229 instead (on interface en0)
24/06/16 20:48:15 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Ivy Default Cache set to: /Users/Kevin/.ivy2/cache
The jars for the packages stored in: /Users/Kevin/.ivy2/jars
com.google.cloud.bigdataoss#gcs-connector added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-eba9226e-19de-44a3-9a78-29d1d9a3f928;1.0
	confs: [default]


:: loading settings :: url = jar:file:/opt/anaconda3/lib/python3.11/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


	found com.google.cloud.bigdataoss#gcs-connector;hadoop3-2.2.2 in central
	found com.google.api-client#google-api-client-jackson2;1.31.3 in central
	found com.google.api-client#google-api-client;1.31.3 in central
	found com.google.oauth-client#google-oauth-client;1.31.2 in central
	found com.google.http-client#google-http-client;1.39.0 in central
	found org.apache.httpcomponents#httpclient;4.5.13 in central
	found org.apache.httpcomponents#httpcore;4.4.14 in central
	found commons-logging#commons-logging;1.2 in central
	found commons-codec#commons-codec;1.15 in central
	found com.google.code.findbugs#jsr305;3.0.2 in central
	found com.google.guava#guava;30.1-jre in central
	found com.google.guava#failureaccess;1.0.1 in central
	found com.google.guava#listenablefuture;9999.0-empty-to-avoid-conflict-with-guava in central
	found org.checkerframework#checker-qual;3.5.0 in central
	found com.google.errorprone#error_prone_annotations;2.5.1 in central
	found com.google.j2objc#j2objc-annotatio

# Abrufen der Daten vom Mockserver
In diesem Notebook werden die Daten von einem Mockserver abgerufen.


In [3]:
# Mockserver URL
mockserver_url = 'http://localhost:3000/applicants'

# Daten von Mockserver abrufen
response = requests.get(mockserver_url)
if response.status_code != 200:
    raise Exception(f"Failed to retrieve data: {response.status_code}")

try:
    data = response.json()
except json.JSONDecodeError as e:
    print("Error decoding JSON:", e)
    print("Response text:", response.text)
    raise


24/06/16 20:48:27 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
24/06/16 23:28:59 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 5185506 ms exceeds timeout 120000 ms
24/06/16 23:28:59 WARN SparkContext: Killing executors is not supported by current scheduler.
24/06/16 23:28:59 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.SparkThreadUtils$.awaitResult(SparkThreadUtils.scala:56)
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:310)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:110)
	at org.apache.spark.util.Rp

# Speichern der Daten in eine temporäre JSON-Datei
In diesem Notebook werden die abgerufenen Daten in eine temporäre JSON-Datei gespeichert.


In [5]:
# Temporärer Pfad für die JSON-Datei
temp_json_path = '/tmp/mock_data.json'

# Speichern der Daten in eine JSON-Datei
with open(temp_json_path, 'w') as f:
    json.dump(data, f)


# Konfiguration für Google Cloud Storage und hochladen der Jason Datei
In diesem Notebook wird die Google Cloud konfiguriert und die Jason Datei hochgeladen


In [6]:
# Namen des Google Cloud Storage Buckets und der Zieldatei festlegen
bucket_name = 'prod_prototype'
destination_blob_name = 'bronze/applicant_data_raw'

# Dienstkonto-Datei laden
service_account_json = '/Users/Kevin/Documents/GitHub/Transferarbeit/Prototyp_Transferarbeit_Lokal/Setup/prototyp-etl-pipline-d6cbb438aa70.json'
credentials = service_account.Credentials.from_service_account_file(service_account_json)

# Google Cloud Storage Client initialisieren
client = storage.Client(credentials=credentials, project='prototyp-etl-pipline')
bucket = client.bucket(bucket_name)
blob = bucket.blob(destination_blob_name)

#Upload Jason File
blob.upload_from_filename(temp_json_path)
print(f"File successfully uploaded to {destination_blob_name} in bucket {bucket_name}.")



File successfully uploaded to bronze/applicant_data_raw in bucket prod_prototype.


24/06/16 18:08:23 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
