# Working with a Database

For this lesson, you first have to prepare a database.

Follow the instructions in *./postgres_in_docker/README.md*

### Verify the DB is up
run `$ docker ps | grep postg`

You should see output similar to:
```
ca09314c8dad   postgres   "docker-entrypoint.s…"   3 minutes ago    Up 3 minutes    5432/tcp    psqlserver
```

## Install the DB driver

Before using a database, we need to install a *driver* for the specific database we use.

In our example, we use postgres.

The driver from https://jdbc.postgresql.org/download/ is already downloaded into the ./jars folder

### Copy the driver to the Spark node (a Docker container in our case)
```
$ docker cp jars/postgresql-42.5.4.jar spark-lab:/usr/local/spark/jars
```



## Reading/Writing from RDBMS

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .config("spark.jars", "/usr/local/spark/jars/postgresql-42.5.4.jar")\
    .getOrCreate()

# Which Database server are we connecting to?
# if running in local Docker, we put both Spark and Postgres servers in the same Docker network ('spark_backend')
# Actually, if the postgres server is used ONLY by the Spark server, there is no need to expose its ports
hostname="db"  # the service name in docker-compose.yml

In [2]:
server_name = f"jdbc:postgresql://{hostname}/"
database_name = "bids_db"
url = server_name + database_name
table_name = "players"
username = "postgres"  # your_dbuser_name_here
password = "postgres"

# We don't even need to add 'option("driver", "org.postgresql.Driver")'

jdbcDF = spark.read\
    .format("jdbc")\
    .option("url", url)\
    .option("dbtable", table_name)\
    .option("user", username)\
    .option("password", password).load()

In [3]:
jdbcDF.toPandas()

Unnamed: 0,name,age,occupation
0,Alice,25,Rocker
1,Bob,30,Assasin
2,Charlie,50,politician
3,David,10,racer


In [4]:
# Add a few rows
from pyspark.sql.types import StructType,StructField, StringType,IntegerType
playerSchema = StructType([StructField('name',StringType(),False), 
                          StructField('Age',IntegerType(),False),
                          StructField('occupation',StringType(),False)
                          ])
newcomers = [('נעם', 59, 'Witcher'), ('Helga', 140, 'hag')]
newPlayers=spark.createDataFrame(data=newcomers, schema= playerSchema)
newPlayers.toPandas()

Unnamed: 0,name,Age,occupation
0,נעם,59,Witcher
1,Helga,140,hag


In [5]:

try:
    newPlayers.write \
        .format("jdbc") \
        .mode("append") \
        .option("url", url) \
        .option("dbtable", table_name) \
        .option("user", username) \
        .option("password", password) \
        .save()
except ValueError as error:
    print("Connector write failed", error)
    

And you can check in the *dbclient*:
```
bids_db=# select * from players;
  name   | age | occupation  
---------+-----+-------------
 Alice   |  25 |  Rocker
 Bob     |  30 |  Assasin
 Charlie |  50 |  politician
 David   |  10 |  racer
 נעם     |  59 | Witcher
 Helga   | 140 | hag
(6 rows)
```
PS: the Hebrew text is in the wrong place. A bug in the terminal...

## Cleanup
Now that we are done playing, let's stop the DB and remove it -- execute the steps in the postgres dir README.md

# Reading / Writing to other databses

In the lesson on Streaming we read from Kafka source.
Simlarly, we can read from other sources such as mongodb using a *connector* supplied by the database vendor
    

<br>This will write to a default container in the database you connected to before