# Getting Started

The base containers include:

- a PostgreSQL database
- a Jupyter notebook server containing a range of scientific python packages;

To get these scripts up and running, you will need to install Docker on your computer or be able to run Docker containers on a third party cloud server. 


Note - there seems to be a problem running newly downloaded versions of docker on Macs:-(


The following `docker-compose.yml` file contents will download and install several useful containers. (Containers are like self-contained applications that run inside a virtual machine on your computer - they shouldn't upset or unnecessarily interact with any other software on your machine...)

    #----------------- START docker-compose.yml ------------------
    #mongodata:
    #    image: busybox
    #    volumes: 
    #        - /data/db

    #mongodbdd:
    #    image: mongo
    #    ports:
    #        - "27107:27107"
    #    volumes_from:
    #        - mongodata
    #    command: --smallfiles

    postgresdatadd:
        image: busybox
        volumes: 
            - /var/lib/postgresql/data

    postgresdd:
        image: postgres #mdillon/postgis 
        environment:
            - POSTGRES_PASSWORD=PGPass
        ports:
            - "5432:5432"
        volumes_from:
            - postgresdatadd

    #neo4jdd:
    #  image: neo4j
    #  ports:
    #    - "7474:7474"
    #    - "1337:1337"
    #  volumes:
    #    - /opt/data

    datadive:
        image: jupyter/scipy-notebook
        user: root
        environment:
            - GRANT_SUDO=yes
        ports:
            - "9988:8888"
        links:
            - postgresdd:postgres
            #- mongodbdd:mongodb
            #- neo4jdd:neo4j
        volumes:
            - .:/home/jovyan/work

    #----------------- END docker-compose.yml ------------------

Download and install Kitematic, a GUI for working with containers.

From the bottom right hand corner of Kitematic you can open a Docker command line.

![](kitematic-dd.png)


On the Docker command line, change directory to the directory containing the `docker-compose.yml` file and run the command:

docker-compose up -d

This will launch a database container linked to a Jupyter notebook server. You should be able to launch it from within Kitematic.

To stop the containers: docker-compose stop
To start the containers back up again: docker-compose start

To destroy the containers and their contents: docker-compose down


The notebook container is lacking some libraries and packages we need to make life easier... The `-qq` and `--quiet` flags make the install a quiet one...

To run the code cell, click on it and press the run button in the tool bar at the top of the notebook or press *shift-return*. (See other keyboard shortcuts from the notebook *Help* menu.)

In [None]:
!sudo apt-get -qq update   && sudo apt-get -qq install  -y libpq-dev python-dev
!pip install --quiet ipython-sql psycopg2

Jupyter notebooks support SQLMagic. This lets us run SQL commands in a code cell.

In [5]:
%load_ext sql
#This is how we connect to a sql database
#Monolithic VM addressing style
%sql postgresql://postgres:PGPass@postgres:5432/postgres

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


'Connected: postgres@postgres'

We can also run SQL commands on the database using the `pandas` python package.

In [None]:
from sqlalchemy import create_engine
engine = create_engine("postgresql://postgres:PGPass@postgres:5432/postgres")

## Importing the basic orientation data

Loading CSV data directly into a PostgreSQL database often required the CSV to be well behaved; *pandas* is a bit more forgiving on the import and can be used to load the data from the original CSV file and then insert it into the PostgreSQL database. As *pandas* dataframes are stored in memory we need to load the data in from the source file in chunks of several thouasnds of lines at a time.

In [None]:
#Define the path to the sample data
#Look for it in the current directory:companies-with-controlling-entities v6.csv
!ls

In [None]:
#If it's not in the current directory, also specify the path to the file
datafile="companies-with-controlling-entities v6.csv"

In [None]:
#preview the first 5 rows of the file contents
!head -n 5 "$datafile"

To make life easier working with the data, we'll use *pandas*, a library descigned for working with tabular data.

In [None]:
#Import the data from the CSV file
import pandas as pd

*pandas* dataframes are held in memory, which can cause your machine to fall over if the datafile is a large one.

To work with the data, we can load it into a database.

*pandas* can create a database table, and populate it, from a dataframe without you haveing to create the database table first.

The PostgreSQL database table will be defined based on the dataframe column names properties - we may need to force column datatypes when we read in the datafile but for now we'll go with whatever the defaults turn out to be...

Rather than load a possibly stupidly large database into a dataframe and from their into the database, *pandas* will let you load the data a bit at a time, chunking it a few hundred or thouasand rows at a time, creating the database table for the first set of rows and then appending the later ones.

Balance the chunksize with speed (it can be quicker to load lots of rows at once) and not making your machine fall over (becuase you've loaded too many rows into memory via a *pandas* dataframe) at the same time.

If the machine hands for too long, restart the python kernel (from the notebook *Kernel* menu). If that doesn't work, select and *Shutdown* the notebook from the notebook homepage.

Note that the data ingest may take some time - go and grab a coffee once you start to run the cell below.

In [None]:
#Import the base data into a table called: sigcontrol
#Drop the table if it already exists so we start from a blank state
%sql DROP TABLE IF EXISTS sigcontrol

#Load the data in ten thouand rows at a time
chunks=pd.read_csv(datafile,chunksize=10000,dtype=str)

#This may take some time - go an grab a coffee....

#Note that we can also read in from a zip file - as long as it isn't password protected:
#pd.read_csv(fn,chunksize=10000,dtype=str, compression='gzip')
for chunk in chunks:
    #Pop the data into the 
    chunk.to_sql('sigcontrol',engine,index=False, if_exists='append')

Preview the data to check the ingest worked...

In [None]:
%sql SELECT * FROM sigcontrol LIMIT 3

Now you should be good to go...

Try out the exercises in the orientiation notebook to start to familiarise yourself with the data.