<img style="width:80%;" src="images/header.png" alt="Python"/>
<br>
<br>

by <a href="mailto:german.priks@nhs.net">German Priks</a>

In this tutorial we will look at how you can use Python, a generalist's programming language with a strong data science pedigree, to answer Information Requests, FoIs, and other data queries.

Before diving in, it's worth spending a moment on the question, why use Python for this set of tasks in the first place. Without having a compelling answer to this question, the "how" part becomes mostly irrelevant. We already have `SPSS`, `SQL`, `Excel VBA` and `R` to contend with so why another tool / language?

Here are the main advantages of Python that come to my mind after having worked with it in data-related contexts for close to 4 years:
***
 - <h4>Python is good <i>to</i> beginners</h4>

It's hard to get off the ground with a new language. In the beginning, there are many potential stumbling blocks: from the initial setup and unfamiliar error messages to the idiosyncratic syntax that appears to have more in common with machine code than a human-readable interface. Python is not completely immune to these, but its designers have made a concerted effort to make things simple. To quote [Zen of Python](https://www.python.org/dev/peps/pep-0020/), Simple is better than Complex; Complex is better than Complicated.
***
 - <h4>Python is fast <i>enough</i></h4>
    
Fast enough to quickly iterate on a new idea, fast enough to crunch through GBs of data in seconds, fast enough for _most_ production-ready projects. And when your Python code is already as performant as it can possibly be, and you still need more speed, Python makes it easy to compile into [faster, closer-to-the-metal languages](https://www.youtube.com/watch?v=x58W9A2lnQc) or [run chunks of code in parallel](https://dask.org/). 
***
 - <h4>Python has a mature data science ecosystem</h4>

No programming language is an island. There are libraries, extensions, IDEs that grow around a language as the needs of its users evolve. Python is no exception. One of the strenghts of Python's data science (DS) stack is that the community over the years has aligned on a few key libraries to provide the baseline functionality that meets 95% of user DS needs. From the quality point of view, this concentrates developer effort and expertise and from the user point of view, it means you only need to learn the conventions and APIs of one or two libraries. Contrast that with `Javascript` and its framework wars. 
*** 
 - <h4>Python is open source and has a great community</h4>
    
Python is not a niche language: it's used extensively by commercial giants, like Netflix and [AstraZeneca](https://stxnext.com/blog/2020/03/17/most-interesting-companies-using-python/), machine learning start-ups, web developers, IoT hobbyists and many others. It's been consistenly voted to be among top [most loved programming languages](https://insights.stackoverflow.com/survey/2019#technology-_-most-loved-dreaded-and-wanted-languages) by StackOverflow developers and is the fastest growing language. Thanks to this community, getting help when starting out with Python is easy, and when you want to expand your knowledge beyond basic scripting, there is a wealth of advanced tutorials online and in printed form.  
***

### Let's get started
Moving on now to the main body of the tutorial, which is to show how Python can be used to answer common IR queries using SMR datamarts. I'm not going into detail of how to setup Python on your machine - a complete guide is available [here](https://nbviewer.jupyter.org/github/Health-SocialCare-Scotland/Python-Resources/blob/master/Python%20Guidance%20for%20PHI.ipynb?flush_cache=true) - but it basically goes like this: ask NSS IT to install Anaconda on your machine, start coding. 

The tutorial is written in a Jupyter notebook which lets you combine HTML elements (Markdown), in-line graphics and cells with code snippets. It's part of the standard Anaconda distribution and is a feature-rich interactive development environment.
***

## Police requests

Let's image that Paddington Bear has gone missing and the police are looking for him. They ask us to check if Paddington has had any contacts with the health service in Scotland. So it's our job to run a search for the itinerant bear in our SMR datasets.

#### First step in any Python code is to import the necessary libraries

In [1]:
import pandas as pd           #pandas is the main data-processing library in Python
import pyodbc                 #pyodbc is a library to connect to ODBC databases

from getpass import getpass   #optional standard library import for hiding login details

#### Next, we connect to SMRA using our login & password

In [2]:
login = getpass("Login")
password = getpass("Password")
cnxn = pyodbc.connect(f'DSN=SMRA; UID={login}; PWD={password}') #f-string enables the injection of variables

Login ········
Password ··········


#### Write SQL to search for Paddington Bear hits in SMR01 using a special Paddington CHI number

In [3]:
#triple quotes allow us to write multi-line strings
sql = """
SELECT FIRST_FORENAME, SURNAME, PREVIOUS_SURNAME, CI_CHI_NUMBER, LINK_NO, DOB, ADMISSION_DATE
FROM ANALYSIS.SMR01_PI
WHERE CI_CHI_NUMBER = '7233464866'
OR UPI_NUMBER = 7233464866
OR (FIRST_FORENAME = 'PADDINGTON' AND SURNAME = 'BEAR'
AND DOB = to_date('1958-10-13', 'yyyy-MM-dd'))
AND ADMISSION_DATE >= to_date('2020-01-01', 'yyyy-MM-dd')
"""

#### Use Pandas to run our SQL query and fetch any rows that it returns

In [4]:
df = pd.read_sql(sql, cnxn)

In [5]:
df #no rows returned

Unnamed: 0,FIRST_FORENAME,SURNAME,PREVIOUS_SURNAME,CI_CHI_NUMBER,LINK_NO,DOB,ADMISSION_DATE
