# Introduction

Having looked at some basic string and file manipulation let's move our attention to data handling. So far we've tended to think of lists of data, but what happens when we want tables of data and more high dimensional data? We've learnt some things that help but of course Python has some specific modules for dealing with this kind of problem.





# Data Handling
Welcome to this Python notebook on data handling. We will start by looking at some slightly more advanced data structures in a package called numpy before moving into a package called pandas.

We will look at some standard ways to manipulate data and how to work on data files we might create, of obtain from third party sources.

First, we will start with a quick recap of some basic data structures.

Copy this code into the cell below a bit at a time and check you understand what it is doing.

```
# Let's start by creating a list of numbers

v = [1.2,3.5,4.4,5.2,6.7]
print(f"v : {v}\n")

a = [v,v,[3,4,6,7,9]]
print(f"a : {a}\n")

# let's try and multiply all the numbers in a by 2
print(f"a*2 {a*2}\n")

# let's try again
for i in a:
  for j in i:
    print(j*2)

# well that kind of worked, but if you've used vectors and arrays in maths you know it's not really what I wanted.```



In [9]:
# Copy the code into this cell



## NumPy
Let me intructroduce numpy short for numerical py. We've seen how libraries can introduce functions that we reuse, well some packages (including numpy) also introduce new types to the language. These new types are aware of what their purpose is and and allow us to do clever things to them very quickly.

In [6]:
# Let's first import our package and give it a short name np.
import numpy as np

# We can take the list we already created and convert it to an array like this.
v2 = np.array(v)

print(f"Original {v} , multiplied array {v2*2}")


In [8]:
# and of course we can create 2 dimensional lists

ar = np.array([[1.2, 2.3, 3.4],
              [2.2, 3.2, 4.6],
              [5.2, 1.2, 2.3]])
print(f"{ar=}")
print(f"{ar*2=}")

In [None]:
# we can also create arrays of a particular size using some special numpy functions.

x1 = np.random.randint(10, size=6)
x2 = np.random.randint(10, size=(2,2))


print(x1)
print(x2)

# if you ever want to see the size of an array you can ask it's shape
print(f"shape of x1 : {x1.shape}")  # 6 rows
print(f"shape of x2 : {x2.shape}")  # 2 rows, 2 columns


In [None]:
# Numpy gives us lots of tools for looking at data.
# Let's start by creating an artificial set of data.
# imagine we have surveyed 100 people and get their ages

# Notice that lot's of functions have named parameters this allows is to
# be a bit more descriptive in our

ages = np.random.randint(low = 18, high= 70, size = 100)

In [None]:
# having created our data, what can we do with it?

print(f"Maximum age is {np.max(ages)}")

# Try the following np summary statistic functions
# np.min()
# np. std()
# np.mean()
# np.median()
# np.size()

# Remember the ? in Jupyter notebooks for help.


In [None]:
# We also have some nice data manipulation functions.
# look at the following.

arr = np.array([45, 32,19,99,67])
arr.sort()
print(arr)
# notice doesn't applies in place, i.e. the array is sorted when we run the function
# we don't need to allocate the output.

# Try the following functions and see what they do.
# remember that the environment will give you hints about parameters etc.

# 1. reverse()
# 2. append()
# 3. pop()
# 4. remove()



## Data as tables

So having seen how we can create an array of numbers it's not hard to see how this could be viewed as a table. Imagine a set of information about people where each row is a person and each column a feature, e.g. age, height, ...

In this next section we will look at accessing information in the table and taking slices of the table to work with.

In [None]:
# Let's start by creating a table of random numbers which has 4 rows and 6 columns
np.random.seed(0) # this is a little trick to make sure always get the same sequence from our random number generator
tbl = np.random.randint(10, size=(4,6))
print(tbl)

# to access a single value from the table we will simply give it the row and column we want
# notice that rows and columns are 0 indexed
print(tbl[0,1])


# change the values in the square brackets to check you understand what it happening.

In [None]:
# of course we might want more than just one element.

# Can you explain what these do?

print(tbl[:, 2])
print("")

print(tbl[1, :])
print("")

print(tbl[0:2, 1:4])
print("")

print(tbl[0,3:])

# What does the colon operator do?
# are the paramters inclusive or exclusive?

## Challenge

In [10]:
# Challenge
# Use the summary statistics you learnt earlier to find summaries (min max and mean) for
# The whole table
# Each Column
# Each Row

# Use loops and formatted printing to show your results



# Pandas

So we've seen some basic numeric arrays, let's turn our attention to pandas.

Pandas is a module which specialises data table operations. You can think of it a little like Pythons version of Excel. So pretty much anything you would like to do in Excel can be done programatically in Python. But, you won't be surprised to hear, it's much more powerful.

In [11]:
# First, just like we converted a list to an array, let's turn an array into a pandas dataframe
import pandas as pd

# Uncomment the code create a data frame

#df = pd.DataFrame(tbl)
#print(df)
#print("")

# you can see the same data we has as before but now every row and column has an id.
# now let's try adding something meaningful to the column headings
#df.columns = ['A', 'B', 'C', 'D', 'E', 'F']
#print(df)
#print("")

# note we could also have just said
#df = pd.DataFrame(tbl, columns = ['A', 'B', 'C', 'D', 'E', 'F'])
#print(df)
#print("")


In [12]:
# Alternatively we can create a dataframe using a dictionary
# Uncomment the code to see how this works


#df = pd.DataFrame(tbl, columns = ['A', 'B', 'C', 'D', 'E', 'F'])
#data = {'Name':['Alex', 'Bob', 'Charlie', 'Samj'],
#        'Age':[27, 45, 22, 32],
#        'Address':['London', 'Paris', 'York', 'Newcastle'],
#        'Qualification':['Msc', 'MA', 'BEng', 'Phd']}

#df = pd.DataFrame(data)
#print(df)

In [None]:
# Because this is a pandas dataframe we also get access to some new special functions
# Uncomment each of these and explain what they do

# df.head(2)
# df.loc[1:3, ['Name', 'Qualification']]

# remember you can use the ? in Jupyter


# Score Book data

Before you start running code cells in this part of the notebook you will need to upload three files: Scorebook.csv, Scorebook2.csv and Scorebook3.csv.

These are simple comma separated value files that I created offline, imagine they came from a google form or Excel.

In [None]:
# Pandas gives us a nive read_csv funtion which knows how to deal with files in this form


df = pd.read_csv('ScoreBook.csv')

# having read in the data we can look at the table
print(df)

In [None]:
# If we want to know some quick summary statistics for the table then we can use
# the Pandas describe function.
df.describe()

In [None]:
# Acouple of notes.
#
# The first participant is only 4 characters while the others are 5.
# The describe function is telling us the mean of the participant id. That makes no sense.
# Similarly we are seeing preference as 0 and 1 and not True and False.
#
# Why?
#
# it's because the reader is trying to be helpful and assumes that the.columns
# are integers. We can helpp pandas by giving it the correct data types

df = pd.read_csv('ScoreBook.csv', dtype={'participant' : 'string', 'pref': 'bool'})
df

In [None]:
# Now let's run the describe function
df.describe()

In [None]:
# Of course I brought in this data because I want to manipulate it.
#
# I want to create a new column on this data which is the calculated
# from the values in other columns. Mayve we have a weight 'total' score
# where  score 3 is worth twice as much as the other two


df['total'] = df['score_1'] + df['score_2'] + df['score_3']*2


# Putting the name of a new column in the square brackets says we are
# going to add a column using whatever is on the right side of the
# = sign.

# I also want the total as a percentage

df['percentage'] = df['total']/4

print(df)

In [None]:
# Sometimes you might wantto delete a column. Maybe it's confidential and you
# want to remove it before to circulate the data.

# To drop a column,  we need to give it the name of the column and an axis of 1, whch means column

df.drop(['age'], axis=1)
print(df)

In [None]:
# Well that didn't seem to work! Why not?

# The answer is that the drop function doesn't change the data frame. It
# returns a dataframe where the column has been dropped.

df_noage = df.drop(['age'], axis=1)
print(df_noage)

In [None]:
# Of course we might also want to remove rows from the data set.
# Maybe there is an individual who has removed consent.

# Here we change to axis=0 which means row.
# the number in the square brackets is the row index.
result = df_noage.drop([3], axis=0)

print(result)

In [None]:
# Removing a row by row number is fine, but we might end up shuffling row numbers
# it would be better to use information in the row.
# for example the participant id,
# This takes a little explaining, so bear with me.

# Firstly we can get a simple boolean check to see if a value in a column matches a value we are interested in
df['participant'] == '46923'

# notice this is run for every row, we don't need a loop, that's one of the
# powerful things about pandas. Removing the loops makes it quick for even quite
# large data sets.

In [None]:
# if we pass this list of booleans into the dataframe we get

df[df['participant']== '46923']

# ie. It will return the rows where the boolean is a True.

In [None]:
# Then we can ask for the index of this row
idx = df[df['participant']== '46923'].index
print(f"Index : {idx.values}")


In [None]:
# Having got the index we can now drop the
# participant that is a problem and save the new data set

df_dropped_person = df.drop(idx)
print(df_dropped_person)

In [None]:
# Of course this isn't just about deleting rows. This kind of search functionality can
# be really useful in finding subsets of the data.
#
# Why are we doing it this complicated way, well the reason is that we can select
# multiple lines

older = df[df['age'] >= 30]
younger = df[df['age']<30]

print('Older participants')
print(older)
print()
print('Younger participants')
print(younger)

In [None]:
# And if you need to you could of course create a function to drop
# if you find yourself wanting to do this lots then remember you
# can always write a function to drop rows based on a columns value

def drop_row(df, value, col_name='participant'):
  idx = df[df[col_name]== value].index
  df2 = df.drop(idx)
  return df2

result = drop_row(df, '46923')
result = drop_row(result, '85243')
result = drop_row(result, '10674')

print(result)

## Saving your data

Having made these changes let's write the data to a new file so we can distribute it.

In [None]:
# having edited our file, let's save it.
# remember you'll need to download it from jupyter if you want to keep it.
result.to_csv('new_score_book.csv', encoding='utf-8')

# Once you've saved it download it and open it in Excel.

In [None]:
# Before we move on I just want to highlight a useful function for when you're
# reading in some data. You might find that the column names are a bit cumbersome,
# especially when you want to include them in lines of code.


# In that case we can rename them to something more friendly.


df = df.rename(columns={"participant":"id", "pref":"preference"})
print(df.head(2))

In [None]:
# So a few time now I've used the df = df.... form of a function.
# We can force what is called inplace operations for quite a few
# pandas fucntions like this.

df.rename(columns={ df.columns[2]: "pref" }, inplace=True)
print(df.head(2))

## Combining Files

Of course sometimes our data is split between multiple sources and we need to combine them. In this next section we are goingto look at the different forms of combination we can do in Pandas.

In [None]:
# Let's start by loading in a second data file
# and having a look at it.

df_city = pd.read_csv('ScoreBook2.csv', dtype={'participant' : 'string'})
df_city = df_city.rename(columns={"participant":"id"})
df_city

# You can see that this is the city assocaited with each participant
# Notice that I renamed the id column that we are going to match to the
# first data set.

In [None]:
# Join Data Sets, Commonly we will have two files we need to join them
# using the 'merge' function.

result = pd.merge(df, df_city, on='id')
print(result)

In [None]:
# Well that was nice because the data was well behaved.
# what about this?

df_missingone = drop_row(df, '61378', col_name='id')
df_missingone

result = pd.merge(df_missingone, df_city, on='id')
result

# Notice that we lose a row from the data set.
# not only that but the resulting data set has had a new set
# of row numbers generated.

In [None]:
# if we want to keep all the rows then we can use the 'how'
# parameter, obviously there is no data for the extra participants apart from
# the city and therefore the data is replaced with NaN (not a number) which
# we can interpret as missing.

result = pd.merge(df_missingone, df_city, on='id', how='outer')
print(result)

In [None]:
# Let's try that again but this time drop a row from the city data set

missing_city = drop_row(df_city, '61378', col_name='id')
result = pd.merge(df, missing_city, on='id', how='outer')
result

## Adding new rows.

Sometimes we simply want to add more rows to an existing set. Maybe the data collection was undertaken over multiple days, or by multiple people.


In [None]:
# Let's start by reading in the two csv files we are interested in.

df_first = pd.read_csv('ScoreBook.csv', dtype={'participant' : 'string'})
df_first = df_first.rename(columns={"participant":"id"})
df_first

df_second = pd.read_csv('ScoreBook3.csv', dtype={'participant' : 'string'})
df_second = df_second.rename(columns={"participant":"id"})
df_second

# Next check that they have the same number of columns?

print(f"shape 1 {df_first.shape}")
print(f"shape 2 {df_second.shape}")
print()

print(df_second)

In [None]:
# Now we can do the work we need to and join the two files together using the concat function
df_full = pd.concat([df_first,df_second])
print(df_full.shape)
print()

# Now let's that a look at the dataframe
print(df_full)

In [None]:
# oops, it looks like the row numbers are being reused.
# well we can fix that.

df_full = pd.concat([df_first,df_second], ignore_index=True)
print(df_full)

In [None]:
# Now Pandas can undertake a massive amount of manipulation and I'll leave that
# for another day (or you own investigation) but let me show you something
# useful

# Data Aggregation and grouping
#
# we might well want to group our data and get information.
# here we want to know how many preople live in each city (for our earlier tables.).
# so we will group the participants by city and then count how many in each group

print(result)


s = result.groupby('city').count()
print(s)

In [None]:
# As well as count we can add things up for each group using the sum function

s = result.groupby('city').agg('sum')
print(s)

In [None]:
# Finally you might want to group by more than one column.
# let's try that

s = result.groupby(['city', 'pref']).agg('sum')
print(s)


# and of course we could just look at a single column of the result
print("------ age column only ------")
print(s['age'])



In [None]:
# Just before we finish this bit, let's try and get the mean of the columns when grouped

# s = result.groupby(['city', 'pref']).agg('mean')


# This throws and error telling us that we can't use mean on string data.
# Well all we wanted was the mean of age and total so un comment the next line
# and comment out the previous code

s = result.groupby(['city', 'pref'])[['age', 'total']].agg('mean')
print(s)

# if you want to see more of this kind of operation search the internet for
# Pandas grouping and aggregation.


# NHS Data

In this section I'm going to make use of some data taken from the internet.
You can find the compelte dataset at the nhs website, but I've created a smaller
set from this data and put it on my university website for convenience.

https://data.england.nhs.uk/dataset/a-e-synthetic-data

https://www-users.york.ac.uk/~cap508/resources/NHS_Sample.csv



In [None]:
# Because this data has been put onto a website Pandas gives us the opportunity
# to load it directly from the url

url = 'https://www-users.york.ac.uk/~cap508/resources/NHS_Sample.csv'

df = pd.read_csv(url)

In [None]:
# What is in this data set? The columns command will give you the name of each column

print(df.columns)

In [None]:
# The head command will show is the top of the data frame, can you guess how you might see the bottom?

df.head(10)



In [None]:
# If we only want to see part of the data we can use range command

df[3:5] # How only rows 3 to 5, notice that the upper bound is non-inclusive

In [None]:
df.loc[0:5,["AE_Arrive_Date", "AE_HRG"]]

In [None]:
# Of course we can also use the filtering commands we saw earlier

df[df.AE_HRG == "Nothing"]

In [None]:
# We can then use this to select information. So the qustion might be
# what is the hour of the day for people haveing an AE_Time of > 200

subset = df[df.AE_Time_Mins > 200]
print(f"{subset.shape[0]} patients had a wait of over 200 minutes")
print("Here are their arival times")
print(subset.loc[:,['AE_Arrive_HourOfDay', 'AE_Time_Mins']])



In [None]:
# There is also a very useful command called query
# to cover it in detail is a bi much but you should know it exists.

df.query('AE_Time_Mins > 200 and AE_Num_Diagnoses == 2 and AE_Num_Investigations > 3')


In [None]:
# We can then just look at the columns we are interested in
df.query('AE_Time_Mins > 200 and AE_Num_Diagnoses == 2 and AE_Num_Investigations > 3')[["Age_Band", "AE_Num_Treatments"]]


# So the first bit returns the subset of data then the next bit says these are the columns we want to see.

That's it for this work sheet. Next we're going to move onto visualising data which is great for creating graphs for use in your presentaions, reports and papers.