In Python, lots of things are done using variables. We can store data in variables in many ways, including individual items of data, lists of data, dictionaries, and dataframes. We can also store more complicated things like classes and operations, we won't worry about the complex things now though. Let's first of all look at how variables can store individual items of data, and some of the quirks of that.

In [17]:
x = 5
y = 10
z = x + y
print(z) # print is a function that, for our simple purposes, will show you in the terminal (or on the page in Jupyter) the data stored in a variable
j = y / x
print(j)
print(type(j)) # printing the type of a variable tells us what data is stored in it, floats are numbers with decimals
j = int(j)
print((type(j))) # we can also convert j to an integer, and check its worked by printing its new type

15
2.0
<class 'float'>
<class 'int'>


In [18]:
name = 'Will' # This is a string, we use a string to tell Python the data stored should be text, not code or numbers. As will be seen, strings act differently to floats/ints
print(type(name))
hobby = 'Running'
long_string = name + hobby
print(long_string) # As we can see, addition of strings concats those strings
# The below is an 'f string' which allows us to insert variables INSIDE a string
phrase = f'My name is {name}, and I like {hobby}.' 
# String behaviour and f strings can be useful when we want to join up variables of strings, really useful when making dynamic titles for visualisations
print(phrase)

<class 'str'>
WillRunning
My name is Will, and I like Running.


In [19]:
string_slice = phrase[1:10] # strings are interesting as we can easily select certain parts of them by selecting a range in square brackets
# When using square brackets to slice a range, the number to the left is INCLUSIVE, the number to the right is EXCLUSIVE
# You'll notice this new string starts at y, why? In most instances, most things in Pythonc ount from 0 as the first index location.
print(string_slice)
filename = 'heres_a_file.csv'
shortened_name = filename[:-4] # We can also start the count from the end of the string, here -4 means I want my last character to be 4 from the end.
# When we don't put a number on one side of the colon, it means 'take everything this side of the range', so string[4:] would mean take everything
# after the 4th character
print(shortened_name)

y name is
heres_a_file


Lets have a look at lists now. We can initialise a list by using square brackets and assigning them to a variable. We seperate elements of a list with commas. We can put pretty much whatever we want in a list: ints, strings, dictionaries, other lists, a whole range of things. A full list of list methods can be found here: https://docs.python.org/3/tutorial/datastructures.html

In [20]:
list_1 = [] # Even though there's nothing in here, it's a list, we can check by printing type(list_1)
print(type(list_1))

colours_list = ['red', 'blue', 'green']
print(colours_list)

colours_list.append('purple') # The append method adds whatever is being appended to a list as one item in a list
print(colours_list)

extra_colours = ['turquoise', 'teal']
colours_list.append(extra_colours)
print(colours_list) # You can see here it's appended the whole list extra_colours as one item in the list

# Instead, we can extend a to have each element of a list as a new element in another list with the .extend() method
colours_list.extend(extra_colours)
print(colours_list)

print(colours_list[3:7]) # We can also access elements in a list the same way we can strings

colours_list.remove('teal') # We can remove list elements using.remove
print(colours_list)


<class 'list'>
['red', 'blue', 'green']
['red', 'blue', 'green', 'purple']
['red', 'blue', 'green', 'purple', ['turquoise', 'teal']]
['red', 'blue', 'green', 'purple', ['turquoise', 'teal'], 'turquoise', 'teal']
['purple', ['turquoise', 'teal'], 'turquoise', 'teal']
['red', 'blue', 'green', 'purple', ['turquoise', 'teal'], 'turquoise']


In [21]:
numbers_list = [2,3,4,5,1,6]
numbers_list.sort() # we can sort lists too, here using a method
print(numbers_list) 
numbers_list.sort(reverse=True)
print(numbers_list) 

sorted_list = sorted([3,2,3,4,5]) # Here we sort a list using a function and returning it to a new variable
print(sorted_list)

colours_list = ['red', 'blue', 'green', 'amber']
sorted_colours = sorted(colours_list, key=str.lower)
print(sorted_colours)



[1, 2, 3, 4, 5, 6]
[6, 5, 4, 3, 2, 1]
[2, 3, 3, 4, 5]
['amber', 'blue', 'green', 'red']


Now lets look at dictionaries. dictionaries may initially seem a little confusing, and not overly helpful, but they are an incredibly powerful tool. Dictionaries allow us to store things in key:value pairs, where, generally, the key is a string or a number, keys must be unique within a dictionary and must be immutable objects (objects that can't be changed, like lists), and the value is... pretty much whatever we want it to be. We build a list using curly brackets {}, seperating key:value pairs with colons, and entries with commas.

In [22]:
favourite_colours = {'Will':'teal',
                     'Andy':['Purple', 'Green', 'Azure'],
                     'Annie':{'Mondays':'Red',
                              'Sundays':'Orange',
                              'Other days':'Black'}}
print(favourite_colours) # Here you can see I've stored strongs, lists, and even entire dictionaries inside strings
print(list(favourite_colours)) # list(d) will print the keys, in order from a dictionary
print(favourite_colours.items()) # Prints all key:value pairs
print(favourite_colours.keys()) # Prints all keys
print(favourite_colours.values()) # Prints all values




{'Will': 'teal', 'Andy': ['Purple', 'Green', 'Azure'], 'Annie': {'Mondays': 'Red', 'Sundays': 'Orange', 'Other days': 'Black'}}
['Will', 'Andy', 'Annie']
dict_items([('Will', 'teal'), ('Andy', ['Purple', 'Green', 'Azure']), ('Annie', {'Mondays': 'Red', 'Sundays': 'Orange', 'Other days': 'Black'})])
dict_keys(['Will', 'Andy', 'Annie'])
dict_values(['teal', ['Purple', 'Green', 'Azure'], {'Mondays': 'Red', 'Sundays': 'Orange', 'Other days': 'Black'}])


In [23]:
print(favourite_colours['Will']) # We can access the values associated with a key by passing it between square brackets after the dictionary's name

favourite_colours['Naiomi'] = 'Yellow' # We can also add/update key value pairs in a similar way
print(favourite_colours)

print('Mo' in favourite_colours) # We can check if a key is in a dict or not like so

teal
{'Will': 'teal', 'Andy': ['Purple', 'Green', 'Azure'], 'Annie': {'Mondays': 'Red', 'Sundays': 'Orange', 'Other days': 'Black'}, 'Naiomi': 'Yellow'}
False


Now we'll look at dataframes which are probably going to be the most powerful ways of storing, manipulating, and using data for us. Dataframes is one of the key ways of working with tabular data in Python. To use them, we need to import the package that has them to our workspace. It's called pandas. Every time we use the package, we'd have to tell Python we want to, bu calling 'pandas' but this is a bit long to do over and over, so the accepted way is to import pandas as pd, and we can use pd as short for pandas. This will become clear as we go.

Dataframes can be made in lots of ways, we can build them from scratch using things like lists and dictionaries, we can read in data from files like CSVs and Excel files, and we can even use extra packages like SQLAlchemy to query SQL databases directly into dataframes. For now, we'll build up our skills by hard coding them from scratch, first using dicitonaries. In this case, we will make a dictionary, with each key being a column, and all the values in lists being the values in that column per row.

In [24]:
import pandas as pd

d = {"here's a column":['Val 1', 2, 'Val 3'], # Notice here we made a string, but the string had an apostrophe in, to make sure the apostrophe was inside the string, we used double quotes
     "Here's another":[4, 5, 6],
     "and another":['foo', 'bar', 'baz']}

df_1  = pd.DataFrame(d)
print(df_1)

df_2 = pd.DataFrame([  # Perhaps a more intuituve way of building a df, one dict per row
    {'letter':'A',
     'word':'Foo',
     'number':1},
     {'letter':'B',
     'word':'Bar',
     'number':2},
     {'letter':'C',
     'word':'Baz',
     'number':3},
])

print(df_2)


  here's a column  Here's another and another
0           Val 1               4         foo
1               2               5         bar
2           Val 3               6         baz
  letter word  number
0      A  Foo       1
1      B  Bar       2
2      C  Baz       3


We will often want to interrogate data in dfs, a good way to start is by using the .info() method, .dtypes is also useful to find out what type of data is stored in a column. .head(n)/.tail(n) will print the top/bottom n rows respectively. A full list of Pandas methods can be found here: https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html

In [25]:
print(df_2.info())
print(df_2.dtypes)
print(df_2.tail(1))


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   letter  3 non-null      object
 1   word    3 non-null      object
 2   number  3 non-null      int64 
dtypes: int64(1), object(2)
memory usage: 200.0+ bytes
None
letter    object
word      object
number     int64
dtype: object
  letter word  number
2      C  Baz       3


In [26]:
print(df_2.sum()) # See how it sums strings, exactly as we did earlier?
print(df_2['number'].mean()) # You obviously can't take the mean of strings, so lets look at just the number column, we access using square braces
print(df_2['number'].cumsum()) # cumalative sum is also useful
print([df_2['number'].min(), df_2.number.max()]) # Min and max. Notice we can access a column using . notation rather than square braces if we want, though square braces is more common

letter          ABC
word      FooBarBaz
number            6
dtype: object
2.0
0    1
1    3
2    6
Name: number, dtype: int64
[1, 3]


In [27]:
df_2['product 3'] = df_2['number'] * 3 # we can also perform actions on columns easily, for example, multiplying each row in the number column by 3 to make a new column
print(df_2)

def multiple_of_3(x):
    if x % 3:
        return 'Yes'
    else:
        return 'No'
    
df_2['multiple of 3'] = df_2['number'].map(multiple_of_3) # mapping is really useful if we want to perform an action for every row that changes based on value sin rows
print(df_2)

  letter word  number  product 3
0      A  Foo       1          3
1      B  Bar       2          6
2      C  Baz       3          9
  letter word  number  product 3 multiple of 3
0      A  Foo       1          3           Yes
1      B  Bar       2          6           Yes
2      C  Baz       3          9            No


In [28]:
# Here we're just making a large dataframe to work from for examples going forward
df = pd.DataFrame({'ChildId':['id1', 'id2', 'id3', 'id4', 'id5'],
                   'Age first contact':[6,12,11,1,19],
                   'Gender':['M','m', 'F', '', 'F' ],
                   'Birthday':['01/01/2002', '02/02/2003', pd.NA, '03/03/2023', '06/01/2012'],
                   'CP Plan?':['Y', 'n', 'N', 'No', 'yES'],})

print(df)
print(df.info())

  ChildId  Age first contact Gender    Birthday CP Plan?
0     id1                  6      M  01/01/2002        Y
1     id2                 12      m  02/02/2003        n
2     id3                 11      F        <NA>        N
3     id4                  1         03/03/2023       No
4     id5                 19      F  06/01/2012      yES
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   ChildId            5 non-null      object
 1   Age first contact  5 non-null      int64 
 2   Gender             5 non-null      object
 3   Birthday           4 non-null      object
 4   CP Plan?           5 non-null      object
dtypes: int64(1), object(4)
memory usage: 328.0+ bytes
None


So, lets use this df to go though some key DF processes: making selections based on criteria, some basic data cleaning, and making sure data is of the right type

In [29]:
# Making sure stuff is the right type: our dates are strings, we need them to be dates!
df['Birthday'] = pd.to_datetime(df['Birthday'])
print(df['Birthday'].dtypes) # Now it's a datetime object, we can do a calculation to work out current ages!

df['Age'] = pd.to_datetime('13/09/2023', dayfirst=True) - df['Birthday']
print(df['Age']) # It's now a time delta object, we want age in years!


import numpy as np
df['Age'] = df['Age'] / pd.Timedelta('365 days')
df['Age'] = df['Age'].round().astype('int', errors='ignore') # Round, convert to integers, ignoring the error produced by the empty value for age
print(df)

# Lets do an error check and take rows where first contact is an age older then their current age
error_df = df[df['Age first contact'] > df['Age']]
print(error_df) # We can see, based on our selection, child id5 has an error

datetime64[ns]
0   7925 days
1   7528 days
2         NaT
3    194 days
4   4121 days
Name: Age, dtype: timedelta64[ns]
  ChildId  Age first contact Gender   Birthday CP Plan?   Age
0     id1                  6      M 2002-01-01        Y  22.0
1     id2                 12      m 2003-02-02        n  21.0
2     id3                 11      F        NaT        N   NaN
3     id4                  1        2023-03-03       No   1.0
4     id5                 19      F 2012-06-01      yES  11.0
  ChildId  Age first contact Gender   Birthday CP Plan?   Age
4     id5                 19      F 2012-06-01      yES  11.0


We've converted to dates, calculated ages, and found a row with errors, lets do a little cleaning now. 

In [30]:
# Convert all gender values to lower case
df['Gender'] = df['Gender'].str.lower()
print(df)

# convert all cp plan? to lower... but we still need the right things in each column
df['CP Plan?'] = df['CP Plan?'].str.lower()

df['CP Plan?'] = df['CP Plan?'].apply(lambda x: 'y' if 'y' in x else 'n' if 'n' in x else pd.NA) # This has got a bit more complicated, we've used a lambda function to change value sbased on other values
print(df)

df = df.fillna(pd.NA) # .fillna() can be used to fill empty rows with whatever we want. standard would be pd.NA which pandas understands as NAs
# Note, this hasn't worked for our empty value in gender as an empty string is still a string!
print(df)

df = df.replace(r'^\s*$', pd.NA, regex=True) # The easiest way to fill it is replacing the regex expression for an empty string with an na
# We can see in the table below we have a NaN, an <NA>, and an NaT, these are the Not a Number, Empty, and Not a Time NAs for each data type.

  ChildId  Age first contact Gender   Birthday CP Plan?   Age
0     id1                  6      m 2002-01-01        Y  22.0
1     id2                 12      m 2003-02-02        n  21.0
2     id3                 11      f        NaT        N   NaN
3     id4                  1        2023-03-03       No   1.0
4     id5                 19      f 2012-06-01      yES  11.0
  ChildId  Age first contact Gender   Birthday CP Plan?   Age
0     id1                  6      m 2002-01-01        y  22.0
1     id2                 12      m 2003-02-02        n  21.0
2     id3                 11      f        NaT        n   NaN
3     id4                  1        2023-03-03        n   1.0
4     id5                 19      f 2012-06-01        y  11.0
  ChildId  Age first contact Gender   Birthday CP Plan?   Age
0     id1                  6      m 2002-01-01        y  22.0
1     id2                 12      m 2003-02-02        n  21.0
2     id3                 11      f        NaT        n   NaN
3     id

In [31]:
# Lets say we have another child
new_child = {
            'ChildId':['id8'],
             'Age':[10],
             'Gender':['m'],
             'Birthday':[pd.to_datetime('05/12/1993')],
             'NHS Number': '666',
             }

print(new_child)
df = pd.concat([df, pd.DataFrame(new_child)], ignore_index=True)
print(df) # this has allowed us to add a new child, and has also added a new column to include the new data
print(df.dtypes)

{'ChildId': ['id8'], 'Age': [10], 'Gender': ['m'], 'Birthday': [Timestamp('1993-05-12 00:00:00')], 'NHS Number': '666'}
  ChildId  Age first contact Gender   Birthday CP Plan?   Age NHS Number
0     id1                6.0      m 2002-01-01        y  22.0        NaN
1     id2               12.0      m 2003-02-02        n  21.0        NaN
2     id3               11.0      f        NaT        n   NaN        NaN
3     id4                1.0   <NA> 2023-03-03        n   1.0        NaN
4     id5               19.0      f 2012-06-01        y  11.0        NaN
5     id8                NaN      m 1993-05-12      NaN  10.0        666
ChildId                      object
Age first contact           float64
Gender                       object
Birthday             datetime64[ns]
CP Plan?                     object
Age                         float64
NHS Number                   object
dtype: object


In [32]:
nhs_numbers = pd.DataFrame([
                            {'ChildId':'id1',
                            'NHS Number': '303',},
                            {'ChildId':'id2',
                            'NHS Number': '3u5029',},
                            {'ChildId':'id3',
                            'NHS Number': 'gqw3',},
                            {'ChildId':'id4',
                            'NHS Number': 'avsgvb',},
                            {'ChildId':'id5',
                            'NHS Number': 'varwvw',},
                            ])

df = pd.merge(df, nhs_numbers, how='left', on='ChildId')
df['NHS Number_x'] = df['NHS Number_x'].fillna(df['NHS Number_y'])
print(df)
df.drop('NHS Number_y', axis=1, inplace=True)
df.rename({'NHS Number_x': 'NHS Number'}, axis=1, inplace=True)
print(df)

  ChildId  Age first contact Gender   Birthday CP Plan?   Age NHS Number_x  \
0     id1                6.0      m 2002-01-01        y  22.0          303   
1     id2               12.0      m 2003-02-02        n  21.0       3u5029   
2     id3               11.0      f        NaT        n   NaN         gqw3   
3     id4                1.0   <NA> 2023-03-03        n   1.0       avsgvb   
4     id5               19.0      f 2012-06-01        y  11.0       varwvw   
5     id8                NaN      m 1993-05-12      NaN  10.0          666   

  NHS Number_y  
0          303  
1       3u5029  
2         gqw3  
3       avsgvb  
4       varwvw  
5          NaN  
  ChildId  Age first contact Gender   Birthday CP Plan?   Age NHS Number
0     id1                6.0      m 2002-01-01        y  22.0        303
1     id2               12.0      m 2003-02-02        n  21.0     3u5029
2     id3               11.0      f        NaT        n   NaN       gqw3
3     id4                1.0   <NA> 2023-0