# Introduction to Python 
## Part 1

<br><br>

**NISR** <br>
2021

## Table of Contents
<a href="#Background"><font size="+1">Background</font></a>
* Learning Objectives
* Working with Python
* Jupyter Notebooks  
* Markdown

<a href="#Basic-Python"><font size="+1">Basic Python</font></a>
* Numeric Values
* Text Values
* Logical Values
* Casting (Type Conversions) in Python
* Variable Assignment
* Simple Conditional Statements

<a href="#Python-Objects"><font size="+1">Python Objects</font></a>
* Lists
* Tuples
* Accessing Values in Lists and Tuples
* Dictionaries
* Keeping Track of Objects
* Iterable Objects
* Getting Help
* General Purpose Objects

<a href="#The-Pandas-Library---Objects-for-Data-Science"><font size="+1">The Pandas Library - Objects for Data Science</font></a>
* Series Objects
* DataFrame Objects
* Importing Python Modules (pandas)
* Setting a Working Directory
* Reading in Data
* Exploring the Data

<a href="#Selecting-Columns-from-DataFrames"><font size="+1">Selecting Columns from DataFrames</font></a>
* Selecting single and multiple columns.
* <a href="#Exercise-1">Exercise 1</a>

<a href="#Filtering-rows-from-Dataframes"><font size="+1">Filtering Rows from DataFrames</font></a>
* Simple filtering
* Conditional filtering
* <a href="#Exercise-2">Exercise 2</a>
* Using multiple conditions to filter
* <a href="#Exercise-3">Exercise 3</a>

<a href="#Generating-New-Variables"><font size="+1">Generating New Variables</font></a>
* Creating Binary Variables
* Constant Value Variables
* Creating Variables Based on Existing Columns
* Classifying Numerical Values using pd.cut() and pd.qcut()
* Removing (Dropping) Columns
* <a href="#Exercise-4">Exercise 4</a>

<a href="#Consolidation"><font size="+1">Consolidation</font></a>
* Today's Learning Objectives
* Plan For Day 2

# Background

## Learning objectives

The goal of this session is to:
* Familiarise yourself with the Jupyter notebook, which is what we'll use to interact with python
* Learn about python's basic data types - integers, floats, strings and Booleans.
* Explore python's general data structure objects - lists, tuples, and dictionaries.
* Take a first look at 'pandas', python's data manipulation and analysis library.
    * The Series and DataFrame objects.
    * Reading data into python using pandas.
    * Getting information about DataFrames.
* Learn some basic DataFrame manipulation approaches:
    * Select columns
    * Filter rows
    * Generate new variables

## Working with Python

Python is:

* A general purpose programming language which has been designed for readability. 
* Created by Guido Van Rossum who chose the name (refering to Monty Python) to indicate that Python should be fun to use.
* Python is interoperable across multiple computing platforms (e.g. Windows, Mac, Linux)
* Used by many people, giving it a large user base.
* Open-source - It is much more cost effective and easier to maintain than SAS and SPSS
* Extensible - A broad range of modules: Data Management, Visualisation, Statistics, Machine Learning, Scientific Computing...
* High-level - Concise expressions compared to low-level languages (e.g. c, java etc.) giving greater productivity.

There are a number of different ways of interacting with python, in fact your anaconda install gives you three different options:
* Jupyter notebook - a straightforward browser-based interactive python environment great for data analysis tasks.
* IPython qtconsole - a stripped-down, command line python application, suitable for more advanced users. 
* Spyder - a python IDE (integrated development environment), for fully-featured software development.

In [1]:
# Execute this code cell for the 'zen' of python
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


## Jupyter Notebooks

You're going to be using Jupyter 'notebooks' in this training, because they allow you to author and execute: python code to perform data analysis tasks; and, descriptive material to support the code, using a simple language called 'markdown'.

Jupyter Notebook gives the opportunity to combine code and description in a single human-readable notebook. You can conduct analysis and give interpretation side-by-side. 

This means that your entire analytical approach can be documented together, from raw data input to analysis model outputs, and from initial assumptions to interpretation of results and conclusions. This also means that notebooks are great for sharing analysis, as the approach taken can be clearly and transparently documented both in code, and with supporting text.

### Working with notebooks

* **Creating New Notebooks** - Navigate to where you'd like to create a new notebook, select the 'New &#9660;' button towards the top-right of the screen and choose 'Python 3', this will open a new, untitled notebook in a new tab.

***Basic tools***

* **Saving Notebooks** File -> Save and Checkpoint, or click the save icon (the floppy disk), or Ctrl + s.
 * Notebooks are saved into your user area (e.g C:/Users/[User_Name]) by default
 

* **(re)Naming Notebooks** Just click on the title (e.g. 'Intro to Python - Day 1', 'Untitled' etc.) at the top of the page and type the name you'd like. This will be used to save the notebook, as the chosen name + '.ipynb' (ipython notebook, the file type).


* **Adding New Cells** Either click the **+** button, or use insert -> Insert Cell Above/Below.

Importantly, there are two types of cell you should be aware of: code cells, and markdown cells.

* **Code Cells** are what you type python code into.
* **Markdown Cells** are where you write supporting text, equations etc.

You can toggle between Code and Markdown cells (and others that we won't cover here) using the drop-down box on the toolbar above.

Finally, it is useful to mention that notebooks have 2 'modes': edit mode and command mode.

* A cell in **Edit Mode** has a green border and a cursor in the cell indicating that you can write in the cell.
* A cell in **Command Mode** has a blue border, this indicates that you are not editing a specific cell, but instead allows you to control the properties of the whole notebook using keyboard shortcuts.

Most of the time you'll be using edit mode, this is the default when you click into a cell and start typing code. Command mode is really for more advanced users who know have memorised some of the notebook keyboard shortcuts. To enter command mode, you can either click outside of the cell space (i.e. on the grey border space), or press Esc when you are editing a cell.

Speaking of which, you can see all the keyboard shortcuts, and get other help on notebooks from Help in the menu bar above. Try Help -> Keyboard Shortcuts.

# Markdown

Markdown is what we've used to produce all the descriptive text in this notebook. 

Markdown is a really simple 'markup' language. This means that you can write in plain text and provide some formatting instructions which change how the text is displayed once the cell is executed. Here are a few pointers on how to use it to create your own markdown supporting text.

* Try creating a new cell and make it a 'Markdown' cell using the drop-down box on the toolbar at the top of the notebook.
* Then, type some text in and then press Ctrl + Enter. See have executing the cell renders the text?

There are a range of ways you can decorate plain text to produce formatted text when executed. Try:

* Putting 1 or more # at the start of a line, then typing after it (e.g. ### A third level header)
* Surrounding a word or phrase with asterisks, e.g. \*hello world\*
* What about two asterisks? e.g. \*\*hello world\*\*
* Try creating a bullet list by starting a line with an asterisk (e.g. * item 1) or a numbered list (e.g. 1. item 1)

There's loads that Markdown can represent, including embedding images and equations. From the toolbar choose Help -> Markdown to see some detailed information on using Markdown.

You can edit existing Markdown too, just double-click into the cell, amend it, and then ctrl-enter or shift-enter to remake the html.

# Images

You can display an image by adding ! and wrapping the alt text in[ ]. Then wrap the link for the image in parentheses ().
```
![This is an image](https://myoctocat.com/assets/images/base-octocat.svg)  
```


![This is an image](https://myoctocat.com/assets/images/base-octocat.svg)    


## Python is very easy to learn. 🐍
## It is not time taking. ⌚️😍
## Python is used by developers 👨🏽‍💻, researchers 👩🏽‍🔬, and instructors 👨🏿‍🏫.
## Thank you 🙏🏽👋


# Basic Python

## Entering Values in a Notebook Cell

We can use Python as a basic calculator by entering values in the cells. This doesn't store any of the information, other than in the notebook cell, rather you're just passing information to the python interpreter and receiving an answer.

## Numeric Values

Python has  - 
* ints (integers, whole numbers) e.g 6, 12, 100000, 8960005060030102
* floats (floating-point numbers, decimal numbers) e.g 3.4, 7.8, 3.14, 98.99997

We can use mathematical expressions

* \+ for addition
* \- for subtraction
* \* for multiplication
* / for division
* \*\* for powers and roots

Plus, loads of other useful operators, including:

* () Brackets for order of operations  
* % modulo - remainder in integer division (e.g. 11%4 = 3, because 11 fully divides by 4 twice, with 3 (of 4) remaining)  
* // Floor Division (Rounds down to the nearest integer, e.g. 12/5 = 2.4 but 12//4 = 2)

And, some built-in functions:
* `abs(`_number_`)` - get the absolute value of a number.
* `divmod(`_dividend_, _divisor_`)` - get the quotient and remainder from integer division. Combines floor division and modulo.
* `pow(`_number_, _power_`)` - function version of _number_\*\*_power_
* `round(`_number_, _ndigits_`)` - round a floating-point number to _ndigits_ number of decimal places.

In [4]:
# Addition of ints (produces an int)
x = 1
y = 1
z = x+y
print(f'x ({type(x)}) + y ({type(y)}) = z ({type(z)})')

x (<class 'int'>) + y (<class 'int'>) = z (<class 'int'>)


In [5]:
# Addition of Floats (produces a float)
x = 1.0
y = 1.0
z = x+y
print(f'x ({type(x)}) + y ({type(y)}) = z ({type(z)})')

x (<class 'float'>) + y (<class 'float'>) = z (<class 'float'>)


In [6]:
# Addition of Ints and Floats (produces...?)
x = 1.0
y = 1
z = x+y
print(f'x ({type(x)}) + y ({type(y)}) = z ({type(z)})')

x (<class 'float'>) + y (<class 'int'>) = z (<class 'float'>)


## Text Values

In python, we refer to text as 'string' data, this is because they are sequences of characters strung together.
* Strings are ordered sequences of characters (letters, numbers, punctuation, whitespace and other special symbols).
* Strings needs to be enclosed in apostrophes (' ') or quotation marks (" ")
    * It doesn't matter which, so long as you're consistent. You can't start with ' and end with "!
    * 'hello world!' is the same as "hello world!"

Strings can be manipulated with some arithmetic operators:
* \+ concatenates strings (joins them together).
* \* repeats a given string *n* times.

Strings are *objects* and so have a number of built-in *methods* for manipulating them.
* Use a '.' to access the string object's methods, e.g.
    * *string*`.lower()` transforms a string to all lowercase.
    * *string*`.upper()` transforms a string to all UPPERCASE.
    * *string*`.title()` transforms a string to Title Case.
* Some string methods require parameters, e.g.
    * *string*`.count('a')` counts the number of 'a'characters in the string, returning an integer.
    * *string*`.replace(" ","_")` replaces any spaces in the string with underscores. 

There are loads more string methods, see: https://docs.python.org/3/library/stdtypes.html#string-methods for more info.

In [7]:
# Enter a string here with quotation marks ""
"String"

'String'

In [8]:
# Now with apostrophes ''
'String'

'String'

In [9]:
# We can use the + symbol here too - but it works differently!
"Data" + " Science " + "Campus"

'Data Science Campus'

In [10]:
# Multiplication too!
"NISR" * 5

'NISRNISRNISRNISRNISR'

In [11]:
#.lower()
'NISR'.lower()

'nisr'

In [12]:
#.upper()
'nisr'.upper()

'NISR'

In [13]:
#.count()
'nisr'.count('n')

1

In [14]:
#.replace()
'nisr'.replace('n','N')

'Nisr'

## Logical Values

Logical values or "Boolean values" (NB Boolean is capitalised as it refers to logician George Boole).

In Python these are written as `True` and `False`.

They are special python data types - not strings! So they don't need ' ' or " " around them.

They also have numerics behind them - True is 1 and False is 0.

In [15]:
True

True

In [16]:
False

False

In [17]:
# You can add them together
True + True

2

In [18]:
# In fact you can attempt any arithmetic operations with booleans
# This owes to their underlying numeric value of either 1 or 0.
True * 2.5

2.5

In [22]:
# Try testing True and True or False and False
print(f'True and True: {True & True}')
print(f'False and False: {False & False}')

True and True: True
False and False: False


In [23]:
# Try testing False and True
True and False

False

## Casting (Type Conversion) in Python

Python doesn't require you to set the data type when you create it, instead it figures out what the best data type is for the object you are creating - int, float, string, boolean etc.

Sometimes you want to ensure that a particular object is actually a certain type, rather than leaving it up to python. This is done using casting. Python has a range of in-built functions that enable you to cast data from one type to another.

Firstly, the `type()` function will tell you what a given object is.

Then, if it is allowed, you can convert that object to a specific data type using:
* `int()`
* `float()`
* `str()`
* `bool()`

Obviously, you can't turn strings into numbers, unless the string in question is a numeric value stored as a string (e.g. '3' instead of 3).

Some transformations make sense, ints into floats and vice versa.

However, try turning numbers and text into Booleans, or make `True` or `False` a string, what do you think will happen? 

In [24]:
# Learn the type of different objects
type(True)

bool

In [25]:
# Int from whole number float
int(4.0)

4

In [26]:
# Int from decimal number float
int(7.8)

7

In [27]:
# Float from int
float(7)

7.0

In [28]:
# int from string
int(float('5.0'))

5

In [29]:
# float from string
float("5")

5.0

In [None]:
# Try to find some transformation that python doesn't like.


## Variable Assignment

* Variables store data values under a specific name:
    * `name = 'Laura'` - here the variable `name` is storing the string `'Laura'`.
    * `age = 21` - here the variable `age` is storing the integer `21`.
    * `height = 1.80` - here the variable height is storing the float `1.80`.
* Variables have to be assigned. Assignment is done with the `=` (equals) sign.
* Once created, using a variable means getting the value of whatever it is storing.
    * Inputting `height` into python will return `1.80` until the variable is either changed or deleted.
* We can change the value of variables at any point by assigning it a new value.
    * `height = 1.75` - reassigns the height variable to the value `1.75`
    * `height = "egg"` - reassigns the height variable to the value `"egg"`
    * `del height` - deletes the height variable, it is no longer in python's memory.
* Variable names are case sensitive
* Variable names can't start with a number (must start with an _ or a letter)
* Can't use reserved words - like ` True, Class, Yield` which already mean something to python.
* Should be descriptive (ideally).
* Can be any length - but should be sensible!
* Can look like this
    * norwegianblue (lowercase)
    * NORWEGIANBLUE (UPPERCASE)
    * norwegianBlue (camelCase)
    * NorwegianBlue (UpperCamelCase)
    * norwegian_blue (snake_case, can also be Norwegian_Blue

In [154]:
name = ''
surname = ""

print(name)
print(surname)





In [155]:
an_integer = 4

In [32]:
a_float = float(an_integer)

In [33]:
a_float

4.0

## Simple Conditional Statements

Conditional statements are basic parts of 'control flow' that allow for different actions depending on whether given a condition(s) evaluates to `True` or `False`. Python implements conditional statements in the format:
```python
if <condition>:
    # Do something.
elif <condition>:
    # Do something else.
else:
    # Do another thing.
```
Depending on context you might see:
* A single `if` with a condition. `False` evaluations are ignored.
* An `if... else...` condition. `False` evaluations are all handled by `else`.
* An `if...elif...else...` condition.
    * There can be one or more `elif` conditions.
    * The conditions are evaluated in order, with further evaluation terminated once a condition is met.

Conditions in python are evaluated using the boolean operators:
* == to mean 'is equal to'
* != to mean 'is not equal to'
* \>, <, >=, <= to give inequalities

Methods that return booleans can also be used to determine different courses of action, e.g.
* *string*`.isnumeric()` returns `True` if all characters in a string are numeric.

In [34]:
# Conditional example
test = 'data science'

if len(test) < 5: # NB len() gives the 'length' of an object
    print("'" + test + "'" + " has very few characters")
elif test.istitle():
    print("'" + test + "'" + " is in title case")
else:
    print("That's a dull object!")

That's a dull object!


In [35]:
# Conditional example 2
string_number = 'Dan'

if string_number.isnumeric():
    string_number = float(string_number)

type(string_number)

str

# Python Objects

Object is the general name for data types, data structures, functions and so on that python stores in memory and references with an identifier. Technically, an object is a specific instance of a 'class', which is an abstract template for a data type, data structure etc. A class is a pattern for creating new objects that can be reused and extended.

```python
days_list = ["Mon", "Tues", "Weds", "Thurs", "Fri", "Sat", "Sun"]
```
`days_list` is the identifier for a python 'list' data structure (object) that in this case stores some text values.

All objects have properties and methods, which relate to data that the object stores and behaviours (procedures) that can be performed on the object to get a specific output.

Accessing the properties and methods of an object usually means using the '.' (dot notation). This means we type the name of the object we are interested in, put a dot, then type the name of the property or method we want to call.

For instance, calling the `clear()` method of a list object named `fruit` would be achieved like this:
```python
fruit.clear()
```
This would have the effect of removing any items stored by the list `fruit`, emptying the list.

Similarly, if we wanted to use the count method to count the number of times a particular item occured in a list, we might do the following:
```python
fruit = ['apple','apple','banana','orange'] # create a list called fruit.
fruit.count('apple') # count the number of times that the string 'apple' occurs in the fruit list
```
In the above, we would expect the `.count()` method, with the parameter 'apple' to return the value 2, because 'apple' appears twice in the list `fruit`.

Additionally, there are some special in-built python functions that interact with objects and perform special behaviours, like `print()` and `len()`, or the numeric functions (e.g. `pow()`) you've seen already.

For example, calling the python `print()` function enables us to print a representation of an object based on how a given object behaves with respect to that function.

## Lists

* The python list object is (possibly) the most versatile of the built in data structures.

* It can hold any sequence of objects, including mixes of objects like strings, integers and Boolean variables together.

* Lists can also hold other lists and dictionaries.

* They can also hold custom data structures, and can be used to create custom data structures.

* Lists are generally created, or 'instantiated', using square brackets `[]`.

* A new list is considered to be an 'instance' of the list object.

* Each item in the list is separated by a comma.

* Lists can be amended (they are 'mutable' - they can be changed).

* Technically, a string is a list of characters, this becomes clear if we explictly make it a list.

In [36]:
# Create a list.
days_list = ["Mon", "Tues", "Weds", "Thurs", "Fri", "Sat", "Sun"]
print(type(days_list))
print(days_list)

<class 'list'>
['Mon', 'Tues', 'Weds', 'Thurs', 'Fri', 'Sat', 'Sun']


In [37]:
# A list of mixed objects
mixed = [2,2.3,'string',True]
mixed

[2, 2.3, 'string', True]

In [38]:
# We can use the inbuilt function list() to convert a string to a list.
mp = list("Monty Python")
mp

['M', 'o', 'n', 't', 'y', ' ', 'P', 'y', 't', 'h', 'o', 'n']

There are a number of methods we can call on list objects, including -  
 * _list_`.append(object)` - adds an object to the end of the list.
 * _list_`.extend([objects])` - similar to append but contents of the list object is added not the list itself.
 * _list_`.insert(index, object)` - allows you to insert an object at a particular position in the list.
 * _list_`.pop(index)` - removes and returns the object at a particular position.

In [39]:
list_1 = [10, 20, 30, 40, 50, 60]
list_1.append(70) # append integer object 70 to end of list
list_1

[10, 20, 30, 40, 50, 60, 70]

In [40]:
list_2 = [10, 20, 30, 40, 50, 60]
list_2.extend(list_1) # extend list with items in second list.
list_2

[10, 20, 30, 40, 50, 60, 10, 20, 30, 40, 50, 60, 70]

In [41]:
# What will happen here?
list_2.append(list_1)
list_2

[10,
 20,
 30,
 40,
 50,
 60,
 10,
 20,
 30,
 40,
 50,
 60,
 70,
 [10, 20, 30, 40, 50, 60, 70]]

In [42]:
list_1 = [10, 20, 30, 40, 50, 60]
list_1.insert(0, 5) # insert a 0 in the at index zero (the 1st position in python).
list_1

[5, 10, 20, 30, 40, 50, 60]

In [43]:
list_1 = [10, 20, 30, 40, 50, 60]
value = list_1.pop(3) # remove the value at index 3 (the 4th index position) and store as a variable called value.
print(value)
print(list_1)

40
[10, 20, 30, 50, 60]


In [44]:
# Note that arithmetic approaches can have similar results.
list_1 += [80]
list_1

[10, 20, 30, 50, 60, 80]

In the code above there is a common programming shortcut being employed: `+=`  
The following two assignments are identical:
```python
list_1 = list_1 + [80]
list_1 += [80]
```
Other possible assignments operators include: `-=`, `*=`, `/=`, `//=`, `**=` and a few others.

However, only `+` and `*` are meaningful for list operations, so other assignment operators give errors. However, these assignment operators will work for updating a numeric value.

In [45]:
# Starting values for cell below.
lump_sum = 100
interest = 1.025 # 2.5%

In [46]:
# Imagine this is a compund interest calculation.
# Run this cell a few times and see the output value change.
lump_sum *= interest
lump_sum

102.49999999999999

## Tuples

* Tuples are similar to lists, they can also hold any mixed sequence of objects.

* However, tuples cannot be edited once created (immutable).

* Tuples are created using parentheses () with items separated by commas.

* The tuple() function can convert lists or strings to tuples.


In [47]:
tuple1 = ("Monday", "Tuesday", "Wednesday")
type(tuple1)

tuple

In [48]:
tuple1

('Monday', 'Tuesday', 'Wednesday')

In [50]:
# You can convert to a list using list
list_from_tuple = list(tuple1)
type(list_from_tuple)

list

## Accessing Values in Lists and Tuples

In addition to creating list objects, python also uses square brackets as an indexing operator.

Therefore, to access a value in a list or tuple use the square brackets, along with the index of the value you want to obtain to return that value. e.g.
```python
vegetables= ['Carrot','Onion','Potato', 'Cauliflower']
vegetables[1]
```
In the above list, the value obtained is `'Onion'` this is because python counts from zero:
* `vegetable[0]` is 'Carrot'
* `vegetable[1]` is 'Onion'
* `vegetable[2]` is 'Potato'
* `vegetable[3]` is 'Cauliflower'

The same would apply if we had made a tuple instead of a list.

If you are dealing with lists, you can also update the values in lists using indexing. This is because lists are mutable. e.g.
```python
vegetables= ['Carrot','Onion','Potato','Cauliflower']
vegetables[1] = 'Courgette'
```
Will give: `['Carrot', 'Courgette', 'Potato', 'Cauliflower']`

As `'Onion'` has been updated with `'Courgette'`

Ways to index:

Python Expression | Result | Explanation
------------------|--------|------------
vegetables[1]     |'Courgette'| Indexing from start of list at 0
vegetables[-2] | 'Potato' | Backwards indexing from end of list
vegetables[2:] | ['Potato','Cauliflower'] | Slicing a section from a given start index to the end of the list.
vegetables[:2] | ['Carrot','Courgette'] | Slicing a section from the start of the list until a given end index.
vegetables[1:3] | ['Courgette','Potato'] | Slicing a section from a given start index to a given end index.

Try some of these approaches below.

In [51]:
# Try indexing list_1 and list_2
list_1[0]

10

In [52]:
# slicing
list_1[3:5]

[50, 60]

In [53]:
# Backwards indexing
tuple1[-1]

'Wednesday'

## Dictionaries

* A dictionary has a set of 'keys' and each key has a single associated value.

* Instead of indexing by position as with lists and tuples, dictionaries are indexed by keys.

* When presented with a key, the dictionary will return the associated value.

* A dictionary key must be an immutable data type. Usually this will mean integers or strings, but it is possible to use a tuple as a dictionary key.

* In contrast, a dictionary value can hold any combination of objects. Including other dictionaries to create 'nested' dictionaries.

* Dictionaries are 'key-value pair' data structures, also known as a _hash_, a _map_, and a _hashmap_ in other programming languages.

* Python dictionaries are created using curly brackets {}.
```python
dictionary = {key1: value1, key2:value2, ...}
```
    
* Values can be indexed by using a key within square brackets.
```python
dictionary[key2]
```
    
* New keys can be assigned to a dictionary by providing a new key-value pair, however, if the key already exists this will overwrite the existing value as dicitonary keys must be unique.

In [54]:
# Example 
person1 = {'name': 'Steve', 'occupation': 'musician', 'instrument' : ['keyboards', 'guitar', 'vocals'], 
      'nationality': 'English'}

print("Name is" , person1['name'], 'and',
      "instrument 1 is ", person1['instrument'][1])

Name is Steve and instrument 1 is  guitar


In [55]:
# Example 2
person2 = {'name': 'John', 'occupation': 'Programmer', 'languages': ['python','C++','scala'], 'nationality': 'UK'}
person2

{'name': 'John',
 'occupation': 'Programmer',
 'languages': ['python', 'C++', 'scala'],
 'nationality': 'UK'}

In [56]:
# Add a new key to a dictionary
person1['age'] = 59
person1

{'name': 'Steve',
 'occupation': 'musician',
 'instrument': ['keyboards', 'guitar', 'vocals'],
 'nationality': 'English',
 'age': 59}

In [57]:
# Update an existing key
person2['occupation'] = 'Data Scientist'

## Keeping Track of Objects

The Jupyter Notebook has a 'magic' command `%whos` which shows the current objects in memory and some information about them.

In [58]:
# Have a look at currently instantiated objects.
%whos

Variable          Type      Data/Info
-------------------------------------
a_float           float     4.0
an_integer        int       4
days_list         list      n=7
interest          float     1.025
list_1            list      n=6
list_2            list      n=14
list_from_tuple   list      n=3
lump_sum          float     102.49999999999999
mixed             list      n=4
mp                list      n=12
name              str       
os                module    <module 'os' from '/opt/m<...>da3/lib/python3.9/os.py'>
person1           dict      n=5
person2           dict      n=4
string_number     str       Dan
surname           str       
sys               module    <module 'sys' (built-in)>
test              str       data science
this              module    <module 'this' from '/opt<...>3/lib/python3.9/this.py'>
tuple1            tuple     n=3
value             int       40
x                 float     1.0
y                 int       1
z                 float     2.0


## Iterable Objects

In python, some objects are iterable. This means that they have a method which allows sequential access to the indices in the object from 0 to *n*. Iterable objects can therefore be looped over using the built-in `for` loop. Iteration, in this context, means taking each item of an object, one after another.

Technically, the string is the simplest iterable object. Using a for loop we can iterate over each character in the string and perform a task.

In [59]:
# iterate over characters in a string.
for char in "Data Science Campus":
    print(char)

D
a
t
a
 
S
c
i
e
n
c
e
 
C
a
m
p
u
s


In the code above the iterable object is the string "Cabinet Office". The `for` loop creates a local variable, in this case called `char`, which will hold the value of the object for a given iteration. In this case the `for` loop indexes each character in the string in turn. The local variable `char` can then be used within the loop (notice the indents) for a given behaviour. Here we just print out the local variable `char` printing out each character in the string in turn.

Here's a slightly more complex case:

In [60]:
# iterate over characters in a string, print true for vowels.
for char in "Data Science Campus":
    print(char, char.lower() in ['a','e','i','o','u'])

D False
a True
t False
a True
  False
S False
c False
i True
e True
n False
c False
e True
  False
C False
a True
m False
p False
u True
s False


Lists and tuples are also easily iterable using `for` loops, providing each item in turn as the local variable:

In [61]:
# iterate over days list
for item in days_list:
    print(item)

Mon
Tues
Weds
Thurs
Fri
Sat
Sun


Dictionaries are more complex objects and have a number of possible iteration behaviours.

The basic iteration behaviour iterates over keys:

In [62]:
# iterate over keys and get values by indexing.
for key in person1:
    #print(key)
    print(key, person1[key])

name Steve
occupation musician
instrument ['keyboards', 'guitar', 'vocals']
nationality English
age 59


In [63]:
for key in person1.keys():
    print(key)

name
occupation
instrument
nationality
age


_dictionary_`.keys()` produces the same behaviour if used as an iterable. Similarly _dictionary_`.values()` allows the values of a dictionary to be iterated over in the same order as the keys.

In [64]:
# Iterate over D1 values
for value in person1.values():
    print(value)

Steve
musician
['keyboards', 'guitar', 'vocals']
English
59


Dictionaries also have a more complex iterator, _dictionary_`.items()` which produces a key:value tuple at each iteration. Because this produces a tuple, we can 'unpack' the tuple into two local variables:

In [65]:
person1.items()

dict_items([('name', 'Steve'), ('occupation', 'musician'), ('instrument', ['keyboards', 'guitar', 'vocals']), ('nationality', 'English'), ('age', 59)])

In [66]:
# Iterate over D1 key:value pairs
for key, value in person1.items():
    print(key,"---", value)

name --- Steve
occupation --- musician
instrument --- ['keyboards', 'guitar', 'vocals']
nationality --- English
age --- 59


Python has a useful built-in iterable object called `range()` this allows you to specify the start, stop and step parameters to produce a sequence of integers.

In [67]:
# simple use of range
for i in range(5):
    print(i)

0
1
2
3
4


In [68]:
# intermediate use of range
for i in range(3,7):
    print(i)

3
4
5
6


In [69]:
# Advanced use of range
for i in range(0,10,2):
    print(i)

0
2
4
6
8


In [70]:
# advanced use of range
for i in range(5,0,-1):
    print(i)

5
4
3
2
1


## While Loops

While loops are different to for loops, `for` loops do definite iteration - you know *a priori* how many loops you are doing - whereas `while` loops do indefinite iteration, they require a stop condition and until that stop condition is met they simply keep iterating.

You start a while loop with `while` then you provide the condition that, when falsified, breaks the loop.

There are some examples below:

In [71]:
# Set an initial condition
i = 0
# Now use a while loop to increment the variable i
while i < 6:
    print(i)
    i += 1 # += 1 means 'increment by 1 on each loop occasion'.

0
1
2
3
4
5


## Getting Help

Most python modules include information on how to use their functions with something called a 'docstring'. You can look at docstrings using python's in-built help() function. Run the code below to look at the docstring for the string object `str`'s `replace` method.

In [72]:
# Have a look at the docstring using python.
help(str.lower)

Help on method_descriptor:

lower(self, /)
    Return a copy of the string converted to lowercase.



When you're using a notebook, you can also access the docstring by pressing Shift-Tab with the cursor somewhere in the object. This produces a **tool-tip**. A small pop-up box with relevant information.

Pressing Tab once shows the parameters available for the particular object you are looking at.  
Pressing Tab twice shows the whole docstring.  
Pressing Tab a third time makes the tool-tip linger for 10 seconds.  
Pressing Tab a fourth time put the tool-tip into a larger pane in the browser.  

The tool-tip provides the same information as the `help()` function, but with nicer formatting!

In [75]:
help(str.replace)

Help on method_descriptor:

replace(self, old, new, count=-1, /)
    Return a copy with all occurrences of substring old replaced by new.
    
      count
        Maximum number of occurrences to replace.
        -1 (the default value) means replace all occurrences.
    
    If the optional argument count is given, only the first count occurrences are
    replaced.



Note that in order to get help, we are using the 'abstract' object, rather than a specific instance of that object. Getting the help for the replace method of a string means using `str.replace` rather than say `"Data Science".replace()`.

### Docstrings

Docstrings look a bit scary at first, but once you learn how to read them they're super useful!

The first block of text is the function we're looking at - read_csv - followed by all the possible parameters you can use to modify how pandas reads a csv file.

Note that the first parameter is 'filepath_or_buffer', that's currently all you're passing to this function - the filepath of 'MarvelUniverse.csv'.

If you scroll down a little, you can then see all of the possible parameters listed and described.

E.g If your data were semi-colon delimited you'd have to specify the 'sep' parameter. 
No header row? (i.e. column names) you could set 'header' to None.

At the very bottom of the docstring, you can see that read_csv function returns a 'DataFrame'.

Don't get too hung up on docstrings for now though, as you become more experienced they'll begin to make more sense!

## General Purpose Objects

The data structures above, lists, tuples, and dictionaries, are the core, in-built data structures available in python. They are always available for use and easily created and destroyed. They are general purpose which means they are useful in many situations, however, they may not always be able to do everything we want them to do for specific purposes.

Nonetheless, it is really important to know a bit about lists, tuples and dictionaries because they are used throughout python, and you will be very likely to encounter them in some form. Perhaps as an output from a function or method; as a way of providing input to a procedure; or, as the way that the package you are using stores properties.

Many advanced python libraries that have been built for specific purposes, like data science, take advantage of the fact that objects are extensible. Programmers can create new classes based on existing objects and add in specific properties or methods that do specific things. This means that general purpose objects can be customised for particular tasks.

In data science applications, the Pandas library provides specialised objects that make working with data tables a lot easier than if we had to rely on python's general purpose objects. However, because pandas objects are specialised, they are also more complicated than general purpose objects.

# The Pandas Library - Objects for Data Science

Pandas adds two important data structures that we need to know about. The 1 dimensional `Series` and the 2 dimensional `DataFrame`.

## Series Objects

* A Series is like a one-dimensional array (effectively a list) with a label for each observation.

* You can think of a Series as being similar to a single column in a spreadsheet: a list of values.

* However, the items in a Series must all be of the same data type, unlike a list.

* The Series index provides a label for each observation, the default is for consecutive integer labels starting at 0.

* However, a Series index does not have to be consecutive numbers, they can be non-unique numbers, non-consecutive numbers, as well as string objects or tuples.

* As such, the Series itself is a bit like a list, whereas the Series index is a bit like a dictionary, and is able to return a value in the Series based on the index, like a key-value pair.

* The Series object implements a large number of special behaviours (methods, procedures) that can be called on the data held in a Series to get useful outputs. This includes mathematical functions, like taking the average of all values in a numerical Series.

##  DataFrame Objects

* A DataFrame is a two-dimensional version of a Series object.

* In effect, a DataFrame is a collection of Series objects - one Series for each column in the DataFrame.

* A DataFrame is thus similar to a whole spreadsheet, like an Excel file, a STATA .dta file, a SAS file, an SPSS .sav file etc.

* A DataFrame can hold columns with different data types, but as per Series above, any single column must contain vlaues of the same data type.

* The dimensions are labeled in a similar way to the Series object:

    * **index** - refers to the row labels
    * **columns** - refers to the column labels
    
 * Having indexed data allows fast look up and powerful relational operations 
 
 * Each row has a label and each column has a label

Most the the time you'll be working with DataFrame objects in Pandas, however, as DataFrames are composed of Series, it is very likely that through indexing, selection, creation of new columns, and analysis you'll encounter Series objects too. Luckily, DataFrame and Series objects are similar, DataFrames effectively extend the Series into two-dimensions and as such implement some additional properties and methdos that may not be relevant to Series on their own.

BEGIN DATASET 

## Importing Python Modules (Pandas)

Python is great on its own, but most of the time you need to add useful functionality by 'importing' modules which expand python  along particular lines.

We'll import a module called 'pandas'.

Pandas massively expands what we can do with data. It allows for: the easy reading and writing of data in loads of different formats; the manipulation, selection, aggregation, merging and joining of data tables; the creation of statistical summaries; and, analysis and modeling.

In [76]:
# importing pandas is this simple!
import pandas as pd
import numpy as np

The above code cell has two lines.

The first line is a comment - Python (and other languages) "skip" anything with a # infront of it.

On the second line I've written my import statement. There are two parts to this statement:
1. import pandas
2. as pd

In python, it is sufficient just to write 'import pandas'.

'pd' is a common nickname or 'alias' used for pandas - rather than write 'pandas' each time I can just write 'pd.'

Many packages have nicknames that are consistently applied by members of the wider python community, such as pandas being pd. You don't have to adhere to these conventions, but you'll probably see them used a lot in training and help materials online.

When you successfully import a module, you don't usually get any feedback. However, if you get it wrong you'll get an error message that should give you some idea of what happened, for instance:

In [77]:
# If you run this code cell you should get a 'ModuleNotFoundError' as there is no 'pandahs' module!
import pandahs # by the way, I can also comment on the end of lines too!

ModuleNotFoundError: No module named 'pandahs'

## Setting a Working Directory

The 'working directory' is the place that Jupyter will try to load files from, or save files to, without any more explicit instructions. The default is usually the same folder that the notebook you are using is in.  

* On Windows this might look like: C:/users/[yourName]/Intro_to_Python/Notebooks  

* On Macs it might look like: /User/[yourName]/Intro_to_Python/Notebooks

We can check the working directory by typing `%pwd ` in a cell. pwd = print working directory.

In [None]:
%pwd

If the working directory is not in the notebooks folder, you can use %cd (change directory) and provide an absolute or relative file path to change the working directory to.

In [None]:
%cd C:\Users\Nisr\Desktop\Python\Intro to Python\Data science team\Data

In [None]:
%pwd

When we want to find something - like data, folders, or notebooks - we can type an **absolute file path**, e.g. `"C:/users/[yourName]/Documents/Training/Intro_to_Python/Notebooks"`

These give the complete address of the file or folder, however, this is often only correct for your computer. If you share your notebook the person receiving the notebook will need to update all of the file paths to reflect where the relevant files are on their system.

We can also use what's called a **relative file path**. These filepaths are *relative* to whatever the working directory is at the time.

Think of a file structure like a tree diagram. From a given folder you can either move up or down the file structure.

```
Intro_to_Python
│    
│
└─── Introduction to Python Part 1.ipynb
└─── Introduction to Python Part 2.ipynb
│     
└─── Data
│      └─── titanic.csv
│
└───Solutions
│     │     
│     └─── part_1_exercise_1.py
|     └─── part_1_exercise_2.py
│   │       exercise1.py
│   │       exercise2.py
│   │       ...
```

If our current working directory is 'Notebooks', and we want to switch to 'Data', we can use the relative path:
```terminal
%cd ../Data
```
This effectively says, first go up a directory (..), so from `Notebooks` to `Intro_to_Python`. Then go down a directory into `Data`.

Going up a directory always requires .., whereas going down a directory always requires you to name the intended directory.

## Reading in Data

Now that you've loaded pandas, lets read some data into python and see what that looks like.

Pandas can read data in a variety of different formats. In the code cell below type pd.read then press **Tab**

Pressing Tab gives you all the things you can do with pandas that start with read. You probably recognise some of the data input types (e.g. Excel, SAS, Stata), while others you may not have heard of (e.g. pickle, json etc.) You can then select whichever method you wish to use.

**Tab** is a really useful shortcut to know for finding or completing commands in notebooks. 

In [81]:
# type pd.read then without adding a space press Tab.
pd.read_csv

<function pandas.io.parsers.readers.read_csv(filepath_or_buffer: 'FilePathOrBuffer', sep=<no_default>, delimiter=None, header='infer', names=<no_default>, index_col=None, usecols=None, squeeze=False, prefix=<no_default>, mangle_dupe_cols=True, dtype: 'DtypeArg | None' = None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal: 'str' = '.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, encoding_errors: 'str | None' = 'strict', dialect=None, error_bad_lines=None, warn_bad_lines=None, on_bad_lines=None, delim_whitespace=False, low_memory=True, memory_map=False,

As you can see we can read in a large variety of data using pandas readers.

(NB There isn't a specific reader for SPSS files - it is possible to read .sav files, unfortunately you need another module called savReaderWriter at the moment.)

## Reading a CSV file

Now that you've seen the different options, try reading a simple csv (comma separated value) dataset. 

To read a csv file, the only piece of information we absolutely need is the location of the file.

The file is called 'titanic.csv' and is in the data folder.

The code below demonstrates how you can read these data.

In [83]:
#remember to put the extension(.csv,.xlsx,etc)

titanic = pd.read_csv('./data/titanic.csv')

There are a number of variables within this dataset:
* pclass = Passenger class of travel.
* survived = 1 if the passenger survived the sinking, 0 if not.
* name = Full name of the passenger, including title.
* sex = Passenger gender.
* age = Passenger age.
* sibsp = Count of siblings or spouse also aboard.
* Parch = Count of parents or children also aboard.
* ticket = Ticket reference.
* fare = fare paid.
* cabin = Cabin number.
* embarked = Port of embarkation. (S = Southampton (UK); C = Cherbourg (France); Q = Queenstown (Cobh, Ireland))

## Exploring the Data

The first thing we may want to do having read in a DataFrame is have a quick look at it to check it looks right.

There are DataFrame methods we might use to do this: `.head()`, `.tail()` and `.sample()`  

In [84]:
# head shows the first n rows of the dataframe (5 by default).
titanic.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S


In [85]:
# tail shows the last n rows of a dataframe (5 by default)
titanic.tail(3)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5,0,0,2656,7.225,,C
1307,3,0,"Zakarian, Mr. Ortin",male,27.0,0,0,2670,7.225,,C
1308,3,0,"Zimmerman, Mr. Leo",male,29.0,0,0,315082,7.875,,S


In [86]:
# sample shows n rows chosen at random from the dataframe (1 by default)
titanic.sample(5)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
938,3,0,"Klasen, Mr. Klas Albin",male,18.0,1,1,350404,7.8542,,S
1203,3,0,"Sivic, Mr. Husein",male,40.0,0,0,349251,7.8958,,S
242,1,1,"Rosenbaum, Miss. Edith Louise",female,33.0,0,0,PC 17613,27.7208,A11,C
234,1,0,"Reuchlin, Jonkheer. John George",male,38.0,0,0,19972,0.0,,S
337,2,1,"Beane, Mrs. Edward (Ethel Clarke)",female,19.0,1,0,2908,26.0,,S


### DataFrame Size

Knowing how many rows and columns your DataFrame has is important. Luckily, this is a *property* of the DataFrame object called `.shape`.

In [87]:
# get the number of rows and columns
# NB This line 'unpacks' the tuple returned by shape
numrows, numcols = titanic.shape

In [88]:
# print out the number of rows and columns
print("There are {} rows and {} columns in the titanic dataset".format(numrows, numcols))

There are 1309 rows and 11 columns in the titanic dataset


In [89]:
# number of rows using pythons inbuilt len() function (len meaning 'length')
numrows = len(titanic)
# print the number of rows using f-string formatting.
print(f"There are {numrows} rows in the titanic dataframe")

There are 1309 rows in the titanic dataframe


In [90]:
x=len(titanic)
x

1309

In [91]:
titanic.shape

(1309, 11)

### DataFrame Data Types

When we want to know what type of object a particular object is we can use python's `type()` function. However, `type(titanic)` will only tell us that `titantic` is a pandas DataFrame object. It won't tell us the kinds of data that are in the DataFrame. To find this out we can use the DataFrame object's `.dtypes` property.

Calling `dtypes` on a DataFrame object, like `titanic` returns a pandas series, in which the row indices are the column names in `titanic` and the values are the type of data stored in each column.

In [92]:
# get column data types
titanic.dtypes

pclass        int64
survived      int64
name         object
sex          object
age         float64
sibsp         int64
parch         int64
ticket       object
fare        float64
cabin        object
embarked     object
dtype: object

Remember our different data types - 

* **int** indicates integers, 'parch' for instance is recording counts as whole numbers, such as: 0, 1, 2 etc.
* **float** indicates 'floating-point numbers', effectively decimal numbers like in the age column.
* **object** indicates text, also known as 'string' data. The 'name' column gives passenger full names and titles.
* **bool** which indicates Boolean values, Booleans encode True or False values.

Other data types you might see include:
* **datetime** which encodes date and time values.
* **category** which is a special Pandas datatype for categorical or factor variables.

Sometimes the datatype will have some numbers after it, e.g. float64, int32. These numbers tell you something about the range of storage capacity as they relate to the number of bits used to encode the data. An int64 can hold a very large number, much larger than an int32 which can hold a number as large as 2,147,483,647 but no larger.

Most of the time you won't have to worry about the range of the data type, but it's useful to be aware of.

### DataFrame Column Names

It can be useful to extract a list of column names from a DataFrame, this is easily done by calling the `.columns` property.

In [93]:
# Column names are really easy to get!
colnames = titanic.columns
colnames

Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
       'fare', 'cabin', 'embarked'],
      dtype='object')

In [94]:
# colnames is an index object, if we wanted a list we could use:
#colnames = colnames.tolist()
# or
colnames = list(colnames)
colnames

['pclass',
 'survived',
 'name',
 'sex',
 'age',
 'sibsp',
 'parch',
 'ticket',
 'fare',
 'cabin',
 'embarked']

As standard, pandas returns 'columns' as a special object called an 'index'. 

I've chosen to use some additional code to return it as a 'list'. Lists are inbuilt python objects. Remember:

* a **tuple** or an **index** is 'immutable', this means you can't change it, it's a fixed thing in a fixed order.
* a **list** is 'mutable', you can pretty much do whatever you want with it!

Pandas returns a tuple for the `.shape` property, because the number of rows and number of columns is fixed. 

Other things, like a list of column names can actually be changed and updated, a list allows this whereas a tuple wouldn't.

### DataFrame Info

DataFrames also have an `.info()` method which returns a concise summary of information about the Data.

In [95]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   pclass    1309 non-null   int64  
 1   survived  1309 non-null   int64  
 2   name      1309 non-null   object 
 3   sex       1309 non-null   object 
 4   age       1046 non-null   float64
 5   sibsp     1309 non-null   int64  
 6   parch     1309 non-null   int64  
 7   ticket    1309 non-null   object 
 8   fare      1308 non-null   float64
 9   cabin     295 non-null    object 
 10  embarked  1307 non-null   object 
dtypes: float64(2), int64(4), object(5)
memory usage: 112.6+ KB


Note the different counts in the information above, this suggests that some variables are completely observed (e.g. pclass, survived), while others have missing data values (e.g. age, embarked, cabin).

### Saving a DataFrame

When you read a DataFrame into pandas, the data are loaded in memory. This means that any changes you make won't be reflected in the original file you loaded. If you want to preserve the changes you make to the dataset you have to export the DataFrame object to a file.

Pandas has a number of file writers. Let's save the `titanic` DataFrame as an excel file.

In [96]:
# File writers are prefixed with .to_ press tab to find avaialble options.
titanic.to_excel('./output/titanic.xlsx')

# Selecting and Filtering DataFrames

Over the following sections, we will learn how to select and filter data using pandas DataFrames. This is one of the most useful and powerful features of pandas.  

It is useful for a range of reasons, from simply cutting down a large dataset into the specific sub-sets of data required for analysis, to managing the various components of a model (e.g. dependent and independent variables, training and test data etc.) and conducting specific subgroup analyses.

Selecting and filtering can be done by using the indexing operator. Pandas uses the same indexing operator as lists, tuples, and dictionaries - `[]` (square brackets).

However, the DataFrame indexing operator is more sophisticated than one used for the python built-in data structures, the behaviour of the DataFrame indexer depends, as you'll see, on what you pass to the DataFrame indexing operator. This allows you to pass different kinds of information to the same indexing operator and get specific outputs.

# Selecting Columns from DataFrames

The simplest way to select a column from a dataframe is to use the name of that column!

In [98]:
# Select by passing the name of a column as a string.
passengers = titanic['name']
passengers.head()

0                      Allen, Miss. Elisabeth Walton
1                     Allison, Master. Hudson Trevor
2                       Allison, Miss. Helen Loraine
3               Allison, Mr. Hudson Joshua Creighton
4    Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
Name: name, dtype: object

In the code above, the following things happen:
1. We index the DataFrame `titanic` using the column header.
2. Assuming 'name' is a legitimate column header, a pandas Series is returned.
3. This Series, representing the 'name' column of the `titanic` Dataframe, is assigned to a variable called `passengers`
5. Finally, we look at the first 5 rows of the Series object `passengers`

If we want to select multiple columns, we have to first collect the column names together using a list, and then pass that to the DataFrame indexing operator.

In [99]:
# First make a list of column names called cols
cols = ['name','age']
# Use cols to select multiple columns from marvel.
passenger_cols = titanic[cols]
passenger_cols.head()

Unnamed: 0,name,age
0,"Allen, Miss. Elisabeth Walton",29.0
1,"Allison, Master. Hudson Trevor",0.9167
2,"Allison, Miss. Helen Loraine",2.0
3,"Allison, Mr. Hudson Joshua Creighton",30.0
4,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",25.0


The code above is very similar to the code for selecting a single column, the main difference is that to select multiple columns we pass a list of string objects, rather that a single string object directly.

However, because we have selected multiple columns, the `passenger_cols` object is actually a DataFrame, rather than a Series object.

Note, that we don't have to create the `cols` list first, we can actually create it on-the-fly in the indexing operator, you just have to learn to distinguish the list constructor square brackets from the indexing square brackets!

In [100]:
# select multiple columns directly.
passenger_cols = titanic[['name','age','sex']]
passenger_cols.head()

Unnamed: 0,name,age,sex
0,"Allen, Miss. Elisabeth Walton",29.0,female
1,"Allison, Master. Hudson Trevor",0.9167,male
2,"Allison, Miss. Helen Loraine",2.0,female
3,"Allison, Mr. Hudson Joshua Creighton",30.0,male
4,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",25.0,female


In [101]:
# Just have a look, rather than assigning to a variable
titanic[['name','age','sex']].sample(9)

Unnamed: 0,name,age,sex
1173,"Sage, Miss. Constance Gladys",,female
281,"Stengel, Mrs. Charles Emil Henry (Annie May Mo...",43.0,female
826,"Goodwin, Master. Sidney Leonard",1.0,male
1011,"McNamee, Mrs. Neal (Eileen O'Leary)",19.0,female
846,"Hampe, Mr. Leon",20.0,male
1253,"Torfa, Mr. Assad",,male
1067,"Nysten, Miss. Anna Sofia",22.0,female
699,"Cacic, Mr. Luka",38.0,male
1069,"O'Brien, Mr. Thomas",,male


# Exercise 1

1. Refresh your memory of the titanic data by getting:
    * The number of rows and columns in the DataFrame
    * The datatypes of the columns.
    * A list of the columns names for the titanic dataset.
2. Select the 'fare' column from the `titanic` data and show the tail of the data.
3. Select just the last column, try using the list of column names you made earlier.
4. Select the second, third and fourth columns, try doing it using DataFrame columns property directly.

In [119]:
%load ./solutions/part_1_exercise_1.py

# Filtering rows from Dataframes

Filtering rows from a pandas dataframe works in a very similar way to selecting columns. Simple filtering can be achieved by passing a range to the DataFrame indexer, just like slicing a list.

## Simple Filtering

The code below does the same thing as `.head()` and `.tail()` and can be used to show any arbitrary range of rows in a given DataFrame.

Note that this is identical to slicing a list.

However, we can't get individual rows by indexing as we would with a list, because a column could be named with an integer. This would mean that `dataframe[0]` is ambiguous and could refer to the first row, or a column named 0. Hence it is not allowed, `dataframe[0]` only works if you have a column named '0', which is a default for some operations in pandas.

This means that selecting a single row also requires a slice.

In [120]:
# first 5 rows
titanic[:5] # or - titanic[:5]

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S


In [121]:
# last 5 rows
titanic[-5:]

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
1304,3,0,"Zabour, Miss. Hileni",female,14.5,1,0,2665,14.4542,,C
1305,3,0,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,,C
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5,0,0,2656,7.225,,C
1307,3,0,"Zakarian, Mr. Ortin",male,27.0,0,0,2670,7.225,,C
1308,3,0,"Zimmerman, Mr. Leo",male,29.0,0,0,315082,7.875,,S


In [122]:
# arbitrary slice
titanic[102:109]

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
102,1,1,"Earnshaw, Mrs. Boulton (Olive Potter)",female,23.0,0,1,11767,83.1583,C54,C
103,1,1,"Endres, Miss. Caroline Louise",female,38.0,0,0,PC 17757,227.525,C45,C
104,1,1,"Eustis, Miss. Elizabeth Mussey",female,54.0,1,0,36947,78.2667,D20,C
105,1,0,"Evans, Miss. Edith Corse",female,36.0,0,0,PC 17531,31.6792,A29,C
106,1,0,"Farthing, Mr. John",male,,0,0,PC 17483,221.7792,C95,S
107,1,1,"Flegenheim, Mrs. Alfred (Antoinette)",female,,0,0,PC 17598,31.6833,,S
108,1,1,"Fleming, Miss. Margaret",female,,0,0,17421,110.8833,,C


In [123]:
# 1 row
titanic[123:124]

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
123,1,1,"Frolicher-Stehli, Mr. Maxmillian",male,60.0,1,1,13567,79.2,B41,C


## Conditional Filtering

Most filtering is a more involved operation where we first specify the condition(s) that must be met in order for rows to be included in or excluded from the output dataframe.

## The Filtering Process

The way that pandas filters rows can be though of as a two-step process.

1. Create a 'mask' that specifies inclusion and exclusion for each row in the dataframe.
2. Mask the dataframe to return the subset of rows that are included.

This sounds a bit abstract, so let's consider what this might look like in practice.

Imagine you have the following (very simple) dataframe, called 'catdog':

index | Animal | Name
---| --- | ---
0 | Cat | Catalie Portman
1 | Cat | Pico de Gato
2 | Dog | Chewbarka
3 | Cat | JK Meowling


You want to filter so you just have 'Cat' rows. Therefore you design the following condition:

```python 
mask = catdog['Animal'] == 'Cat'
```

There are a lot of = (equals) in the above statement.
* The first = indicates assignment, we are assigning the outcome of the expression on the right to the variable on the left of the equals sign.
* The second double equals sign, ==, indicates a comparison, in this case it assesses whether each value in the 'Animal' column of catdog is equal to the text 'Cat'. If python finds that the column value and 'Cat' are the same it assigns a True value, and if not a False value.

This produces a 'mask' which is a Series of `True` and `False` values against the DataFrame index.

index | &#xfeff;
---|---
0 | True
1 | True
2 | False
3 | True

Now, you just have to pass the mask to the original dataframe to complete the filtering process.

```python
catdog2 = catdog[mask]
catdog2
```
This subsets the catdog dataframe based on the True (include) and False (exclude) values. Producing:

index | Animal | Name
---| --- | ---
0 | Cat | Catalie Portman
1 | Cat | Pico de Gato
3 | Cat | JK Meowling

The row that had a 'Dog' value for 'Animal', has been removed. Note though that the index has remained the same as the original. Sometimes it is important to reset the index after filtering to restore the index to sequential integers starting at 0.

If you want to reset the index - you can do so by using this code - 

```python
catdog2 = catdog2.reset_index(drop = True)
catdog2
```
index | Animal | Name
---| --- | ---
0 | Cat | Catalie Portman
1 | Cat | Pico de Gato
2 | Cat | JK Meowling

See above that the index has been reset to be sequential.

## Filtering Data

Just like with simple conditional statements, we filter by using logical comparison statements

* == 'is equal to' notice the double == and watch out! A single one would be assigning to a variable!
* != 'does not equal' - the opposite of ==
* $\gt$  greater than
* $\lt$ less than
* $\gt$= greater than or equal too
* $\lt$= less than or equal too.

In addition, pandas includes some functions to make particular comparisons easier:

* .isin(list) which we can use for multiple conditions 
* .between() which we can use to specify upper and lower bounds

Finally, the ~ (tilde) allows us to flip or invert an expression. Basically, if an expression returns `[True, True, False]`, the same expression with a ~ in front of it will return `[False, False, True]`.

However, we'll concentrate on the simple operators in the top list for now.

In [124]:
titanic.head(2)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S


In [125]:
third=titanic['pclass']==3
third.head()

0    False
1    False
2    False
3    False
4    False
Name: pclass, dtype: bool

In [126]:
class3=titanic[third]

In [127]:
class3.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
600,3,0,"Abbing, Mr. Anthony",male,42.0,0,0,C.A. 5547,7.55,,S
601,3,0,"Abbott, Master. Eugene Joseph",male,13.0,0,2,C.A. 2673,20.25,,S
602,3,0,"Abbott, Mr. Rossmore Edward",male,16.0,1,1,C.A. 2673,20.25,,S
603,3,1,"Abbott, Mrs. Stanton (Rosa Hunt)",female,35.0,1,1,C.A. 2673,20.25,,S
604,3,1,"Abelseth, Miss. Karen Marie",female,16.0,0,0,348125,7.65,,S


In [128]:
cl3=titanic[titanic['pclass']==3]
cl3.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
600,3,0,"Abbing, Mr. Anthony",male,42.0,0,0,C.A. 5547,7.55,,S
601,3,0,"Abbott, Master. Eugene Joseph",male,13.0,0,2,C.A. 2673,20.25,,S
602,3,0,"Abbott, Mr. Rossmore Edward",male,16.0,1,1,C.A. 2673,20.25,,S
603,3,1,"Abbott, Mrs. Stanton (Rosa Hunt)",female,35.0,1,1,C.A. 2673,20.25,,S
604,3,1,"Abelseth, Miss. Karen Marie",female,16.0,0,0,348125,7.65,,S


In [129]:
# Filter titanic for 3rd class passengers only

# First make the mask
mask = titanic['pclass'] == 3
# Have a quick look at the mask
mask.sample(5) # 5 rows in the mask.

461    False
410    False
681     True
562    False
91     False
Name: pclass, dtype: bool

In [130]:
# Now filter the titanic dataframe with this mask
thirdclass = titanic[mask]
thirdclass.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
600,3,0,"Abbing, Mr. Anthony",male,42.0,0,0,C.A. 5547,7.55,,S
601,3,0,"Abbott, Master. Eugene Joseph",male,13.0,0,2,C.A. 2673,20.25,,S
602,3,0,"Abbott, Mr. Rossmore Edward",male,16.0,1,1,C.A. 2673,20.25,,S
603,3,1,"Abbott, Mrs. Stanton (Rosa Hunt)",female,35.0,1,1,C.A. 2673,20.25,,S
604,3,1,"Abelseth, Miss. Karen Marie",female,16.0,0,0,348125,7.65,,S


In [131]:
thirdclass.shape

(709, 11)

In [132]:
# Use the same approach for other logical statements.
mask = titanic['fare'] > 200
titanic[mask].head(7)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S
10,1,0,"Astor, Col. John Jacob",male,47.0,1,0,PC 17757,227.525,C62 C64,C
11,1,1,"Astor, Mrs. John Jacob (Madeleine Talmadge Force)",female,18.0,1,0,PC 17757,227.525,C62 C64,C
16,1,0,"Baxter, Mr. Quigg Edmond",male,24.0,0,1,PC 17558,247.5208,B58 B60,C
17,1,1,"Baxter, Mrs. James (Helene DeLaudeniere Chaput)",female,50.0,0,1,PC 17558,247.5208,B58 B60,C
23,1,1,"Bidois, Miss. Rosalie",female,42.0,0,0,PC 17757,227.525,,C
24,1,1,"Bird, Miss. Ellen",female,29.0,0,0,PC 17483,221.7792,C97,S


In [133]:
cl3=titanic[titanic['pclass']==3]
cl3.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
600,3,0,"Abbing, Mr. Anthony",male,42.0,0,0,C.A. 5547,7.55,,S
601,3,0,"Abbott, Master. Eugene Joseph",male,13.0,0,2,C.A. 2673,20.25,,S
602,3,0,"Abbott, Mr. Rossmore Edward",male,16.0,1,1,C.A. 2673,20.25,,S
603,3,1,"Abbott, Mrs. Stanton (Rosa Hunt)",female,35.0,1,1,C.A. 2673,20.25,,S
604,3,1,"Abelseth, Miss. Karen Marie",female,16.0,0,0,348125,7.65,,S


# Exercise 2

1. Show the row for the passenger named: 'Birkeland, Mr. Hans Martin Monsen'
2. How many passengers in the dataset are male?
3. How many passengers are under 18 years of age?
4. What proportion of passenger in the dataset survived?

In [None]:
# exercise 2 solutions
%load ./solutions/part_1_exercise_2.py

## Using Multiple Conditions to Filter

So far, we've only filtered according to individual conditions set on a single column, but there is no reason we can't use multiple conditions to filter by several conditions and/or columns at once. However, we do need to think about how the conditions relate to each other, we have two options to establish these relationships.

* **and** relationships are given by the **&** (ampersand) symbol. This implies both/all conditions must be met for a row to evaluate to True.
* **or** relationships are given by the **|** (pipe) symbol. This implies that if _any_ of the conditions can be met a given row evaluates to True.

You can think of `.isin()` and `.between()` as being special versions of multiple condition filters.

* isin() is basically just a lot of linked **or** statements - *value1* **or** *value2* **or** *value3* etc.
* between() is an **and** condition - greater than (or equal to) the lower bound **and** less than (or equal to) the upper bound.

Let's again take a simple example to illustrate this with the `catdog` dataframe:

index | Animal | Name | Age
---| --- | --- | ---
0 | Cat | Catalie Portman | 3.0
1 | Cat | Pico de Gato | 5.0
2 | Dog | Chewbarka | 1.0
3 | Cat | JK Meowling | 7.0
4 | Dog | K-9 | 11.0

If you wanted to select all animals that are cats **and** who are over 4 years old, you could do the following:

```python
mask = (catdog['Animal'] == 'Cat') & (catdog['Age'] > 4.0)

catdog[mask]
```
index | Animal | Name | Age
---| --- | --- | ---
1 | Cat | Pico de Gato | 5.0
3 | Cat | JK Meowling | 7.0

Only Cats over 4 years old have been included in the filter.

However, if you wanted to select all animals that are either cats **or** are over 4 years old, you could instead do:

```python
mask = (catdog['Animal'] == 'Cat') | (catdog['Age'] > 4.0)

catdog[mask]
```
index | Animal | Name | Age
---| --- | --- | ---
0 | Cat | Catalie Portman | 3.0
1 | Cat | Pico de Gato | 5.0
3 | Cat | JK Meowling | 7.0
4 | Dog | K-9 | 11.0


In [134]:
# Let's try some multiple condition filters with the titanic data
# First class passengers who are women.
mask = (titanic['pclass'] == 1) & (titanic['sex']== 'female')
titanic[mask].head()
len(titanic[mask])

144

In [135]:
# Women or children
mask = (titanic['sex'] == 'female') | (titanic['age'] < 18)
titanic[mask].head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S
6,1,1,"Andrews, Miss. Kornelia Theodosia",female,63.0,1,0,13502,77.9583,D7,S


In [136]:
# Try the special functions for multiple selection. First .isin()
# Passeners from Cherbourg ('C') or Queenstown ('Q')
titanic[titanic['embarked'].isin(['C','Q'])].sample(7)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
992,3,0,"Mangan, Miss. Mary",female,30.5,0,0,364850,7.75,,Q
702,3,0,"Canavan, Miss. Mary",female,21.0,0,0,364846,7.75,,Q
480,2,0,"Laroche, Mr. Joseph Philippe Lemercier",male,25.0,1,2,SC/Paris 2123,41.5792,,C
204,1,1,"Meyer, Mrs. Edgar Joseph (Leila Saks)",female,,1,0,PC 17604,82.1708,,C
716,3,0,"Chronopoulos, Mr. Apostolos",male,26.0,1,0,2680,14.4542,,C
1078,3,1,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,7.8792,,Q
164,1,1,"Homer, Mr. Harry (""Mr E Haven"")",male,35.0,0,0,111426,26.55,,C


In [137]:
# Now, .between()
# passengers who paid between 100 and 250
titanic[titanic['fare'].between(100,250)].head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S


In [138]:
titanic.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S


# Exercise 3

1. Select passengers who are in classes 2 and 3, what percentage of passengers is this?
2. How many passengers who do not have siblings or spouses ('sibsp'), or parents or children ('parch') on the boat?
3. What proportion of passengers who 'embarked' in Cherbourg ('C') or Queenstown ('Q') survived? 

In [None]:
# Exercise 3 solutions
%load ./solutions/part_1_exercise_3.py

# Generating New Variables

There are a number of approaches to adding new columns of data to a DataFrame, and depending on what you want to achieve some of these can be quite complicated. We start with simple examples and build up to more sophisticated approaches later.

## Creating Binary Variables

We can assign a condition to a new column to create a binary variable in much the same way as we might create a mask for filtering rows.

Imagine that instead of filtering rows, we instead wanted to assign a 1 or a 0 to a new column depending on whether that condition was met. We can do this quite simply due to the fact that in python (and many other languages) a `True` value is equivalent to 1, and a `False` value is equivalent to 0.

In [139]:
# First create the condition as a True/False Series
cond = titanic['sex'] == 'female'

# Now assign the condition variable cond to a new column, but as an integer type.
titanic['female'] = cond.astype(int)
titanic.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,female
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,1
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,0
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,1
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,0
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,1


In [140]:
b=titanic['age']<18
titanic['child']=b.astype(int)

titanic['child']=b.map({True:'Child',False:'Adult'})
titanic.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,female,child
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,1,Adult
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,0,Child
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,1,Child
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,0,Adult
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,1,Adult


In the code above, a condition is specified, returing a boolean Series. This Series object is converted to an integer data type and assigned to a new column in the `titanic` DataFrame called 'female'. If 'female' already existed, this code would have overwritten whatever was already in runtime.

Using a condition we can either store `True` and `False` values directly, or convert them to their integer representations `1` and `0`. If we wanted to use arbitrary values for our new variable we can create a dictionary and `.map()` the dictionary to the column.

In [141]:
# Use YES and NO instead of True and False.
titanic['female'] = cond.map({True:'YES',False:'NO'})
titanic.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,female,child
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,YES,Adult
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,NO,Child
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,YES,Child
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,NO,Adult
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,YES,Adult


Note, this `.map()` approach is also good for aggregating (dissolving) or recoding categorical variables. The dictionary can be of an arbitrary length and act as a lookup, this is a benefit of the key:value structure. Note though that pandas also implements its own `category` variable. We're not going to discuss it here as it's not strictly necessary, but it can be useful. Check the pandas docs for more information.

## Constant Value Variables

Constant values can also be assigned to all rows in a DataFrame with a single number or string, with the data type being dictated by the format of the value being assigned. This may be useful in the context of updating cells, which we'll discuss later on.

In [142]:
# New column of ints
titanic['int_zeroes'] = 0
# New column of floats
titanic['float_ones'] = 1.0 # note floating point.
# New column of strings
titanic['string_twos'] = 'two'
titanic.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,female,child,int_zeroes,float_ones,string_twos
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,YES,Adult,0,1.0,two
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,NO,Child,0,1.0,two
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,YES,Child,0,1.0,two
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,NO,Adult,0,1.0,two
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,YES,Adult,0,1.0,two


## Creating Variables Based on Existing Columns

We can simply assign the values of existing columns to new columns using assignment.

In [143]:
titanic['name2'] = titanic['name']
titanic.head(6)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,female,child,int_zeroes,float_ones,string_twos,name2
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,YES,Adult,0,1.0,two,"Allen, Miss. Elisabeth Walton"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,NO,Child,0,1.0,two,"Allison, Master. Hudson Trevor"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,YES,Child,0,1.0,two,"Allison, Miss. Helen Loraine"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,NO,Adult,0,1.0,two,"Allison, Mr. Hudson Joshua Creighton"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,YES,Adult,0,1.0,two,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)"
5,1,1,"Anderson, Mr. Harry",male,48.0,0,0,19952,26.55,E12,S,NO,Adult,0,1.0,two,"Anderson, Mr. Harry"


We can also use mathematical expressions with one or more existing columns to create new columns.

In [144]:
titanic['family_size'] = titanic['sibsp'] + titanic['parch'] + 1
titanic.head(6)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,female,child,int_zeroes,float_ones,string_twos,name2,family_size
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,YES,Adult,0,1.0,two,"Allen, Miss. Elisabeth Walton",1
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,NO,Child,0,1.0,two,"Allison, Master. Hudson Trevor",4
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,YES,Child,0,1.0,two,"Allison, Miss. Helen Loraine",4
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,NO,Adult,0,1.0,two,"Allison, Mr. Hudson Joshua Creighton",4
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,YES,Adult,0,1.0,two,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",4
5,1,1,"Anderson, Mr. Harry",male,48.0,0,0,19952,26.55,E12,S,NO,Adult,0,1.0,two,"Anderson, Mr. Harry",1


## Classifying Numerical Values using pd.cut() and pd.qcut()

New columns can be created directly when classifying numerical data using `pd.cut()` or `pd.qcut()`

`cut()` has two behaviours. The default is to create a given number of equal-sized bins (e.g. the width of all bins are the same), however if you provide bins it will cut according to those bins.

`qcut()` is similar, but rather than equal-sized bins it created bins that have (roughly) equal numbers of observations in them. These bins could have very different widths.

In [145]:
titanic.head(1)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,female,child,int_zeroes,float_ones,string_twos,name2,family_size
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,YES,Adult,0,1.0,two,"Allen, Miss. Elisabeth Walton",1


In [146]:
# Divide fare into 3 equally sized classes.
titanic['fare_3class'] = pd.cut(titanic['fare'], 3, labels=['low','mid','high'])
titanic.head(4)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,female,child,int_zeroes,float_ones,string_twos,name2,family_size,fare_3class
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,YES,Adult,0,1.0,two,"Allen, Miss. Elisabeth Walton",1,mid
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,NO,Child,0,1.0,two,"Allison, Master. Hudson Trevor",4,low
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,YES,Child,0,1.0,two,"Allison, Miss. Helen Loraine",4,low
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,NO,Adult,0,1.0,two,"Allison, Mr. Hudson Joshua Creighton",4,low


In [147]:
# Divide fare into tertile.
titanic['fare_tertiles'] = pd.qcut(titanic['fare'], 3, labels=['low','mid','high'])
titanic.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,female,child,int_zeroes,float_ones,string_twos,name2,family_size,fare_3class,fare_tertiles
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,YES,Adult,0,1.0,two,"Allen, Miss. Elisabeth Walton",1,mid,high
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,NO,Child,0,1.0,two,"Allison, Master. Hudson Trevor",4,low,high
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,YES,Child,0,1.0,two,"Allison, Miss. Helen Loraine",4,low,high
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,NO,Adult,0,1.0,two,"Allison, Mr. Hudson Joshua Creighton",4,low,high
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,YES,Adult,0,1.0,two,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",4,low,high


In [148]:
titanic['fare_3class'] = pd.cut(titanic['fare'], 3)
titanic.head(4)
titanic['fare_3class'].value_counts()

(-0.512, 170.776]     1270
(170.776, 341.553]      34
(341.553, 512.329]       4
Name: fare_3class, dtype: int64

In [149]:
titanic['fare_3class'] = pd.qcut(titanic['fare'], 3)
titanic.head(4)
titanic['fare_3class'].value_counts()

(-0.001, 8.662]    454
(8.662, 26.0]      428
(26.0, 512.329]    426
Name: fare_3class, dtype: int64

## Removing (Dropping) Columns

Not withstanding the fact that we could select the specific columns we want and exclude ones we don't want in our dataset, we also have a couple of options for dropping or deleting a column from a pandas DataFrame.

Let's delete the constant valued columns we established earlier: 'int_zeroes', 'float_ones', 'string_twos'.

The first option we have is the built in python statement `del`. Otherwise we can use the `.drop()` method.

Note, that pandas is in active development, so sometimes parameters change as the library matures. If your `.drop()` method doesn't understand the code below then it may be an older version. try this instead:
```python
titanic.drop(['int_zeroes','float_ones','string_twos'], axis=1, inplace=True)
```
Note that this syntax is still compatible with newer versions of pandas, so old code won't break in this case.

In [150]:
# drop a column with del
del titanic['int_zeroes']
titanic.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,female,child,float_ones,string_twos,name2,family_size,fare_3class,fare_tertiles
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,YES,Adult,1.0,two,"Allen, Miss. Elisabeth Walton",1,"(26.0, 512.329]",high
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,NO,Child,1.0,two,"Allison, Master. Hudson Trevor",4,"(26.0, 512.329]",high
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,YES,Child,1.0,two,"Allison, Miss. Helen Loraine",4,"(26.0, 512.329]",high
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,NO,Adult,1.0,two,"Allison, Mr. Hudson Joshua Creighton",4,"(26.0, 512.329]",high
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,YES,Adult,1.0,two,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",4,"(26.0, 512.329]",high


In [151]:
# drop using the drop method
titanic.drop(columns=['float_ones','string_twos'], inplace = True)
titanic.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,female,child,name2,family_size,fare_3class,fare_tertiles
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,YES,Adult,"Allen, Miss. Elisabeth Walton",1,"(26.0, 512.329]",high
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,NO,Child,"Allison, Master. Hudson Trevor",4,"(26.0, 512.329]",high
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,YES,Child,"Allison, Miss. Helen Loraine",4,"(26.0, 512.329]",high
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,NO,Adult,"Allison, Mr. Hudson Joshua Creighton",4,"(26.0, 512.329]",high
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,YES,Adult,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",4,"(26.0, 512.329]",high


# Exercise 4

1. Create a new binary variable called 'child', it should have the value 1 when the passenger is under 18 and 0 otherwise.
2. Create a new variable called 'embarked_city', map 'S' to 'Southampton', 'C' to 'Cherbourg', and 'Q' to 'Queenstown'.
3. Create a new variable called 'surname', the value should be the surname part of the 'name' field.
    * Use `titanic['name'].str.split(',',expand=True)[0]` to get surnames.
    * Explore this code and make sure you understand what's going on.
    * How many unique surnames are there (hint: try `.unique()` or `.nunique()` on the new column.

In [None]:
# Exercise 4 solutions
%load ./solutions/part_1_exercise_4.py

In that last question we made use of a special module on the `Series` object called `str`, this exposes all the string methods we've encounter previous to a column of string data, but in a vectorised form. This means you can manipulate text is a row-by-row manner with a single method call. For example:

In [None]:
# lower case names
titanic['name'].str.lower().head()

In [None]:
# Press tab after the last fullstop to see the string methods available.
# Select one and use shift-tab to explore it further.
pd.Series.str.

In [None]:
# in this example we are trying to extract all the titles from the names field.
titanic['name'].str.split(',',expand=True)[1].str.split('.',expand=True)[0].head()

# Consolidation

## Reminder of Learning Objectives

* Familiarise yourself with the Jupyter notebook, which is what we'll use to interact with python
* Learn about python's basic data types - integers, floats, strings and Booleans.
* Explore python's general data structure objects - lists, tuples, and dictionaries.
* Take a first look at 'pandas', python's data manipulation and analysis library.
    * The Series and DataFrame objects.
    * Reading data into python using pandas.
    * Getting information about DataFrames.
* Learn some basic DataFrame manipulation approaches:
    * Select columns
    * Filter rows
    * Generate new variables
    
## Next Session

Focus on basic analysis
* Descriptive statistics.
* Aggregation and cross-tabulation.
* Merging and updating.

We will then look at visualisation and/or simple statistics in python using matplotlib and statsmodels.