<img src="https://datasciencecampus.ons.gov.uk/wp-content/uploads/sites/10/2017/03/data-science-campus-logo-new.svg"
             alt="ONS Data Science Campus Logo"
             width = "240"
             style="margin: 0px 60px"
             />

In [1]:
# import the helper functions from the parent directory,
# these help with things like graph plotting and notebook layout
import sys
sys.path.append('../..')
from helper_functions import *

# set things like fonts etc - comes from helper_functions
set_notebook_preferences()

# add a show/hide code button - also from helper_functions
toggle_code(title = "import functions")

In [4]:
#dependencies
import pandas as pd
titanic = pd.read_csv('../../data/titanic.csv')
toggle_code(title = "dependencies")

   pclass  survived                                             name     sex  \
0       1         1                    Allen, Miss. Elisabeth Walton  female   
1       1         1                   Allison, Master. Hudson Trevor    male   
2       1         0                     Allison, Miss. Helen Loraine  female   
3       1         0             Allison, Mr. Hudson Joshua Creighton    male   
4       1         0  Allison, Mrs. Hudson J C (Bessie Waldo Daniels)  female   

       age  sibsp  parch  ticket      fare    cabin embarked  
0  29.0000      0      0   24160  211.3375       B5        S  
1   0.9167      1      2  113781  151.5500  C22 C26        S  
2   2.0000      1      2  113781  151.5500  C22 C26        S  
3  30.0000      1      2  113781  151.5500  C22 C26        S  
4  25.0000      1      2  113781  151.5500  C22 C26        S  


# 3. Generating New Variables

There are a number of approaches to adding new columns of data to a dataframe. Depending on what you want to achieve, some of these can be quite complicated. We start with simple examples and build up to more sophisticated approaches later.

## 3.1 Creating Binary Variables


We can assign a condition to a new column to create a binary variable in much the same way as we might create a mask for filtering rows.

Imagine that instead of filtering rows, we instead wanted to assign a 1 or a 0 to a new column depending on whether that condition was met. We can do this quite simply due to the fact that in Python (and many other languages) a `True` value is equivalent to 1, and a `False` value is equivalent to 0.

In [4]:
# First create the condition as a True/False Series
cond = titanic['sex'] == 'female'

# Now assign the condition variable cond to a new column, but as an integer type.
titanic['female'] = cond.astype(int)
titanic.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,female
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,1
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,0
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,1
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,0
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,1


In the code above, a condition is specified, returing a Boolean series. This series object is converted to an integer data type and assigned to a new column in the `titanic` dataframe called 'female'. If 'female' already existed, this code would have overwritten whatever was already in runtime.

Using a condition we can either store `True` and `False` values directly, or convert them to their integer representations `1` and `0`. If we wanted to use arbitrary values for our new variable we can create a dictionary and `.map()` the dictionary to the column.

In [5]:
# Use YES and NO instead of True and False.
titanic['female'] = cond.map({True:'YES',False:'NO'})
titanic.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,female
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,YES
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,NO
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,YES
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,NO
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,YES


Note, this `.map()` approach is also good for aggregating (dissolving) or recoding categorical variables. The dictionary can be of an arbitrary length and act as a lookup, this is a benefit of the key:value structure. Note though, that pandas also implements its own `category` variable. We're not going to discuss it here as it's not strictly necessary, but it can be useful. Check the [pandas docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html) for more information.

## 3.2 Constant Value Variables

Constant values can also be assigned to all rows in a dataframe with a single number or string, with the data type being dictated by the format of the value being assigned. This may be useful in the context of updating cells, which we'll discuss later on.

In [6]:
# New column of ints
titanic['int_zeroes'] = 0
# New column of floats
titanic['float_ones'] = 1.0 # note floating point.
# New column of strings
titanic['string_twos'] = 'two'
titanic.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,female,int_zeroes,float_ones,string_twos
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,YES,0,1.0,two
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,NO,0,1.0,two
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,YES,0,1.0,two
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,NO,0,1.0,two
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,YES,0,1.0,two


## 3.3 Creating Variables Based on Existing Columns

We can simply assign the values of existing columns to new columns using assignment.

In [7]:
titanic['name2'] = titanic['name']
titanic.head(6)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,female,int_zeroes,float_ones,string_twos,name2
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,YES,0,1.0,two,"Allen, Miss. Elisabeth Walton"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,NO,0,1.0,two,"Allison, Master. Hudson Trevor"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,YES,0,1.0,two,"Allison, Miss. Helen Loraine"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,NO,0,1.0,two,"Allison, Mr. Hudson Joshua Creighton"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,YES,0,1.0,two,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)"
5,1,1,"Anderson, Mr. Harry",male,48.0,0,0,19952,26.55,E12,S,NO,0,1.0,two,"Anderson, Mr. Harry"


We can also use mathematical expressions with one or more existing columns to create new columns.

In [8]:
# The + 1 constant accounts for the passenger themself.
titanic['family_size'] = titanic['sibsp'] + titanic['parch'] + 1
titanic.head(6)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,female,int_zeroes,float_ones,string_twos,name2,family_size
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,YES,0,1.0,two,"Allen, Miss. Elisabeth Walton",1
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,NO,0,1.0,two,"Allison, Master. Hudson Trevor",4
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,YES,0,1.0,two,"Allison, Miss. Helen Loraine",4
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,NO,0,1.0,two,"Allison, Mr. Hudson Joshua Creighton",4
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,YES,0,1.0,two,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",4
5,1,1,"Anderson, Mr. Harry",male,48.0,0,0,19952,26.55,E12,S,NO,0,1.0,two,"Anderson, Mr. Harry",1


## 3.4 Removing (Dropping) Columns

Not withstanding the fact that we could select the specific columns we want and exclude ones we don't want in our dataset, we also have a couple of options for dropping or deleting a column from a pandas dataframe.

Let's delete the constant valued columns we established earlier: 'int_zeroes', 'float_ones', 'string_twos'.

The first option we have is the built in Python statement `del`. Otherwise we can use the `.drop()` method.

Note, that pandas is in active development, so sometimes parameters change as the library matures. If your `.drop()` method doesn't understand the code below then it may be an older version. try this instead:
```python
titanic.drop(['int_zeroes','float_ones','string_twos'], axis=1, inplace=True)
```
Note that this syntax is still compatible with newer versions of pandas, so old code won't break in this case.

In [9]:
# drop a column with del
del titanic['int_zeroes']
titanic.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,female,float_ones,string_twos,name2,family_size
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,YES,1.0,two,"Allen, Miss. Elisabeth Walton",1
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,NO,1.0,two,"Allison, Master. Hudson Trevor",4
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,YES,1.0,two,"Allison, Miss. Helen Loraine",4
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,NO,1.0,two,"Allison, Mr. Hudson Joshua Creighton",4
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,YES,1.0,two,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",4


In [10]:
# drop using the drop method
titanic.drop(columns=['float_ones','string_twos'], inplace = True)
titanic.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,female,name2,family_size
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,YES,"Allen, Miss. Elisabeth Walton",1
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,NO,"Allison, Master. Hudson Trevor",4
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,YES,"Allison, Miss. Helen Loraine",4
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,NO,"Allison, Mr. Hudson Joshua Creighton",4
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,YES,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",4


## Exercise 4

1. Create a new binary variable called 'child', it should have the value 1 when the passenger is under 18 and 0 otherwise.
2. Create a new variable called 'embarked_city', map 'S' to 'Southampton', 'C' to 'Cherbourg', and 'Q' to 'Queenstown'.
3. Create a new variable called 'surname', the value should be the surname part of the 'name' field.
    * Use `titanic['name'].str.split(',',expand=True)[0]` to get surnames.
    * Explore this code and make sure you understand what's going on.
    * How many unique surnames are there (hint: try `.unique()` or `.nunique()` on the new column.

In [11]:
## Question 1

#titanic['child'] = (titanic['age'] < 18).astype(int)
#print(titanic['child'].head(),'\n')

## Question 2

#titanic['embarked_city'] = titanic['embarked'].map({'S':'Southampton','C':'Cherbourg','Q':'Queenstown'})
#print(titanic['embarked_city'].sample(5),'\n')

## Question 3

#titanic['surname'] = titanic['name'].str.split(',',expand=True)[0]
#print("The number of unique surnames in the dataset is {}".format(titanic['surname'].nunique()))

toggle_code()

In that last question we made use of a special module on the `Series` object called `str`, this exposes all the string methods we've encountered previously to a column of string data, but in a vectorised form. This means you can manipulate text in a row-by-row manner with a single method call. For example:

In [12]:
# lower case names
titanic['name'].str.lower().head()

0                      allen, miss. elisabeth walton
1                     allison, master. hudson trevor
2                       allison, miss. helen loraine
3               allison, mr. hudson joshua creighton
4    allison, mrs. hudson j c (bessie waldo daniels)
Name: name, dtype: object

In [13]:
# Press tab after the last fullstop to see the string methods available.
# Select one and use shift-tab to explore it further.
pd.Series.str.

SyntaxError: invalid syntax (<ipython-input-13-59377a988749>, line 3)

In [None]:
# in this example we are trying to extract all the titles from the names field.
titanic['name'].str.split(',',expand=True)[1].str.split('.',expand=True)[0].head()