# Intro to Jupyter notebooks


## Markdown and REPL (read–eval–print loop) things ✎

In [1]:
print("banana")

banana


In [1]:
my_list = ["a", "b", "c", "d", "e", "f"]

my_list

['a', 'b', 'c', 'd', 'e', 'f']

In [2]:
# Get the last item in the list
my_list[-1]

'f'

In [3]:
# Slice it!
my_list[1:3]

['b', 'c']

In [4]:
# Get an end piece!
my_list[:3]

['a', 'b', 'c']

In [5]:
# Get the other end piece!
my_list[-3:]

['d', 'e', 'f']

## `%magic` tricks ✵彡

This gives us some sweet features from Interactive Python. There are two types of magic:

- `%` prefix: line magic, which applies to a line
- `%%` prefix: cell magic, which applies to multiples lines

In [2]:
# Get more info on a magic
%pwd?

In [3]:
# Print working directory
%pwd

'/Users/fei.phoon/projects/repositories/intro-to-jupyter-notebooks'

In [4]:
# We can assign %magic output to variables
cwd = %pwd 
cwd

'/Users/fei.phoon/projects/repositories/intro-to-jupyter-notebooks'

In [5]:
dict = {}
dict["banana"]

KeyError: 'banana'

In [6]:
# Adds an interactive debugger at the bottom of the LAST exception traceback
%debug

> [0;32m<ipython-input-5-7d4a03e2aeb5>[0m(2)[0;36m<module>[0;34m()[0m
[0;32m      1 [0;31m[0mdict[0m [0;34m=[0m [0;34m{[0m[0;34m}[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m----> 2 [0;31m[0mdict[0m[0;34m[[0m[0;34m"banana"[0m[0;34m][0m[0;34m[0m[0;34m[0m[0m
[0m
ipdb> exit()


In [None]:
# Automatically enter debugger after ANY exception
%pdb

In [7]:
# Returns the execution time of a single statement
%time "banana".count("a")

CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 6.91 µs


3

In [8]:
# Returns the average execution time of a single statement, run multiple times
%timeit "banana".count("a")

135 ns ± 3.95 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [9]:
%%time
def count_a(string):
    return string.count("a")
count_a("banana")

# A cell magic that returns the execution time of a multi-line statement.
# Warning: Cell magics don't like leading blank/comment/code lines.

CPU times: user 6 µs, sys: 0 ns, total: 6 µs
Wall time: 10 µs


In [10]:
# Run an external script here
%run banana.py

Bananas for everyone!


In [11]:
# View all %magic commands
%magic

In [None]:
cwd

In [None]:
# Reset all variables. Use `%reset -f` to skip confirmation
%reset

In [None]:
cwd

More:
https://ipython.readthedocs.io/en/stable/interactive/magics.html

## Meet `pandas` ʕ •ᴥ•ʔ

`pandas` is a Python library for exploration & analysis, for tabular data.

We've already made `pandas` available in this environment with `pip install`. To use it in our notebook, we'll need to import it.

In [1]:
import pandas as pd

### `pandas` objects: Series ﾟﾟ┌┴oﾟﾟﾟﾟ°

A Series is a one-dimensional array-like data structure with indexing.

In [23]:
# The simplest Series.
# Note that the index is automatically created, and the value type is inferred.
series1 = pd.Series([10, 5, 6, 7, 18, 10, 4, 13])

series1

0    10
1     5
2     6
3     7
4    18
5    10
6     4
7    13
dtype: int64

In [6]:
series1.values

array([10,  5,  6,  7, 18, 10])

In [7]:
series1.index

RangeIndex(start=0, stop=6, step=1)

In [8]:
# Check the shape of the Series
series1.shape

(6,)

In [9]:
# Produces a summary of the Series.
series1.describe()

count     6.000000
mean      9.333333
std       4.718757
min       5.000000
25%       6.250000
50%       8.500000
75%      10.000000
max      18.000000
dtype: float64

In [11]:
# Custom indexing with labels
series2 = pd.Series([1, 2, 50], index=['banana', 'bananas', 'much bananas'])

series2

banana           1
bananas          2
much bananas    50
dtype: int64

In [12]:
# We can have mixed types in a Series
series3 = pd.Series([1, 'potato', 2, 'potato'])

series3

0         1
1    potato
2         2
3    potato
dtype: object

***Note: When the columns are object types, this makes the data structure slower to process.**

In [14]:
# We can use array indexes to fetch values
series2[2]

50

In [15]:
# We can fetch ranges by slicing
series2[:2]

banana     1
bananas    2
dtype: int64

In [16]:
# We can fetch values by integer indexing
series2.iloc[1]

2

In [19]:
# We can also use the index labels to fetch values
series2['much bananas']

# We can fetch values by label indexing.
# This is the `pandas` way to do the previous command
# series2.loc['much bananas']

50

In [20]:
# We can fetch values using multiple labels, called in a custom order
series2[['much bananas', 'banana', 'bananas']]

much bananas    50
banana           1
bananas          2
dtype: int64

In [27]:
# We can filter with conditions

# Flashback: series1 = pd.Series([10, 5, 6, 7, 18, 10, 4, 13])
# Note the indexes.
series1[series1<8]

1    5
2    6
3    7
6    4
dtype: int64

In [28]:
# We can load dictionaries as Series
fruit_dict = {'banana': 3, 'apple': 2, 'lemon': 5}

fruit_series1 = pd.Series(fruit_dict)

fruit_series1

banana    3
apple     2
lemon     5
dtype: int64

In [30]:
# And you can pre-specify index labels and get the Series in a desired order
fruit_labels = ['apple', 'banana', 'lemon', 'tomato']

fruit_series2 = pd.Series(fruit_dict, index=fruit_labels)

fruit_series2 # Suddenly we get floats. Why?

apple     2.0
banana    3.0
lemon     5.0
tomato    NaN
dtype: float64

In [31]:
fruit_series2.isnull()

apple     False
banana    False
lemon     False
tomato     True
dtype: bool

In [32]:
fruit_series2.notnull()

apple      True
banana     True
lemon      True
tomato    False
dtype: bool

In [33]:
# Grab the Index object from one of our Series
fruit_series2_index = fruit_series2.index

fruit_series2_index

Index(['apple', 'banana', 'lemon', 'tomato'], dtype='object')

In [34]:
type(fruit_series2_index)

pandas.core.indexes.base.Index

In [35]:
fruit_series2_index[1]

'banana'

In [36]:
# `pandas` Index object items are immutable
fruit_series2_index[1] = 'no banana'

TypeError: Index does not support mutable operations

In [37]:
# `pandas` Index object can contain duplicates
orphan_index = pd.Index([1, 'potato', 2, 'potato'])

orphan_index

Index([1, 'potato', 2, 'potato'], dtype='object')

### ┏━━━━━━━ʕ•㉨•ʔ━━━━━━━━┓
### &nbsp;&nbsp; &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`pandas` objects: DataFrame
### ┗━━━━━━━━☆━━━━━━━━━┛

A `pandas` DataFrame is a table of data with row and column indexes. Wes McKinney (`pandas` creator) describes it as a dictionary of Series that shares the same index. The data is stored as one or more two-dimensional blocks.

In [38]:
# Now we'll load some data
file_path = "./titanic/train.csv"

df = pd.read_csv(file_path) # defaults: `sep=",", header=0`

In [39]:
# Check the data type
type(df)

pandas.core.frame.DataFrame

In [40]:
# Check the shape of the DataFrame
df.shape

(891, 12)

In [41]:
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [43]:
# Quick peek at the dataset via Python slicing
df[:5]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [42]:
# This is the `pandas` way to do it
df.head()
# df.head(8)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [44]:
df.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [45]:
df.sample(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
267,268,1,3,"Persson, Mr. Ernst Ulrik",male,25.0,1,0,347083,7.775,,S
204,205,1,3,"Cohen, Mr. Gurshon ""Gus""",male,18.0,0,0,A/5 3540,8.05,,S
12,13,0,3,"Saundercock, Mr. William Henry",male,20.0,0,0,A/5. 2151,8.05,,S


In [46]:
# Produces a summary of the DataFrame.
# This ignores columns with non-numerical values.
# This has no meaning in this context, we're just experimenting.
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [47]:
# Examine a certain column
df['Name'].head(10)

0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
5                                     Moran, Mr. James
6                              McCarthy, Mr. Timothy J
7                       Palsson, Master. Gosta Leonard
8    Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9                  Nasser, Mrs. Nicholas (Adele Achem)
Name: Name, dtype: object

In [48]:
df.rename(columns={"PassengerId": "passenger_id"}) # Note if this changed the column name in the DataFrame.

Unnamed: 0,passenger_id,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [49]:
# Different ways of selecting the first row

df.head(1)
# df[:1]
# df.iloc[0]
# df.iloc[0, :]
# df.loc[0]
# df.loc[[0], df.columns]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S


In [50]:
# Different ways of selecting a particular column (Name)

df['Name']
# df.loc[:, 'Name']
# df.iloc[:, 3]

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
5                                       Moran, Mr. James
6                                McCarthy, Mr. Timothy J
7                         Palsson, Master. Gosta Leonard
8      Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9                    Nasser, Mrs. Nicholas (Adele Achem)
10                       Sandstrom, Miss. Marguerite Rut
11                              Bonnell, Miss. Elizabeth
12                        Saundercock, Mr. William Henry
13                           Andersson, Mr. Anders Johan
14                  Vestrom, Miss. Hulda Amanda Adolfina
15                      Hewlett, Mrs. (Mary D Kingcome) 
16                                  Rice, Master. Eugene
17                          Wil

![Accessing things in pandas](pandas_access.png)

### Exercise ᕦ(ò_óˇ)ᕤ

Use your newfound knowledge of `iloc` and `loc` to select the first 5 rows from the first 4 columns.

In [7]:
# Different ways of selecting (5 rows from) the first 4 columns

# Using head()

# Using iloc[]

# Using loc[]

## I can haz codez pls ( ᐕ)

To generate a `.py` out of this notebook:

`File > Download as > Python (.py)`

Github is nice and renders `.ipynb` files without needing to run Jupyter - great for quick sharing. E.g. https://github.com/ibm-et/jupyter-samples/blob/master/airline/Exploration%20of%20Airline%20On-Time%20Performance.ipynb

Bitbucket doesn't do this, so for the same effect, you can generate and commit a Markdown version of your notebook to your repo:

`File > Download as > Markdown (.md)`

## Questions? ʕ•ᴥ•ʔʃ

## What's next?  υ´• ﻌ •`υ

Extra credit: Try the Titanic tutorial on Kaggle: https://www.kaggle.com/c/titanic/overview/tutorials