When presenting, vertical whitespace matters. I tend to do both maximize my browser (`F11`) and go into single document mode. To get to single document mode, we can use the command palette, either by clicking it in the left sidebar or by typing `Ctrl+Shift+c`. The command palette is great becaues it also show the shortcut we could use to get into single document mode directly via the shortcut `Ctrl+Shift+d`. When we're done with the sidebar we can close it with `Ctrl+b`.

First create a markdown cell with a header.

# MDS seminar

I usually add a few import that I am sure I will use up front and then more as I go. If I do a lot of prototyping, I just add them in the cell I am currently working in and then move them to the first cell when I am ready to commit something.

In [1]:
import seaborn as sns
import pandas as pd
import numpy as np
from sinfo import sinfo


sinfo() # Writes dependencies to `sinfo-requirements.txt` by default

-----
numpy     	1.16.4
pandas    	0.24.2
seaborn   	0.9.0
-----
IPython   	7.5.0
jupyter_client	5.2.4
jupyter_core	4.4.0
jupyterlab	0.35.6
notebook  	5.7.8
-----
Python 3.7.3 (default, Jun 18 2019, 09:42:59) [GCC 8.3.0]
Linux-5.1.9-arch1-1-ARCH-x86_64-with-arch
8 logical CPU cores
-----
Session information updated at 2019-06-18 22:40


# Initial textual EDA

There are many ways to get sample data to work with, including scikit-learn, statsmodels, and quilt. For small examples, I tend to use seaborn.

In [2]:
iris = sns.load_dataset('iris')
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


It is a little bit annoying to type head everytime I want to look at a dataframe. Pandas has options to control the dispalyed data frame output and even a nice search interface to find 

In [3]:
pd.describe_option('row')

display.latex.multirow : bool
    This specifies if the to_latex method of a Dataframe uses multirows
    to pretty-print MultiIndex rows.
    Valid values: False,True
    [default: False] [currently: False]

display.max_info_rows : int or None
    df.info() will usually show null-counts for each column.
    For large frames this can be quite slow. max_info_rows and max_info_cols
    limit this null check only to frames with smaller dimensions than
    specified.
    [default: 1690785] [currently: 1690785]

display.max_rows : int
    If max_rows is exceeded, switch to truncate view. Depending on
    `large_repr`, objects are either centrally truncated or printed as
    a summary view. 'None' value means unlimited.

    In case python/IPython is running in a terminal and `large_repr`
    equals 'truncate' this can be set to 0 and pandas will auto-detect
    the height of the terminal and print a truncated object which fits
    the screen height. The IPython notebook, IPython qtconsole, 

In [4]:
pd.set_option('display.max_rows', 9)

We can see that this has changed the current value.

In [5]:
pd.describe_option('max_row')

display.max_rows : int
    If max_rows is exceeded, switch to truncate view. Depending on
    `large_repr`, objects are either centrally truncated or printed as
    a summary view. 'None' value means unlimited.

    In case python/IPython is running in a terminal and `large_repr`
    equals 'truncate' this can be set to 0 and pandas will auto-detect
    the height of the terminal and print a truncated object which fits
    the screen height. The IPython notebook, IPython qtconsole, or
    IDLE do not run in a terminal and hence it is not possible to do
    correct auto-detection.
    [default: 60] [currently: 9]




And if we type the `iris` now, we wont get flooded with 60 rows.

In [6]:
iris

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
...,...,...,...,...,...
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica


I like that this shows the beginning and the end of the data frame, as well as the dimensions which don't show up with head. The only drawback is that you need to set it back if you want to display more rows, or override it temporarily with the context manager. So that is worth keeping in mind.

To get the default back, we could use `pd.reset_option('max_row')`.

A good follow up would be to check if there are any NaNs and which data types pandas has identified (we already have a good idea from the above).

In [7]:
iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
sepal_length    150 non-null float64
sepal_width     150 non-null float64
petal_length    150 non-null float64
petal_width     150 non-null float64
species         150 non-null object
dtypes: float64(4), object(1)
memory usage: 5.9+ KB


We can see that there are no NaNs since every columns has the same number of non-null entries as the number of entries in the index (150). The data types and index type match up with what we might expect from glancing at the values in previously. We can find out the number of unique values in each column via `nunique()`.

In [8]:
iris.nunique()

sepal_length    35
sepal_width     23
petal_length    43
petal_width     22
species          3
dtype: int64

`describe()` shows descriptive summary statistics.

In [9]:
iris.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


Note that describe by default only show numerical columns (if there are any), but we can specify that we want to include other column types.

In [10]:
iris.describe(include='object')

Unnamed: 0,species
count,150
unique,3
top,virginica
freq,50


We can also tell it to include all column and control the displayed percentiles.

In [11]:
iris.describe(percentiles=[0.5], include='all')

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
count,150.0,150.0,150.0,150.0,150
unique,,,,,3
top,,,,,virginica
freq,,,,,50
mean,5.843333,3.057333,3.758,1.199333,
std,0.828066,0.435866,1.765298,0.762238,
min,4.3,2.0,1.0,0.1,
50%,5.8,3.0,4.35,1.3,
max,7.9,4.4,6.9,2.5,


# Using apply

Aggregation functions can be specified in many different ways in pandas. From highly optimized built in functions to highly flexible arbitrary functions. 

In [12]:
iris.mean()

sepal_length    5.843333
sepal_width     3.057333
petal_length    3.758000
petal_width     1.199333
dtype: float64

`agg()` is a different interface to summary functions, which also allows for multiple functions to be past in the same call.

In [13]:
iris.agg('mean')

sepal_length    5.843333
sepal_width     3.057333
petal_length    3.758000
petal_width     1.199333
dtype: float64

In [14]:
iris.agg(['mean', 'median'])

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
mean,5.843333,3.057333,3.758,1.199333
median,5.8,3.0,4.35,1.3


If we want to use a function that is not available through pandas, we can use apply.

In [15]:
iris[['sepal_length', 'sepal_width']].apply(np.mean)

sepal_length    5.843333
sepal_width     3.057333
dtype: float64

The built in aggregation functions automatically drop non-numerical values. Apply does not, so an error is thrown with non-numerical cols.

We could drop the string columns if there are just a few and we know which.

In [16]:
iris.drop(columns='species').apply(np.mean)

sepal_length    5.843333
sepal_width     3.057333
petal_length    3.758000
petal_width     1.199333
dtype: float64

If there are many, it is easier to use `.select_dtypes()`.

In [17]:
iris.select_dtypes('number').apply(np.mean)

sepal_length    5.843333
sepal_width     3.057333
petal_length    3.758000
petal_width     1.199333
dtype: float64

## Lambda functions

Unnamed functions that don't need to be defined.

In [18]:
def add_one(x):
    return x + 1

add_one(5)

6

Lambda functions can be used without being named, so they are effective for throwaway functions that you are likely only to use only once.

In [19]:
(lambda x: x + 1)(5)

6

Lambda functions can be assigned to a variable name if so desired.

In [20]:
my_lam = lambda x: x + 1

my_lam(5)

6

Just as with functions, there is nothing special with the letter `x`, it is just a variable name and you can call it whatever you prefer.

In [21]:
(lambda a_good_descriptive_name: a_good_descriptive_name + 1)(5)

6

Custom function, both named and unnamed, can be used together with apply to create any transformation to the dataframe values.

In [22]:
iris_num = iris.select_dtypes('number')
iris_num.apply(add_one)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,6.1,4.5,2.4,1.2
1,5.9,4.0,2.4,1.2
2,5.7,4.2,2.3,1.2
3,5.6,4.1,2.5,1.2
...,...,...,...,...
146,7.3,3.5,6.0,2.9
147,7.5,4.0,6.2,3.0
148,7.2,4.4,6.4,3.3
149,6.9,4.0,6.1,2.8


In [23]:
iris_num.apply(lambda x: x + 1)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,6.1,4.5,2.4,1.2
1,5.9,4.0,2.4,1.2
2,5.7,4.2,2.3,1.2
3,5.6,4.1,2.5,1.2
...,...,...,...,...
146,7.3,3.5,6.0,2.9
147,7.5,4.0,6.2,3.0
148,7.2,4.4,6.4,3.3
149,6.9,4.0,6.1,2.8


We can check if they are correct by surrounding with parentheses and asser equality.

In [24]:
iris_num.apply(lambda x: x + 1) == iris_num.apply(add_one)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,True,True,True,True
1,True,True,True,True
2,True,True,True,True
3,True,True,True,True
...,...,...,...,...
146,True,True,True,True
147,True,True,True,True
148,True,True,True,True
149,True,True,True,True


In [25]:
(iris_num.apply(lambda x: x + 1) == iris_num.apply(add_one)).all()

sepal_length    True
sepal_width     True
petal_length    True
petal_width     True
dtype: bool

We could also have checked if the new df minus the old one ends up with all 1.

In [26]:
iris_num.apply(lambda x: x + 1) - iris_num

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,1.0,1.0,1.0,1.0
1,1.0,1.0,1.0,1.0
2,1.0,1.0,1.0,1.0
3,1.0,1.0,1.0,1.0
...,...,...,...,...
146,1.0,1.0,1.0,1.0
147,1.0,1.0,1.0,1.0
148,1.0,1.0,1.0,1.0
149,1.0,1.0,1.0,1.0


It looks like all are correct but when we check equality it seems not to be.

#TODO short explanation and link

In [27]:
(iris_num.apply(lambda x: x + 1) - iris_num) == 1

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,True,True,True,True
1,True,True,True,True
2,True,True,False,True
3,True,False,True,True
...,...,...,...,...
146,True,True,True,True
147,True,True,True,True
148,True,False,True,True
149,True,True,True,False


This is because of floating point error. Something that is good to be aware of and know that it can be fixed with `np.isclose`.

In [28]:
np.isclose(iris_num.apply(lambda x: x + 1) - iris_num, 1).all()

True

### Row and column wise apply

By default, `.apply` (and other functions), work column-wise, but can be set to work row-wise instead.

In [29]:
# The highest value in any of the rows for each column.
iris_num.apply(lambda col: col.max())

sepal_length    7.9
sepal_width     4.4
petal_length    6.9
petal_width     2.5
dtype: float64

In [30]:
# The highest value in any of the columns for each row.
iris_num.apply(lambda row: row.max(), axis=1)

0      5.1
1      4.9
2      4.7
3      4.6
      ... 
146    6.3
147    6.5
148    6.2
149    5.9
Length: 150, dtype: float64

In [31]:
# The highest value in any of the columns for each row.
iris_num.idxmax()

sepal_length    131
sepal_width      15
petal_length    118
petal_width     100
dtype: int64

In [32]:
# The highest value in any of the columns for each row.
iris_num.idxmax(axis=1)

0      sepal_length
1      sepal_length
2      sepal_length
3      sepal_length
           ...     
146    sepal_length
147    sepal_length
148    sepal_length
149    sepal_length
Length: 150, dtype: object

Sepal length seems to be the highest value for all rows.

In [33]:
iris_num.idxmax(axis=1).value_counts()

sepal_length    150
dtype: int64

### Testing performance

Built in pandas methods are optimized to be faster with pandas dataframees than applying a standard python method, so always use these when possible.

In [208]:
iris = pd.read_csv('iris.csv', index_col=0).select_dtypes('number')

In [209]:
%%timeit
iris.iloc[0, :] #200

135 µs ± 2.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [210]:
%%timeit
iris.iloc[:, 0] #140

58.8 µs ± 851 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [221]:
%%timeit
iris.mean(axis=0)

252 µs ± 7.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [222]:
%%timeit
iris.mean(axis=1)

256 µs ± 6.12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [227]:
%%timeit
iris.head(4).apply(lambda x: x.mean(), axis=1)

1.52 ms ± 28.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [228]:
%%timeit
iris.head(4).apply(lambda x: x.mean(), axis=0)

1.54 ms ± 43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [213]:
%%timeit
iris.apply(lambda x: x['sepal_length'] + x['sepal_width'], axis=1)

4.2 ms ± 68.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [214]:
%%timeit
iris['sepal_length'] + iris['sepal_width']

121 µs ± 917 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


A pandas data frame stores each column as a numpy array. This has the expected
effect that pulling out one column from the dataframe is much faster than
pullinf out one row


numpy behaves as expeced


In [153]:
a = np.ones((10000, 10000), order='F')
df = pd.DataFrame(a)

In [122]:
%%timeit
a.mean(axis=0)

81.7 ms ± 3.87 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [123]:
%%timeit
a.mean(axis=1)

55 ms ± 174 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [154]:
%%timeit
df.iloc[:, 0]

67.9 µs ± 2.45 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [155]:
%%timeit
df.iloc[0, :]

139 µs ± 1.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [156]:
np.info(df.iloc[:, 0].values)

class:  ndarray
shape:  (10000,)
strides:  (8,)
itemsize:  8
aligned:  True
contiguous:  True
fortran:  True
data pointer: 0x7f7931e5a010
byteorder:  little
byteswap:  False
type: float64


In [157]:
np.info(df.iloc[0, :].values)

class:  ndarray
shape:  (10000,)
strides:  (80000,)
itemsize:  8
aligned:  True
contiguous:  False
fortran:  False
data pointer: 0x7f7931e5a010
byteorder:  little
byteswap:  False
type: float64


In [141]:
np.info(one_col)

class:  ndarray
shape:  (10000,)
strides:  (80000,)
itemsize:  8
aligned:  True
contiguous:  False
fortran:  False
data pointer: 0x7f799143c010
byteorder:  little
byteswap:  False
type: float64


In [124]:
%%timeit
df.mean(axis=0)

987 ms ± 8.34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [125]:
%%timeit
df.mean(axis=1)

975 ms ± 15.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [132]:
%%timeit
df.apply(np.mean, axis=0)

3.7 s ± 64.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [133]:
%%timeit
df.apply(np.mean, axis=1)

1.84 s ± 19.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [62]:
# TODO change to iris if there is no diff with axis=1, there should be based on the mem layout...
rs = np.random.RandomState(0)
many_cols = pd.DataFrame(rs.rand(2, 5000))
many_rows = pd.DataFrame(rs.rand(5000, 2))
square_np = rs.rand(10000, 10000)
square = pd.DataFrame(square_np)

Columns are faster than rows since each column is a numpy array

In [35]:
%%timeit
many_rows.mean()

477 µs ± 13.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [36]:
%%timeit
many_rows.mean(axis=1)

476 µs ± 24.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [37]:
%%timeit
many_cols.mean()

307 µs ± 9.46 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [38]:
%%timeit
many_cols.mean(axis=1)

284 µs ± 3.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [71]:
np.info(square_np[:, 0])

class:  ndarray
shape:  (10000,)
strides:  (80000,)
itemsize:  8
aligned:  True
contiguous:  False
fortran:  False
data pointer: 0x7f7a2050f010
byteorder:  little
byteswap:  False
type: float64


In [96]:
np.info(iris['sepal_length'].values)

class:  ndarray
shape:  (150,)
strides:  (8,)
itemsize:  8
aligned:  True
contiguous:  True
fortran:  True
data pointer: 0x562670313ef0
byteorder:  little
byteswap:  False
type: float64


In [107]:
%%timeit
iris.drop(columns='species').head(4).mean(axis=0)

1.02 ms ± 26.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [108]:
%%timeit
iris.drop(columns='species').head(4).mean(axis=1)

997 µs ± 19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [82]:
# %%timeit
square_np[:, 0].mean()

0.49930881308747355

In [72]:
np.info(square_np[0, :])

class:  ndarray
shape:  (10000,)
strides:  (8,)
itemsize:  8
aligned:  True
contiguous:  True
fortran:  True
data pointer: 0x7f7a2050f010
byteorder:  little
byteswap:  False
type: float64


In [81]:
# %%timeit
square_np[0, :].mean()

0.5007089026674952

In [76]:
square_np.mean(axis=0)

array([0.49930881, 0.49916254, 0.50014099, ..., 0.50121883, 0.49942237,
       0.49973079])

In [80]:
square_np.mean(axis=1)

array([0.5007089 , 0.49822079, 0.49985381, ..., 0.49442171, 0.49509447,
       0.5053317 ])

In [65]:
np.info(square_np)

class:  ndarray
shape:  (10000, 10000)
strides:  (80000, 8)
itemsize:  8
aligned:  True
contiguous:  True
fortran:  False
data pointer: 0x7f7a2050f010
byteorder:  little
byteswap:  False
type: float64


In [79]:
%%timeit
square_np.mean(axis=0)

80.8 ms ± 2.92 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [78]:
%%timeit
square_np.mean(axis=1)

56 ms ± 524 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [39]:
%%timeit
square.apply(np.mean, axis=1)

807 ms ± 14.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [40]:
%%timeit
square.apply(np.mean, axis=0)

1.21 s ± 7.64 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [54]:
%%timeit
square.iloc[:, 0]

66.9 µs ± 1.77 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [55]:
%%timeit
square.iloc[0, :]

135 µs ± 2.66 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [41]:
%%timeit
square.apply(lambda x: sum(x) / len(x), axis=1)

2.35 s ± 11.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [42]:
%%timeit
square.apply(lambda x: sum(x) / len(x), axis=0)

2.72 s ± 38.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [43]:
%%timeit
square.apply(lambda x: x ** x, axis=0)

3.01 s ± 29.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [44]:
%%timeit
square.apply(lambda x: x ** x, axis=1)

2.45 s ± 11.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [45]:
# snakeviz

# Working with categorical data

In [46]:
iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
sepal_length    150 non-null float64
sepal_width     150 non-null float64
petal_length    150 non-null float64
petal_width     150 non-null float64
species         150 non-null object
dtypes: float64(4), object(1)
memory usage: 5.9+ KB


but how should we interpret the `+` sign under memory usage? In the help we can see that there is one option that affects memory usage, let's use it.

In [47]:
iris.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
sepal_length    150 non-null float64
sepal_width     150 non-null float64
petal_length    150 non-null float64
petal_width     150 non-null float64
species         150 non-null object
dtypes: float64(4), object(1)
memory usage: 14.3 KB


What happened? How could the data frame almost tripple in size? The `info()` method's docstring explains why:

> Without deep introspection a memory estimation is made based in column dtype and number of rows assuming values consume the same memory amount for corresponding dtypes. With deep memory introspection, a real memory usage calculation is performed at the cost of computational resources.

So deep memory introspection shows the real memory usage, but it is still a bit cryptic what part of the dataframe is responsible for this extra size. To find this out, it is helpful to understand that pandas dataframes essentially consist of numpy arrays held together with some super smart glue. Knowing that, it would be interesting to inspect whether any of the columns report different size measures with and without deep memory introspection. Instead of the more general `info()` method, we can use one specific to memory usage to find this out.

In [48]:
iris.memory_usage()

Index             80
sepal_length    1200
sepal_width     1200
petal_length    1200
petal_width     1200
species         1200
dtype: int64

In [49]:
iris.memory_usage(deep=True)

Index             80
sepal_length    1200
sepal_width     1200
petal_length    1200
petal_width     1200
species         9800
dtype: int64

From this, it is clear that it is the species column that changes, everything else remains the same. Above we saw that this column is of dtype "object". To understand what is happening, we first need to know that a numpy array is stored in the computer's memory as a contiguous (uninterupted) segment. This is one of the reasons why numpy is so fast, it only needs to find the start of the array and then access a sequential length from the start point instead of going to find every single object (which is how a lists work in python). However, in order for numpy to store objects sequentially in memory, it needs to allocate a certain space for each object. This is fine for integer up to a certain size or floats up to a certain precision, but with strings (or more complex object such as lists and dictionaries), numpy cannot fit them into the same sized chunks in an effective manner and the actual object is stored outside the array. So what is inside the array? Just a reference (also called a pointer) to where in memory the actual object is stored and these references are of a fixed size:

![](./img/int-vs-pointer-memory-lookup.png)

[Image source](https://stackoverflow.com/questions/21018654/strings-in-a-dataframe-but-dtype-is-object/21020411#21020411)

What happens when we specify to use the deep memory introspection is that pandas finds and calculates the size of each of the objects in memory. With the shallow introspection, it simply reports the values of the references that are actually stored in the array. 

Note that memory usage is not the same as disk usage. Objects can take up additional space in memory depending on how they are constructed.

In [50]:
iris.to_csv('iris.csv')
!ls -lh iris*

-rw-r--r-- 1 joel joel 4.3K Jun 18 22:42 iris.csv


In [51]:
# titanic = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv') #sns.load_dataset('titanic')
titanic = pd.read_csv('/home/joel/Downloads/train.csv')
titanic

FileNotFoundError: [Errno 2] File b'/home/joel/Downloads/train.csv' does not exist: b'/home/joel/Downloads/train.csv'

Some of these columns I will not touch, so we're dropping them to fit the df on the screen.

In [None]:
titanic = titanic.drop(columns=['SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'])
titanic

In [None]:
titanic.info()

In [None]:
titanic.memory_usage(deep=True) #.sum()

In [None]:
titanic.select_dtypes('number').head()

Survived and Pclass are not numerical variables, they are categorical.

In [None]:
# import re
# titanic.rename(columns=lambda x: re.sub('(?!^)([A-Z]+)', r'_\1', x).lower())
titanic = titanic.rename(columns=str.lower)
titanic = titanic.set_index('passengerid')
titanic['survived'] = titanic['survived'] == 1
titanic['pclass'] = titanic['pclass'].map({1: 'first', 2: 'second', 3: 'third'})
titanic

In [None]:
titanic.memory_usage(deep=True)

Boolean takes less space and strings take more.

In [None]:
titanic.dtypes

In [None]:
pd.Categorical(titanic['sex'])

In [None]:
titanic['sex'] = pd.Categorical(titanic['sex'])

In [None]:
titanic.memory_usage(deep=True)

In [None]:
# Stored as integers with a mapping, which can be seen with the cat accessor
titanic['sex'].cat.codes

Categories can be ordered which allows comparisons.

In [None]:
titanic['pclass'] = pd.Categorical(titanic['pclass'], categories=['third', 'second', 'first'], ordered=True)

In [None]:
# Note that comparisons with string also work, but it is just comparing alphabetical order.
titanic['pclass'] > 'third'

The order is also respected by pandas and seaborn.

In [None]:
# mode, min and max work
titanic['pclass'].mode()

In [None]:
titanic.groupby('pclass').size()

In [None]:
sns.catplot(x='pclass', y='age', data=titanic, kind='swarm')

In [None]:
# Value counts sorts based on value, not index.
titanic['pclass'].value_counts(normalize=True)

In [None]:
titanic.dtypes

In [None]:
# titanic.apply(lambda x: x + 1)
titanic.select_dtypes('number').apply(lambda x: x + 1)

In [None]:
titanic.describe()

In [None]:
# 'number', 'category', 'object' ,'bool'
titanic.select_dtypes('category').describe()

In [None]:
# describe has an built-in way of doing this also, but it is more versatile to learn select dtype
titanic.describe(include='category')

# String processing

Could use lambda and the normal python string functions.

In [None]:
'First Last'.lower()

In [None]:
titanic['name'].apply(lambda x: x.lower())

Pandas has built in accessor method for many string methods so that we don't have to use lambda.

In [None]:
titanic['name'].str.lower()

Note that these work on Series, not dataframes. So either use on one series at a time or a dataframe with a lmabda experssion.

In [None]:
titanic

## What are the longest lastnames

In [None]:
titanic['name'].str.split(',')

In [None]:
titanic['name'].str.split(',', expand=True)

Can be assigned to multiple columns, or select one column with indexing.

In [None]:
titanic[['lastname', 'firstname']] = titanic['name'].str.split(',', expand=True)
titanic

In [None]:
titanic['lastname_length'] = titanic['lastname'].str.len()
titanic

In [None]:
titanic.sort_values('lastname_length', ascending=False).head()

In [None]:
# Shortcut for sorting
titanic.nlargest(5, 'lastname_length')

In [None]:
sns.distplot(titanic['lastname_length'], bins=20)

How many times are lastnames duplicated.

In [None]:
titanic['lastname'].value_counts().value_counts()

How can we view the duplicated ones.

In [None]:
titanic[titanic.duplicated('lastname', keep=False)].sort_values(['lastname'])

Duplication is often due to women being registered under their husbands name. 

We can get an idea, by checking how many vaues include a parenthesis.

In [None]:
titanic.loc[titanic['name'].str.contains('\('), 'sex'].value_counts()

In [None]:
titanic.loc[titanic['name'].str.contains('\('), 'sex'].value_counts(normalize=True)

How to negate a boolean expression.

In [None]:
titanic.loc[~titanic['name'].str.contains('\('), 'sex'].value_counts()

There seems to be several reasons for parenthesis in the name. The ones we want to change are the ones who have 'Mrs' and a parenthesis in the name.

In [None]:
# It is beneficial to break long method or indexeing chains in to several rows surrounded by parenthesis.
(titanic
    .loc[(titanic['name'].str.contains('\('))
        & (titanic['name'].str.contains('Mrs'))
        , 'sex']
    .value_counts()
)

Dropped all male and 4 female passengers. Which females were dropped?

In [None]:
(titanic
    .loc[(titanic['name'].str.contains('\('))
        & (~titanic['name'].str.contains('Mrs'))
        & (titanic['sex'] == 'female')
        , 'name']
)

Even more precisely, we only want to keep the ones with a last and first name in the parentheiss. We can use the fact that these seems to be separated by a space.

In [None]:
# Explain regex above
# titanic.loc[(titanic['name'].str.contains('\(')) & (titanic['sex'] == 'female'), 'sex'].value_counts()
titanic.loc[titanic['name'].str.contains('Mrs.*\(.* .*\)'), 'sex'].value_counts()

From these passengers, we can extract the name in the parenthesis.

In [None]:
(titanic
    .loc[titanic['name'].str.contains('Mrs.*\(.* .*\)'), 'name']
    .str.partition('(')[2]
)

In [None]:
(titanic
    .loc[titanic['name'].str.contains('Mrs.*\(.* .*\)'), 'name']
    .str.partition('(')[2]
    .str.partition(')')[0]
)

In this case I could also have used string indexing to strip the last character, but this would give us issues if there are spaces at the end.

In [None]:
(titanic
    .loc[titanic['name'].str.contains('Mrs.*\(.* .*\)'), 'name']
    .str.partition('(')[2]
    .str[:-1]
)

There is a more advanced way of getting this with regex directly, using a matching group to find anything in the parenthesis.

In [None]:
# %%timeit
(titanic
    .loc[titanic['name'].str.contains('Mrs.*\(.* .*\)'), 'name']
    .str.extract("\((.+)\)")
)

The two way partition method is just fine, and regex can feel a bit magical sometime, but it is good to know about if you end up working a lot with strings or need to extract complicated patterns.

Now lets get just the last names from this column and assign them back to the dataframe.

In [None]:
(titanic
    .loc[titanic['name'].str.contains('Mrs.*\(.* .*\)'), 'name']
    .str.partition('(')[2]
    .str.partition(')')[0]
    .str.rsplit(n=1, expand=True)
)

All the lastnames without parenthsis will remain the same.

In [None]:
titanic['real_last'] = titanic['lastname']

Overwrite only the relevant columns.

In [None]:
titanic.loc[titanic['name'].str.contains('Mrs.*\(.* .*\)'), 'real_last'] = (
    titanic
        .loc[titanic['name'].str.contains('Mrs.*\(.* .*\)'), 'name']
        .str.partition('(')[2]
        .str.partition(')')[0]
        .str.rsplit(n=1, expand=True)
        [1]
)

In [None]:
titanic

In [None]:
titanic['lastname'].value_counts().value_counts()

In [None]:
titanic['real_last'].value_counts().value_counts()

## Extras

In [52]:
# For easier version control
!jupyter-nbconvert mds-seminar-apply-cat-str.ipynb --to python

[NbConvertApp] Converting notebook mds-seminar-apply-cat-str.ipynb to python
[NbConvertApp] Writing 18197 bytes to mds-seminar-apply-cat-str.py
