# Introduction

Hi, my nice friend Josh asked me a question:

    "Given the dataset defined by the following code, return the features from the following dataset which contains a non-zero-value in at least 40 of the 48 different items."


So here is my book report about the answer.

# Setup

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.DataFrame({'Item': np.arange(0, 48)})
for i in range(50):
    df['F' + str(i)] = np.random.randint(0, i % 10 + 1, 48)

## Step 0: The 'Item' column is useless, so die
https://external-preview.redd.it/a8CoNByIoUUJRp4qq3pt2g0HwWYMA2S0cuC0dc8JJ1U.gif?format=mp4&s=56d7263ae4e24115d5d6a03980028c06755153ea

In [3]:
df = df.loc[:, 'F0':]

# Take a look at the data

In [4]:
df.shape

(48, 50)

In [5]:
df.head()

Unnamed: 0,F0,F1,F2,F3,F4,F5,F6,F7,F8,F9,...,F40,F41,F42,F43,F44,F45,F46,F47,F48,F49
0,0,0,1,2,2,3,1,6,2,1,...,0,0,2,0,0,4,0,5,2,4
1,0,0,0,2,1,1,5,4,7,2,...,0,0,0,2,0,5,2,3,5,5
2,0,1,1,0,1,0,4,0,5,4,...,0,1,0,1,1,2,5,1,0,6
3,0,1,1,1,2,5,1,1,8,5,...,0,0,2,3,3,2,2,7,3,1
4,0,0,1,3,1,5,0,5,8,5,...,0,1,1,1,1,3,4,6,2,8


## Ok, cool

# Methods

## Method 1: For loop, using value_counts

In [6]:
def method1(df):
    """ For loop, using value_counts"""
    good_columns = []

    for column in df:
        if df[column].value_counts()[0] <= 8:
            good_columns.append(column)
    
    return good_columns

## Method 2: List comprehension, using value_counts

In [7]:
def method2(df):
    """ List comprehension, using value_counts"""
    return [column for column in df if df[column].value_counts()[0] <= 8]

## Method 3: apply to make mask, then select columns using mask

In [8]:
def method3(df):
    """ apply to make mask, then select columns using mask"""
    
    mask = df.apply(pd.value_counts).loc[0].apply(lambda x: x <= 8)

    return [column for column in df.loc[:, mask]]

## Method 4: Get mad, turn on wall hax

In [9]:
def method4(df):
    """
    Pandas dataframes are built on Numpy arrays, which are optimized to make use of vecorized operations, 
        by acting on all vectors (or columns) simultaneously for SICK GAINS.
    
    To take advantage of this, basically use built in functions instead of DIY with loops and apply
    
    Procedure
    1) reduce to bool of 0 or not (vectorized)
    2) sum bools to get count (vectorized)
    3) create bool mask of sums which are >=40 (vectorized)
    4) select using this mask
    5) list columns
    """
    return list(df.loc[:, (df!=0).agg(sum)>=40].columns)

# Gotta Time Em All

In [10]:
%%timeit
method1(df)

35.8 ms ± 963 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [11]:
%%timeit
method2(df)

35.6 ms ± 1.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [12]:
%%timeit
method3(df)

54.3 ms ± 1.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [13]:
%%timeit
method4(df)

872 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


# Your face when you see that vectorized operations are orders of magnitude faster than element-wise:
https://i.pinimg.com/originals/05/51/f5/0551f506725ac1deeaa85d46f8b9a5fd.jpg

# Output of methods (if you don't believe my code works :-(

In [14]:
for method in [method1, method2, method3, method4]:
    print(method(df))

['F5', 'F6', 'F7', 'F8', 'F9', 'F13', 'F15', 'F16', 'F17', 'F18', 'F19', 'F24', 'F26', 'F27', 'F28', 'F29', 'F34', 'F37', 'F38', 'F39', 'F45', 'F46', 'F47', 'F48', 'F49']
['F5', 'F6', 'F7', 'F8', 'F9', 'F13', 'F15', 'F16', 'F17', 'F18', 'F19', 'F24', 'F26', 'F27', 'F28', 'F29', 'F34', 'F37', 'F38', 'F39', 'F45', 'F46', 'F47', 'F48', 'F49']
['F5', 'F6', 'F7', 'F8', 'F9', 'F13', 'F15', 'F16', 'F17', 'F18', 'F19', 'F24', 'F26', 'F27', 'F28', 'F29', 'F34', 'F37', 'F38', 'F39', 'F45', 'F46', 'F47', 'F48', 'F49']
['F5', 'F6', 'F7', 'F8', 'F9', 'F13', 'F15', 'F16', 'F17', 'F18', 'F19', 'F24', 'F26', 'F27', 'F28', 'F29', 'F34', 'F37', 'F38', 'F39', 'F45', 'F46', 'F47', 'F48', 'F49']


# Post Script: If you can tell me how to insert a jpg or especially a gif directly into a notebook, I will buy you 5 packs of magic the gathering cards, or 5 ice cream cones, your choice.