# Slicing Data in Python, An In-depth Overview

#### The many ways to slice data in Python.

### Abstract:

While performing exploratory data analysis, one is often moving quickly between different data structures and then this question arises, "How do I extract or filter the data I need?" This overview of Python data structures attempts to unveil the various methods in concise junks, so that they can be compared, contrasted, and committed to memory! At the end of the talk, the hope is that slicing and filtering data in Python will be as intuitive as slicing bread.

### Some Resources to Share:
- Are you just getting started with Python?
    - [Google Python Edu Course](https://developers.google.com/edu/python/)
- [DataCamp String Tutorial](https://www.datacamp.com/community/tutorials/python-string-tutorial)
- [DataCamp Dictionary Tutorial](https://www.datacamp.com/community/tutorials/python-dictionary-tutorial)
- [DataCamp NumPy Array Tutorial](https://www.datacamp.com/community/tutorials/python-numpy-tutorial)
- [Python Data Structures](https://docs.python.org/3.7/tutorial/datastructures.html)

----

## Lists

----

<div style="background-color: #000054; padding: 25px;">
<h4 style="color: #ffffff;">Generating a list of random numbers as strings</h4>
<p style="color: #cccccc;">In the following example, the <a src="https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.randn.html#numpy.random.randn">randn</a> method of the NumPy random module is used to generate 14 random values from the "standard normal" distribution. Then string formatting is used to limit the accuracy of each value to 4 decimal places. The resulting Python object is saved to a variable called <em>my_list</em>.</p>
</div>

In [2]:
import numpy as np

my_list = [f'{x:.4f}' for x in np.random.randn(14)]

In [4]:
print(my_list)

['0.5472', '-1.3950', '0.3047', '-1.0847', '-0.1700', '0.4376', '-1.1538', '0.6338', '-1.6537', '1.3397', '-0.0013', '-0.1586', '0.6316', '-0.3730']


<div style="background-color: #000054; padding: 25px;">
<h4 style="color: #ffffff;">Slicing single values from a list</h4>
<p style="color: #cccccc;">To pull out items from a list, square bracket notation is used to index the list. In the following example, first the fifth item is printed and then the fourth last item. Python indexes are zero indexed.</p>
</div>

In [5]:
print("fifth list element:", my_list[4], "\nfourth last list element:", my_list[-4])

fifth list element: -0.1700 
fourth last list element: -0.0013


<div style="background-color: #000054; padding: 25px;">
<h4 style="color: #ffffff;">Slicing an index range from a list</h4>
<p style="color: #cccccc;">To get any subset of sequential items from a list, square bracket notation is used again with a colon separating the starting index value and the <strong>exclusive</strong> ending index value.</p>
</div>

In [6]:
print(my_list[3:7])

['-1.0847', '-0.1700', '0.4376', '-1.1538']


<div style="background-color: #000054; padding: 25px;">
<h4 style="color: #ffffff;">Non-sequential index range from a list</h4>
<p style="color: #cccccc;">To get a non-sequential subset of items from a list, square bracket notation is used again with a colon separating the starting index value and the <strong>exclusive</strong> ending index value and then another colon separating the interval of items to return. For example, passing 2 as the interval will return the first item and skip the second. You can then imagine the same pattern starting at the third item with a return-skip non-sequential sequence. Passing a 4 for the slicing interval will produce a return-skip-skip-skip sequence.</p>
</div>

In [7]:
print(my_list[::2])

['0.5472', '0.3047', '-0.1700', '-1.1538', '-1.6537', '-0.0013', '0.6316']


<div style="background-color: #000054; padding: 25px;">
<h4 style="color: #ffffff;">Generating a 2-D list of lists</h4>
<p style="color: #cccccc;">The following code generates a list of lists with 5 random numbers in each of the 5 lists. The list of lists, <em>l_of_l</em> is then printed.</p>
</div>

In [8]:
list_of_lists = [x for x in np.random.randn(5, 5)]

l_of_l = []
for x in list_of_lists:
    l_of_l.append(list(x))

In [10]:
# A 2-D list
print(l_of_l)

[[-1.459458523162747, -0.11161976861055396, 0.7286473159816925, -0.45019633949927335, -1.480284675504651], [-1.4623590735762781, 0.49269906139539904, -0.2807141147098116, 0.04426059310797786, 0.149451615853121], [-0.5049336780755084, 0.24960882372001447, 0.06343745373630551, 0.2533567425192111, 0.69160565229463], [-1.707563606142647, -0.9145527113511979, -0.4091646169982658, -0.8439119023746744, -0.3773618689196677], [0.2949757592196722, 1.5749739828620137, 1.004092104315791, -2.1705220499626603, 0.6462539550293249]]


<div style="background-color: #000054; padding: 25px;">
<h4 style="color: #ffffff;">Slicing data from a list of lists</h4>
<p style="color: #cccccc;">In the following example we slice the fourth random number from the fourth list in the array. We achieve this by first using square bracket notation to indicate the list (row), second we follow that with the same square bracket notation to indicate the item in that list we want. Remember this notation because it contrasts with the less verbose and more efficient NumPy syntax for 2 dimensional arrays.</p>
</div>

In [11]:
print(l_of_l[3][3])

-0.8439119023746744


<div style="background-color: #000054; padding: 25px;">
<h4 style="color: #ffffff;">Performing element-wise operations in list arrays</h4>
<p style="color: #cccccc;">If we want the fourth element from the first 3 lists, we don't have a concise syntax within the above mentioned square bracket notation. To perform this task we need a for loop and if we were operating on these elements we would require a nested for loop.</p>
</div>

In [12]:
for l in l_of_l[:3]:
    print(l[3])

-0.45019633949927335
0.04426059310797786
0.2533567425192111


<div style="background-color: #000054; padding: 25px;">
<h4 style="color: #ffffff;">Pythonic Zen</h4>
<p style="color: #cccccc;">A more Pythonic and efficient approach would utilize a list comprehension for the above task.</p>
</div>

In [13]:
[l[3] for l in l_of_l[:3]]

[-0.45019633949927335, 0.04426059310797786, 0.2533567425192111]

<div style="background-color: #000054; padding: 25px;">
<h4 style="color: #ffffff;">Some background information</h4>
<p style="color: #cccccc;">It is worth mentioning some background information that should augment your understanding of slicing operations. The <a src="https://docs.python.org/3.7/library/functions.html?highlight=slice#slice"><em>slice</em></a> built-in class is the underlying workhorse of extended indexing syntax and it is utilized throughout the numerical python package (NumPy). See the link to read more about this built-in Python class. The following example shows the same functionality as the beginning of this document.</p>
</div>

In [83]:
my_slice = slice(1, 8, 2)
my_list[my_slice]

['-1.3950', '-1.0847', '0.4376', '0.6338']

<div style="background-color: #000054; padding: 25px;">
<h4 style="color: #ffffff;">An alternate slice object that iterates</h4>
<p style="color: #cccccc;">If you want the output of the <em>slice</em> class to be iterable, you will have to look elsewhere. This is where our first built-in Python package comes to the rescue. Itertools has a function called <a src="https://docs.python.org/3/library/itertools.html#itertools.islice"><em>islice</em></a> or to help you conceptulize, "iterable-slice." The below code tries to illustrate that the object returned by slice() is not iterable and islice() is.</p>
</div>

In [84]:
iter(my_slice)

TypeError: 'slice' object is not iterable

In [85]:
from itertools import islice

In [86]:
iter_slice = islice(my_list, 1, 8, 2)

[float(i) for i in iter_slice]

[-1.395, -1.0847, 0.4376, 0.6338]

In [87]:
iter(iter_slice)

<itertools.islice at 0x11b725048>

----

## Strings

----

<div style="background-color: #000054; padding: 25px;">
<h4 style="color: #ffffff;">Strings, same as 1-D lists, no sweat</h4>
<p style="color: #cccccc;">Strings allow for the same extended indexing for slices that we saw with a single list. As long as you remember zero indexing, exclusive stopping index, and interval sequencing, you are ready for string slicing.</p>
</div>

In [22]:
my_string = "My name is Cody"

In [23]:
my_string[-4:]

'Cody'

In [24]:
my_string[3:7]

'name'

In [25]:
my_string[::2]

'M aei oy'

----

## NumPy Arrays

----

<div style="background-color: #000054; padding: 25px;">
<h4 style="color: #ffffff;">Numerical Python and the NumPy array</h4>
<p style="color: #cccccc;">Oh NumPy! You have cleaner array outputs, less verbose syntax, vectorized operations, and lay the foundation for pandas and many other packages. The following code generates a Numpy array, <em>num_2d_array</em>, from our list of lists.</p>
</div>

In [17]:
import numpy as np

num_2d_array = np.array(l_of_l)
num_2d_array

array([[-1.45945852, -0.11161977,  0.72864732, -0.45019634, -1.48028468],
       [-1.46235907,  0.49269906, -0.28071411,  0.04426059,  0.14945162],
       [-0.50493368,  0.24960882,  0.06343745,  0.25335674,  0.69160565],
       [-1.70756361, -0.91455271, -0.40916462, -0.8439119 , -0.37736187],
       [ 0.29497576,  1.57497398,  1.0040921 , -2.17052205,  0.64625396]])

<div style="background-color: #000054; padding: 25px;">
<h4 style="color: #ffffff;">Simple 1 dimensional NumPy array slicing</h4>
<p style="color: #cccccc;">No changes here from the list syntax.</p>
</div>

In [100]:
num_2d_array[0:3:2]

array([[-1.45945852, -0.11161977,  0.72864732, -0.45019634, -1.48028468],
       [-0.50493368,  0.24960882,  0.06343745,  0.25335674,  0.69160565]])

<div style="background-color: #000054; padding: 25px;">
<h4 style="color: #ffffff;">2 dimensional NumPy array slicing</h4>
<p style="color: #cccccc;">No more for loops for slicing items from rows within the array. We now have a new syntax that can be thought of as <em>[rows, columns]</em> and each of the slices supports the extended indexing syntax of <em>start:stop:step</em>.</p>
</div>

In [101]:
num_2d_array[0:3, ::2]

array([[-1.45945852,  0.72864732, -1.48028468],
       [-1.46235907, -0.28071411,  0.14945162],
       [-0.50493368,  0.06343745,  0.69160565]])

<div style="background-color: #000054; padding: 25px;">
<h4 style="color: #ffffff;">Easily slice a column</h4>
<p style="color: #cccccc;">The new 2-D NumPy array syntax makes column slices efficient and far less verbose than list arrays. The following example would return everything from the first column.</p>
</div>

In [102]:
num_2d_array[:,0]

array([-1.45945852, -1.46235907, -0.50493368, -1.70756361,  0.29497576])

<div style="background-color: #000054; padding: 25px;">
<h4 style="color: #ffffff;">Another column slice example</h4>
<p style="color: #cccccc;">This example would return, for the first three rows, the third column item. </p>
</div>

In [20]:
num_2d_array[:3,3]

array([-0.45019634,  0.04426059,  0.25335674])

<div style="background-color: #000054; padding: 25px;">
<h4 style="color: #ffffff;">Boolean indexing, supported by a NumPy array near you!</h4>
<p style="color: #cccccc;">By using a threshold, equality, or other comparison operator, you can easily create a boolean mask of a given array. Then you can use this boolean mask within the square bracket notation to return the items where the index matches <em>True</em>. In the following example, the code would return every value in the <em>num_2d_array</em> that is greater than zero.</p>
</div>

In [21]:
# NumPy also supports boolean  indexing
num_2d_array[num_2d_array > 0]

array([0.72864732, 0.49269906, 0.04426059, 0.14945162, 0.24960882,
       0.06343745, 0.25335674, 0.69160565, 0.29497576, 1.57497398,
       1.0040921 , 0.64625396])

<div style="background-color: #000054; padding: 25px;">
<h4 style="color: #ffffff;">What is the Ellipsis object?</h4>
<p style="color: #cccccc;">When you have a 1-D or 2-D array, the utility of the Ellipsis object is hard to grasp. However, if your ndarray, <em>x</em>, is representative of <em>x.ndim >= 3</em> then we can talk about the Ellipsis object for slicing NumPy arrays. This is because the Ellipsis object will span any number of dimensions infered in your slice notation. The following example returns, for all items found in the second item of the first dimension, the second items of the fouth dimension.</p>
</div>

In [116]:
multi_dim_array = np.arange(16).reshape(2,2,2,2)

In [119]:
multi_dim_array

array([[[[ 0,  1],
         [ 2,  3]],

        [[ 4,  5],
         [ 6,  7]]],


       [[[ 8,  9],
         [10, 11]],

        [[12, 13],
         [14, 15]]]])

In [118]:
multi_dim_array[1,...,1]

array([[ 9, 11],
       [13, 15]])

<div style="background-color: #000054; padding: 25px;">
<h4 style="color: #ffffff;">Simple slicing with indices</h4>
<p style="color: #cccccc;">The following example would return items in the sequence of indices given. If you read the code logically it says, return the first item, then the second, the first again, the third, the first again, and then the fourth in a new NumPy array.</p>
</div>

In [113]:
num_2d_array[[0,1,0,2,0,3]]

array([[-1.45945852, -0.11161977,  0.72864732, -0.45019634, -1.48028468],
       [-1.46235907,  0.49269906, -0.28071411,  0.04426059,  0.14945162],
       [-1.45945852, -0.11161977,  0.72864732, -0.45019634, -1.48028468],
       [-0.50493368,  0.24960882,  0.06343745,  0.25335674,  0.69160565],
       [-1.45945852, -0.11161977,  0.72864732, -0.45019634, -1.48028468],
       [-1.70756361, -0.91455271, -0.40916462, -0.8439119 , -0.37736187]])

<div style="background-color: #000054; padding: 25px;">
<h4 style="color: #ffffff;">Advanced slicing notation with indices</h4>
<p style="color: #cccccc;">The following example would return items in the sequence of indices given <strong>but</strong> subsequent argument(s) will broadcast to the preceding argument(s). If you read the code logically it says, return for the second, third, and fourth items, the second, third, and fourth items, <strong>respectively</strong>, to a new NumPy array.</p>
</div>

In [125]:
num_2d_array[(1,2,3),(1,2,3)]

array([ 0.49269906,  0.06343745, -0.8439119 ])

Remember that NumPy arrays are homogenious with regards to datatype and that simple 1-D slicing in NumPy arrays is exactly the same as slicing 1-D lists.

----

## Dictionaries

----

In [26]:
key_list = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
value_list = [[1, 2, 3, 4, 5], 'My name is Cody', 77, 0.1234, (123, 456), {"a dict": "within a dict"}, {1, 2, 3, 4}]

mapping = {}

for key, value in zip(key_list, value_list):
    mapping[key] = value

In [27]:
mapping

{'a': [1, 2, 3, 4, 5],
 'b': 'My name is Cody',
 'c': 77,
 'd': 0.1234,
 'e': (123, 456),
 'f': {'a dict': 'within a dict'},
 'g': {1, 2, 3, 4}}

In [28]:
comp_mapping = {key : value for key, value in zip(key_list, value_list)}

comp_mapping

{'a': [1, 2, 3, 4, 5],
 'b': 'My name is Cody',
 'c': 77,
 'd': 0.1234,
 'e': (123, 456),
 'f': {'a dict': 'within a dict'},
 'g': {1, 2, 3, 4}}

In [29]:
mapping['c']

77

In [30]:
key_slice = ['a', 'c', 'g']

values_slice = [mapping[key] for key in key_slice]
values_slice

[[1, 2, 3, 4, 5], 77, {1, 2, 3, 4}]

In [31]:
(123,456) in mapping.values()

True

In [32]:
def dict_search(dictionary, search_term):

    ''' This function takes two inputs. First is the
    dictionary you want to search, second is the value
    you want to search for. The function will return the
    FIRST key that matches that value.
    '''

    my_keys_indexed = list(dictionary.keys())
    my_values_indexed = list(dictionary.values())

    ans = my_keys_indexed[my_values_indexed.index(search_term)]
    
    return ans

In [33]:
dict_search(mapping, 77)

'c'

In [34]:
def dict_search_all(dictionary, search_term):
    
    '''This function loops over the dictionary and
    returns all keys that match the search term. The
    function takes two inputs:
    
    dictionary: a dict() object
    search_term: a value within the supplied dict()
    '''
    
    my_values_enumerated = enumerate(dictionary.values())
    
    idx_search_matches = []
    
    for idx, value in my_values_enumerated:
        if value == search_term:
            idx_search_matches.append(idx)
            
    key_matches = []
    
    for idx in idx_search_matches:
        key_matches.append(list(dictionary.keys())[idx])
        
    return key_matches

In [35]:
dict_search_all(mapping, 77)

['c']

In [36]:
for i, v in mapping.items():
    print(i, v)

a [1, 2, 3, 4, 5]
b My name is Cody
c 77
d 0.1234
e (123, 456)
f {'a dict': 'within a dict'}
g {1, 2, 3, 4}


In [37]:
search_term = 77

for i, v in mapping.items():
    if v == search_term:
        print(i, v)

c 77


----

## `Collections`

----

When thinking about splicing, the built-in `collections` package provides many useful container objects.

#### `Counter`, with useful methods like `most_common`

In [38]:
from collections import Counter

In [39]:
c = Counter({'red': 4, 'blue': 2})
c['red']

4

In [40]:
c = Counter('My name is Cody, and I love to program with Python')
c['y']

3

In [41]:
c.most_common(3)

[(' ', 10), ('o', 5), ('y', 3)]

#### `deque`, with useful methods for appending and popping values

It is worth the effort to work through the examples provided in the python documentation for [deque](https://docs.python.org/3.7/library/collections.html?highlight=collections#collections.deque). There are also excellent recipes for usage cases.

In [42]:
my_queue = [x for x in np.arange(4, 25, 0.5)]

In [43]:
from collections import deque

d = deque(my_queue)

In [44]:
d.pop()

24.5

In [45]:
d.popleft()

4.0

In [46]:
d.append(24.5)
d[-1]

24.5

In [47]:
d.appendleft(4.0)
d[0]

4.0

----

## Pandas

----

In [48]:
import pandas as pd

In [49]:
my_series = pd.Series(my_queue)

In [50]:
my_series[0:5]

0    4.0
1    4.5
2    5.0
3    5.5
4    6.0
dtype: float64

In [51]:
new_series = my_series[0:26]
len(new_series)

26

In [52]:
index_list = [x for x in "abcdefghijklmnopqrstuvwxyz"]
new_series.index = index_list

In [53]:
new_series[0:5]

a    4.0
b    4.5
c    5.0
d    5.5
e    6.0
dtype: float64

In [54]:
new_series['g']

7.0

#### Pandas series and dataframes with datetime indexing

In [55]:
from datetime import datetime
from datetime import timedelta

In [56]:
base = datetime.today()
date_index = [base - timedelta(days=x) for x in range(0, 26)]

In [57]:
new_series.index = sorted(date_index)

In [58]:
new_series[0:5]

2019-03-30 15:17:34.902935    4.0
2019-03-31 15:17:34.902935    4.5
2019-04-01 15:17:34.902935    5.0
2019-04-02 15:17:34.902935    5.5
2019-04-03 15:17:34.902935    6.0
dtype: float64

In [59]:
new_series['April, 6 2019']

2019-04-06 15:17:34.902935    7.5
dtype: float64

In [60]:
new_series['2019-04-04':'2019-04-14']

2019-04-04 15:17:34.902935     6.5
2019-04-05 15:17:34.902935     7.0
2019-04-06 15:17:34.902935     7.5
2019-04-07 15:17:34.902935     8.0
2019-04-08 15:17:34.902935     8.5
2019-04-09 15:17:34.902935     9.0
2019-04-10 15:17:34.902935     9.5
2019-04-11 15:17:34.902935    10.0
2019-04-12 15:17:34.902935    10.5
2019-04-13 15:17:34.902935    11.0
2019-04-14 15:17:34.902935    11.5
dtype: float64

In [61]:
new_df = pd.DataFrame(new_series)

In [62]:
new_df['2019-04-02':'2019-04-04']

Unnamed: 0,0
2019-04-02 15:17:34.902935,5.5
2019-04-03 15:17:34.902935,6.0
2019-04-04 15:17:34.902935,6.5


In [63]:
new_df[new_df.index.day == 2]

Unnamed: 0,0
2019-04-02 15:17:34.902935,5.5


In [64]:
new_df[new_df.index.month == 4]

Unnamed: 0,0
2019-04-01 15:17:34.902935,5.0
2019-04-02 15:17:34.902935,5.5
2019-04-03 15:17:34.902935,6.0
2019-04-04 15:17:34.902935,6.5
2019-04-05 15:17:34.902935,7.0
2019-04-06 15:17:34.902935,7.5
2019-04-07 15:17:34.902935,8.0
2019-04-08 15:17:34.902935,8.5
2019-04-09 15:17:34.902935,9.0
2019-04-10 15:17:34.902935,9.5


In [65]:
another_df = pd.DataFrame({'a':np.arange(5, 10, 0.25), 'b':np.arange(10, 15, 0.25)})

In [66]:
another_df.head()

Unnamed: 0,a,b
0,5.0,10.0
1,5.25,10.25
2,5.5,10.5
3,5.75,10.75
4,6.0,11.0


In [67]:
another_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 2 columns):
a    20 non-null float64
b    20 non-null float64
dtypes: float64(2)
memory usage: 400.0 bytes


In [68]:
another_df.describe()

Unnamed: 0,a,b
count,20.0,20.0
mean,7.375,12.375
std,1.47902,1.47902
min,5.0,10.0
25%,6.1875,11.1875
50%,7.375,12.375
75%,8.5625,13.5625
max,9.75,14.75


In [69]:
another_df[4:14]

Unnamed: 0,a,b
4,6.0,11.0
5,6.25,11.25
6,6.5,11.5
7,6.75,11.75
8,7.0,12.0
9,7.25,12.25
10,7.5,12.5
11,7.75,12.75
12,8.0,13.0
13,8.25,13.25


In [70]:
another_df.iloc[4:14]

Unnamed: 0,a,b
4,6.0,11.0
5,6.25,11.25
6,6.5,11.5
7,6.75,11.75
8,7.0,12.0
9,7.25,12.25
10,7.5,12.5
11,7.75,12.75
12,8.0,13.0
13,8.25,13.25


In [71]:
another_df.loc[:,'b']

0     10.00
1     10.25
2     10.50
3     10.75
4     11.00
5     11.25
6     11.50
7     11.75
8     12.00
9     12.25
10    12.50
11    12.75
12    13.00
13    13.25
14    13.50
15    13.75
16    14.00
17    14.25
18    14.50
19    14.75
Name: b, dtype: float64

In [72]:
short_df = another_df[0:13]

In [73]:
rows_string = 'row '*13
rows_list = rows_string.split(sep=' ')
n = 0
for i in range(0,13):
    rows_list[i] = rows_list[i]+str(n)
    n+=1
    
rows_list.pop()
rows_list

['row0',
 'row1',
 'row2',
 'row3',
 'row4',
 'row5',
 'row6',
 'row7',
 'row8',
 'row9',
 'row10',
 'row11',
 'row12']

In [74]:
short_df.index = rows_list
short_df.columns = ['column1', 'column2']

In [75]:
short_df.loc[['row4', 'row7', 'row1'],'column2']

row4    11.00
row7    11.75
row1    10.25
Name: column2, dtype: float64

In [76]:
short_df['column1']

row0     5.00
row1     5.25
row2     5.50
row3     5.75
row4     6.00
row5     6.25
row6     6.50
row7     6.75
row8     7.00
row9     7.25
row10    7.50
row11    7.75
row12    8.00
Name: column1, dtype: float64

In [77]:
short_df[['column1', 'column2']]

Unnamed: 0,column1,column2
row0,5.0,10.0
row1,5.25,10.25
row2,5.5,10.5
row3,5.75,10.75
row4,6.0,11.0
row5,6.25,11.25
row6,6.5,11.5
row7,6.75,11.75
row8,7.0,12.0
row9,7.25,12.25


In [78]:
short_df[3:5]

Unnamed: 0,column1,column2
row3,5.75,10.75
row4,6.0,11.0


In [79]:
short_df[4:10]['column1']

row4    6.00
row5    6.25
row6    6.50
row7    6.75
row8    7.00
row9    7.25
Name: column1, dtype: float64

In [80]:
short_df['column1']['row11']

7.75

In [81]:
masked_df = short_df[short_df['column1'].between(7,9)]

In [82]:
masked_df

Unnamed: 0,column1,column2
row8,7.0,12.0
row9,7.25,12.25
row10,7.5,12.5
row11,7.75,12.75
row12,8.0,13.0
