# Chapter 2 - Data Preparation Basics
## Segment 1 - Filtering and selecting data

In [1]:
import numpy as np
import pandas as pd

from pandas import Series, DataFrame

### Selecting and retrieving data
You can write an index value in two forms.
- Label index or 
- Integer index

##### create a series obj with index title 

In [5]:

series_obj = Series( np.array(5), index = ["row 1", "r2", 'r3', 'r4','r5'])
series_obj


row 1    5
r2       5
r3       5
r4       5
r5       5
dtype: int32

In [7]:
series_obj2 = Series(np.arange(5), index = ["r1", "r2", "r3", "r4", "r5"])
series_obj2

r1    0
r2    1
r3    2
r4    3
r5    4
dtype: int32

In [6]:
series_obj['r2']

5

In [8]:
# integer index (a position)
series_obj2[[0,4]] #selecting 2 rows


r1    0
r5    4
dtype: int32

In [14]:
series_obj2[1]
# cannot change an element in series to list 
#series_obj2[1,2] <-- this is a wrong syntax

1

In [18]:
np.random.seed(25)
#create a DF obj with row and columns titles and random number 
DF_obj = DataFrame(np.random.rand(36).reshape(6,6), 
                   index = ["r1", "r2","r3","r4","r5","r6"], 
                   columns = ["c1","c2","c3", "c4", "c5", "c6"])
DF_obj

Unnamed: 0,c1,c2,c3,c4,c5,c6
r1,0.870124,0.582277,0.278839,0.185911,0.4111,0.117376
r2,0.684969,0.437611,0.556229,0.36708,0.402366,0.113041
r3,0.447031,0.585445,0.161985,0.520719,0.326051,0.699186
r4,0.366395,0.836375,0.481343,0.516502,0.383048,0.997541
r5,0.514244,0.559053,0.03445,0.71993,0.421004,0.436935
r6,0.281701,0.900274,0.669612,0.456069,0.289804,0.525819


In [20]:
#retrieve values from row 2, row 5, column 5 and 2 using .loc function
#obj.loc[[r1,r2], [c1, c2]] (using title name )
DF_obj.loc[["r1", "r3", "r5"], ["c2", "c4"]]
#using int index with .iloc function 

Unnamed: 0,c2,c4
r1,0.582277,0.185911
r3,0.585445,0.520719
r5,0.559053,0.71993


In [26]:
DF_obj.iloc[[0,2,4], [1,3]]

Unnamed: 0,c2,c4
r1,0.582277,0.185911
r3,0.585445,0.520719
r5,0.559053,0.71993


### Data slicing
You can use slicing to select and return a slice of several values from a data set. Slicing uses index values so you can use the same square brackets when doing data slicing.

How slicing differs, however, is that with slicing you pass in two index values that are separated by a colon. The index value on the left side of the colon should be the first value you want to select. On the right side of the colon, you write the index value for the last value you want to retrieve. When you execute the code, the indexer then simply finds the first record and the last record and returns every record in between them. 

In [33]:
# put either lists or ranges (1:4) into the bracket 
#series_obj2["r2":"r4"]
series_obj2[1:4]

r2    1
r3    2
r4    3
dtype: int32

In [30]:
#10:56
#retrieve values between row 1 and row 4
DF_obj.loc["r1":"r4", "c2":"c3"]

Unnamed: 0,c2,c3
r1,0.582277,0.278839
r2,0.437611,0.556229
r3,0.585445,0.161985
r4,0.836375,0.481343


# Comparing with scalars
Now we're going to talk about comparison operators and scalar values. Just in case you don't know that a scalar value is, it's basically just a single numerical value. You can use comparison operators like greater than or less than to return true/false values for all records to indicate how each element compares to a scalar value.

In [34]:
#11:50
#return a DF with boolean values to indicate 
#which values in our DF obj is < a certain value 
DF_obj>0.5

Unnamed: 0,c1,c2,c3,c4,c5,c6
r1,True,True,False,False,False,False
r2,True,False,True,False,False,False
r3,False,True,False,True,False,True
r4,False,True,False,True,False,True
r5,True,True,False,True,False,False
r6,False,True,True,False,False,True


### Filtering with scalars

In [35]:
#13:22
#only retrieve value from our seri obj that 
#is greater than a value seri_obj[seri_obj > 6]
series_obj2[ series_obj2>3]

r5    4
dtype: int32

In [36]:
DF_obj[DF_obj>0.5]

Unnamed: 0,c1,c2,c3,c4,c5,c6
r1,0.870124,0.582277,,,,
r2,0.684969,,0.556229,,,
r3,,0.585445,,0.520719,,0.699186
r4,,0.836375,,0.516502,,0.997541
r5,0.514244,0.559053,,0.71993,,
r6,,0.900274,0.669612,,,0.525819


### Setting values with scalars

In [37]:
#14:34 select values and set to new values 
#seri_obj [ ["r1", ....]] = 8

In [39]:
series_obj2[["r1", "r2"]] = 8
series_obj2

r1    8
r2    8
r3    2
r4    3
r5    4
dtype: int32

Filtering and selecting using Pandas is one of the most fundamental things you'll do in data analysis. Make sure you know how to use indexing to select and retrieve records.