#  What Is Data Selection

* Data selection refers to the process selecting /accessing desired column, rows or observations.

* There are multiple methods to select data , some being simple indexing based and some method based.

# Load Data

In [78]:
# load data
from sklearn.datasets import load_diabetes
diabetes = load_diabetes(as_frame = True)

df = diabetes.data

df.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641


In [83]:
df.iloc[2,2]    #0.044451 (2, bmi)

np.float64(0.04445121333659049)

# General Selection - List / Matrix Based

## Print columns of dataframe

In [None]:
# print columns
print(df.columns)

## Selecting single columns and observation of a dataframe

syntax: `df[col_name][row index]`

NOTE: Index starts with 0

In [None]:
# select datafram column using indexing
df['age']

all 442 values in `age` column with a associated index

In [None]:
# select datafram column specific value using indexing
# value of column1 (age) at index / row 1

df['age']

In [None]:
# getting subsequent values for a particular column
# all 5 values for column bmi (first 10) rows
df['bmi'][437:442]

# indexing start at 0

SYNTAX: `df[col_name][start_index, end_index]`

`end_ index` is excluded

In [None]:
# getting subsequent values for a particular column
# all 5 values for column bmi (last 10)
df['bmi'][-10:]

In [None]:
# selecting specific range of values
# 150-250 values for column 's1
df['s1'][150:250]

## Selecting multiple columns & observation of dataframe

In [None]:
# select multiple columns
# select age, sex and s1

df[['age', 'sex', 's1']]

SYNTAX: `df[[column_names]]`

* 442 rows and 3 columns (features)


In [None]:
# accessing specific values / observation (all will have same)
df[['age', 'sex', 's1']][:10]

df[['age', 'sex', 's1']].head(10) #- alternative

first 10 values of 3 columns

In [None]:
# get specific observation for mumltiple columns
# get value at index 338 for age and s3 column

#df[['age', 's3']][334]

suprisingly its not possible to index into single value directly. So we need a method to perform so.

# Loc & Iloc Based Selection (imp)

Methods

* `loc` (dictionary style indexing)
    - row value / row index
    - Stands for label-based selection.
    - use it to select rows and columns by their labels (index names and column names).
    - Includes both the start and end labels in a range.
    - row index is used.

* `iloc`(array style indexing)
    - position of the row in a dataset
    - Stands for integer-based selection.
    - use it to select rows and columns by their integer positions (like index-based slicing).
    - Includes the start position but excludes the end in a range.

* DIFFERENCE
    - `loc` is label-based, so use column names and row indices.
    - `iloc` is position-based, so use integer indices for rows and columns.

* Example

    - loc:
    ```
    # Select the row where the index label is 2
    row = df.loc[2]

    # Select the 'Name' and 'City' columns for index labels 1 and 3
    subset = df.loc[[1, 3], ['Name', 'City']]

    ```

    - iloc:
    ```
    # Select the first 2 rows (0 and 1) and all columns
    rows = df.iloc[0:2, :]

    # Select the first 3 rows and the first 2 columns (integer position based)
    subset = df.iloc[0:3, 0:2]
    ```


In simple terms, if assume index start at 1. then:

* iloc: 0th position.
* loc: 1st value / actual postion

## Selecting single observation based on row & index (loc, iloc)

In [34]:
df.head(11)

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641
5,-0.092695,-0.044642,-0.040696,-0.019442,-0.068991,-0.079288,0.041277,-0.076395,-0.041176,-0.096346
6,-0.045472,0.05068,-0.047163,-0.015999,-0.040096,-0.0248,0.000779,-0.039493,-0.062917,-0.038357
7,0.063504,0.05068,-0.001895,0.066629,0.09062,0.108914,0.022869,0.017703,-0.035816,0.003064
8,0.041708,0.05068,0.061696,-0.040099,-0.013953,0.006202,-0.028674,-0.002592,-0.01496,0.011349
9,-0.0709,-0.044642,0.039062,-0.033213,-0.012577,-0.034508,-0.024993,-0.002592,0.067737,-0.013504


In [37]:
# use loc to access 10th observation across all columns
df.loc[10]

age   -0.096328
sex   -0.044642
bmi   -0.083808
bp     0.008101
s1    -0.103389
s2    -0.090561
s3    -0.013948
s4    -0.076395
s5    -0.062917
s6    -0.034215
Name: 10, dtype: float64

SYNTAX: `df.loc[row_index / position]`

In [42]:
# iloc - acess entire column value for a particular row index
df.iloc[10] #row index

#df.iloc[age] #- can't use

age   -0.096328
sex   -0.044642
bmi   -0.083808
bp     0.008101
s1    -0.103389
s2    -0.090561
s3    -0.013948
s4    -0.076395
s5    -0.062917
s6    -0.034215
Name: 10, dtype: float64

## Selecting multiple observation using loc and iloc

In [43]:
# access a specific value
# get 2nd element of age column
df.loc[2, 'age'] #row index -excat

np.float64(0.08529890629667548)

In [46]:
# same with iloc - 2nd value in 1st columnn
df.iloc[2, 0]

np.float64(0.08529890629667548)

syntax : `df.loc[row_index (actual), colum_name]`

In [None]:
df['age'].head(10)

In [51]:
# access same row values across multiple cols using loc
df.loc[2, ['age', 'sex', 'bmi', 'bp'] ] # its hard

age    0.085299
sex    0.050680
bmi    0.044451
bp    -0.005670
Name: 2, dtype: float64

SYNTAX: `df[row_index_start, row_index_end, [col_names_list]]`

In [50]:
# same using iloc
df.iloc[2, 0:4]

age    0.085299
sex    0.050680
bmi    0.044451
bp    -0.005670
Name: 2, dtype: float64

In [52]:
# access multiple row values acroos multiple columns

# with loc - - 2nd to 5th values across 0 to 4 columns
df.loc[2:5, ['age', 'sex', 'bmi', 'bp']]

Unnamed: 0,age,sex,bmi,bp
2,0.085299,0.05068,0.044451,-0.00567
3,-0.089063,-0.044642,-0.011595,-0.036656
4,0.005383,-0.044642,-0.036385,0.021872
5,-0.092695,-0.044642,-0.040696,-0.019442


In [54]:
# with iloc - 2nd to 5th values across 0 to 4 columns
df.iloc[2:6, 0:4]

Unnamed: 0,age,sex,bmi,bp
2,0.085299,0.05068,0.044451,-0.00567
3,-0.089063,-0.044642,-0.011595,-0.036656
4,0.005383,-0.044642,-0.036385,0.021872
5,-0.092695,-0.044642,-0.040696,-0.019442


In [55]:
# acess only specific value for a specific set of columns - iloc and loc
# 2, 5, 7 row of 1,3,5th column

# using iloc
print(df.iloc[2:5, [1,3,5]]) # end index is excluded (array based)

print("\n")

# using loc
print(df.loc[2:4, ['sex', 'bp', 's2']]) # end index is included (normal)

        sex        bp        s2
2  0.050680 -0.005670 -0.034194
3 -0.044642 -0.036656  0.024991
4 -0.044642  0.021872  0.015596


        sex        bp        s2
2  0.050680 -0.005670 -0.034194
3 -0.044642 -0.036656  0.024991
4 -0.044642  0.021872  0.015596


In [57]:
df.head(5) # sanity check!

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641


* iloc - specify only **integer** values and end index is excluded
* loc - specify **integer and string values** (col names) and end index is included.

## Conditional Selection

* Select / Access based on conditions
* loc and iloc + normal methods + conditional and arithematic operators can be used
* mazor usage of loc and iloc can be seen here

## optional

In [58]:
df.describe()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
count,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0
mean,-2.511817e-19,1.23079e-17,-2.245564e-16,-4.79757e-17,-1.3814990000000001e-17,3.9184340000000004e-17,-5.777179e-18,-9.04254e-18,9.293722000000001e-17,1.130318e-17
std,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905
min,-0.1072256,-0.04464164,-0.0902753,-0.1123988,-0.1267807,-0.1156131,-0.1023071,-0.0763945,-0.1260971,-0.1377672
25%,-0.03729927,-0.04464164,-0.03422907,-0.03665608,-0.03424784,-0.0303584,-0.03511716,-0.03949338,-0.03324559,-0.03317903
50%,0.00538306,-0.04464164,-0.007283766,-0.005670422,-0.004320866,-0.003819065,-0.006584468,-0.002592262,-0.001947171,-0.001077698
75%,0.03807591,0.05068012,0.03124802,0.03564379,0.02835801,0.02984439,0.0293115,0.03430886,0.03243232,0.02791705
max,0.1107267,0.05068012,0.1705552,0.1320436,0.1539137,0.198788,0.1811791,0.1852344,0.1335973,0.1356118


## Single conditional selection

In [62]:
# select all rows where person bp > -5.670422e-03
df[df['bp'] > -5.670422e-03]

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.050680,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641
7,0.063504,0.050680,-0.001895,0.066629,0.090620,0.108914,0.022869,0.017703,-0.035816,0.003064
10,-0.096328,-0.044642,-0.083808,0.008101,-0.103389,-0.090561,-0.013948,-0.076395,-0.062917,-0.034215
13,0.005383,0.050680,-0.001895,0.008101,-0.004321,-0.015719,-0.002903,-0.002592,0.038394,-0.013504
...,...,...,...,...,...,...,...,...,...,...
431,0.070769,0.050680,-0.030996,0.021872,-0.037344,-0.047034,0.033914,-0.039493,-0.014960,-0.001078
434,0.016281,-0.044642,0.001339,0.008101,0.005311,0.010899,0.030232,-0.039493,-0.045424,0.032059
437,0.041708,0.050680,0.019662,0.059744,-0.005697,-0.002566,-0.028674,-0.002592,0.031193,0.007207
439,0.041708,0.050680,-0.015906,0.017293,-0.037344,-0.013840,-0.024993,-0.011080,-0.046883,0.015491


In [63]:
# using loc
df.loc[df['bp']> -5.670422e-03] # passed condition instead of row index

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.050680,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641
7,0.063504,0.050680,-0.001895,0.066629,0.090620,0.108914,0.022869,0.017703,-0.035816,0.003064
10,-0.096328,-0.044642,-0.083808,0.008101,-0.103389,-0.090561,-0.013948,-0.076395,-0.062917,-0.034215
13,0.005383,0.050680,-0.001895,0.008101,-0.004321,-0.015719,-0.002903,-0.002592,0.038394,-0.013504
...,...,...,...,...,...,...,...,...,...,...
431,0.070769,0.050680,-0.030996,0.021872,-0.037344,-0.047034,0.033914,-0.039493,-0.014960,-0.001078
434,0.016281,-0.044642,0.001339,0.008101,0.005311,0.010899,0.030232,-0.039493,-0.045424,0.032059
437,0.041708,0.050680,0.019662,0.059744,-0.005697,-0.002566,-0.028674,-0.002592,0.031193,0.007207
439,0.041708,0.050680,-0.015906,0.017293,-0.037344,-0.013840,-0.024993,-0.011080,-0.046883,0.015491


for now its not stored, so lets create a new temp df to store it.

In [73]:
bp_df_filtered = df.loc[df['bp']> -5.670422e-03]
bp_df_filtered

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.050680,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641
7,0.063504,0.050680,-0.001895,0.066629,0.090620,0.108914,0.022869,0.017703,-0.035816,0.003064
10,-0.096328,-0.044642,-0.083808,0.008101,-0.103389,-0.090561,-0.013948,-0.076395,-0.062917,-0.034215
13,0.005383,0.050680,-0.001895,0.008101,-0.004321,-0.015719,-0.002903,-0.002592,0.038394,-0.013504
...,...,...,...,...,...,...,...,...,...,...
431,0.070769,0.050680,-0.030996,0.021872,-0.037344,-0.047034,0.033914,-0.039493,-0.014960,-0.001078
434,0.016281,-0.044642,0.001339,0.008101,0.005311,0.010899,0.030232,-0.039493,-0.045424,0.032059
437,0.041708,0.050680,0.019662,0.059744,-0.005697,-0.002566,-0.028674,-0.002592,0.031193,0.007207
439,0.041708,0.050680,-0.015906,0.017293,-0.037344,-0.013840,-0.024993,-0.011080,-0.046883,0.015491


NOTE: Index of orignal dataframe is returned, not the shifted ones

based on this multiple operations can be performed

# Acessing new dataframe values using loc and iloc

In [76]:
# check entires present on 3rd position - loc
bp_df_filtered.loc[3] # no 3 index is present in new df

KeyError: 3

In [77]:
# check entries present on 3rd position
bp_df_filtered.iloc[3]

# row indexing is based on the new dataframe

age   -0.096328
sex   -0.044642
bmi   -0.083808
bp     0.008101
s1    -0.103389
s2    -0.090561
s3    -0.013948
s4    -0.076395
s5    -0.062917
s6    -0.034215
Name: 10, dtype: float64

---

# Homework


* How many columns are present in the diabetes dataset?

* What is the value of of **20th element of s1 and age** column?

* What is the value of **23 to 26 element** (rows) of 1 to 5th column. Answer based on loc and iloc both.

* Find all the rows / row_index where **s1 > -3.424784e-02** and return the **3rd** observation **s1** value . Hint (use iloc filter and loc indexing)

* Find all rows where **s3 < 2.931150e-02**, store it in `s3_filtered` and find the 3rd row value (use loc and iloc) -> are they same?

* OPTIONAL
    - Combine : `s1 > -3.424784e-02` , `s3 < 2.931150e-02` and find all rows


