#  What Is Data Selection

* Data selection refers to the process selecting /accessing desired column, rows or observations.

* There are multiple methods to select data , some being simple indexing based and some method based.

# Load Data

In [2]:
# load data
from sklearn.datasets import load_diabetes
diabetes = load_diabetes(as_frame = True)

df = diabetes.data

# df.head()

# General Selection - List / Matrix Based

## Print columns of dataframe

In [None]:
# print columns
df.columns

## Selecting single columns and observation of a dataframe

In [None]:
# select datafram column using indexing
df['age']

all 442 values in `age` column with a associated index

In [None]:
# select datafram column specific value using indexing
# value of column1 (age) at index / row 1

df['age'][0]

syntax: `df[col_name][row index]`

NOTE: Index starts with 0

In [None]:
# getting subsequent values for a particular column
# all 5 values for column bmi (first 10)
df['bmi'][:10]

# indexing start at 0

In [None]:
# getting subsequent values for a particular column
# all 5 values for column bmi (last 10)
df['bmi'][-10:]

In [None]:
# selecting specific range of values
# 150-250 values for column 's1
df['s1'][150:250]

SYNTAX: `df[col_name][start_index, end_index]`

`end_ index` is excluded

## Selecting multiple columns & observation of dataframe

In [None]:
# select multiple columns
# select age, sex and s1

df[['age', 'sex', 's1']]

SYNTAX: `df[[column_names]]`

* 442 rows and 3 columns (features)


In [None]:
# accessing specific values / observation (all will have same)
df[['age', 'sex', 's1']][:10]

#df[['age', 'sex', 's1']].head(10) - alternative

first 10 values of 3 columns

In [None]:
# get specific observation for mumltiple columns
# get value at index 338 for age and s3 column

df[['age', 's3']][338]

suprisingly its not possible to index into single value directly. So we need a method to perform so.

# Loc & Iloc Based Selection (imp)

Methods

* `loc` (dictionary style indexing)
    - row value / row index
    - Stands for label-based selection.
    - use it to select rows and columns by their labels (index names and column names).
    - Includes both the start and end labels in a range.
    - row index is used.

* `iloc`(array style indexing)
    - position of the row in a dataset
    - Stands for integer-based selection.
    - use it to select rows and columns by their integer positions (like index-based slicing).
    - Includes the start position but excludes the end in a range.

* DIFFERENCE
    - `loc` is label-based, so use column names and row indices.
    - `iloc` is position-based, so use integer indices for rows and columns.

* Example

    - loc:
    ```
    # Select the row where the index label is 2
    row = df.loc[2]

    # Select the 'Name' and 'City' columns for index labels 1 and 3
    subset = df.loc[[1, 3], ['Name', 'City']]

    ```

    - iloc:
    ```
    # Select the first 2 rows (0 and 1) and all columns
    rows = df.iloc[0:2, :]

    # Select the first 3 rows and the first 2 columns (integer position based)
    subset = df.iloc[0:3, 0:2]
    ```


In simple terms, if assume index start at 1. then:

* iloc: 0th position.
* loc: 1st value / actual postion

## Selecting single observation based on row & index (loc, iloc)

In [None]:
df.head(11)

In [None]:
# use loc to access 10th observation across all columns
df.loc[10]

SYNTAX: `df.loc[row_index / position]`

In [None]:
# iloc - acess entire column value for a particular row index
df.iloc[2] #column index

# df.iloc['age'] - can't use

## Selecting multiple observation using loc and iloc

In [None]:
# access a specific value
# get 2nd element of age column
df.loc[2, 'age'] #row index -excat

In [None]:
# same with iloc - 2nd value in 1st columnn
df.iloc[2, 0]

syntax : `df.loc[row_index (actual), colum_name]`

In [None]:
df['age'].head(10)

In [None]:
# access same row values across multiple cols using loc
df.loc[2, ['age', 'sex', 'bmi', 'bp'] ] # its hard

SYNTAX: `df[row_index_start, row_index_end, [col_names_list]]`

In [None]:
# same using iloc
df.iloc[2, 0:4]

In [None]:
# access multiple row values acroos multiple columns

# with loc - - 2nd to 5th values across 0 to 4 columns
df.loc[2:5, ['age', 'sex', 'bmi', 'bp']]

In [None]:
# with iloc - 2nd to 5th values across 0 to 4 columns
df.iloc[2:5, 0:4]

In [None]:
# acess only specific value for a specific set of columns - iloc and loc
# 2, 5, 7 row of 1,3,5th column

# using iloc
print(df.iloc[2:5, [1,3,5]]) # end index is excluded (array based)

print("\n")

# using loc
print(df.loc[2:4, ['sex', 'bp', 's2']]) # end index is included (normal)

In [None]:
df.head(4) # sanity check!

* iloc - specify only **integer** values and end index is excluded
* loc - specify **integer and string values** (col names) and end index is included.

## Conditional Selection

* Select / Access based on conditions
* loc and iloc + normal methods + conditional and arithematic operators can be used
* mazor usage of loc and iloc can be seen here

## optional

In [None]:
df.describe()

## Single conditional selection

In [None]:
# select all rows where person bp > -5.670422e-03
df[df['bp'] > -5.670422e-03]

In [None]:
# using loc
df.loc[df['bp']> -5.670422e-03] # passed condition instead of row index

for now its not stored, so lets create a new temp df to store it.

In [None]:
bp_df_filtered = df.loc[df['bp']> -5.670422e-03]
bp_df_filtered

NOTE: Index of orignal dataframe is returned, not the shifted ones

based on this multiple operations can be performed

# Acessing new dataframe values using loc and iloc

In [None]:
# check entires present on 8th position - loc
bp_df_filtered.loc[8] # no 8 index is present in new df

In [None]:
# check entries present on 0th position
bp_df_filtered.iloc[8]

# row indexing is based on the new dataframe

---

# Homework


* How many columns are present in the diabetes dataset?

* What is the value of of **20th element of s1 and age** column?

* What is the value of **23 to 26 element** (rows) of 1 to 5th column. Answer based on loc and iloc both.

* Find all the rows / row_index where **s1 > -3.424784e-02** and return the **3rd** observation **s1** value . Hint (use iloc filter and loc indexing)

* Find all rows where **s3 < 2.931150e-02**, store it in `s3_filtered` and find the 3rd row value (use loc and iloc) -> are they same?

* OPTIONAL
    - Combine : `s1 > -3.424784e-02` , `s3 < 2.931150e-02` and find all rows


