# Intermediate Data Selection With Pandas

Today we create a new dataframe with dummy values, as earlier one was not fit for current analysis.

Thats why a new notebook

# Load Libs & Modules

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Creating A New DataFrame

In [None]:
# create a dummy dataframe with some random values
df = pd.DataFrame(
    np.random.rand(10,5),
    index = list("abcdefghif"),
    columns = list("ABCDE"))

Understand the code:
- `np.random.rand(10,5)` - creates 10 rows with 5 columns filled with random values.

- `index` - These are the row index and must be equal to no of rows specified.

- `columns` - Though seem like column names, creates columns index and must be equal to no of cols specified.

- `list` - creates a list of element

- `pd.Dataframe()` - function to create a new dataframe

In [None]:
# cumbersome - so a better approach
# specify range for row index - only numbers allowed
another_df = pd.DataFrame(
    np.random.rand(100,5),
    index = range(100), # must be value <= no of rows
    columns = list("ABCDE"))

another_df.head()

In [None]:
# benifit - you can specify the start and end index for the range
# cumbersome - so a better approach
# specify range for row index - only numbers allowed

another_df = pd.DataFrame(
    np.random.rand(100,5),
    index = range(10, 110), # diff of start & end index must match no of rows.
    columns = list("ABCDE"))

another_df

How to Fix It

1. Match the Number of Rows and Index Length: Either change the number of row in your data to match the row index

OR

2. Extend the Index to Match the Data: Alternatively, you can change the difference between start and end index values to match the number of rows in your data:

# Revision on Exploration of DataFrame Using Pandas

In [None]:
# check shape
df.shape

In [None]:
# get cols name
df.columns

In [None]:
# get information on dataframe cols
df.info()

OBSERVATION:
- 5 columns, each with 10 values (rows).
- All are type float64
- No null values are present

In [None]:
# examine few values
df.head()

- row index - small letters
- col index - large letters

# Revision on Data Selection Using Pandas

In [None]:
# get the row index
df.index

In [None]:
# access values across all the columns in dth indexed row
df.loc['d'] # pass in index

In `df.loc[row_index]` make sure the value in `row_index` is part of index

In [None]:
# access values using iloc
df.iloc[3] # row index starts from 0

In [None]:
# return value of 2rd to 5th indexed column of the 3rd index row
# use loc
df.loc['c', 'C':'E']

---

In [None]:
# return value of 2rd to 5th indexed column of the 3rd index row
# use iloc
df.iloc[2, 2:6]

---

In [None]:
# return value of the 3rd to rest row for all columns
# use loc
df.loc['c':, :]

In [None]:
# return value of the 3rd to rest row for all columns
# use iloc
df.iloc[2:, :] # row indexing starts at 0

---

In [None]:
# return 3rd element of 4th column
# use loc
df.loc["c", "D"]

In [None]:
# return 3rd element of 4th column
# use iloc
df.iloc[2, 3]

* rows index - [a:0, b:1, c:2]
* cols index - [A:0. B:1, C:2, D:3]

In iloc both index start with 0

# Intermediate Analysis - Complex Conditions

## Use Lamda Fn

- passing fn to loc selector
- fn will be anynomus - lambda
- helps in writing clean code at cost of understandibility

Lambda Fn syntax:

`lambda iteraor / variable: expression`

Example in this case:
- `lambda` : Keyword
- `df` : variable
- `df['c'] > 0.3` : conditional expression.

The expression can be of any type
- conditional
- arthimetatic
- boolean
- simple

In [36]:
# create a fn for filtering dataframe, if C column greater than 0.3
# act as a selector later on
selector1 = lambda df: df['C']>0.3
selector1

<function __main__.<lambda>(df)>

In [37]:
# use the selector for finding row index based on filter condition (C>0.3)
# use loc
df.loc[selector1]

Unnamed: 0,A,B,C,D,E
a,0.310262,0.230804,0.722192,0.793374,0.631912
b,0.463573,0.038228,0.312213,0.157013,0.334907
d,0.8701,0.700852,0.757054,0.86837,0.035922
e,0.897963,0.786304,0.955026,0.36401,0.171953
f,0.874728,0.112129,0.892224,0.325969,0.393496
g,0.030989,0.572395,0.621266,0.546421,0.844981
h,0.511749,0.472104,0.618091,0.691562,0.424876


In [38]:
# chaining multiple colums and conditions
selector2 = lambda df: (df['B'] > 0.3) & (df['C'] > 0.3) & (df['E'] < 0.6)
selector2

<function __main__.<lambda>(df)>

In [39]:
# use selector to filter the new df
# only return rows which have col B and # > 0.3 and col E < 0.6
# use loc
df.loc[selector2]

Unnamed: 0,A,B,C,D,E
d,0.8701,0.700852,0.757054,0.86837,0.035922
e,0.897963,0.786304,0.955026,0.36401,0.171953
h,0.511749,0.472104,0.618091,0.691562,0.424876


In [40]:
# can do it in simple way using all method
# only return rows which have col B and # > 0.3 and col E < 0.6
selector2 = lambda df: (df[['B', 'C']] > 0.3).all(axis=1) & (df['E'] < 0.6)

df.loc[selector2]

Unnamed: 0,A,B,C,D,E
d,0.8701,0.700852,0.757054,0.86837,0.035922
e,0.897963,0.786304,0.955026,0.36401,0.171953
h,0.511749,0.472104,0.618091,0.691562,0.424876


## Use Selector Masks

- Selectors are simple expression that returns True or False values

- Often conditional operators are used in them# only return rows which have col B and # > 0.3 and col E < 0.6

In [41]:
# only return rows which have col B and C > 0.3 and col E < 0.6
selection_condition = (df['B']>0.3) & (df['C']>0.3) & (df['E']<0.6)
selection_condition

a    False
b    False
c    False
d     True
e     True
f    False
g    False
h     True
i    False
f    False
dtype: bool

this is called mask by the way!

In [42]:
df.loc[selection_condition]

Unnamed: 0,A,B,C,D,E
d,0.8701,0.700852,0.757054,0.86837,0.035922
e,0.897963,0.786304,0.955026,0.36401,0.171953
h,0.511749,0.472104,0.618091,0.691562,0.424876


## Using Operators

- | -  OR operator
- ~ -  NOT operator

In [43]:
# if df[A]> 0.5 or df['B]>0.2, return those rows
condition_for_selection = (df['A'] > 0.5) | ~(df['B'] < 0.2)

In [44]:
df[condition_for_selection]

Unnamed: 0,A,B,C,D,E
a,0.310262,0.230804,0.722192,0.793374,0.631912
d,0.8701,0.700852,0.757054,0.86837,0.035922
e,0.897963,0.786304,0.955026,0.36401,0.171953
f,0.874728,0.112129,0.892224,0.325969,0.393496
g,0.030989,0.572395,0.621266,0.546421,0.844981
h,0.511749,0.472104,0.618091,0.691562,0.424876
i,0.562309,0.667947,0.118795,0.2491,0.170761
f,0.209425,0.671576,0.099938,0.130003,0.455119


# Homework

- No homework, just practice whatever present in notebook (including revision section), if anything changes later, will add in github repo

# Resources

* Reference Notebook: [Github](https://github.com/devloperhs14/practical_ml) - notebook name

* Pandas Indexing & Selecting Documentation : [Explore methods](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#different-choices-for-indexing)

* Series Playlist: [Youtube](https://youtube.com/playlist?list=PLDfna1ApN44oZsHW1AAxoMkREFWOse7sV&feature=shared)

Thanks