# Basic Pandas Selection and Filtering

- filtering is based on rows, based on criteria of the values of the DataFrame
- selection is based on columns, not based on a value

## 1. Setup

In [None]:
import pandas as pd

# Load our dataset:
url = 'https://bit.ly/eds217-studentdata'

df = pd.read_csv(url)

In [None]:
df.head()

## 2. Basic Selection

In [None]:
# Select a single column from a dataframe and assign it to a Series:
majors = df['major']
type(majors)

# Select multiple columns from a dataframe and assign it to a new dataframe:
# Provide a list of columns into the selector/filter brackets:
id_majors = df[['student_id','major']]
type(id_majors)

# We could make a dataframe with just one column by putting major as a list of lists

## 3. Filtering Based on Column Values

### 3a. Single Condition Filtering


In [None]:
# Filtering on the value of a single condition (usually a single column's values)
# Select only rows with gpa > 3.7
high_achievers = df[ df['gpa'] > 3.7 ]
type(high_achievers)

# returns a dataframe

### 3b. Multiple Conditions with Logical Operators

In [None]:
# Filtering on the values of multiple conditions (usually multiple column values, but not always)
# Find students less than 20 years old, majoring in Mathematics
young_math = df[ (df['age'] < 20) & (df['major'] == "Mathematics") ]
print(young_math)

# returns a dataframe

In [None]:
# Find students who are either 22 years old or have a GPA of 3.5 exactly:
specific_students = df[ (df['age'] == 22) | (df['gpa'] == 3.5)]
print(specific_students)

### 3c. Using the filter command

Use the filter command to match specific columns or rows based on labels (column names or index labels)

Use the `like` argument to find substrings (especially useful for large dataframes with many columns!)

In [None]:
# Filter all the column names that contain the substring 'id':
id_columns = df.filter(like='id')

# Filter all the rows where the index contains a `5`:
rows_with_5 = df.filter(like='5',axis=0)
print(rows_with_5)

# axis=0 means rows, axis=1 means columns)

The `filter()` command can also take a `regex` argument

In [None]:
# Filter column names using a `regex` instead of `like`
# Find all the columns that end in the letter `e`:
e_ending_cols = df.filter(regex='e$')
print(e_ending_cols.head())

RegexLearn:
https://regexlearn.com/learn/regex101

## 4. Combining Selection and Filtering

Use method chaining to append a selection to the results of a filter before assigning it to a new variable


In [None]:
# Get a list of majors for students under 21 years old:
young_majors = df[ df['age'] < 21 ]['major']

## 5. Using .isin() for Multiple Values

`.isin()` is useful for filtering rows that meet any of a list of criteria. For example, filtering by a subset of majors.

Most useful for filtering categorical data.

In [None]:
stem_majors = df[ df['major'].isin(['Engineering','Chemistry','Physics']) ]
print(stem_majors.head(10))

## 6. Filtering with String Methods

Pandas provides string methods that can be used to filter text data.

In [None]:
# dir(df['major']) has str commands

In [None]:
# Filter majors that contain the word 'Science':
science_majors = df[ df['major'].str.contains('Science') ]
print(science_majors)

# could be really helpful for selecting certain countries/locations

## 7. Advanced Selection: .loc vs .iloc

In [None]:
df['String'], df[['list','of','columns']],df[[True,False,True...]]

## Conclusion