[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gdsaxton/GDAN5400/blob/main/Week%202%20Notebooks/GDAN%205400%20-%20Week%202%20Notebooks%20%28VII%29%20-%20Filtering%20Records.ipynb)

This notebook provides recipes for filtering and cleaning records in PANDAS

In [None]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')

# Load Packages and Set Working Directory
Import several necessary Python packages. We will be using the <a href="http://pandas.pydata.org/">Python Data Analysis Library,</a> or <i>PANDAS</i>, extensively for our data manipulations in this and future tutorials.

In [None]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

<br>
PANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks.

In [None]:
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 250)
pd.set_option('display.max_info_columns', 500)

# Read in Data

In [None]:
import pandas as pd
import requests

# NOTE: replace `https://github.com/` with `https://raw.githubusercontent.com`
# https://github.com/gdsaxton/GDAN5400/blob/main/Coding%20Assignment%201/final_insurance_fraud.xlsx
url = 'https://raw.githubusercontent.com/gdsaxton/GDAN5400/main/Coding%20Assignment%201/final_insurance_fraud.xlsx'

# Download the file
response = requests.get(url)
with open('final_insurance_fraud.xlsx', 'wb') as f:
    f.write(response.content)

# Load the Excel file
df = pd.read_excel('final_insurance_fraud.xlsx', engine='openpyxl')

df.head()

# Filtering Records in PANDAS
   -  [This would be the ``Filter`` tool in Alteryx]

**[ChatGPT prompt]** `How do you filter records in PANDAS?`

Filtering records in PANDAS involves selecting rows from a DataFrame that meet specific conditions. Here are common ways to filter records in pandas:

### Using Boolean Indexing
You can filter rows based on a condition:

- Examples: 
  - Filter (Keep) Rows Where Wind Speed > 50
  - Filter (Exclude) Rows Where `Adjuster` is 'Dudley'

In [None]:
filtered_df = df[df['Wind Speed'] > 50]
print(len(filtered_df))
filtered_df[:1]

In [None]:
filtered_df = df[~(df['Adjuster'] == 'Dudley')]
print(len(filtered_df))
filtered_df[:1]

Alternative code

In [None]:
filtered_df = df[(df['Adjuster'] != 'Dudley')]
print(len(filtered_df))
filtered_df[:1]

### Using `.query()`
The .query() method allows filtering with an SQL-like syntax.

- Examples: 
  - Filter Rows Where Wind Speed > 50
  - Filter Rows Where City == "Sycamore" AND Age of roof > 10

In [None]:
filtered_df = df.query('`Wind Speed` > 50')
print(len(filtered_df))
filtered_df[:1]

In [None]:
filtered_df = df.query('City == "Sycamore" & `Age of roof` > 10')
print(len(filtered_df))
filtered_df[:1]

### Filtering for Null or Non-Null Values
- Example: Filter rows where `Estimated cost to repair` is NOT `NaN` (i.e., not missing)

In [None]:
filtered_df = df[df['Estimated cost to repair'].notnull()]
print(len(filtered_df))
filtered_df[:1]

Note: There are instances where you do *not* want to drop cases missing data. Please refer to the second part of this notebook on "Data Cleaning"

### Using `.isin()`
Filter rows where a column's value is in a list of values.
- Example: Filter Rows Where Zip Code is in a Specific List

In [None]:
zip_codes = [60178, 10001]
filtered_df = df[df['Zip Code'].isin(zip_codes)]
print(len(filtered_df))
filtered_df[:1]

### Using `.str` for String Filtering
- Example: Filter Rows Where `Loss or damage` details Contains the Word "Hail"

In [None]:
filtered_df = df[df['Loss or damage details'].str.contains('Hail', na=False)]
print(len(filtered_df))
filtered_df[:1]

### Filter Rows by Index Range
Use `.loc[]` for more complex row filtering and column selection.
- Example: Filter Rows from Index 0 to 5

In [None]:
filtered_df = df.loc[0:5]
print(len(filtered_df))
filtered_df

### Using Multiple Conditions
Combine conditions using & (AND), | (OR), and ~ (NOT). Enclose each condition in parentheses.

- Examples: 
  - Filter Rows Where `Rainfall` > 1.5 AND `Hail Diameter` > 1.0
  - Filter Rows Where Wind Speed > 50 AND City == "Sycamore"

In [None]:
filtered_df = df[(df['Rainfall'] > 1.5) & (df['Hail Diameter'] > 1.0)]
print(len(filtered_df))
filtered_df[:1]

In [None]:
filtered_df = df[(df['Wind Speed'] > 50) & (df['City'] == 'Sycamore')]
print(len(filtered_df))
filtered_df[:1]

# Data Cleaning

In an earlier example, we filtered the dataset to only keep rows where `Estimated cost to repair` is NOT `NaN` (i.e., not missing). In some instances, though, we will not want to drop cases missing data. A key example is when 'missing' actually means '$0.'


### Using `fillna()`

**[ChatGPT prompt]** `What is fillna() in pandas?`

The `fillna()` method in pandas is used to replace missing (`NaN`) values in a DataFrame or Series with a specified value or method.

Example: Replace Missing Values with a Constant Value:
```python
df['Column_Name'] = df['Column_Name'].fillna(0)
```

Example: Replace Missing Values with a Summary Statistic:
```python
df['Column_Name'] = df['Column_Name'].fillna(df['Column_Name'].mean())
```

### Using `replace()`
The replace() method can replace specific values, including `NaN`.

Example:
```python
df['Column_Name'] = df['Column_Name'].replace(np.nan, 0)
```

In [None]:
df['Age of roof'].describe()

In [None]:
df['Age of roof'] = df['Age of roof'].replace(35, 34)
df['Age of roof'].describe()