[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gdsaxton/GDAN5400/blob/main/Week%204%20Notebooks/GDAN%205400%20-%20Week%204%20Notebooks%20%28III%29%20-%20Creating%20Binary%20Variables.ipynb)

This notebook provides recipes for creating binary variables in Python 

In [None]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')

# Load Packages and Set Working Directory
Import several necessary Python packages. We will be using the <a href="http://pandas.pydata.org/">Python Data Analysis Library,</a> or <i>PANDAS</i>, extensively for our data manipulations in this and future tutorials.

In [None]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

<br>
PANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks.

In [None]:
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 250)
pd.set_option('display.max_info_columns', 500)

# Read in Data

In [None]:
import pandas as pd
import requests

# NOTE: replace `https://github.com/` with `https://raw.githubusercontent.com`
# https://github.com/gdsaxton/GDAN5400/blob/main/Coding%20Assignment%201/final_insurance_fraud.xlsx
url = 'https://raw.githubusercontent.com/gdsaxton/GDAN5400/main/Coding%20Assignment%201/final_insurance_fraud.xlsx'

# Download the file
response = requests.get(url)
with open('final_insurance_fraud.xlsx', 'wb') as f:
    f.write(response.content)

# Load the Excel file
df = pd.read_excel('final_insurance_fraud.xlsx', engine='openpyxl')

df.head()

In [None]:
#APPLY DATA CLEANING OPERATIONS FROM CODING ASSIGNMENT 1
df = df[df['Policy Number'].notnull()]
df['Estimated cost to repair'] = df['Estimated cost to repair'].fillna(0)
df['Estimated cost to replace'] = df['Estimated cost to replace'].fillna(0)

# Creating Binary Variables
- Create binary variable from `Stories` column to indicate multiple stories

In [None]:
df['Stories'].value_counts()

In [None]:
#Option 1: lambda function
df['multiple_stories'] = df['Stories'].apply(lambda x: 1 if x > 1 else 0)
df['multiple_stories'].value_counts()

In [None]:
#Option 2: Custom function
def classify_stories(stories, threshold=1):
    return 1 if stories > threshold else 0

In [None]:
df['multiple_stories'] = df['Stories'].apply(classify_stories)
df['multiple_stories'].value_counts()

In [None]:
#Option 2: Apply direct Boolean function and convert to `int` format
df['multiple_stories']  = (df['Stories'] > 1).apply(int)
df['multiple_stories'].value_counts()

In [None]:
#Option 2 (alternative): using `astype()` instead of 'apply()'
df['multiple_stories']  = (df['Stories'] > 1).astype(int)
df['multiple_stories'].value_counts()

In [None]:
#Showing what the Boolean operation looks like without applying `int()`
(df['Stories'] > 1)

In [None]:
#Option 4: Using `np.where()` from NumPy
df['multiple_stories']  = np.where(df['Stories'] > 1, 1, 0)
df['multiple_stories'].value_counts()

#### Additional example

In [None]:
#Purpose: This categorizes hail diameter into high and low, simplifying analyses.
df['High_Hail_Flag'] = df['Hail Diameter'].apply(lambda x: 1 if x > 1.0 else 0)
df[['Hail Diameter', 'High_Hail_Flag']].head()