[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gdsaxton/GDAN5400/blob/main/Week%204%20Notebooks/GDAN%205400%20-%20Week%204%20Notebooks%20%28II%29%20-%20Converting%20Data%20Types.ipynb)

This notebook provides recipes for converting variables' data types in Python 

In [None]:
%%time
import datetime
print ("Current date and time : ", datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S"), '\n')

# Load Packages and Set Working Directory
Import several necessary Python packages. We will be using the <a href="http://pandas.pydata.org/">Python Data Analysis Library,</a> or <i>PANDAS</i>, extensively for our data manipulations in this and future tutorials.

In [None]:
import numpy as np
import pandas as pd
from pandas import DataFrame
from pandas import Series

<br>
PANDAS allows you to set various options for, among other things, inspecting the data. I like to be able to see all of the columns. Therefore, I typically include this line at the top of all my notebooks.

In [None]:
#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', 250)
pd.set_option('display.max_info_columns', 500)

# Read in Data

In [None]:
import pandas as pd
import requests

# NOTE: replace `https://github.com/` with `https://raw.githubusercontent.com`
# https://github.com/gdsaxton/GDAN5400/blob/main/Coding%20Assignment%201/final_insurance_fraud.xlsx
url = 'https://raw.githubusercontent.com/gdsaxton/GDAN5400/main/Coding%20Assignment%201/final_insurance_fraud.xlsx'

# Download the file
response = requests.get(url)
with open('final_insurance_fraud.xlsx', 'wb') as f:
    f.write(response.content)

# Load the Excel file
df = pd.read_excel('final_insurance_fraud.xlsx', engine='openpyxl')

df.head()

In [None]:
#APPLY DATA CLEANING OPERATIONS FROM CODING ASSIGNMENT 1
df = df[df['Policy Number'].notnull()]
df['Estimated cost to repair'] = df['Estimated cost to repair'].fillna(0)
df['Estimated cost to replace'] = df['Estimated cost to replace'].fillna(0)

# Main Data Types in PANDAS

**[ChatGPT prompt]** `What are the main data types in Python?`

# Check Our Current Data Types 
First, let us check what the data types for all of our variables

In [None]:
df.dtypes

<br>Recall that if we want data type plus additional information, we can use `.info()`

In [None]:
df.info()

# Examples
### Convert `Date&Time` column to `datetime` format

In [None]:
df['Date&Time'].sample(5)

Note that we need to know the precise format for our data. 

We add the `format` option with the command `'%m/%d/%y %H:%M'` to match the specific format of our data.

To get help:

In [None]:
pd.to_datetime?

Or, via **ChatGPT**: `My Date&Time data are in this format. How can I convert it to datetime? 2019-09-12 02:45:00`

**Response:** `Your Date&Time data is already in a standard ISO 8601 datetime format (YYYY-MM-DD HH:MM:SS), which Pandas can easily recognize. Here's how you can convert it to a datetime format:`

In [None]:
df['Date&Time'] = pd.to_datetime(df['Date&Time'])
df['Date&Time'][:5]

**Response:** ``If you want to use the **`format`** option in `pd.to_datetime()`, you can specify the exact datetime format of your data. This is useful when you know the exact structure of your datetime strings and want faster or more precise parsing. For your data (`2019-09-12 02:45:00`), the format is `'%Y-%m-%d %H:%M:%S'`.``

### Explanation of the Format String:
- **`%Y`**: Four-digit year (e.g., `2019`).
- **`%m`**: Two-digit month (e.g., `09` for September).
- **`%d`**: Two-digit day of the month (e.g., `12`).
- **`%H`**: Two-digit hour in 24-hour format (e.g., `02` for 2 AM).
- **`%M`**: Two-digit minute (e.g., `45`).
- **`%S`**: Two-digit second (e.g., `00`).

---

### Why Use `format`?
- **Performance**: Specifying the format can be faster than automatic detection when the dataset is large.
- **Precision**: Ensures the data is parsed exactly as intended, avoiding incorrect conversions.

---

### Handling Errors with `format`
If some rows do not match the specified format, you can handle them gracefully:
```python
df['Date&Time'] = pd.to_datetime(df['Date&Time'], format='%Y-%m-%d %H:%M:%S', errors='coerce')
```

`errors='coerce'`: Converts invalid formats to NaT (Not a Time), allowing you to identify and handle problematic rows.


In [None]:
df['Date&Time'] = pd.to_datetime(df['Date&Time'], format='%m/%d/%y %H:%M')
df['Date&Time'][:5]

In [None]:
df['Date&Time'].dt.time

In [None]:
df['Date&Time'].dt.second
df['Date&Time'].dt.minute
df['Date&Time'].dt.hour
df['Date&Time'].dt.weekday # Extracts the day of the week (Monday = 0, Sunday = 6)
df['Date&Time'].dt.day
df['Date&Time'].dt.month
df['Date&Time'].dt.year

In [None]:
df['Time'] = df['Date&Time'].dt.time

In [None]:
df[['Date&Time', 'Time']].sample(5)

### Convert `Estimated cost to repair` column from `float` to `integer` format
First, take a look at the values in the first 5 rows:

In [None]:
df['Estimated cost to repair'][:5]

Two alternatives:

In [None]:
df['Rounded_Cost'] = df['Estimated cost to repair'].astype(int)

In [None]:
df['Rounded_Cost'] = df['Estimated cost to repair'].apply(int)

In [None]:
df[['Estimated cost to repair', 'Rounded_Cost']].sample(5)

### Convert `Policy Number` to `int` and `string` formats:

In [None]:
#SHOW FIRST FIVE ROWS
df['Policy Number'][:5]

In [None]:
df['Policy Number'] = df['Policy Number'].apply(int)
df['Policy Number'][:5]

In [None]:
df['Policy Number'] = df['Policy Number'].apply(str)
df['Policy Number'][:5]

In [None]:
df['Policy Number'][:5].apply(str)

### Examples of using `bool()` conversion

In [None]:
# Example 4: Convert a column to boolean using `bool()`
df['Has_Replacement_Cost'] = df['Estimated cost to replace'].apply(bool)
# Display examples to verify the results
df[['Estimated cost to replace', 'Has_Replacement_Cost']].head()

#### Inspect data types again

In [None]:
df.info()