# Part 2: Data Summary

## Name : Anuj Kumar Shah

### 2. Data Cleaning and Summarization

In this section, we focus on cleaning and summarizing our dataset to prepare it for further analysis. To streamline various data preprocessing tasks, we utilize a custom Python module, DataLoader, which provides reusable methods that encapsulate various data cleaning functionalities, embodying the principles of modularity and code reusability.

- First, we invoke the DataLoader to load and inspect a sample of our data, enabling us to understand its structure and content.

- Next, we clean the column names, removing any square brackets and leading or trailing spaces, ensuring consistency and improving readability.

- We further refine our data by converting specific columns to appropriate data types, such as datetime and numeric, facilitating easier analysis.

- After meticulously cleaning and transforming the data, we conclude this part by saving the cleaned dataset into a CSV file, ensuring that our refined data is preserved for subsequent parts of our analysis and use cases.


Throughout this process, our DataLoader module plays a crucial role, providing reusable methods that encapsulate various data cleaning functionalities, embodying the principles of modularity and code reusability.



### 2.1 Loading and Inspecting Data
#### 2.1.1 Use the DataLoader to load a sample of the data.

In this step, we will read our dataset which is in text file. We will read each text file into a pandas DataFrame. This DataFrame will be like a database helping us to understand the structure of our data better.  Database are usually a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns).

#### Process:

- we will initialize the dataloaded with the file path.
- We will load and display sample data to check if our module is working correctly or not.

In [1]:
from py1 import DataLoader  

# Initialize DataLoader with the file path
data_loader = DataLoader("data/spy_eod_202301.txt")


# Load and display sample data
sample_data = data_loader.load_sample_data()
sample_data


Unnamed: 0,"[QUOTE_UNIXTIME], [QUOTE_READTIME], [QUOTE_DATE], [QUOTE_TIME_HOURS], [UNDERLYING_LAST], [EXPIRE_DATE], [EXPIRE_UNIX], [DTE], [C_DELTA], [C_GAMMA], [C_VEGA], [C_THETA], [C_RHO], [C_IV], [C_VOLUME], [C_LAST], [C_SIZE], [C_BID], [C_ASK], [STRIKE], [P_BID], [P_ASK], [P_SIZE], [P_LAST], [P_DELTA], [P_GAMMA], [P_VEGA], [P_THETA], [P_RHO], [P_IV], [P_VOLUME], [STRIKE_DISTANCE], [STRIKE_DISTANCE_PCT]"
0,"1672779600, 2023-01-03 16:00, 2023-01-03, 16.0..."
1,"1672779600, 2023-01-03 16:00, 2023-01-03, 16.0..."
2,"1672779600, 2023-01-03 16:00, 2023-01-03, 16.0..."
3,"1672779600, 2023-01-03 16:00, 2023-01-03, 16.0..."
4,"1672779600, 2023-01-03 16:00, 2023-01-03, 16.0..."


Since our module is working correctly , we will now initialize the DataLoaded with the file path.

#### 2.1.2 Inspect the basic structure and attributes of the data.
-   The 'load_sample_data' function in our python module is used to load the data. Even though our files are text files, this function is quite versatile and can read text files by specifying the appropriate delimiter.
-   We to specify a delimiter, such as a comma which separates the values in your text files.

In [2]:
# Initializing DataLoader with the file path
data_loader = DataLoader("data/spy_eod_202301.txt")

# Using DataLoader to load and display data with comma as the delimiter
sample_data_comma_delimited = data_loader.load_sample_data(delimiter=",")

sample_data_comma_delimited.head()

Unnamed: 0,[QUOTE_UNIXTIME],[QUOTE_READTIME],[QUOTE_DATE],[QUOTE_TIME_HOURS],[UNDERLYING_LAST],[EXPIRE_DATE],[EXPIRE_UNIX],[DTE],[C_DELTA],[C_GAMMA],...,[P_LAST],[P_DELTA],[P_GAMMA],[P_VEGA],[P_THETA],[P_RHO],[P_IV],[P_VOLUME],[STRIKE_DISTANCE],[STRIKE_DISTANCE_PCT]
0,1672779600,2023-01-03 16:00,2023-01-03,16.0,380.82,2023-01-03,1672779600,0.0,0.96551,0.00562,...,0.01,-0.00075,0.00015,0.00072,-0.00483,-0.00015,1.21005,0.0,70.8,0.186
1,1672779600,2023-01-03 16:00,2023-01-03,16.0,380.82,2023-01-03,1672779600,0.0,0.96015,0.00703,...,0.02,-0.00093,0.00025,0.00104,-0.00487,0.0,0.99616,0.0,60.8,0.16
2,1672779600,2023-01-03 16:00,2023-01-03,16.0,380.82,2023-01-03,1672779600,0.0,0.95788,0.00778,...,0.0,-0.0014,0.0002,0.00105,-0.00538,-7e-05,0.91199,,56.8,0.149
3,1672779600,2023-01-03 16:00,2023-01-03,16.0,380.82,2023-01-03,1672779600,0.0,0.96337,0.0081,...,0.02,-0.00081,0.00026,0.00115,-0.005,-0.00048,0.89065,0.0,55.8,0.147
4,1672779600,2023-01-03 16:00,2023-01-03,16.0,380.82,2023-01-03,1672779600,0.0,0.956,0.00817,...,0.01,-0.00119,0.00022,0.00043,-0.00533,-0.00044,0.87004,0.0,54.8,0.144


The data has been successfully loaded with each attribute recognized as a separate column. However, there are still square brackets (`[]`) surrounding the column names and some data values, which we might want to clean up for ease of access and analysis later.

##### Observations:

-   The dataset includes various attributes related to options trading, such as option prices, Greeks, implied volatility, and expiry details, among others.
-   Each row seems to represent a unique option based on different attributes like the expiry date and strike price.

### 2.2 Cleaning and Concatinating

#### 2.2.1 Removing the square brackets from the column names.

In [3]:
# Using DataLoader to clean the column names
cleaned_data = DataLoader.clean_column_names(sample_data_comma_delimited.copy())

# Displaying the cleaned column names and the first few rows of the data
cleaned_data.columns, cleaned_data.head()


  data.columns = data.columns.str.replace('[', '').str.replace(']', '')
  data.columns = data.columns.str.replace('[', '').str.replace(']', '')


(Index(['QUOTE_UNIXTIME', ' QUOTE_READTIME', ' QUOTE_DATE', ' QUOTE_TIME_HOURS',
        ' UNDERLYING_LAST', ' EXPIRE_DATE', ' EXPIRE_UNIX', ' DTE', ' C_DELTA',
        ' C_GAMMA', ' C_VEGA', ' C_THETA', ' C_RHO', ' C_IV', ' C_VOLUME',
        ' C_LAST', ' C_SIZE', ' C_BID', ' C_ASK', ' STRIKE', ' P_BID', ' P_ASK',
        ' P_SIZE', ' P_LAST', ' P_DELTA', ' P_GAMMA', ' P_VEGA', ' P_THETA',
        ' P_RHO', ' P_IV', ' P_VOLUME', ' STRIKE_DISTANCE',
        ' STRIKE_DISTANCE_PCT'],
       dtype='object'),
    QUOTE_UNIXTIME     QUOTE_READTIME   QUOTE_DATE   QUOTE_TIME_HOURS  \
 0      1672779600   2023-01-03 16:00   2023-01-03               16.0   
 1      1672779600   2023-01-03 16:00   2023-01-03               16.0   
 2      1672779600   2023-01-03 16:00   2023-01-03               16.0   
 3      1672779600   2023-01-03 16:00   2023-01-03               16.0   
 4      1672779600   2023-01-03 16:00   2023-01-03               16.0   
 
     UNDERLYING_LAST  EXPIRE_DATE   EXPIRE_UNIX  

The square brackets in the column names have been successfully removed. Here's a brief explanation of what was done:

#### Explanation:

-   Removing Square Brackets:
    -   The column names originally had square brackets around them, like `[QUOTE_UNIXTIME]`. To clean this, the square brackets were removed by replacing them with empty strings.
    -   A Python string method `.replace()` was used in  the module which was in combination with pandas functionality to apply this operation to all column names.

#### Current State of the Data:

-   Columns Cleaned:

    -   The column names are now clean and easily accessible for future operations.
-   Data Preview:

    -   The data consists of various attributes like `QUOTE_UNIXTIME`, `QUOTE_READTIME`, Greeks such as `C_DELTA`, `C_GAMMA`, and others, and option prices such as `C_BID`, `C_ASK`.

#### 2.2.2 Load and Concatenate All Files

- Load each text file and concatenate them into a single DataFrame for a comprehensive analysis.




In [3]:
# Defining the file paths
file_paths = ["data/spy_eod_202301.txt", 
              "data/spy_eod_202302.txt", 
              "data/spy_eod_202303.txt"]

# Assuming DataLoader has a method to load and concatenate files
# Using DataLoader to load and concatenate all files into a single DataFrame
all_data = DataLoader.load_and_concatenate_files(file_paths)

# Using DataLoader to clean the column names (calling the method on the class, not the instance)
all_data_cleaned = DataLoader.clean_column_names(all_data.copy())

# Displaying the shape and the first few rows of the concatenated DataFrame
all_data_cleaned.shape, all_data_cleaned.head()



  data.columns = data.columns.str.replace('[', '').str.replace(']', '')
  data.columns = data.columns.str.replace('[', '').str.replace(']', '')


((245695, 33),
    QUOTE_UNIXTIME     QUOTE_READTIME   QUOTE_DATE   QUOTE_TIME_HOURS  \
 0      1672779600   2023-01-03 16:00   2023-01-03               16.0   
 1      1672779600   2023-01-03 16:00   2023-01-03               16.0   
 2      1672779600   2023-01-03 16:00   2023-01-03               16.0   
 3      1672779600   2023-01-03 16:00   2023-01-03               16.0   
 4      1672779600   2023-01-03 16:00   2023-01-03               16.0   
 
     UNDERLYING_LAST  EXPIRE_DATE   EXPIRE_UNIX   DTE   C_DELTA   C_GAMMA  ...  \
 0            380.82   2023-01-03    1672779600   0.0   0.96551   0.00562  ...   
 1            380.82   2023-01-03    1672779600   0.0   0.96015   0.00703  ...   
 2            380.82   2023-01-03    1672779600   0.0   0.95788   0.00778  ...   
 3            380.82   2023-01-03    1672779600   0.0   0.96337   0.00810  ...   
 4            380.82   2023-01-03    1672779600   0.0   0.95600   0.00817  ...   
 
     P_LAST   P_DELTA   P_GAMMA   P_VEGA  P_THETA  

The data from all files has been successfully loaded and concatenated into a single DataFrame. Here's a summary of the above:

#### i. Loading and Concatenating

-   All text files have been read into individual DataFrames.
-   These DataFrames have been concatenated, meaning they have been stacked on top of each other to form a single DataFrame. This will make the analysis easier as we have all the data in one place.
-   The dataset now has 245,695 rows and 33 columns.

#### ii. Cleaning Column Names

-   The square brackets in the column names have been removed to simplify accessing the columns later.

### 2.3. Removing Unnecessary Columns 

- This could include columns that are redundant, have too many missing values, or are not useful for the analysis.       
- To simplify the dataset, we will be keeping only the essential columns relevant to our research questions.
- We will also identify and remove columns that might not be crucial for answering the research questions.
- This will make the dataset more manageable and easier to understand.



#### 2.3.1 Checking Existing Columns
- Here we Check the existing columns and clean it if required 

In [4]:
print(all_data_cleaned.columns)


Index(['QUOTE_UNIXTIME', ' QUOTE_READTIME', ' QUOTE_DATE', ' QUOTE_TIME_HOURS',
       ' UNDERLYING_LAST', ' EXPIRE_DATE', ' EXPIRE_UNIX', ' DTE', ' C_DELTA',
       ' C_GAMMA', ' C_VEGA', ' C_THETA', ' C_RHO', ' C_IV', ' C_VOLUME',
       ' C_LAST', ' C_SIZE', ' C_BID', ' C_ASK', ' STRIKE', ' P_BID', ' P_ASK',
       ' P_SIZE', ' P_LAST', ' P_DELTA', ' P_GAMMA', ' P_VEGA', ' P_THETA',
       ' P_RHO', ' P_IV', ' P_VOLUME', ' STRIKE_DISTANCE',
       ' STRIKE_DISTANCE_PCT'],
      dtype='object')


#### Correcting Column Names:

It appears that there are extra spaces at the beginning of some column names. We will remove the leading spaces from the column names and then proceed to keep only the necessary columns.

-   Removed leading spaces from the column names to correctly reference them.

In [5]:
# Using DataLoader to clean leading spaces in column names
all_data_cleaned = DataLoader.clean_leading_spaces_in_columns(all_data_cleaned)

# Using DataLoader to clean the square brackets in column names
all_data_cleaned = DataLoader.clean_column_names(all_data_cleaned)

# Displaying the cleaned column names
print(all_data_cleaned.columns)


Index(['QUOTE_UNIXTIME', 'QUOTE_READTIME', 'QUOTE_DATE', 'QUOTE_TIME_HOURS',
       'UNDERLYING_LAST', 'EXPIRE_DATE', 'EXPIRE_UNIX', 'DTE', 'C_DELTA',
       'C_GAMMA', 'C_VEGA', 'C_THETA', 'C_RHO', 'C_IV', 'C_VOLUME', 'C_LAST',
       'C_SIZE', 'C_BID', 'C_ASK', 'STRIKE', 'P_BID', 'P_ASK', 'P_SIZE',
       'P_LAST', 'P_DELTA', 'P_GAMMA', 'P_VEGA', 'P_THETA', 'P_RHO', 'P_IV',
       'P_VOLUME', 'STRIKE_DISTANCE', 'STRIKE_DISTANCE_PCT'],
      dtype='object')


  data.columns = data.columns.str.replace('[', '').str.replace(']', '')
  data.columns = data.columns.str.replace('[', '').str.replace(']', '')


#### 2.3.2 Selecting Relevant Columns 
We'll select columns that are directly relevant to the research questions. Below are the columns which we are keeping:

1.  `QUOTE_DATE`: Date of the quote - important for time series analysis.
2.  `EXPIRE_DATE`: Expiry date of the option - crucial for analyzing time left until expiry.
3.  `DTE`: Days till Expiry - useful for analyzing how option characteristics change over time.
4.  `C_DELTA`, `C_GAMMA`, `C_VEGA`, `C_THETA`, `C_RHO`: Greeks for Call options - essential for analyzing option risks and exposures.
5.  `C_IV`: Implied Volatility of Call options - key metric in option pricing.
6.  `P_DELTA`, `P_GAMMA`, `P_VEGA`, `P_THETA`, `P_RHO`: Greeks for Put options.
7.  `P_IV`: Implied Volatility of Put options.
8.  `STRIKE_DISTANCE`: Distance between strike price and current underlying price - important for analyzing moneyness of options.

We exclude columns related to bid, ask, last traded price, and volume, as our focus is more on the Greeks, implied volatility, and their relationship with days to expiry and strike distance.

In [9]:
# Columns to keep based on their relevance to the research questions
columns_to_keep = [
    'QUOTE_DATE', 'EXPIRE_DATE', 'DTE', 'C_DELTA', 'C_GAMMA', 'C_VEGA', 
    'C_THETA', 'C_RHO', 'C_IV', 'P_DELTA', 'P_GAMMA', 'P_VEGA', 
    'P_THETA', 'P_RHO', 'P_IV', 'STRIKE_DISTANCE'
]

# Using DataLoader to clean leading spaces in column names
all_data_cleaned = DataLoader.clean_leading_spaces_in_columns(all_data)

# Using DataLoader to select specific columns
selected_data = DataLoader.select_columns(all_data_cleaned, columns_to_keep)

# Displaying the first few rows of the selected DataFrame
selected_data.head()


Unnamed: 0,QUOTE_DATE,EXPIRE_DATE,DTE,C_DELTA,C_GAMMA,C_VEGA,C_THETA,C_RHO,C_IV,P_DELTA,P_GAMMA,P_VEGA,P_THETA,P_RHO,P_IV,STRIKE_DISTANCE
0,2023-01-03,2023-01-03,0.0,0.96551,0.00562,0.00913,-0.10519,0.00095,4.34673,-0.00075,0.00015,0.00072,-0.00483,-0.00015,1.21005,70.8
1,2023-01-03,2023-01-03,0.0,0.96015,0.00703,0.00997,-0.10512,0.00032,3.87219,-0.00093,0.00025,0.00104,-0.00487,0.0,0.99616,60.8
2,2023-01-03,2023-01-03,0.0,0.95788,0.00778,0.01014,-0.10536,0.00025,3.68261,-0.0014,0.0002,0.00105,-0.00538,-7e-05,0.91199,56.8
3,2023-01-03,2023-01-03,0.0,0.96337,0.0081,0.00901,-0.06979,0.00059,3.59852,-0.00081,0.00026,0.00115,-0.005,-0.00048,0.89065,55.8
4,2023-01-03,2023-01-03,0.0,0.956,0.00817,0.01064,-0.10918,0.00021,3.59029,-0.00119,0.00022,0.00043,-0.00533,-0.00044,0.87004,54.8


The dataset has been successfully cleaned by keeping only the relevant columns.
-   Kept only the columns that are directly relevant to the research questions. These columns include details like quote and expiry dates, Greeks for both call and put options, and implied volatilities.

####  Next Steps:
We can proceed to handle missing values and perform other necessary data preprocessing steps as part of the data summary and cleaning process.

### 2.4 Further Cleaning
Next steps in data cleaning and summarization are:

#### 2.4.1 Handling Missing Values

-   Identify any columns or rows with missing values.
-   Decide on a strategy to handle them, such as removing or imputing missing values.


In [10]:
# Using DataLoader to check for missing values in the DataFrame
missing_values = DataLoader.check_missing_values(selected_data)

# Displaying the number of missing values in each column
missing_values


QUOTE_DATE         0
EXPIRE_DATE        0
DTE                0
C_DELTA            0
C_GAMMA            0
C_VEGA             0
C_THETA            0
C_RHO              0
C_IV               0
P_DELTA            0
P_GAMMA            0
P_VEGA             0
P_THETA            0
P_RHO              0
P_IV               0
STRIKE_DISTANCE    0
dtype: int64

There are no missing values in our dataset. Every column has complete data, which simplifies the cleaning process.



#### 2.4.2 Data Type Conversion

-   Ensure that each column is of the appropriate data type for analysis. For example, date columns should have a date data type.

Now, let's ensure that each column is of the appropriate data type for analysis. Specifically, we will check and possibly convert the `QUOTE_DATE` and `EXPIRE_DATE` columns to a date data type, which will facilitate any time-based analysis we might want to perform later on

In [11]:
# Columns to convert to datetime data type
columns_to_convert = ['QUOTE_DATE', 'EXPIRE_DATE']

# Using DataLoader to convert specified columns to datetime data type
converted_data = DataLoader.convert_to_datetime(selected_data.copy(), columns_to_convert)

# Displaying data types of each column to verify the changes
converted_data.dtypes


QUOTE_DATE         datetime64[ns]
EXPIRE_DATE        datetime64[ns]
DTE                       float64
C_DELTA                   float64
C_GAMMA                   float64
C_VEGA                    float64
C_THETA                   float64
C_RHO                     float64
C_IV                       object
P_DELTA                   float64
P_GAMMA                   float64
P_VEGA                    float64
P_THETA                   float64
P_RHO                     float64
P_IV                       object
STRIKE_DISTANCE           float64
dtype: object

The `QUOTE_DATE` and `EXPIRE_DATE` columns have been successfully converted to datetime data types, which will be helpful for any time-based analyses we might want to conduct later.

#### Observations:

-   Datetime Conversion:
    -   The `QUOTE_DATE` and `EXPIRE_DATE` columns are now in datetime format.
-   Other Data Types:
    -   Most other columns, such as the Greeks and `STRIKE_DISTANCE`, are in a numeric format (float64), which is suitable for mathematical operations and analyses.
    -   However, the `C_IV` and `P_IV` (implied volatilities) columns are object types, likely due to some non-numeric values or inconsistencies in the data.

#### Next Steps:

-   We might want to further investigate and clean the `C_IV` and `P_IV` columns to ensure that they are in a numeric format suitable for analysis.

#### 2.4.3 Descriptive Statistics

-   Calculate basic descriptive statistics for the numeric columns to understand the distribution of values.

let's handle the `C_IV` and `P_IV` columns by doing the following:

i.  Investigation:

    -   Find out why these columns are not being recognized as numeric.
    -   Check for any non-numeric or inconsistent values.
ii.  Conversion:

    -   We will try to convert these columns to a numeric data type, handling or removing any non-numeric values if necessary.

In [12]:
# Columns to check for unique values
columns_to_check = ['C_IV', 'P_IV']

# Using DataLoader to get unique values from specified columns
unique_values = DataLoader.get_unique_values(converted_data, columns_to_check)

# Displaying unique values to identify any inconsistencies
unique_values


{'C_IV': array([' 4.346730', ' 3.872190', ' 3.682610', ..., ' 0.443730',
        ' 0.430870', ' 0.287580'], dtype=object),
 'P_IV': array([' 1.210050', ' 0.996160', ' 0.911990', ..., ' 0.144610',
        ' 0.339160', ' 0.149600'], dtype=object)}

It appears that the `C_IV` and `P_IV` columns are recognized as object data types because the values are being read as strings (text) rather than numeric values. This is likely due to the presence of spaces or other non-numeric characters.

#### Solution:

-   We can convert these columns to a numeric data type, which will make them more suitable for mathematical computations and analyses.
-   Any non-numeric values or inconsistencies will be converted to

In [13]:
# Columns to convert to numeric data type
columns_to_convert_numeric = ['C_IV', 'P_IV']

# Using DataLoader to convert specified columns to numeric data type
numeric_converted_data = DataLoader.convert_to_numeric(converted_data.copy(), columns_to_convert_numeric)

# Displaying data types of each column to verify the changes, and checking for any new missing values
numeric_converted_data.dtypes, numeric_converted_data.isnull().sum()


(QUOTE_DATE         datetime64[ns]
 EXPIRE_DATE        datetime64[ns]
 DTE                       float64
 C_DELTA                   float64
 C_GAMMA                   float64
 C_VEGA                    float64
 C_THETA                   float64
 C_RHO                     float64
 C_IV                      float64
 P_DELTA                   float64
 P_GAMMA                   float64
 P_VEGA                    float64
 P_THETA                   float64
 P_RHO                     float64
 P_IV                      float64
 STRIKE_DISTANCE           float64
 dtype: object,
 QUOTE_DATE             0
 EXPIRE_DATE            0
 DTE                    0
 C_DELTA                0
 C_GAMMA                0
 C_VEGA                 0
 C_THETA                0
 C_RHO                  0
 C_IV                7006
 P_DELTA                0
 P_GAMMA                0
 P_VEGA                 0
 P_THETA                0
 P_RHO                  0
 P_IV               29226
 STRIKE_DISTANCE        0
 dtype: 

The `C_IV` and `P_IV` columns have been successfully converted to numeric data types. However, in the process, some values that couldn't be converted to numbers have become NaN (Not a Number). Here's a summary:

##### i. Conversion to Numeric:
-   The `C_IV` and `P_IV` columns are now of type float64, suitable for mathematical operations.

##### ii. Handling NaN Values:
-   There are 7,006 NaN values in the `C_IV` column and 29,226 in the `P_IV` column.
-   These NaN values need to be addressed. We could either remove the rows with NaN values or fill them with a specific value, like the mean or median of the respective columns.

##### Next Steps:
-   We will decide on a strategy to handle the NaN values in the `C_IV` and `P_IV` columns.

In [14]:
# Columns to check for NaN values
columns_to_check_na = ['C_IV', 'P_IV']

# Using DataLoader to remove rows where specified columns have NaN values
cleaned_data = DataLoader.remove_na_rows(numeric_converted_data.copy(), columns_to_check_na)

# Displaying the shape of the DataFrame and checking for any remaining missing values
cleaned_data.shape, cleaned_data.isnull().sum()


((209463, 16),
 QUOTE_DATE         0
 EXPIRE_DATE        0
 DTE                0
 C_DELTA            0
 C_GAMMA            0
 C_VEGA             0
 C_THETA            0
 C_RHO              0
 C_IV               0
 P_DELTA            0
 P_GAMMA            0
 P_VEGA             0
 P_THETA            0
 P_RHO              0
 P_IV               0
 STRIKE_DISTANCE    0
 dtype: int64)

The rows containing NaN values in the `C_IV` and `P_IV` columns have been successfully removed. Now, our dataset is cleaner and more suitable for analysis.

#### Summary of Changes:

-   Rows Removed:
    -   Rows with NaN values in the `C_IV` and `P_IV` columns were removed to maintain the accuracy of the implied volatility values.
-   Current Dataset Shape:
    -   The dataset now contains 209,463 rows and 16 columns, and there are no missing values.

In [15]:
# Using DataLoader to display basic info of the DataFrame
DataLoader.display_basic_info(cleaned_data)

# Using DataLoader to display statistical summary of the DataFrame
DataLoader.display_statistical_summary(cleaned_data)

# Columns to display unique values
columns_to_display_unique = ['QUOTE_DATE', 'EXPIRE_DATE']

# Using DataLoader to display number of unique values in specified columns
DataLoader.display_unique_values(cleaned_data, columns_to_display_unique)


<class 'pandas.core.frame.DataFrame'>
Int64Index: 209463 entries, 0 to 245668
Data columns (total 16 columns):
 #   Column           Non-Null Count   Dtype         
---  ------           --------------   -----         
 0   QUOTE_DATE       209463 non-null  datetime64[ns]
 1   EXPIRE_DATE      209463 non-null  datetime64[ns]
 2   DTE              209463 non-null  float64       
 3   C_DELTA          209463 non-null  float64       
 4   C_GAMMA          209463 non-null  float64       
 5   C_VEGA           209463 non-null  float64       
 6   C_THETA          209463 non-null  float64       
 7   C_RHO            209463 non-null  float64       
 8   C_IV             209463 non-null  float64       
 9   P_DELTA          209463 non-null  float64       
 10  P_GAMMA          209463 non-null  float64       
 11  P_VEGA           209463 non-null  float64       
 12  P_THETA          209463 non-null  float64       
 13  P_RHO            209463 non-null  float64       
 14  P_IV             209

{'QUOTE_DATE': 62, 'EXPIRE_DATE': 92}

### 2.5: Data Summary

#### Data Acquisition:

-   The dataset consists of options data related to SPY (S&P 500 ETF), which seems to have been obtained from text files. Since, I am very much interested in trading options, I dived into the research of acquiring the options data for Project 1 - EDA. We acquired this data from https://www.optionsdx.com/. Optionsdx provides free Historical Options Data. This particular Options data belons to SPY ETF. SPY is American Options meaning option holders can exercise their option at any time before expiration. In the liquid market, SPY is one of the most heavily traded ETFs and is the oldest ETF still trading.

#### Use Cases and Attributes:

-   The cleaned dataset provides 209,463 use cases (rows), each representing an option data point with various attributes such as Greeks, implied volatilities, and date-related information.
-   Each use case has 16 attributes, including Greeks ('C_DELTA', 'C_GAMMA', 'C_VEGA', 'C_THETA', 'C_RHO', 'P_DELTA', 'P_GAMMA', 'P_VEGA', 'P_THETA', 'P_RHO'), implied volatilities ('C_IV', 'P_IV'), and date-related information ('QUOTE_DATE', 'EXPIRE_DATE', 'DTE', 'STRIKE_DISTANCE').

#### Data Types:

-   Columns like 'QUOTE_DATE' and 'EXPIRE_DATE' are datetime types, providing precise date information.
-   Greeks and implied volatilities are float types, which are suitable for mathematical analysis and model building.

### 2.6 Saving Cleaned Data
We will save the cleaned file as 'cleaned_data.csv' file for backup.

In [16]:
# File path to save the cleaned data
file_path_to_save = 'cleaned_data.csv'

# Using DataLoader to save the cleaned DataFrame to a CSV file
DataLoader.save_to_csv(cleaned_data, file_path_to_save)
