# Date Cleaning Loop for Data Pulled from Yahoo Finance API


### Background

Sometimes when one is trying to work with historical data, formatting doesn't map directly from one dataset to another. In scenerios like this we need to clean and transform the data so that they can be mapped together. Date formatting is a common example where the data type can be formatted in different ways, (MM-DD-YY, MM-DD-YYYY, MM/DD/YY, MM/DD/YYYY, YY-MM-DD, etc.) and can create all kinds of problems when combining datasets with one another and going about the initial steps of data exploration and data cleaning.

In this specific script, I was initially trying to transoform the Date columns within the dataframe, merging them together and receiving error messages for having wrong formatting. Rather, I found if I was to turn the dataframes back into CSVs, and then retrieve that same information from the CSV file.


### Summary

This python script transforms the data in the "Date Column" which comes up as the following <i>"2007-11-15 00:00:00-05:00"</i>, removes the time component and just keeps the date in YYYY-MM-DD format, this example being <i>"2007-11-15"</i>.

After doing so, we are going to save the updated dataframe into a new CSV file in a new folder. Because we are working with multiple CSV files for data regarding multiple financial instruments that are from the same data source and have the same structure and formatting, we are going to use a loop so that we can apply it to all CSV files and populate these new files into a new, seperate folder.


### Process

To address this issue and ensure seamless integration with other time series datasets, we are implementing a data cleaning loop specifically designed to process the files obtained from the Yahoo Finance API. This loop will:

1. Identify and isolate the "Date" column in each dataset.
2. Remove any extraneous text from the "Date" values, leaving only the date information.
3. Convert the cleaned date strings into the datetime64 format, ensuring uniformity with standard time series data formats.

<b>Example:<b>
```javascript
df['Date'] = df['Date'].str.split(' ').str[0]
```

By standardizing the date format across all datasets, we enhance the compatibility and reliability of our data merges. This, in turn, facilitates more accurate and insightful data analysis and visualization, allowing us to derive meaningful insights from the combined datasets with greater efficiency and less manual intervention.



In [6]:
import pandas as pd
import os
import glob

# CSV files all from Yahoo Finance API in the "Historical_Data_Prices" folder to the new "Historical_Data_Prices_Cleaned" folder
input_directory = 'Historical_Data_Prices'
output_directory = 'Historical_Data_Prices_Cleaned'

# If this is a new folder
os.makedirs(output_directory, exist_ok=True)

# Use glob to find all CSV files in the input directory
csv_files = glob.glob(os.path.join(input_directory, '*.csv'))

# Process each CSV file
for file_path in csv_files:
    # Load the CSV file into a DataFrame
    df = pd.read_csv(file_path)
    
    # Check if 'Date' column exists
    if 'Date' in df.columns:
        # Apply the string split operation to the 'Date' column
        df['Date'] = df['Date'].str.split(' ').str[0]
        
        # Define the new path for the cleaned CSV file
        file_name = os.path.basename(file_path)
        cleaned_file_path = os.path.join(output_directory, file_name)
        
        # Save the cleaned DataFrame to a new CSV file in the output directory
        df.to_csv(cleaned_file_path, index=False)

print(f"Done.")

Done.
