# Week 3 EDA
### Tasks:
- Code for Data split 
- Code for EDA

### Data Split:
Because the data is structured as panel data, this data must be split ascendingly by date according to a split decided by the team. 

The first portion of data will be used to predict the last 12 months in the data from December 2023 - December 2024.

Cross-validation for time series can be done later after the train/test split is performed. (Link: https://otexts.com/fpp3/tscv.html)

### Notes to this Step:
Though the data is split at this point, it does not imply that it is ready to be modeled yet, EDA must still be performed strictly upon the training data. The data can still be aggregated as well to try and find a more solid way to predict encounters. 

### Approaches to take:
- Aggregated Barplots
- Decomposition Methods
- Aggregated Time Series Plot
- Geographic Heat Maps

In [1]:
# Standard Setup Import
from _Setup import *

### Data Cleaning
The data needs a slight rework with regards to the dates. The following code will create a date tag associated with each observation's month to make graphing easier. 

In [2]:
# Import sector-level data
sector_df = pd.read_csv(sector_data_csv_path)
sector_df.head()

  sector_df = pd.read_csv(sector_data_csv_path)


Unnamed: 0,Fiscal Year,Month Grouping,Month (abbv),Component,Land Border Region,Area of Responsibility,AOR (Abbv),Demographic,Citizenship,Title of Authority,Encounter Type,Encounter Count
0,2020,FYTD,OCT,Office of Field Operations,Northern Land Border,Boston Field Office,Boston,FMUA,BRAZIL,Title 8,Inadmissibles,2
1,2020,FYTD,OCT,Office of Field Operations,Northern Land Border,Boston Field Office,Boston,FMUA,OTHER,Title 8,Inadmissibles,29
2,2020,FYTD,OCT,Office of Field Operations,Northern Land Border,Boston Field Office,Boston,Single Adults,BRAZIL,Title 8,Inadmissibles,1
3,2020,FYTD,OCT,Office of Field Operations,Northern Land Border,Boston Field Office,Boston,Single Adults,CANADA,Title 8,Inadmissibles,1031
4,2020,FYTD,OCT,Office of Field Operations,Northern Land Border,Boston Field Office,Boston,Single Adults,"CHINA, PEOPLES REPUBLIC OF",Title 8,Inadmissibles,9


In [3]:
sector_df.replace(to_replace="2025 (FYTD)", value=2025, inplace=True)
sector_df.replace(to_replace='2024', value=2024, inplace=True)
print(sector_df['Fiscal Year'].unique())

[2020 2021 2022 2023 2024 2025]


  sector_df.replace(to_replace='2024', value=2024, inplace=True)


In [4]:
# Create a dictionary mapping month abbreviations (uppercase) to month numbers
month_abbr_to_num = {
    'JAN': 1, 'FEB': 2, 'MAR': 3, 'APR': 4, 'MAY': 5, 'JUN': 6,
    'JUL': 7, 'AUG': 8, 'SEP': 9, 'OCT': 10, 'NOV': 11, 'DEC': 12
}

# Function to convert Fiscal Year and Month (abbv) to a Year-Date
def convert_to_fiscal_year_date(row):
    month_num = month_abbr_to_num[row['Month (abbv)'].upper()]
    fiscal_year = int(row['Fiscal Year'])  # Convert Fiscal Year to integer
    
    # Adjust fiscal year for months January through September
    if month_num >= 10:  # Jan - Sep belong to the previous calendar year
        fiscal_year -= 1
    
    # Format the fiscal year and month into a date string
    return f"{fiscal_year}-{month_num:02d}-01"

# Apply the function to create a Year-Date column in fytd_df
sector_df['Year-Date'] = sector_df.apply(lambda row: convert_to_fiscal_year_date(row), axis=1)

# Convert the new column to datetime format
sector_df['Year-Date'] = pd.to_datetime(sector_df['Year-Date'], format='%Y-%m-%d')

# Display the dataframe with the new fiscal year-based Year-Date
print(sector_df['Year-Date'].unique())

<DatetimeArray>
['2019-10-01 00:00:00', '2019-11-01 00:00:00', '2019-12-01 00:00:00',
 '2020-01-01 00:00:00', '2020-02-01 00:00:00', '2020-03-01 00:00:00',
 '2020-04-01 00:00:00', '2020-05-01 00:00:00', '2020-06-01 00:00:00',
 '2020-07-01 00:00:00', '2020-08-01 00:00:00', '2020-09-01 00:00:00',
 '2020-10-01 00:00:00', '2020-11-01 00:00:00', '2020-12-01 00:00:00',
 '2021-01-01 00:00:00', '2021-02-01 00:00:00', '2021-03-01 00:00:00',
 '2021-04-01 00:00:00', '2021-05-01 00:00:00', '2021-06-01 00:00:00',
 '2021-07-01 00:00:00', '2021-08-01 00:00:00', '2021-09-01 00:00:00',
 '2021-10-01 00:00:00', '2021-11-01 00:00:00', '2021-12-01 00:00:00',
 '2022-01-01 00:00:00', '2022-02-01 00:00:00', '2022-03-01 00:00:00',
 '2022-04-01 00:00:00', '2022-05-01 00:00:00', '2022-06-01 00:00:00',
 '2022-07-01 00:00:00', '2022-08-01 00:00:00', '2022-09-01 00:00:00',
 '2022-10-01 00:00:00', '2022-11-01 00:00:00', '2022-12-01 00:00:00',
 '2023-01-01 00:00:00', '2023-02-01 00:00:00', '2023-03-01 00:00:00',
 '20

### TODO : Train Test Split
The data will be split at December 2023 for the last month of the Train data. Our goal is to predict January 2024-on. 

In [5]:
train_df = sector_df[sector_df['Year-Date'] < '2024-01-01']
test_df = sector_df[sector_df['Year-Date'] >= '2024-01-01']

test_df.head()

Unnamed: 0,Fiscal Year,Month Grouping,Month (abbv),Component,Land Border Region,Area of Responsibility,AOR (Abbv),Demographic,Citizenship,Title of Authority,Encounter Type,Encounter Count,Year-Date
57644,2024,Remaining,JAN,Office of Field Operations,Northern Land Border,Boston Field Office,Boston,Accompanied Minors,CANADA,Title 8,Inadmissibles,1,2024-01-01
57645,2024,Remaining,JAN,Office of Field Operations,Northern Land Border,Boston Field Office,Boston,Accompanied Minors,OTHER,Title 8,Inadmissibles,1,2024-01-01
57646,2024,Remaining,JAN,Office of Field Operations,Northern Land Border,Boston Field Office,Boston,FMUA,BRAZIL,Title 8,Inadmissibles,42,2024-01-01
57647,2024,Remaining,JAN,Office of Field Operations,Northern Land Border,Boston Field Office,Boston,FMUA,CANADA,Title 8,Inadmissibles,24,2024-01-01
57648,2024,Remaining,JAN,Office of Field Operations,Northern Land Border,Boston Field Office,Boston,FMUA,"CHINA, PEOPLES REPUBLIC OF",Title 8,Inadmissibles,38,2024-01-01


### Resave the Train/Test dataframe to avoid redoing this transformation

In [6]:
train_df.to_csv(sector_data_csv_path_train, index=False)
test_df.to_csv(sector_data_csv_path_test, index=False)