# 預處理思路
從驗證的public dataset中可以看到需要從日期、站點ID、時間資料去進行每20分鐘的可用腳踏車數量的預測，因此訓練資料集必須整理成含有以上資訊的資料集，因為保持數據集可處理的彈性，以下程式碼我會分階段進行，主要是把經緯度的資料加進去並去除無效資料，但我還是有讓處理過程都輸出成csv檔案，因為驗證的公式需要tot(停車柱的數量)的資料，最後的train_data有兩種主要是保留時序的彈性。

訓練資料集處理後的特徵包含date, station_id, time, latitude, longitude希望用這五個資料去預測sbi，因此驗證數據集也需要整理成這格式，看之後模型要怎麼處理這些特徵值。


## ENV Check

In [None]:
!nvidia-smi

In [None]:
! pwd

## Data Processing and Feature Acquisition
* 輸出檔名為aggregated_data_YYYYMMDD的csv檔案到根目錄，資料很多會花一些時間
* 輸出完成後我手動在根目錄新建一個名為"aggreated_data"的資料夾把這些csv檔案都放進去 (我懶得用程式)

In [None]:
import os
import json
import pandas as pd
from datetime import datetime, timedelta

def convert_time_to_minutes(time_str):
    hours, minutes = map(int, time_str.split(':'))
    return hours * 60 + minutes

def process_json_file(json_path, site_id):
    with open(json_path, 'r') as file:
        data = json.load(file)

    # Get the sorted list of times from the data
    sorted_times = sorted(data.keys(), key=lambda x: convert_time_to_minutes(x))

    # Find the first non-empty data entry
    for time in sorted_times:
        if data[time]:
            last_valid_data = data[time]
            break
    else:
        last_valid_data = {'tot': 0, 'sbi': 0, 'bemp': 0, 'act': '0'}

    # Process data for every 20 minutes interval
    processed_data = []
    for minutes in range(0, 24 * 60, 20):
        current_time_str = f"{minutes // 60:02d}:{minutes % 60:02d}"
        if current_time_str in data and data[current_time_str]:
            last_valid_data = data[current_time_str]

        processed_data.append({
            'station_id': site_id,
            'Time (minutes)': minutes,
            'tot': last_valid_data.get('tot', 0),
            'sbi': last_valid_data.get('sbi', 0),
            'bemp': last_valid_data.get('bemp', 0),
            'act': last_valid_data.get('act', '0')
        })

    return processed_data

def main():
    base_path = '/data/html.2023.final.data-release/release/'
    start_date = datetime(2023, 10, 2)
    end_date = datetime(2023, 12, 8)

    current_date = start_date
    while current_date <= end_date:
        folder_name = current_date.strftime('%Y%m%d')
        folder_path = os.path.join(base_path, folder_name)
        if os.path.exists(folder_path):
            all_data = []
            for file_name in os.listdir(folder_path):
                if file_name.endswith('.json'):
                    json_path = os.path.join(folder_path, file_name)
                    # Extract site ID from file name
                    site_id = file_name.split('.')[0]
                    all_data.extend(process_json_file(json_path, site_id))

            # Convert the aggregated data into a DataFrame
            if all_data:
                df = pd.DataFrame(all_data)
                output_csv = f'/data/aggregated_data_{folder_name}.csv'
                df.to_csv(output_csv, index=False)
        current_date += timedelta(days=1)

if __name__ == '__main__':
    main()


## Capture latitude and longitude data from demographic
* 提取demographic檔案經緯度資料並輸出csv檔案

In [None]:
# The previous path used was incorrect. Let's correct the file path.
demographic_file_path = '/data/html.2023.final.data-release/demographic.json'

# Function to extract station ID, latitude, and longitude and output it as a CSV
def extract_station_info_to_csv(file_path):
    with open(file_path, 'r') as file:
        data = json.load(file)
    
    # Prepare data for DataFrame
    stations_data = [{
        'station_id': sno,
        'latitude': info['lat'],
        'longitude': info['lng']
    } for sno, info in data.items()]

    # Convert to DataFrame
    df = pd.DataFrame(stations_data)
    
    # Define the CSV output path
    output_csv_path = '/data/station_info.csv'
    
    # Save as CSV
    df.to_csv(output_csv_path, index=False)
    
    return output_csv_path

# Call the function and get the path of the created CSV
csv_output_path = extract_station_info_to_csv(demographic_file_path)
csv_output_path


## Batch aggreated data and merge date
* 批次處理"aggreated_data"資料夾內的檔案並與經緯度合併 > 最後輸出一個包含所有資料的csv資料集

In [None]:
import os
import pandas as pd

# Define the base directory containing the aggregated data files
base_dir = '/data/aggreated_data/'

# Load station information
station_info_file_path = '/data/station_info.csv'
station_info_df = pd.read_csv(station_info_file_path)
station_info_df['station_id'] = station_info_df['station_id'].astype(int)

# Initialize an empty DataFrame to store the merged data
all_data_merged = pd.DataFrame()

# Iterate through each file in the directory
for file_name in os.listdir(base_dir):
    if file_name.startswith('aggregated_data_') and file_name.endswith('.csv'):
        # Extract the date from the filename
        date_str = file_name[len('aggregated_data_'):-4]
        file_path = os.path.join(base_dir, file_name)

        # Load the aggregated data file
        aggregated_data_df = pd.read_csv(file_path)
        aggregated_data_df['station_id'] = aggregated_data_df['station_id'].astype(int)

        # Add the extracted date to the DataFrame
        # Convert date to the desired format without separators (YYYYMMDD)
        aggregated_data_df['date'] = pd.to_datetime(date_str).strftime('%Y%m%d')

        # Merge with station information
        merged_df = pd.merge(aggregated_data_df, station_info_df, how='left', on='station_id')

        # Append to the overall DataFrame
        all_data_merged = pd.concat([all_data_merged, merged_df])

# Reset the index of the final DataFrame
all_data_merged.reset_index(drop=True, inplace=True)

# Save the final merged data to a new CSV file
output_file_path = '/data/merged_all_data.csv'
all_data_merged.to_csv(output_file_path, index=False)

# Output file path for download
output_file_path



## Sort columns
* 單純看不順眼sort一下

In [None]:
# Re-import the necessary libraries
import pandas as pd

# Load the final merged dataset again due to the environment reset
final_merged_dataset_path = '/data/merged_all_data.csv'
final_merged_dataset = pd.read_csv(final_merged_dataset_path)

# Reorder the columns according to the new requirement
final_merged_dataset = final_merged_dataset[['date', 'station_id', 'Time (minutes)', 'latitude', 'longitude', 'tot', 'sbi', 'bemp', 'act']]

# Rename the columns for consistency
final_merged_dataset.rename(columns={'Time (minutes)': 'time'}, inplace=True)

# Sort the DataFrame by 'date' and 'station_id'
final_merged_dataset.sort_values(by=['date', 'station_id'], inplace=True)

# Save the sorted and reordered DataFrame to a new CSV file
sorted_reordered_csv_path = '/data/merged_all_data_sort.csv'
final_merged_dataset.to_csv(sorted_reordered_csv_path, index=False)

sorted_reordered_csv_path


## Remove tot, bemp, act
* 去除三項值用以符合驗證資料集

In [None]:
# Load the dataset
file_path = '/data/merged_all_data_sort.csv'  # Replace this with your file path
data = pd.read_csv(file_path)

# Remove specified columns
data_cleaned = data.drop(['tot', 'bemp', 'act'], axis=1)

# Save the cleaned dataset
data_cleaned.to_csv('train_data.csv', index=False)

## Convert minute into 00:00 (optional)
* 將時間轉換成24小時制

In [None]:
# Load the dataset
file_path = '/data/train_data.csv'  # Replace with your file path
data = pd.read_csv(file_path)

# Function to convert time format
def convert_time(time):
    hours = time // 60
    minutes = time % 60
    return f"{hours:02d}:{minutes:02d}"

# Apply the conversion to the 'time' column
data['time'] = data['time'].apply(convert_time)

# The 'time' column is now in the desired format
# You can now work with this updated dataframe or save it to a new file
# For example, to save:
data.to_csv('train_data_time_convert.csv', index=False)


## Convert submited data to test_data format

In [1]:
import pandas as pd

# 載入 CSV 檔案
sample_submission_path = 'sample_submission_stage2.csv'
station_info_path = 'station_info.csv'

sample_submission = pd.read_csv(sample_submission_path)
station_info = pd.read_csv(station_info_path)

# 從 sample_submission 的 'id' 欄位中提取 'station_id'
sample_submission['station_id'] = sample_submission['id'].str.split('_').str[1].astype(int)

# 將 sample_submission 與 station_info 根據 'station_id' 合併
merged_data = pd.merge(sample_submission, station_info, on='station_id', how='left')

# 儲存合併後的資料為新的 CSV 檔案
merged_data.to_csv('eva_marge.csv', index=False)


In [3]:
import pandas as pd

# 載入 CSV 檔案
eva_marge_path = 'eva_marge.csv'
eva_marge = pd.read_csv(eva_marge_path)

# 從 'id' 欄位中提取 'date' 和 'time'
eva_marge['date'] = eva_marge['id'].str.split('_').str[0].astype(int)
eva_marge['time'] = eva_marge['id'].str.split('_').str[2].str.split(':').str[0].astype(int) * 60 + \
                    eva_marge['id'].str.split('_').str[2].str.split(':').str[1].astype(int)

# 重新排列欄位以符合 samples.csv 的格式
# 'sbi' 欄位設置為空並移至最右邊
reordered_eva_marge = eva_marge[['date', 'station_id', 'time', 'latitude', 'longitude']]
reordered_eva_marge['sbi'] = None

# 儲存重組後的資料為新的 CSV 檔案
reordered_eva_marge.to_csv('test_data.csv', index=False)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  reordered_eva_marge['sbi'] = None


In [7]:
import pandas as pd

# 讀取 CSV 文件的路徑
# 請將以下路徑替換為您本地 CSV 文件的實際路徑
train_data_path = 'train_data.csv'  # 例如 'C:/Users/YourName/Documents/train_data.csv'

# 讀取 CSV 文件
train_data = pd.read_csv(train_data_path)

# 顯示處理後的數據集
print(train_data.head())

       date  station_id  time  latitude  longitude  sbi
0  20231002   500101001     0  25.02605   121.5436   12
1  20231002   500101001    20  25.02605   121.5436   12
2  20231002   500101001    40  25.02605   121.5436   12
3  20231002   500101001    60  25.02605   121.5436    8
4  20231002   500101001    80  25.02605   121.5436    8


## Main traing model

In [4]:
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from tqdm import tqdm
import numpy as np

# 設定裝置
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 讀取 CSV 文件的路徑
train_data_path = 'train_data.csv'  # 替換為您的文件路徑

# 讀取數據
data = pd.read_csv(train_data_path)

# 選擇特徵和目標變量
X = data[['date', 'station_id', 'time', 'latitude', 'longitude']].values
y = data['sbi'].values

# 拆分數據集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 數據標準化
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# 轉換為 PyTorch 張量
X_train_tensor = torch.tensor(X_train, dtype=torch.float32).to(device)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32).to(device)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32).to(device)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32).to(device)

# 定義 PyTorch 數據集
class TrainData(Dataset):
    def __init__(self, X_data, y_data):
        self.X_data = X_data
        self.y_data = y_data
    
    def __getitem__(self, index):
        return self.X_data[index], self.y_data[index]
    
    def __len__ (self):
        return len(self.X_data)

train_data = TrainData(X_train_tensor, y_train_tensor)
train_loader = DataLoader(dataset=train_data, batch_size=64, shuffle=True)

# 定義模型
class NeuralNet(nn.Module):
    def __init__(self):
        super(NeuralNet, self).__init__()
        self.layer_1 = nn.Linear(5, 64) 
        self.layer_2 = nn.Linear(64, 64)
        self.layer_out = nn.Linear(64, 1)
        self.relu = nn.ReLU()
        
    def forward(self, inputs):
        x = self.relu(self.layer_1(inputs))
        x = self.relu(self.layer_2(x))
        x = self.layer_out(x)
        return x

model = NeuralNet().to(device)

# 定義損失函數和優化器
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# 訓練模型
epochs = 100
for epoch in range(epochs):
    model.train()
    running_loss = 0.0
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()
        y_pred = model(X_batch)
        loss = criterion(y_pred, y_batch.unsqueeze(1))
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

    # 在每個 epoch 結束時打印平均損失
    avg_loss = running_loss / len(train_loader)
    print(f"Epoch [{epoch+1}/{epochs}], Loss: {avg_loss:.4f}")

# 在測試數據上評估模型
model.eval()
with torch.no_grad():
    y_pred_tensor = model(X_test_tensor)
    y_pred = y_pred_tensor.detach().cpu().numpy()
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    print(f"均方誤差 (MSE): {mse}")
    print(f"均方根誤差 (RMSE): {rmse}")


Epoch [1/100], Loss: 61.0923
Epoch [2/100], Loss: 58.5671
Epoch [3/100], Loss: 57.3840
Epoch [4/100], Loss: 56.6302
Epoch [5/100], Loss: 56.0591
Epoch [6/100], Loss: 55.5898
Epoch [7/100], Loss: 55.2353
Epoch [8/100], Loss: 54.9222
Epoch [9/100], Loss: 54.6454
Epoch [10/100], Loss: 54.3851
Epoch [11/100], Loss: 54.1126
Epoch [12/100], Loss: 53.8458
Epoch [13/100], Loss: 53.6126
Epoch [14/100], Loss: 53.3977
Epoch [15/100], Loss: 53.2059
Epoch [16/100], Loss: 53.0498
Epoch [17/100], Loss: 52.9287
Epoch [18/100], Loss: 52.8081
Epoch [19/100], Loss: 52.6917
Epoch [20/100], Loss: 52.5783
Epoch [21/100], Loss: 52.4621
Epoch [22/100], Loss: 52.3569
Epoch [23/100], Loss: 52.2388
Epoch [24/100], Loss: 52.1460
Epoch [25/100], Loss: 52.0467
Epoch [26/100], Loss: 51.9702
Epoch [27/100], Loss: 51.8666
Epoch [28/100], Loss: 51.7764
Epoch [29/100], Loss: 51.6906
Epoch [30/100], Loss: 51.6057
Epoch [31/100], Loss: 51.5485
Epoch [32/100], Loss: 51.4921
Epoch [33/100], Loss: 51.4271
Epoch [34/100], Los

## Predict test_data

In [10]:
import pandas as pd
import torch
import numpy as np
from sklearn.preprocessing import StandardScaler

# 假設您已經有了一個訓練好的模型和對應的標準化對象 'scaler'

# 加載新數據
new_data_path = 'test_data.csv'  # 替換為新數據的文件路徑
new_data = pd.read_csv(new_data_path)

# 預處理新數據
X_new = new_data[['date', 'station_id', 'time', 'latitude', 'longitude']].values
X_new = scaler.transform(X_new)  # 使用之前訓練時使用的 scaler 對象
X_new_tensor = torch.tensor(X_new, dtype=torch.float32).to(device)

# 使用模型進行預測
model.eval()
with torch.no_grad():
    y_new_pred_tensor = model(X_new_tensor)
    y_new_pred = y_new_pred_tensor.detach().cpu().numpy()

# 四捨五入並確保預測值非負
y_new_pred_rounded = np.round(y_new_pred).astype(int)
y_new_pred_rounded = np.clip(y_new_pred_rounded, 0, None)  # 確保所有預測值非負

# 創建一個新的 DataFrame 來儲存預測結果
predicted_data = pd.DataFrame(y_new_pred_rounded, columns=['predicted_sbi'])

# 將預測結果保存到 CSV 文件
predicted_data.to_csv('predicted_results.csv', index=False)
