# Steps

## 1. [Libraries](#1.-Import-the-Libraries)
## 2. [Analysis](#2.-Preliminary-Analysis)
## [Preprocessing]

## 🚌 Bus Demand Forecasting Hackathon

### 📌 Problem Statement

Accurately forecasting demand for bus journeys is a complex task due to the influence of multiple factors. These include:

- **Holiday calendars**
- **Wedding seasons**
- **Long weekends**
- **School vacations**
- **Exam schedules**
- **Day-of-week effects**
- **Regional holidays** (impact varies by region)

> Not all holidays cause significant changes in demand, making this a non-trivial modeling problem.

---

### 🎯 Objective

Develop a predictive model that forecasts the **total number of seats booked** for a given **route** and **date of journey**, exactly **15 days before the travel date**.

---

### 📂 Provided Data

You will be given historical data from the platform, which includes:

- `seats_booked`: Number of seats booked (actual demand)
- `date_of_journey`: The actual travel date
- `date_of_issue`: The date when the ticket was booked
- `search_data`: Number of user searches for a specific journey on a given booking date

---

### 🧠 Challenge Details

You need to predict the **final demand (total seats booked)** for each route, **15 days in advance** of the travel date.

#### ✅ Example

- **Route**: Source City "A" → Destination City "B"
- **Date of Journey (DOJ)**: `30-Jan-2025`
- **Prediction Date**: `16-Jan-2025` (15 days prior to DOJ)

Your model should predict the **total expected bookings** on the prediction date for the journey date.

---

### 📥 Data Access

Download the training and testing datasets from the links provided at the bottom of the problem statement.

---

### 🔧 Next Steps

1. Exploratory Data Analysis (EDA)
2. Feature Engineering
3. Model Building
4. Evaluation and Validation
5. Final Predictions



## 2. Data Collection

Data - https://www.analyticsvidhya.com/datahack/contest/redbus-data-decode-hackathon-2025/
1. train_csv => rows - 67200, columns - 4
  - ['doj', 'srcid', 'destid', 'final_seatcount']
2. test_csv  => rows - 5900, columns - 4
  - ['route_key', 'doj', 'srcid', 'destid']
3. transactions.csv => rows - 22661000, columns - 11
  - ['doj', 'doi', 'srcid', 'destid', 'srcid_region', 'destid_region',
       'srcid_tier', 'destid_tier', 'cumsum_seatcount', 'cumsum_searchcount',
       'dbd']

### 2.1. Import the Libraries

In [1]:
import numpy  as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import OrdinalEncoder

from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_error

from xgboost import XGBRegressor

#### Import Data from csv files

In [2]:
train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")
transactions_data = pd.read_csv("transactions.csv")

In [3]:
train_df = train_data.copy()
test_df = test_data.copy()
test_df = test_df.drop('route_key', axis = 1)
trans_df = transactions_data.copy()

In [4]:
print("train shape ->", train_df.shape)
print("transactions shape ->", trans_df.shape)
print("test shape ->", test_df.shape)

train shape -> (67200, 4)
transactions shape -> (2266100, 11)
test shape -> (5900, 3)


### 2.2 Merge data from transactions file to test and train files

In [5]:
transaction_15 = trans_df[trans_df['dbd'] == 15]

features = transaction_15[['doj', 'srcid', 'destid', 'srcid_region', 'destid_region',
       'srcid_tier', 'destid_tier', 'cumsum_seatcount', 'cumsum_searchcount']]

train_df = train_df.merge(features, on = ['doj', 'srcid', 'destid'], how = 'left')
test_df = test_df.merge(features, on = ['doj', 'srcid', 'destid'], how = 'left')

In [6]:
dfs = [train_df, test_df]

for df in dfs:
    
    df['doj'] = pd.to_datetime(df['doj'], format = '%Y-%m-%d')

    # Extracting new columns with date columns
    df['doj' + '_year'] = df['doj'].dt.year
    df['doj' + '_month'] = df['doj'].dt.month
    df['doj' + '_day'] = df['doj'].dt.day
    df['doj' + '_dayofweek'] = df['doj'].dt.dayofweek
    df['doj' + '_isweekend'] = df['doj'].dt.dayofweek.isin([5,6]).astype(int)

    # Deleting the datetime datatype columns
    df.drop(['doj'], axis = 1, inplace = True)

##### info() of dataframes

In [7]:
for i in dfs:
    print(i.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67200 entries, 0 to 67199
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   srcid               67200 non-null  int64  
 1   destid              67200 non-null  int64  
 2   final_seatcount     67200 non-null  float64
 3   srcid_region        67200 non-null  object 
 4   destid_region       67200 non-null  object 
 5   srcid_tier          67200 non-null  object 
 6   destid_tier         67200 non-null  object 
 7   cumsum_seatcount    67200 non-null  float64
 8   cumsum_searchcount  67200 non-null  float64
 9   doj_year            67200 non-null  int32  
 10  doj_month           67200 non-null  int32  
 11  doj_day             67200 non-null  int32  
 12  doj_dayofweek       67200 non-null  int32  
 13  doj_isweekend       67200 non-null  int64  
dtypes: float64(3), int32(4), int64(3), object(4)
memory usage: 6.2+ MB
None
<class 'pandas.core.frame.Data

##### Check for Null values

In [8]:
for i in dfs:   
    print(i.isna().sum(),'\n')

srcid                 0
destid                0
final_seatcount       0
srcid_region          0
destid_region         0
srcid_tier            0
destid_tier           0
cumsum_seatcount      0
cumsum_searchcount    0
doj_year              0
doj_month             0
doj_day               0
doj_dayofweek         0
doj_isweekend         0
dtype: int64 

srcid                 0
destid                0
srcid_region          0
destid_region         0
srcid_tier            0
destid_tier           0
cumsum_seatcount      0
cumsum_searchcount    0
doj_year              0
doj_month             0
doj_day               0
doj_dayofweek         0
doj_isweekend         0
dtype: int64 



###### Insight
- There are no Nulll values

##### Check for duplicates

In [9]:
for i in dfs:
    print(i.duplicated().sum())

0
0


###### Insights 
- No Duplicates

#### Coverting objects into Date

In [10]:
date_item_list = ['doj', 'doi']

for df in dfs:
    for col in date_item_list:
        if col in df.columns and df[col].dtype != 'datetime64[ns]':
            df[col] = pd.to_datetime(df[col], format = '%Y-%m-%d')

            # Extracting new columns with date columns
            df[col + '_year'] = df[col].dt.year
            df[col + '_month'] = df[col].dt.month
            df[col + '_day'] = df[col].dt.day
            df[col + '_dayofweek'] = df[col].dt.dayofweek
            df[col + '_isweekend'] = df[col].dt.dayofweek.isin([5,6]).astype(int)

            # Deleting the datetime datatype columns
            df.drop([col], axis = 1, inplace = True)

## Preprocessing

In [13]:
categorical_columns = [col for col in train_df.columns if train_df[col].dtype == 'O']
# ['srcid_region', 'destid_region', 'srcid_tier', 'destid_tier']

In [14]:
encoder = OrdinalEncoder(dtype = int)

train_df[categorical_columns] = encoder.fit_transform(train_df[categorical_columns])
test_df[categorical_columns] = encoder.fit_transform(test_df[categorical_columns])

## Data Preparation

In [15]:
X = train_df.drop(['final_seatcount'], axis = 1)
y = train_df['final_seatcount']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [16]:
print(train_df.columns)
print(test_df.columns)

Index(['srcid', 'destid', 'final_seatcount', 'srcid_region', 'destid_region',
       'srcid_tier', 'destid_tier', 'cumsum_seatcount', 'cumsum_searchcount',
       'doj_year', 'doj_month', 'doj_day', 'doj_dayofweek', 'doj_isweekend'],
      dtype='object')
Index(['srcid', 'destid', 'srcid_region', 'destid_region', 'srcid_tier',
       'destid_tier', 'cumsum_seatcount', 'cumsum_searchcount', 'doj_year',
       'doj_month', 'doj_day', 'doj_dayofweek', 'doj_isweekend'],
      dtype='object')


### model testing

In [17]:
model = XGBRegressor()

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

rmse = np.sqrt(mean_squared_error(y_test, y_pred)) # 504.88480654785326
rmse

np.float64(392.09103789169905)