# Steps

## 1. [Libraries](#1.-Import-the-Libraries)
## 2. [Analysis](#2.-Preliminary-Analysis)
## [Preprocessing]

## 🚌 Bus Demand Forecasting Hackathon

### 📌 Problem Statement

Accurately forecasting demand for bus journeys is a complex task due to the influence of multiple factors. These include:

- **Holiday calendars**
- **Wedding seasons**
- **Long weekends**
- **School vacations**
- **Exam schedules**
- **Day-of-week effects**
- **Regional holidays** (impact varies by region)

> Not all holidays cause significant changes in demand, making this a non-trivial modeling problem.

---

### 🎯 Objective

Develop a predictive model that forecasts the **total number of seats booked** for a given **route** and **date of journey**, exactly **15 days before the travel date**.

---

### 📂 Provided Data

You will be given historical data from the platform, which includes:

- `seats_booked`: Number of seats booked (actual demand)
- `date_of_journey`: The actual travel date
- `date_of_issue`: The date when the ticket was booked
- `search_data`: Number of user searches for a specific journey on a given booking date

---

### 🧠 Challenge Details

You need to predict the **final demand (total seats booked)** for each route, **15 days in advance** of the travel date.

#### ✅ Example

- **Route**: Source City "A" → Destination City "B"
- **Date of Journey (DOJ)**: `30-Jan-2025`
- **Prediction Date**: `16-Jan-2025` (15 days prior to DOJ)

Your model should predict the **total expected bookings** on the prediction date for the journey date.

---

### 📥 Data Access

Download the training and testing datasets from the links provided at the bottom of the problem statement.

---

### 🔧 Next Steps

1. Exploratory Data Analysis (EDA)
2. Feature Engineering
3. Model Building
4. Evaluation and Validation
5. Final Predictions



## 2. Data Collection

Data - https://www.analyticsvidhya.com/datahack/contest/redbus-data-decode-hackathon-2025/
1. train_csv => rows - 67200, columns - 4
  - ['doj', 'srcid', 'destid', 'final_seatcount']
2. test_csv  => rows - 5900, columns - 4
  - ['route_key', 'doj', 'srcid', 'destid']
3. transactions.csv => rows - 22661000, columns - 11
  - ['doj', 'doi', 'srcid', 'destid', 'srcid_region', 'destid_region',
       'srcid_tier', 'destid_tier', 'cumsum_seatcount', 'cumsum_searchcount',
       'dbd']

### 2.1. Import the Libraries

In [57]:
import numpy  as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import OrdinalEncoder

from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_error

from xgboost import XGBRegressor

#### Import Data from csv files

In [2]:
train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")
transactions_data = pd.read_csv("transactions.csv")

In [3]:
train_df = train_data.copy()
test_df = test_data.copy()
test_df = test_df.drop('route_key', axis = 1)
trans_df = transactions_data.copy()

In [4]:
print("train shape ->", train_df.shape)
print("transactions shape ->", trans_df.shape)
print("test shape ->", test_df.shape)

train shape -> (67200, 4)
transactions shape -> (2266100, 11)
test shape -> (5900, 3)


In [5]:
dfs = [train_df, test_df, trans_df]

##### info() of dataframes

In [6]:
for i in dfs:
    print(i.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67200 entries, 0 to 67199
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   doj              67200 non-null  object 
 1   srcid            67200 non-null  int64  
 2   destid           67200 non-null  int64  
 3   final_seatcount  67200 non-null  float64
dtypes: float64(1), int64(2), object(1)
memory usage: 2.1+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5900 entries, 0 to 5899
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   doj     5900 non-null   object
 1   srcid   5900 non-null   int64 
 2   destid  5900 non-null   int64 
dtypes: int64(2), object(1)
memory usage: 138.4+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2266100 entries, 0 to 2266099
Data columns (total 11 columns):
 #   Column              Dtype  
---  ------              -----  
 0   doj                 ob

##### Check for Null values

In [7]:
for i in dfs:   
    print(i.isna().sum(),'\n')

doj                0
srcid              0
destid             0
final_seatcount    0
dtype: int64 

doj       0
srcid     0
destid    0
dtype: int64 

doj                   0
doi                   0
srcid                 0
destid                0
srcid_region          0
destid_region         0
srcid_tier            0
destid_tier           0
cumsum_seatcount      0
cumsum_searchcount    0
dbd                   0
dtype: int64 



###### Insight
- There are no Nulll values

##### Check for duplicates

In [8]:
for i in dfs:
    print(i.duplicated().sum())

0
0
0


###### Insights 
- No Duplicates

#### Coverting objects into Date

In [12]:
date_item_list = ['doj', 'doi']

for df in dfs:
    for col in date_item_list:
        if col in df.columns and df[col].dtype != 'datetime64[ns]':
            df[col] = pd.to_datetime(df[col], format = '%Y-%m-%d')

            # Extracting new columns with date columns
            df[col + '_year'] = df[col].dt.year
            df[col + '_month'] = df[col].dt.month
            df[col + '_day'] = df[col].dt.day
            df[col + '_dayofweek'] = df[col].dt.dayofweek
            df[col + '_isweekend'] = df[col].dt.dayofweek.isin([5,6]).astype(int)

            # Deleting the datetime datatype columns
            df.drop([col], axis = 1, inplace = True)

##### Mapping ID's with respective locations and tiers

In [13]:
'''
dic = {}
# dictionary adds [location ID] as key and its [Location name], [Location Tier]
for i, j in transactions_data.iterrows():
    dic.update({j.values[2] : [j.values[4], j.values[6]]})
    dic.update({j.values[3] : [j.values[5], j.values[7]]})

dict(sorted(dic.items()))
'''

######################################################

loc_dict = {1: ['Maharashtra and Goa', 'Tier2'],
 2: ['Maharashtra and Goa', 'Tier 1'],
 3: ['Andhra Pradesh', 'Tier2'],
 4: ['East 1', 'Tier 4'],
 5: ['Andhra Pradesh', 'Tier2'],
 6: ['Andhra Pradesh', 'Tier 3'],
 7: ['Rest of North', 'Tier2'],
 8: ['Andhra Pradesh', 'Tier2'],
 9: ['Tamil Nadu', 'Tier2'],
 10: ['Andhra Pradesh', 'Tier 3'],
 11: ['Rest of North', 'Tier 1'],
 12: ['Rajasthan', 'Tier 1'],
 13: ['East 1', 'Tier2'],
 14: ['Maharashtra and Goa', 'Tier2'],
 15: ['Tamil Nadu', 'Tier 3'],
 16: ['Maharashtra and Goa', 'Tier 1'],
 17: ['East 1', 'Tier2'],
 18: ['Tamil Nadu', 'Tier 1'],
 19: ['Madhya Pradesh', 'Tier 1'],
 20: ['Tamil Nadu', 'Tier2'],
 21: ['Maharashtra and Goa', 'Tier2'],
 22: ['Rest of North', 'Tier2'],
 23: ['East 1', 'Tier 1'],
 24: ['Maharashtra and Goa', 'Tier2'],
 25: ['East 1', 'Tier2'],
 26: ['Tamil Nadu', 'Tier 4'],
 27: ['Rest of North', 'Tier2'],
 28: ['Andhra Pradesh', 'Tier 3'],
 29: ['East 1', 'Tier2'],
 30: ['Maharashtra and Goa', 'Tier 1'],
 31: ['Maharashtra and Goa', 'Tier 4'],
 32: ['Madhya Pradesh', 'Tier 1'],
 33: ['Tamil Nadu', 'Tier 3'],
 34: ['Kerala', 'Tier2'],
 35: ['Tamil Nadu', 'Tier2'],
 36: ['Delhi', 'Tier2'],
 37: ['Rest of North', 'Tier 1'],
 38: ['Rest of North', 'Tier 1'],
 39: ['Tamil Nadu', 'Tier 3'],
 40: ['Maharashtra and Goa', 'Tier 1'],
 41: ['Rest of North', 'Tier 4'],
 42: ['Karnataka', 'Tier2'],
 43: ['Andhra Pradesh', 'Tier 1'],
 44: ['Tamil Nadu', 'Tier2'],
 45: ['Karnataka', 'Tier 1'],
 46: ['Tamil Nadu', 'Tier 1'],
 47: ['Andhra Pradesh', 'Tier 1'],
 48: ['Tamil Nadu', 'Tier2']}

################################## Train #############################################
train_df[['srcid_region', 'srcid_tier']] = train_df['srcid'].apply(lambda x: pd.Series(loc_dict[x]))
train_df[['destid_region', 'destid_tier']] = train_df['destid'].apply(lambda x: pd.Series(loc_dict[x]))

################################## test #############################################
test_df[['srcid_region', 'srcid_tier']] = test_df['srcid'].apply(lambda x: pd.Series(loc_dict[x]))
test_df[['destid_region', 'destid_tier']] = test_df['destid'].apply(lambda x: pd.Series(loc_dict[x]))

## Preprocessing

In [16]:
categorical_columns = [col for col in trans_df.columns if trans_df[col].dtype == 'O']
# ['srcid_region', 'destid_region', 'srcid_tier', 'destid_tier']

In [51]:
encoder = OrdinalEncoder(dtype = int)

train_df[categorical_columns] = encoder.fit_transform(train_df[categorical_columns])
test_df[categorical_columns] = encoder.fit_transform(test_df[categorical_columns])
trans_df[categorical_columns] = encoder.fit_transform(trans_df[categorical_columns])

## Data Preparation

In [58]:
X = train_df.drop(['final_seatcount'], axis = 1)
y = train_df['final_seatcount']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [55]:
print(train_df.columns)
print(test_df.columns)

Index(['srcid', 'destid', 'final_seatcount', 'doj_year', 'doj_month',
       'doj_day', 'doj_dayofweek', 'doj_isweekend', 'srcid_region',
       'srcid_tier', 'destid_region', 'destid_tier'],
      dtype='object')
Index(['srcid', 'destid', 'doj_year', 'doj_month', 'doj_day', 'doj_dayofweek',
       'doj_isweekend', 'srcid_region', 'srcid_tier', 'destid_region',
       'destid_tier'],
      dtype='object')


In [61]:
model = XGBRegressor()

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

rmse = np.sqrt(mean_squared_error(y_test, y_pred)) # 504.88480654785326
rmse

np.float64(504.88480654785326)

## Submission

In [63]:
model = XGBRegressor()
model.fit(X,y)
y_pred = model.predict(test_df)

In [70]:
final_df = pd.DataFrame(test_data['route_key'])
final_df['final_seatcount'] = pd.Series(y_pred)

In [73]:
final_df.to_csv('submission.csv', index = False)