<div class="alert alert-block alert-success">
    <h1>Project Portfolio # 1</h1>
    <h2>Part 1: Database Synthetic Data Generation</h2>
    <h3>Coded by: Ariba Khan</h3>
</div>

This notebook focuses on the generation of synthetic data to simulate a public transport system for Karachi. The data was created using a blend of real-world information and generated datasets, aiming to closely mimic the actual routes, buses, and daily operations of Karachi's public transport network.

#### Data Generation Process

- **Real-World Data Reference:** We based our data generation on actual routes and buses used in Karachi, utilizing web scraping to collect reference data. This approach ensures that the generated data closely resembles the real-world scenario.
  
- **Python Scripting:** We used Python to generate trip data, introducing variability and authenticity. The dataset spans from June 1, 2023, to May 31, 2024, with May data reserved for testing purposes.

- **ChatGPT-Assisted Data Generation:** 
  - **Passengers, Drivers, and Conductors:** Common names and relevant details were generated using ChatGPT.
  - **Incidents and Complaints:** ChatGPT was also used to create realistic incident and complaint scenarios, reflecting potential issues within the transport system.

- **Database Design:**
  - **Bridge Table (Buses_Drivers):** A bridge table was created to normalize the many-to-many relationship between buses and drivers.
  - **Trip Frequency:** Each bus is assumed to make at least three trips per day.

#### Database Structure

The generated data includes the following tables:

- **Users:** Information about passengers.
- **Drivers:** Information about drivers.
- **Conductors:** Information about conductors.
- **Buses:** Details about the vehicles in use.
- **Routes:** Details about the bus routes.
- **Fares:** Structured fare rates for each route/bus.
- **Schedules:** Scheduled route trips for each bus.
- **Tickets:** Information about each ticket bought.
- **Payments:** Details about the payment amount and method used when buying a ticket.
- **Incidents:** Records of accidents or incidents during trips.
- **Complaints:** Passenger complaints lodged through the official app.
- **Buses_Drivers:** A bridge table linking buses and their respective drivers.
- **Trips:** Details about scheduled and actual trips made by buses.

#### PostgreSQL Database:

The generated data and tables are then loaded onto a postgreSQL database which is accessed through pgAdmin4. In another SQL file, test SQL queries are run to ensure data is accessed easily and correct results to analytical queries are returned and displayed.

In [1]:
import pandas as pd
import random
import numpy as np
import warnings
from faker import Faker
from datetime import datetime, timedelta
from collections import Counter
warnings.filterwarnings("ignore", category=Warning)
pd.set_option('display.max_rows', None)

<div class="alert alert-block alert-success">
    <h3>1. Buses</h3>
</div>

In [2]:
# importing buses and route info extracted after web scraping

bus_info = pd.read_csv("bus_info.csv")

In [3]:
bus_info

Unnamed: 0,Buses,Routes,Origin,Destination,Type
0,D10 Mini Bus,Korangi 100 Quarters to Taiser Town,Korangi 100 Quarters,Taiser Town,Mini Bus
1,F16 Mini Bus,Mianwali Colony to Machine Tool Factory,Mianwali Colony,Machine Tool Factory,Mini Bus
2,S2 Mini Bus,Korangi 100 Quarters to Ayub Goth,Korangi 100 Quarters,Ayub Goth,Mini Bus
3,W18 Mini Bus,Surjani Town to Fish Harbour,Surjani Town,Fish Harbour,Mini Bus
4,X8 Mini Bus,Ittihad Town to Qayyumabad,Ittihad Town,Qayyumabad,Mini Bus
5,X10 Mini Bus,Mohajir Camp to DHA Phase 8,Mohajir Camp,DHA Phase 8,Mini Bus
6,W22 Mini Bus,Nai Abadi to Korangi No. 2 1/2,Naiabadi,Korangi No. 2 1/2,Mini Bus
7,F11 Mini Bus,Gul Ahmed Textile Mills to Mustafabad Colony,Gul Ahmed Textile Mills,Mustafabad Colony,Mini Bus
8,G25 Mini Bus,Ittihad Town to Rasheedabad,Ittihad Town,Rasheedabad,Mini Bus
9,Gulistan Coach,Pehlwan Goth to Qayyumabad,Pehlwan Goth,Qayyumabad,Regular Bus


Attributes for 'Buses' table:

- bus_id
- bus_name
- bus_type
- capacity

In [4]:
## initializing 'Buses' df and creating 'bus_id' attribute
# note: total count of buses is 29

buses_df = pd.DataFrame()
buses_df['bus_id'] = [f'B{str(i).zfill(3)}' for i in range(1, 30)]

In [5]:
# creating 'bus_name' attribute

buses_df['bus_name'] = bus_info['Buses']

In [6]:
# creating 'bus_type' attribute

buses_df['bus_type'] = bus_info['Type']

In [7]:
# creating 'capacity' attribute
# note: capacity for mini buses is 30 and for regular buses is 50

buses_df['capacity'] = buses_df['bus_type'].apply(lambda x: 30 if x == 'Mini Bus' else 50 if x == 'Regular Bus' else None)

In [8]:
buses_df

Unnamed: 0,bus_id,bus_name,bus_type,capacity
0,B001,D10 Mini Bus,Mini Bus,30
1,B002,F16 Mini Bus,Mini Bus,30
2,B003,S2 Mini Bus,Mini Bus,30
3,B004,W18 Mini Bus,Mini Bus,30
4,B005,X8 Mini Bus,Mini Bus,30
5,B006,X10 Mini Bus,Mini Bus,30
6,B007,W22 Mini Bus,Mini Bus,30
7,B008,F11 Mini Bus,Mini Bus,30
8,B009,G25 Mini Bus,Mini Bus,30
9,B010,Gulistan Coach,Regular Bus,50


<div class="alert alert-block alert-success">
    <h3>2. Routes</h3>
</div>

Attributes for 'Routes' table:

- route_id
- route_name
- route_origin
- route_dest
- origin_lat
- origin_long
- dest_lat
- dest_long
- bus_id

In [9]:
## initializing 'Routes' df and creating 'route_id' attribute

routes_df = pd.DataFrame()
routes_df['route_id'] = [f'RT{str(i).zfill(3)}' for i in range(1, 30)]

In [10]:
# creating 'route_name' attribute

routes_df['route_name'] = bus_info['Routes']

In [11]:
# creating 'route_origin' attribute

routes_df['route_origin'] = bus_info['Origin']

In [12]:
# creating 'route_dest' attribute

routes_df['route_dest'] = bus_info['Destination']

In [13]:
# creating 'origin_lat' and 'origin_long' attributes

routes_df['route_origin'].unique()

array(['Korangi 100 Quarters', 'Mianwali Colony', 'Surjani Town',
       'Ittihad Town', 'Mohajir Camp', 'Naiabadi',
       'Gul Ahmed Textile Mills', 'Pehlwan Goth', 'Maroora Goth',
       'KDA Site Office', 'Sohrab Goth', 'Islam Nagar',
       'New Karachi Allahwala Masjid', 'Model Colony', 'North Karachi',
       'Nagan Chowrangi', 'Gulshan-e-Bihar Orangi', 'Mosamiyat',
       'Gulshan-e-Hadeed', 'Numaish', 'Shireen Jinnah Colony',
       'Numaish Chowrangi', 'Khokrapar'], dtype=object)

In [14]:
# creating a nested dictionary for lat long coordinates to map against origin locations

origin_locations = { 
    'Korangi 100 Quarters': {'latitude': 24.8277, 'longitude': 67.1209},
    'Mianwali Colony': {'latitude': 24.9456, 'longitude': 66.9683},
    'Surjani Town': {'latitude': 25.0408, 'longitude': 66.9472},
    'Ittihad Town': {'latitude': 24.9296, 'longitude': 66.9738},
    'Mohajir Camp': {'latitude': 24.9643, 'longitude': 67.0536},
    'Naiabadi': {'latitude': 24.8286, 'longitude': 67.0568},
    'Gul Ahmed Textile Mills': {'latitude': 24.8182, 'longitude': 67.0910},
    'Pehlwan Goth': {'latitude': 24.8970, 'longitude': 67.1144},
    'Maroora Goth': {'latitude': 24.8893, 'longitude': 67.1272},
    'KDA Site Office': {'latitude': 24.8943, 'longitude': 67.1136},
    'Sohrab Goth': {'latitude': 24.9855, 'longitude': 67.0730},
    'Islam Nagar': {'latitude': 24.9280, 'longitude': 67.0303},
    'New Karachi Allahwala Masjid': {'latitude': 24.9985, 'longitude': 67.0720},
    'Model Colony': {'latitude': 24.9045, 'longitude': 67.1892},
    'North Karachi': {'latitude': 24.9876, 'longitude': 67.0628},
    'Nagan Chowrangi': {'latitude': 24.9425, 'longitude': 67.0728},
    'Gulshan-e-Bihar Orangi': {'latitude': 24.9718, 'longitude': 67.0085},
    'Mosamiyat': {'latitude': 24.9252, 'longitude': 67.1233},
    'Gulshan-e-Hadeed': {'latitude': 24.8042, 'longitude': 67.1873},
    'Numaish': {'latitude': 24.8732, 'longitude': 67.0347},
    'Shireen Jinnah Colony': {'latitude': 24.8353, 'longitude': 67.0270},
    'Numaish Chowrangi': {'latitude': 24.8718, 'longitude': 67.0336},
    'Khokrapar': {'latitude': 24.9042, 'longitude': 67.1453}
}

In [15]:
# initializing columns
routes_df['origin_lat'] = None
routes_df['origin_long'] = None

# mapping lat/long values
routes_df['origin_lat'] = routes_df['route_origin'].apply(lambda x: origin_locations[x]['latitude'] if x in origin_locations else None)
routes_df['origin_long'] = routes_df['route_origin'].apply(lambda x: origin_locations[x]['longitude'] if x in origin_locations else None)

In [16]:
# creating 'origin_lat' and 'origin_long' attributes

routes_df['route_dest'].unique()

array(['Taiser Town', 'Machine Tool Factory', 'Ayub Goth', 'Fish Harbour',
       'Qayyumabad', 'DHA Phase 8', 'Korangi No. 2 1/2',
       'Mustafabad Colony', 'Rasheedabad', 'Mianwali Colony',
       'Abdul Shah Ghazi Mazar', 'KDA', 'Korangi Road', 'Keamari',
       'Merewether Tower', 'Landhi', 'Singer Chowrangi', 'Dockyard',
       'Masroor Base', 'Baldia Town', 'Sea View', 'Meeran Naka, Lyari',
       'Korangi', 'Sea View Beach', 'Lucky Star Saddar'], dtype=object)

In [17]:
# creating a nested dictionary for lat long coordinates to map against destination locations

dest_locations = {
    'Taiser Town': {'latitude': 25.0503, 'longitude': 67.1420},
    'Machine Tool Factory': {'latitude': 24.9194, 'longitude': 67.1065},
    'Ayub Goth': {'latitude': 24.9338, 'longitude': 67.1234},
    'Fish Harbour': {'latitude': 24.8435, 'longitude': 66.9841},
    'Qayyumabad': {'latitude': 24.8425, 'longitude': 67.0653},
    'DHA Phase 8': {'latitude': 24.8011, 'longitude': 67.0325},
    'Korangi No. 2 1/2': {'latitude': 24.8524, 'longitude': 67.1271},
    'Mustafabad Colony': {'latitude': 24.8256, 'longitude': 67.1023},
    'Rasheedabad': {'latitude': 24.9137, 'longitude': 67.0705},
    'Mianwali Colony': {'latitude': 24.9456, 'longitude': 66.9683},
    'Abdul Shah Ghazi Mazar': {'latitude': 24.8006, 'longitude': 67.0281},
    'KDA': {'latitude': 24.8801, 'longitude': 67.0724},
    'Korangi Road': {'latitude': 24.8383, 'longitude': 67.0534},
    'Keamari': {'latitude': 24.8006, 'longitude': 66.9809},
    'Merewether Tower': {'latitude': 24.8485, 'longitude': 67.0106},
    'Landhi': {'latitude': 24.8533, 'longitude': 67.2040},
    'Singer Chowrangi': {'latitude': 24.8317, 'longitude': 67.1250},
    'Dockyard': {'latitude': 24.8314, 'longitude': 66.9760},
    'Masroor Base': {'latitude': 24.9150, 'longitude': 66.9380},
    'Baldia Town': {'latitude': 24.9552, 'longitude': 66.9744},
    'Sea View': {'latitude': 24.8130, 'longitude': 67.0349},
    'Meeran Naka, Lyari': {'latitude': 24.8735, 'longitude': 67.0050},
    'Korangi': {'latitude': 24.8364, 'longitude': 67.1230},
    'Sea View Beach': {'latitude': 24.8048, 'longitude': 67.0312},
    'Lucky Star Saddar': {'latitude': 24.8583, 'longitude': 67.0217}
}

In [18]:
# initializing columns
routes_df['dest_lat'] = None
routes_df['dest_long'] = None

# mapping lat/long values
routes_df['dest_lat'] = routes_df['route_dest'].apply(lambda x: dest_locations[x]['latitude'] if x in dest_locations else None)
routes_df['dest_long'] = routes_df['route_dest'].apply(lambda x: dest_locations[x]['longitude'] if x in dest_locations else None)

In [19]:
routes_df['bus_id'] = buses_df['bus_id']

In [20]:
routes_df

Unnamed: 0,route_id,route_name,route_origin,route_dest,origin_lat,origin_long,dest_lat,dest_long,bus_id
0,RT001,Korangi 100 Quarters to Taiser Town,Korangi 100 Quarters,Taiser Town,24.8277,67.1209,25.0503,67.142,B001
1,RT002,Mianwali Colony to Machine Tool Factory,Mianwali Colony,Machine Tool Factory,24.9456,66.9683,24.9194,67.1065,B002
2,RT003,Korangi 100 Quarters to Ayub Goth,Korangi 100 Quarters,Ayub Goth,24.8277,67.1209,24.9338,67.1234,B003
3,RT004,Surjani Town to Fish Harbour,Surjani Town,Fish Harbour,25.0408,66.9472,24.8435,66.9841,B004
4,RT005,Ittihad Town to Qayyumabad,Ittihad Town,Qayyumabad,24.9296,66.9738,24.8425,67.0653,B005
5,RT006,Mohajir Camp to DHA Phase 8,Mohajir Camp,DHA Phase 8,24.9643,67.0536,24.8011,67.0325,B006
6,RT007,Nai Abadi to Korangi No. 2 1/2,Naiabadi,Korangi No. 2 1/2,24.8286,67.0568,24.8524,67.1271,B007
7,RT008,Gul Ahmed Textile Mills to Mustafabad Colony,Gul Ahmed Textile Mills,Mustafabad Colony,24.8182,67.091,24.8256,67.1023,B008
8,RT009,Ittihad Town to Rasheedabad,Ittihad Town,Rasheedabad,24.9296,66.9738,24.9137,67.0705,B009
9,RT010,Pehlwan Goth to Qayyumabad,Pehlwan Goth,Qayyumabad,24.897,67.1144,24.8425,67.0653,B010


<div class="alert alert-block alert-success">
    <h3>3. Drivers</h3>
</div>

Attributes for 'Drivers' table:

- driver_id
- driver_name
- driver_dob
- driver_phone
- driver_cnic
- driver_address

In [21]:
## initializing 'Drivers' df and creating 'driver_id' attribute

drivers_df = pd.DataFrame()
drivers_df['driver_id'] = [f'DRV{str(i).zfill(3)}' for i in range(1, 30)]

In [22]:
# creating a list of random unique driver names

driver_names = [
    "Ahmed Raza", "Bilal Ahmed", "Hassan Ali", "Muhammad Fahad", "Umar Khan",
    "Arif Hussain", "Saad Farooq", "Salman Aslam", "Junaid Akhtar", "Imran Sheikh",
    "Yasir Qureshi", "Kamran Khan", "Naveed Malik", "Kashif Saeed", "Tariq Mahmood",
    "Waqar Shah", "Faizan Tariq", "Zeeshan Haider", "Waleed Rehman", "Hamza Noor",
    "Anas Abbas", "Shahid Iqbal", "Usman Zafar", "Zain Ahmed", "Danish Mirza",
    "Adnan Rafiq", "Asad Javed", "Rizwan Anwar", "Sohail Hassan"
]

In [23]:
# creating 'driver_name' attribute

drivers_df['driver_name'] = driver_names

In [24]:
# creating a list of random unique driver dates of birth

drivers_dob = [
    '1989-03-14', '1975-06-25', '1980-11-08', '1986-02-19', '1983-09-04',
    '1977-12-22', '1985-05-17', '1981-08-29', '1974-10-10', '1987-04-06',
    '1982-07-23', '1979-01-31', '1976-05-15', '1984-09-11', '1973-11-21',
    '1988-02-27', '1981-06-18', '1980-03-09', '1975-12-14', '1986-10-02',
    '1978-08-19', '1983-11-30', '1977-07-07', '1984-04-12', '1982-01-05',
    '1976-09-27', '1987-03-22', '1980-06-01', '1974-10-16'
]

In [25]:
# creating 'driver_dob' attribute

drivers_df['driver_dob'] = drivers_dob

In [26]:
# creating a list of fictional phone numbers

drivers_numbers = [
    '0300-1234567', '0311-9876543', '0322-3456789', '0333-8765432', '0344-2345678',
    '0355-7654321', '0366-8765432', '0377-1234567', '0388-9876543', '0301-3456789',
    '0312-8765432', '0323-2345678', '0334-7654321', '0345-8765432', '0356-1234567',
    '0367-9876543', '0378-3456789', '0389-8765432', '0302-2345678', '0313-7654321',
    '0324-8765432', '0335-1234567', '0346-9876543', '0357-3456789', '0368-8765432',
    '0379-2345678', '0380-7654321', '0303-8765432', '0314-1234567'
]

In [27]:
# creating 'driver_phone' attribute

drivers_df['driver_phone'] = drivers_numbers

In [28]:
# creating a list of fictional cnic numbers

drivers_cnic = [
    '42201-1234567-0', '42201-2345678-1', '42201-3456789-2', '42201-4567890-3',
    '42201-5678901-4', '42201-6789012-5', '42201-7890123-6', '42201-8901234-7',
    '42201-9012345-8', '42201-0123456-9', '42201-1357924-0', '42201-2468013-1',
    '42201-3579136-2', '42201-4680245-3', '42201-5791358-4', '42201-6802467-5',
    '42201-7913572-6', '42201-8024683-7', '42201-9135794-8', '42201-0246805-9',
    '42201-1357912-0', '42201-2468023-1', '42201-3579134-2', '42201-4680245-3',
    '42201-5791356-4', '42201-6802467-5', '42201-7913578-6', '42201-8024689-7',
    '42201-9135790-8'
]

In [29]:
# creating 'driver_cnic' attribute

drivers_df['driver_cnic'] = drivers_cnic

In [30]:
# creating a list of fictional driver home addresses

drivers_addresses = [
    "House No. 123, Street No. 5, Block A, Gulshan-e-Iqbal",
    "Apartment No. 456, ABC Towers, Shahrah-e-Faisal",
    "Plot No. 789, Phase 6, DHA",
    "Flat No. 101, Block B, Clifton",
    "House No. 321, Street No. 7, North Nazimabad",
    "Plot No. 654, Block C, Gulistan-e-Jauhar",
    "Flat No. 202, XYZ Apartments, PECHS",
    "House No. 987, Street No. 9, Korangi",
    "Apartment No. 303, DEF Residency, Nazimabad",
    "Plot No. 246, Phase 4, DHA",
    "Flat No. 505, Tower A, Bahria Town",
    "House No. 789, Street No. 3, Gulshan-e-Maymar",
    "Apartment No. 404, EFG Towers, Shahrah-e-Quaideen",
    "Plot No. 543, Block D, Gulshan-e-Hadeed",
    "Flat No. 707, Tower B, Sea View Apartments",
    "House No. 321, Street No. 6, Nazimabad",
    "Plot No. 888, Phase 5, DHA",
    "Apartment No. 111, GHI Residency, Clifton",
    "Flat No. 202, Block E, Gulshan-e-Iqbal",
    "House No. 444, Street No. 8, Malir",
    "Plot No. 999, Phase 7, DHA",
    "Apartment No. 222, JKL Towers, Saddar",
    "Flat No. 303, Tower C, Bahria Town",
    "House No. 555, Street No. 4, Gulshan-e-Hadeed",
    "Plot No. 333, Phase 8, DHA",
    "Apartment No. 888, MNO Residency, PECHS",
    "Flat No. 101, Block F, North Karachi",
    "House No. 222, Street No. 2, Gulshan-e-Iqbal",
    "Plot No. 777, Phase 2, DHA"
]

In [31]:
# creating 'driver_address' attribute

drivers_df['driver_address'] = drivers_addresses

In [32]:
drivers_df

Unnamed: 0,driver_id,driver_name,driver_dob,driver_phone,driver_cnic,driver_address
0,DRV001,Ahmed Raza,1989-03-14,0300-1234567,42201-1234567-0,"House No. 123, Street No. 5, Block A, Gulshan-..."
1,DRV002,Bilal Ahmed,1975-06-25,0311-9876543,42201-2345678-1,"Apartment No. 456, ABC Towers, Shahrah-e-Faisal"
2,DRV003,Hassan Ali,1980-11-08,0322-3456789,42201-3456789-2,"Plot No. 789, Phase 6, DHA"
3,DRV004,Muhammad Fahad,1986-02-19,0333-8765432,42201-4567890-3,"Flat No. 101, Block B, Clifton"
4,DRV005,Umar Khan,1983-09-04,0344-2345678,42201-5678901-4,"House No. 321, Street No. 7, North Nazimabad"
5,DRV006,Arif Hussain,1977-12-22,0355-7654321,42201-6789012-5,"Plot No. 654, Block C, Gulistan-e-Jauhar"
6,DRV007,Saad Farooq,1985-05-17,0366-8765432,42201-7890123-6,"Flat No. 202, XYZ Apartments, PECHS"
7,DRV008,Salman Aslam,1981-08-29,0377-1234567,42201-8901234-7,"House No. 987, Street No. 9, Korangi"
8,DRV009,Junaid Akhtar,1974-10-10,0388-9876543,42201-9012345-8,"Apartment No. 303, DEF Residency, Nazimabad"
9,DRV010,Imran Sheikh,1987-04-06,0301-3456789,42201-0123456-9,"Plot No. 246, Phase 4, DHA"


<div class="alert alert-block alert-success">
    <h3>4. Conductors</h3>
</div>

Attributes for 'Conductors' table:

- cond_id
- cond_name
- cond_dob
- cond_phone
- cond_cnic
- cond_address

In [33]:
## initializing 'Conductors' df and creating 'conductor_id' attribute

conductors_df = pd.DataFrame()
conductors_df['cond_id'] = [f'CND{str(i).zfill(3)}' for i in range(1, 30)]

In [34]:
# creating a list of random unique conductors names

conductors_names = [
    "Ali Hassan", "Kamran Siddiq", "Imran Ahmed", "Asadullah Khan", "Ahmed Farooq",
    "Usman Shahid", "Faisal Mahmood", "Bilal Rehman", "Zohaib Khan", "Aamir Yousaf",
    "Raza Khan", "Omar Malik", "Naeem Akhtar", "Waqas Aslam", "Sajid Iqbal",
    "Zain ul Abideen", "Tariq Hussain", "Junaid Malik", "Qasim Raza", "Saadat Ali",
    "Fahad Ahmad", "Jawad Ali", "Taimur Riaz", "Shahzad Saleem", "Azhar Abbas",
    "Faisal Naeem", "Mujahid Khan", "Arif Qureshi", "Amir Hamza"
]

In [35]:
# creating 'cond_name' attribute

conductors_df['cond_name'] = conductors_names

In [36]:
# creating a list of random unique driver dates of birth

conductors_dob = [
    '1999-05-15', '1995-11-03', '1990-07-22', '1992-01-18', '1993-08-25',
    '1996-12-30', '1997-06-05', '1994-09-17', '1991-03-27', '1998-10-12',
    '1990-11-19', '1993-02-24', '1995-06-14', '1997-08-09', '1991-04-06',
    '1994-01-11', '1992-07-29', '1996-03-21', '1999-11-05', '1998-12-18',
    '1991-05-07', '1993-09-26', '1992-02-15', '1995-04-19', '1997-01-04',
    '1990-09-10', '1996-11-23', '1994-05-02', '1999-08-13'
]

In [37]:
# creating 'cond_dob' attribute

conductors_df['cond_dob'] = conductors_dob

In [38]:
conductors_numbers = [
    '0304-1234567', '0315-9876543', '0326-3456789', '0337-8765432', '0348-2345678',
    '0359-7654321', '0361-8765432', '0372-1234567', '0383-9876543', '0305-3456789',
    '0316-8765432', '0327-2345678', '0338-7654321', '0349-8765432', '0350-1234567',
    '0362-9876543', '0373-3456789', '0384-8765432', '0306-2345678', '0317-7654321',
    '0328-8765432', '0339-1234567', '0340-9876543', '0351-3456789', '0363-8765432',
    '0374-2345678', '0385-7654321', '0307-8765432', '0318-1234567'
]

In [39]:
# creating 'cond_phone' attribute

conductors_df['cond_phone'] = conductors_numbers

In [40]:
# creating another list of fictional cnic numbers

conductors_cnic = [
    '42201-2345678-9', '42201-3456789-0', '42201-4567890-1', '42201-5678901-2',
    '42201-6789012-3', '42201-7890123-4', '42201-8901234-5', '42201-9012345-6',
    '42201-0123456-7', '42201-1357924-8', '42201-2468013-9', '42201-3579136-0',
    '42201-4680245-1', '42201-5791358-2', '42201-6802467-3', '42201-7913572-4',
    '42201-8024683-5', '42201-9135794-6', '42201-0246805-7', '42201-1357912-8',
    '42201-2468023-9', '42201-3579134-0', '42201-4680245-1', '42201-5791356-2',
    '42201-6802467-3', '42201-7913578-4', '42201-8024689-5', '42201-9135790-6',
    '42201-0246801-7'
]

In [41]:
# creating 'cond_cnic' attribute

conductors_df['cond_cnic'] = conductors_cnic

In [42]:
# creating a list of fictional conductor home addresses

conductors_address = [
    "House No. 123, Street No. 1, Block F, Federal B Area",
    "Apartment No. 567, XYZ Towers, Tariq Road",
    "Plot No. 789, Phase 2, DHA",
    "Flat No. 104, Block G, Clifton",
    "House No. 321, Street No. 8, Gulshan-e-Maymar",
    "Plot No. 654, Block H, Gulshan-e-Iqbal",
    "Flat No. 202, PQR Apartments, North Karachi",
    "House No. 987, Street No. 10, Korangi",
    "Apartment No. 303, JKL Residency, Saddar",
    "Plot No. 246, Phase 1, DHA",
    "Flat No. 505, Tower D, Bahria Town",
    "House No. 789, Street No. 5, Malir",
    "Apartment No. 404, MNO Towers, Shahrah-e-Faisal",
    "Plot No. 543, Block I, Gulistan-e-Jauhar",
    "Flat No. 707, Tower E, Sea View",
    "House No. 321, Street No. 2, Nazimabad",
    "Plot No. 888, Phase 3, DHA",
    "Apartment No. 111, OPQ Residency, PECHS",
    "Flat No. 202, Block J, North Nazimabad",
    "House No. 444, Street No. 6, Orangi Town",
    "Plot No. 999, Phase 4, DHA",
    "Apartment No. 222, RST Towers, Clifton",
    "Flat No. 303, Tower F, Bahria Town",
    "House No. 555, Street No. 3, Gulshan-e-Hadeed",
    "Plot No. 333, Phase 5, DHA",
    "Apartment No. 888, UVW Residency, Gulshan-e-Iqbal",
    "Flat No. 101, Block K, Gulistan-e-Jauhar",
    "House No. 222, Street No. 4, Federal B Area",
    "Plot No. 777, Phase 6, DHA"
]

In [43]:
# creating 'cond_address' attribute

conductors_df['cond_address'] = conductors_address

In [44]:
conductors_df

Unnamed: 0,cond_id,cond_name,cond_dob,cond_phone,cond_cnic,cond_address
0,CND001,Ali Hassan,1999-05-15,0304-1234567,42201-2345678-9,"House No. 123, Street No. 1, Block F, Federal ..."
1,CND002,Kamran Siddiq,1995-11-03,0315-9876543,42201-3456789-0,"Apartment No. 567, XYZ Towers, Tariq Road"
2,CND003,Imran Ahmed,1990-07-22,0326-3456789,42201-4567890-1,"Plot No. 789, Phase 2, DHA"
3,CND004,Asadullah Khan,1992-01-18,0337-8765432,42201-5678901-2,"Flat No. 104, Block G, Clifton"
4,CND005,Ahmed Farooq,1993-08-25,0348-2345678,42201-6789012-3,"House No. 321, Street No. 8, Gulshan-e-Maymar"
5,CND006,Usman Shahid,1996-12-30,0359-7654321,42201-7890123-4,"Plot No. 654, Block H, Gulshan-e-Iqbal"
6,CND007,Faisal Mahmood,1997-06-05,0361-8765432,42201-8901234-5,"Flat No. 202, PQR Apartments, North Karachi"
7,CND008,Bilal Rehman,1994-09-17,0372-1234567,42201-9012345-6,"House No. 987, Street No. 10, Korangi"
8,CND009,Zohaib Khan,1991-03-27,0383-9876543,42201-0123456-7,"Apartment No. 303, JKL Residency, Saddar"
9,CND010,Aamir Yousaf,1998-10-12,0305-3456789,42201-1357924-8,"Plot No. 246, Phase 1, DHA"


<div class="alert alert-block alert-success">
    <h3>5. Buses_Drivers</h3>
</div>

Attributes for 'Buses_Drivers' table:

- bus_id
- driver_id
- cond_id
- route_id

In [45]:
buses_drivers_df = pd.DataFrame()

In [46]:
buses_drivers_df['bus_id'] = buses_df['bus_id']

In [47]:
buses_drivers_df['driver_id'] = drivers_df['driver_id']

In [48]:
buses_drivers_df['cond_id'] = conductors_df['cond_id']

In [49]:
buses_drivers_df['route_id'] = routes_df['route_id']

In [50]:
buses_drivers_df

Unnamed: 0,bus_id,driver_id,cond_id,route_id
0,B001,DRV001,CND001,RT001
1,B002,DRV002,CND002,RT002
2,B003,DRV003,CND003,RT003
3,B004,DRV004,CND004,RT004
4,B005,DRV005,CND005,RT005
5,B006,DRV006,CND006,RT006
6,B007,DRV007,CND007,RT007
7,B008,DRV008,CND008,RT008
8,B009,DRV009,CND009,RT009
9,B010,DRV010,CND010,RT010


<div class="alert alert-block alert-success">
    <h3>6. Schedules</h3>
</div>

Attributes for 'Schedules' table:

- sched_id
- route_id
- bus_id
- date
- departure_time
- arrival_time

Although trips will occur randomly, for the purpose of data generation, we will assume each bus makes a minimum of 3 trips per day. The data will span from June 1, 2023, to May 31, 2024, with data from May used as a test set for demonstration purposes at the project's conclusion. 

In [51]:
## initializing 'Schedules' df and creating 'sched_id' attribute

sched_df = pd.DataFrame()
sched_df['sched_id'] = [f'SCH{str(i).zfill(3)}' for i in range(1, 63511)]

In [52]:
sched_df.tail() # viewing changes

Unnamed: 0,sched_id
63505,SCH63506
63506,SCH63507
63507,SCH63508
63508,SCH63509
63509,SCH63510


In [53]:
# creating date attribute

# defining the date range
start_date = datetime(2023, 6, 1)
end_date = datetime(2024, 5, 31)

all_dates = pd.date_range(start_date, end_date)

# dates when public transport is unavailable
unavailable_dates = [
    '2023-06-28', '2023-06-29', '2023-06-30',  # Eid-ul-Adha
    '2023-07-27', '2023-07-28',  # Ashura
    '2023-08-14',  # Independence Day
    '2024-04-10', '2024-04-11', '2024-04-12',  # Eid-ul-Fitr
    '2024-03-23'  # Pakistan Day
]


unavailable_dates = pd.to_datetime(unavailable_dates)
available_dates = all_dates.difference(unavailable_dates) # removing unavailable dates from range

In [54]:
# generating dates for the 'date' column with more trips on weekends and days before holidays

weights = []
for date in available_dates:
    if date.weekday() in [5, 6]:  # weekends (Saturday, Sunday)
        weight = 3
    elif (date - pd.Timedelta(days=1)) in unavailable_dates or (date + pd.Timedelta(days=1)) in unavailable_dates:
        weight = 2  # days before/after holidays
    else:
        weight = 1
    weights.append(weight)
    
# converting weights to a probability distribution
weights = np.array(weights)
probabilities = weights / weights.sum()


num_trips_per_day = 3 # on schedule: 3 trips per day per bus
num_buses = 29
total_trips = 63510  # on average: 6 trips per day per bus (to account for excess trips during holiday seasons and weekends)


date_choices = np.random.choice(available_dates, size=total_trips, p=probabilities)

In [55]:
date_choices

array(['2024-04-28T00:00:00.000000000', '2023-07-14T00:00:00.000000000',
       '2023-12-02T00:00:00.000000000', ...,
       '2024-04-05T00:00:00.000000000', '2024-03-28T00:00:00.000000000',
       '2023-09-15T00:00:00.000000000'], dtype='datetime64[ns]')

In [56]:
# we have also generated timestamps, so will be removing those

date_choices = np.array([np.datetime64(date, 'D') for date in date_choices])

In [57]:
date_choices

array(['2024-04-28', '2023-07-14', '2023-12-02', ..., '2024-04-05',
       '2024-03-28', '2023-09-15'], dtype='datetime64[D]')

In [58]:
print(len(date_choices))

63510


In [59]:
date_choices.sort()

In [60]:
sched_df['date'] = date_choices

In [61]:
sched_df.sample(5)

Unnamed: 0,sched_id,date
55867,SCH55868,2024-04-18
7401,SCH7402,2023-07-15
46852,SCH46853,2024-02-24
50897,SCH50898,2024-03-17
25920,SCH25921,2023-10-28


In [62]:
# creating the 'route_id' attribute
# calculating the probability distribution of routes based on date_choices distribution

route_distribution = np.random.choice(routes_df['route_id'], size=len(date_choices), replace=True)
route_counts = Counter(route_distribution) # creating a counter

In [63]:
# normalizing the counts to get the probability distribution

total_counts = sum(route_counts.values())
route_probabilities = {route_id: count / total_counts for route_id, count in route_counts.items()}

In [64]:
# converting route probabilities to a list for sampling

route_prob_list = [(route_id, probability) for route_id, probability in route_probabilities.items()]

In [65]:
# sampling route IDs based on the probabilities for each date choice

route_ids = [np.random.choice([route_id for route_id, _ in route_prob_list], p=[prob for _, prob in route_prob_list]) for _ in range(len(date_choices))]

In [66]:
sched_df['route_id'] = route_ids

In [67]:
sched_df.head() # viewing changes

Unnamed: 0,sched_id,date,route_id
0,SCH001,2023-06-01,RT019
1,SCH002,2023-06-01,RT013
2,SCH003,2023-06-01,RT010
3,SCH004,2023-06-01,RT002
4,SCH005,2023-06-01,RT003


In [68]:
# creating the 'bus_id' attribute

for index, row in sched_df.iterrows():
    route_id = row['route_id']

    bus_id = routes_df.loc[routes_df['route_id'] == route_id, 'bus_id'].values

    if bus_id:
        sched_df.at[index, 'bus_id'] = bus_id[0]
    else:
        print(f"No matching bus_id found for route_id {route_id}")

In [69]:
sched_df.head() # viewing changes

Unnamed: 0,sched_id,date,route_id,bus_id
0,SCH001,2023-06-01,RT019,B019
1,SCH002,2023-06-01,RT013,B013
2,SCH003,2023-06-01,RT010,B010
3,SCH004,2023-06-01,RT002,B002
4,SCH005,2023-06-01,RT003,B003


In [70]:
# creating 'departure_time' attribute

def generate_departure_times(group):

    start_time = pd.to_datetime('07:00')
    increment = pd.Timedelta(minutes=50)
    
    departure_times = [start_time + i * increment for i in range(len(group))]
    
    return [time.strftime('%H:%M') for time in departure_times]

In [71]:
departure_times = sched_df.groupby(['date', 'bus_id']).apply(generate_departure_times).explode().reset_index(drop=True)

In [72]:
sched_df['departure_time'] = departure_times

In [73]:
sorted_df = sched_df.sort_values(by=['date', 'departure_time']) # sorting values by date and then by departure time

In [74]:
unique_rows = sorted_df.drop_duplicates(subset=['date', 'bus_id', 'departure_time']) # dropping duplicates

In [75]:
len(unique_rows)

44779

In [76]:
unique_rows.head() # viewing changes

Unnamed: 0,sched_id,date,route_id,bus_id,departure_time
0,SCH001,2023-06-01,RT019,B019,07:00
4,SCH005,2023-06-01,RT003,B003,07:00
7,SCH008,2023-06-01,RT028,B028,07:00
10,SCH011,2023-06-01,RT007,B007,07:00
13,SCH014,2023-06-01,RT004,B004,07:00


In [77]:
schedules_df = unique_rows.copy() # creating a copy of the unique rows

In [78]:
# creating 'arrival_time' attribute

schedules_df['arrival_time'] = (pd.to_datetime(schedules_df['departure_time'], format='%H:%M') + pd.Timedelta(minutes=45)).dt.strftime('%H:%M')

In [79]:
# rearranging columns

new_order_sched = ['sched_id', 'route_id', 'bus_id', 'date', 'departure_time', 'arrival_time']
schedules_df = schedules_df[new_order_sched]

In [80]:
schedules_df.head() # viewing changes

Unnamed: 0,sched_id,route_id,bus_id,date,departure_time,arrival_time
0,SCH001,RT019,B019,2023-06-01,07:00,07:45
4,SCH005,RT003,B003,2023-06-01,07:00,07:45
7,SCH008,RT028,B028,2023-06-01,07:00,07:45
10,SCH011,RT007,B007,2023-06-01,07:00,07:45
13,SCH014,RT004,B004,2023-06-01,07:00,07:45


In [81]:
print(len(schedules_df))

44779


We now have data for 44872 scheduled bus trips for one year.

In [82]:
schedules_df = schedules_df.reset_index(drop=True) # resetting index after removing duplicates

In [83]:
schedules_df['sched_id'] = [f'SCH{str(i).zfill(3)}' for i in range(1, len(schedules_df) + 1)] # resetting sched_id column as well

In [84]:
schedules_df.head()

Unnamed: 0,sched_id,route_id,bus_id,date,departure_time,arrival_time
0,SCH001,RT019,B019,2023-06-01,07:00,07:45
1,SCH002,RT003,B003,2023-06-01,07:00,07:45
2,SCH003,RT028,B028,2023-06-01,07:00,07:45
3,SCH004,RT007,B007,2023-06-01,07:00,07:45
4,SCH005,RT004,B004,2023-06-01,07:00,07:45


In [85]:
schedules_df.tail()

Unnamed: 0,sched_id,route_id,bus_id,date,departure_time,arrival_time
44774,SCH44775,RT007,B007,2024-05-31,12:00,12:45
44775,SCH44776,RT006,B006,2024-05-31,12:00,12:45
44776,SCH44777,RT029,B029,2024-05-31,12:00,12:45
44777,SCH44778,RT004,B004,2024-05-31,12:50,13:35
44778,SCH44779,RT027,B027,2024-05-31,12:50,13:35


<div class="alert alert-block alert-success">
    <h3>7. Fares</h3>
</div>

Attributes for 'Fares' table:

- fare_id
- route_id
- bus_id
- fare_amount
- fare_type

For mini buses:
- Student: Rs. 25
- Regular: Rs. 30

For regular buses:
- Student: Rs. 50
- Regular: Rs. 75

In [86]:
## initializing 'Fares' df and creating 'fare_id' attribute
# note: 2 (ticket types/bus) x 29 (no. of buses) = 58

fares_df = pd.DataFrame()
fares_df['fare_id'] = [f'F{str(i).zfill(3)}' for i in range(1, 59)]

In [87]:
# creating 'route_id' attribute

route_ids = routes_df['route_id'].values
duplicated_route_ids = [route_id for route_id in route_ids for _ in range(2)]

fares_df['route_id'] = duplicated_route_ids

In [88]:
# creating 'bus_id' attribute

bus_ids = buses_df['bus_id'].values
duplicated_bus_ids = [bus_id for bus_id in bus_ids for _ in range(2)]

fares_df['bus_id'] = duplicated_bus_ids

In [89]:
# creating 'fare_type' attribute

num_rows = len(fares_df)
fares_df['fare_type'] = ['Student' if i % 2 == 0 else 'Regular' for i in range(num_rows)]

In [90]:
# creating 'fare_amount' attribute

merged_df = pd.merge(fares_df, buses_df, on='bus_id', how='left') # joining tables for reference

In [91]:
# defining function to determine fare amount

def determine_fare(row):
    if row['bus_type'] == 'Mini Bus':
        if row['fare_type'] == 'Student':
            return 25
        elif row['fare_type'] == 'Regular':
            return 30
    elif row['bus_type'] == 'Regular Bus':
        if row['fare_type'] == 'Student':
            return 50
        elif row['fare_type'] == 'Regular':
            return 75

In [92]:
merged_df['fare_amount'] = merged_df.apply(determine_fare, axis=1)

In [93]:
fares_df['fare_amount'] = merged_df['fare_amount']

In [94]:
# fixing order of columns

new_order_fares = ['fare_id', 'route_id', 'bus_id', 'fare_amount', 'fare_type']
fares_df = fares_df[new_order_fares]

In [95]:
fares_df

Unnamed: 0,fare_id,route_id,bus_id,fare_amount,fare_type
0,F001,RT001,B001,25,Student
1,F002,RT001,B001,30,Regular
2,F003,RT002,B002,25,Student
3,F004,RT002,B002,30,Regular
4,F005,RT003,B003,25,Student
5,F006,RT003,B003,30,Regular
6,F007,RT004,B004,25,Student
7,F008,RT004,B004,30,Regular
8,F009,RT005,B005,25,Student
9,F010,RT005,B005,30,Regular


<div class="alert alert-block alert-success">
    <h3>8. Payments</h3>
</div>

Attributes for 'Payments' table:

- payment_id
- payment_amount
- payment_method
- payment_date
- payment_time

In [96]:
# first we will have to calculate the total number of tickets sold based on the schedule of trips that happened
# and their fares for each respective bus/route
# let's make the assumptions that all trips were at full capacity

temp_df = pd.DataFrame()
temp_df['bus_id'] = schedules_df['bus_id']
temp_df = pd.merge(temp_df, buses_df, on='bus_id', how='left')

In [97]:
temp_df.tail()

Unnamed: 0,bus_id,bus_name,bus_type,capacity
44774,B007,W22 Mini Bus,Mini Bus,30
44775,B006,X10 Mini Bus,Mini Bus,30
44776,B029,K9 Mini Bus,Mini Bus,30
44777,B004,W18 Mini Bus,Mini Bus,30
44778,B027,Pink Bus P-2,Regular Bus,50


In [98]:
temp_df['capacity'].sum()

1771610

In [99]:
## initializing 'Payments' df and creating 'payment_id' attribute

payments_df = pd.DataFrame()
payments_df['payment_id'] = [f'P{str(i).zfill(3)}' for i in range(1, temp_df['capacity'].sum() + 1)]

In [100]:
# creating 'payment_date' attribute

temp_df['sched_date'] = schedules_df['date']

In [101]:
payment_dates = []
bus_ids = []

for _, row in temp_df.iterrows():
    payment_dates.extend([row['sched_date']] * row['capacity'])
    bus_ids.extend([row['bus_id']] * row['capacity'])

In [102]:
payments_df['payment_date'] = payment_dates
payments_df['bus_id'] = bus_ids

In [103]:
payments_df.tail() # checking changes

Unnamed: 0,payment_id,payment_date,bus_id
1771605,P1771606,2024-05-31,B027
1771606,P1771607,2024-05-31,B027
1771607,P1771608,2024-05-31,B027
1771608,P1771609,2024-05-31,B027
1771609,P1771610,2024-05-31,B027


In [104]:
# creating 'payment_amount' attribute

ptemp_df = payments_df.copy() # creating a temporary df for payment calculations

In [105]:
ptemp_df.head()

Unnamed: 0,payment_id,payment_date,bus_id
0,P001,2023-06-01,B019
1,P002,2023-06-01,B019
2,P003,2023-06-01,B019
3,P004,2023-06-01,B019
4,P005,2023-06-01,B019


In [106]:
np.random.seed(42)
fare_types = np.random.choice(['Regular', 'Student'], size=len(ptemp_df), p=[0.7, 0.3])

In [107]:
ptemp_df['fare_type'] = fare_types

In [108]:
ptemp_df.head()

Unnamed: 0,payment_id,payment_date,bus_id,fare_type
0,P001,2023-06-01,B019,Regular
1,P002,2023-06-01,B019,Student
2,P003,2023-06-01,B019,Student
3,P004,2023-06-01,B019,Regular
4,P005,2023-06-01,B019,Regular


In [109]:
ptemp_df = pd.merge(ptemp_df, buses_df, on='bus_id', how='left')

In [110]:
ptemp_df.columns

Index(['payment_id', 'payment_date', 'bus_id', 'fare_type', 'bus_name',
       'bus_type', 'capacity'],
      dtype='object')

In [111]:
ptemp_df = ptemp_df.drop(columns = ['bus_name', 'capacity'])

In [112]:
ptemp_df.head()

Unnamed: 0,payment_id,payment_date,bus_id,fare_type,bus_type
0,P001,2023-06-01,B019,Regular,Regular Bus
1,P002,2023-06-01,B019,Student,Regular Bus
2,P003,2023-06-01,B019,Student,Regular Bus
3,P004,2023-06-01,B019,Regular,Regular Bus
4,P005,2023-06-01,B019,Regular,Regular Bus


In [113]:
ptemp_df = pd.merge(ptemp_df, fares_df, on=['bus_id', 'fare_type'], how='left')

In [114]:
ptemp_df.columns

Index(['payment_id', 'payment_date', 'bus_id', 'fare_type', 'bus_type',
       'fare_id', 'route_id', 'fare_amount'],
      dtype='object')

In [115]:
ptemp_df.head()

Unnamed: 0,payment_id,payment_date,bus_id,fare_type,bus_type,fare_id,route_id,fare_amount
0,P001,2023-06-01,B019,Regular,Regular Bus,F038,RT019,75
1,P002,2023-06-01,B019,Student,Regular Bus,F037,RT019,50
2,P003,2023-06-01,B019,Student,Regular Bus,F037,RT019,50
3,P004,2023-06-01,B019,Regular,Regular Bus,F038,RT019,75
4,P005,2023-06-01,B019,Regular,Regular Bus,F038,RT019,75


In [116]:
payments_df['payment_amount'] = ptemp_df['fare_amount']

In [117]:
payments_df = payments_df.drop(columns = 'bus_id')

In [118]:
payments_df.head()

Unnamed: 0,payment_id,payment_date,payment_amount
0,P001,2023-06-01,75
1,P002,2023-06-01,50
2,P003,2023-06-01,50
3,P004,2023-06-01,75
4,P005,2023-06-01,75


In [119]:
# creating 'payment_method' attribute

payment_methods = ['Cash', 'Credit Card', 'Debit Card', 'Digital Wallet']
probabilities = [0.4, 0.15, 0.15, 0.30]

In [120]:
payments_df['payment_method'] = np.random.choice(payment_methods, size=len(payments_df), p=probabilities)

In [121]:
payments_df.tail()

Unnamed: 0,payment_id,payment_date,payment_amount,payment_method
1771605,P1771606,2024-05-31,75,Digital Wallet
1771606,P1771607,2024-05-31,50,Digital Wallet
1771607,P1771608,2024-05-31,75,Debit Card
1771608,P1771609,2024-05-31,50,Cash
1771609,P1771610,2024-05-31,75,Credit Card


In [122]:
# creating 'payment_time' attribute

# defining time range (open hours of the bus service)
start_time = pd.to_datetime('07:00:00')
end_time = pd.to_datetime('21:00:00')

time_range = pd.date_range(start=start_time, end=end_time, freq='T')

payment_times = np.random.choice(time_range, size=len(payments_df))

In [123]:
payments_df['payment_time'] = payment_times

In [124]:
payments_df.head()

Unnamed: 0,payment_id,payment_date,payment_amount,payment_method,payment_time
0,P001,2023-06-01,75,Cash,2024-05-31 08:22:00
1,P002,2023-06-01,50,Credit Card,2024-05-31 14:18:00
2,P003,2023-06-01,50,Cash,2024-05-31 08:55:00
3,P004,2023-06-01,75,Debit Card,2024-05-31 11:30:00
4,P005,2023-06-01,75,Digital Wallet,2024-05-31 09:42:00


In [125]:
payments_df['payment_time'] = pd.to_datetime(payments_df['payment_time']).dt.time

In [126]:
payments_df.head()

Unnamed: 0,payment_id,payment_date,payment_amount,payment_method,payment_time
0,P001,2023-06-01,75,Cash,08:22:00
1,P002,2023-06-01,50,Credit Card,14:18:00
2,P003,2023-06-01,50,Cash,08:55:00
3,P004,2023-06-01,75,Debit Card,11:30:00
4,P005,2023-06-01,75,Digital Wallet,09:42:00


In [127]:
# fixing column order

payments_order = ['payment_id', 'payment_amount', 'payment_method', 'payment_date', 'payment_time']

payments_df = payments_df[payments_order]

In [128]:
payments_df.head()

Unnamed: 0,payment_id,payment_amount,payment_method,payment_date,payment_time
0,P001,75,Cash,2023-06-01,08:22:00
1,P002,50,Credit Card,2023-06-01,14:18:00
2,P003,50,Cash,2023-06-01,08:55:00
3,P004,75,Debit Card,2023-06-01,11:30:00
4,P005,75,Digital Wallet,2023-06-01,09:42:00


<div class="alert alert-block alert-success">
    <h3>9. Trips</h3>
</div>

Attributes for 'Trips' table:

- trip_id
- bus_id
- cond_id
- route_id
- driver_id
- sched_id
- trip_date 
- departure_time
- arrival_time
- actual_departure_time
- actual_arrival_time

In [129]:
len(schedules_df)

44779

In [130]:
## initializing 'Trips' df and creating 'trip_id' attribute

trips_df = pd.DataFrame()
trips_df['trip_id'] = [f'TRP{str(i).zfill(3)}' for i in range(1, len(schedules_df) + 1)]

In [131]:
schedules_df.tail()

Unnamed: 0,sched_id,route_id,bus_id,date,departure_time,arrival_time
44774,SCH44775,RT007,B007,2024-05-31,12:00,12:45
44775,SCH44776,RT006,B006,2024-05-31,12:00,12:45
44776,SCH44777,RT029,B029,2024-05-31,12:00,12:45
44777,SCH44778,RT004,B004,2024-05-31,12:50,13:35
44778,SCH44779,RT027,B027,2024-05-31,12:50,13:35


In [132]:
# creating the attributes 'bus_id', 'route_id', 'sched_id', 'trip_date', 'departure_time', 'arrival_time'

trips_df[['bus_id', 'route_id', 'sched_id', 'trip_date', 'departure_time', 'arrival_time']] = schedules_df[['bus_id', 'route_id', 'sched_id', 'date', 'departure_time', 'arrival_time']]

In [133]:
trips_df.tail()

Unnamed: 0,trip_id,bus_id,route_id,sched_id,trip_date,departure_time,arrival_time
44774,TRP44775,B007,RT007,SCH44775,2024-05-31,12:00,12:45
44775,TRP44776,B006,RT006,SCH44776,2024-05-31,12:00,12:45
44776,TRP44777,B029,RT029,SCH44777,2024-05-31,12:00,12:45
44777,TRP44778,B004,RT004,SCH44778,2024-05-31,12:50,13:35
44778,TRP44779,B027,RT027,SCH44779,2024-05-31,12:50,13:35


In [134]:
# creating the 'driver_id' and 'cond_id' attributes

buses_drivers_df.head()

Unnamed: 0,bus_id,driver_id,cond_id,route_id
0,B001,DRV001,CND001,RT001
1,B002,DRV002,CND002,RT002
2,B003,DRV003,CND003,RT003
3,B004,DRV004,CND004,RT004
4,B005,DRV005,CND005,RT005


In [135]:
trips_df = pd.merge(trips_df, buses_drivers_df, on=['bus_id', 'route_id'], how='left')

In [136]:
trips_df.head()

Unnamed: 0,trip_id,bus_id,route_id,sched_id,trip_date,departure_time,arrival_time,driver_id,cond_id
0,TRP001,B019,RT019,SCH001,2023-06-01,07:00,07:45,DRV019,CND019
1,TRP002,B003,RT003,SCH002,2023-06-01,07:00,07:45,DRV003,CND003
2,TRP003,B028,RT028,SCH003,2023-06-01,07:00,07:45,DRV028,CND028
3,TRP004,B007,RT007,SCH004,2023-06-01,07:00,07:45,DRV007,CND007
4,TRP005,B004,RT004,SCH005,2023-06-01,07:00,07:45,DRV004,CND004


In [137]:
# creating 'actual_departure_time' attribute

trips_df['departure_time'] = pd.to_datetime(trips_df['departure_time'])

In [138]:
# defining function for creating delays

def add_random_delta(time):
    
    # probability distribution: 40% chance of 0 minutes, 60% chance of 1-10 minutes
    probabilities = [0.4] + [0.6 / 10] * 10  # 40% for 0, 60% evenly distributed for 1-10
    minutes = np.random.choice(range(11), p=probabilities)  # random integer between 0 and 10

    datetime_time = pd.to_datetime(time.strftime('%H:%M'))
    new_time = datetime_time + pd.Timedelta(minutes=minutes)

    return new_time.strftime('%H:%M')

In [139]:
trips_df['actual_departure_time'] = trips_df['departure_time'].apply(add_random_delta)

In [140]:
trips_df['departure_time'] = trips_df['departure_time'].dt.strftime('%H:%M')

In [141]:
trips_df.head()

Unnamed: 0,trip_id,bus_id,route_id,sched_id,trip_date,departure_time,arrival_time,driver_id,cond_id,actual_departure_time
0,TRP001,B019,RT019,SCH001,2023-06-01,07:00,07:45,DRV019,CND019,07:00
1,TRP002,B003,RT003,SCH002,2023-06-01,07:00,07:45,DRV003,CND003,07:00
2,TRP003,B028,RT028,SCH003,2023-06-01,07:00,07:45,DRV028,CND028,07:04
3,TRP004,B007,RT007,SCH004,2023-06-01,07:00,07:45,DRV007,CND007,07:09
4,TRP005,B004,RT004,SCH005,2023-06-01,07:00,07:45,DRV004,CND004,07:10


In [142]:
# creating 'actual_arrival_time' attribute

trips_df['arrival_time'] = pd.to_datetime(trips_df['arrival_time'])

In [143]:
# defining function for creating delays

def add_random_delta_arrival(time):
    
    # probability distribution: 40% chance of 0 minutes, 60% chance of 1-10 minutes
    probabilities = [0.3] + [0.7 / 30] * 30  # 40% for 0, 60% evenly distributed for 1-30
    minutes = np.random.choice(range(31), p=probabilities)  # random integer between 0 and 30

    datetime_time = pd.to_datetime(time.strftime('%H:%M'))
    new_time = datetime_time + pd.Timedelta(minutes=minutes)

    return new_time.strftime('%H:%M')

In [144]:
trips_df['actual_arrival_time'] = trips_df['arrival_time'].apply(add_random_delta_arrival)

In [145]:
trips_df['arrival_time'] = trips_df['arrival_time'].dt.strftime('%H:%M')

In [146]:
trips_df.head()

Unnamed: 0,trip_id,bus_id,route_id,sched_id,trip_date,departure_time,arrival_time,driver_id,cond_id,actual_departure_time,actual_arrival_time
0,TRP001,B019,RT019,SCH001,2023-06-01,07:00,07:45,DRV019,CND019,07:00,07:51
1,TRP002,B003,RT003,SCH002,2023-06-01,07:00,07:45,DRV003,CND003,07:00,07:45
2,TRP003,B028,RT028,SCH003,2023-06-01,07:00,07:45,DRV028,CND028,07:04,08:12
3,TRP004,B007,RT007,SCH004,2023-06-01,07:00,07:45,DRV007,CND007,07:09,08:04
4,TRP005,B004,RT004,SCH005,2023-06-01,07:00,07:45,DRV004,CND004,07:10,07:56


In [147]:
trips_order = ['trip_id', 'bus_id', 'cond_id', 'route_id', 'driver_id', 'sched_id', 'trip_date',
               'departure_time', 'arrival_time', 'actual_departure_time', 'actual_arrival_time']

trips_df = trips_df[trips_order]

In [148]:
trips_df.head()

Unnamed: 0,trip_id,bus_id,cond_id,route_id,driver_id,sched_id,trip_date,departure_time,arrival_time,actual_departure_time,actual_arrival_time
0,TRP001,B019,CND019,RT019,DRV019,SCH001,2023-06-01,07:00,07:45,07:00,07:51
1,TRP002,B003,CND003,RT003,DRV003,SCH002,2023-06-01,07:00,07:45,07:00,07:45
2,TRP003,B028,CND028,RT028,DRV028,SCH003,2023-06-01,07:00,07:45,07:04,08:12
3,TRP004,B007,CND007,RT007,DRV007,SCH004,2023-06-01,07:00,07:45,07:09,08:04
4,TRP005,B004,CND004,RT004,DRV004,SCH005,2023-06-01,07:00,07:45,07:10,07:56


<div class="alert alert-block alert-success">
    <h3>10. Incidents</h3>
</div>

Attributes for 'Incidents' table:

- incident_id
- trip_id
- incident_type
- description
- incident_time

We'll create 20000 instances of recorded incidents at random.

In [149]:
## initializing 'Incidents' table and creating 'trip_id' attribute

incidents_df = pd.DataFrame({'trip_id': trips_df['trip_id'].sample(n=20000, replace=False).sort_values()})

In [150]:
# creating 'incident_id' attribute

incidents_df['incident_id'] = [f'INC{str(i).zfill(3)}' for i in range(1, 20001)]

In [151]:
# creating 'incident_type' and 'description' attribute

incident_types = {
    "Mechanical Failure": "A breakdown or malfunction of the vehicle's mechanical components.",
    "Passenger Disruption": "Disruptive behavior exhibited by passengers, causing disturbances.",
    "Accidental Damage": "Unintentional damage to the vehicle, such as minor collisions.",
    "Weather Related Delay": "Delays caused by adverse weather conditions impacting travel.",
    "Route Obstruction": "Blockage or obstruction along the planned route, hindering travel progress."
}

In [152]:
incident_list = list(incident_types.keys())
probabilities = [0.1, 0.1, 0.1, 0.3, 0.4]

sampled_incident_types = np.random.choice(incident_list, size=len(incidents_df), p=probabilities)

incidents_df['incident_type'] = sampled_incident_types
incidents_df['description'] = incidents_df['incident_type'].map(incident_types)

In [153]:
incidents_df.head()

Unnamed: 0,trip_id,incident_id,incident_type,description
0,TRP001,INC001,Accidental Damage,"Unintentional damage to the vehicle, such as m..."
1,TRP002,INC002,Route Obstruction,Blockage or obstruction along the planned rout...
5,TRP006,INC003,Route Obstruction,Blockage or obstruction along the planned rout...
7,TRP008,INC004,Route Obstruction,Blockage or obstruction along the planned rout...
11,TRP012,INC005,Passenger Disruption,"Disruptive behavior exhibited by passengers, c..."


In [154]:
# creating 'incident_time' attribute

def generate_random_time():
    start_time = pd.Timestamp('07:00')
    end_time = pd.Timestamp('22:00')
    random_minutes = random.randint(0, int((end_time - start_time).total_seconds() // 60))
    random_time = start_time + pd.Timedelta(minutes=random_minutes)
    return random_time.strftime('%H:%M')

In [155]:
incidents_df['incident_time'] = [generate_random_time() for _ in range(len(incidents_df))]

In [156]:
incidents_order = ['incident_id', 'trip_id', 'incident_type', 'description', 'incident_time']

incidents_df = incidents_df[incidents_order]

In [157]:
incidents_df.head()

Unnamed: 0,incident_id,trip_id,incident_type,description,incident_time
0,INC001,TRP001,Accidental Damage,"Unintentional damage to the vehicle, such as m...",19:43
1,INC002,TRP002,Route Obstruction,Blockage or obstruction along the planned rout...,21:51
5,INC003,TRP006,Route Obstruction,Blockage or obstruction along the planned rout...,21:03
7,INC004,TRP008,Route Obstruction,Blockage or obstruction along the planned rout...,18:17
11,INC005,TRP012,Passenger Disruption,"Disruptive behavior exhibited by passengers, c...",17:34


<div class="alert alert-block alert-success">
    <h3>11. Users</h3>
</div>

Attributes for 'Users' table:

- user_id
- user_name
- user_dob
- user_gender
- user_phone
- user_email

We'll create a database of 22000 customers.

In [158]:
## initializing 'Users' df and creating 'user_id' attribute

users_df = pd.DataFrame()
users_df['user_id'] = [f'U{str(i).zfill(3)}' for i in range(1, 22001)]

In [159]:
# creating 'user_name' attribute

male_names = ["Ahmed", "Ali", "Bilal", "Hassan", "Kashif", "Sohail", "Usman"]
female_names = ["Aisha", "Fatima", "Khadija", "Mariam", "Sana", "Zara", "Zainab"]

surnames = ["Khan", "Malik", "Shaikh", "Butt", "Chaudhry", "Qureshi", "Abbasi", "Ansari"]

first_names = male_names + female_names

In [160]:
# defining function to generate full name combos

def generate_full_name():
    first_name = random.choice(first_names)
    surname = random.choice(surnames)
    return f"{first_name} {surname}"

In [161]:
random_full_names = [generate_full_name() for _ in range(22000)]
users_df['user_name'] = random_full_names

In [162]:
# creating 'user_dob' attribute
# age range: 18-75

# defining a function to generate a random dob

def generate_random_dob():

    age_min = 18
    age_max = 75
    
    current_date = datetime.now()
    
    random_age = random.randint(age_min, age_max)
    
    birth_year = current_date.year - random_age
    
    random_date_of_birth = datetime(birth_year, 1, 1) + timedelta(days=random.randint(0, 364))
    
    return random_date_of_birth

In [163]:
random_birth_dates = [generate_random_dob() for _ in range(22000)]

users_df['user_dob'] = random_birth_dates

In [164]:
# creating 'user_gender' attribute

# defining function to determine gender through user's name

def determine_gender(full_name):
    first_name = full_name.split()[0]
    if first_name in male_names:
        return 'Male'
    elif first_name in female_names:
        return 'Female'
    else:
        return 'Unknown'

In [165]:
users_df['user_gender'] = users_df['user_name'].apply(determine_gender)

In [166]:
# creating 'user_phone' attribute

operator_codes = ['300', '301', '302', '303', '304', '305', '306', '307', '308', '309', '311', '312', '313', '314', '315', '316', '317', '318', '319']

phone_numbers = set()
while len(phone_numbers) < 22000:
    operator_code = random.choice(operator_codes)
    subscriber_number = ''.join([str(random.randint(0, 9)) for _ in range(7)])
    phone_number = f'+92{operator_code}{subscriber_number}'
    phone_numbers.add(phone_number)

In [167]:
users_df['user_phone'] = list(phone_numbers)

In [168]:
# creating 'user_email' attribute

# defining function to generate synthetic emails

def generate_email(name):
    first_name = name.split()[0].lower()
    return f'{first_name}@example.com'

In [169]:
users_df['user_email'] = users_df['user_name'].apply(generate_email)

In [170]:
users_df.tail()

Unnamed: 0,user_id,user_name,user_dob,user_gender,user_phone,user_email
21995,U21996,Bilal Chaudhry,1950-09-06,Male,923116738008,bilal@example.com
21996,U21997,Usman Shaikh,1981-04-02,Male,923155066792,usman@example.com
21997,U21998,Zainab Khan,1992-02-06,Female,923072539675,zainab@example.com
21998,U21999,Usman Qureshi,1949-04-13,Male,923082510005,usman@example.com
21999,U22000,Khadija Chaudhry,1953-03-15,Female,923083929690,khadija@example.com


<div class="alert alert-block alert-success">
    <h3>12. Tickets</h3>
</div>

Attributes for 'Tickets' table:

- ticket_id
- trip_id
- user_id
- payment_id
- fare_id
- issue_time

In [171]:
len(payments_df)

1771610

In [172]:
## initializing 'Tickets' df and creating 'ticket_id' attribute

tickets_df = pd.DataFrame()
tickets_df['ticket_id'] = [f'TCK{str(i).zfill(3)}' for i in range(1, len(payments_df) + 1)]

In [173]:
# creating 'payment_id' attribute

tickets_df['payment_id'] = payments_df['payment_id']

In [174]:
ptemp_df.head()

Unnamed: 0,payment_id,payment_date,bus_id,fare_type,bus_type,fare_id,route_id,fare_amount
0,P001,2023-06-01,B019,Regular,Regular Bus,F038,RT019,75
1,P002,2023-06-01,B019,Student,Regular Bus,F037,RT019,50
2,P003,2023-06-01,B019,Student,Regular Bus,F037,RT019,50
3,P004,2023-06-01,B019,Regular,Regular Bus,F038,RT019,75
4,P005,2023-06-01,B019,Regular,Regular Bus,F038,RT019,75


In [175]:
# creating 'fare_id' attribute

tickets_df['fare_id'] = ptemp_df['fare_id']

In [176]:
# creating 'issue_time'

tickets_df['issue_time'] = payments_df['payment_time']

In [177]:
trips_df.head()

Unnamed: 0,trip_id,bus_id,cond_id,route_id,driver_id,sched_id,trip_date,departure_time,arrival_time,actual_departure_time,actual_arrival_time
0,TRP001,B019,CND019,RT019,DRV019,SCH001,2023-06-01,07:00,07:45,07:00,07:51
1,TRP002,B003,CND003,RT003,DRV003,SCH002,2023-06-01,07:00,07:45,07:00,07:45
2,TRP003,B028,CND028,RT028,DRV028,SCH003,2023-06-01,07:00,07:45,07:04,08:12
3,TRP004,B007,CND007,RT007,DRV007,SCH004,2023-06-01,07:00,07:45,07:09,08:04
4,TRP005,B004,CND004,RT004,DRV004,SCH005,2023-06-01,07:00,07:45,07:10,07:56


In [194]:
# creating 'trip_id' attribute

ticktemp_df = pd.DataFrame()
ticktemp_df[['trip_id', 'bus_id']] = trips_df[['trip_id', 'bus_id']]

In [190]:
len(payments_df)

1771610

In [195]:
ticktemp_df.tail()

Unnamed: 0,trip_id,bus_id
44774,TRP44775,B007
44775,TRP44776,B006
44776,TRP44777,B029
44777,TRP44778,B004
44778,TRP44779,B027


In [187]:
len(schedules_df)

44779

In [196]:
temp2 = pd.DataFrame()
temp2[['bus_id', 'capacity']] = buses_df[['bus_id', 'capacity']]
temp2.tail()

Unnamed: 0,bus_id,capacity
24,B025,30
25,B026,50
26,B027,50
27,B028,50
28,B029,30


In [197]:
ticktemp_df = pd.merge(ticktemp_df, temp2, on='bus_id', how='left')

In [198]:
ticktemp_df.tail()

Unnamed: 0,trip_id,bus_id,capacity
44774,TRP44775,B007,30
44775,TRP44776,B006,30
44776,TRP44777,B029,30
44777,TRP44778,B004,30
44778,TRP44779,B027,50


In [199]:
ticktemp_df.head()

Unnamed: 0,trip_id,bus_id,capacity
0,TRP001,B019,50
1,TRP002,B003,30
2,TRP003,B028,50
3,TRP004,B007,30
4,TRP005,B004,30


In [200]:
trip_ids = []

for _, row in ticktemp_df.iterrows():
    trip_ids.extend([row['trip_id']] * row['capacity'])

In [201]:
len(trip_ids)

1771610

In [202]:
tickets_df['trip_id'] = trip_ids

In [204]:
# creating 'user_id' attribute

unique_user_ids = users_df['user_id']
num_unique_users = len(users_df)
num_tickets = len(tickets_df)

In [205]:
tickets_df.loc[:num_unique_users-1, 'user_id'] = unique_user_ids
remaining_user_ids = np.random.choice(unique_user_ids, num_tickets - num_unique_users, replace=True)
tickets_df.loc[num_unique_users:, 'user_id'] = remaining_user_ids

In [207]:
tickets_order = ['ticket_id', 'trip_id', 'user_id', 'payment_id', 'fare_id', 'issue_time']
tickets_df = tickets_df[tickets_order]

In [208]:
tickets_df.head()

Unnamed: 0,ticket_id,trip_id,user_id,payment_id,fare_id,issue_time
0,TCK001,TRP001,U001,P001,F038,08:22:00
1,TCK002,TRP001,U002,P002,F037,14:18:00
2,TCK003,TRP001,U003,P003,F037,08:55:00
3,TCK004,TRP001,U004,P004,F038,11:30:00
4,TCK005,TRP001,U005,P005,F038,09:42:00


In [209]:
tickets_df.tail()

Unnamed: 0,ticket_id,trip_id,user_id,payment_id,fare_id,issue_time
1771605,TCK1771606,TRP44779,U7130,P1771606,F054,09:07:00
1771606,TCK1771607,TRP44779,U885,P1771607,F053,14:28:00
1771607,TCK1771608,TRP44779,U14958,P1771608,F054,18:36:00
1771608,TCK1771609,TRP44779,U20453,P1771609,F053,12:20:00
1771609,TCK1771610,TRP44779,U21941,P1771610,F054,19:06:00


<div class="alert alert-block alert-success">
    <h3>13. Complaints</h3>
</div>

Attributes for 'Complaints' table:

- complaint_id
- user_id
- trip_id
- comp_type
- description
- comp_time

We'll have 15000 complaints.

In [211]:
## initializing 'Complaints' df and creating 'complaint_id' attribute

complaints_df = pd.DataFrame()
complaints_df['complaint_id'] = [f'CMP{str(i).zfill(3)}' for i in range(1, 15000 + 1)]

In [212]:
# creating 'user_id' attribute

user_ids = users_df['user_id'].values
num_complaints = len(complaints_df)
random_user_ids = np.random.choice(user_ids, num_complaints, replace=True)
complaints_df['user_id'] = random_user_ids

In [213]:
# creating 'trip_id' attribute

trip_ids = trips_df['trip_id'].values
random_trip_ids = np.random.choice(trip_ids, num_complaints, replace=True)
complaints_df['trip_id'] = random_trip_ids

In [216]:
# creating the 'comp_type' and 'description' attributes

complaint_types = {
    "Delay": "The vehicle did not arrive or depart at the scheduled time, causing inconvenience.",
    "Cleanliness": "The vehicle was not clean, impacting passenger comfort and hygiene.",
    "Rude Staff": "Staff members exhibited rude or unprofessional behavior towards passengers.",
    "Overcrowding": "The vehicle was too crowded, causing discomfort for passengers.",
    "Noisy Environment": "Excessive noise inside the vehicle, disturbing passengers.",
    "Uncomfortable Seats": "Seats were uncomfortable or in poor condition.",
    "Route Change": "Unexpected changes to the route without prior notice.",
    "Poor Air Conditioning": "Inadequate or malfunctioning air conditioning system.",
    "Safety Concerns": "Issues related to passenger safety, such as reckless driving.",
    "Accessibility Issues": "Problems related to accessibility for disabled or elderly passengers."
}

In [218]:
complaint_list = list(complaint_types.keys())
probabilities = [0.3, 0.05, 0.1, 0.1, 0.1, 0.1, 0.1, 0.05, 0.05, 0.05]

sampled_complaint_types = np.random.choice(complaint_list, size=len(complaints_df), p=probabilities)

complaints_df['comp_type'] = sampled_complaint_types
complaints_df['description'] = complaints_df['comp_type'].map(complaint_types)

In [219]:
complaints_df['comp_time'] = [generate_random_time() for _ in range(len(complaints_df))]

In [220]:
complaints_df.head()

Unnamed: 0,complaint_id,user_id,trip_id,comp_type,description,comp_time
0,CMP001,U13525,TRP11005,Uncomfortable Seats,Seats were uncomfortable or in poor condition.,18:48
1,CMP002,U2805,TRP31165,Delay,The vehicle did not arrive or depart at the sc...,10:52
2,CMP003,U10299,TRP36829,Route Change,Unexpected changes to the route without prior ...,21:41
3,CMP004,U11845,TRP20387,Noisy Environment,"Excessive noise inside the vehicle, disturbing...",15:27
4,CMP005,U2705,TRP6323,Poor Air Conditioning,Inadequate or malfunctioning air conditioning ...,17:13


<div class="alert alert-block alert-success">
    <h2>Separating Test Datasets</h2>
</div>

In [231]:
# step 1: filtering trips_df and schedules_df for May 2024
trips_df_test = trips_df[(trips_df['trip_date'] >= '2024-05-01') & (trips_df['trip_date'] <= '2024-05-31')]
schedules_df_test = schedules_df[(schedules_df['date'] >= '2024-05-01') & (schedules_df['date'] <= '2024-05-31')]

# step 2: extracting trip_ids from trips_df_test
trip_ids_test = trips_df_test['trip_id'].unique()

# step 3: filtering tickets_df, payments_df, complaints_df, and incidents_df using the trip_ids
tickets_df_test = tickets_df[tickets_df['trip_id'].isin(trip_ids_test)]
payments_df_test = payments_df[payments_df['payment_id'].isin(tickets_df_test['payment_id'].unique())]
complaints_df_test = complaints_df[complaints_df['trip_id'].isin(trip_ids_test)]
incidents_df_test = incidents_df[incidents_df['trip_id'].isin(trip_ids_test)]

# step 4: removing test data from the original DataFrames
trips_df = trips_df[~trips_df['trip_id'].isin(trip_ids_test)]
schedules_df = schedules_df[~schedules_df['date'].isin(schedules_df_test['date'])]
tickets_df = tickets_df[~tickets_df['trip_id'].isin(trip_ids_test)]
payments_df = payments_df[~payments_df['payment_id'].isin(tickets_df_test['payment_id'].unique())]
complaints_df = complaints_df[~complaints_df['trip_id'].isin(trip_ids_test)]
incidents_df = incidents_df[~incidents_df['trip_id'].isin(trip_ids_test)]

In [232]:
# verifying changes and test datasets created

print("Trips DataFrame Test:")
print(trips_df_test.head())
print("\nSchedules DataFrame Test:")
print(schedules_df_test.head())
print("\nTickets DataFrame Test:")
print(tickets_df_test.head())
print("\nPayments DataFrame Test:")
print(payments_df_test.head())
print("\nComplaints DataFrame Test:")
print(complaints_df_test.head())
print("\nIncidents DataFrame Test:")
print(incidents_df_test.head())

Trips DataFrame Test:
        trip_id bus_id cond_id route_id driver_id  sched_id  trip_date  \
40986  TRP40987   B006  CND006    RT006    DRV006  SCH40987 2024-05-01   
40987  TRP40988   B018  CND018    RT018    DRV018  SCH40988 2024-05-01   
40988  TRP40989   B028  CND028    RT028    DRV028  SCH40989 2024-05-01   
40989  TRP40990   B003  CND003    RT003    DRV003  SCH40990 2024-05-01   
40990  TRP40991   B007  CND007    RT007    DRV007  SCH40991 2024-05-01   

      departure_time arrival_time actual_departure_time actual_arrival_time  
40986          07:00        07:45                 07:00               07:56  
40987          07:00        07:45                 07:01               08:03  
40988          07:00        07:45                 07:00               08:10  
40989          07:00        07:45                 07:00               08:02  
40990          07:00        07:45                 07:02               07:45  

Schedules DataFrame Test:
       sched_id route_id bus_id       

<div class="alert alert-block alert-success">
    <h2>Loading All Tables to a PostgreSQL Database</h2>
</div>

In [253]:
all_dataframes = [users_df, buses_df, drivers_df, conductors_df, routes_df, fares_df,
                 schedules_df, tickets_df, payments_df, incidents_df, complaints_df, buses_drivers_df, trips_df]

test_dataframes = [trips_df_test, schedules_df_test, tickets_df_test, payments_df_test, complaints_df_test, incidents_df_test]

dataframe_names = ['Users', 'Buses', 'Drivers', 'Conductors', 'Routes', 'Fares',
                   'Schedules', 'Tickets', 'Payments', 'Incidents', 'Complaints', 'Buses_Drivers', 'Trips']

test_df_names = ['Trips_test', 'Schedules_Test', 'Tickets_test', 'Payments_test', 'Complaints_test', 'Incidents_test']

In [238]:
print(len(all_dataframes))
print(len(dataframe_names))

13
13


In [236]:
pip install sqlalchemy psycopg2

Collecting psycopg2
  Obtaining dependency information for psycopg2 from https://files.pythonhosted.org/packages/37/2c/5133dd3183a3bd82371569f0dd783e6927672de7e671b278ce248810b7f7/psycopg2-2.9.9-cp311-cp311-win_amd64.whl.metadata
  Downloading psycopg2-2.9.9-cp311-cp311-win_amd64.whl.metadata (4.5 kB)
Downloading psycopg2-2.9.9-cp311-cp311-win_amd64.whl (1.2 MB)
   ---------------------------------------- 0.0/1.2 MB ? eta -:--:--
   -- ------------------------------------- 0.1/1.2 MB 1.7 MB/s eta 0:00:01
   -------- ------------------------------- 0.2/1.2 MB 2.4 MB/s eta 0:00:01
   ------------- -------------------------- 0.4/1.2 MB 2.7 MB/s eta 0:00:01
   ------------------ --------------------- 0.5/1.2 MB 2.8 MB/s eta 0:00:01
   ----------------------- ---------------- 0.7/1.2 MB 2.9 MB/s eta 0:00:01
   ----------------------------- ---------- 0.8/1.2 MB 3.0 MB/s eta 0:00:01
   --------------------------------- ------ 1.0/1.2 MB 2.9 MB/s eta 0:00:01
   -------------------------------

In [240]:
from sqlalchemy import create_engine

connection_string = 'postgresql://aribaandsumbal:DAWproject@localhost:5432/Public Transport (Karachi)'
engine = create_engine(connection_string)

for df, name in zip(all_dataframes, dataframe_names):
    df.to_sql(name, engine, if_exists='replace', index=False)

print("DataFrames have been stored in the PostgreSQL database")

DataFrames have been stored in the PostgreSQL database


In [259]:
second_export_names = []
excel_row_limit = 1048576

for name, df in zip(dataframe_names, all_dataframes):
    if len(df) > excel_row_limit:
        second_export_names.append(name)
        
print(second_export_names)

['Tickets', 'Payments']


In [255]:
first_export = [users_df, buses_df, drivers_df, conductors_df, routes_df, fares_df,
                schedules_df, incidents_df, complaints_df, buses_drivers_df, trips_df, 
                trips_df_test, schedules_df_test, tickets_df_test, payments_df_test, complaints_df_test, incidents_df_test]

first_export_names = ['Users', 'Buses', 'Drivers', 'Conductors', 'Routes', 'Fares',
                      'Schedules', 'Incidents', 'Complaints', 'Buses_Drivers', 'Trips',
                      'Trips_test', 'Schedules_Test', 'Tickets_test', 'Payments_test', 'Complaints_test', 'Incidents_test']

In [256]:
## saving first batch of dataframes as Excel files for submission

def save_to_excel(dataframes, names, chunk_size=1048576, folder_path='./'):
    for df, name in zip(dataframes, names):
        df.to_excel(folder_path + f"{name}.xlsx", index=False)
        print(f"{name}.xlsx saved successfully.")

In [257]:
save_to_excel(first_export, first_export_names)

Users.xlsx saved successfully.
Buses.xlsx saved successfully.
Drivers.xlsx saved successfully.
Conductors.xlsx saved successfully.
Routes.xlsx saved successfully.
Fares.xlsx saved successfully.
Schedules.xlsx saved successfully.
Incidents.xlsx saved successfully.
Complaints.xlsx saved successfully.
Buses_Drivers.xlsx saved successfully.
Trips.xlsx saved successfully.
Trips_test.xlsx saved successfully.
Schedules_Test.xlsx saved successfully.
Tickets_test.xlsx saved successfully.
Payments_test.xlsx saved successfully.
Complaints_test.xlsx saved successfully.
Incidents_test.xlsx saved successfully.


In [260]:
## saving second batch of dataframes as Excel files for submission
# note: these files exceed Excel's row limit

second_export = [tickets_df, payments_df]

In [265]:
import os

def save_large_dataframes(dataframes, names, folder_path='.'):
    for df, name in zip(dataframes, names):
        num_chunks = len(df) // 1048576 + 1
        chunks = np.array_split(df, num_chunks)
        for i, chunk in enumerate(chunks):
            file_name = f"{name}_part{i+1}.xlsx"
            file_path = os.path.join(folder_path, file_name)
            chunk.to_excel(file_path, index=False)
            print(f"{file_name} saved successfully.")

In [266]:
save_large_dataframes(second_export, second_export_names)

Tickets_part1.xlsx saved successfully.
Tickets_part2.xlsx saved successfully.
Payments_part1.xlsx saved successfully.
Payments_part2.xlsx saved successfully.
