# Define a problem or research question you will aim to address throughout the course

--> Something personally interesting to you

--> Tell us why it is interesting/important!

--> Something that you can perform analysis on and specifically, make a prediction about

--> The more focused, measurable, and specific you can be, the better

### Problem:
What factors are responsible for influencing taxi fares (particularly in DC)and can we build a model to predict taxi fares based on distance, time and duration of journey.



*   I am from India, and I have used various cab services like traidtional cabs, Uber, ola, etc. The fares that cab companies show are often unpredictable and unreasonable. Also, it changes with time of day, trip distance requested, traffic on the route, etc. Even, a study shows that, in India, fares are different for same trip on mobile phones with different operating systems. That is, on Iphones, app might ask for more money than on an android phone. Now, after moving to US, I am interested in studying the taxi systems here. Beyond my personal interest, the study might help customers with transparency and predictability as well as it might help companies in understanding revenue patterns.
*   I can perform an analysis to find out patterns such as taxi useage according to seasonality, maximum serviced area, high traffic areas and predict taxi fares using trip distance, milage, and time of the day. I can cluster population on customer with specific trats who tend to tip more.
*   It is generally said that airport trips cost more than ordinary ones. We can test out a hypothesis based on the same.

---



# Determine the population you wish to study

--> Describe this population.

*   The population I wish to study is taxi-cab data that provided service in the Washington DC area in year 2024.
*   The trip records differ by trip distance, time of day, mode of payment.
*   Each record contains fare details, tip amount, milage, pickup and drop off location details, timestamps.


---




# Identify variable(s) in the population sample that you will study

--> What are the independent variable(s)?

--> What is the dependent variable? (Ideally you have a single DI.)

--> Confounding variable

--> Explain what a confounding variable is.

--> Identify any potential confounding variables in your study.

--> How will you deal with them?

--> Or, if not something that can be dealt with here, why not and how
might you address it if you had the resources to do so?

**Independent variables:** Milage, Duration, Airport (categorical- whether the trip is for airport or not), Payment Type, Trip Type, Origin and Destination Blockname, Provider Name (Cab Company)

**Dependent Variables:** Total Fare Amount

--> Confounding variable is and external factor that may affect the relationships between dependent and independent variables.

**Confounding Variables:**
*   Day of Week: trips on weekdays can be costlier than weekends or trips to vacation spots can be costlier on weekends or holidays.
*   Month of trip: seasonality affects trip fares.
*   Hour of the day: peak hours like office opening-closing timings might have higher fares.
*   Traffic/ congestion along route : traffic may increase trip time and may increase fare amount.
*   Weather conditions : bad weather conditions can increase time required to reach destination and may increase surcharge.
*   Diversions : Construction work or accidents on road may cause diversions which in turn increases trip duration.
*   Tolls : Toll booths can be time consuming and it increases the total fare because of toll fare.

We can create those variables using existing data in the dataset. Make new features such as day of week, month of trip and hour of the day.
Build seperate models over peak and non-peak hours categories.
We do not have any available data regarding weather, traffic condition and hence we can add those resources later when we have data available. If we had data resources of weather conditions and traffic conditions, we would have categorised population accordingly and build model to check for differences and analyse patterns.



---





# Formulate a hypothesis

--> What do you suspect analysis of the data will find?

--> Use the format shown in lecture.

If trip source or destination involves airport (pickup or drop-off), then the fare amount will be higher than for non-airport trips.


---



# Develop a detailed plan for data collection

--> How will you get the data?

--> How will you ensure representativeness when using a sample?

--> What method will you use to collect the data?

*   The data is publically available on DC Gov website which I can download and and upload it to google drive.
*   In colab, access the data on google drive and perform further analysis on it.
*   The data is available in csv format and it has 12 files for 12 months which I will merge in one file (as master dataset).
*   I can use stratified sampling with month and hour to make sure I have evenly distributed data with minimal bias. Draw a random sample size that preserves proportion across strata.
*   Data is well represented across 12 months of year, various categories such as mode of payment, source and destination locations, etc.


---



# Choose a data set

--> Explain why this dataset is interesting to you

--> What is in the dataset?

--> Where is the dataset from?
    
    --> Include the link for where you obtained the dataset from, or if you
    created it yourself please give details of how you generated it.

--> When is the dataset from?

The data has various features that can help us predict prices, perofrm analysis and find patterns which will be useful for customers, drivers as well as cab companies. In my opinion, this is an excellent example of data science. That is why the dataset interests me.

The dataset is records of taxi trips across Washington DC in year 2024.
It has 12 csv files with records of each month and a text file with dataset info. Each file has about 200k records and in total there are more than 2.5 million records.

The dataset is publically available at DC Gov website: https://dcgov.app.box.com/v/TaxiTrips2024

Link to the dataset stored in google drive (View Only): https://drive.google.com/drive/folders/1Zpdt9uEL-T5bqkGt85rw77IM8lgbjehx?usp=sharing

Dataset has following features:
1.   TRIPTYPE
2.   PROVIDERMANE
3.   TOTALFARE
4.   GRATUITYAMOUNT
5.   SURCHARGEAMOUNT
6.   EXTRAFAREAMOUNT
7.   TOLLAMOUNT
8.   TOTALAMOUNT
9.   PAYMENTTYPE
10.  MILAGE
11.  DURATION
12.  ORIGIN
13.  DESTINATION
14.  AIRPORT
15.  ORIGINTIME
16.  DESTINATIONTIME


Categories explained:

Payment Type

1.	Credit

2.	Cash

3.	EHail

4.	Other (not sure how common this is)

5.	Uber (not sure how common this is)

Trip Type

1.	Ordinal (normal rate)

2.	VoD

3.	TransportDC (grant program)

4.	TransportDCShared (grant program)

5.	MOVA (grant program)

6.	CFSA (grant program)

7.	NRS (grant program)

8.	NEMT (grant program




---









# Import the data in Colab

In [None]:
# I will import the data on google drive.
import os
import glob
import numpy as np
import pandas as pd

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
folder_path = '/content/drive/MyDrive/OpenDataDC_Taxi_2024/'
files = glob.glob(os.path.join(folder_path, '*.csv'))
print(f"Found {len(files)} files.")

Found 12 files.


In [None]:
# now I will merge those file to make a master dataset
df_list = []
for individual_file in files:
    '''Extract month from filename.
    File naming convention is: taxi_2024_01.csv'''
    month = os.path.basename(individual_file).split("_")[-1].split(".")[0]
    temp_df = pd.read_csv(individual_file)
    temp_df["MONTH"] = month
    df_list.append(temp_df)

In [None]:
#concat the dataframe
masterDF = pd.concat(df_list, ignore_index=True)

In [None]:
#save the file in same directory
output = os.path.join(folder_path, "master_data.csv")
masterDF.to_csv(output, index=False)

In [None]:
#Quick check
print(f"Master dataset created with shape: {masterDF.shape}")
print(f"Saved to: {output}")

Master dataset created with shape: (2670564, 28)
Saved to: /content/drive/MyDrive/OpenDataDC_Taxi_2024/master_data.csv


In [None]:
#Note: all the code blocks above are one time run, because the master file will be created only once and used furtheras per the need.

# Print the first 5 data entries from the file

In [None]:
#heretoforth run code cells below

In [None]:
import os
import glob
import numpy as np
import pandas as pd

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
folder_path = "/content/drive/MyDrive/OpenDataDC_Taxi_2024/"
file_path = folder_path + "master_data.csv"
df = pd.read_csv(file_path)

  df = pd.read_csv(file_path)


In [None]:
print("First five entries:")
df.head()

First five entries:


Unnamed: 0,OBJECTID,TRIPTYPE,PROVIDERNAME,FAREAMOUNT,GRATUITYAMOUNT,SURCHARGEAMOUNT,EXTRAFAREAMOUNT,TOLLAMOUNT,TOTALAMOUNT,PAYMENTTYPE,...,ORIGIN_BLOCK_LATITUDE,ORIGIN_BLOCK_LONGITUDE,ORIGIN_BLOCKNAME,DESTINATION_BLOCK_LAT,DESTINATION_BLOCK_LONG,DESTINATION_BLOCKNAME,AIRPORT,ORIGINDATETIME_TR,DESTINATIONDATETIME_TR,MONTH
0,1,,,5.68,0.0,0.25,0.0,,5.93,2,...,38.952536,-77.003107,100 BLOCK GALLOWAY STREET NE,38.953713,-76.988006,5100 BLOCK SARGENT ROAD NE,,01/01/2024 00:00,01/01/2024 00:00,1
1,2,,,,,,,,51.84,4,...,,,,,,,,01/01/2024 00:00,01/01/2024 00:00,1
2,3,,,44.13,0.0,0.25,2.25,0.0,46.38,4,...,38.917557,-77.034531,2000 BLOCK 15TH STREET NW,,,,,01/01/2024 00:00,01/01/2024 00:00,1
3,4,,,14.32,0.0,0.25,0.0,,14.57,2,...,38.899817,-77.026514,1000 BLOCK H STREET NW,38.94092,-77.021225,600 BLOCK TAYLOR STREET NW,,01/01/2024 00:00,01/01/2024 00:00,1
4,5,,,14.05,2.57,0.25,0.0,0.0,16.87,1,...,38.896881,-77.006479,UNIT BLOCK COLUMBUS CIRCLE NE,38.909637,-77.047716,2100 BLOCK P STREET NW,,01/01/2024 00:00,01/01/2024 00:00,1


In [None]:
print('Shape of data', df.shape)

Shape of data (2670564, 28)
