Skip to content

VasilyevEvgeny/taxi_analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

cover

What is it?

TaxyAnalysis - utility for quickly obtaining statistics of the average Yandex taxi ride cost changes over time.

It is often interesting to understand how the price of taxi services has changed over the past few years. In the absence of open statistics, we can try to estimate this change based on our trips. This utility compares the average relative taxi ride cost using statistical criteria.

More details below.

Requirements

  • Python 3

Usage

  • Clone repository
@: git clone https://github.com/VasilyevEvgeny/taxi_analysis.git
@: cd taxi_analysis
  • Create virtual environment

Windows:

@: python -m venv .venv
@: .\.venv\Scripts\activate

Linux:

@: python -m venv .venv
@: source .venv/bin/activate
  • Install essential packages
@: pip install -r requirements.txt
  • Run taxi_analysis.py
@: python taxi_analysis.py -h
TaxiAnalyzer - utility for quickly obtaining statistics of the average taxi ride cost changes over time
                                                                                                       
optional arguments:                                                                                    
  -h, --help            show this help message and exit                                                
  -p PATH, --path PATH  Path to data                                                                   
  -l LOCATION [LOCATION ...], --location LOCATION [LOCATION ...]                                       
                        Location(s) to analyze: Moscow, Vladimir or both                               
  -i AVERAGING_INTERVAL, --averaging_interval AVERAGING_INTERVAL                                       
                        Averaging time interval: year or quarter                                       

With default arguments (-p data -l Moscow -i year):

python taxi_analysis.py

With all arguments:

python taxi_analysis.py --path <path_to_data> --location <Moscow_or_Vladimir_or_both> --averaging_level <type_of_averaging>

Preparing data

Script parses Yandex taxi ride reports in .pdf. To prepare data correctly you should:

  • Attach email to personal account
  • For each ride in the ride history click "Send report to email"
  • Save every email report as .pdf and move to specific folder (argument -p (--path) in script usage)

Example of ride report: Moscow, Vladimir

Description

It so happened that in my sample of trips, Moscow and Vladimir are the most frequent cities, so the analysis was carried out for them. TaxiAnalysis has the ability to take them into account at the same time, but I think that this should not be done, since it will hardly be possible to come to constructive conclusions based on such analysis.

For the selected cities, we will analyze the normalized dependence of the total ride cost on the route distance. Let's try to approximate the specified dependence by polynomials of the 1-st, 2-nd and 3-rd degree.

  • Moscow: regression_Moscow
  • Vladimir: regression_Vladimir

It can be seen that the RMSE between the fitted curves and the original sample in the linear case approximately does not differ from the other two. So we can assume a linear relationship between the total ride cost and the route distance. The ratio of these quantities will be called RRC (relative ride cost) and actively used in further analysis.

Since in the history of trips Economy class is significantly more than the rest (Comfort, Comfort+ and Business), only it was considered.

It is important to note that both relative and absolute cost of a trip change over time due to inflation. In TaxiAnalysis, inflation was taken into account as a piecewise linear function, constant within every year. For example, it was considered that in 2018 inflation was equal to 5%, in 2019 - 6%, etc. The information about inflation changes was taken from here. RRC for the entire time was calculated in money at the time of April 2023 and was named RRC_i (RRC inflated). That is, if in April 2023 the cost of a trip was 100 rubles, then the cost of a trip in April 2018, initially also equal to 100 rubles, increased due to adjustments for inflation according to the specified piecewise linear function. As a result, the farther the cost of trips was from April 2023, the more it grew when inflation was taken into account. These adjustments make it possible to exclude the inflation factor when interpreting the results. Note that the approximation used is rough. In particular, inflation could be taken into account by months, not years, and not within the framework of the average for various industries, but specific to taxis. I think that the refinement of the inflation model will be appropriate with a further increase in the sample and the complexity of the analysis.

TaxiAnalysis generates 4 files:

  • The results of the regression analysis described above
  • Distribution of RRC_i from time in the original dataset, mean for each averaging period, as well as a smoothed with a quadratic interpolation curve based on these averaged values.
    • Moscow: smoothed_Moscow
    • Vladimir: smoothed_Vladimir
  • Violin plot (only for annual averaging interval), which shows the median, interquartile range and kernel-integrated distribution of trip density.
    • Moscow: violines_Moscow
    • Vladimir violines_Vladimir
  • Text file with calculation of statistical values (only for year averaging interval). It presents statistics for the entire sample and for each year separately. Using the Shapiro-Wilk test, it is analyzed whether the distribution of each year is normal. If for each year the distribution of the target value turned out to be normal, then to compare the average values for each year we calculate mean within the framework of the T-test. If there was at least one year when the distribution was not normal, then the medians are compared and the Mann-Whitney test is used. Note that the significance level for all criteria is chosen 0.05.

As a result, none of RRC_i distributions was normal. Statistically significant differences in RRC_i values:

  • Moscow:
    • RRC_i(2018) < RRC_i(2021)
    • RRC_i(2018) < RRC_i(2022)
  • Vladimir:
    • RRC_i(2018) < RRC_i(2020)
    • RRC_i(2018) < RRC_i(2021)
    • RRC_i(2018) < RRC_i(2022)
    • RRC_i(2019) < RRC_i(2021)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages