<a href="https://colab.research.google.com/github/abroniewski/Jira-Dataset-Time-Estimate-Study/blob/main/src/eda.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import plotly.express as px
import numpy as np

In [2]:
issues_df = pd.read_csv('clean_data_r1.csv', parse_dates=['created', 'updated', 'resolutiondate', 'duedate'])
issues_df.head()

FileNotFoundError: ignored

In [None]:
issues_df.columns

Index(['repo_name', 'id', 'timespent', 'timeestimate', 'created', 'updated',
       'resolutiondate', 'duedate', 'labels', 'comments', 'issue_link_count',
       'project_name', 'issue_name', 'priority', 'votes', 'watches',
       'log_change_count', 'assignee_id', 'assignee_active_status',
       'creator_id', 'creator_active_status', 'reporter_id',
       'reporter_active_status'],
      dtype='object')

### what is the measure unit of timespent and timeestimate?

We check that by comparing with creation and resolution date.


In [None]:
issues_df['delta_time_s'] = (issues_df['resolutiondate'] - issues_df['created']).dt.total_seconds()
issues_df['delta_time_m'] = issues_df['delta_time_s'] / 60
issues_df[['timespent', 'timeestimate', 'created', 'updated', 'resolutiondate', 'duedate', 'delta_time_s', 'delta_time_m']]

Unnamed: 0,timespent,timeestimate,created,updated,resolutiondate,duedate,delta_time_s,delta_time_m
0,864000.0,288000.0,2021-01-08 16:38:15+00:00,2021-03-16 14:04:18+00:00,2021-02-23 16:49:41+00:00,2021-02-27,3975086.0,6.625143e+04
1,3660.0,3600.0,2008-12-30 20:15:24+00:00,2013-06-10 18:21:50+00:00,2013-02-18 08:29:58+00:00,NaT,130508074.0,2.175135e+06
2,43200.0,14400.0,2017-01-20 06:23:09+00:00,2017-12-06 11:52:42+00:00,2017-02-24 10:15:35+00:00,NaT,3037946.0,5.063243e+04
3,600.0,3000.0,2016-06-01 07:26:47+00:00,2021-10-24 06:35:10+00:00,2016-07-22 11:27:49+00:00,NaT,4420862.0,7.368103e+04
4,7200.0,50400.0,2012-03-08 08:17:09+00:00,2016-02-10 05:34:24+00:00,2016-02-10 05:34:24+00:00,NaT,123887835.0,2.064797e+06
...,...,...,...,...,...,...,...,...
2475,10800.0,21600.0,2010-09-09 07:22:12-05:00,2010-09-14 12:14:37-05:00,2010-09-14 12:07:28-05:00,NaT,449116.0,7.485267e+03
2476,7200.0,14400.0,2010-09-08 13:56:45-05:00,2012-04-17 12:08:48-05:00,2010-11-24 12:43:07-06:00,NaT,6651982.0,1.108664e+05
2477,288000.0,201600.0,2010-09-08 13:31:10-05:00,2011-10-21 16:37:56-05:00,2011-10-21 16:37:56-05:00,NaT,35262406.0,5.877068e+05
2478,97200.0,252000.0,2011-07-18 09:47:26-05:00,2013-09-05 13:45:25-05:00,2013-09-05 13:45:25-05:00,NaT,67406279.0,1.123438e+06


#### Conclusion

The time spent is not calculated based on creation and resolution time. We think that the time spent is a field inputted by the user.

### Checking distributions

In [None]:
# Computing the difference between estimated and actual
# A negative value means it was understimated
# A positive value means it was overstimated
issues_df['estimation_error'] = issues_df['timeestimate'] - issues_df['timespent']

#### Using raw values

In [None]:
px.box(issues_df[['timeestimate', 'timespent']]).show()

In [None]:
px.box(issues_df['estimation_error']).show()

In [None]:
px.histogram(issues_df, x="timeestimate").show()

In [None]:
px.histogram(issues_df, x="timespent").show()

In [None]:
px.histogram(issues_df, x="estimation_error").show()

#### Transforming to days

Considering that the raw values are in seconds

In [None]:
# Transforming values to day
issues_df['timeestimate_d'] = issues_df['timeestimate'] / 60 / 60 / 24
issues_df['timespent_d'] = issues_df['timespent'] / 60 / 60 / 24
issues_df['estimation_error_d'] = issues_df['timeestimate_d'] - issues_df['timespent_d']

In [None]:
px.box(issues_df[['timeestimate_d', 'timespent_d']]).show()

In [None]:
px.box(issues_df['estimation_error_d']).show()

In [None]:
px.histogram(issues_df, x="timeestimate_d").show()

In [None]:
px.histogram(issues_df, x="timespent_d").show()

In [None]:
px.histogram(issues_df, x="estimation_error_d").show()

### Checking how accurate are the estimations

In [None]:
issues_df['estimate_class'] = np.where(issues_df['estimation_error'] == 0,
                                       'accurate',
                                       np.where(issues_df['estimation_error'] < 0,
                                                'underestimated',
                                                'overestimated'
                                                )
                                       )

In [None]:
px.histogram(issues_df, x="estimate_class", histnorm='probability density').show()

overestimated = 46% \
underestimated = 38% \
accurate = 15%

#### Distribution of overestimated issues

In [None]:
px.box(issues_df[issues_df['estimate_class'] == 'overestimated']['estimation_error']).show()

25% of the overestimated issues are overestimated by 2 hours; \
50% of the overestimated issues are overestimated by 6.58 hours; \
75% of the overestimated issues are overestimated by 1 day (24 hours); \

25% of the overestimated issues are overestimated by **more than 1 day**;

#### Distribution of underestimated issues

In [None]:
px.box(issues_df[issues_df['estimate_class'] == 'underestimated']['estimation_error']).show()

25% of the underestimated issues are underestimated by 1 hour; \
50% of the underestimated issues are underestimated by 2.17 hours; \
75% of the underestimated issues are underestimated by 8 hours; \

25% of the underestimated issues are underestimated by **more than 8 hours**;