# ROAD COST ANALYSIS 

## PART I - DATA PREPARATION


### Introduction

In this data analysis project we will look at the cost of road construction in Poland. It is said that road construction in Poland is much more expensive than in neighboring countries and the quality of the new routes does not meet the requirements of users.

We will check the list of elements of newly build or rebuilt roads and what road elements contribute the most to the high cost of roads in Poland and what does it look like from the inside.

The analysis will be carried out on the basis of real data for which the names of the roads covered by the analysis have been changed.

The input material are pdf files obtained from a reputable polish construction company, the explanation of which is presented below.

The original data contains the following columns:

* 'Lp.': Ordinal number
* 'CPV': Central Product Classification code
* 'Numer Specyfikacji Technicznej': Technical Specification code
* 'Elementy rozliczeniowe': Billing elements
* 'Jednostka': Measure unit
* 'Ilosc': quantity
* 'Cena jedn': Unit price
* 'Wartosc calkowita': Total value
* 'Droga': Road number
* 'Rok': Year of construction
* 'Kategoria': Category of construction works
 


**Import Libraries**

In [1]:
import pandas as pd
import numpy as np

from functions.pdf_tools import pdf_reader, pdf_cleaner, match_category

**Read pdf files**

In [None]:
road_data = pdf_reader('..\Projekt_Analiza_Danych\DATA\*.pdf')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  report_cl['Dlugosc_drogi'] = report_cl[report_cl['Elementy_rozliczeniowe'].str.contains("Odtworzenie trasy i punktów wysokościowych")]['Ilosc']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  report_cl['Dlugosc_drogi'] = report_cl[report_cl['Elementy_rozliczeniowe'].str.contains("Odtworzenie trasy i punktów wysokościowych")]['Ilosc']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydat

**Initial data clean**

In [None]:
road_data = pdf_cleaner(road_data)

**Checking the basic informations of the DataFrame.**

In [None]:
road_data.head()

In [None]:
road_data.info()

In [None]:
road_data.describe()

In [None]:
set(road_data['Cena_jedn'].map(type))

In [None]:
set(road_data['Ilosc'].map(type))

In [None]:
set(road_data['Wartosc_calkowita'].map(type))

In [None]:
set(road_data['Dlugosc_drogi'].map(type))

In [None]:
set(road_data['Rok'].map(type))

In [None]:
road_data['Kategoria'].value_counts()

**Note**

As you can see above,the "Ilosc, Cena_jedn, Wartosc_calkowita, Dlugosc_drogi and Rok" columns contains wrong data type, we will convert them to float and DateTime data type. What is more the "Kategoria" column contain incomplete category names. In the next few steps we will fix these errors.

**Cleaning the "Kategoria" column**

In [None]:
road_data['Kategoria_robot'] = road_data['Kategoria'].apply(match_category)

In [None]:
road_data.drop('Kategoria', inplace=True, axis=1)

In [None]:
road_data

In [None]:
road_data['Ilosc'] = pd.to_numeric(road_data['Ilosc'],errors='coerce')

**Cleaning the "Ilosc", "Cena_jedn", "Wartosc_calkowita" and "Dlugosc_drogi"columns**

In [None]:
road_data['Ilosc'] = pd.to_numeric(road_data['Ilosc'],errors='coerce')

In [None]:
road_data['Cena_jedn'] = pd.to_numeric(road_data['Cena_jedn'],errors='coerce')

In [None]:
road_data['Wartosc_calkowita'] = pd.to_numeric(road_data['Wartosc_calkowita'],errors='coerce')

In [None]:
road_data['Dlugosc_drogi'] = pd.to_numeric(road_data['Dlugosc_drogi'],errors='coerce')

**Cleaning the "Rok" column**

In [None]:
road_data['Rok'] = pd.to_numeric(road_data['Rok'],errors='coerce')

In [None]:
road_data['Rok'] = pd.to_datetime(road_data['Rok'],format='%Y')

In [None]:
road_data

In [None]:
road_data.info()

In [None]:
road_data['Kategoria_drogi'] = road_data['Droga'].str[0:2].apply(lambda x:
                                                                 ('Powiatowa' if x=='DP'
                                                                  else 'Krajowa' if x=='DK'
                                                                  else 'Wojewodzka'))

In [None]:
road_data

In [None]:
road_data['Kategoria_drogi'].value_counts()

**Note**

It can be seen that the product of the "Ilosc" and "Cena_jedn" columns does not equal to the values in the column "Wartosc_calkowita", let's fix that.

In [None]:
road_data['Wartosc_calkowita'] = road_data['Ilosc'] * road_data['Cena_jedn']

In [None]:
road_data

**Note**

We still dont know the lenght of analized roads, so their cost not reliable. Let's bring all the costs down to the cost of the 1 km of the road

We know that the cells that contains sentence "Odtworzenie trasy i punktów wysokościowych..." in column "Elementy_rozliczenione" contains information about the lenght of the analized roads. We used it at the begining (def pdf_reader) to bring all the costs down to the cost of the 1 km of the road

In [None]:
road_data['Cena_jedn_per_km'] = road_data['Cena_jedn'] / road_data['Dlugosc_drogi']

In [None]:
road_data['Wartosc_calkowita_per_km'] = road_data['Wartosc_calkowita'] / road_data['Dlugosc_drogi']

In [None]:
road_data

**Note**

Okey we finished our data preparation let's save it to excel file for further data analysis in second part of the project.

In [None]:
road_data.to_excel('..\Projekt_Analiza_Danych\DATA\Road_cost_analysis.xlsx',
                                             sheet_name='Road_cost_analysis',
                                             index=False)