# Dataframe preparation

Once we have downloaded all the datasets we will be using in our project, the first thing we should do is preparing the dataframe we will be working on.

In [1]:
# Importing modules
import pandas as pd
import numpy as np

The downloaded files are in a txt format, with the state name and the corresponding year of the dataset.

In [2]:
# Let's create dataframes from those txt files
pa18 = pd.read_csv('./NBIDATA/PA18.txt')

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
pa18.shape

(22737, 137)

In [None]:
# CHECKED that the number of bridges is the same as in the table provided in the FHWA website

In [14]:
states = ['PA','OH']

years = ['1992','1993','1994','1995','1996','1997','1998','1999',
        '2000','2001','2002','2003','2004','2005','2006','2007','2008','2009',
        '10','11','2012','13','14','15','16','17','18']

In [21]:
link_data = []

for state in states:
    for year in years:
        link ='./NBIDATA/'+state+year+'.txt'
        link_data.append(link)


In [22]:
link_data

['./NBIDATA/PA1992.txt',
 './NBIDATA/PA1993.txt',
 './NBIDATA/PA1994.txt',
 './NBIDATA/PA1995.txt',
 './NBIDATA/PA1996.txt',
 './NBIDATA/PA1997.txt',
 './NBIDATA/PA1998.txt',
 './NBIDATA/PA1999.txt',
 './NBIDATA/PA2000.txt',
 './NBIDATA/PA2001.txt',
 './NBIDATA/PA2002.txt',
 './NBIDATA/PA2003.txt',
 './NBIDATA/PA2004.txt',
 './NBIDATA/PA2005.txt',
 './NBIDATA/PA2006.txt',
 './NBIDATA/PA2007.txt',
 './NBIDATA/PA2008.txt',
 './NBIDATA/PA2009.txt',
 './NBIDATA/PA10.txt',
 './NBIDATA/PA11.txt',
 './NBIDATA/PA2012.txt',
 './NBIDATA/PA13.txt',
 './NBIDATA/PA14.txt',
 './NBIDATA/PA15.txt',
 './NBIDATA/PA16.txt',
 './NBIDATA/PA17.txt',
 './NBIDATA/PA18.txt',
 './NBIDATA/OH1992.txt',
 './NBIDATA/OH1993.txt',
 './NBIDATA/OH1994.txt',
 './NBIDATA/OH1995.txt',
 './NBIDATA/OH1996.txt',
 './NBIDATA/OH1997.txt',
 './NBIDATA/OH1998.txt',
 './NBIDATA/OH1999.txt',
 './NBIDATA/OH2000.txt',
 './NBIDATA/OH2001.txt',
 './NBIDATA/OH2002.txt',
 './NBIDATA/OH2003.txt',
 './NBIDATA/OH2004.txt',
 './NBIDATA/OH20

In [25]:
# We are only interested in extracting the columns from the older datasets that contain the condition ratings and the structure number
list(pa18.columns)

['STATE_CODE_001',
 'STRUCTURE_NUMBER_008',
 'RECORD_TYPE_005A',
 'ROUTE_PREFIX_005B',
 'SERVICE_LEVEL_005C',
 'ROUTE_NUMBER_005D',
 'DIRECTION_005E',
 'HIGHWAY_DISTRICT_002',
 'COUNTY_CODE_003',
 'PLACE_CODE_004',
 'FEATURES_DESC_006A',
 'CRITICAL_FACILITY_006B',
 'FACILITY_CARRIED_007',
 'LOCATION_009',
 'MIN_VERT_CLR_010',
 'KILOPOINT_011',
 'BASE_HWY_NETWORK_012',
 'LRS_INV_ROUTE_013A',
 'SUBROUTE_NO_013B',
 'LAT_016',
 'LONG_017',
 'DETOUR_KILOS_019',
 'TOLL_020',
 'MAINTENANCE_021',
 'OWNER_022',
 'FUNCTIONAL_CLASS_026',
 'YEAR_BUILT_027',
 'TRAFFIC_LANES_ON_028A',
 'TRAFFIC_LANES_UND_028B',
 'ADT_029',
 'YEAR_ADT_030',
 'DESIGN_LOAD_031',
 'APPR_WIDTH_MT_032',
 'MEDIAN_CODE_033',
 'DEGREES_SKEW_034',
 'STRUCTURE_FLARED_035',
 'RAILINGS_036A',
 'TRANSITIONS_036B',
 'APPR_RAIL_036C',
 'APPR_RAIL_END_036D',
 'HISTORY_037',
 'NAVIGATION_038',
 'NAV_VERT_CLR_MT_039',
 'NAV_HORR_CLR_MT_040',
 'OPEN_CLOSED_POSTED_041',
 'SERVICE_ON_042A',
 'SERVICE_UND_042B',
 'STRUCTURE_KIND_043A',
 '

In [27]:
rating_cols = ['STRUCTURE_NUMBER_008','DECK_COND_058','SUPERSTRUCTURE_COND_059','SUBSTRUCTURE_COND_060']

We will have to do the same data preparation for all the datasets we have downloaded from 1992 to 2018 and from PA and OH.
The steps are: 
    - Read only the rating condition and structure number columns 
    - Check missing values and how to deal with them
    - Create a new variable "Total condition rating" that considers the three rating conditions
    
What we want to study is the evolution of this new variable per year on each bridge:
    - Create the ratio per year of this new variable
    - Calculate the mean of this ratio per bridge

#### Read only the rating condition and structure number columns

In [38]:
pa92 = pd.read_csv(link_data[0], header=0, usecols = rating_cols)

  interactivity=interactivity, compiler=compiler, result=result)


In [39]:
pa92.shape

(29767, 4)

In [40]:
pa92.head()

Unnamed: 0,STRUCTURE_NUMBER_008,DECK_COND_058,SUPERSTRUCTURE_COND_059,SUBSTRUCTURE_COND_060
0,10015001019940,7,8,7
1,10015003000000,8,8,7
2,10015003100000,8,7,7
3,10015005000000,8,8,8
4,10015005100000,7,8,7


In [59]:
pa92.dtypes

STRUCTURE_NUMBER_008       object
DECK_COND_058              object
SUPERSTRUCTURE_COND_059    object
SUBSTRUCTURE_COND_060      object
dtype: object

In [66]:
# We also drop all the rows with 'N' values (which mean Not Applicable)
pa92 = pa92[pa92['DECK_COND_058']!='N']
pa92 = pa92[pa92['SUPERSTRUCTURE_COND_059']!='N']
pa92 = pa92[pa92['SUBSTRUCTURE_COND_060']!='N']

In [67]:
pa92.shape

(20491, 4)

In [79]:
# We need to change the object types to numeric
pa92['DECK_COND_058'] = pd.to_numeric(pa92['DECK_COND_058'])
pa92['SUPERSTRUCTURE_COND_059'] = pd.to_numeric(pa92['SUPERSTRUCTURE_COND_059'])
pa92['SUBSTRUCTURE_COND_060'] = pd.to_numeric(pa92['SUBSTRUCTURE_COND_060'])

In [80]:
pa92.dtypes

STRUCTURE_NUMBER_008       object
DECK_COND_058               int64
SUPERSTRUCTURE_COND_059     int64
SUBSTRUCTURE_COND_060       int64
dtype: object

#### Check missing values

In [47]:
# We check how many items are missing in the dataset
pa92.isnull().sum()

STRUCTURE_NUMBER_008          0
DECK_COND_058              6010
SUPERSTRUCTURE_COND_059    6016
SUBSTRUCTURE_COND_060      6011
dtype: int64

In [48]:
# We drop the rows with null values
pa92.dropna(inplace=True)

In [49]:
pa92.shape

(23747, 4)

#### Create a new variable "Total condition rating" that considers the three rating conditions

El dataset contiene los ratings de tres partes fundamentales de los puentes:
    - Deck (Item 58): describes the overall condition rating of the deck (deck slab, parapets and barriers, bearings, ...)
    
    - Superstructure (Item 59): describes the physical condition rating of all the structural members of the superstructure (girders/beams, cross-frames, stiffeners,...)
    
    - Substructure (Item 60): describes the physical condition of piers, abutments, piles, fenders, footings or other components of the bridge substructure.

La nueva variable Total Rating que engloba los 3 ratings que define el dataset se ha establecido tras evaluar las common practices en inspección y rehabilitación de puentes. 

Se han definido 3 casos distintos que podrían presentar los datos:
    - Si cualquiera de los ratings estuviese en "Poor condition", es decir con valores menores o iguales al 4, la rehabilitación debería de estar próxima, por lo que tomamos el Total Rating como el menor de los 3 ratings.
    
    - Si por el contrario todos los ratings están por encima de 8, es decir con "Very good condition" o "Excellent condition", definiremos el Total Rating como la media aritmética de los ratings.
    
    - Como tercera opción, si tenemos valores superiores al 5 pero sin estar todos en las mejores condiciones, definiremos el Total Rating como una media ponderada. Los coeficientes de ponderación que se han establecido son:
            - 0.5 para el menor de los ratings, para dar más peso al elemento más desfavorecido
            - 0.3 para el rating intermedio (prácticamente la misma ponderación que en la opción 2)
            - 0.2 para el mayor de los ratings

In [81]:
def TotalRating(row):
    dr = row['DECK_COND_058']
    supr = row['SUPERSTRUCTURE_COND_059']
    subr = row['SUBSTRUCTURE_COND_060']
    minval = min(dr, supr, subr)
    maxval = max(dr, supr, subr)
   
    if minval <= 4:
        rating = minval
        
    elif minval >= 8:
        rating = (dr+supr+subr)/3
        
    else:
        medval = dr+subr+supr-minval-maxval
        rating = 0.5*minval+0.2*maxval+0.3*medval

    return rating


pa92['TotalRating'] = pa92.apply(lambda row: TotalRating(row),axis=1)

In [84]:
pa92.sample(10)
# CHECKED: The equation has worked properly 

Unnamed: 0,STRUCTURE_NUMBER_008,DECK_COND_058,SUPERSTRUCTURE_COND_059,SUBSTRUCTURE_COND_060,TotalRating
8028,207225069330330,3,5,5,3.0
18575,530443059014390,8,8,8,8.0
2184,50096069004660,8,8,6,7.0
6697,164031001000000,4,2,4,2.0
10434,280997058000000,5,4,6,4.0
14689,410180044111520,7,7,7,7.0
19652,570167012006770,5,7,5,5.4
11543,321056001034300,6,5,5,5.2
256,17216049330550,7,6,8,6.7
21100,620079044505090,4,4,5,4.0
