# **2-Laboratory-15-10-2020**

| Credits to the authors of the exercises: Andrea Pasini, Giuseppe Attanasio, Flavio Giobergia <br />
| Master of Science in Data Science and Engineering, Politecnico di Torino, A.A. 2020-21

## Global Land Temperature
The Global Land Temperature (GLT) dataset is a large collection of measurements actively maintained by Berkeley Earth. It contains the raw source data measured with stations all around the globe, plus anintermediate format and several formatted output files. Data span from ∼1750 up to recent days with monthly and daily availability. 
<br />
Measurements are provided by hemispheres, states, countries, cities andmore. You can read more about the dataset at the Berkeley Earth website. For the purpose of this laboratory you will work on a modified, smaller but dirtier, version of the original GLT dataset, to stress the importance of data preprocessing. More specifically, this didactic version contains the formatted output files of the major cities of the globe with monthly granularity. For the sake of simplicity, the analysis will range between almost two centuries (i.e. between the years 1817 and 2012). The dataset is composed of∼200k rows corresponding to the measurements taken the first day of themonth in a given city. Each measurement is then described by 7 values:
- Date, when the measurement was taken
- AverageTemperature
- AverageTemperatureUncertainty
- City, from which the measurement was taken•
- Country
- Latitude
- Longitude

The main goal of this exercise is to learn how to clean a real-world dataset searching for anomalies, suchas missing values or outliers, in its data

### Questions
1. Load the Global Land Temperature dataset as a list of lists. Before starting, take a moment to better inspect the attributes you are going to work on. How many of them are nominal, how many continuous or discrete?

In [101]:
import csv

# initialize
GLT_dataset = [ [] for i in range(7)]

with open('../Datasets/GLT_filtered.csv') as f:
    
    for row in csv.reader(f): 
        for i in range(len(row)):
            GLT_dataset[i].append(row[i])
        
head = [ GLT_dataset[i].pop(0) for i in range(len(GLT_dataset))]

Let's see better what we obtained

In [105]:
print(f" *** Head  \n{head}")
print("\n *** First 5 elements for each list")

for i in range(len(GLT_dataset)):
    print(GLT_dataset[i][:8])

 *** Head  
['Date', 'AverageTemperature', 'AverageTemperatureUncertainty', 'City', 'Country', 'Latitude', 'Longitude']

 *** First 5 elements for each list
['1849-01-01', '1849-02-01', '1849-03-01', '1849-04-01', '1849-05-01', '1849-06-01', '1849-07-01', '1849-08-01']
[26.704, 27.434, 26.787, 26.14, 25.427, 24.844, 24.058000000000003, 23.576]
['1.435', '1.3619999999999999', '', '1.3869999999999998', '1.2', '1.402', '1.254', '1.265']
['Abidjan', 'Abidjan', 'Abidjan', 'Abidjan', 'Abidjan', 'Abidjan', 'Abidjan', 'Abidjan']
["CÃ´te D'Ivoire", "CÃ´te D'Ivoire", "CÃ´te D'Ivoire", "CÃ´te D'Ivoire", "CÃ´te D'Ivoire", "CÃ´te D'Ivoire", "CÃ´te D'Ivoire", "CÃ´te D'Ivoire"]
['5.63N', '5.63N', '5.63N', '5.63N', '5.63N', '5.63N', '5.63N', '5.63N']
['3.23W', '3.23W', '3.23W', '3.23W', '3.23W', '3.23W', '3.23W', '3.23W']


So, we're dealing with 2 nominal attributes (City,Country), three discrete ones (Date,Latitude,Longitude) and finally two continuos attributes (AverageTemperature,AverageTemperatureUncertainty) <br />

2. Analyze the attribute AverageTemperature, which contains missing values. Fill any gap with the arithmetic mean among the closest antecedent and the closest successive measurements in time,taken in the same city. Assume the following rules for edge cases:

original_list = ['', 5, 6,'']
step_1        = [ 2.5, 5, 6,''] # (0 + 5) / 2
step_2        = [ 2.5, 5, 6,  3 ] # (6 + 0) / 2

original_list   = ['','', 24, 28.9 ]
step_1          = [ 12,'', 24, 28.9 ] # (0 + 24) / 2
step_2          = [ 12, 18, 24, 28.9 ] # (12 + 24) / 2

In [93]:
# let's start with a toy vector in order to see better what happens
toy =  ['',5,4,'',3,'','']

def next_non_negative(i):
    for k in range(i+1,len(toy)):
        if toy[k] != '':
            return k
    return -1

""" 
    I define the first control outside the loop 
    because in this way this control will take place
    just once
"""
print(toy)

if toy[0] == '':
    toy[0] = toy[next_non_negative(0)]/2

print(toy)

# from second to penultimate
for i in range(1,len(toy)):
    
    if toy[i] == '':
        j = next_non_negative(i)
        
        # if there is any non-null value beyond it, it takes the previous one
        if j == -1:
            toy[i] = toy[i-1]/2
        else:
            if (j-i) > 1:
                toy[i] = toy[j]/2
            else:
                toy[i] = (toy[i-1] + toy[j])/2
    print(toy)
    

['', 5, 4, '', 3, '', '']
[2.5, 5, 4, '', 3, '', '']
[2.5, 5, 4, '', 3, '', '']
[2.5, 5, 4, '', 3, '', '']
[2.5, 5, 4, 3.5, 3, '', '']
[2.5, 5, 4, 3.5, 3, '', '']
[2.5, 5, 4, 3.5, 3, 1.5, '']
[2.5, 5, 4, 3.5, 3, 1.5, 0.75]


It'd works, now let's try it on our list, but firstly we need to map these values as float, otherwise we cannot calculate the mean 


In [104]:
# convert from string to float
GLT_dataset[1] = [float(i) if i != '' else '' for i in GLT_dataset[1] ]

def next_non_negativeV2(i):
    for k in range(i+1,len(GLT_dataset[1])):
        if GLT_dataset[1][k] != '':
            return k
    return -1

if GLT_dataset[1][0] == '':
    GLT_dataset[1][0] = GLT_dataset[1][next_non_negative(0)]/2

for i in range(len(GLT_dataset[1])):
    
    if GLT_dataset[1][i] == '':
        
        j = next_non_negativeV2(i)
        
        if j == -1:
            GLT_dataset[1][i] = GLT_dataset[1][i-1]/2
        else:
            if (j-i) > 1:
                GLT_dataset[1][i] = GLT_dataset[1][j]/2
            else:
                GLT_dataset[1][i] = (GLT_dataset[1][i-1] + GLT_dataset[1][j])/2
    
    
GLT_dataset[1][:10]

[26.704,
 27.434,
 26.787,
 26.14,
 25.427,
 24.844,
 24.058000000000003,
 23.576,
 24.4195,
 25.263]

In [64]:
for i,elem in enumerate(GLT_dataset[1]):
    
    if elem == '' and i != len(GLT_dataset[1])-1:
        # look for the first non-null value
        for j in range(i+1,len(GLT_dataset[1])):
            if GLT_dataset[1][j] != '':
                GLT_dataset[1][i] = GLT_dataset[1][j]/2
                break
                
    elif elem == '' and i == len(GLT_dataset[1])-1:
        # use the previous element
        GLT_dataset[1][i] = GLT_dataset[1][i-1]/2 

GLT_dataset[1][:10]

[26.704,
 27.434,
 13.07,
 26.14,
 25.427,
 24.844,
 24.058000000000003,
 23.576,
 12.6315,
 25.263]