## Processing data with pandas II

from https://geo-python-site.readthedocs.io/en/latest/notebooks/L6/advanced-data-processing-with-pandas.html

### Downloading the data
The data is easily downloadable from the Finnish Meteorological Institute’s open data portal.

## Import libraries

In [3]:
import pandas as pd

In [5]:
fp = r"data/Helsinki.csv"
data = pd.read_csv(fp, na_values=["-"])
data.head()

Unnamed: 0,Observation station,Year,Month,Day,Time [Local time],Air temperature [°C]
0,Helsinki Kumpula,2024,6,1,02:00,19.5
1,Helsinki Kumpula,2024,6,1,02:10,19.3
2,Helsinki Kumpula,2024,6,1,02:20,19.3
3,Helsinki Kumpula,2024,6,1,02:30,19.7
4,Helsinki Kumpula,2024,6,1,02:40,19.3


In [11]:
data.columns

Index(['Observation station', 'Year', 'Month', 'Day', 'Time [Local time]',
       'Air temperature [°C]'],
      dtype='object')

In [17]:
# Read in only selected columns
data = pd.read_csv(
    fp,
    usecols=['Observation station', 'Year', 'Month', 'Day','Time [Local time]', 'Air temperature [°C]'])

# Check the dataframe

data.head()     

Unnamed: 0,Observation station,Year,Month,Day,Time [Local time],Air temperature [°C]
0,Helsinki Kumpula,2024,6,1,02:00,19.5
1,Helsinki Kumpula,2024,6,1,02:10,19.3
2,Helsinki Kumpula,2024,6,1,02:20,19.3
3,Helsinki Kumpula,2024,6,1,02:30,19.7
4,Helsinki Kumpula,2024,6,1,02:40,19.3


In [18]:
data.columns

Index(['Observation station', 'Year', 'Month', 'Day', 'Time [Local time]',
       'Air temperature [°C]'],
      dtype='object')

In [22]:
# Create the dictionary with old and new names
new_names = {"Observation station": "STATION", "Time [Local time]": "TIME", "Air temperature [°C]": "TEMP"}

# Let's see what the variable new_names look like
new_names

{'Observation station': 'STATION',
 'Time [Local time]': 'TIME',
 'Air temperature [°C]': 'TEMP'}

In [23]:
type(new_names)

dict

In [26]:
data = data.rename(columns=new_names)
print(data.columns)

Index(['STATION', 'Year', 'Month', 'Day', 'TIME', 'TEMP'], dtype='object')


In [33]:
data2 = data.rename(columns={"TEMP":"TEMP_C"})
print(data2.columns)

Index(['STATION', 'Year', 'Month', 'Day', 'TIME', 'TEMP_C'], dtype='object')


In [34]:
data.shape

(13248, 6)

In [35]:
data.dtypes

STATION     object
Year         int64
Month        int64
Day          int64
TIME        object
TEMP       float64
dtype: object

In [38]:
data.describe()

Unnamed: 0,Year,Month,Day,TEMP
count,13248.0,13248.0,13248.0,13248.0
mean,2024.0,7.013587,15.836957,18.417671
std,0.0,0.815859,8.854561,3.479045
min,2024.0,6.0,1.0,7.7
25%,2024.0,6.0,8.0,16.0
50%,2024.0,7.0,16.0,18.4
75%,2024.0,8.0,23.25,20.9
max,2024.0,9.0,31.0,27.7
