# Dictionaries and Pandas

## References: 

- [Datacamp - Matplotlib](https://campus.datacamp.com/courses/intermediate-python/dictionaries-pandas?ex=1)


## Overview

Dictionary is an unordered collection of key-value pairs, where each key is unique. It is denoted by curly braces `{}` and the key-value pairs are separated by a `:` colon. Dictionaries are extremely useful when we need to store and retrieve data in a way that is fast and efficient.


## Creating a Dictionary

To create a dictionary in Python, we use the curly braces {} and separate the key-value pairs with a colon. Here's an example:

In [1]:
hot_data = {
    # keys        :   values 
    'dataset_name': 'Hawaii Ocean Time-series data',
    'dataset_description': 'HOT dataset',
    'dataset_source': 'BCO-DMO',
    'dataset_variables': ['temperature', 'salinity', 'pressure'], # not including everything
    'dataset_years': (1988, 2019),
    'dataset_ctd':'https://erddap.bco-dmo.org/erddap/tabledap/bcodmo_dataset_3937.csv',
    'dataset_bottle':'https://erddap.bco-dmo.org/erddap/tabledap/bcodmo_dataset_3773.csv'
}

In this example, we have created a dictionary called `hot_data` with several key-value pairs. The keys are strings (e.g. 'dataset_name') and the values can be of any data type (e.g. strings, lists, tuples, integers).

## Accessing Dictionary Values

You can access the value of a specific key in a dictionary by using the key inside square brackets `[]`. For example, to access the value for 'dataset_name' in `hot_data`, we would do the following:

In [2]:
print(hot_data['dataset_name'])

Hawaii Ocean Time-series data


## Updating a Dictionary

You can add new key-value pairs to a dictionary or update existing ones by assigning a value to a specific key. Here's an example of adding a new key-value pair to hot_data:

In [3]:
hot_data['dataset_processor'] = 'Fernando C. Pacheco'

In [4]:
print(hot_data)

{'dataset_name': 'Hawaii Ocean Time-series data', 'dataset_description': 'HOT dataset', 'dataset_source': 'BCO-DMO', 'dataset_variables': ['temperature', 'salinity', 'pressure'], 'dataset_years': (1988, 2019), 'dataset_ctd': 'https://erddap.bco-dmo.org/erddap/tabledap/bcodmo_dataset_3937.csv', 'dataset_bottle': 'https://erddap.bco-dmo.org/erddap/tabledap/bcodmo_dataset_3773.csv', 'dataset_processor': 'Fernando C. Pacheco'}


## Iterating over a Dictionary

You can iterate over a dictionary using a for loop. Here's an example of iterating over the `hot_data` dictionary we created earlier:



In [5]:
for key, value in hot_data.items():
    print(key, ':', value)

dataset_name : Hawaii Ocean Time-series data
dataset_description : HOT dataset
dataset_source : BCO-DMO
dataset_variables : ['temperature', 'salinity', 'pressure']
dataset_years : (1988, 2019)
dataset_ctd : https://erddap.bco-dmo.org/erddap/tabledap/bcodmo_dataset_3937.csv
dataset_bottle : https://erddap.bco-dmo.org/erddap/tabledap/bcodmo_dataset_3773.csv
dataset_processor : Fernando C. Pacheco


# Pandas

Pandas is a powerful Python library used for data manipulation and analysis. It provides a data structure called DataFrame, which allows you to organize and manipulate data in a tabular form.

## Importing Pandas and Loading Data

Before we start using Pandas, we need to install it and import it into our Python environment.

`conda install pandas`
     
    or 

`pip install pandas`

Next, we will load our data into a Pandas DataFrame. We will use the CTD (conductivity, temperature, depth) dataset from HOT. This dataset contains measurements of water temperature, salinity, and pressure at various depths in the ocean.

In [8]:
import pandas as pd

#url = hot_data['dataset_ctd']
url= "dataset_ctd.csv"

In [134]:
df = pd.read_csv(url , 
                 header=[0,1],
                 dtype=None,
                )

  df = pd.read_csv(url ,


In [135]:
df

Unnamed: 0_level_0,cruise_name,station,cast,time,Year,Month,Day,timeutc,longitude,latitude,...,CTDPRS,CTDTMP,CTDSAL,CTDOXY,XMISS,CHLPIG,NUMBER,NITRATE,FLUOR,QUALT1
Unnamed: 0_level_1,unitless,unitless,unitless,UTC,unitless,unitless,unitless,unitless,degrees_east,degrees_north,...,Decibars (db),Degrees Celsius,PSU,micromoles per kilogram (umol/kg),percent transmission (%),microgram per liter (ug/L),count,micromoles per kilogram (umol/kg),mVOLTS,unitless
0,1,2,1,1988-10-30T21:34:00Z,1988,10,30,2134,-157.9967,22.7483,...,0.0,26.2412,35.2615,183.2,4.99,-0.0126,0.0,,,666666
1,1,2,1,1988-10-30T21:34:00Z,1988,10,30,2134,-157.9967,22.7483,...,2.0,26.2412,35.2615,183.2,4.99,-0.0126,36.0,,,222322
2,1,2,1,1988-10-30T21:34:00Z,1988,10,30,2134,-157.9967,22.7483,...,4.0,26.2554,35.2530,185.5,4.08,0.0026,72.0,,,223322
3,1,2,1,1988-10-30T21:34:00Z,1988,10,30,2134,-157.9967,22.7483,...,6.0,26.2377,35.2455,204.8,3.05,0.0167,108.0,,,222122
4,1,2,1,1988-10-30T21:34:00Z,1988,10,30,2134,-157.9967,22.7483,...,8.0,26.2257,35.2419,205.1,2.63,0.0043,96.0,,,222122
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3503261,288,50,1,2016-11-28T17:51:00Z,2016,11,28,1751,-157.9373,22.7703,...,194.0,18.2746,34.9098,203.9,,0.0254,144.0,,,222192
3503262,288,50,1,2016-11-28T17:51:00Z,2016,11,28,1751,-157.9373,22.7703,...,196.0,18.1590,34.9027,203.8,,0.0257,108.0,,,222192
3503263,288,50,1,2016-11-28T17:51:00Z,2016,11,28,1751,-157.9373,22.7703,...,198.0,17.9686,34.8727,203.8,,0.0257,156.0,,,222192
3503264,288,50,1,2016-11-28T17:51:00Z,2016,11,28,1751,-157.9373,22.7703,...,200.0,17.8751,34.8506,201.8,,0.0248,108.0,,,222192


In [130]:
df.dtypes

cruise_name            unitless                               int64
station                unitless                               int64
cast                   unitless                               int64
time                   UTC                                   object
Year                   unitless                               int64
Month                  unitless                               int64
Day                    unitless                               int64
timeutc                unitless                               int64
longitude              degrees_east                         float64
latitude               degrees_north                        float64
depth_max              meters (m)                             int64
pres_max               decibars (db)                          int64
Date                   untiless                               int64
timecode               unitless                              object
HOT_summary_file_name  unitless                 

- Note that `df['time']` dtype is object
- I want that to be `datetime`

In [None]:
#df['time'] = pd.to_datetime(df[['Year', 'Month', 'Day', 'timeutc']])
df.Month