# Data Wrangling with Pandas

**Author**: Jeremy Maurer - Missouri University of Science and Technology

This notebook provides an overview of data manipulation using Pandas, a Python package that provides similar functionality to spreadsheet programs like Excel or Google Sheets.

You can read more details about Pandas __[here](https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html)__

In this notebook we will briefly demonstrate the following capabilities of pandas:
- Reading data from comma and space-delimited files into pandas dataframes
- Manipulating data in a dataframe
- Writing dataframes to files

<div class="alert alert-info">
    <b>Terminology:</b>    

- *dataframe*: The equivalent of a spreadsheet in Python.
    
- *Series*: A single column of a Pandas dataframe; equivalent to a column in a spreadsheet  

- *tropospheric zenith delay*: The precise atmospheric delay satellite signals experience when propagating through the troposphere.  
</div>

Estimated time to run notebook: 15 minutes

## Table of Contents:
<a id='example_TOC'></a>

[**Overview of the pandas package**](#overview)  
[1. Reading data from files](#reading-data)  
[2. Manipulating data in dataframes](#manip-data)  
[3. Writing data to files](#write-data)  

## Prep: Initial setup of the notebook

Below we set up the directory structure for this notebook exercise. In addition, we load the required modules into our python environment using the **`import`** command.

<div class="alert alert-info">
    You can customize the location of your home and working directory when running this notebook by modifying the cell below. 
</div>
    

In [None]:
import numpy as np
import os
import matplotlib.pyplot as plt
import pandas as pd

## Defining the home and data directories
tutorial_home_dir = os.path.abspath(os.getcwd())
work_dir = os.path.abspath(os.getcwd())
print("Tutorial directory: ", tutorial_home_dir)
print("Work directory: ", work_dir)

## Overview of the Pandas Package
<a id='overview'></a>

### Reading data from files
<a id='reading-data'></a>

In [None]:
# Let's start by loading a simple .csv dataset into a pandas dataframe
df = pd.read_csv('data/sample_data.csv')
df.head()

In [None]:
# It's also possible to read space-delimited and excel files using pandas
# df = pd.read_csv('space_delimited_file.txt', delim_whitespace=True)
# df = pd.read_excel('excel_file.xlsx') # You may need to install xlrd or openpyxl to read excel files

### Manipulating data in pandas
<a id='manip-data'></a>

In [None]:
# Pandas uses an "index" to keep track of rows. By default it uses integers
print(df.index)

In [None]:
# You can change the index to a column in the dataframe, for example a datetime
df = df.set_index('Datetime')
df.head()

In [None]:
# You can reset the index as well
df = df.reset_index()
df.head()

In [None]:
# By default Pandas reads datetimes from files as strings.
# we can convert them to actual Python datetimes 
df['Datetime'] = pd.to_datetime(df['Datetime'])
df = df.set_index('Datetime')
df.head()

In [None]:
# We can get a subset of the data using the column name
index = df['ID'] == 'JME2'
# df_jme2 = df[index]
# df_jme2.head()
np.sum(index)

In [None]:
# It's possible to plot data directly using Pandas
df_jme2['ZTD'].plot()

In [None]:
# We can perform operations on columns:
'Station_' + df['ID'] 

In [None]:
# Or mathematical operations:
noisy = np.nanmean(df['ZTD']) + np.nanstd(df['ZTD'])*np.random.randn(len(df))
print(noisy)

In [None]:
# We can assign the output of an operation to a new column
df['ZTD_noisy'] = noisy

In [None]:
# And we can take operations of several columns
df['ZTD_diff'] = df['ZTD'] - df['ZTD_noisy']

In [None]:
# We can define functions and then call them as operators on a dataframe column or index
def dt2fracYear(date):
    import datetime as dt
    import time

    def sinceEpoch(date): # returns seconds since epoch
        return time.mktime(date.timetuple())
    s = sinceEpoch

    # check that the object is a datetime
    try:
        year = date.year
    except AttributeError:
        date = numpyDT64ToDatetime(date)
        year = date.year

    startOfThisYear = dt.datetime(year=year, month=1, day=1)
    startOfNextYear = dt.datetime(year=year+1, month=1, day=1)

    yearElapsed = s(date) - s(startOfThisYear)
    yearDuration = s(startOfNextYear) - s(startOfThisYear)
    fraction = yearElapsed/yearDuration
    date_frac = date.year + fraction

    return date_frac

def numpyDT64ToDatetime(dt64):
    '''
    Convert a numpy datetime64 object to a python datetime object
    '''
    import datetime
    import numpy as np

    unix_epoch = np.datetime64(0, 's')
    one_second = np.timedelta64(1, 's')
    seconds_since_epoch = (dt64 - unix_epoch) / one_second
    dt = datetime.datetime.utcfromtimestamp(seconds_since_epoch)
    return dt

In [None]:
# We can assign the index to be a column, operate on it, and then drop the added column
df['dt'] = df.index
df['fracYear'] = df['dt'].apply(lambda x: dt2fracYear(x))
df.drop('dt', axis=1)

In [None]:
# We can look at summary statistics
df.describe()

In [None]:
# We can group variables as needed
station_stats = df.value_counts('ID')
station_stats.head()

In [None]:
# We can create different plots, depending on the type of variable we are interested in
df['ZTD'].plot.hist(bins=100)
plt.xlabel('ZTD (m)')

In [None]:
# See the API documentation for keyword arguments, etc.
df.plot.scatter(x='ZTD', y='ZTD_noisy', s=1, c='k')

### Writing dataframes to a file
<a id='write-data'></a>

Pandas can write to various file formats, including xcel, JSON, HTML, HDF5, STATA, SQL, and pickle formats. 

Using the __[Scipy.io](https://docs.scipy.org/doc/scipy/reference/io.html)__ module, you can also export data from Python to a .mat file that can be read in Matlab. 

You can the Pandas I/O documentation __[here](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html)__. 

In [None]:
# We can export a dataframe to a .csv file
df_jme2.to_csv(os.path.join(work_dir, 'Station_JME2_ZTD.csv'), index = False)

In [None]:
# export to a .mat file by first converting the dataframe to a dictionary
import scipy.io as sio
sio.savemat('Station_JME2_ZTD.mat', {'data': df_jme2.to_dict()})