# Seed notebook to read and manipulate data with pandas and numpy

A seed Jupyter Python notebook to read, manipulate and visualize data. This notebooks is a skeleton for those steps, with simple sample code that can be expanded later.

It covers:

* Read data from an external file
* Ma

## Read 

Read from a .csv file, parsing dates

In [None]:
import pandas as pd
df = pd.read_csv("employees.csv", parse_dates=["hire date"])
df

## Add columns

Add a new column, based on values of another column

In [None]:
import numpy as np
df["high salary"] = np.where(df["salary"] >= 7000, "yes", "no")
df

Add the column again, this time using list comprehension

In [None]:
# First drop the column we just added (test first, in case this cell
# is executed more than once)
if "high salary" in df.columns:
    df.drop("high salary", axis=1, inplace=True)

# Add it again using list comprehension
df["high salary"] = ["yes" if x >= 7000 else "no" for x in df["salary"]]
df

Which method to use: [this StackOverflow question](https://stackoverflow.com/q/50375985/336802) says that the `numpy` method is faster for large dataset.

Use a function to add a new column with a calculated value. A function is useful when the calculation is complex, not suitable for list comprehension or lambda expressions.

In [None]:
import datetime

def vacation_days(hire_date):
    today = pd.to_datetime("today")
    service_years = (today - hire_date)/ np.timedelta64(1, "Y")
    if service_years >= 20:
        return 25
    if service_years >= 15:
        return 20
    return 10

df["vacation days"] = df["hire date"].apply(vacation_days)
df

## Filter

Filter data based on column value

In [None]:
df_high = df.loc[df["high salary"] == "yes"]
df_high

## Graphs

**NOTE**: pandas uses [matlibplot](https://matplotlib.org/) under the hood for graphs. Pandas' `plot` is a wrapper for the matlibplot APIs, simplifying its usage. You can also invoke the matlibpot APIs directly for more advanced graphing creation and customization. See details in [this pandas document](http://pandas.pydata.org/pandas-docs/version/0.13/visualization.html). 

Simple graphs, choosing specific columns columns for each axis.

In [None]:
df.plot(x="first name", y="salary", kind="barh", legend=False)

Choosing colors for the bars, based on the value of another column.

In [None]:
df.plot(x="first name", y="salary", kind="barh", legend=False, 
        color=np.where(df["high salary"] == "yes", "r", "b"))


Same as the above, using list comprehension for the colors.

In [None]:
df.plot(x="first name", y="salary", kind="barh", legend=False, 
        color=["r" if x == "yes" else "b" for x in df["high salary"]])


There are other data visualizaton libraries for Python. See [this article](https://blog.modeanalytics.com/python-data-visualization-libraries/), [this other article](http://pbpython.com/visualization-tools-1.html) and [this one](https://dsaber.com/2016/10/02/a-dramatic-tour-through-pythons-data-visualization-landscape-including-ggplot-and-altair/) that includes code in the comparison. But note that libraries move fast, especially the new ones. Always check for changes with a generic google "python data visualization".

If you are also working with R: `ggplot2` is available in Python (where it's called `ggplot` - see [here](http://ggplot.yhathq.com/)).
