# Data science recipes

Data scientists all do the same thing: look for insights into data. In this course, you'll learn and practice a set of recipes, much like a basic cooking class does. The concepts are the same: prepare your ingredients (either data or food — and this step can be complicated for both kinds), then follow a set of steps that end up creating what you want.

* For food recipes, this is a dish, or a set of dishes, ready to serve.
* For data science recipes, it's the answer to a question, or a set of answers, ready to examine and interpret further.

People new to data science need to learn the fundamental data science recipes as it applies to their fields of study, and that is the purpose of this course. By the end of term, you'll be well on your way to customizing the recipes to your data science context — defined by the set of questions you want to answer — and even creating recipes to match new contexts.

Many of the examples use this [faithful.csv](faithful.csv) file.

# Recipes

+ [Read a CSV file](#read_csv)
+ [Calculate a statistic about a single column](#calc_column_stat)
+ [Subset a `DataFrame` by extracting columns](#subset_by_col)
+ [Rename the columns in a `DataFrame`](#rename_columns)
+ [Subset a `DataFrame` by filtering rows that have a particular property](#filter_rows)

# Types of ingredients

The recipes each examine data and produce data. Here are common datatypes.

+ built-in types
    - `int`: values are integers
    - `float`: values are floating-point numbers, which are approximations to real numbers
    - `str`: values are strings of characters, used for sentences and for formatting data for display
    - `bool`: values are Boolean, `True` and `False`
    - `list`: a list of any values
    - `dict` a dictionary, where each key has an associated value
+ `pandas` types
    - `DataFrame`
        * a 2-dimensional table containing data
        * The columns have labels
        * Columns typically contain one type of information, usually numbers or strings.
        * The rows have integer labels by default but can also have other labels.
    + `Series`
        * a 1-dimensional series of information, usually extracted from a column in a `DataFrame`
        * can be used as an argument for a function call to built-in functions like `len` and `max` and `sum`
        * Often used as a Boolean mask to filter rows from a `DataFrame`

<a id='read_csv'></a>
# Recipe: Read a CSV file

## Example use cases

* You want to examine a dataset to discover its structure and content.
* To prepare for another recipe, you want to examine a dataset to verify that it has the structure and content required by that other recipe.

## Ingredients

* Data to examine: a CSV file.
* Data produced: a `DataFrame`

## Instructions

1. Upload the CSV file to JupyterHub in the same directory as your notebook.
1. Substitute the name of the file in the example code below, and change `faithful_df` to a more appropriate name.
1. `read_csv` reads the file and returns a `DataFrame` containing the data. Give a name to the `DataFrame` using an assignment statement so you can use it later.

## Example

Here is an example for a file named `faithful.csv`:

In [9]:
import pandas as pd
faithful_df = pd.read_csv('faithful.csv')

4. It's usually a good idea to display the first few lines of the `DataFrame` to make sure the data is as you expect.

In [10]:
faithful_df.head()

Unnamed: 0,Index,Eruption length (mins),Eruption wait (mins)
0,1,3.6,79
1,2,1.8,54
2,3,3.333,74
3,4,2.283,62
4,5,4.533,85


<a id='calc_column_stat'></a>
# Recipe: Calculate a statistic about a single column

## Example use cases

+ You want to know the average of the values in a column so you can compare to another column.

## Ingredients

+ A `DataFrame` containing the column you are interested in.
+ The column label.

## Instructions

1. Make sure your `DataFrame` looks like you expect, using `head`.
1. Make a string for the name of the column.
1. Figure out your algorithm for calculating the statistic you want.
1. Extract the column from the `DataFrame`. This returns a `Series` containing the values from the column. Give a name to the data using an assignment statement.
1. Write your algorithm in Python, applying it to the column.

## Example

This code calculates the average of the waiting time between eruptions.

In [11]:
wait_col = faithful_df['Eruption wait (mins)']
wait_col.head()  # Notice how the output doesn't look like a `DataFrame`.

0    79
1    54
2    74
3    62
4    85
Name: Eruption wait (mins), dtype: int64

In [12]:
# Here's the algorithm.
total = sum(wait_col)
count = len(wait_col)
avg = total / count
avg

70.8970588235294

<a id='subset_by_col'></a>
# Recipe: Subset a `DataFrame` by extracting columns

## Example use cases

+ There are a lot of irrelevant columns and you want to extract a subset of them.

## Ingredients

+ A `DataFrame` containing the column you are interested in.
+ The column labels for the columns you want to keep.

## Instructions

1. Make sure your `DataFrame` looks like you expect, using `head`.
1. Make a list of strings for the names of the columns you want to keep.
1. Extract the columns from the `DataFrame`. This returns a `DataFrame`. Give a name to the data using an assignment statement.

## Example

In [5]:
faithful_columns = ['Index', 'Eruption wait (mins)']
wait_col = faithful_df[faithful_columns]
wait_col.head()

Unnamed: 0,Index,Eruption wait (mins)
0,1,79
1,2,54
2,3,74
3,4,62
4,5,85


<a id='rename_columns'></a>
# Recipe: Rename the columns in a `DataFrame`

## Example use cases

+ The names are not easy to interpret
+ The names are too long

## Ingredients

+ A `DataFrame` containing the column you are interested in.
+ The new column labels you want

## Instructions

1. Make sure your `DataFrame` looks like you expect, using `head`.
1. Make a dictionary mapping old column names to new column names.
1. Use the `rename` function to rename the columns in the `DataFrame`.

## Example

In [6]:
column_names = {
    # Notice we're only listing the name we want to change.
    'Eruption wait (mins)' : 'Wait time (mins)'
}

faithful_renamed_df = faithful_df.rename(columns=column_names)
faithful_renamed_df.head()

Unnamed: 0,Index,Eruption length (mins),Wait time (mins)
0,1,3.6,79
1,2,1.8,54
2,3,3.333,74
3,4,2.283,62
4,5,4.533,85


<a id='filter_rows'></a>
# Recipe: Subset a `DataFrame` by filtering rows

## Example use cases

+ You want to find rows where a column has a particular value.

## Ingredients

+ A `DataFrame`.
+ The column label you are interested in.

## Instructions

1. Make sure your `DataFrame` looks like you expect it to.
1. Figure out which column you want to look at.
1. Extract your column as a `Series` and give it a name.
1. Write a Boolean expression comparing your column to the values you want. This produces a new `Series` where all the entries are `True` or `False`, based on your comparison. Give it a name.
1. Use your Boolean `Series` as a "mask" to filter rows from the `DataFrame`. You can, at this time, also extract a subset of the columns by providing a range of column labels.

## Example

This extracts rows where the eruption length was greater than 4 seconds.

In [7]:
eruption_length_col = faithful_df['Eruption length (mins)']
long_eruptions = eruption_length_col > 4.0
faithful_long_df = faithful_df.loc[long_eruptions, 'Index': 'Eruption length (mins)']
faithful_long_df.head()

Unnamed: 0,Index,Eruption length (mins)
4,5,4.533
6,7,4.7
9,10,4.35
12,13,4.2
14,15,4.7
