# Lesson 3: Data Types and Formats

These are some notes for [Data Carpentry](http://www.datacarpentry.org)'s tutorial [*Data Analysis and Visualization in Python*](http://www.datacarpentry.org/python-ecology-lesson/).  The web page for this lesson can be found [here](http://www.datacarpentry.org/python-ecology-lesson/03-data-types-and-format).

## Goal

> Learn how to deal with different data types

## Numerical Data Types

There are two main types of numerical data in Python:

* **Floating Point** number (or **`float`**): a number which **can** have a **fractional part**, i.e. something **after the decimal** point:
    * `1.0, 3.25, -7536.3`;
* **Integer** number (or **`int`**): a number which **cannot** have a **fractional part**, i.e. **nothing** after the decimal point:
    * `1, -2, 1177325`.

Note:

* Sometimes you'll see **32** or **64** in the name of the numeric data type.
    * That's to say how much **memory** -- how many **digits** -- the computer allocates for each number.
* If Pandas sees **one** column entry is floating point, it'll assign **all entries** to the floating point data type to **avoid losing precision**.

## Character Data Types

We also have the following important Python data type:

* **`string`**: a data type which can hold **characters** of **any sort**.
    * By "characters" we mean **letters**, **punctuation**, and **spaces**, like
        * `'w', 'NASA', 'a ska-style drum solo'`.
    * But we also mean **digits** when they're used as **text**:
        * `'Dec. 7, 1941', 'my 12th birthday', '3.14'`.
        * You can **not do math** with strings.

Pandas and Python **terminology differs**:

Pandas and base Python use slightly different names for data types. More on this is in the table below:

| **Pandas Type** | **Native Python Type** | **Description** |
| :-- | :-- | :-- |
| `object` | `string` | The most general dtype. Will be assigned to your column if column has mixed types (numbers and strings). |
| `int64` | `int` | Numeric characters. 64 refers to the memory allocated to hold this character. |
| `float64` | `float` | Numeric characters with decimals. If a column contains numbers and NaNs(see below), pandas will default to float64, in case your missing value has a decimal. |
| `datetime64, timedelta[ns]` | N/A (but see the [datetime](http://doc.python.org/2/library/datetime.html) module in Python's standard library) | Values meant to hold time data. Look into these for time series experiments. |

## Checking Data Formats

Let's check our data...

In [1]:
import pandas as pd

In [2]:
# note that pd.read_csv is used because we imported pandas as pd
#surveys_df = pd.read_csv("https://ndownloader.figshare.com/files/2292172")
surveys_df = pd.read_csv('data/surveys.csv')

In [3]:
type(surveys_df)

pandas.core.frame.DataFrame

What data type is a single column?

In [4]:
surveys_df['sex'].dtype

dtype('O')

`'O'` stands for Python **`object`** data, i.e. **`string`s**.

Try another column...

In [5]:
surveys_df['record_id'].dtype

dtype('int64')

Or simply **all columns** at once...

In [6]:
surveys_df.dtypes

record_id            int64
month                int64
day                  int64
year                 int64
plot_id              int64
species_id          object
sex                 object
hindfoot_length    float64
weight             float64
dtype: object

## Working with `int`s & `float`s

In [7]:
5 + 5

10

In [8]:
24 - 7

17

Basic math operations are the same as usual...

But be careful with division...

In [9]:
# division of integers in Python 3
5/9

0.5555555555555556

In [10]:
# so-called "integer division" in Python 3
5//9

0

In [11]:
13/6

2.1666666666666665

In [12]:
13//6

2

Convert between types...

In [13]:
# convert a to integer
a = 7.83
int(a)

7

In [14]:
# convert to float
b = 7
float(b)

7.0

## Working with Survey Data

### Converting Types

Let's try converting an **entire column**...

In [15]:
# convert the record_id field from an integer to a float
surveys_df['record_id'] = surveys_df['record_id'].astype('float64')
surveys_df['record_id'].dtype

dtype('float64')

Try the weight values...

In [16]:
surveys_df['weight'].astype('int')

ValueError: Cannot convert NA to integer

**Problem:** the converter encountered `NaN` (**N**ot **a** **N**umber) and doesn't know how to convert that to a number.

`NaN`s can arise from

* data that was **uninterpretable**, or simply
* **empty cells** in a spreadsheet.

**NB:** if we **average without replacing** `NaN`s, Pandas skips those values.

In [17]:
surveys_df['weight'].mean()

42.672428212991356

### Missing Values

The **big issue:**

> what does **missing data** mean?

Does it mean

* **unreadable** by the computer?
* **not entered** by the person collecting the data?

When data is missing, should we

* leave the cell **blank**?
* insert a **`0`**?
* insert **`-9999`**?  (This is the practice in Remote Sensing.)
    * This could make for some **odd averaging**...

**Good practice:**

> Put in something that clearly **screams "Data is missing!"**

### Finding `NaN`s

Find out **how many `NaN`s** in the **weight** column...

In [18]:
len(surveys_df[pd.isnull(surveys_df.weight)])

3266

So how many rows **do have** weight values?

In [20]:
len(surveys_df[surveys_df.weight > 0])

32283

**Substitute** `NaN`s with `0`.

In [21]:
df1 = surveys_df.copy()
# fill all NaN values with 0
df1['weight'] = df1['weight'].fillna(0)

**NB:** Note the **effect on the average**.

As `NaN`s these entries were skipped.  Now they're **counted**.

In [22]:
df1['weight'].mean()

38.751976145601844

We can, in principle, **substitute anything**.

Like the average over all values, as below...

In [23]:
df1['weight'] = surveys_df['weight'].fillna(surveys_df['weight'].mean())

**Upshot:**

> be careful **how you handle missing data**, since that can **affect analysis**.