## Data Types and Formats

The format of individual columns and rows will impact analysis performed on a dataset read into a pandas DataFrame. For example, you can’t perform mathematical calculations on a string (text formatted data). This might seem obvious, however sometimes numeric values are read into pandas as strings. In this situation, when you then try to perform calculations on the string-formatted numeric data, you get an error.

### Types of Data
How information is stored in a DataFrame or a Python object affects what we can do with it and the outputs of calculations as well. There are two main types of data that we will explore in this lesson: numeric and text data types.

#### Numeric Data Types
Numeric data types include integers and floats. A floating point (known as a float) number has decimal points even if that decimal point value is 0. For example: 1.13, 2.0, 1234.345. If we have a column that contains both integers and floating point numbers, pandas will assign the entire column to the float data type so the decimal points are not lost.

An integer will never have a decimal point. Thus if we wanted to store 1.13 as an integer it would be stored as 1. Similarly, 1234.345 would be stored as 1234. You will often see the data type Int64 in pandas which stands for 64 bit integer. The 64 refers to the memory allocated to store data in each cell which effectively relates to how many digits it can store in each “cell”. Allocating space ahead of time allows computers to optimize storage and processing efficiency.

### Text Data Type
The text data type is known as a string in Python, or object in pandas. Strings can contain numbers and / or characters. For example, a string might be a word, a sentence, or several sentences. A pandas object might also be a plot name like 'plot1'. A string can also contain or consist of numbers. For instance, '1234' could be stored as a string, as could '10.23'. However strings that contain numbers can not be used for mathematical operations!

pandas and base Python use slightly different names for data types. More on this is in the table below:


| Pandas Type               | Native Python Type | Description                                                                                       |
|---------------------------|--------------------|---------------------------------------------------------------------------------------------------|
| `object`                  | `string`           | The most general dtype. Assigned to columns with mixed types (numbers and strings).               |
| `int64`                   | `int`              | Numeric characters. 64 refers to the memory allocated to store each value.                        |
| `float64`                 | `float`            | Numeric characters with decimals. Used when a column has numbers and NaNs.                        |
| `datetime64`, `timedelta[ns]` | N/A (use `datetime` module) | Time-related values. Useful for time series analysis.                                   |


## Checking the format of our data
Now that we’re armed with a basic understanding of numeric and text data types, let’s explore the format of our survey data. We’ll be working with the same surveys.csv dataset that we’ve used in previous lessons.

In [2]:
# Make sure pandas is loaded
import pandas as pd

# Note that pd.read_csv is used because we imported pandas as pd
surveys_df = pd.read_csv("../Files/surveys.csv")

In [3]:
type(surveys_df)

pandas.core.frame.DataFrame

In [4]:
surveys_df['sex'].dtype

dtype('O')

A type ‘O’ just stands for “object” which in pandas is a string (text).

In [5]:
surveys_df['record_id'].dtype

dtype('int64')

In [6]:
surveys_df.dtypes

record_id            int64
month                int64
day                  int64
year                 int64
plot_id              int64
species_id          object
sex                 object
hindfoot_length    float64
weight             float64
dtype: object

Note that most of the columns in our survey_df data are of type int64. This means that they are 64 bit integers. But the weight column is a floating point value which means it contains decimals. The species_id and sex columns are objects which means they contain strings.

### Working With Integers and Floats


So we’ve learned that computers store numbers in one of two ways: as integers or as floating-point numbers (or floats). Integers are the numbers we usually count with. Floats have fractional parts (decimal places). Let’s next consider how the data type can impact mathematical operations on our data. Addition, subtraction, division and multiplication work on floats and integers as we’d expect.

In [7]:
# Convert a to an integer
a = 7.83
int(a)

7

In [8]:
# Convert b to a float
b = 7
float(b)

7.0

## Working With Our Survey Data

Getting back to our data, we can modify the format of values within our data, if we want. For instance, we could convert the record_id field to floating point values.

In [9]:
# Convert the record_id field from an integer to a float
surveys_df['record_id'] = surveys_df['record_id'].astype('float64')
surveys_df['record_id'].dtype

dtype('float64')

### Missing Data Values - NaN

What happened in the last challenge activity? Notice that this raises a casting error: pandas.errors.IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer (in older versions of pandas, this may be called a ValueError instead). If we look at the weight column in the surveys data we notice that there are NaN (Not a Number) values. NaN values are undefined values that cannot be represented mathematically. pandas, for example, will read an empty cell in a CSV or Excel sheet as NaN. NaNs have some desirable properties: if we were to average the weight column without replacing our NaNs, Python would know to skip over those cells.

In [10]:
surveys_df['weight'].mean()

np.float64(42.672428212991356)

Dealing with missing data values is always a challenge. It’s sometimes hard to know why values are missing - was it because of a data entry error? Or data that someone was unable to collect? Should the value be 0? We need to know how missing values are represented in the dataset in order to make good decisions. If we’re lucky, we have some metadata that will tell us more about how null values were handled.

For instance, in some disciplines, like Remote Sensing, missing data values are often defined as -9999. Having a bunch of -9999 values in your data could really alter numeric calculations. Often in spreadsheets, cells are left empty where no data are available. pandas will, by default, replace those missing values with NaN. However, it is good practice to get in the habit of intentionally marking cells that have no data with a no data value! That way there are no questions in the future when you (or someone else) explores your data.

### Where Are the NaN’s?
Let’s explore the NaN values in our data a bit further. Using the tools we learned in lesson 02, we can figure out how many rows contain NaN values for weight. We can also create a new subset from our data that only contains rows with weight > 0 (i.e., select meaningful weight values):

In [12]:
len(surveys_df[surveys_df['weight'].isna()])

3266

In [None]:

# How many rows have weight values?
len(surveys_df[surveys_df['weight'] > 0])

32283

We can replace all NaN values with zeroes using the .fillna() method (after making a copy of the data so we don’t lose our work):

In [13]:
df1 = surveys_df.copy()
# Fill all NaN values with 0
df1['weight'] = df1['weight'].fillna(0)

However NaN and 0 yield different analysis results. The mean value when NaN values are replaced with 0 is different from when NaN values are simply thrown out or ignored.

In [14]:
df1['weight'].mean()

np.float64(38.751976145601844)

We can fill NaN values with any value that we chose. The code below fills all NaN values with a mean for all weight values.

In [16]:
df1['weight'] = surveys_df['weight'].fillna(surveys_df['weight'].mean())
df1['weight'].mean()

np.float64(42.672428212991356)

### Writing Out Data to CSV

We’ve learned about manipulating data to get desired outputs. But we’ve also discussed keeping data that has been manipulated separate from our raw data. Something we might be interested in doing is working with only the columns that have full data. First, let’s reload the data so we’re not mixing up all of our previous manipulations.

In [18]:
surveys_df = pd.read_csv("../Files/surveys.csv")

Next, let’s drop all the rows that contain missing values. We will use the command dropna. By default, dropna removes rows that contain missing data for even just one column.

In [19]:
df_na = surveys_df.dropna()

If you now type df_na, you should observe that the resulting DataFrame has 30676 rows and 9 columns, much smaller than the 35549 row original.

We can now use the to_csv command to export a DataFrame in CSV format. Note that the code below will by default save the data into the current working directory. We can save it to a different folder by adding the foldername and a slash before the filename: df.to_csv('foldername/out.csv'). We use 'index=False' so that pandas doesn’t include the index number for each line.

In [None]:
# Write DataFrame to CSV
df_na.to_csv('../Files/surveys_complete.csv', index=False)