# Data Analysis and Visualization in Python
## Data Formatting
Questions
* What are the different data types in Pandas?
* What impacts have data types on descriptive statistics?
* How can I manage undefined (null) values?
* How can I save a dataframe to a file?

Objectives
* Manipulate the data types.
* Create a copy of a DataFrame.
* Transform or remove null values.
* Write modified data to a CSV file.

## Loading our data

In [None]:
# First make sure pandas is loaded
import pandas as pd

# Read in the survey csv
surveys_df = pd.read_csv("../data/surveys.csv")

## Types of Data
### Checking the format of our data

In [None]:
# Getting the data types of all columns
surveys_df.dtypes

In [None]:
# Getting the data type of a single column
surveys_df['month'].dtype

Native Python Type | Pandas Type | Description
:-----------------:|:-----------:|:-----------
`str`              | `object`    | The most general dtype. Will be assigned to your column if column has mixed types (numbers and strings).
`int`              | `int64`     | 64 bits integer
`float`            | `float64`   | Numeric characters with decimals. If a column contains numbers and NaNs(see below), pandas will default to float64.
 N/A               | `datetime64`| Values meant to hold time data.

### Working With Our Survey Data

In [None]:
# Summary of descriptive statistics
surveys_df.describe()

In [None]:
# Convert month numbers to nominal values
surveys_df['month'] = surveys_df['month'].astype('str')
surveys_df['month'].dtype

In [None]:
# Descriptive statistics on a categorical variable
surveys_df['month'].describe()

In [None]:
# Listing all different months
surveys_df['month'].unique()

In [None]:
# Listing all different years
surveys_df['year'].unique()

### Demo - Calculating Statistics

`1`. What happens if we try to convert `weight` values to `int64` integers?

In [None]:
try:
    surveys_df['weight'].astype('int64')
except BaseException as error:
    print(f'The problem: {error}')

`2`. Try converting the column `plot_id` to native Python `float` data type.

In [None]:
surveys_df['plot_id'] = surveys_df['plot_id'].astype('float')
surveys_df['plot_id'].dtype

## Selecting and cleaning undefined values

In [None]:
# For each value, is the value undefined
surveys_df.isnull()

In [None]:
# Select rows with at least one undefined value
nan_mask = surveys_df.isnull().any(axis='columns')
surveys_df[nan_mask]

In [None]:
# What does this do?
nan_mask = surveys_df['weight'].isnull()
one_selection = surveys_df[nan_mask]
one_selection.groupby('species_id')['record_id'].count()

### Getting Rid of the NaN’s

In [None]:
# Before the cleanup
print(surveys_df['weight'].count(), surveys_df['weight'].mean())

In [None]:
# Create a copy to avoid modifying the original object
copy_surveys_df = surveys_df.copy()

In [None]:
# For a stable mean value
averageW = copy_surveys_df['weight'].mean()
copy_surveys_df['weight'] = copy_surveys_df['weight'].fillna(averageW)

In [None]:
# After the cleanup
print(copy_surveys_df['weight'].count(), copy_surveys_df['weight'].mean())

In [None]:
# Can we now convert weight values to integers?
copy_surveys_df['weight'] = copy_surveys_df['weight'].astype('int64')
copy_surveys_df['weight'].mean()

### Exercise - Data Cleanup
In the `sex` column of `copy_surveys_df`:
* Replace undefined values by `'F|M'`
* Any value not equal to `'F'`, `'M'` or `'F|M'` is
  considered invalid and must be replaced by `'F|M'`

In [None]:
# Create invalid data
copy_surveys_df.loc[::123, 'sex'] = 'NA'

# Replace undefined values
copy_surveys_df['sex'] = copy_surveys_df['sex'].fillna('F|M')

# Replace invalid values
invalid_rows = ~copy_surveys_df['sex'].isin(['F', 'F|M', 'M'])
copy_surveys_df.loc[invalid_rows, 'sex'] = 'F|M'

copy_surveys_df['sex'].unique()

### Writing Out Data to CSV

In [None]:
# Only keep (complete) records that have no NA
df_no_na = copy_surveys_df.dropna()
df_no_na

In [None]:
# Save the cleaned DataFrame to a CSV file
df_no_na.to_csv('surveys_complete.csv', index=False)

## Technical Summary
* **Managing data types**
    * For a **DataFrame**:
        * Attribute: `dtypes`
    * For a **Series** (column):
        * Attribute: `dtype`
        * Method: `astype()`
* **Cleaning data**
    * `df.copy()`
    * `isna()`, `isnull()` (the second is an alias of the first)
    * `notna()`, `notnull()`  (the second is an alias of the first)
    * `column.fillna(value, inplace=True)`
* **Saving a DataFrame**
    * `df.to_csv(csv_filename, index=False)`