# Data Analysis with Python
## Data Formatting
Questions
* How can I manage undefined (null) values?
* How can I save a dataframe to a file?

Objectives
* Create a copy of a DataFrame.
* Transform or remove null values.
* Write modified data to a CSV file.

## Loading our data

In [None]:
# First make sure pandas is loaded
import pandas as pd

# Read in the survey csv
surveys_df = pd.read_csv('../data/surveys.csv')

## Selecting and cleaning undefined values

In [None]:
# For each value, is the value undefined
surveys_df.isna()

In [None]:
# Select rows with at least one undefined value
nan_mask = surveys_df.isna().any(axis='columns')
surveys_df[nan_mask]

In [None]:
# What does this do?
one_selection = surveys_df[surveys_df['weight'].isna()]
one_selection.groupby('species_id')['record_id'].count()

### Getting Rid of the NaN’s

In [None]:
def state_by_species(df, column:str):
    '''
    Prints the count, the mean value and the standard deviation of a
    given column for each species ID from DM to NL.
    - df:     DataFrame object
    - column: name of one column of df
    '''
    print(
        df.groupby('species_id')[column].aggregate(
            ['count', 'mean', 'std']
        ).loc['DM':'NL'],
        '\n\nTotal count:', df[column].count()
    )

# Before the cleanup
state_by_species(surveys_df, 'weight')

In [None]:
# Create a copy to avoid modifying the original object
copy_surveys = surveys_df.copy()
copy_surveys.head()

In [None]:
# For a stable mean value per species
copy_surveys.groupby('species_id')['weight'].transform('mean')

In [None]:
# Replace unknown values by known mean values
copy_surveys['weight'] = copy_surveys['weight'].fillna(
    copy_surveys.groupby('species_id')['weight'].transform('mean')
)

In [None]:
# Before and after the cleanup
state_by_species(surveys_df, 'weight')
print()  # Print an empty line
state_by_species(copy_surveys, 'weight')

### Exercise - Data Cleanup
Repeat the same steps to fill in the undefined
values, but for the `'hindfoot_length'` column.
However, this time we want to calculate the averages
according to `'species_id'` and `'sex'`.

The `state_by_species_and_sex()` function is provided
to display statistics before and after cleaning.

(5 min.)

In [None]:
def state_by_species_and_sex(df, column:str):
    '''
    Prints the count, the mean value and the standard deviation of a
    given column for the first 5 species ID and for each sex.
    - df:     DataFrame object
    - column: name of one column of df
    '''
    print(
        df.groupby(
            ['species_id', 'sex']
        )[column].aggregate(
            ['count', 'mean', 'std']
        ).unstack().head(),
        '\n\nTotal count:', df[column].count()
    )

In [None]:
column = ###
state_by_species_and_sex(copy_surveys, column)
print()  # Print an empty line

copy_surveys[column] = copy_surveys[column].###(
    copy_surveys.groupby(
        ###
    )[column].###('mean')
)

state_by_species_and_sex(copy_surveys, column)

### Writing Out Data to CSV

In [None]:
# Only keep (complete) records that have no NA
df_no_na = copy_surveys.dropna()
df_no_na

In [None]:
# Save the cleaned DataFrame to a CSV file
df_no_na.to_csv('surveys_complete.csv', index=False)

## Technical Summary
* **Descriptive statistic by groups with the index of** `df`
    * `df.groupby()[column].transform(function)`
* **Cleaning data**
    * `df.copy()`
    * `isna()`, `notna()`
    * `column.fillna(value, inplace=True)`
* **Saving a DataFrame**
    * `df.to_csv(csv_filename, index)`