# Data Types and Formats

## Before we start

Want to email me? My address is: [goodbody@usc.edu](mailto:goodbody@usc.edu). And I'm on Twitter [@doctornerdis](https://twitter.com/doctornerdis).

## Review: Subsetting using a mask

In [1]:
import pandas as pd

In [2]:
surveys_df = pd.read_csv('data/surveys.csv')

In [3]:
surveys_df.head()

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
0,1,7,16,1977,2,NL,M,32.0,
1,2,7,16,1977,3,NL,M,33.0,
2,3,7,16,1977,2,DM,F,37.0,
3,4,7,16,1977,7,DM,M,36.0,
4,5,7,16,1977,3,DM,M,35.0,


**Challenge:** Create a mask to get every observation from **1978** that is species **DM**.

In [4]:
mask = (surveys_df.year == 1978) & (surveys_df.species_id == 'DM')
surveys_df[mask].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 389 entries, 504 to 1550
Data columns (total 9 columns):
record_id          389 non-null int64
month              389 non-null int64
day                389 non-null int64
year               389 non-null int64
plot_id            389 non-null int64
species_id         389 non-null object
sex                381 non-null object
hindfoot_length    360 non-null float64
weight             352 non-null float64
dtypes: float64(2), int64(5), object(2)
memory usage: 30.4+ KB


## Back to Data Types

In [5]:
type(surveys_df)

pandas.core.frame.DataFrame

In [6]:
type(surveys_df.record_id)

pandas.core.series.Series

In [7]:
surveys_df.record_id.dtype

dtype('int64')

In [9]:
surveys_df.dtypes

record_id            int64
month                int64
day                  int64
year                 int64
plot_id              int64
species_id          object
sex                 object
hindfoot_length    float64
weight             float64
dtype: object

## Integers vs. Floats

In [10]:
type(10)

int

In [11]:
type(10.5)

float

In [12]:
int(10.6)

10

In [13]:
float(10)

10.0

In [14]:
surveys_df.head()

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
0,1,7,16,1977,2,NL,M,32.0,
1,2,7,16,1977,3,NL,M,33.0,
2,3,7,16,1977,2,DM,F,37.0,
3,4,7,16,1977,7,DM,M,36.0,
4,5,7,16,1977,3,DM,M,35.0,


In [15]:
surveys_df.record_id = surveys_df.record_id.astype('float64')

In [20]:
surveys_df.head()

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
0,1.0,7,16,1977,2.0,NL,M,32.0,
1,2.0,7,16,1977,3.0,NL,M,33.0,
2,3.0,7,16,1977,2.0,DM,F,37.0,
3,4.0,7,16,1977,7.0,DM,M,36.0,
4,5.0,7,16,1977,3.0,DM,M,35.0,


In [21]:
surveys_df.record_id.dtype

dtype('float64')

**Challenge:** Convert the `plot_id` column from **int64** to **float64**.

In [19]:
surveys_df.plot_id = surveys_df.plot_id.astype('float64')

## Missing Values: `NaN`

In [22]:
surveys_df.weight.mean()

42.672428212991356

In [23]:
surveys_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35549 entries, 0 to 35548
Data columns (total 9 columns):
record_id          35549 non-null float64
month              35549 non-null int64
day                35549 non-null int64
year               35549 non-null int64
plot_id            35549 non-null float64
species_id         34786 non-null object
sex                33038 non-null object
hindfoot_length    31438 non-null float64
weight             32283 non-null float64
dtypes: float64(4), int64(3), object(2)
memory usage: 2.4+ MB


In [24]:
df_test = surveys_df.copy()

In [26]:
df_test.weight = df_test.weight.fillna(0)

In [27]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35549 entries, 0 to 35548
Data columns (total 9 columns):
record_id          35549 non-null float64
month              35549 non-null int64
day                35549 non-null int64
year               35549 non-null int64
plot_id            35549 non-null float64
species_id         34786 non-null object
sex                33038 non-null object
hindfoot_length    31438 non-null float64
weight             35549 non-null float64
dtypes: float64(4), int64(3), object(2)
memory usage: 2.4+ MB


In [28]:
df_test.weight.mean()

38.751976145601844

## Write to CSV

In [30]:
df_test.to_csv('data/surveys_complete.csv')