### Archive exploration
Brief Description 
- data is about: Snowshoe hare physical data in Bonanza Creek Experimental Forest
- collection time frame: 1999-06-01 to 2012-09-14
- sensitive data
- associated publication
- citation
- date of access
- link to archive
![Snowshoe hare](https://upload.wikimedia.org/wikipedia/commons/thumb/8/8a/SNOWSHOE_HARE_%28Lepus_americanus%29_%285-28-2015%29_quoddy_head%2C_washington_co%2C_maine_-01_%2818988734889%29.jpg/1452px-SNOWSHOE_HARE_%28Lepus_americanus%29_%285-28-2015%29_quoddy_head%2C_washington_co%2C_maine_-01_%2818988734889%29.jpg?20170313021652)

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Read in data 
hare = pd.read_csv('https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-bnz.55.22&entityid=f01f5d71be949b8c700b6ecd1c42c701')

In [3]:
hare.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3380 entries, 0 to 3379
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   date        3380 non-null   object 
 1   time        264 non-null    object 
 2   grid        3380 non-null   object 
 3   trap        3368 non-null   object 
 4   l_ear       3332 non-null   object 
 5   r_ear       3211 non-null   object 
 6   sex         3028 non-null   object 
 7   age         1269 non-null   object 
 8   weight      2845 non-null   float64
 9   hindft      1633 non-null   float64
 10  notes       243 non-null    object 
 11  b_key       3333 non-null   float64
 12  session_id  3380 non-null   int64  
 13  study       3217 non-null   object 
dtypes: float64(3), int64(1), object(10)
memory usage: 369.8+ KB


In [4]:
hare.columns

Index(['date', 'time', 'grid', 'trap', 'l_ear', 'r_ear', 'sex', 'age',
       'weight', 'hindft', 'notes', 'b_key', 'session_id', 'study'],
      dtype='object')

In [5]:
hare.head()

Unnamed: 0,date,time,grid,trap,l_ear,r_ear,sex,age,weight,hindft,notes,b_key,session_id,study
0,11/26/1998,,bonrip,1A,414D096A08,,,,1370.0,160.0,,917.0,51,Population
1,11/26/1998,,bonrip,2C,414D320671,,M,,1430.0,,,936.0,51,Population
2,11/26/1998,,bonrip,2D,414D103E3A,,M,,1430.0,,,921.0,51,Population
3,11/26/1998,,bonrip,2E,414D262D43,,,,1490.0,135.0,,931.0,51,Population
4,11/26/1998,,bonrip,3B,414D2B4B58,,,,1710.0,150.0,,933.0,51,Population


In [6]:
# What are the unique values for some of the categorical columns?
hare['trap'].unique()

array(['1A', '2C', '2D', '2E', '3B', '3D', '4A', '4B', '4C', '4E', '5A',
       '5C', '5D', '5E', '10C', '1C', '1E', '2A', '2B', '3C', '3E', '5B',
       '6A', '6B', '6C', '7B', '7C', '7E', '8A', '8B', '8E', '9A', '9D',
       '1D', '6E', '7D', '8C', '8D', '9B', '3A', '10B', '1B', '7A', '9E',
       '4D', '10A', '6D', '9C', '10D', '10E', '10b', '2a', '2b', '2d',
       '3b', '4a', '4c', '4e', '5b', '6c', '7a', '7b', '7d', '7e', '8e',
       '9a', '1b', '2c', '2e', '3c', '1e', '3e', '5d', '3d', '4d', '7c',
       '8c', '10c', '1c', '1d', '9d', '5e', '6a', '8a', '8b', '6b', '10e',
       '6e', nan, '4b', '5c', '9c', '10a', '5a', '9b', '9e', '6d', '1a',
       '3a', '10d', '8d', '4f', '5f', '3f', '2f', '2g', '5g', '4g', '1g',
       '7f', '6f', '6g', '3g', '4c ', '4e ', '1e ', '1b ', '2b ', '6b ',
       '2c ', '5c ', '4b '], dtype=object)

- What are the dimensions of the dataframe and what are the data types of the columns? Do the data types match what you would expect from each column?
- Are there any columns that have a significant number of NA values?
- What are the minimum and maximum values for the weight and hind feet measurements?


## 4.Detecting messy values
a. In the metadata section of the EDI repository, find which are the allowed values for the hares’ sex. Create a small table in a markdown cell showing the values and their definitions.

| Allowed Value  |Description|
| --- |---|
| m | male|
| f | female|
| m?|male not confirmed|

b. Get the number of times each unique sex non-NA value appears.

In [21]:
# hare['sex'].unique().value_counts()

hare['sex'].unique()
hare['sex'].nunique()

12

c. Check the documentation of value_counts(). What is the purpose of the dropna parameter and what is its default value? Repeat step (a), this time adding the dropna=False parameter to value_counts().

In [19]:
hare['sex'].value_counts(dropna=False)

F      1161
M       730
f       556
m       515
NaN     352
?        40
F?       10
f         4
m         4
f?        3
M?        2
m?        2
pf        1
Name: sex, dtype: int64

d.Discuss with your team the output of the unique value counts. In particular:

- Do the values in the sex column correspond to the values declared in the metadata?
- What could have been potential causes for multiple codes?
- Are there seemingly repated values? If so, what could be the cause?

### 5. Brainstorm

Individually, write step-by-step instructions on how you would wrangle the hares data frame to clean the values in the sex column to have only two classes female and male. Which codes would you assign to each new class? Remember: It’s ok if you don’t know how to code each step - it’s more important to have an idea of what you would like to do.

Discuss your step-by-step instructions with your team.

The next exercise will guide you through cleaning the sex codes. There are many ways of doing this. The one presented here might not be the same way you thought about doing it - that’s ok! This one was designed to practice using the numpy.select() function.

### 6. Clean values

In [25]:
conditions = [hare['sex'].isin(['F','f','f ']),
             hare['sex'].isin(['M','m','m '])]

gender=['female','male']

hare['sex_simple']=np.select(conditions, gender, default=np.nan)

print(hare['sex_simple'])


0        nan
1       male
2       male
3        nan
4        nan
        ... 
3375     nan
3376     nan
3377     nan
3378     nan
3379    male
Name: sex_simple, Length: 3380, dtype: object


In [26]:
# calculate mean weight
hare.groupby('sex_simple').weight.mean()

sex_simple
female    1365.164792
male      1349.935542
nan       1193.364055
Name: weight, dtype: float64

In [27]:
# 8. code cell
conditions = [hare['sex'].isin(['F','f','f ']),
             hare['sex'].isin(['M','m','m '])]

gender=['female','male']

hare['sex_simple']=np.select(conditions, gender, default=np.nan)
