# Research Data Management for Data Science 

* Data Evaluation
    * Provenance
* Data Cleaning
    * Reshaping and joining
    * Variable names and types
    * Missing values, nulls, and zeros
    * Anonymizing?
* Archiving Standards
* Metadata



# Data Evaluation


![](./images/quality.png)



You will probably evaluate your data intuitively using the above categories but it is valuable from a data management perspective to focus attention in particular to **Usability and Reliability**. However you acquire a dataset, whether it is through secondary data repository like ICPSR/Dataverse, etc. or you procure the data yourself through web scraping you should be asking yourself:.....






![](./images/qa.png)

# Data Cleaning

> “Happy families are all alike; every unhappy family is unhappy in its own way.” –– Leo Tolstoy

> “Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham 

There is a much repeated saying in data science that 80% of data analysis is applying cleaning operations to get the dataset to a usable form. Because transforming a raw data set is time consuming and will take several iterations, it is important to think about the lifecycle of a dataset, how it changes, and how you will document those changes to others. 

We will look at some of the more common techniques for [Tidy Data]("http://r4ds.had.co.nz/tidy-data.html").

Tidy Data has the following attributes:

* Each variable forms a column and contains values
* Each observation forms a row
* Each type of observational unit forms a table

  
 ![](./images/tidy.png)


### Reshaping

#### Pew Research Center Dataset

This dataset explores the relationship between income and religion.

How many variables are in this dataset?

In [51]:
df = pd.read_csv("./data/pew-raw.csv")
df

Unnamed: 0,religion,$0-10k,$10-20k,$20-30k,$30-40k,$40-50k,$50-75k
0,Agnostic,27,34,60,81,76,137
1,Atheist,12,27,37,52,35,70
2,Buddhist,27,21,30,34,33,58
3,Catholic,418,617,732,670,638,1116
4,Dont know/refused,15,14,15,11,10,35
5,Evangelical Prot,575,869,1064,982,881,1486
6,Hindu,1,9,7,9,11,34
7,Historically Black Prot,228,244,236,238,197,223
8,Jehovahs Witness,20,27,24,24,21,30
9,Jewish,19,19,25,25,30,95


df is a common looking data table designed to be easily readable in print. Many times you will find data like this if it has been digitized from government documents or survey providers (like Pew or Gallup) and provided in Excel spreadsheets. Though it is human readable it is not useful for analysis. 

We want to [melt]("http://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html") this data from *wide* to *long* such that income is no longer in the column header and the count values are no longer spread out through the table but neatly contained as variables in columns.

In [55]:
import pandas as pd

tidy_df = pd.melt(df,
                  ["religion"],
                  var_name="income",
                  value_name="freq")
tidy_df = tidy_df.sort_values(by=["religion"])
tidy_df.head(10)



Unnamed: 0,religion,income,freq
0,Agnostic,$0-10k,27
30,Agnostic,$30-40k,81
40,Agnostic,$40-50k,76
50,Agnostic,$50-75k,137
10,Agnostic,$10-20k,34
20,Agnostic,$20-30k,60
41,Atheist,$40-50k,35
21,Atheist,$20-30k,37
11,Atheist,$10-20k,27
31,Atheist,$30-40k,52


If we want to do the reverse and revert it to a "pivot-table" like dataset as we saw before, the [pivot]("https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pivot.html") function can be used.

In [75]:
wide_df = tidy_df.pivot( 
                   index = 'religion', 
                   columns = 'income', 
                   values = 'freq')
wide_df.head()

income,$0-10k,$10-20k,$40-50k,$20-30k,$30-40k,$50-75k
religion,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agnostic,27,34,76,60,81,137
Atheist,12,27,35,37,52,70
Buddhist,27,21,33,30,34,58
Catholic,418,617,638,732,670,1116
Dont know/refused,15,14,10,15,11,35


## Merging

Commonly, we need to combine many datasets. There are many types of merging operations or "joins" but we'll cover the more common kinds: left (or right) join, inner join, and appends.

We'll use two really simple datasets to demonstrate. A members dataset that has a variable for first name, and a variable for the name of the band they belong to. The instruments dataset has the first name variable and a variable for the instrument they play. 

It is important when merging to have a common variable between the datasets being merged. In this example, the common variable is *name*

In [84]:
members = pd.read_csv("./data/members.csv")
instruments = pd.read_csv("./data/instruments.csv")

members.head()
instruments.head()

Unnamed: 0,name,band
0,Mick,Stones
1,John,Beatles
2,Paul,Beatles


In [88]:
pd.merge(members, instruments, how = 'left', on = 'name')

Unnamed: 0,name,band,plays
0,Mick,Stones,
1,John,Beatles,guitar
2,Paul,Beatles,bass


In [93]:
pd.merge(members, instruments, how = 'inner', on = 'name')

Unnamed: 0,name,band,plays
0,John,Beatles,guitar
1,Paul,Beatles,bass


Inner joins retain only the rows common to both datasets. Try changing the *how* argument to run an outer join. What happens?

## Variable Management

In [91]:
df.dtypes

religion    object
 $0-10k      int64
 $10-20k     int64
$20-30k      int64
$30-40k      int64
 $40-50k     int64
$50-75k      int64
dtype: object

## Missing Values, Nulls, and Zeros

![](images/nulls.png)

In [None]:
df.dropna()
df.fillna(value)



## Anonymization of Sensitive Data