## Research Data Management for Data Science 
#### UC Berkeley Library, Research IT

* Data Evaluation
* Data Cleaning
    * Reshaping and joining
    * Variable names and types
    * Missing values, nulls, and zeros
* Reproducibility/Metadata 




# Data Evaluation
|                                                                                                                   |
|-------------------------------------------------------------------------------------------------------------------|
| What makes this dataset findable?                                                                                 |
| Can you tell if this is original/raw data or if it's been manipulated? Who created this?                          |
| Can you tell what the unit of observation is?                                                                     |
| If you had to merge or add to this, could you?                                                                    |
| Do you think this data can answer our question _What is the proportion of women to men in the Technology Industry?_ |
|                                                                                                                   |

|         |                                                                       |
|---------|-----------------------------------------------------------------------|
| Group 1 | https://www.eeoc.gov/eeoc/statistics/reports/hightech/                |
| Group 2 | https://github.com/alison985/women-in-tech-datasets                   |
| Group 3 | https://www.dol.gov/wb/stats/Computer_information_technology_2014.htm |
| Group 4 | https://www.bls.gov/cps/cpsaat11.htm                                  |
| Group 5 | https://berkeley.box.com/s/ik0obara8hj8k212logii0sh9m5vwj63           |

# Data Evaluation


![](./images/quality.png)

[Quartz Guide to Bad Data](https://qz.com/572338/the-quartz-guide-to-bad-data/)




# Data Cleaning

> “Happy families are all alike; every unhappy family is unhappy in its own way.” 

– Leo Tolstoy


> “Tidy datasets are all alike, but every messy dataset is messy in its own way.”

– Hadley Wickham 



There is a much repeated saying in data science that 80% of data analysis is applying cleaning operations to get the dataset to a usable form for analysis. Because transforming a raw data set is time consuming and will take several iterations, it is important to think about the lifecycle of a dataset, how it changes, and how you will document those changes to others. 

We will look at some of the more common techniques for [Tidy Data]("http://r4ds.had.co.nz/tidy-data.html").



Tidy Data has the following attributes:

* Each variable forms a column and contains values
* Each observation forms a row

  
 ![](./images/tidy.png)





### Reshaping


In [None]:
import pandas as pd

df = pd.read_csv("./data/pew-raw.csv")
df

df is a common looking data table designed to be easily readable in print. Many times you will find data like this if it has been digitized from government documents or survey providers (like Pew or Gallup) and provided in Excel spreadsheets. Though it is human readable it is not useful for analysis. 

We want to [melt]("http://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html") this data from *wide* to *long* so that income classes are no longer in the column header and the count values are no longer spread out through the table but neatly contained as variables in columns.

In [None]:
tidy_df = pd.melt(df,
                  ["religion"],
                  var_name="income",
                  value_name="n")
tidy_df = tidy_df.sort_values(by=["religion"])
tidy_df.head(10)




If we want to do the reverse and revert it to a "pivot-table" like dataset as we saw before, the [pivot]("https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pivot.html") function can be used.

In [None]:
wide_df = tidy_df.pivot( 
                   index = 'religion', 
                   columns = 'income', 
                   values = 'n')
wide_df.head()

In [None]:
#tidy_df.drop(['start','end'], axis=1)

# split variable
tidy_df['lower_inc'] = tidy_df.income.str.split('-').str.get(0)
tidy_df['upper_inc'] = tidy_df.income.str.split('-').str.get(1) 

tidy_df.head()



In [None]:
print(df.describe())


## Merging



 ![](./images/sql-joins.png)


In [None]:
members = pd.read_csv("./data/members.csv")
instruments = pd.read_csv("./data/instruments.csv")

In [None]:
from IPython.display import display

display(members.head())
display(instruments.head())


In [None]:
pd.merge(members, instruments, how = 'right', on = 'name')

Left merge will merge keys from "left" dataframe

Right merge will merge keys from "right" dataframe. 


In [None]:
pd.merge(members, instruments, how = 'outer', on = 'name')

Inner joins retain only the rows common to both datasets. Try changing the *how* argument to run an outer join. What happens?

## Text Data

In [None]:
tidy_df['lower_inc'] = tidy_df.lower_inc.str.strip("$")

tidy_df.head()

## Text Data

Converting the entire document to lower case

Removing punctuation marks (periods, commas, hyphens etc)

Removing stopwords (extremely common words such as “and”, “or”, “not”, etc)

Removing numbers

Filtering out unwanted terms

Removing extra whitespace





## Text Data

[Regular Expressions in Python](https://github.com/dlab-berkeley/regular-expressions-in-python)

[Regexone](https://regexone.com/references/python)

[Learn Regex the Easy Way](https://github.com/zeeshanu/learn-regex)

## Variable Management

In [None]:
for col in tidy_df:
    print (type(tidy_df[col][1]))
    
tidy_df.head()

### Dates



In [None]:
dates = pd.read_csv("./data/dates.csv")
dates

In [None]:
#check data types
for col in dates:
    print (type(dates[col][1]))



In [None]:
#convert str to date
dates["date_time"] = pd.to_datetime(dates["date_time"] )

dates

## Variable Naming Best Practices




|Good Example   | Bad Example   | Description  |
|---|---|---|
|gnp2010   |gnp-2002; gnp#2002    |   |
|real_int    |real interest rate    |   |
| score1; gnp2003   | 1st_score; 2003gnp  |  |
|reg_out; glm1    | REG; glm; ttest   |   |
| invest; interest  | xxx; yyy; zmdje;   |    |
|male; asian    | gender; race   |   |
| citizen   | Are_you_a_US_citizen?   |   |
| income; intUS03   | INCOME; Int_us2003;   |   |
| 2017-04-20   |April 20, 2017   |   |   |


## Variable Naming Best Practices



|Good Example   | Bad Example   | Description  |
|---|---|---|
|gnp2010   |gnp-2002; gnp#2002    | avoid special characters  |
|real_int    |real interest rate    |Use underscore   |
| score1; gnp2003   | 1st_score; 2003gnp  | Begin with a character   |
|reg_out; glm1    | REG; glm; ttest   | Avoid reserved words  |
| invest; interest  | xxx; yyy; zmdje;   |Use meaningful names    |
|male; asian    | gender; race   | Use a value of dummy   |
| fav_color   | Whats_Your_Favorite_Color?   | The shorter, the better   |
| income; intUS03   | INCOME; Int_us2003;   | Use lower cases   |
| 2017-04-20   |April 20, 2017   | Use common ISO year format  |


## Missing Values, Nulls, and Zeros

![](images/nulls.png)

In [None]:
tidy_df.dropna()
#tidydf.fillna(value)



Tidy Data has the following attributes:

* Each variable forms a column and contains values
* Each observation forms a row

  
 ![](./images/tidy.png)


Tidy _and useful_ Data has the following attributes:

* Each variable forms a column and contains values
* Each observation forms a row
* Data types = What you expect them to be 
* [Appropriate observations and variables to address a question](https://www.theatlantic.com/business/archive/2013/04/forget-excel-this-was-reinhart-and-rogoffs-biggest-mistake/275088/)
* Missing values are accounted for, Zeros really zero

### Well-Described Data...

* Precisely IDs the dataset
* Provides details of its origin and history
* Eliminates confusion (and error)
* Includes “intrinsic metadata”
* Justifies every decision made re: the handling of the data
* Allows discovery and reuse




["Science is show me, not trust me"](http://www.bitss.org/2015/12/31/science-is-show-me-not-trust-me)

[Findability, Accessibiliy, Interoperability, Reusability (F.A.I.R) Principles](https://www.force11.org/group/fairgroup/fairprinciples)

## Reaching out

Rick Jaffe, rjaffe@berkeley.edu

Josh Quan, joshua.quan@berkeley.edu

