# Python for data analysis workshop
## A quick and dirty exploratory analysis of the Chronic Disease Indicators dataset

Disclaimer: You can check out the pandas, numpy, scipy etc documentation for the various data types that they make available to you and how to work with them. In this workshop, we'll be focusing on the parts of data analysis that may be more difficult to internalize by reading alone. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Import the Chronic Disease Indicators dataset

The Chronic Disease Indicators dataset was downloaded from the Centers for Disease Control website. 
A number of readme files in the 'data' directory provide more information on the variables. These are important, take a look!

In [2]:
# Import the dataset
data = pd.read_csv('../data/U.S._Chronic_Disease_Indicators__CDI_.csv', 
                       sep = ',', 
                       dtype={'YearStart' : np.int64,
                              'YearEnd' : np.int64, 
                              'LocationAbbr' : str,
                              'LocationDesc' : str,
                              'DataSource' : str, 
                              'Topic' : str,
                              'Question' : str,
                              'DataValueUnit' : str,
                              'DataValueType' : str,
                              'DataValue' : str,
                              'DataValueAlt' : np.float64,
                              'DataValueFootnoteSymbol' : str,
                              'DataValueFootnote' : str,
                              'LowConfidenceLimit' : np.float64,
                              'HighConfidenceLimit' : np.float64,
                              'StratificationCategory' : str,
                              'Stratification' : str,
                              'GeoLocation'  : str,
                              'LocationID' : np.int64,
                              'TopicID'    : str,
                              'QuestionID'   : str,
                              'DataValueTypeID'    : str,
                              'StratificationCategoryID' : str,
                              'StratificationID' : str
                             })

### What variables do we have? 

The first, coarsest look that you can have at the data is to list all the variables that are present.

In Jupyter notebooks, tab completion for both the names of the columns and the public attributes is enabled by default.

**Question:** Is this information enough to know what was measured? Why, or why not? 

In [3]:
data.dtypes

YearStart                      int64
YearEnd                        int64
LocationAbbr                  object
LocationDesc                  object
DataSource                    object
Topic                         object
Question                      object
DataValueUnit                 object
DataValueType                 object
DataValue                     object
DataValueAlt                 float64
DataValueFootnoteSymbol       object
DatavalueFootnote             object
LowConfidenceLimit           float64
HighConfidenceLimit          float64
StratificationCategory1       object
Stratification1               object
GeoLocation                   object
LocationID                     int64
TopicID                       object
QuestionID                    object
DataValueTypeID               object
StratificationCategoryID1     object
StratificationID1             object
dtype: object

### Take a first peek

In [4]:
data.head()

Unnamed: 0,YearStart,YearEnd,LocationAbbr,LocationDesc,DataSource,Topic,Question,DataValueUnit,DataValueType,DataValue,...,HighConfidenceLimit,StratificationCategory1,Stratification1,GeoLocation,LocationID,TopicID,QuestionID,DataValueTypeID,StratificationCategoryID1,StratificationID1
0,2013,2013,AK,Alaska,YRBSS,Alcohol,Alcohol use among youth,%,Crude Prevalence,22.5,...,26.1,Overall,Overall,"(64.84507995700051, -147.72205903599973)",2,ALC,ALC1_1,CrdPrev,OVERALL,OVR
1,2013,2013,AL,Alabama,YRBSS,Alcohol,Alcohol use among youth,%,Crude Prevalence,35.0,...,40.3,Overall,Overall,"(32.84057112200048, -86.63186076199969)",1,ALC,ALC1_1,CrdPrev,OVERALL,OVR
2,2013,2013,AR,Arkansas,YRBSS,Alcohol,Alcohol use among youth,%,Crude Prevalence,36.3,...,40.4,Overall,Overall,"(34.74865012400045, -92.27449074299966)",5,ALC,ALC1_1,CrdPrev,OVERALL,OVR
3,2013,2013,AZ,Arizona,YRBSS,Alcohol,Alcohol use among youth,%,Crude Prevalence,36.0,...,40.9,Overall,Overall,"(34.865970280000454, -111.76381127699972)",4,ALC,ALC1_1,CrdPrev,OVERALL,OVR
4,2013,2013,CA,California,YRBSS,Alcohol,Alcohol use among youth,%,Crude Prevalence,,...,,Overall,Overall,"(37.63864012300047, -120.99999953799971)",6,ALC,ALC1_1,CrdPrev,OVERALL,OVR


## numpy and pandas provide some great functionality for data analysis

**Question:** What are the primitive data types? Why are they called primitive? 

In addition to primitive data types, there are a lot of custom ones, built to make your life easier with respect to a specific problem. 

**Question:** What is the key data type that numpy offers beyond the capabilities of the Python Standard Library? Why was this useful?  

**Question:** What are the key data types that pandas offers? Why are they useful? 

**Question:** What happened when we imported the CDIs .csv dataset? 

**Question:** Use the pandas documentation (and/or the method itself!) to figure out what the .describe() method below does. What type of object does it take? 

In [57]:
data.describe()

Unnamed: 0,YearStart,YearEnd,DataValueAlt,LowConfidenceLimit,HighConfidenceLimit,LocationID
count,404155.0,404155.0,271797.0,244930.0,244930.0,404155.0
mean,2012.720899,2012.747735,735.5094,49.968239,62.905365,31.004137
std,1.552934,1.522028,18073.38,83.909972,96.295604,17.700753
min,2001.0,2001.0,0.0,0.0,0.0,1.0
25%,2011.0,2011.0,19.0,13.0,20.2,17.0
50%,2013.0,2013.0,41.9,31.4,45.9,30.0
75%,2014.0,2014.0,71.2,57.6,72.3,45.0
max,2016.0,2016.0,3967333.0,1293.9,2088.0,78.0


In [58]:
# Accessing some of the attributes can provide further information about the dataset
data.shape

(404155, 24)

In [59]:
data.index

RangeIndex(start=0, stop=404155, step=1)

## Which years are included in these observations? 

One of the first questions you may ask when you take a peek at the file is "how many years does this dataset span?".

In fact, you can ask the "what are the unique values?" for any of the variables. It wouldn't make sense to do that for continuous variables.  For categorical variables it gives you even more information to ask "how many observations do I have per unique value?" which includes the answer to the first question. 

In order to answer any of these, you would need to be able to take one variable at a time, or in some cases one observation at a time. Let's play with some subsetting opeations to get you warmed up. 

Subsetting is the act of selecting lower-dimensional slices of a multi-dimensional object. 

### Subsetting

In [60]:
data['YearStart'].head()

0    2013
1    2013
2    2013
3    2013
4    2013
Name: YearStart, dtype: int64

**Question:**  What is the type of the object retuned by the above?

### Counting starts at 0

Python, and all languages in the C family including Java, Perl, Python, C++, start counting at 0, not 1. This represents the number of steps it takes for us to get to the element we are interested in. 

This is in contrast to MATLAB, R, and Fortran where you will notice that the first element in e.g. a vector is element 1, not 0. 

This means that in order to get the first element of the YearStart Series, I would need to write:

In [61]:
data['YearStart'][0]

2013

**Question:** What does the : do? 

In [62]:
data[0:1]

Unnamed: 0,YearStart,YearEnd,LocationAbbr,LocationDesc,DataSource,Topic,Question,DataValueUnit,DataValueType,DataValue,...,HighConfidenceLimit,StratificationCategory1,Stratification1,GeoLocation,LocationID,TopicID,QuestionID,DataValueTypeID,StratificationCategoryID1,StratificationID1
0,2013,2013,AK,Alaska,YRBSS,Alcohol,Alcohol use among youth,%,Crude Prevalence,22.5,...,26.1,Overall,Overall,"(64.84507995700051, -147.72205903599973)",2,ALC,ALC1_1,CrdPrev,OVERALL,OVR


**Question:** What does the expression below do? 

There are lots of ways to subset this pandas DataFrame. One way is to use the isin() method, which allows us to filter by value of a variable. 

Let's try that for filtering only the records that represent data on Florida, abbreviated as FL, and ask to return how many of them there are.

In [63]:
data[data['LocationAbbr'].isin(['FL'])].shape

(7750, 24)

**Exercise:** In the cell below, write an expression that will return the CrudePrevalence of "Alcohol use among youth" for the state of Florida in the year 2013. 

Another two key methods to know from pandas is .loc() and .iloc()

**Question:** Look up what they do and how to use them in the pandas documentation.

**Exercise:** Write an expression that subsets anything you want out of the data using .loc() or .iloc()

# Quality control, preprocessing and exploratory analyses

Cleaning your data, preprocessing, quality control (QC) and exploratory analyses are all words that we use to describe inter-related processes.

This stage is where you'll spend the majority of your time. Once you have data that you can trust, you can do all kinds of analyses on it. But it is crucial that you first gain trust in the data. 

In our example, most of the QC and preprocessing has been done for us by the analysts at the CDC. It is generally a good idea for those doing the QC to be as close to the data collection as possible - their knowledge of the data generation process is an advantage in both identifying and deciding how to deal with QC issues. 

## Preprocessing

The term "preprocessing" refers to manipulations such as:
* converting all measurements of e.g. distance to the same unit
* consolidating how missing data is represented into a common value
* consolidating values into a unique set (most often a problem with survey data, particularly with free text where you may have a variety of responses that all mean the same thing)
* merging your data with another dataset 
* normalizing the measurements to a common standard 

Often, you will find that you'll also need to reshape your data so that libraries that were designed to work with data in a specific format can be used in your analysis. 

Lastly, some may argue that variable transformations are also part of the pre-processing routine. Transformations may be necessary for a number of reasons, for example, if a continuous variable v is Poisson distributed you may want to use log(v) in your analysis instead. 

## Quality control and data cleaning

The terms "cleaning" and "quality control" refer to manipulations such as:
* removing untrusted observations
* removing entire columns or rows of data depending on a set of filtering criteria
* identifying any structure in the data that doesn't have to do with the variables that you are interested in but rather with how the data was collected

**Question:** Under what conditions would you want to remove a column rather than simply ignore it in your analyses? 

**Question:** If we were performing an fMRI study and had data from multiple scanners in our dataset, there will be structure in the data that reflects the difference in scanners rather than a variable of interest. Under what conditions would this be a problem? 

**Question:** In the example above with the multiple MRI scanners, why is this under QC? 

The goal is to end up with a dataset that you can confidently analyze, a dataset that you can trust actually represents the reality of what was measured. 

Many fields have figured out what the standard QC metrics are for their type of data. For example, for genetic data generated using microarray technology, the standard set of QC procedures include checking that the genotypic sex of the sample matches the manifest, applying filters that exclude samples with >5% of genotypes missing, and filters that exclude genotypes missing in >5% of the samples. These often represent sample mix ups, a low quality DNA sample, or failed genotyping assay respectively.  

You should think critically through each of the QC filters that you apply.  Depending on the question you are asking and on other properties of the data, the standard filters may not be appropriate for your analysis! It is really, really important to let the data and your questions guide your QC approach, rather than follow a protocol blindly. Granted, you will most likely end up not deviating from the protocol, especially if it's been around for long enough to be well refined, but it will be your job as a data scientist to make sure that you know why you are applying a QC filter and why you are choosing a specific filtering threshold over another. 

## Exploratory analyses

Exploratory analyses often feed back into pre-processing and QC because they identify issues that can be addressed by e.g. transforming a variable or e.g. applying a QC filter. 

In addition, exploratory analyses calibrate your intuition for the data, gives you a feel for the information that is available, and the information that is not available given how the data was collected. 

Some things that you can do to explore your data are:
* plot the distributions of variables and get some descriptive statistics
* plot scatterplots to get a sense for the relationship between variables

There are three main categories of descriptive statistics:
* frequencies
* measures of central tendency
* measures of variation. 

Which of these will be meaningful depends on the type of variable that you have. 

**Question:** What are the various types of variables? How are they related to each other?  

### Frequency tables

In [64]:
data['LocationAbbr'].value_counts()

KY    7782
NC    7782
WI    7782
SC    7782
NV    7782
NJ    7782
NE    7782
NY    7781
FL    7750
AZ    7750
IA    7749
VT    7735
HI    7691
NM    7683
MI    7683
OR    7683
SD    7683
WV    7683
WA    7683
AR    7651
CO    7650
RI    7578
MD    7578
UT    7578
MA    7577
CA    7545
MS    7536
LA    7389
MT    7389
ND    7389
VA    7389
MO    7389
OH    7389
OK    7389
PA    7389
WY    7389
TX    7389
KS    7389
IN    7388
MN    7388
TN    7387
ME    7387
NH    7386
ID    7357
AL    7357
AK    7357
GA    7357
IL    7357
CT    7356
DC    7355
DE    7354
PR    5638
GU    5600
VI    5554
US    2577
Name: LocationAbbr, dtype: int64

**Question:** What is the type of the object returned by the .value_counts() method? 

**Exercise:** In the cell below, write an expression that will return the value counts just for the state of FL. 

**Exercise:** In the cells below, write expressions that will return:
1. The value counts for the topics covered by the CDI instruments.
2. The value counts for the questions asked for the topic "Alcohol". 

### 2D frequency tables

**Question:** Why would you want to generate a 2D frequency table? 

**Question:** Why am I using pd.crosstab() instead of data.crosstab() in the example below?

In [5]:
pd.crosstab(data['LocationAbbr'], data['Topic'])

Topic,Alcohol,Arthritis,Asthma,Cancer,Cardiovascular Disease,Chronic Kidney Disease,Chronic Obstructive Pulmonary Disease,Diabetes,Disability,Immunization,Mental Health,"Nutrition, Physical Activity, and Weight Status",Older Adults,Oral Health,Overarching Conditions,Reproductive Health,Tobacco
LocationAbbr,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
AK,517,582,574,216,1152,220,922,1216,1,48,103,591,74,141,577,32,391
AL,517,582,574,216,1152,220,922,1216,1,48,103,591,74,141,577,32,391
AR,517,582,616,216,1236,220,1006,1300,1,48,103,591,74,141,577,32,391
AZ,517,582,631,216,1236,220,1090,1300,1,48,103,591,74,141,577,32,391
CA,517,582,595,216,1194,220,1006,1258,1,48,103,590,74,141,577,32,391
CO,517,582,616,216,1236,220,1006,1300,1,48,103,590,74,141,577,32,391
CT,517,582,574,216,1152,220,922,1216,1,48,103,591,74,141,577,32,390
DC,517,582,574,216,1152,220,922,1216,1,48,103,590,74,141,577,32,390
DE,515,582,574,216,1152,220,922,1216,1,48,103,590,74,141,577,32,391
FL,517,582,631,216,1237,220,1090,1300,1,48,103,591,74,141,577,32,390


###  Histograms and barplots

**Question:** What is the difference between histograms and barplots?

Look through the documentation for the Visualization functionality provided in pandas. What are some plotting functions that are available to you? 

In [85]:
plt.style.use('ggplot') # Look up Hadley Wickham. Do it!

# First let's select a meaningful subset of the data
# Filter for the topic of Alcohol, and for the question Alcohol use among youth
alcohol_youth = data[data['Topic'].isin(['Alcohol']) &
                  data['Question'].isin(['Alcohol use among youth'])
                 ]

# Let's try to plot the DataValues for the questions in the topic 'Alcohol'
alcohol_youth.DataValue.plot(kind = 'bar')
plt.show()


TypeError: Empty 'DataFrame': no numeric data to plot

**Question:** Why doesn't the above work? How can we get a clue? 

**Exercise:** Come up with an expression that would help you get on your way to fixing it. 

In [86]:
# If the CDC analysts hadn't already preprocessed the data, 
# we could add a column in which we coerce the DataValue elements to floats
alcohol_youth.NumDataValue = pd.to_numeric(alcohol_fl['DataValue'], errors = 'coerce')

alcohol_youth.NumDataValue.head()

0    22.5
1    35.0
2    36.3
3    36.0
4     NaN
Name: DataValue, dtype: float64

Okay, now let's try the plot again. 

**Exercise:** Write the code that should produce the plot in the box below. Use the documentation to tidy it up. Hint: look for a way to make it horizontal, name the axes, and add the error bars from the relevant columns in the alcohol_youth dataframe.  

### More descriptive plots 

Boxplots are useful for exploring relationships between a continuous variable and a categorical variable. 

Scatterplots are useful for exploring relationships between two continuous variables.  

### Missing data

In your exploratory analysis phase, you'll also want to know how much missing data you have, which variables have the most missing data, and if there are any other patterns to the missingness. 

**Exercise:** For each of the variables in the original data, write the expressions that will return a count of how many values are missing. 

## Sources of bias in the data

Hopefully by now you're starting to get a sense of how exploratory analyses can feed into creating your QC pipeline.

It is very, very important to be careful when applying QC measures so that you don't generate bias in your dataset! 

Question: What is bias? Can you think of a scenario under which you have introduced bias through a QC filter? 

Question: What may be other sources of bias? 

## Organizing your pipelines

Once you have streamlined your quality control process, and you feel like you understand what is in the data, you can exit the loop and put together a pipeline that takes the dataset through quality control linearly. 

At that point it would make sense to split that into a different notebook than your lab/development notebook. You should still be using Markdown or comments or some other way to describe what you are doing at each step of your QC pipeline, but it won't necessarily be clear in that report alone how you came across a problem in the data which you're addressing with each of the quality control steps. You can give a high-level description of the kind "In the exploratory analysis I noticed X, which led to the development of the QC filter Y using the value M as the filtering threshold."

## When should I export a manipulated dataset as a new file?

First of all, let's be clear that your manipulated dataset should be saved separately from your original data! Separate directories, labelled descriptively.

At any point in the process, you can export your dataset, for example by writing a manipulated DataFrame back into a new .csv. It is up to you to choose when it is appropriate to do that. Try to minimize the number of "slightly different" copies that you have of the same dataset. You'll have the scripts that manipulate the dataset to put it into its final form, and which make your analysis reproducible, so it's less important to have the intermediate datasets saved than it is to save your scripts, complete with plenty of comments and designed modularly.  

Consider things like the storage space that it would take to save multiple forms of a dataset, the confusion that it could generate to not know off hand what are the differences between the various forms of the dataset, against the benefit of not having to wait for the dataset to be processed again. 

It's already becoming evident from this discussion that I am having with myself that for most pipelines you'll want the final form of the manipulated, pre-processed, clean data to be the only copy that you have stored, other than the original data. Never over-write the original data. 

## Next steps