# Lesson 3: Introduction to Data Analysis


## 0.1: What is data analysis?

At this point, we've gone through data manipulation using `numpy` and `pandas`, and basic visualization using `matplotlib`. Whenever using data in research, it's always a great idea to start the same way: explore the data and visualize it in simple ways to get a feel for the dataset you are working with. Once you have a good understanding of the data itself, often the next step is to _analyze_ the data. When we refer to data analysis, we are usually talking about using a computational model or tool to investigate whether the data fits a certain hypothesis. 

For example, you might want to look at the data in fewer dimensions using a **dimensionality reduction** technique. This is an _unsupervised_ way to see patterns in your dataset, and is really useful when the data you are working with has tons of features (which biological data often does, such as expression or activity of several genes/proteins).

You also might want to learn whether your data naturally **clusters** into multiple groups with distinct features. This is another _unsupervised_ approach to finding patterns in the data that might have distinct groups. This can be really useful if you are interested in distinct categories in the data; for example, do the cases and controls cluster separately from each other? Do your samples fall into different phenotypes?

Finally, you can try to **predict** a particular outcome (dependent variable) from several features (independent variables) such as by a **regression** model or **classify** samples into different known groups. When each sample is labeled by some feature (e.g. case vs. control or a categorical label), we can use a _supervised_ approach to fit a model to the data and see how well we can predict this variable or label the data using the input features.

_Note: technically these unsupervised and supervised approaches are types of machine learning, which is about coding programs that automatically adjust their performance from exposure to information encoded in the data. This is a subfield of Artificial Intelligence. More complicated ML methods, like neural networks, will not be covered in this JumpStart, but they are built on the same foundations as the methods described here._

<img src="http://1.bp.blogspot.com/-ME24ePzpzIM/UQLWTwurfXI/AAAAAAAAANw/W3EETIroA80/s1600/drop_shadows_background.png" width="800px"/>


## 0.2 Representing data in `scikit-learn`

We will use a package called `scikit-learn` to implement our data analysis. This package includes functions for all of the techniques described above, as well as methods for data wrangling such as normalization and included some built-in datasets useful for testing any models you build.

We will import `scikit-learn` below:


In [3]:
import sklearn

Most algorithms implemented in scikit-learn expect data to be stored in a
**two-dimensional array or matrix**.  The arrays can be
either ``numpy`` arrays, or in some cases ``scipy.sparse`` matrices.
The size of the array is expected to be `[n_samples, n_features]`

- **n_samples:**   The number of samples: each sample is an item to process (e.g. classify).
  A sample can be a document, a picture, a sound, a video, an astronomical object,
  a row in database or CSV file,
  or whatever you can describe with a fixed set of quantitative traits.
- **n_features:**  The number of features or distinct traits that can be used to describe each
  item in a quantitative manner.  Features are generally real-valued, but may be boolean or
  discrete-valued in some cases.

The number of features must be fixed in advance. However it can be very high dimensional
(e.g. millions of features) with most of them being zeros for a given sample. This is a case
where `scipy.sparse` matrices can be useful, in that they are
much more memory-efficient than numpy arrays.

In the code below, we will use a mix of `numpy` arrays (which you learned in Day 1) and `pandas` DataFrames (which you learned earlier today).

We'll start with a basic dataset that is preloaded in sklearn: the iris dataset. 

In [1]:
from sklearn.datasets import load_iris
iris = load_iris()

## Section 1: Linear Regression

We'll start by talking about a model you are probably familiar with from math classe: linear regression, also known as finding a line of best fit. In the example below, we will work through a regression problem with one independent variable and one dependent variable. This data will be read in as a pandas DataFrame.