> **Jupyter slideshow:** This notebook can be displayed as slides. To view it as a slideshow in your browser, type following in the console:
 
 
> `> jupyter nbconvert [this_notebook.ipynb] --to slides --post serve`
 
 
> To toggle off the slideshow cell formatting, click the `CellToolbar` button, then `View > Cell Toolbar > None`.

For more help, check out [this tutorial](https://drive.google.com/open?id=17q01buf7YFuB4yF8cFjmnc_ZARQWOljGhlHs99i9X0c).

<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">
 
# What is Data Science?
 
_Authors: Alexander Egorenkov (DC), Amy Roberts (NYC)_
 
---


<a id="learning-objectives"></a>
### Learning Objectives
*After this lesson, you will be able to:*

- Describe the roles and components of a successful development environment.
- Define data science and the data science workflow.
- Apply the data science workflow to solve a task.
- Discuss common data science terminology and processes.

### Lesson Guide
- [Learning Objectives](#learning-objectives)
- [Welcome to GA](#welcome)
- [Intro: What Is Data Science?](#wids)
- [Intro: The Data Science Workflow](#DSWF)
	- [Asking a Good Question](#good_q)
	- [Example: A Data Science Project](#futurama)
	- [Practice: The Data Science Workflow](#practice1)
- [Data and Data Set Types](#data-types)
- [Common Data Terminology](#data-terms)
- [Intro: Machine Learning](#ML)
	- [Supervised Learning](#supervised)
        - [Practice: Regression or Classification](#practice2) 
    - [Unsupervised Learning](#unsupervised)
- [Algorithms](#algorithm)    
- [Intro: scikit-learn](#sklearn)
- [Conclusion](#conclusion)
- [Additional Resources](#add_res)

<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 35px">

## Welcome to GA!
<a id="welcome"></a>

### GA offers a special learning environment.

- Introduce the instructors, instructional assistants, course producers.
- GA is a global community of individuals empowered to pursue the work we love.
- GA Resources — discounts, community events, office hours.
- GA feedback loop — exit tickets, mid-course feedback, final feedback.

### Road to Success

- Emotional cycle of change.
- Student learning responsibility.
- GA graduation requirements
- After GA — build network, find opportunities, community, perks
- Q/A.


<a id="wids"></a>
## Introduction: What Is Data Science?

---

- A set of tools and techniques used to extract useful information from data.
- A interdisciplinary, problem-solving-oriented framework.
- An application of scientific techniques to practical problems.

![Data Science venn diagram](./assets/images/datascience-vd.png)

### What Is Data? 

- "Factual information (such as measurements or statistics) used as a basis for reasoning, discussion, or calculation" — _Merriam-Webster._

Data is typically referred to as being structured or unstructured.

You may typically think about data as "structured" — i.e., contained within organized rows and columns in spreadsheets.

"Unstructured" refers to most other data, such as reviews on Yelp.


### Why Use Data Science?

Welcome to the age of data. It's no secret that, over the past decade, there has been an exponential increase in the amount of data being collected. But what good is this information without anyone to extract insight from it? 

The number of data analysts and use of data analysis have risen vastly in recent years, and from it the field of data science has emerged. So how do we distinguish data science from traditional analytics? A helpful comparison is that analytics is useful for extracting explicit insights from data — i.e., summary statistics, trends, correlations — while data science is a means of extracting implicit insights such as multivariable correlations, natural language processing, and machine learning models.  

The additional insights we can gain with data science have increased the functionality of data and the value of the analytical process. Consequently, part of data science is developing new methods for extracting insights from existing data sets to increase their value.  

These are just some of the reasons the data science field has grown and continues to grow rapidly.

### Who Uses Data Science?

You've probably seen data science in action more than you realize. It's used for:

- Providing movie recommendations on Netflix.
- Making product suggestions on Amazon.
- Offering election and sports coverage on the stats site FiveThirtyEight.
- Calculating daily bet predictions on the fantasy sports site DraftKings.
- Returning auto translate and search results on Google.
- And much more.

Take a moment to think about other examples you might have encountered. Have you ever thought, "How did it know that?" or "Hey, that was easy!" about something that could have been based on models or predictive analytics?

### Who Are Data Scientists?

Data scientists don't always have the same title. Here are a few common roles that may use data science techniques:

* Developer
* Researcher
* Data visualization expert
* Data analyst
* Business intelligence analyst

![Data Science Roles](./assets/images/datascienceroles.jpg)

To understand the role of a data scientist, it's important to understand the skills involved. Data scientists commonly leverage the following:

- Business intelligence
- Machine learning modeling
- Database design/"big data"
- Applied math and statistics
- Programming
- Study design

![Data Science Skills](./assets/images/datasci-skills.jpg)

**Data Science Skills, Broken Down by Role**

So how are these capabilities applied? They may not be evenly distributed in each role. Let's look at how some of the positions we mentioned earlier might use each of these skills:

![Data Science Skills by Role](./assets/images/datasci-skills-by-role.jpg)


Keep in mind that these are just generalizations, and skill combinations can vary by industry, location, or company.  It is also important to note that no one data scientist is an expert in everything.

### How Do Data Scientists Solve Problems?

How do data scientists solve problems? Most practitioners apply a version of the scientific method in order to logically deconstruct and analyze an issue. At General Assembly, we call this the data science workflow, which we've broken down into a series of steps.

This problem-solving framework will help you produce results that are reliable (so your findings will be more accurate) and reproducible (so others can follow your steps and achieve the same results).

Note that, depending on the problem, this process is not always linear. You may require lots of iteration and repetition before any conclusions can be made!

<a id="DSWF"></a>
## Introduction: The Data Science Workflow

---

Throughout the course and for our projects we will be following a general workflow. This workflow will help you produce *reliable* and *reproducible* results.

- **Reliable**: Accurate findings.
- **Reproducible**: Others can follow your steps and achieve the same results.

### Steps in the Data Science Workflow

- **Frame**: Develop a hypothesis-driven approach to your analysis.
- **Prepare**: Select, import, explore, and clean your data.
- **Analyze**: Structure, visualize, and complete your analysis.
- **Interpret**: Derive recommendations and business decisions from your data.
- **Communicate**: Present (edited) insights from your data to different audiences.

#### Notes on the Data Science Workflow

_These are not hard-set rules that must be followed; use them as guidelines._


- The process is cyclical, so repeatition of steps is completely natural.
- Not all steps will be necessary in every project.

<a id="good_q"></a>
### Asking a Good Question

Even though all data science projects have different general flows, they all start in the same place: with a problem.  From this problem statement arises questions; questions we will ask the data in order to gain more information so we can attempt to find a solution to that problem.

**Why do we need a good question?**

_“A problem well stated is half solved.”_ — Charles Kettering

A good question: 

- Sets you up for success as you begin analysis.
- Establishes the basis for reproducibility.
- Enables collaboration through clear goals.
    - It's hard to collaborate without a vision.

One way to approach formulating a question is through goal-setting via the SMART Goals Framework.

- **Specific**: The data set and key variables are clearly defined.
- **Measurable**: The type of analysis and major assumptions are articulated.
- **Attainable**: The question you are asking is feasible for your data set and is not likely to be biased.
- **Reproducible**: Another person (or future you) can read and understand exactly how your analysis is performed.
- **Time-bound**: You clearly state the time period and population to which this analysis pertains.

#### What Are Some Common Questions Asked in Data Science?

**Machine learning more or less asks the following questions:**

- Does X predict Y? (Where X is a set of data and y is an outcome.)
- Are there any distinct groups in our data?
- What are the key components of our data?
- Is one of our observations “weird”?

**From a business perspective, we can ask:**

- What is the likelihood that a customer will buy this product?
- Is this a good or bad review?
- How much demand will there be for my service tomorrow?
- Is this the cheapest way to deliver my goods?
- Is there a better way to segment my marketing strategies?
- What groups of products are customers purchasing together?
- Can we automate this simple yes/no decision?

_This list may seem limited, but we rewrite most questions to fit this form._

Here is an example of a [SMART question](assets/SMART_example.md).

<a id="futurama"></a>
### Example: A Data Science Project
<img src="assets/images/futurama.png" width="500">

#### Project 1: Futurama

For those of you not familiar with _Futurama_, here are some quick notes about the show:

- It's an animated comedy series set in the year 3000.
- It was created by Matt Groening.
- It focuses on the adventures of the space delivery company Planet Express.
- You can learn more on the [Comedy Central Website](http://www.cc.com/shows/futurama) and on this [Wikipedia](https://en.wikipedia.org/wiki/Futurama) page.


Image credit to [KT-245](http://kt-245.deviantart.com/).

#### FRAME: Understand the Problem

Using Planet Express' customer data from January 3001-3005, determine how likely previous customers are to request a repeat delivery using demographic information (e.g., profession, company size, location) and previous delivery data (e.g., days since last delivery, number of total deliveries, etc.).

- Identify the business/product objectives.
	- How likely are previous customers to request a repeat delivery?
- Identify and hypothesize goals and criteria for success.
	- What factors are likely to influence a customer's decision to reuse Planet Express for delivery?
- Create a set of questions to help you identify the correct data set.

#### PREPARE: Obtain, Understand, Structure, and Clean the Data

**Ideal Data vs. Data That is Available**  

Oftentimes we'll start by identifying the *ideal data* we would want for a project.

Then, during the data acquisition phase, we'll learn about the limitations on the types of data actually available. We have to decide if these limitations will inhibit our ability to answer our question of interest or if we can work with what we have to find a reasonable and reliable answer.

For this example, our data includes:

- Demographic information (e.g., profession, company size, location).
- Previous delivery data (e.g., days since last delivery, number of total deliveries, etc.).

This is possibly the hardest step in the data science workflow. At this stage, it is common to realize that the problem you are trying to solve may not be solvable with the information available. The data could be incomplete, non-existant, or unable to meet the criteria necessary to answer your question.  

That said, you now have a better feel for the data available and the information they could contain. You can identify a new, answerable question that ultimately helps you solve or better understand your problem.

**Questions We May Ask About Our Data**

- Is there enough data?
- Does our data appropriately align with the question/problem statement?
- Can the data set be trusted? How was it collected?
- Is this data set aggregated? Can we use the aggregation, or do we need to obtain it pre-aggregation?
- What are necessary resources, requirements, assumptions, and constraints?
- Can we import data from the web (Google Analytics, HTML, XML)?
- Can we import data from a file (CSV, XML, TXT, JSON)?
- Can we import data from a preexisting database (SQL)?
- Can we set up local or remote data structures?
- What are the most appropriate tools for working with the data?
- Do these tools align with the format and size of the data set?

####  PREPARE: Understand the Data

Oftentimes, we are given *secondary data*, or data that was collected previously. In these cases, we have to learn as much as possible about our data using tools like data dictionaries and source documentation to determine how the set was gathered.

Here's an example of a data dictionary:

Variable | Description | Type of Variable
---| ---| ---
Profession | Title of the Account Owner | Categorical
Company Size | 1- small, 2- medium, 3- large| Categorical
Location | Planet of the Company | Categorical
Days Since Last Delivery | Integer | Continuous
Number of Deliveries | Integer | Continuous

**Common steps include:**  

- Reading any documentation provided with the data (e.g., the data dictionary above).
- Performing exploratory surface analysis via filtering, sorting, and simple visualizations.
- Describing the data structure and the information being collected.
- Exploring variables and data types via SELECT statements.
- Assessing preliminary outliers and trends.
- Verifying the quality of the data (feedback loop -> 1).

#### PREPARE: Structure and Clean the Data  

Additionally, we'll often need to clean our data prior to performing an analysis.  

Common steps include:

- Sampling the data and determining sampling methodology.
- Iterating and exploring outliers and null values via SELECT statements.
- Assessing qualitative versus quantitative data.
- Formating and cleaning data in Python (e.g., dates, number signs, formatting).
- Defining how to appropriately address missing values (cleaning).
- Categorizing, manipulating, slicing, formating, and integrating data.
- Formating and combining different data points, separate columns, etc.
- Determining most appropriate methods for aggregating, cleaning, etc.
- Creating the necessary derived columns from the data (new data).



---

**As you can see, the "Prepare" phase of the data science workflow encompasses several steps. The act of identifying, aquiring, and especially cleaning your data will consume a great deal of time (60%–70%), no matter your position hin the data field.**

#### ANALYZE: Perform Exploratory Data Analysis

As an example of basic statistics, you might check the mean, standard deviation, or specific frequency counts of your data.

Variable | Mean (STD) or Frequency (%)
---| ---
Number of Deliveries | 50.0 (10)
Earth | 50 (10%)
Amphibios 9 | 100 (20%)
Bogad | 100 (20%)
Colgate 8| 100 (20%)
Other| 150 (30%)

**These descriptive stats allow us to:**

- Identify trends and outliers.
- Decide how to deal with outliers — excluding, filtering, and communication.
- Apply descriptive and inferential statistics.
- Determine initial visualization techniques.
- Document and capture knowledge.
- Choose visualization techniques for different data types.
- Transform data.

#### ANALYZE: Create a Data Model

We select data models based on the outcome we are interested in or the assumptions of the model we are using. An example of a model statement might look like this:

_"We completed a logistic regression. We calculated the probability of a customer placing another order with Planet Express."_  

Here, we're using a logistic model because we want to determine the probability that a customer may place a return order, which at its heart is a *classification problem*.

**The steps for model building are:**

- Selecting the appropriate model.
- Building the model.
- Evaluating and refining the model.
- Predicting outcomes and action items.

#### INTERPRET:  Develop Recommendations and Decisions

What good is an analysis that comes to no conclusion? As a data professional, you may not perform this step alone and will instead consult with an expert in the field you're studying.  

With or without a subject matter expert, you will need to draw conclusions about your project/experiment.

- Did you reject or fail to reject your hypothesis?
    - What does this mean for your project?
    - What does this mean for your client?

- Were you questions answered?
    - Which ones?
    - What do you need to do to answers the ones that weren't answered?
    
- Do your findings support any business recommendations, actions, or decisions?
    - Is there further supportive analysis?
    - How does your data support these recommendations?
    
You should be able to translate the results of your findings into a single sentence that leads to a recommendation.

    - Conclusion: "Customers from large companies were twice as likely to place another order with Planet Express than customers from small companies."
    - Recommendation:  We should target more large companies to use our delivery service."
    
    - Conclusion: "Other than size of company, I found no significant evidence that any other feature affected the odds of customers reusing our delivery service."

#### COMMUNICATE: Share the Results of Your Analysis  

Presentations are a critical part of your analysis. It doesn't matter how brilliant your model is or how illuminating your findings are, if you are not able to effectively communicate your results, they will not be used.

The most basic form of a data science presentation should include a simple sentence that describes your results:

_"Customers from large companies had twice (CI 1.9, 2.1) the odds for placing another order with Planet Express compared to customers from small companies."_

Data science presentations can also be far more complex and exciting, like some of the [research presented by Nate Silver's FiveThirtyEight blog](http://fivethirtyeight.com/burrito/#brackets-view).

When crafting a presentation, always consider your audience and make sure to practice your presentation beforehand. Consider the types of questions people might ask or — better yet — test your presentation on a few people and pay attention to their response. Clarify and refine your presentation accordingly.

Make sure to consider your needs and goals as well as those of your audience. A presentation created for your fellow data scientists will be vastly different than a presentation intended for executives trying to make a business decision.

**Keys for Crafting a Good Presentation**

- Summarize findings with narrative and storytelling techniques.
- Refine your visualizations for broader comprehension.
- Present both limitations and assumptions.
- Determine the integrity of your analysis.
- Consider the degree of disclosure for various stakeholders.
- Test and evaluate the effectiveness of your presentation beforehand.
- Review additional actions that could offer further insight.

**A Note About Iteration**

Iteration is an important part of *every step* in the data science workflow. At any given point in the process, you may find yourself repeating or going back and redoing steps in order to better understand your data, clarify your model, and refine your presentation.

For example, after presenting your findings, you may want to:

- Identify follow-up problems and questions for future analysis.
- Create a visually effective summary or report.
- Consider the needs of different stakeholders and how your report might be changed for them.
- Identify the limitations of your analysis.
- Identify relationships between visualizations.


<a id="practice1"></a>
### Practice: The Data Science Workflow

---

Use four of the steps from the Data Science Workflow (Frame, Prepare, Analyze, Communicate) to get to know your classmates!

> Students should get into 4-6 groups, spaced at the whiteboards around the room.

#### FRAME: Understand the Problem

Have each group develop one research question it would like to know about the class and form a hypothesis.
> Note: Don't share these questions with the class just yet!

Examples:

- What is your current favorite tool for working with data?
- What are you most excited about learning?
- What can you help your classmates with when it comes to data analysis?

#### PREPARE & ANALYZE: Obtain the Data and Examine It
Rotate through the groups to "collect the data" and record the raw data on the whiteboard.

> Optional: Create an easy, visual way for other students to write their answers, or come up with a quick option to save time.

#### PRESENT: Communicate the Results of Your Analysis  

- Summarize your findings in a narrative.
- Provide a basic visualization for broader comprehension on the whiteboard.
- Have one student present for the group.


<a id="data-types"></a>
## Data Types and Dataset Types

In our previous exercise, you may have noticed that you and your classmates recorded your data in different ways.

### Data Types

**Numerical**

Simply enough, numerical data is typically data represented by numbers.
- Examples: sales, height, passanger count, etc.

**Categorical**

Categorical data, on the other hand, is any data _not_ represented by numbers.
- Examples: color, name, organism type, etc.




###  Dataset Types

Datasets can comprise both categorical and numerical data, and data set types can offer additional information about the data as a whole.

**Cross-sectional**

All information is determined at the same time; all data come from the same time period.

**Time series**

The information is collected over a period of time for a single group.

**Longitudinal/Panel**

The information is collected over a period of time for several groups.

(Check out the data structures available in Pandas [here](http://pandas.pydata.org/pandas-docs/stable/overview.html).)

### Why Do Data Types Matter?

- Different data types have different limitations and strengths.
- Certain types of analyses aren’t possible with certain data types.

<a id="data-terms"></a>
## Common Data Terminology
![](./assets/images/feature_matrix.png)

**Observations** are also known as: samples, examples, instances, and records.

**Features** are also known as: predictors, independent variables, inputs, regressors, covariates, and attributes.

**Response** is also known as: outcome, label, target, and dependent variable. Responses are often used in data science and machine learning.


<a id="ML"></a>

## Introduction: Machine Learning

_That buzzword everyone is talking about._

#### What is Machine Learning?

_"A field of study that gives computers the ability to learn without being explicitly programmed."_
— Arthur Samuel, AI pioneer

One definition states that, “Machine learning is the semi-automatic extraction of knowledge from data.”

**Knowledge from data:** The process starts with a question that might be answerable using data.

**Automatic extraction:** A computer provides the insight.

**Semi-automatic:** It still requires many smart decisions by a human.

### Types of Machine Learning

There are two main categories of machine learning: supervised learning and unsupervised learning.

**Supervised learning (a.k.a., “predictive modeling”):**  
_Classification and regression_
- Predicts an outcome based on input data.
    - Example: Predict whether an email is Spam or Ham.
- Its goal is “generalization.”

**Unsupervised learning:**  
_Clustering and dimensionality reduction_
- Extracts structure from data.
    - Example: Segmenting grocery store shoppers into “clusters” that exhibit similar behaviors.
- Its goal is “representation.”

It's typical to combine both types of machine learning in a project to reduce the costs of collecting data by learning a better representation. This is referred to as transfer learning.

Unsupervised learning tends to present more difficult problems because its goals are amorphous. Supervised learning has goals that are almost too clear and can lead people into the trap of optimizing metrics without considering business value.

<a id="supervised"></a>
### Supervised Learning

So, how does supervised learning “work”?

1. We train a **machine learning model** using **labeled data** (the "response" label from earlier).
    - “Labeled data” is data with a response variable.
    - The “machine learning model” learns the relationship between the features and the response.

2. We make predictions on **new data** for which the response is unknown.

The primary goal of supervised learning is to build a model that “generalizes” — i.e., accurately predicts the **future** rather than the **past**!

![](./assets/images/supervised-learning.png)

### Practice: Classification vs. Regression

There are two categories of supervised learning:

**Regression**
- The outcome we are trying to predict is continuous.
    - Examples: price, blood pressure, etc.

**Classification**
- The outcome we are trying to predict is categorical (i.e., its values in a finite set).
    - Examples: Spam/Ham, cancer class of tissue sample, etc.

The type of supervised learning problem has nothing to do with the features; only the response matters!

#### Supervised Learning Example: Coin Classifier*

- **Observations:** Coins.
- **Features:** Size and mass.
- **Response:** Hand-labeled coin type.

- Train a machine learning model using labeled data.
    - The model learns the relationship between the features and the coin type.

- Make predictions on new data for which the response is unknown.
    - Give the model a new coin, and it will predict the coin type automatically.
    
![](./assets/images/supervised-coins.png)

<a id="practice2"></a>
#### Practice: Regression or Classification?

With a nearby partner (or two), decide if the problems below are classification, regression, or both.

#### 1. Predict salary using demographic data.
![](./assets/images/salary-regression.png)

#### 2. Identify the numbers in a handwritten zip code.
![](./assets/images/ocr-classification.png)

#### 3. Consider the following:

**Problem:** Children born prematurely are at high risk of developing infections, many of which are not detected until after the baby is sick.

**Goal:** Detect subtle patterns in the data that predict infection before it occurs.

**Data:** 16 vital signs such as heart rate, respiration rate, blood pressure, etc.

**Impact:** The model is able to predict the onset of infection 24 hours before the traditional symptoms of infection appear.

**Case Study:** http://www.amazon.com/Big-Data-Revolution-Transform-Think/dp/0544002695

<a id="unsupervised"></a>
### Unsupervised Learning

Unsupervised learning has some clear differences from supervised learning. With unsupervised learning:

- There is no clear objective.
- There is no “right answer” (it's hard to tell how well you are doing).
- There is no response variable, only observations with features.
- Labeled data is not required.

**Unsupervised learning:**

- Extracts structure from data.
    - Example: Segmenting grocery store shoppers into “clusters” that exhibit similar behaviors.
- Its goal is “representation.”



#### Common Types of Unsupervised Learning

- **Clustering:** Groups “similar” data points together.
- **Dimensionality Reduction:** Reduce the dimensionality of a data set by extracting features that capture most of the variance in the data.

**Steps for Clustering**

Considering our coin example,

1. Perform unsupervised learning.
2. Cluster the coins based on “similarity.”
3. Inspect the grouping that the algorithm found.
4. You’re done!

Hopefully this would put the coins into four separate groups.

**Steps for Dimensionality Reduction**

Considering our coin example,

1. Perform unsupervised learning.
2. Cluster the coins based on “similarity.”
3. Inspect the features produces by the algorithm.
4. You’re done!

Hopefully the algorithm would recognize something like: 

$$\dfrac {mass} {size} = density$$

Sometimes unsupervised learning is used as a “preprocessing” step for supervised learning. (Can you guess how?)

#### Unsupervised Learning Example: Coin Clustering and Dimensionality Reduction

- **Observations:** Coins.
- **Features:** Size and mass.
- **Response:** There isn’t one (no hand-labeling required!).

![](./assets/images/unsupervised-coins.png)

<a id = 'algorithm'></a>

## Algorithms

Reguardless of whether its supervised or unsupervised, the underlying engine driving a machine learning model is an algorithm. These algorithms are used to help identify trends, represent said trends, and explain the overall variance of the data.   

Let's say we are a real estate agent looking to price a house using only its square feet. We know there are other features that can highly influence this outcome, but we are just focusing on the square footage for now. We know that, as square footage increases, so does price. At this point you may be thinking that a simple algebra equation could be useful; one that helps us price the house by its square footage.  

Recently we sold a house whose square footage was 2,500 for about \$285,000. If we apply this information to a normal linear equation — $ Y = mx + b$ — we can create a simple _algorithm_ to help us price a house.

$$285,000 = 2,500x + b$$

$$ x = 114, b = 0 $$ 

_The Y intercept has been omitted for this example._

#### Final Algorithm

$$ Price = 114x $$

This is an example of a model built with the intent of predicting price. The algorithm is simple and is built off of limited information. Typically, our models will be more complex, and we'll consider a greater amount of prior data to help us develop a final algorithm.  

#### Algorithm Training 

In our example, we used previously known information to find our coefficients. This action is also known as "Training." But let's make something clear:

- Model building would be the task of constructing an actual algorithm.
    - This is the linear model of $ Y = mx + b $.
- Training involves figuring out the coefficient and the Y intercept the model uses for _our intended purpose_.  
    - The coefficents uncovered via training were $m= 114$ and $b=0$.



<a id="sklearn"></a>

## Introduction: scikit-learn

- Typically, we won't be implementing machine learning algorithms from scratch.
- Scikit-learn, referred to as "sklearn," is a popular machine learning library in Python.
- Its benefits include ease of use and great documentation.
- It's possible to find other libraries and tools with better performing algorithms, but sklearn is a great place to start.

Find it [here](http://scikit-learn.org/stable/).

### Why Python?

When it comes to the debate of Python versus R, there is no clear winner; both have their pros and cons. While R was the original language for data science and processing, Python's popularity has serged in recent years, to the point where its used equally throughout the field. In addition to the friendly nature of the language, Python's origin as an open-source programming language have greatly contributed to its appeal. Because it is a programming language, those that pick it up for data science can easily pick up front-end and back-end programming skills with Python. Additionally, this allows data science deliverables built in Python to be immediately implemented within a Python-based website or program.  


**Reasons for Choosing Python**

- It was created for simplicity and readability.
- It allows for rapid prototyping and ease of production. 
- It features open-source, importable libraries.
- It has a broad range of applications.
- It boasts a fast-growing community of users.


 

<a id="conclusion"></a>
## Conclusion

---

By now, you should be able to answer the following questions easily:

- What is data science?
- What is the data science workflow?
- How can you have a successful learning experience at GA?

<a id="add_res"></a>
### Additional Resources

---

* For a useful look at the different types of data scientists, read [Analyzing the Analyzers](http://cdn.oreillystatic.com/oreilly/radarreport/0636920029014/Analyzing_the_Analyzers.pdf) (32 pages).
* For thoughts on what it's like to be a data scientist, read these short posts from [Win-Vector](http://www.win-vector.com/blog/2012/09/on-being-a-data-scientist/) and [Datascope Analytics](http://datascopeanalytics.com/what-we-think/2014/07/31/six-qualities-of-a-great-data-scientist).
* Read an [Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/) for more information on machine learning. (Note that the applications in its examples are in R.)