# Flex lesson: Data Science Tools

# WHEN YOU COME TO CLASS
- Open Jupyter Notebook
- Download and open DS-DC-14_05-data-science-tools
  - Located in lesson 05
  - If you have git working feel free to use git pull
- Go to `File > Make a copy` and save a new copy of the notebook with your initials in the name

## LEARNING OBJECTIVES
- Identify the data science toolkit
- Navigate Git and the Command Line
- Get your own GitHub repo
- Learn a simple framework for choosing visualization types

## STUDENT PRE-WORK
Before this lesson, you should already be able to:

- Explain the difference between variance and bias
- Use descriptive stats to understand your data

### Intro: Tools of the Trade

Today, we are going to review some of the tools we use in data science and see how they fit into the wider programming environment.

## The Data Processing Pipeline
Objective: Summarize how data is processed, who does the processing, and what tools they frequently use
1. Collection and Archiving
  - Get data sources and make lasting back-ups
2. Data Profiling
  - Look at summary statistics to identify major errors in the source data
3. Data Staging and Conformance (Extract, Transform, Load)
  - Combine datasets, check compatiblity, prepare them for useful storage
4. Data Storage/Data Warehousing/a pile of spreadsheets
  - Make sure data is stored in a manner to is easy to access for decision makers and analytical teams
5. History and Auditing
  - Track the history of the database and double check that information is accurate
6. Data Marts and Services
  - If need be, create easy data access tools/layers. This can be an API, a Data Access Layer to use with a programming language, a simplified table in a database, or a spreadsheet that collects useful information
7. Analysis and Modeling
  - Summarize the data as a working model or analytical piece
8. Presentation
  - Summarize results to decision makers and clients

#### Who does what?
1. Collection and Archiving 
  - System Administrators
2. Data Profiling 
  - Data Engineer, ETL team, individual researcher
3. Data Staging and Conformance/Extract, Transform, Load 
  - Data Engineer, ETL team, individual researcher
4. Data Storage/Data Warehousing/a pile of spreadsheets 
  - Database Administrator, Data Engineer individual researcher, or anyone who deals with spreadsheets
5. History and Auditing 
  - Database administrator, individual researcher
6. Data Marts and Services 
  - Database Administrator, Back-end programmer, individual researcher
7. Analysis and Modeling 
  - Data Analyst, Back-end programmer, individual researcher, many office workers
8. Presentation 
  - Front-end developer, Data Analyst, individual researcher, many office workers

#### Tools of the trade?
1. Collection and Archiving 
  - Unix scripts and utilities, Perl, Python, specialized software
2. Data Profiling 
  - Unix scripts and utilities, Perl, Python, specialized data profiling software
3. Data Staging and Conformance/Extract, Transform, Load 
  - Unix scripts and utilities, Perl, Python, SQL specialized ETL software, Big Data
4. Data Storage/Data Warehousing/a pile of spreadsheets 
  - SQL, Perl, Python, Java, Excel, Big Data
5. History and Auditing 
  - SQL, specialized software
6. Data Marts and Services 
  - SQL and just about any programming language (C/C++, C#/Java, Python, R, MatLab, Ruby, PHP, Go ...), Excel
7. Analysis and Modeling 
  - Visualization tools (Excel, tableau), specialized modeling tools, programming languages (C/C++, Java, Python, Clojure, F#, R, MatLab, Scala)
8. Presentation 
  - Microsoft Suite, HTML/Javascrpt, Python, R, MatLab, Visualization tools (Excel, tableau)

#### KNOWLEDGE CHECK: PAIR AND SHARE (5 min)
- Think about your own projects and how this data process fits in this workflow
- Turn to someone at your table and briefly explain the workflow


- Do you think the pipeline is missing some steps?
- Who did what section of the project
- Are there any specialized tools that you think are common, but we didn't mention?


#### What do data scientists use?
- SQL
 - Essentially a universal vocabulary for data access
- Modeling friendly language such as Python, R, MatLab, Scala, Java, Clojure, or F#
  - Depends on industry
  - Access to data, modeling tools, and automating grunt work
- Some kind of version control Git, SVN, Mercurial
  - Organize your own work
  - Make your work easy to share on various teams
  - Provides a subtle excuse to talk to the programmers while they fix your merge conflicts
- Command line/Terminal
  - Better control of your computer, automating grunt work
- Text editor or Integrated Development Environment
  - So you can edit code
- Additional helper software for visualization, machine learning, industry specific modeling, or something that helps bridge the communication gap with another team that works with the data.
  - This is a very long list, the most common requests tend to be industry specific libraries, databases (Postgresql, MySQL, Cassandra, Spark, Hadoop), data extraction and summarization tools such as Splunk
- Most importantly: Your knowledge and intuition of everything that will go wrong that your teammates haven't even considered

#### KNOWLEDGE CHECK (1 min)

Think to yourself

How do these tools fit with the idea that data science is the intersection of programming, traditional research, domain knowledge?

### Local machine

On your local machine you have a variety of tools at your disposal, including:

    Text editor
    Programs/Packages/Tools
    Your files

All of these can be accessed through terminal and many can also be accessed through a GUI, or Graphical User Interface.

### Command line
Objective: Become comfortable with navigating through file directories using a Command Line Interface

This is your portal to your computer and the outside world.

#### What is a command?
They are small programs, a lot like the Python functions we have been using. 

They have a name followed by arguments that we can use to tell the program what we want

![command anatomy](assets/images/cmd-anatomy.jpg)

Demo a few commands:

    pwd - present working directory
    cd - change directroy
    ls - list files

Identifiers:

    ./ - refers to the current directory
    ../ - refers to the directory above

Getting help:

    --help as an option for any command will give you a help file
   

#### KNOWLEDGE CHECK

I use pwd and see that I'm in /home/alex/DC-DC-14/
1. How do I see what files are listed?
2. How do I move to /home/alex?

As we mentioned, we can access many tools with terminal. Let's walk through a few that are important for data science.

So far we've been using iPython notebook in place of a text editor. However, there are lots of other options available, including: Emacs, Vim, Sublime.

Text Editor - Sublime
However, there are many options available

- eMacs
- Vim
- Sublime Text
- Atom
- Pycharm

#### iPython Notebook

##### Where does IPython Notebook fit in?

**From the iPython Notebook docs:**

    "The notebook extends the console-based approach to interactive computing in a qualitatively new direction, providing a web-based application suitable for capturing the whole computation process: developing, documenting, and executing code, as well as communicating the results."

**iPython notebooks combine two components:**

- A web application: a browser-based tool for interactive authoring of documents which combine explanatory text, mathematics, computations and their rich media output.

- Notebook documents: a representation of all content visible in the web application, including inputs and outputs of the computations, explanatory text, mathematics, images, and rich media representations of objects."

#### Outside World

The CL can connect us to the outside world. In data this is particularly important.

Let's say we have HIPAA protected data (note: HIPAA is a policy that protects health data for people. It requires extra security so you can't leave data around on your local computer.) Often times it will be the data we'll leave on an external computer that we need to communicate with. We can do this through the CL.

In a simpler case, command line can provide an easy to install libraries and software.

When we do this with Python we often use a tool called pip or in the case of the anaconda distribution we use conda.

#### KNOWLEDGE CHECK: SHARE OUTLOUD

Does anyone want to share their experience with command line tools or tools that resemble command line from their day to day work?

#### Let's conda install a library

Here we will checkout a popular Python library for parsing HTML/XML called Beautiful Soup:

conda install beautifulsoup4

#### PYTHON PACKAGES

Python has a very rich set of libraries and community support, which is related to its reputation as a glue language, meaning that it is great at glueing several pieces of software together, and as having B+ performance across the board. 

Some of the most well known:

- Requests
- Scrapy
- wxPython
- Pillow
- SQLAlchemy
- BeautifulSoup
- Twisted
- NumPy
- SciPy
- matplotlib

#### REAL WORLD APPLICATION: HOW DO WE PICK A LIBRARY?

This questions doesn't have a simple answer, but we can list out some factors to look for:

##### How do I find a library that I need?
- Look at PyPi or another source that lists libraries and their uses
- Look on stackoverflow for the issue you are trying to solve and see what libraries are used in solutions
- Ask a friend

##### How do I choose an appropriate library?
- Decide whether you need the library in the first place
    - If you only use a small portion of the library the instability of library upgrades may far out weigh the work of writing some code yourself
- Does the library do what I need?
  - Seriously, read the documentation first, a first impression isn't enough
- Is there community support?
  - See how many questions are asked about the library on stackoverflow
  - When was the last update made?
  - Libraries can die without community support, not a good thing to rely on
- What common issues do people complain about?
  - Do those issues affect my project?

## BREAK

#### Re-Intro to Git
Objective: Get enough comfort with git to use it for class and create a GitHub repo
- Git is a way of tracking changes we've made to our programs and go back in time to fix errors. 
- It is also a powerful tool for collaborating with colleagues allowing you to work on different aspects of the project simultaneously and merge all the changes together seamlessly. 
- There are lots of ways to use git one common tool is Github.

#### Let's learn how to get copies of our repository
(No need to follow along if you've already done this and get it. Help out your classmates!)

Our repo is stored at https://github.com/ga-students/DS-DC-14

- We need to get a link from that website that tells us how to access the repo
  - https://github.com/ga-students/DS-DC-14.git
- Navigate to a directory where you want the DS-DC-14 folder to be copied, the default directory in terminal is probably home or my documents
- run the program git with the argument clone followed by the repo
  - git clone https://github.com/ga-students/DS-DC-14.git
- Enter username and password (the password is invisible when you type, its ok if you type and nothing shows up)
- You should now have a copy of of DS-DC-14/ in the working directory you started in

Note:
- Clone set up up a lot of options for you in the background, such as telling git where to find updates in the future

#### How do we update the repository?

At the start of every class, run ```git pull``` in the DS-DC-14 directory.

We've already configured all the correct options for the pull, so it's simple.

#### Troubleshooting

You may end up with what is called a merge conflict (git will tell you). This happens when your local copy of a file has been updated and the remote (GitHub) copy has also been updated. Which copy should git use? How would it know?

To avoid this: 
- Don't edit files within the DS-DC-14 repo unless you know what you're doing. 
- This is also why I have you make a copy of every notebook with your initials; no conflicts if it's a new file. 
- Alternatively, if you don't have any work to lose, play around as much as you want. If the repo is in a bad state, you can just delete it and clone it again.

##### EXERCISE
1. I will make an update to the course repo in lesson-05 (this serves as a demo)
2. Once I announce that the update is ready, do a git pull

#### Let's also make our own repos for homework and final project submissions
We will do this together on GitHub

1. Go to GitHub (make sure you are signed in)
2. Click on the plus sign in the upper right-hand corner and select new repository
3. Fill out the repository name (something like DS-DC-14 coursework or DS14)
4. Check the box that says "Initialize this repository with a README"
5. Click "Create repository"
6. Clone your new repository

Note: Do not clone this into the DS-DC-14 directory, it's confusing and doesn't work well.

#### Let's get our own file onto GitHub

1. In your new repo directory on your local machine run the command:
  - ```touch newfile.md```
  - This will create a new markdown file called newfile
2. Use git with the argument status
  - ```git status```
  - This will show you some summary information about new and changed files in your repo
2. Use git with the argument add followed by the filename.
  - ```git add newfile.md```
  - This tells get to put newfile on the index and keep track of it
3. Use git with the argument commit, the option -m, followed by a string as an argument
  - ```git commit -m "Add newfile"```
  - This commits your changes to history
4. View your commit history using
  - ```git log```
5. Finally, to update GitHub you need to send a copy of your history to GitHub:
  - ```git push```
  - This will require a username and password
  - This copies your local log to the remote log
  - git will automatically diagnose potential issues such as multiple conflicting changes from multiple users

#### KNOWLEDGE CHECK
##### What are the big advantages of using CL?

##### What's a GUI?

##### Will I destroy my computer if I use terminal?

## BREAK

# DATA VISUALIZATION

- Important for data science
- Exploratory data analysis
- Visualization inspires new questions
- Communication of findings
- Fundamental principle: communicate information to your audience
- Don’t just choose a chart at random, take into account the purpose and the perceptive differences and the type of information you are trying to convey
- There is no perfect answer, so don’t stress


##### What visual features can viewers percieve? (in order from easiest to most difficult)

- Position along a common scale e.g. scatter plot
- Position on identical but nonaligned scales e.g. multiple scatter plots
- Length e.g. bar chart
- Angle & Slope (tie) e.g. pie chart
- Area e.g. bubbles
- Volume, density, and color saturation (tie) e.g. heatmap
- Color hue e.g. newsmap

Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods
https://www.cs.ubc.ca/~tmm/courses/cpsc533c-04-spr/readings/cleveland.pdf

#### Can you rank order A-E from largest share to smallest share?
![pie](assets/images/pie_chart.jpg)

#### What about now?
![bar](assets/images/bar_plot.jpg)

![Visualization Flow Chart](assets/images/visualization_flow_chart.jpg)

#### Composition: What is X made up of?
![composition](assets/images/composition1.jpg)

#### KNOWLEDGE CHECK
I am following three different countries during the Olympics, I want to see how many gold medals each country has won everyday. Which chart should I use?

#### Distribution: How is my data distributed?
![distribution](assets/images/distribution.jpg)

#### KNOWLEDGE CHECK
I want to see where each Starbucks in DC is located. What kind of chart should I use?

#### Relationship: How does y (and z) depend on x?
![relationship](assets/images/relationship1.jpg)

#### KNOWLEDGE CHECK
My mobile application tracks user location. I want to see the travel history of a particular user, which chart should I use?

#### Comparison: Greater or less than? Order? Difference of distribution?
![comparison](assets/images/comparison.jpg)

#### KNOWLEDGE CHECK
I have sampled two groups randomly out of a population. I need to compare their demographic characteristics, psychographic characteristic, responses to questionnaires, and various other data. What kind of chart can I use to get a sense of how similar the two groups are? 

#### Wall of shame
http://flowingdata.com/category/visualization/ugly-visualization/

# Seaborn
Python package
Nice plots by default
Understands the pandas DataFrame

Have to install: 
conda install seaborn
from the terminal

Seaborn calls matplotlib - so in theory I think you could do anything there that seaborn does
But we have plenty of other work to be getting on with


## This is an intro to seaborn in python
You must have already installed seaborn:  
`conda install seaborn`  
should work if you are using anaconda  

Many ideas for this came from http://twiecki.github.io/blog/2014/11/18/python-for-data-science/ and https://stanford.edu/~mwaskom/software/seaborn/tutorial.html


In [None]:
%matplotlib inline 
# IPython to create plots within cells

In [None]:
import seaborn as sns
# sns is the accepted short name for seaborn (don't ask me why)

import matplotlib.pyplot as plt
import matplotlib as mpl
import pandas as pd
# get pandas (of course) and matplotlib for comparisons

In [None]:
# Load the tips data set that comes with seaborn
tips = sns.load_dataset("tips")

In [None]:
tips.describe()

In [None]:
tips.head()
# data includes bill, tip size, gender, smoker, day of the week, time, and party size

In [None]:
plt.scatter(tips.total_bill,tips.tip)
# we can do a scatter plot in matplotlib

In [None]:
sns.jointplot("total_bill", "tip", tips, kind='reg');
# or, in one line, we could get this! Thanks, seaborn
# jointplot shows an linear model plot (lm plot) as wel as histrograms
# note that it also adds a bit of transparency so you can see overlap as multiple points better

In [None]:
sns.lmplot("total_bill", "tip", tips, col="time");
# in seaborn, a single command will suffice
# do people tip differently at lunch vs. dinner?

In [None]:
sns.lmplot(x="total_bill", y="tip", hue="smoker", data=tips);
# or we can plot on one plot, separating by color
# let's compare smokers and non-smokers

In [None]:
sns.lmplot("total_bill", "tip", tips, col="day");
# how about different days of the week?

In [None]:
sns.pairplot(tips)
# pairplot compares multiple variables
# the diagonal is the distribution of each
# the other plots are scatters comparing the two
# note this ignores the categorical variables

In [None]:
sns.boxplot(x="day", y="total_bill", hue="time", data=tips);
# but boxplots can be split by categoricals in lots of fun ways

For more information:  
A tutorial by the authors: https://stanford.edu/~mwaskom/software/seaborn/tutorial.html 
Tons of color and display options and many other plots not shown here
and a very cool example gallery: https://stanford.edu/~mwaskom/software/seaborn/examples/index.html

# TOPIC REVIEW
Today we looked at:
- The datascience toolkit
  - Why are there so many programming related tools?
- Navigate Git and the Command Line
  - When you come to class what command should you run?
  - How would you upload a file to YOUR GitHub repository
- Learn a simple framework to choose visualization types

**Any further questions?**

# FURTHER READING
- [Automate the Boring stuff with Python](https://automatetheboringstuff.com/#toc)
- [Data Science at the Command Line](http://shop.oreilly.com/product/0636920032823.do)

## Pre-work for next class
- Keep thinking about project questions and datasets, we will be sharing ideas next week
- Next class is on linear regression if you want a headstart (Pick, choose, and skim):
  - https://www.khanacademy.org/math/probability/regression
  - http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression
  - A very gentle introduction to machine learning: https://www.youtube.com/watch?v=elojMnjn4kk
  - For the more advanced who are interested in machine learning variations:
  http://scikit-learn.org/stable/modules/linear_model.html

# EXIT TICKETS
http://goo.gl/forms/gG5qAw9QljgkHC2q1