# Let's Get Started

Learning Assumptions and Workshop Objective: 

The intent of these workshops is *not* to make you a competent R data analyst or expert data scientist. 

The objective *is* to help you gain a conceptual understanding of the process and technologies that are used in data science and "big data" via hands-on activities. Many of these activities are common tasks for data scientists. This high-level perspective will help you understand how data science tools fit into the larger context of business strategies that seek to leverage data science methods and tools. It will also hopefully motivate you to move beyond the legacy analytic tools of business like Microsoft Excel to see how powerful tools like R and Python are more accessible than ever before to non-engineers and non-scientists.

All of us are responsible for creating a productive learning environment for the entire class, so please contribute to making the environment positive for everyone. We will follow the Python code of conduct- https://www.python.org/psf/codeofconduct/.

You can work independently or in groups. If your technical and analytical expertise is beyond the basic level of these exercises, take on the role of adviser and help bring others up to speed and extend your work to add more sophistication or functionality.

**This workshop covers these topics:**

* Data Science and Analytics Tools
* What is a Desktop Environment?
* How to install R
* How to install RStudio
* How to install Anaconda
* How to install Jupyter for R
* Using Jupyter Notebooks
* Where to get help
* How do you load data from a csv? 
* What is a dataframe?
* What is Dplyr and how can it be used to focus and analyze data?
* Go through a simple version of a complete analysis lifecycle using weather data

# Data Science and Analytics Tools

Data scientist has been categorized as ["the sexiest job of the 21st sentury"](https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century), [second in growth among the top 20 emerging jobs in 2017](https://economicgraph.linkedin.com/research/LinkedIns-2017-US-Emerging-Jobs-Report), and [in the top 10 most promising jobs of 2018](https://blog.linkedin.com/2018/january/11/linkedin-data-reveals-the-most-promising-jobs-and-in-demand-skills-2018). While there is still some debate about exactly what a data scientist is and what skills they should possess, we can make some conclusions based on our experience. 

While there are literally hundreds of tools a data scientist might use, perhaps the most foundation tools are what they use for desktop data analysis and exploratory data analysis. Until somewhat recently, anyone doing statistical analysis would typically learn one of the commerical statistical analysis packages like [SPSS](https://www.ibm.com/analytics/data-science/predictive-analytics/spss-statistical-software), [SAS](https://www.sas.com/en_us/software/stat.html), or [Stata](https://www.stata.com/). Today, while commercial products are still available and used by some, open source languages like Python and R are the top tools mentioned in new job postings especially by technology companies. Given many data scientists and analytics teams are coming from computer science rather than statistics, it is not surprising to see more general purpose programming languages like Python become popular data science tools. The open source ecosystem has also allowed these tools to have thousands of developers worldwide contributing to making them better and better while also being free to use which is a compelling value proposition. 

IMO, there is no one best tool. However, having skills in the popular or newer tools is generally a good thing for your career. 

Technically, general-purpose programming languages like Python and R are more flexible than proprietary or specialized software like [Tableau](https://www.tableau.com) and [Excel](https://products.office.com/en-us/excel). Both Python and R can complete any programming as programmers have complete control over all the code. Tableau, Excel, and other more specialized analytical software have their place too given they aim to make analysis and data exploration easier for those that don't want to learn to code.

Most data scientists today use either Python or R as their primary desktop analysis tool. Beyond being free to use and having thousands of developers contributing to their success, statistical programming languages also allow data scientists to document every step of an analysis to share with others and to use to create analytical products. To understand an analysis from a peer, rather than just looking at the output of an analysis, you can go through the complete process they used by reading their code. 


For this workshop, we are going to focus on the R programming language. While both Python and R are popular, some believe R is easier for non-programmers to learn. Additionally, R has a superior plotting library called GGplot. 

# What is an Environment?

Data scientists use a variety of tools in their day to day work. A desktop data science environment includes all the software tools needed to explore data and conduct analyses using their personal computer. These tools can also be used to collect data from "big data" or cloud services like [Hadoop](http://hadoop.apache.org/), [Hive](http://hive.apache.org/), and [Spark](http://hadoop.apache.org/). Examples of some of these desktop environment tools include:

* **A statistical programming language**- examples include [Python](https://www.python.org/), [R](https://www.r-project.org/), or commercial packages like [SPSS](https://www.ibm.com/products/spss-statistics) or [SAS](https://www.sas.com/en_us/software/stat.html).
* **Data science specific libraries**- Most statistical programming languages have add on libraries useful for data science. For Python there is [SciPy](https://www.scipy.org/) and R has the [tidyverse](https://www.tidyverse.org/). There are also a variety of machine learning libraries like [Tensorflow](https://www.tensorflow.org/api_docs/), [SciKitLearn for Python](http://scikit-learn.org/stable/) and [Caret for R](http://topepo.github.io/caret/index.html).
* **Data notebooks**- When sharing results, interactive notebooks are a popular tool. The most common one for Python and increasingly for other languages like R is [Jupyter](http://jupyter.org/). There are several others including [Zepplin](https://zeppelin.apache.org/) from the Apache Software Foundation.
* **Visualization tools or libraries**- Python and R have their specific libraries [Matplotlib](https://matplotlib.org/) and [Ggplot2](http://ggplot2.org/) which are included in SciPy and Tidyverse. There are many commercial tools for visualization like [Tableau](https://www.tableau.com/). There are also hybrid open-source/commercial products like [Plotly](https://plot.ly/).
* **Software Integrated Development Environments (IDEs)**- IDEs help organize and streamline your software. I like [PyCharm](https://www.jetbrains.com/pycharm/) for Python. I use [Rstudio](https://www.rstudio.com/products/rstudio/) for R.
* **Version control software**- As you work on more sophisticated projects and collaborate with others, version control becomes important. [Git](https://git-scm.com/) is the underlying version control software most used, but [Github](https://github.com/) is a centralized repository often used for some of the most important data science tools like https://github.com/apache/spark and https://github.com/tensorflow/tensorflow
* **Database clients**- Connecting to traditional databases is a common task despite all the newer big data technologies. There are free clients like [SQL Workbench](http://www.sql-workbench.net/) and commercial products like [Datagrip](https://www.jetbrains.com/datagrip/). While most of the tools I use are open-source, in this case I really like Datagrip and don't mind paying for it.


# Installing Anaconda for R and Jupyter Notebooks

While there are many great things about open source software, one challenge is getting different packages to work together. For this workshop, we are going to use R, Jupyter, and RStudio. While we won't use Python, we are using a distribution of Python called Anaconda to try to make the installation easier. Getting all of these things installed and working is sometimes a chore, so if you can do this before the workshop we will have more time for analysis.

This web page was built using [Jupyter](http://jupyter.org/). Jupyter Notebooks are a way to share documents that have both live code in them as well as descriptions (like this code block), plots, or other images. Jupyter is a good way to surround your working code with descriptions and contextual information all while having real, running code you can change and experiment with. 

Jupyter started as a Python project, but given its popularity in data science now support many other languages including R. To install Jupyter on your computer, you need to first install Python.  The easiest way to do this is to use the Anaconda Python 3.6 distribution (a distribution is a curated package of all the things you need) from this link- https://www.anaconda.com/download/ Once you download the package, you can launch the installer. Depending on your operating system, you should typically pick install for your current user rather than the whole system.

You can check your installation by opening a console window in your operation system and typing 
```bash
conda info
```  

If you are not sure how to open a console window, here is a link for Mac OSX (which is also quite similar for Linux):

Mac OSX- http://blog.teamtreehouse.com/introduction-to-the-mac-os-x-command-line

For Windows 10, you will have to use the Anaconda Prompt which is a version of a Windows console that loads all of the correct libraries.  It should be a program option like below that will open a console:

<img src="images/conda_prompt.png">

If you successfully installed Anaconda, you should see a screen like the below (left is OSX and right is Windows). For Windows, save the "active environment location" as you will need that later.

<img src="images/conda_infos.png">




Next, we need to install R.  While you can always install R by going to https://cloud.r-project.org/ and downloading the latest version, we are going to use Anaconda so all our enviroments work together.  To do this, go to your console (in Windows it is the Anaconda Prompt) and type the following:

```bash
conda install r-essentials
``` 

You should see lines that say:

```bash
Fetching package metadata....
Solving package specifications.```

Below this will be summary of what conda is going to install and a prompt asking you to confirm the install, Press ```y```. 


Now, in the same console, type the following:

```bash
r
```
You should see something similar to the following which details what version of R you have installed.

<img src="images/r_console.png">


Entering the command ```r``` in your console starts the R kernel in your console. To exit, type ```q()```.

# Install R Studio

While you can just use the base R console you saw above, R Studio provides a much easier to use IDE. RStudio (the IDE) is sponsored by RStudio (the commercial entity). They have offered an open source version of RStudio Desktop which is what you should download next at this link- https://www.rstudio.com/products/rstudio/download/ 

After starting the installer from your download folder, it may ask you to locate your local R environment. Use the information from the ```conda info``` command above to find the default R environment or you can start your R console with the ```r``` command and type ```R.home()``` to return the home directory. 

On Macs after you install Rstudio, you can just click on the application link to launch it. On Windows, this is more complicated due to how it works with Anaconda. If you just launch the RStudio app from its icon, it may fail. The way to get around this is to use your Anaconda Prompt console and change into the directory where RStudio was installed.  This is typically something like ```C:\Program Files\RStudio\bin\```.  Once you are in that directory, enter ```rstudio``` in the console and the application should launch.  The RStudio executable is "rstudio.exe".

If you launch the app and see a console like the one below, you have successfully installed RStudio. 

<img src="images/r_studio.png">

Note for Windows users: the other option to get around the Anaconda-Windows RStudio issue is to install R directly from https://cloud.r-project.org/. That should install a version of R you can use directly with RStudio, but it will be a second version from the one installed with Anaconda which can lead to confusion when dealing with R libraries.



# Starting an R Notebook In Your Desktop Environment

If you are looking at this page on Github at the URL https://github.com/azbones/cis503_data_science_tools/blob/master/Intro_to_data_science_tools.ipynb you are seeing a static version of a Jupyter notebook. To get a live version running on your computer, there are two steps.

## Step 1- Download the R Notebook Repository from Github


While you can use Git to manage your files, it can be confusing for beginners. An easier way to get the files is to go to the "Clone or download" dropdown at the main page for this workshop's files at https://github.com/azbones/cis503_data_science_tools. Select the "Download ZIP" option.

<img src="images/zip.png">

After downloading the files, unzip them on your hard drive.

Open a console window (Mac) or Anadonda Prompt (Win) in your operating system and change the working directory to the folder where you downloaded the Github files.

Mac OSX- https://www.macworld.com/article/2042378/master-the-command-line-navigating-files-and-folders.html

Windows 10- Using the Anaconda Prompt: https://www.wikihow.com/Change-Directories-in-Command-Prompt

## Step 2- Start Your Local Jupyter Notebook

Once you have navigated to the right directory, you can start Jupyter Notebook locally. Jupyter works by starting a small private web server on your desktop. To start it enter the following command in your console (using Anaconda Prompt in Windows):

```bash
jupyter notebook
```

After it starts, a new tab should open in your default web browser that looks like:

<img src="images/jupyter.png">

The files with the extension .ipynb are Jupyter notebooks.  Try launching one from this main page. Another tab should open which is now a live notebook where you can run code. To test this, put your cursor in the cell below this block and select "Run Cells" from the "Cells" menu at the top of the page. You can also use the keyboard shortcut "control-return" to run code in a cell block.

You can tell the code blocks as they have the following text prepended to them:

```bash
In []:
```

If your environment is properly set up, you should see a basic plot.


In [None]:
plot(cars)

## [Please continue in the next notebook page by clicking here](Intro_to_data_science_tools-2.ipynb)