# Let's Get Started

Learning Assumptions and Workshop Objective: 

The intent of these workshops is *not* to make you a competent developer or expert data scientist. The workshop objective *is* to help you gain a conceptual understanding of the process and technologies that are used in data science and "big data" via hands-on activities. Many of these activities are common tasks for data scientists. This high-level perspective will help you understand how data science tools fit into the larger context of business strategies that seek to leverage data science methods and tools. It will also hopefully motivate you to move beyond the legacy analytic tools of business like Microsoft Excel to see how powerful tools like R and Python are more accessable than ever before to non-engineers and non-scientists.

All of us are responsible for creating a productive learning environment for the entire class, so please contribute to making the environment positive for everyone. We will follow the Python code of conduct- https://www.python.org/psf/codeofconduct/.

You can work independently or in groups. If your technical and analytical expertise is beyond the basic level of these exercises, take on the role of adviser and help bring others up to speed and extend your work to add more sophistication or functionality.

**This workshop covers these topics:**

* Data Science and Analytics Tools
* Learning to Code
* What is a Desktop Environment?
* How to install R
* How to install RStudio
* What is a dataframe?
* How do you load data from a csv? 
* What is a dataframe?
* What is Dplyr and how can it be used to focus and analyze data?
* Where to get help
* Go through a simple version of a complete analysis lifecycle using weather data

# Data Science and Analytics Tools

Data scientist has been categorized as ["the sexiest job of the 21st sentury"](https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century), [second in growth among the top 20 emerging jobs in 2017](https://economicgraph.linkedin.com/research/LinkedIns-2017-US-Emerging-Jobs-Report), and [in the top 10 most promising jobs of 2018](https://blog.linkedin.com/2018/january/11/linkedin-data-reveals-the-most-promising-jobs-and-in-demand-skills-2018). While there is still some debate about exactly what a data scientist is and what skills they should possess, we can make some conclusions based on our experieince. 

While there are literally hundreds of tools a data scientist might use, perhaps the most foundation tools are what they use for desktop data analysis and exploratory data analysis. Until somewhat recently, anyone doing statistical analysis would typically learn one of the prorprietary statistical analysis packages like [SPSS](https://www.ibm.com/analytics/data-science/predictive-analytics/spss-statistical-software), [SAS](https://www.sas.com/en_us/software/stat.html), or [Stata](https://www.stata.com/). Today, while commercial products are still available and used by some, open source languages like Python and R are the top tools mentioned in new job postings especially by technology companies. Given many data scientists and analytics teams are coming from computer science rather than statistics, it is not surprising to see more general purpose programming languages like Python become popular data science tools. The open source ecosystem has also allowed these tools to have thousands of developers worldwide contributing to making them better and better while also being free to use which is a compelling value proposition. 

IMO, there is no one best tool. However, having skills in the popular or newer tools is generally a good thing for your career. 

Technically, general purpose programming languages like Python and R are more flexable than proprietary or specialized software like [Tableau](https://www.tableau.com) and [Excel](https://products.office.com/en-us/excel). Both Python and R can complete any programming as programmers have complete control over all the the code. Tableau, Excel, and other more specialized analytical software have their place too given they aim to make analysis and data exploration easier for those that don't want to learn to code.

Most data scientists today use either Python or R as their primary desktop analysis tool. Beyond being free to use and having thousands of developers contributing to their success, statistical programming languages also allow data scientists to document every step of an analysis to share with others and to use to create analytical products. To understand an analysis from a peer, rather than just looking at the output of an analysis, you can go through the complete process they used by reading their code. 


For this workshop, we are going to focus on the R programming language. While both Python and R are popular, some believe R is easier for non-programmers to learn. Additionally, R has a superior plotting library called GGplot. 

# What is an Environment?

Data scientists use a variety of tools in their day to day work. A desktop data science environment includes all the software tools needed to explore data and conduct analyses using their personal computer. These tools can also be used to collect data from "big data" or cloud services like [Hadoop](http://hadoop.apache.org/), [Hive](http://hive.apache.org/), and [Spark](http://hadoop.apache.org/). Examples of some of these desktop environment tools include:

* **A statistical programming language**- examples include [Python](https://www.python.org/), [R](https://www.r-project.org/), or commercial packages like [SPSS](https://www.ibm.com/products/spss-statistics) or [SAS](https://www.sas.com/en_us/software/stat.html).
* **Data science specific libraries**- Most statistical programming languages have add on libraries useful for data science. For Python there is [SciPy](https://www.scipy.org/) and R has the [tidyverse](https://www.tidyverse.org/). There are also a variety of machine learning libraries like [Tensorflow](https://www.tensorflow.org/api_docs/), [SciKitLearn for Python](http://scikit-learn.org/stable/) and [Caret for R](http://topepo.github.io/caret/index.html).
* **Data notebooks**- When sharing results, interactive notebooks are a popular tool. The most common one for Python and increasingly for other languages like R is [Jupyter](http://jupyter.org/). There are several others including [Zepplin](https://zeppelin.apache.org/) from the Apache Software Foundation.
* **Visualization tools or libraries**- Python and R have their specific libraries [Matplotlib](https://matplotlib.org/) and [Ggplot2](http://ggplot2.org/) which are included in SciPy and Tidyverse. There are many commercial tools for visualization like [Tableau](https://www.tableau.com/). There are also hybrid open-source/commercial products like [Plotly](https://plot.ly/).
* **Software Integrated Development Environments (IDEs)**- IDEs help organize and streamline your software. I like [PyCharm](https://www.jetbrains.com/pycharm/) for Python.
* **Version control software**- As you work on more sophisticated projects and collaborate with others, version control becomes important. [Git](https://git-scm.com/) is the underlying version control software most used, but [Github](https://github.com/) is a centralized repository often used for some of the most important data science tools like https://github.com/apache/spark and https://github.com/tensorflow/tensorflow
* **Database clients**- Connecting to traditional databases is a common task despite all the newer big data technologies. There are free clients like [SQL Workbench](http://www.sql-workbench.net/) and commercial products like [Datagrip](https://www.jetbrains.com/datagrip/). While most of the tools I use are open-source, in this case I really like Datagrip and don't mind paying for it.


# Installing R

R is a full featured statistical computing language. Unlike Python which started as a general purpose language, R has always been focused on data analysis. R is free to use under a [GNU general license](https://www.r-project.org/COPYING) and is available for Windows, MacOS, and Linux/FreeBSD. 

We will be using other software that works with R. To get started, we need to download and install the base R package.  You can download base R from this link- https://cloud.r-project.org/ by selecting the operating system you are using.

If you have any issues, try searching for a solution using the error description you get. 

If you have successfully installed R, you should see an application called "R" in your applications folder. If you start that application and get a console like the one below, you have successfully installed it.

<img src="https://raw.githubusercontent.com/azbones/cis503_data_science_tools/master/images/r_console.png">

In [None]:
print("Hello, World!")

# Installing R studio

In [None]:
as.data.frame(installed.packages()[,c(1,3:4)])