---
layout: page
title: Introduction to Data Science using R
subtitle: Programming with R
minutes: 5
---

## Brief history of R

-   Until the mid-70s, much of statistical computing work was being done using Fortran. 
-   In 1975-76, the S programming language was developed at Bell Labs to an alternative and more interactive approach to statistical computing.
-   R is developed as an open source reimplementation of S with a first beta-release in 2000. R is currently at its third major version. 

## Why R

-   R is written by statisticians, for statisticians. 
-   R has been widely adopted by the statistic community. 
-   R contains a wide range of statistical techniques, mainly due to the enthusiastic contributions from the user community. These inclure linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and others. It also has excellent tools to help with data acquisition, extration, manipulation, and visualization. 
  

## What will we learn today?

There are three approaches in storing data for processing purposes:

-   Data are stored on a single computer's local hard drive and can be processed by programs running on 
    the same computer. This is how most people manage their data in normal daily tasks.
-   Data are stored on remote storage systems. These systems are often consisted of multiple hard drives 
    to support reading/writing large amount of data. Software programs accessing these data are located 
    on a different set of computers, and the data must be copied from the storage systems to these computers 
    over the network. This is the storage model of the Palmetto Supercomputer.
-   Data are stored on a set of computers, and the software programs accessing these data also runs on 
    the same set of computers. This is the storage model of the Hadoop Distributed File System.

<img src="fig/StorageSimplified.png" \
     alt="Simple Presentation of Storage Models" \
     style="height:500px">

HDFS uses the following design assumptions:

-   Hardware failure is the norm rather than the exception. Companies like Yahoo or Google run data 
    warehouses that have thousands of machines. Rather than trying to prevent failure, it is more realistic 
    to focus on failure recovery.
-   Data arrives as a stream. HDFS is specifically designed for programs that process massive amount of data in 
    batches.
-   The amount of data to be processed is very large.
-   Data are written once but read many times (no data modification).
-   It is cheaper to move the computation (e.g., copy the programs) than to move the data.
-   The set of computers contains different types of hardware and software.


## Movie Ratings and Recommendation

An independent movie company is looking to invest in a new movie project. With limited finance, the company wants to 
analyze the reaction of audiences, particularly toward various movie genres, in order to identify beneficial 
movie project to focus on. The company relies on data collected from a publicly available recommendation service 
by [MovieLens](http://dl.acm.org/citation.cfm?id=2827872). This 
[dataset](http://files.grouplens.org/datasets/movielens/ml-10m-README.html) 
contains "**10,000,054** ratings and **95,580** tag applications across **10,681** movies. These data were created 
by **71,567** users between January 09, 1995 and January 29, 2016." 

From this dataset, several analyses are possible, include the followings:
1.   Find movies which have the highest ratings over the years and identify the corresponding genre.
2.   Find genres which have the highest ratings over the years.
3.   Find users who rate movies most frequently in order to contact them for in-depth marketing analysis.

These types of analyses, which are somewhat ambiguous, demand the ability to quickly process large amount of data in 
elatively short amount of time for decision support purposes. In these situations, the sizes of the data typically 
make analysis done on a single machine impossible and analysis done using a remote storage system impractical. For 
remainder of the lessons, we will learn how HDFS provides the basis to store massive amount of data and to enable 
the programming approach to analyze these data.
