<div style="text-align: right">INFO 6105 Data Sci Engineering Tools and Methods, Lecture 1, Day 2</div>
<div style="text-align: right">Prof. Dino Konstantopoulos, 9 January 2019</div>

## Lecture 1: Basic Operations - R Equivalence ##

We will import some libraries that we will introduce much later on in our lecture series. Don't worry about this for now. Just focus on what you learned in R, and how similar it can be made into Python.

### Vectors, Data, Matrices and Subsetting ###

In [1]:
from __future__ import division
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
x = np.array([2,7,5])    # explicit vector creation
x

In [None]:
y = np.arange(4, 13, 1)  # vector creation from a sequence (start, stop, step)
y

In [None]:
z = [i for i in range(13) if i > 4]
z

In [None]:
x + y    # vectors can be added

In [None]:
x / y    # divided

In [None]:
x ** y    # exponentiated

In [None]:
x[1]    # vector elements can be selected by position

In [None]:
x[1:3]  # multiple elements can be selected using slices

In [None]:
x[-2]  # elements can be specified as offset from end

In [None]:
x[np.array([0,1])]  # elements can be specified as an array

In [1]:
Z = np.matrix(np.arange(1,17)).reshape((4, 4))
Z                 # note: R arranges the elements column-wise

In [None]:
Z[2:4, 1:3]    # R is 1-based and includes ending index, Python is 0 based and does not.

In [None]:
Z[:, 1:3]    # column slice

In [None]:
Z.shape

Matrices can also be added, and multiplied amongst themselves or by scalars. Try it?

### Generating Random Data, Graphics ###

In [None]:
x = np.random.uniform(0.0, 1.0, 50)
x

In [None]:
y = np.random.normal(0.0, 1.0, 50)
y

In [None]:
fig, ax = plt.subplots()
plt.scatter(x, y)

In [None]:
fig, ax = plt.subplots()
plt.xlabel("Random Uniform")
plt.ylabel("Random Normal")
plt.scatter(x, y, marker='o', color='red')  # plot customizations

A Histogram is a *very important concept in statistics*. It is a graphical display where the data is grouped into ranges (such as "100 to 149", "150 to 199", etc), and then plotted as bars. Similar to a Bar Graph, but in a Histogram each bar is for a range of data.

A histogram is an *accurate* representation of the distribution of numerical data. It is an estimate of the probability distribution of a continuous variable (quantitative variable). It differs from a bar graph, in the sense that a bar graph relates two variables, but a histogram relates only one!

To construct a histogram, the first step is to "bin" (or "bucket") the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The bins (intervals) must be adjacent, and are often (but are not required to be) of equal size.

In [None]:
plt.subplot(121)    # parameter indicates 1 rows, 2 col, first figure
plt.scatter(x, y)
plt.subplot(122)
plt.hist(y)

### Reading in data ###

All file references start from your C:\Users\<username> folder. So you're reading data/Auto.csv, you need to create a data folder in C:\users\<username>, and copy the Auto.csv file in there..

In [None]:
auto_df = pd.read_csv("data/Auto.csv")
auto_df.columns     # column names

In [None]:
auto_df.shape    # number of rows, number of columns

In [None]:
type(auto_df)

In [None]:
auto_df.describe()  # equivalent of R's DataFrame.summary()

In [None]:
plt.ylabel("MPG")
auto_df.plot(x="cylinders", y="mpg", style='o')

In [None]:
auto_df.boxplot(column="mpg", by="cylinders") # MPG distribution by number of cylinders

In [None]:
# similar to R pairs, shows correlation scatter plots between columns and distribution for each 
# column along the diagonal.
# The R version that uses formulas does not seem to have a Python equivalent (and doesn't seem
# to be very useful for exploratory analysis IMO).
axes = pd.tools.plotting.scatter_matrix(auto_df, color="brown")