# Why python for data analysis?
There are lots of reasons that we want to use python for doing data science. It is certainly one of the younger programming languages used in the data science ecosystem (compared to say R and SAS) but it is used just as frequently for analysis as SAS and R. Having a good foundation in python, R, and SAS should be a *must* for **every data scientist**. 

In this course, python allows for an open source method of performing machine learning that runs from just about any machine. So let's start with looking at Numpy and Pandas pachages for analyzing data. 

With that in mind, let's go over the following:
- Numpy matrices
- Simple operations on arrays and matrices
- Indexing with numpy
- Pandas for tabular data
- Representing categorical data (discussion point)

In [2]:
import numpy as np

x = np.random.rand(5,3)
x

array([[ 0.04607703,  0.36076589,  0.95251868],
       [ 0.02818899,  0.59528139,  0.99669465],
       [ 0.20042058,  0.48024933,  0.32683431],
       [ 0.69086873,  0.74432887,  0.24808531],
       [ 0.54275188,  0.11294145,  0.67194907]])

In [3]:
x.shape

(5, 3)

In [4]:
x.dtype

dtype('float64')

In [5]:
y = np.random.rand(3,4)
z = x*y
z

ValueError: operands could not be broadcast together with shapes (5,3) (3,4) 

In [6]:
z = np.dot(x,y)
z

array([[ 0.49794687,  1.13201149,  0.75607934,  0.19584876],
       [ 0.59040447,  1.27295694,  0.86749169,  0.20515853],
       [ 0.36253437,  0.64648446,  0.41805351,  0.22002309],
       [ 0.60235927,  0.94393743,  0.53929437,  0.56970748],
       [ 0.48492605,  1.00022987,  0.56073971,  0.48671654]])

In [7]:
x = np.mat(x)
y = np.mat(y)
z = x*y
z

matrix([[ 0.49794687,  1.13201149,  0.75607934,  0.19584876],
        [ 0.59040447,  1.27295694,  0.86749169,  0.20515853],
        [ 0.36253437,  0.64648446,  0.41805351,  0.22002309],
        [ 0.60235927,  0.94393743,  0.53929437,  0.56970748],
        [ 0.48492605,  1.00022987,  0.56073971,  0.48671654]])

# Indexing

In [8]:
x1 = np.array([[1,2,3],[4,5,6],[7,8,9]])
x1

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [12]:
for row in range(x1.shape[0]):
    print x1[row,1]

2
5
8


In [10]:
x1[0,:]

array([1, 2, 3])

In [11]:
x1[ x1[:,1]>3 ]

array([[4, 5, 6],
       [7, 8, 9]])

In [17]:
x2 = np.array(range(10))
x2

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [18]:
x2.shape

(10,)

In [19]:
idx = x2>5
idx

array([False, False, False, False, False, False,  True,  True,  True,  True], dtype=bool)

In [20]:
x2[idx]

array([6, 7, 8, 9])

In [21]:
x2[x2>5]

array([6, 7, 8, 9])

# Named columns
So what if we have a matrix of data where each row is some observation of features and the feature values are represented in each column?

In [12]:
col_names = ['temperature','time','day']
data = np.array([[64,2100,1],[50,2200,1],[48,2300,1],[34,0,2],[30,100,2]])
data

array([[  64, 2100,    1],
       [  50, 2200,    1],
       [  48, 2300,    1],
       [  34,    0,    2],
       [  30,  100,    2]])

In [23]:
data2 = data[data[:,1]>1500]
data2

array([[  64, 2100,    1],
       [  50, 2200,    1],
       [  48, 2300,    1]])

In [13]:
# pandas to the rescue
import pandas as pd

df = pd.DataFrame(data,columns=col_names)
df

Unnamed: 0,temperature,time,day
0,64,2100,1
1,50,2200,1
2,48,2300,1
3,34,0,2
4,30,100,2


In [25]:
df[df.time>1500]

Unnamed: 0,temperature,time,day
0,64,2100,1
1,50,2200,1
2,48,2300,1


In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 3 columns):
temperature    5 non-null int64
time           5 non-null int64
day            5 non-null int64
dtypes: int64(3)
memory usage: 160.0 bytes


In [27]:
df.day[df.day==1] = 'Mon'

In [28]:
df

Unnamed: 0,temperature,time,day
0,64,2100,Mon
1,50,2200,Mon
2,48,2300,Mon
3,34,0,2
4,30,100,2


In [30]:
df.day.replace(to_replace=range(7),value=['Su','Mon','Tues','Wed','Th','Fri','Sat'],inplace=True)
df

Unnamed: 0,temperature,time,day
0,64,2100,Mon
1,50,2200,Mon
2,48,2300,Mon
3,34,0,Tues
4,30,100,Tues


In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 3 columns):
temperature    5 non-null int64
time           5 non-null int64
day            5 non-null object
dtypes: int64(2), object(1)
memory usage: 160.0+ bytes
