# Why python for data analysis, machine learning?
There are lots of reasons that we want to use python for doing data science. It is certainly one of the younger programming languages used in the data science ecosystem (compared to say R and SAS) but it is used just as frequently for analysis as SAS and R. Having a good foundation in python, R, and SAS should be a *must* for **every data scientist** and machine learning enthusiast. 

In this course, python allows for an open source method of performing machine learning that runs from just about any machine. So let's start with looking at Numpy and Pandas pachages for analyzing data. 

With that in mind, let's go over the following:
- Numpy matrices
- Simple operations on arrays and matrices
- Indexing with numpy
- Pandas for tabular data
- Representing categorical data (discussion point)

In [2]:
import sys
import numpy as np

print(sys.version)
print(np.__version__)

3.6.5 |Anaconda custom (64-bit)| (default, Apr 26 2018, 08:42:37) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
1.14.3


In [3]:
x = np.random.rand(5,3)
x

array([[0.80690697, 0.21845636, 0.51384176],
       [0.68794289, 0.49666161, 0.94782103],
       [0.28103213, 0.690014  , 0.92560308],
       [0.70250515, 0.87416963, 0.8918224 ],
       [0.13771647, 0.19171082, 0.61052283]])

In [4]:
x.shape

(5, 3)

In [5]:
x.dtype

dtype('float64')

In [6]:
y = np.random.rand(3,4)
z = x*y
z

ValueError: operands could not be broadcast together with shapes (5,3) (3,4) 

In [7]:
# we can designate what matrix multiplication is directly using objects
z = np.dot(x,y)
z

array([[0.86977183, 0.85459189, 1.04006705, 0.86849717],
       [1.30939212, 1.05446332, 1.56098127, 1.10943015],
       [1.28637098, 0.87647506, 1.43977599, 0.89886784],
       [1.64603496, 1.29164312, 1.72678753, 1.22299497],
       [0.57471885, 0.37556688, 0.77649964, 0.46910494]])

In [8]:
# or we can use the overloaded matrix multiplication operator
z = x @ y
z

array([[0.86977183, 0.85459189, 1.04006705, 0.86849717],
       [1.30939212, 1.05446332, 1.56098127, 1.10943015],
       [1.28637098, 0.87647506, 1.43977599, 0.89886784],
       [1.64603496, 1.29164312, 1.72678753, 1.22299497],
       [0.57471885, 0.37556688, 0.77649964, 0.46910494]])

# Indexing

In [9]:
x1 = np.array([[1,2,3],
               [4,5,6],
               [7,8,9]])
x1

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [10]:
for row in range(x1.shape[0]):
    print(x1[row,1])

2
5
8


In [11]:
x1[:,1]

array([2, 5, 8])

In [12]:
x1[:,1]>3

array([False,  True,  True])

In [13]:
# slicing
x1[ x1[:,1]>3 ]

array([[4, 5, 6],
       [7, 8, 9]])

In [14]:
x2 = np.array(range(10))
x2

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [15]:
x2.shape

(10,)

In [16]:
idx = x2>5
idx

array([False, False, False, False, False, False,  True,  True,  True,
        True])

In [17]:
x2[idx]

array([6, 7, 8, 9])

In [18]:
x2[x2>5]

array([6, 7, 8, 9])

# Named columns
So what if we have a matrix of data where each row is some observation of features and the feature values are represented in each column?

In [19]:
col_names = ['temperature','time','day']
data = np.array([[64,2100,1],
                 [50,2200,4],
                 [48,2300,3],
                 [34,0,   2],
                 [30,100, 5]])
data

array([[  64, 2100,    1],
       [  50, 2200,    4],
       [  48, 2300,    3],
       [  34,    0,    2],
       [  30,  100,    5]])

In [20]:
data2 = data[data[:,1]>1500]
data2

array([[  64, 2100,    1],
       [  50, 2200,    4],
       [  48, 2300,    3]])

In [21]:
# pandas to the rescue
import pandas as pd

df = pd.DataFrame(data,columns=col_names)
df

Unnamed: 0,temperature,time,day
0,64,2100,1
1,50,2200,4
2,48,2300,3
3,34,0,2
4,30,100,5


In [22]:
df[df.time>1500]

Unnamed: 0,temperature,time,day
0,64,2100,1
1,50,2200,4
2,48,2300,3


In [23]:
# lets get a description of the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
temperature    5 non-null int64
time           5 non-null int64
day            5 non-null int64
dtypes: int64(3)
memory usage: 200.0 bytes


In [24]:
df.day[df.day==1] = 'Mon'
df

Unnamed: 0,temperature,time,day
0,64,2100,Mon
1,50,2200,4
2,48,2300,3
3,34,0,2
4,30,100,5


In [25]:
# there is almost always a more efficient built in pandas function
df.day.replace(to_replace=range(7),
               value=['Su','Mon','Tues','Wed','Th','Fri','Sat'],
               inplace=True)
df

Unnamed: 0,temperature,time,day
0,64,2100,Mon
1,50,2200,Th
2,48,2300,Wed
3,34,0,Tues
4,30,100,Fri


In [26]:
# notice how the type of the column has changed to an object "categorical"
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
temperature    5 non-null int64
time           5 non-null int64
day            5 non-null object
dtypes: int64(2), object(1)
memory usage: 200.0+ bytes


In [27]:
# one hot encoding example
pd.get_dummies(df.day)

Unnamed: 0,Fri,Mon,Th,Tues,Wed
0,0,1,0,0,0
1,0,0,1,0,0
2,0,0,0,0,1
3,0,0,0,1,0
4,1,0,0,0,0
