# Why python for data analysis, machine learning?
There are lots of reasons that we want to use python for doing data science. It is certainly one of the younger programming languages used in the data science ecosystem (compared to say R and SAS) but it is used just as frequently for analysis as SAS and R. Having a good foundation in python and R, (and SAS or SPSS) should be a *must* for **every data scientist** and machine learning enthusiast. 

Here is a latex equation $\lambda=5$ 

In this course, python allows for an open source method of performing machine learning that runs from just about any machine. So let's start with looking at Numpy and Pandas pachages for analyzing data. 

With that in mind, let's go over the following:
- Numpy matrices
- Simple operations on arrays and matrices
- Indexing with numpy
- Pandas for tabular data
- Representing categorical data (discussion point)

In [2]:
import sys
import numpy as np

print(sys.version)
print(np.__version__)

3.8.16 (default, Mar  1 2023, 21:18:45) 
[Clang 14.0.6 ]
1.23.5


In [3]:
x = np.random.rand(5,3)
x

array([[0.76339766, 0.62632236, 0.25960689],
       [0.40765455, 0.35943807, 0.38501316],
       [0.70459107, 0.44395245, 0.99896915],
       [0.34238641, 0.02821981, 0.15188864],
       [0.55646648, 0.47633573, 0.45770129]])

In [4]:
x.shape

(5, 3)

In [5]:
x.dtype

dtype('float64')

In [6]:
# will this work?
y = np.random.rand(3,4)
z = x*y
z

ValueError: operands could not be broadcast together with shapes (5,3) (3,4) 

In [7]:
# we can designate what matrix multiplication is directly using objects
z = np.dot(x,y)
z

array([[0.43803297, 0.98036877, 0.41344914, 0.86729313],
       [0.26619768, 0.65550723, 0.39960669, 0.50750554],
       [0.37985583, 1.25958372, 0.90692381, 0.85585519],
       [0.050642  , 0.39549161, 0.17982217, 0.29657844],
       [0.35011516, 0.85661079, 0.49531994, 0.67768181]])

In [7]:
# or we can use the overloaded matrix multiplication operator
z = x @ y
z

array([[1.62519216, 1.61165567, 2.0642909 , 1.47188586],
       [0.81208096, 1.00549686, 0.82198292, 0.44786103],
       [1.52738965, 1.53879288, 1.44356732, 0.99309143],
       [0.84452074, 1.08814595, 1.31620868, 0.78637785],
       [0.45454204, 0.50838927, 0.98475922, 0.68175702]])

# Indexing

In [8]:
x1 = np.array([[1,2,3],
               [4,5,6],
               [7,8,9]])
x1

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [9]:
for row in range(x1.shape[0]):
    print(x1[row,1])

2
5
8


In [10]:
print(x1[:,1])
print(x1[:,1]>3)
# slicing
print(x1[ x1[:,1]>3 ])

[2 5 8]
[False  True  True]
[[4 5 6]
 [7 8 9]]


In [11]:
x2 = np.array(range(10))
print(x2)
x2.shape

[0 1 2 3 4 5 6 7 8 9]


(10,)

In [14]:
x_tmp = x2[:,np.newaxis]
x_tmp

array([[0],
       [1],
       [2],
       [3],
       [4],
       [5],
       [6],
       [7],
       [8],
       [9]])

In [15]:
idx = x2>5
print(idx)
print(x2[idx])

[False False False False False False  True  True  True  True]
[6 7 8 9]


In [16]:
x2[x2>5] # rows of x2 where x2 is greater than 5

array([6, 7, 8, 9])

# Named columns
So what if we have a matrix of data where each row is some observation of features and the feature values are represented in each column?

In [17]:
col_names = ['temperature','time','day']
data = np.array([[64,2100,1],
                 [50,2200,4],
                 [48,2300,3],
                 [34,0,   2],
                 [30,100, 5]])
data

array([[  64, 2100,    1],
       [  50, 2200,    4],
       [  48, 2300,    3],
       [  34,    0,    2],
       [  30,  100,    5]])

In [18]:
data2 = data[data[:,1]>1500]
data2

array([[  64, 2100,    1],
       [  50, 2200,    4],
       [  48, 2300,    3]])

In [19]:
# pandas to the rescue
import pandas as pd
print(pd.__version__)

df = pd.DataFrame(data,columns=col_names)
df

2.0.1


Unnamed: 0,temperature,time,day
0,64,2100,1
1,50,2200,4
2,48,2300,3
3,34,0,2
4,30,100,5


In [20]:
# can always access the backend numpy with .values
print(type(df.to_numpy()))
df.to_numpy()

<class 'numpy.ndarray'>


array([[  64, 2100,    1],
       [  50, 2200,    4],
       [  48, 2300,    3],
       [  34,    0,    2],
       [  30,  100,    5]])

In [21]:
df[df.time>1500]

Unnamed: 0,temperature,time,day
0,64,2100,1
1,50,2200,4
2,48,2300,3


In [22]:
# lets get a description of the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   temperature  5 non-null      int64
 1   time         5 non-null      int64
 2   day          5 non-null      int64
dtypes: int64(3)
memory usage: 248.0 bytes


In [23]:
df.day[df.day==1] = 'Mon'
df

Unnamed: 0,temperature,time,day
0,64,2100,Mon
1,50,2200,4
2,48,2300,3
3,34,0,2
4,30,100,5


In [24]:
# there is almost always a more efficient built in pandas function
df.day.replace(to_replace=range(7),
               value=['Su','Mon','Tues','Wed','Th','Fri','Sat'],
               inplace=True)
df

Unnamed: 0,temperature,time,day
0,64,2100,Mon
1,50,2200,Th
2,48,2300,Wed
3,34,0,Tues
4,30,100,Fri


In [25]:
# notice how the type of the column has changed to an object "categorical"
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   temperature  5 non-null      int64 
 1   time         5 non-null      int64 
 2   day          5 non-null      object
dtypes: int64(2), object(1)
memory usage: 248.0+ bytes


In [26]:
# one hot encoding example
pd.get_dummies(df.day)

Unnamed: 0,Fri,Mon,Th,Tues,Wed
0,False,True,False,False,False
1,False,False,True,False,False
2,False,False,False,False,True
3,False,False,False,True,False
4,True,False,False,False,False


# Some Pandas Syntax

In [27]:
# slicing into a pandas dataframe
print(df.day)
print(df['day'])
df[['day','temperature']]

0     Mon
1      Th
2     Wed
3    Tues
4     Fri
Name: day, dtype: object
0     Mon
1      Th
2     Wed
3    Tues
4     Fri
Name: day, dtype: object


Unnamed: 0,day,temperature
0,Mon,64
1,Th,50
2,Wed,48
3,Tues,34
4,Fri,30


In [28]:
print(df.day[2]) # print the value
print(df.day[2:]) # print as pandas series

Wed
2     Wed
3    Tues
4     Fri
Name: day, dtype: object


In [29]:
# index location
df.iloc[3:]

Unnamed: 0,temperature,time,day
3,34,0,Tues
4,30,100,Fri


In [30]:
df.iloc[3:][['day','temperature']]

Unnamed: 0,day,temperature
3,Tues,34
4,Fri,30


In [31]:
df[['day','temperature']].iloc[3:]

Unnamed: 0,day,temperature
3,Tues,34
4,Fri,30


In [32]:
df.mean(numeric_only=True)

temperature      45.2
time           1340.0
dtype: float64

In [33]:
df.std(numeric_only=True)

temperature      13.608821
time           1180.254210
dtype: float64

In [34]:
df.mean(numeric_only=True)/df.std(numeric_only=True)

temperature    3.321375
time           1.135349
dtype: float64

In [35]:
df.time.unique()

array([2100, 2200, 2300,    0,  100])

# Pandas Block Manager
Let's take a look at some important points from the following post:
 - https://uwekorn.com/2020/05/24/the-one-pandas-internal.html

This is the pandas BlockManager, which tries to group internal structures together to make things fast:
<img src="https://uwekorn.com/images/pd-df-perception.002.png" width=200 height=200 />

In [36]:
df

Unnamed: 0,temperature,time,day
0,64,2100,Mon
1,50,2200,Th
2,48,2300,Wed
3,34,0,Tues
4,30,100,Fri


In [37]:
print(df._data.nblocks)
df._data

2


BlockManager
Items: Index(['temperature', 'time', 'day'], dtype='object')
Axis 1: RangeIndex(start=0, stop=5, step=1)
NumericBlock: slice(0, 2, 1), 2 x 5, dtype: int64
ObjectBlock: slice(2, 3, 1), 1 x 5, dtype: object

## Advantages and disadvantages:
This can speed up operations because it inhenertly can apply operations along columns in a single pass over the data (like sums, etc.) and therefore is using c++ for much of the heavy lifting.

But, **it might be bad** when you are adding columns to the data because it can trigger consolidation of columns, which means copying over data in numpy to creata new matrix. The slow down also doesn't show up until a needed column is accessed (lazy data copying). Let's do an example from:  https://uwekorn.com/2020/05/24/the-one-pandas-internal.html

**Block consolidation is triggered after 100 blocks of data are reached.**

In [38]:
# we will start with a 2 column dataframe
# one column is an int and the other a float
# becasue there are two datatypes this has two blocks
df_example = pd.DataFrame({
    'int64': np.arange(1024 * 1024, dtype=np.int64),
    'float64': np.arange(1024 * 1024, dtype=np.float64),
})
df_example

Unnamed: 0,int64,float64
0,0,0.0
1,1,1.0
2,2,2.0
3,3,3.0
4,4,4.0
...,...,...
1048571,1048571,1048571.0
1048572,1048572,1048572.0
1048573,1048573,1048573.0
1048574,1048574,1048574.0


In [39]:
%%time 

# but now lets start to add columns one by one
# to be fast, pandas adds each as a new block 
# so we will have 99 blocks (2+97 new ones)
for i in range(96):
    df_example[f'new_{i}'] = df_example['int64'].to_numpy()
    
print(df_example._data.nblocks)
df_example

98
CPU times: user 51.7 ms, sys: 69.1 ms, total: 121 ms
Wall time: 170 ms


Unnamed: 0,int64,float64,new_0,new_1,new_2,new_3,new_4,new_5,new_6,new_7,...,new_86,new_87,new_88,new_89,new_90,new_91,new_92,new_93,new_94,new_95
0,0,0.0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,1.0,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
2,2,2.0,2,2,2,2,2,2,2,2,...,2,2,2,2,2,2,2,2,2,2
3,3,3.0,3,3,3,3,3,3,3,3,...,3,3,3,3,3,3,3,3,3,3
4,4,4.0,4,4,4,4,4,4,4,4,...,4,4,4,4,4,4,4,4,4,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1048571,1048571,1048571.0,1048571,1048571,1048571,1048571,1048571,1048571,1048571,1048571,...,1048571,1048571,1048571,1048571,1048571,1048571,1048571,1048571,1048571,1048571
1048572,1048572,1048572.0,1048572,1048572,1048572,1048572,1048572,1048572,1048572,1048572,...,1048572,1048572,1048572,1048572,1048572,1048572,1048572,1048572,1048572,1048572
1048573,1048573,1048573.0,1048573,1048573,1048573,1048573,1048573,1048573,1048573,1048573,...,1048573,1048573,1048573,1048573,1048573,1048573,1048573,1048573,1048573,1048573
1048574,1048574,1048574.0,1048574,1048574,1048574,1048574,1048574,1048574,1048574,1048574,...,1048574,1048574,1048574,1048574,1048574,1048574,1048574,1048574,1048574,1048574


In [40]:
%time df_example['dummy_name5'] = df_example['int64'].to_numpy() # copy over some new columns
print('Number of blocks in data:',df_example._data.nblocks)

# force consolidation 
%time df_example = df_example.reindex()

%time df_example['dummy_name6'] = df_example['int64'].to_numpy() # copy over some new columns
print('Number of blocks in data:',df_example._data.nblocks)


CPU times: user 2.15 ms, sys: 13.2 ms, total: 15.3 ms
Wall time: 18.9 ms
Number of blocks in data: 99
CPU times: user 123 ms, sys: 678 ms, total: 800 ms
Wall time: 1.1 s
CPU times: user 890 µs, sys: 3.37 ms, total: 4.26 ms
Wall time: 4.74 ms
Number of blocks in data: 3


In [41]:
df_example.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048576 entries, 0 to 1048575
Data columns (total 100 columns):
 #   Column       Non-Null Count    Dtype  
---  ------       --------------    -----  
 0   int64        1048576 non-null  int64  
 1   float64      1048576 non-null  float64
 2   new_0        1048576 non-null  int64  
 3   new_1        1048576 non-null  int64  
 4   new_2        1048576 non-null  int64  
 5   new_3        1048576 non-null  int64  
 6   new_4        1048576 non-null  int64  
 7   new_5        1048576 non-null  int64  
 8   new_6        1048576 non-null  int64  
 9   new_7        1048576 non-null  int64  
 10  new_8        1048576 non-null  int64  
 11  new_9        1048576 non-null  int64  
 12  new_10       1048576 non-null  int64  
 13  new_11       1048576 non-null  int64  
 14  new_12       1048576 non-null  int64  
 15  new_13       1048576 non-null  int64  
 16  new_14       1048576 non-null  int64  
 17  new_15       1048576 non-null  int64  
 18  n