There will be a review of some features of pandas that may be helpfull for data wrangling, model fitting and scoring
Also, there will be a short introduction to popular modeling toolkits:
- statsmodels
- scikitlearn

### Interfacing between pandas and Model Code
Common workflow for model development:
- Use pandas for data loading and cleaning. 
  Feature engineering -> Is any data transformation or analytics that extract information from a raw dataset that may be useful in a modeling context. Important part of the model development in ml. Data aggregation used with groupby are often used in a feature engineering context

The way to connect pandas and othe analysis libraries is usually NumPy arrays.
We have to turn a DataFrame into a NumPy array

In [1]:
import numpy as np
import pandas as pd

In [2]:
data = pd.DataFrame({
    "x0":np.arange(5)+1,
    "x1":[0.01,-0.01,0.25,-4.1,0.],
    "y":[-1.5,0.,3.6,1.3,-2.]})
data

Unnamed: 0,x0,x1,y
0,1,0.01,-1.5
1,2,-0.01,0.0
2,3,0.25,3.6
3,4,-4.1,1.3
4,5,0.0,-2.0


In [3]:
data.columns

Index(['x0', 'x1', 'y'], dtype='object')

In [4]:
#Passing the DataFrame to numpy
np_array = data.to_numpy()
np_array

array([[ 1.  ,  0.01, -1.5 ],
       [ 2.  , -0.01,  0.  ],
       [ 3.  ,  0.25,  3.6 ],
       [ 4.  , -4.1 ,  1.3 ],
       [ 5.  ,  0.  , -2.  ]])

In [5]:
#Going back to DataFrame
df2 = pd.DataFrame(np_array,columns=['x0','x1','y'])
df2


Unnamed: 0,x0,x1,y
0,1.0,0.01,-1.5
1,2.0,-0.01,0.0
2,3.0,0.25,3.6
3,4.0,-4.1,1.3
4,5.0,0.0,-2.0


to_numpy is intended to be used when your data is homogeneous (ex: all numeric types)
If you have heterogneous data, the result will be an ndarray of Python objects

In [6]:
df3 = data.copy()
df3['strings'] =['a','b','c','d','e']
df3

Unnamed: 0,x0,x1,y,strings
0,1,0.01,-1.5,a
1,2,-0.01,0.0,b
2,3,0.25,3.6,c
3,4,-4.1,1.3,d
4,5,0.0,-2.0,e


In [7]:
df3.to_numpy()

array([[1, 0.01, -1.5, 'a'],
       [2, -0.01, 0.0, 'b'],
       [3, 0.25, 3.6, 'c'],
       [4, -4.1, 1.3, 'd'],
       [5, 0.0, -2.0, 'e']], dtype=object)

In [8]:
# In some models we may wish to only use some columns
# We can use loc with to_numpy
model_cols=['x0','x1']
data.loc[:,model_cols].to_numpy()

array([[ 1.  ,  0.01],
       [ 2.  , -0.01],
       [ 3.  ,  0.25],
       [ 4.  , -4.1 ],
       [ 5.  ,  0.  ]])

Some libraries have native support with pandas and automatically convert from DataFrame to NumPy array and attach model parameter names to the columns of output tables or series.

In [9]:
# We have a non nonnumeric column 
data['category'] = pd.Categorical(['a','b','a','a','b'],categories=['a','b'])
data

Unnamed: 0,x0,x1,y,category
0,1,0.01,-1.5,a
1,2,-0.01,0.0,b
2,3,0.25,3.6,a
3,4,-4.1,1.3,a
4,5,0.0,-2.0,b


In [10]:
dummies = pd.get_dummies(data.category,prefix="cat")
data_with_dummies =  data.drop("category",axis=1).join(dummies)
data_with_dummies

Unnamed: 0,x0,x1,y,cat_a,cat_b
0,1,0.01,-1.5,True,False
1,2,-0.01,0.0,False,True
2,3,0.25,3.6,True,False
3,4,-4.1,1.3,True,False
4,5,0.0,-2.0,False,True


### Creating Model Descriptions with Patsy
Patsy is a Python library for decribing statistical models (especially linear models).
It uses string base formula syntax of R.
Patsy syntax -> y ~ x0 + x1.
patsy.dmatrices functions -> Takes formula string along with dataset

In [11]:
data = pd.DataFrame({
    "x0":np.arange(1,6),
    "x1":[0.01,-0.01,0.25,-4.10,0.00],
    "y":[-1.5,0.0,3.6,1.3,-2.0]
})
data

Unnamed: 0,x0,x1,y
0,1,0.01,-1.5
1,2,-0.01,0.0
2,3,0.25,3.6
3,4,-4.1,1.3
4,5,0.0,-2.0


In [12]:
import patsy
y, X = patsy.dmatrices("y ~ x0 + x1",data)

In [13]:
y

DesignMatrix with shape (5, 1)
     y
  -1.5
   0.0
   3.6
   1.3
  -2.0
  Terms:
    'y' (column 0)

In [14]:
X

DesignMatrix with shape (5, 3)
  Intercept  x0     x1
          1   1   0.01
          1   2  -0.01
          1   3   0.25
          1   4  -4.10
          1   5   0.00
  Terms:
    'Intercept' (column 0)
    'x0' (column 1)
    'x1' (column 2)

These Patsy DesignMatrix instances are Numpy ndarrays with metadata

In [15]:
np.asarray(y)

array([[-1.5],
       [ 0. ],
       [ 3.6],
       [ 1.3],
       [-2. ]])

In [16]:
np.asarray(X)

array([[ 1.  ,  1.  ,  0.01],
       [ 1.  ,  2.  , -0.01],
       [ 1.  ,  3.  ,  0.25],
       [ 1.  ,  4.  , -4.1 ],
       [ 1.  ,  5.  ,  0.  ]])

In [17]:
#Patsy object can be passed directly into algorithms like numpy.linalg.lstsq -> least squares regression
coef,resid,_,_ = np.linalg.lstsq(X,y)

  coef,resid,_,_ = np.linalg.lstsq(X,y)


In [18]:
coef 

array([[ 0.31290976],
       [-0.07910564],
       [-0.26546384]])

In [19]:
# model metada is retained in design_info attribute. We can reattach the model columns names to the fitted coefficients to obtain a Series
coef = pd.Series(coef.squeeze(),index= X.design_info.column_names)

In [20]:
coef

Intercept    0.312910
x0          -0.079106
x1          -0.265464
dtype: float64

### Data Transformations in Patsy Formulas
We can mix python code into Patsy formulas. 

In [21]:
y,X = patsy.dmatrices('y ~ x0 + np.log(np.abs(x1) + 1)', data) #Patsy interprets de python function

#Commonly used variable transformations -> 
- standarizing (mean 0 and variance 1)
- centering (substracting the mean)

In [22]:
y,X = patsy.dmatrices('y ~ standardize(x0) + center(x1)',data)
data

Unnamed: 0,x0,x1,y
0,1,0.01,-1.5
1,2,-0.01,0.0
2,3,0.25,3.6
3,4,-4.1,1.3
4,5,0.0,-2.0


In [23]:
X

DesignMatrix with shape (5, 3)
  Intercept  standardize(x0)  center(x1)
          1         -1.41421        0.78
          1         -0.70711        0.76
          1          0.00000        1.02
          1          0.70711       -3.33
          1          1.41421        0.77
  Terms:
    'Intercept' (column 0)
    'standardize(x0)' (column 1)
    'center(x1)' (column 2)

When applying transformations like center and standarize, you should be careful when using this information to form predictions based on new data.
These are called stateful transformations-> Must use statistics like mean or standard deviation of the original dataset when transforming a new dataset.
patsy.build_design_matrices function applies transformations yo new out-of-sample data, using the saved information from the original in-sample dataset

In [24]:
new_data = pd.DataFrame({
    'x0':np.arange(6,10),
    'x1':[3.1,-0.5,0,2.3],
    'y': np.arange(1,5)
})
new_data

Unnamed: 0,x0,x1,y
0,6,3.1,1
1,7,-0.5,2
2,8,0.0,3
3,9,2.3,4


In [25]:
new_X = patsy.build_design_matrices([X.design_info],new_data)
new_X

[DesignMatrix with shape (4, 3)
   Intercept  standardize(x0)  center(x1)
           1          2.12132        3.87
           1          2.82843        0.27
           1          3.53553        0.77
           1          4.24264        3.07
   Terms:
     'Intercept' (column 0)
     'standardize(x0)' (column 1)
     'center(x1)' (column 2)]

### Categorical Data and Patsy
Nonnuemeric data can be transformed for a model design matrix in different ways.
When using nonnumeric terms in a Patsy formula, the get converted into dummy variables by default

In [26]:
data = pd.DataFrame({
        'key1':['a','a','b','b','a','b','a','b'],
        'key2': [0,1,0,1,0,1,0,0],
        'v1':np.arange(1,9),
        'v2': [-1,0,2.5,-0.5,4.0,-1.2,0.2,-1.7]
    })
data

Unnamed: 0,key1,key2,v1,v2
0,a,0,1,-1.0
1,a,1,2,0.0
2,b,0,3,2.5
3,b,1,4,-0.5
4,a,0,5,4.0
5,b,1,6,-1.2
6,a,0,7,0.2
7,b,0,8,-1.7


In [27]:
y,X = patsy.dmatrices('v2 ~ key1', data)
X

DesignMatrix with shape (8, 2)
  Intercept  key1[T.b]
          1          0
          1          0
          1          1
          1          1
          1          0
          1          1
          1          0
          1          1
  Terms:
    'Intercept' (column 0)
    'key1' (column 1)

### Introduction to statsmodels
Python library for:
- Fit many kinds of statistical models
- Perform statistical tests
- Data exploration and visualization
It contains mode "classical" frequentist statistical methods of machine learning

### Some models
- Linear models, generalized linear models and robust linear models
- Linear mixed effects models
- Analusysi of variance (ANOVA) methods
- Time series processes and state space models
- Generalized method of moments

### Estimating linear models
Linear models in statsmodels.
Two main interfaces:
- Array based
- Formula based
  

In [28]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [29]:
# Generating random data for the linear model
rng = np.random.default_rng(seed=12345)
rng

Generator(PCG64) at 0x14FC0DF4D60

In [30]:
def dnorm(mean,variance,size=1):
    if isinstance(size,int):
        size = size,
    return mean + np.sqrt(variance) * rng.standard_normal(*size)

In [31]:
#dnorm -> Helper function for generating normally distibuted data with particular mean and variance
N = 100
X = np.c_[dnorm(0,0.4,size=N),
          dnorm(0,0.6,size=N),
          dnorm(0,0.2,size=N)
          ]
eps = dnorm(0,0.1,size=N)
beta = [0.1,0.3,0.5]
y = np.dot(X,beta) + eps


In [32]:
X[:5]

array([[-0.90050602, -0.18942958, -1.0278702 ],
       [ 0.79925205, -1.54598388, -0.32739708],
       [-0.55065483, -0.12025429,  0.32935899],
       [-0.16391555,  0.82403985,  0.20827485],
       [-0.04765129, -0.21314698, -0.04824364]])

In [33]:
y[:5]

array([-0.59952668, -0.58845445,  0.18563386, -0.00747657, -0.01537445])

In [34]:
#sm.add_constant function -> adds an intercept column to an existing matrix
X_model = sm.add_constant(X)
X_model

array([[ 1.00000000e+00, -9.00506021e-01, -1.89429577e-01,
        -1.02787020e+00],
       [ 1.00000000e+00,  7.99252054e-01, -1.54598388e+00,
        -3.27397080e-01],
       [ 1.00000000e+00, -5.50654833e-01, -1.20254287e-01,
         3.29358994e-01],
       [ 1.00000000e+00, -1.63915546e-01,  8.24039852e-01,
         2.08274848e-01],
       [ 1.00000000e+00, -4.76512913e-02, -2.13146980e-01,
        -4.82436357e-02],
       [ 1.00000000e+00, -4.68576597e-01, -1.43558784e+00,
        -1.52694953e-01],
       [ 1.00000000e+00, -8.65068061e-01, -9.63148432e-02,
         7.08625055e-01],
       [ 1.00000000e+00,  4.10395842e-01,  6.08038650e-01,
         1.26222105e-01],
       [ 1.00000000e+00,  2.28353201e-01,  1.56467440e-01,
         4.06761512e-01],
       [ 1.00000000e+00, -1.23509905e+00, -3.31585038e-01,
         1.76681376e-01],
       [ 1.00000000e+00,  1.48463222e+00,  1.43167842e+00,
        -2.99354280e-01],
       [ 1.00000000e+00,  6.12531226e-01,  1.47169718e+00,
      

In [35]:
# sm.OLS class can fit an ordinary least squares linear regression
model = sm.OLS(y,X)

In [36]:
#Models fit method returns a regression results object containing estimated model parameters
results = model.fit()

In [37]:
results.params

array([0.06681503, 0.26803235, 0.45052319])

In [38]:
print(results.summary())

                                 OLS Regression Results                                
Dep. Variable:                      y   R-squared (uncentered):                   0.469
Model:                            OLS   Adj. R-squared (uncentered):              0.452
Method:                 Least Squares   F-statistic:                              28.51
Date:                Fri, 08 Mar 2024   Prob (F-statistic):                    2.66e-13
Time:                        10:06:40   Log-Likelihood:                         -25.611
No. Observations:                 100   AIC:                                      57.22
Df Residuals:                      97   BIC:                                      65.04
Df Model:                           3                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

### Introduction to scikit-learn
Is one of the most widely used and truste general-purpose Python machine learning toolkits.
Contains broad selection of standard supervised and unsupervised machine learning methods.
Tools included
- Model selection
- Evaluation
- Data transformation
- Data loading
- Mode persistence
can be used for:
- Classification
- Clustering
- Prediction
