# NUSA Demo of Fama French Factor Model

In [67]:
import pandas as pd
# use any other libraries you may need
import numpy as np
from scipy.stats.mstats import winsorize
# from sklearn.linear_model import LinearRegression
import statsmodels.api as sm

Read the contents of **cleaned_factset_data.csv**  into a Pandas Dataframe called **df** and drop any rows with NaN values. Note that the **CAP** column values are Strings with commas to denote thousands, so convert all the values in the column to Floats.

In [68]:
df = pd.read_csv('cleaned_factset_data.csv')
df.dropna(inplace=True)
df['CAP'] = df['CAP'].astype(float)
df.head()

Unnamed: 0,Ticker,Company Name,monthly_return,capm_beta,book_price,CAP,GPM
0,DDD,3D Systems Corporation,-6.02,1.555648,0.436308,1523.9963,48.936516
1,MMM,3M Company,4.5,1.079156,0.074971,125018.13,49.73928
2,EGHT,"8x8, Inc.",-1.81,0.366954,0.236263,1241.359,75.4866
3,AOS,A. O. Smith Corporation,-1.52,1.536893,0.147579,11333.753,41.665737
4,SHLM,"A. Schulman, Inc.",9.18,1.600787,0.033661,1006.2639,16.560259


To reduce the impact of outliers caused by the few number of large cap companies, add a new column to **df** called **log_mktcap** and populate it with the log of each value in **CAP**. 

In [69]:
df['log_mktcap'] = np.log(df['CAP'])
df.head()

Unnamed: 0,Ticker,Company Name,monthly_return,capm_beta,book_price,CAP,GPM,log_mktcap
0,DDD,3D Systems Corporation,-6.02,1.555648,0.436308,1523.9963,48.936516,7.329091
1,MMM,3M Company,4.5,1.079156,0.074971,125018.13,49.73928,11.736214
2,EGHT,"8x8, Inc.",-1.81,0.366954,0.236263,1241.359,75.4866,7.123962
3,AOS,A. O. Smith Corporation,-1.52,1.536893,0.147579,11333.753,41.665737,9.335541
4,SHLM,"A. Schulman, Inc.",9.18,1.600787,0.033661,1006.2639,16.560259,6.914


Then calculate the z-score of each of the numeric columns and put the results into new columns with **'zscore_'** prepended to each original column name. 


The z-score formula is:

|      $Z = \frac{x - \mu}{\sigma}$

Where $\mu$ is the column mean, $\sigma$ is the column standard deviation, and $x$ is the observed value.


In [70]:
for col in df.columns:
    if df[col].dtype == 'float64':
        df['zscore_'+col] = (df[col] - df[col].mean())/df[col].std(ddof=0)
        
df.head()



Unnamed: 0,Ticker,Company Name,monthly_return,capm_beta,book_price,CAP,GPM,log_mktcap,zscore_monthly_return,zscore_capm_beta,zscore_book_price,zscore_CAP,zscore_GPM,zscore_log_mktcap
0,DDD,3D Systems Corporation,-6.02,1.555648,0.436308,1523.9963,48.936516,7.329091,-0.927695,0.650256,0.011779,-0.295365,0.563248,-0.680589
1,MMM,3M Company,4.5,1.079156,0.074971,125018.13,49.73928,11.736214,0.938817,-0.065171,-0.731026,2.005445,0.599912,2.181047
2,EGHT,"8x8, Inc.",-1.81,0.366954,0.236263,1241.359,75.4866,7.123962,-0.180735,-1.134507,-0.399456,-0.300631,1.775854,-0.813783
3,AOS,A. O. Smith Corporation,-1.52,1.536893,0.147579,11333.753,41.665737,9.335541,-0.129282,0.622097,-0.581765,-0.1126,0.231174,0.62224
4,SHLM,"A. Schulman, Inc.",9.18,1.600787,0.033661,1006.2639,16.560259,6.914,1.769167,0.71803,-0.815947,-0.305011,-0.915453,-0.950116


Winsorize the data in the **'zscore'** columns at the 1st and 99th percentiles. 
(Censor the outliers, set any values less than the 1st percentile to the value of the 1st percentile and any values greater than the 99th percentile to the value at the 99th percentile).

In [71]:
for col in df.columns:
    if col.startswith('zscore_'):
        df[col] = winsorize(df[col], limits=[0.01, 0.01])
        
df.head()

Unnamed: 0,Ticker,Company Name,monthly_return,capm_beta,book_price,CAP,GPM,log_mktcap,zscore_monthly_return,zscore_capm_beta,zscore_book_price,zscore_CAP,zscore_GPM,zscore_log_mktcap
0,DDD,3D Systems Corporation,-6.02,1.555648,0.436308,1523.9963,48.936516,7.329091,-0.927695,0.650256,0.011779,-0.295365,0.563248,-0.680589
1,MMM,3M Company,4.5,1.079156,0.074971,125018.13,49.73928,11.736214,0.938817,-0.065171,-0.731026,2.005445,0.599912,2.181047
2,EGHT,"8x8, Inc.",-1.81,0.366954,0.236263,1241.359,75.4866,7.123962,-0.180735,-1.134507,-0.399456,-0.300631,1.775854,-0.813783
3,AOS,A. O. Smith Corporation,-1.52,1.536893,0.147579,11333.753,41.665737,9.335541,-0.129282,0.622097,-0.581765,-0.1126,0.231174,0.62224
4,SHLM,"A. Schulman, Inc.",9.18,1.600787,0.033661,1006.2639,16.560259,6.914,1.769167,0.71803,-0.815947,-0.305011,-0.915453,-0.950116


Run a **weighted least squares regression** using the standardized, winsorized data as explanatory variables and the monthly returns as the dependent.

In [72]:
y = df['monthly_return']
X = df.filter(like='zscore')

#???? - not sure about what weights to use here, used inverse of squared residuals from OLS based upon research
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
squared_res = model.resid ** 2

err_var = np.mean(squared_res)

weights = 1.0 / err_var

model = sm.WLS(y, X, weights=weights).fit()

print(model.summary())
#print(res.params)

                            WLS Regression Results                            
Dep. Variable:         monthly_return   R-squared:                       0.990
Model:                            WLS   Adj. R-squared:                  0.990
Method:                 Least Squares   F-statistic:                 2.073e+04
Date:                Sat, 23 Sep 2023   Prob (F-statistic):               0.00
Time:                        14:31:07   Log-Likelihood:                -1130.0
No. Observations:                1309   AIC:                             2274.
Df Residuals:                    1302   BIC:                             2310.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
const                    -0.78

Write a sentence or two interpreting the results of the regression, what do the coefficients mean and are they statistically significant?

- R-squared of .990 indicates that the model explains 99% of the variation in the dependent variable.
- The coefficent of zscore_monthly_return is highly statistically significant (p val < .01) and indicates a 1 SD increase in monthly return z score is associated with a 5.85% increase in monthly return.
- The coefficient of zscore_capm_beta is statistically significant (p val (.039) < 0.05) and negative. A 1 SD increase in CAPM beta z-score is associated with a 0.038% decrease in monthly returns, suggesting higher beta stocks have lower returns, which makes sense due to higher risk.
- other coefficients are not statistically significant.