## Stats in Python
Some brief notes and examples on how to do statistical analysis in Python.

We saw a bit of this in the machine learning module, comparing logistic regression to random forests. Some of the statistical functions are built into `scikit-learn`. But `statsmodels` has more features.

Why use `statsmodels`?
- it integrates nicely with `pandas`, so if you have data wrangling code and are joining together lots of datasets, `statsmodels` can simplify your analysis workflow
- if you are familiar with R, the syntax is almost identical. [See examples here](https://www.statsmodels.org/stable/index.html).
- the geospatial functionality is much better in Python than in R, and so any GIS+regression analysis might be better off in Python.

Why use R or Stata instead?
- maybe you or your collaborators are more proficient in R (or Stata)
- you are using more advanced statistical models that are not supported in Python (or would require a lot of custom matrix algebra) - especially true for causal inference, clustered standard errors, and Bayesian analysis, but that is changing fast
- R and Stata have more advanced postestimation commands (e.g. marginal effects)
- there are more online examples for R and Stata

In these cases, you can still use Python for data wrangling, web scraping, etc. But you then save a `.csv` file that you load in to R. 

You can also call R functions from Python! Check out `rpy2`: https://rpy2.github.io

TL;DR: for most projects, it won't make a difference - do whatever is easier.

Let's redo the example from Class 6 (ADUs). This first part of the code we've seen before

In [None]:
import pandas as pd
import geopandas as gpd

parcels = pd.read_pickle('../Lectures/scratch/joined_permits.pandas')

# convert to a geodataframe. Same code as from video lecture
parcels = gpd.GeoDataFrame(parcels, 
                    geometry = gpd.points_from_xy(
                        parcels.CENTER_LON, 
                        parcels.CENTER_LAT, crs='EPSG:4326'))
enviroscreen = gpd.read_file('../Lectures/data/CalEnviroScreen/CES4 Final Shapefile.shp')
gdf = gpd.sjoin(parcels, enviroscreen.to_crs('EPSG:4326'), how='left', predicate='intersects')

Now let's run a regression.

In [None]:
import statsmodels.formula.api as smf
gdf['hasADU'] = gdf.hasADU.astype(int)  # statsmodels can't deal with boolean variables
xvars = ['YearBuilt1', 'Units1', 'Bedrooms1','SQFTmain1', 'Roll_LandValue', 'Roll_ImpValue']
yvar = 'hasADU'

# create a dataframe with no NaNs
df_to_fit = gdf[xvars+[yvar]].dropna()

# follow the example in the docs. Looks like R!
formula = 'y ~ YearBuilt1 + Units1 + Bedrooms1 + SQFTmain1 + Roll_LandValue'
results = smf.ols(formula, data=df_to_fit).fit()

print(results.summary())

In [None]:
# we can also do a logistic regression
# which is more appropriate for our variable
formula = 'y ~ YearBuilt1 + Units1 + Bedrooms1 + SQFTmain1 + Roll_LandValue'
results = smf.logit(formula, data=df_to_fit).fit()

print(results.summary())

What do I do?
- I do most simple models in Python, but that's because I find R syntax annoying
- Descriptive plots and analysis are always in Python
- For most statistical analysis, I use Stata, but I normally prepare a clean dataset in Python so there isn't much Stata code
- For some specialist models (e.g. Bayesian hierarchical regression), I use Python (e.g. `cmdstanpy`)
- I use R if a collaborator prefers that, but then I make them do all the coding...
- Sometimes, a collaborator and I do the work independently in Python and R, as a way to catch errors