# Python Crash Course: Statistical Analysis of a CSV

This notebook walks through:

1. Basic Python syntax via real data
2. Loading and exploring a CSV with `pandas`
3. Descriptive statistics
4. Plotting with `matplotlib`
5. OLS regression with `statsmodels`
6. Fixed effects regression

Dataset: `python_intro_stats_dataset.csv` (panel data for firms over years).

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from linearmodels.panel import PanelOLS

## 1. Load the CSV

Make sure the file `python_intro_stats_dataset.csv` is in the same folder as this notebook.

In [None]:
df = pd.read_csv('python_intro_stats_dataset.csv')
df.head()

## 2. Quick Python Syntax Refresher

We'll briefly review variables, lists, dictionaries, and functions.

In [None]:
# Basic types
x = 10
name = "Alice"
values = [1, 2, 3]
record = {"a": 10, "b": 20}

print(type(x), x)
print(type(name), name)
print(type(values), values)
print(type(record), record)

# Indexing
print("First value in list:", values[0])
print("Value for key 'a' in dict:", record["a"])

In [None]:
# Functions loops and indentation
def twice(z):
    return 2 * z

print(twice(5))
li = []
di = {}
for i in range(5):
    if i % 2:
        li.append(twice(i))
        di[f'item{i}'] = f'{np.log(i+1)}'

print(li, '\n', di)

## 3. Descriptive Statistics

We'll compute some summary statistics and look at missing values.

In [None]:
df.describe()

In [None]:
# Missing values per column
df.isna().sum()

In [None]:
# Example: mean and standard deviation of y
print("Mean of y:", df['y'].mean())
print("Std of y:", df['y'].std())

### Grouped statistics

For example, the mean of `y` by `firm_id`.

In [None]:
df.groupby('firm_id')['y'].mean().head()

## 4. Plotting

We'll create a histogram of `y` and a scatter plot of `x1` vs `y`.

In [None]:
# Histogram of y
plt.hist(df['y'], bins=30)
plt.xlabel('y')
plt.ylabel('Count')
plt.title('Histogram of y')
plt.show()

In [None]:
# Scatter plot: x1 vs y
plt.scatter(df['x1'], df['y'])
plt.xlabel('x1')
plt.ylabel('y')
plt.title('Scatter plot of x1 vs y')
plt.show()

## 5. OLS Regression with statsmodels

We'll estimate the model:

$$ y = \beta_0 + \beta_1 x1 + \beta_2 x2 + \epsilon $$

In [None]:
# Define X and y
X = df[['x1', 'x2']]
X = sm.add_constant(X)  # adds intercept
y = df['y']

ols_model = sm.OLS(y, X).fit()
ols_model.summary()

### Residuals and fitted values

In [None]:
df['fitted_ols'] = ols_model.fittedvalues
df['resid_ols'] = ols_model.resid

plt.scatter(df['fitted_ols'], df['resid_ols'])
plt.axhline(0)
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
plt.title('OLS residual plot')
plt.savefig('./regression_plot.png')
plt.show()
plt.close()

## 6. Fixed Effects via Dummy Variables

We'll estimate a model with firm fixed effects by adding firm dummies.

In [None]:
# Create firm dummies
df_fe = pd.get_dummies(df, columns=['firm_id'], drop_first=True)
df_fe.head()

In [None]:
# Choose regressors: x1, x2, and firm dummies
fe_cols = ['x1', 'x2'] + [c for c in df_fe.columns if c.startswith('firm_id_')]
X_fe = df_fe[fe_cols].astype(float)
X_fe = sm.add_constant(X_fe)
y_fe = df_fe['y']

fe_model_dummies = sm.OLS(y_fe, X_fe).fit()
fe_model_dummies.summary()

## 7. Fixed Effects with `linearmodels` (Optional)

If `linearmodels` is installed, we can treat this as panel data and estimate entity fixed effects directly.

In [None]:

df_panel = df.set_index(['firm_id', 'year'])
fe_model_panel = PanelOLS.from_formula('y ~ x1 + x2 + EntityEffects', data=df_panel).fit()
print(fe_model_panel.summary)


## 8. Mini-Exercise

Try the following on your own:

1. Add `size` as an additional regressor in the OLS model.
2. Estimate a model `y ~ x1 + x2 + size` with and without firm fixed effects.
3. Compare the coefficients on `x1` and `x2` across the models.
4. Plot the predicted vs actual `y` for one of your models.