# Problem Set 5

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn as skl
from sklearn import linear_model
import statsmodels as sm
import patsy
import scipy.optimize as opt

## Questions 1-3

Suppose that you own a mobile app that has a "freemium" pricing policy, and you want to model how the behavior of your users depends on the monthly subscription price that you charge.

The population of your potential app users are distributed between three states; Non-users, Free users, and Premium users. **All users begin as Non-users**; a Non-user will never go directly from being a Non-user to being a Premium user, without first trying the app out as a Free user for at least one month. In each month, 5\% of the Non-users will decide to try the app in the next month.

In every time period, 0.5\% of the Free users tend to leave the app and become Non-users again. After those users switch, some of the remaining users will choose to subscribe to the app in a given month (and transfer from the Free user category to the Premium user category). The probability that a Free user who did not already leave the app chooses to subscribe for the next month depends on the monthly price that you charge. If the monthly price is represented by a positive number $p$, then the probability of subscribing is given by $$\mathbb{P}\left[ \text{Free user chooses to subscribe to Premium}\right] = \frac{1-\mathbb{P}[\text{Free user chooses to leave the app}]}{(1+p)^2}.$$

Finally, every month 5\% of Premium users will cancel their subscriptions, and 1\%  of Premium users will delete the app and become Non-users again.


### Question 1

Define a function `transition_matrix` over prices $p$, whose output is a row-stochastic version of the matrix that describes the transition probabilities in this economy.

Write another function called `iterate` that takes as arguments $p$ and $t$, where $t$ is the number of months and takes a default value of 6 months, and outputs the distribution of users into categories after $t$ months.

In [None]:
# your code here

### Question 2

Create a figure with three subplots arranged horizontally. On each subplot, show the evolution of your distribution of users over the first year of your new app. Each figure should have time $t$ as the x-axis, and the y-axis should track the proportions of users in each category, for a specific price. On these axes, show how the number of users in each category changes over time with a monthly price equal to $p=0.10$, $p=2$, and $p=20$, respectively. Make sure to clearly label your lines, axes, and subplots.

In [None]:
# your code here

Which of the above prices yields the most premium users after one year? Is this surprising? Why or why not?

Use this markdown cell to enter your answer

### Question 3

Suppose that you pay a per-user cost of $c$ for each user (premium or free) in your app. You also earn a revenue equal to the price $p$ for each premium user. Define a function `profit` that takes as arguments $p$, $c$, $T$, and a discount factor $beta$, and outputs the total discounted profit that you will earn after $t$ months. Assume that you will get paid at the end of each month, (so the first month is discounted). Calculate your profits at a price of $p=2$, a cost of $c=1$, and a discount factor of $beta=0.95$ over 12 months.

In [None]:
# your code here

### Question 4

Using the same parameters as in Question 3, calculate the price that maximizes your profits over 12 months. Use `scipy.optimize` to solve this problem.

In [None]:
# your code here

## Questions 5-7

This problem set uses data on insurance characteristics and medical costs. This is a public domain dataset downloaded from [kaggle](https://www.kaggle.com/mirichoi0218/insurance). 

The variables in the data are:
- age: age of primary beneficiary
- sex: insurance contractor gender, female, male
- bmi: Body mass index of primary beneficiary
- children: Number of children covered by health insurance / Number of dependents
- smoker: whether primary beneficiary smokes
- region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.
- charges: medical costs billed by health insurance

You will build a  model to predict charges given the other variables in the data. 


In [None]:
insure = pd.read_csv("https://raw.githubusercontent.com/doctor-phil/ECON323_2024_Spring/main/problem_sets/insurance.csv")
insure.head()

In these questions you will build and evaluate a model to predict medical costs. 

First, we divide the data into training and testing sets. 

In [None]:
train = insure.sample(frac = 0.8,random_state = 42) 
test = insure.drop(train.index)

Now we create a numeric matrix of features from our dataframe. The formula interface from the patsy package is one convenient method for doing this.

In [None]:
y, X = patsy.dmatrices("charges ~ C(sex)*(age + children + C(smoker) + C(region)) + age:C(smoker)", insure, return_type='matrix')
y = y.flatten()
y_train = y[train.index]
X_train = X[train.index]
y_test = y[test.index]
X_test = X[test.index]

### Question 5

Fit a linear regression model to the training data. You can use any of the methods we discussed in class. Estimate the model using the training set and print the MSE on the training and testing data.

In [None]:
# your code here

### Question 6

Fit a LASSO model to the training data. Follow along with the notes on regression to visualize the lasso path. `alpha`. Print the MSE on the training and testing data.

In [None]:
# your code here

### Question 7

Estimate a regression tree using the training, data with maximum depth of 3 layers. Set the `random_state` keyword argument to `123`. Report the mean-square error on the training and testing samples as a formatted string. 

Then estimate a regression tree with a maximum depth of 10, and the same value of `random_state` and report its MSE on the test data in the same way.

Which model had a higher MSE on the testing data? Explain why you think this might be.

(Use this markdown cell to write your answer)