# Week 5 Exercise 2: Patient Cost Forecast
## Approach: Time-series forecast, one-step forward

In this Exercise, we will try to forecast the insurance claims of patients using forecast methods.

These Forecast Methods are nothing but combination of Regression and Decision Tree Algorithms.


In [156]:
#importing libraries, It is a good coding Practice to import all the libraries at the beginning of the code
import pandas as pd
import sqlite3 as sql
import os
from datetime import date
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score

Now, we are accessing our database to get the costs incurred for each patient.

In [157]:

con = sql.connect(os.path.join(os.getcwd(), 'synthea_and_county_ga.db')) 

In [158]:
df = pd.read_excel('w4e2_patient_risk_morb_scores.xlsx',index_col='id')


Data in our database is created in March 2023 so we can assume yearwise data from 2022 for our analysis.

Manually set a date to start from for 5 years in the past.




In [159]:

startdate = date(2017, 1, 1) # Date chosen for when data created
enddate = date(2022, 12, 31) # Date chosen for when data created

#Keep in Mind that datetime library accepts year, month, day as arguments contrary to general convention.

In [160]:
# Create the columns with zero values
#Ingeneral this is one of the methods of creating new columns in pandas, we assign the column name and assign a value to it,usually zero.
df['cost_year_minus5'] = 0
df['cost_year_minus4'] = 0
df['cost_year_minus3'] = 0
df['cost_year_minus2'] = 0
df['cost_year_minus1'] = 0
df['cost_year0'] = 0

Note the start date is 5 years back. Change to python variable in the future

In [161]:

sql = f"""
    select patient as id, strftime('%Y', start) as enc_year, sum(TOTAL_CLAIM_COST) as enc_year_cost
    from encounters 
    where start > '{startdate}' and start < '{enddate}' and encounterclass not in ('wellness')
    group by patient, enc_year
    order by patient, enc_year
"""



In [162]:
df_temp = pd.read_sql_query(sql, con)
df_temp = df_temp.rename(columns=str.lower)
df_temp = df_temp.set_index('id')
df_temp = df_temp.round(2)

Now we have all the data from 2022 to 2018.Next , we will be grouping the data by year and sum the costs incurred for each patient for that year.

In [163]:

#df.pivot is a method of pandas to pivot the data, it takes the columns to be pivoted as arguments,it is simalar to the pivot function in excel and 
#group by function in sql
df_pivot = df_temp.pivot(columns='enc_year', values='enc_year_cost')
df_pivot = df_pivot.rename(columns = {'2017':'cost_yearminus5',
                                     '2018':'cost_yearminus4',
                                     '2019':'cost_yearminus3',
                                     '2020':'cost_yearminus2',
                                     '2021':'cost_yearminus1',
                                     '2022':'cost_year0'})
df_pivot.info()

<class 'pandas.core.frame.DataFrame'>
Index: 19923 entries, 00054016-4e0f-ca1b-1c9d-ef729b4015e5 to fffec807-6aca-3300-b87a-032c70a488d1
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   cost_yearminus5  6314 non-null   float64
 1   cost_yearminus4  11274 non-null  float64
 2   cost_yearminus3  11954 non-null  float64
 3   cost_yearminus2  12544 non-null  float64
 4   cost_yearminus1  17052 non-null  float64
 5   cost_year0       12468 non-null  float64
dtypes: float64(6)
memory usage: 1.1+ MB



### we need to be aware of null values as they play an Important role in the data analysis.

In [164]:
#number of null values in each column

df_pivot.isnull().sum()


enc_year
cost_yearminus5    13609
cost_yearminus4     8649
cost_yearminus3     7969
cost_yearminus2     7379
cost_yearminus1     2871
cost_year0          7455
dtype: int64

### we observe that the pivot table has some null values, we fill them with zero for our analysis as Null values Indicate that the patient has not had any encounters in that year

In [165]:


df_pivot = df_pivot.fillna(0)

#update the df with the pivot table
#df.update is a method of pandas to update the data frame with the pivot table, it takes the pivot table as argument
# and updates the df with the pivot table

df.update(df_pivot)
df.head()

Unnamed: 0_level_0,first,last,city,state,county,fips,lat,lon,birthdate,marital,...,comm_health_needs_score,risk_score_morbidity,risk_score_diab,risk_score_hyp,cost_year_minus5,cost_year_minus4,cost_year_minus3,cost_year_minus2,cost_year_minus1,cost_year0
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
53e40e98-c764-53a4-aaf6-6318a3c3c95d,Aaron,Abernathy,Augusta,Georgia,Richmond County,13245,33.403418,-82.00898,2017-11-19,unknown,...,2.42,0.0,0.0,0.0,0,0,0,0,0,116.08
c87850c0-6c2e-7312-b050-cfcc2a74be53,Aaron,Bashirian,Acworth,Georgia,Cobb County,13067,34.070996,-84.698541,1949-02-19,M,...,-6.31,99.46,99.0,93.0,0,0,0,0,0,28981.8
0c99afa9-03c2-f695-efe2-7943262bbbc7,Aaron,Bashirian,Newnan,Georgia,Coweta County,13077,33.39007,-84.665856,1992-01-21,S,...,-5.43,0.2,68.0,5.0,0,0,0,0,0,0.0
5e8c9aa2-90da-8e27-7d6f-c6b0faf50169,Aaron,Nienow,Alto,Georgia,Habersham County,13137,34.490158,-83.528305,1980-12-14,D,...,-2.25,7.9,99.0,76.0,0,0,0,0,0,0.0
ce8b6d28-f7a3-8427-6ca6-0f01fc5422d4,Aaron,Will,St. Simons,Georgia,Glynn County,13127,31.191608,-81.367092,2013-05-07,unknown,...,-0.82,0.0,0.0,0.0,0,0,0,0,0,116.08


## Create the prediction model for prediting year 0 (i.e., 2022)

As we have observed earlier, there are significant number of null values in our data.For best prediction results, we first predict whether the patient will have any encounters in the current year or not and then, predict the cost incurred for that patient in the current year if it is predicted that the patient will have encounters in the current year.

### for Classification, we use Random Forest Classifier
for this we transform the data into a binary classification problem by creating a new column called "encounter" which is 1 if the patient has encounters in the current year and 0 if the patient does not have encounters in the current year.

In [166]:


y = df['cost_year0'] #assigning the cost_year0 column to y this is the column that we are trying to predict.
#assign y =1 if cost is greater than 0 and 0 if cost is 0

y = y.apply(lambda x: 0 if x == 0 else 1)
#number of zeros in y

y.value_counts()
#shows number of ones in y


1    12468
0    10164
Name: cost_year0, dtype: int64

Now we are selecting the features for training the model and the target variable for the model.
Here we consider all the columns that has numof andf cost as features and cost of each year  in training the model.

In [167]:
# Set up the predictors to include all 'numof' (utilization) columns and prior year costs
cols_numof = [col for col in df.columns if 'numof' in col]
cols_cost = [col for col in df.columns if 'cost_year' in col]
X = df[cols_numof + cols_cost]
# Drop last (target)
X = X.drop(columns=['cost_year0']) # don't need the current year as that is what we are trying to predict
X.head()
y.head()

id
53e40e98-c764-53a4-aaf6-6318a3c3c95d    1
c87850c0-6c2e-7312-b050-cfcc2a74be53    1
0c99afa9-03c2-f695-efe2-7943262bbbc7    0
5e8c9aa2-90da-8e27-7d6f-c6b0faf50169    0
ce8b6d28-f7a3-8427-6ca6-0f01fc5422d4    1
Name: cost_year0, dtype: int64

Now we are splitting the data into training and testing data.

In [168]:

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)#random_state is used to set the seed for the random number generator
#This is a good practice to set the seed for the random number generator so that the results are reproducible.

In [169]:
#train and get the score of the model
model_class = RandomForestClassifier(random_state=42)
model_class.fit(X_train, y_train)
model_class.score(X, y)

0.9424708377518558

In [170]:
#predict the values of y for the data
y_pred = model_class.predict(X)
df['y_pred'] = y_pred

In [171]:
#create a new prediction cost column

df['prediction_cost'] = 0

#assign 0 to the prediction cost column if the  prediction is 0

df.loc[df['y_pred'] == 0, 'prediction_cost'] = 0

#create a new DataFrame with only the patients that have a prediction of 1

df1 = df[df['y_pred'] == 1]
df1.head()

Unnamed: 0_level_0,first,last,city,state,county,fips,lat,lon,birthdate,marital,...,risk_score_diab,risk_score_hyp,cost_year_minus5,cost_year_minus4,cost_year_minus3,cost_year_minus2,cost_year_minus1,cost_year0,y_pred,prediction_cost
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
53e40e98-c764-53a4-aaf6-6318a3c3c95d,Aaron,Abernathy,Augusta,Georgia,Richmond County,13245,33.403418,-82.00898,2017-11-19,unknown,...,0.0,0.0,0,0,0,0,0,116.08,1,0
c87850c0-6c2e-7312-b050-cfcc2a74be53,Aaron,Bashirian,Acworth,Georgia,Cobb County,13067,34.070996,-84.698541,1949-02-19,M,...,99.0,93.0,0,0,0,0,0,28981.8,1,0
ce8b6d28-f7a3-8427-6ca6-0f01fc5422d4,Aaron,Will,St. Simons,Georgia,Glynn County,13127,31.191608,-81.367092,2013-05-07,unknown,...,0.0,0.0,0,0,0,0,0,116.08,1,0
d369812d-7ba7-a03a-cd5c-ccc2989d0b15,Abbey,Grady,Ball Ground,Georgia,Cherokee County,13057,34.36073,-84.344374,1978-10-29,S,...,89.0,19.0,0,0,0,0,0,28105.99,1,0
6e384968-cd64-1335-002e-113fecfe0f63,Abbey,Purdy,Kennesaw,Georgia,Cobb County,13067,34.018574,-84.617446,1984-06-28,M,...,29.0,27.0,0,0,0,0,0,23098.47,1,0


As we discussed earlier, we will regress the cost incurred for each patient in the current year if the patient has encounters in the current year.. so we separated the data into two parts, one with encounters and one without encounters.

Just as we did for the classification model, we are selecting the features for training the model and the target variable for the model.


In [172]:
cols_numof = [col for col in df1.columns if 'numof' in col]
cols_cost = [col for col in df1.columns if 'cost_year' in col]
X = df1[cols_numof + cols_cost]
# Drop last (target)
X = X.drop(columns=['cost_year0'])
y = df1['cost_year0']
X.head()

Unnamed: 0_level_0,numof_allergies,numof_careplans,numof_devices,numof_medications,numof_procedures_2yr,numof_med_conds,numof_soc_challs,numof_enc_ambulatory_2yr,numof_enc_emergency_2yr,numof_enc_home_2yr,...,numof_enc_outpatient_2yr,numof_enc_snf_2yr,numof_enc_urgentcare_2yr,numof_enc_virtual_2yr,numof_enc_wellness_2yr,cost_year_minus5,cost_year_minus4,cost_year_minus3,cost_year_minus2,cost_year_minus1
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
53e40e98-c764-53a4-aaf6-6318a3c3c95d,0,1,0,2,2,1,0,2,3,0,...,0,0,0,0,4,0,0,0,0,0
c87850c0-6c2e-7312-b050-cfcc2a74be53,0,3,1,7,253,20,3,0,6,0,...,22,0,34,0,2,0,0,0,0,0
ce8b6d28-f7a3-8427-6ca6-0f01fc5422d4,0,0,0,0,1,0,0,0,1,0,...,0,0,0,0,2,0,0,0,0,0
d369812d-7ba7-a03a-cd5c-ccc2989d0b15,5,3,4,2,16,9,2,2,0,0,...,2,0,0,0,2,0,0,0,0,0
6e384968-cd64-1335-002e-113fecfe0f63,0,0,0,1,74,3,1,9,2,0,...,2,1,0,0,2,0,0,0,0,0


In [173]:

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [174]:

model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)
model.score(X,y)

0.8581127640317022

In [175]:
#UPDATE THE PREDICTION COST COLUMN df1 WITH THE PREDICTED COST
y_pred = model.predict(X)
df1['prediction_cost'] = y_pred
df1.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['prediction_cost'] = y_pred


Unnamed: 0_level_0,first,last,city,state,county,fips,lat,lon,birthdate,marital,...,risk_score_diab,risk_score_hyp,cost_year_minus5,cost_year_minus4,cost_year_minus3,cost_year_minus2,cost_year_minus1,cost_year0,y_pred,prediction_cost
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
53e40e98-c764-53a4-aaf6-6318a3c3c95d,Aaron,Abernathy,Augusta,Georgia,Richmond County,13245,33.403418,-82.00898,2017-11-19,unknown,...,0.0,0.0,0,0,0,0,0,116.08,1,7729.74026
c87850c0-6c2e-7312-b050-cfcc2a74be53,Aaron,Bashirian,Acworth,Georgia,Cobb County,13067,34.070996,-84.698541,1949-02-19,M,...,99.0,93.0,0,0,0,0,0,28981.8,1,28286.9317
ce8b6d28-f7a3-8427-6ca6-0f01fc5422d4,Aaron,Will,St. Simons,Georgia,Glynn County,13127,31.191608,-81.367092,2013-05-07,unknown,...,0.0,0.0,0,0,0,0,0,116.08,1,80.426103
d369812d-7ba7-a03a-cd5c-ccc2989d0b15,Abbey,Grady,Ball Ground,Georgia,Cherokee County,13057,34.36073,-84.344374,1978-10-29,S,...,89.0,19.0,0,0,0,0,0,28105.99,1,5751.0202
6e384968-cd64-1335-002e-113fecfe0f63,Abbey,Purdy,Kennesaw,Georgia,Cobb County,13067,34.018574,-84.617446,1984-06-28,M,...,29.0,27.0,0,0,0,0,0,23098.47,1,33704.4619



## we update the above predicted values to the original data

In [176]:

df.update(df1)

Now we observe the efficiency of the model by calculating the accuracy of the model.

In [177]:

r2_score(df['cost_year0'], df['prediction_cost'])

0.8666541883857961

Now as we have a fairly accurate model, we can predict the cost incurred for each patient in the current year if the patient has encounters in the current year. Now we can predict the cost incurred for each patient in the current year if the patient has encounters in the current year.

In [178]:
# Set up the predictors to include all 'numof' (utilization) columns and prior year costs
cols_numof = [col for col in df.columns if 'numof' in col]
cols_cost = [col for col in df.columns if 'cost_year' in col]
X = df[cols_numof + cols_cost]
# Drop last (target)
X = X.drop(columns=['cost_year0']) # don't need the current year as that is what we are trying to predict
X

Unnamed: 0_level_0,numof_allergies,numof_careplans,numof_devices,numof_medications,numof_procedures_2yr,numof_med_conds,numof_soc_challs,numof_enc_ambulatory_2yr,numof_enc_emergency_2yr,numof_enc_home_2yr,...,numof_enc_outpatient_2yr,numof_enc_snf_2yr,numof_enc_urgentcare_2yr,numof_enc_virtual_2yr,numof_enc_wellness_2yr,cost_year_minus5,cost_year_minus4,cost_year_minus3,cost_year_minus2,cost_year_minus1
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
53e40e98-c764-53a4-aaf6-6318a3c3c95d,0.0,1.0,0.0,2.0,2.0,1.0,0.0,2.0,3.0,0.0,...,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0
c87850c0-6c2e-7312-b050-cfcc2a74be53,0.0,3.0,1.0,7.0,253.0,20.0,3.0,0.0,6.0,0.0,...,22.0,0.0,34.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0
0c99afa9-03c2-f695-efe2-7943262bbbc7,0.0,1.0,1.0,0.0,0.0,2.0,0.0,1.0,0.0,0.0,...,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5e8c9aa2-90da-8e27-7d6f-c6b0faf50169,6.0,5.0,2.0,10.0,11.0,9.0,2.0,0.0,0.0,0.0,...,2.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0
ce8b6d28-f7a3-8427-6ca6-0f01fc5422d4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9e386110-9e48-346c-7363-86669d6cfa8d,0.0,1.0,1.0,1.0,9.0,4.0,2.0,1.0,0.0,0.0,...,2.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0
f2ad9924-0e5e-4ed4-a600-bc4341c440b6,0.0,1.0,2.0,1.0,14.0,6.0,3.0,3.0,0.0,0.0,...,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0
b3882e05-5ad3-36d9-8822-d5ff0fe187bb,3.0,4.0,4.0,7.0,22.0,13.0,4.0,1.0,0.0,0.0,...,2.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0
69426d6b-5c07-b1b7-41fc-060a788819c8,0.0,3.0,1.0,7.0,16.0,9.0,1.0,0.0,0.0,0.0,...,2.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0


taking the help of the above process, we create a function  to predict the cost incurred in the subsequent years in a one line of code.

In [179]:
def Reg_fun(X):
    #predict using model_class
    y_pred_class = model_class.predict(X)
    X['y_pred'] = y_pred_class
    #create a new df with only the patients that have a prediction of 1

    X1 = X[X['y_pred'] == 1]
    X1.drop(columns=['y_pred'], inplace=True)
    print('done1')
    #predict using model
    y_pred = model.predict(X1)
    print('done2')
    X1['y_pred']= y_pred
    #UPDATE THE PREDICTION COST COLUMN X1 WITH THE PREDICTED COST
    X.update(X1)
    return X['y_pred']

   
    


In [186]:

#for the new prediction we change the cost_year_minus1 to cost_year0 and so on

X['cost_year_minus1'] = df['cost_year0']
X['cost_year_minus2'] = df['cost_year_minus1']
X['cost_year_minus3'] = df['cost_year_minus2']
X['cost_year_minus4'] = df['cost_year_minus3']
X['cost_year_minus5'] = df['cost_year_minus4']
X

Unnamed: 0_level_0,numof_allergies,numof_careplans,numof_devices,numof_medications,numof_procedures_2yr,numof_med_conds,numof_soc_challs,numof_enc_ambulatory_2yr,numof_enc_emergency_2yr,numof_enc_home_2yr,...,numof_enc_snf_2yr,numof_enc_urgentcare_2yr,numof_enc_virtual_2yr,numof_enc_wellness_2yr,cost_year_minus5,cost_year_minus4,cost_year_minus3,cost_year_minus2,cost_year_minus1,y_pred
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
53e40e98-c764-53a4-aaf6-6318a3c3c95d,0.0,1.0,0.0,2.0,2.0,1.0,0.0,2.0,3.0,0.0,...,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,116.08,7729.740260
c87850c0-6c2e-7312-b050-cfcc2a74be53,0.0,3.0,1.0,7.0,253.0,20.0,3.0,0.0,6.0,0.0,...,0.0,34.0,0.0,2.0,0.0,0.0,0.0,0.0,28981.80,28286.931700
0c99afa9-03c2-f695-efe2-7943262bbbc7,0.0,1.0,1.0,0.0,0.0,2.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00,0.000000
5e8c9aa2-90da-8e27-7d6f-c6b0faf50169,6.0,5.0,2.0,10.0,11.0,9.0,2.0,0.0,0.0,0.0,...,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.00,0.000000
ce8b6d28-f7a3-8427-6ca6-0f01fc5422d4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,116.08,80.426103
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9e386110-9e48-346c-7363-86669d6cfa8d,0.0,1.0,1.0,1.0,9.0,4.0,2.0,1.0,0.0,0.0,...,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,67.77,0.000000
f2ad9924-0e5e-4ed4-a600-bc4341c440b6,0.0,1.0,2.0,1.0,14.0,6.0,3.0,3.0,0.0,0.0,...,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,135.54,1960.508000
b3882e05-5ad3-36d9-8822-d5ff0fe187bb,3.0,4.0,4.0,7.0,22.0,13.0,4.0,1.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,6101.05,5004.330200
69426d6b-5c07-b1b7-41fc-060a788819c8,0.0,3.0,1.0,7.0,16.0,9.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.00,0.000000


In [181]:
df.drop(columns = ['y_pred', 'prediction_cost'], inplace = True)

In [182]:
out = Reg_fun(X)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X1.drop(columns=['y_pred'], inplace=True)


done1
done2


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X1['y_pred']= y_pred


In [183]:
out




id
53e40e98-c764-53a4-aaf6-6318a3c3c95d     7729.740260
c87850c0-6c2e-7312-b050-cfcc2a74be53    28286.931700
0c99afa9-03c2-f695-efe2-7943262bbbc7        0.000000
5e8c9aa2-90da-8e27-7d6f-c6b0faf50169        0.000000
ce8b6d28-f7a3-8427-6ca6-0f01fc5422d4       80.426103
                                            ...     
9e386110-9e48-346c-7363-86669d6cfa8d        0.000000
f2ad9924-0e5e-4ed4-a600-bc4341c440b6     1960.508000
b3882e05-5ad3-36d9-8822-d5ff0fe187bb     5004.330200
69426d6b-5c07-b1b7-41fc-060a788819c8        0.000000
9fdfb702-0f46-8899-fe8c-363733532bb6    23260.838100
Name: y_pred, Length: 22632, dtype: float64

In [184]:
df['cost_yearplus1'] = out.round(2)

In [185]:
df.to_excel('w5e2_patient_cost_forecast_vinay.xlsx')