# Predicting the yearly number of doctor visits using linear regression

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#importing data on which we will train the model
dt=pd.read_csv('../input/learning-lab-mini-challenge-doctors-visits/train.csv')

# Describe the data

In [None]:
dt.info()

The dataframe contains 7 columns:  <br>
**target variable:** 'visits' numerical dtype: integer <br>
**numerical features:** age , weight, height dtype: float64 <br>
**categorical features:** gender, ethnicity dtype: object i.e string/text <br>
**id** column is not informative but is required in submission <br>
**number of observations:** 1000 entries <br>
**memory size:** 54.8+ KB

### **Explore more data types**

In [None]:
dt['visits'].describe()

### **Target variable: visits**
<br> In this sampling, people visit the doctor at least 3 times and at most 30 times a year. <br>the average number of doctor visits is 11.

In [None]:
dt['visits'].head(5)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(color_codes=True)
plt.hist(dt['visits'])

In [None]:
sns.distplot(dt['visits'])

The sample distribution of the feature visits is right-skewed. There are no outliers.

**Check for null and missing values**

In [None]:
dt.isna().sum()

In [None]:
dt=dt.dropna() #if you wirte just dt.dropna(), the resulting dataframe will not be memorized and the dt will contain the missing values until you assign it to itself

In [None]:
dt.isna().sum()

In [None]:
dt.info()

62 rows were deleted because they contained missing values. 6,2% of the dataset was eliminated.

**Randomly mix the data**

In [None]:
from sklearn.utils import shuffle
import random
random.seed(40)
dt=shuffle(dt,random_state=np.random)

Check if data was really randomized

In [None]:
dt.head(10)

### Encoding categorical features and target numerical feature

In [None]:
#dt_enc is the dataframe that will contain the data including the encoded features
dt_enc=pd.get_dummies(dt, columns=['gender', 'ethnicity'], dummy_na=False,drop_first=True)
dt_enc.head(10)



**Data should be scaled before modeling**

In [None]:
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
scaler.fit(dt_enc)
scaled = scaler.transform(dt_enc)

In [None]:
dt_enc.plot(x='age', y='visits', style='o')  
plt.title('Age vs visits')  
plt.xlabel('age')  
plt.ylabel('visits')  
plt.show()

In [None]:
dt.plot(x='gender', y='visits', style='o')  
plt.title('gender vs visits')  
plt.xlabel('gender')  
plt.ylabel('visits')  
plt.show()

In [None]:
dt.plot(x='ethnicity', y='visits', style='o')  
plt.title('ethnicity vs visits')  
plt.xlabel('ethnicity')  
plt.ylabel('visits')  
plt.show()

In [None]:
dt_enc.corr()

The number of visits in this sample has the highest positive correlation with the variable age. Visits number is moderately (and positively) related to age

### Model the data with simple regression of age as dependent regressor of visits 

In [None]:
X = dt_enc[['age', 'ethnicity_group D']]
Y = dt_enc['visits']

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test=train_test_split(X,Y, test_size=0.3)

In [None]:
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(X_train, Y_train)
pred = regr.predict(X_test)

In [None]:
regr.intercept_

In [None]:
regr.coef_

The impact of age on the yearly number of visits is 0.137 <br>
The impact of being of ethnicity group D or not on the yearly number of visits is 0.744

In [None]:
from sklearn import metrics
import numpy as np
print(np.sqrt(metrics.mean_squared_error(Y_test,pred)))

### Apply a different linear regression model 

In [48]:
features

Index(['id', 'age', 'weight', 'height', 'visits', 'gender_male',
       'ethnicity_group B', 'ethnicity_group C', 'ethnicity_group D',
       'ethnicity_group E'],
      dtype='object')

In [60]:
dt_enc = dt_enc.rename(columns = {'ethnicity_group B': 'ethnicB', 'ethnicity_group C': 'ethnicC','ethnicity_group D': 'ethnicD', 'ethnicity_group E': 'ethnicE'}, errors = 'raise')

In [None]:
import seaborn as sns
features=dt_enc.columns
features.tolist()
sns.pairplot(dt_enc, x_vars=dt_enc[['ethnicity_group B', 'ethnicity_group C','ethnicity_group D']], y_vars=dt_enc['visits'], height=7, aspect=0.7)


In [51]:
import statsmodels.formula.api as smf
model2 = smf.ols(formula='visits ~ age + weight + gender_male', data=dt_enc).fit()


In [52]:
model2.params

Intercept      4.537979
age            0.141099
weight         0.003195
gender_male   -0.066068
dtype: float64

In [61]:
dt_enc.head(4)

Unnamed: 0,id,age,weight,height,visits,gender_male,ethnicB,ethnicC,ethnicD,ethnicE
691,692,32.0,244.0,62.0,7,0,1,0,0,0
106,107,18.0,223.0,63.0,20,0,1,0,0,0
50,51,80.0,264.0,62.0,14,1,1,0,0,0
850,851,37.0,108.0,68.0,18,0,0,0,0,1


In [62]:
model2 = smf.ols(formula='visits ~ age + weight + ethnicD + ethnicE', data=dt_enc).fit()

In [63]:
model2.params

Intercept    4.474973
age          0.139602
weight       0.002419
ethnicD      0.817810
ethnicE      0.355255
dtype: float64

In [64]:
model2.rsquared

0.2849309704963229

In [65]:
model2.summary()

0,1,2,3
Dep. Variable:,visits,R-squared:,0.285
Model:,OLS,Adj. R-squared:,0.282
Method:,Least Squares,F-statistic:,92.94
Date:,"Thu, 28 May 2020",Prob (F-statistic):,1.5100000000000002e-66
Time:,00:03:16,Log-Likelihood:,-2455.8
No. Observations:,938,AIC:,4922.0
Df Residuals:,933,BIC:,4946.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,4.4750,0.570,7.854,0.000,3.357,5.593
age,0.1396,0.007,18.779,0.000,0.125,0.154
weight,0.0024,0.002,1.144,0.253,-0.002,0.007
ethnicD,0.8178,0.258,3.175,0.002,0.312,1.323
ethnicE,0.3553,0.334,1.062,0.288,-0.301,1.012

0,1,2,3
Omnibus:,208.832,Durbin-Watson:,1.997
Prob(Omnibus):,0.0,Jarque-Bera (JB):,374.927
Skew:,1.362,Prob(JB):,3.85e-82
Kurtosis:,4.475,Cond. No.,1130.0


In [69]:
model3 = smf.ols(formula='visits ~ age', data=dt_enc).fit()

In [70]:
model3.rsquared

0.27546663255740467

The linear regression models do not fit this dataset because the r squared remains low (around .28) when we use the significantly related features (features that have small p-values : age & ethnicD)
<br>
Insight: encode the 'visits' variable (which is numeric) to class intervals ( results in three classes/ binary variables ), then try classification models on the data
Try imputing the missing values and reapply regression and classification to see if the 6% deleted observations impact the quality of predictions