# MQB7046 MODELLING PUBLIC HEALTH DATA - Linear regression

Multiple linear regression is a statistical technique used to model the relationship between multiple independent variables (predictors) and a single dependent variable. It extends the concept of simple linear regression, where there is only one independent variable.

It is important to ensure that the assumptions of multiple linear regression, such as linearity, independence of errors, constant variance (homoscedasticity), and normality of residuals, are met for the validity of the model.

Steps involved in conducting multiple linear regression:
1) Import Libraries
2) Prepare the data
3) Add Constant (Intercept): Add a constant term to the independent variables
4) Fit the Model: 
Use the OLS (Ordinary Least Squares) method
5) View the summary of the regression results, which includes coefficients, standard errors, t-values, p-values, R-squared, and other statistics
6) 
Interpret the result:  coefficients (slopes) of the independent variables, their significance (p-values), and the overall goodness of fit (R-squared)
7) Assess Assumptions:  use diagnostic plots and statistical tests provided by Statsmodel.

Statsmodels is a Python library that provides classes and functions for the estimation of various statistical models, hypothesis testing, and statistical data exploration. It is particularly useful for regression analysis, time series analysis, and generalized linear models, among other statistical techniques.

Multiple Linear Regression in Statsmodels can be performed using the following methods:<br>
1) statsmodels.api <br>
2) statsmodels.api.formula <br>
3) statsmodels.regression.linear_model <br>

#### Practical 2

The researchers are interested to examine if upper body strength and lower body strengths of an older person are associated with the number and severity of falls. The injury index was calculated to indicate number and severity of accidents that an older person suffered.

Variable / Definition: <br>
1) injury : Overall injury index based on medical records
2) gluts  : A measure of strength of the lower body
3) arms   : A measure of strength of the upper body
4) age    : Age of participants (in years)
5) gender : Gender of participants (0: Male, 1: Female)

Which of the above variables relate to injury among older people?


#### Analysis Instructions

Please analyze the dataset provided to identify variables associated with injuries among older people. Utilize the Statsmodels library to perform any necessary analyses and report your findings.

1. **Data Preparation**: Begin by cleaning and preprocessing the dataset, handling missing values and outliers as needed.

2. **Exploratory Data Analysis**: Explore the distribution of variables and assess relationships between variables and the occurrence of injuries.

3. **Multiple Linear Regression**: Use the Statsmodels library to perform multiple linear regression analysis, identifying any significant variables related to injuries among older people.

4. **Interpretation and Reporting**: Interpret the results of the regression analysis and summarize your findings.


In [9]:
# Import Libraries
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf # or import statsmodels.api as smf
import plotly.express as px

In [29]:
# Load data into a DataFrame 
file_location = r"injuries.csv"
df = pd.read_csv(file_location, sep = ",")

In [30]:
# check dataframe
df.head()

Unnamed: 0,id,age,gender,gluts,arms,injury
0,1,60,0,27,11,159
1,2,65,0,26,36,238
2,3,65,0,27,28,195
3,4,64,0,27,24,212
4,5,67,0,34,25,199


In [31]:
# Display basic information about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   id      100 non-null    int64
 1   age     100 non-null    int64
 2   gender  100 non-null    int64
 3   gluts   100 non-null    int64
 4   arms    100 non-null    int64
 5   injury  100 non-null    int64
dtypes: int64(6)
memory usage: 4.8 KB


In [6]:
# Check data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   id      100 non-null    int64
 1   age     100 non-null    int64
 2   gender  100 non-null    int64
 3   gluts   100 non-null    int64
 4   arms    100 non-null    int64
 5   injury  100 non-null    int64
dtypes: int64(6)
memory usage: 4.8 KB


In [7]:
# Check any missing data
df.isnull().sum()

id        0
age       0
gender    0
gluts     0
arms      0
injury    0
dtype: int64

In [8]:
# Display descriptive statistics
df.describe()

Unnamed: 0,id,age,gender,gluts,arms,injury
count,100.0,100.0,100.0,100.0,100.0,100.0
mean,50.5,67.09,0.49,31.08,30.4,145.8
std,29.011492,3.621192,0.502418,5.783432,8.539865,52.19563
min,1.0,60.0,0.0,15.0,5.0,6.0
25%,25.75,64.75,0.0,27.0,24.75,112.25
50%,50.5,67.0,0.0,31.0,30.0,148.0
75%,75.25,70.0,1.0,34.25,36.25,184.5
max,100.0,75.0,1.0,47.0,48.0,279.0


In [18]:
# Check data distribution
# Distribution for continous data, use shapiro
continous_data_columns = ["age", "gluts", "arms", "injury"]
px.box(df, x = continous_data_columns)

In [27]:
px.pie(df, "gender")

In [28]:
# Perform any inferential statistics that is/are deemed necessary.
df.loc[:,continous_data_columns].corr()

Unnamed: 0,age,gluts,arms,injury
age,1.0,0.177143,-0.002482,0.281626
gluts,0.177143,1.0,0.337616,-0.392851
arms,-0.002482,0.337616,1.0,-0.242609
injury,0.281626,-0.392851,-0.242609,1.0


In [None]:
from scipy import stats
print(stats.)

In [None]:
# Run linear regression
smf.ols()
