# Lower Back Pain

[Lower back pain](https://www.healthline.com/health/back-pain), also called __lumbago__, is not a disorder. It¡¯s a symptom of several different types of medical problems. It usually results from a problem with one or more parts of the lower back, such as:
* ligaments
* muscles
* nerves
* the bony structures that make up the spine, called vertebral bodies or vertebrae

Lower back pain can also be due to a problem with nearby organs, such as the kidneys.

According to the American Association of Neurological Surgeons, 75 to 85 percent of Americans will experience back pain in their lifetime. Of those, 50 percent will have more than one episode within a year. In 90 percent of all cases, the pain gets better without surgery. Talk to your doctor if you¡¯re experiencing back pain. 

In this [Exploratory Data Analysis (EDA)](https://en.wikipedia.org/wiki/Exploratory_data_analysis) I am going to use the Lower Back Pain Symptoms Dataset and try to find out interesting insights of this dataset. 


## Dataset Description
This dataset contains:
* 310 Observations
* 12 Features 
* 1 Lebel

|__ col. no.__| __Attribute name__| __type__| 
|-------------|---------------------|-----------|
| Col1 | pelvic_incidence | numeric, float64|
|Col2|pelvic_tilt |numeric, float64 |
|Col3| lumbar_lordosis_angle |numeric, float64|
|Col4|sacral_slope|numeric, float64|
|Col5| pelvic_radius  |numeric, float64|
|Col6|degree_spondylolisthesis   |numeric, float64|
|Col7| pelvic_slope |numeric, float64|
|Col8|Direct_tilt  |numeric, float64|
|Col9| thoracic_slope |numeric, float64|
|Col10|  cervical_tilt |numeric, float64|
|Col11| sacrum_angle |numeric, float64|
|Col12| scoliosis_slope |numeric, float64|
|Class_att| Attribute Class | categorical, object|


## EDA on Lower Back Pain Symptoms Dataset

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set()
from sklearn.preprocessing import MinMaxScaler, StandardScaler, LabelEncoder
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier, plot_importance
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,confusion_matrix

In [None]:
dataset = pd.read_csv("../input/Dataset_spine.csv")

In [None]:
dataset.head()

In [None]:
# Unnecessary column
dataset.iloc[:,-1:].head()

In [None]:
# removing Unnecessary column
del dataset["Unnamed: 13"]

## Full Dataset Summary  
[DataFrame.describe()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html) method generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset¡¯s distribution, excluding `NaN` values.  This method tells us a lot of things about a dataset. One important thing is that the `describe()` method deals only with numeric values. It doesn't work with any categorical values. 

Now, let's understand the statistics that are generated by the `describe()` method:

* `count` tells us the number of `NoN-empty` rows in a feature.

* `mean` tells us the mean value of that feature.

* `std` tells us the Standard Deviation Value of that feature.

* `min` tells us the minimum value of that feature.

* `25%`, `50%`, and `75%` are the percentile/quartile of each features. This quartile information helps us to detect [Outliers](https://machinelearningmastery.com/how-to-identify-outliers-in-your-data/).

* `max` tells us the maximum value of that feature.


In [None]:
dataset.describe()

In [None]:
# Change the Column names
dataset.rename(columns = {
    "Col1" : "pelvic_incidence", 
    "Col2" : "pelvic_tilt",
    "Col3" : "lumbar_lordosis_angle",
    "Col4" : "sacral_slope", 
    "Col5" : "pelvic_radius",
    "Col6" : "degree_spondylolisthesis", 
    "Col7" : "pelvic_slope",
    "Col8" : "direct_tilt",
    "Col9" : "thoracic_slope", 
    "Col10" :"cervical_tilt", 
    "Col11" : "sacrum_angle",
    "Col12" : "scoliosis_slope", 
    "Class_att" : "class"}, inplace=True)

In [None]:
dataset.head()

In [None]:
dataset.shape

[DataFrame.info()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html) prints information about a DataFrame including the `index` dtype and `column` dtypes, `non-null` values and memory usage. We can use the `info()` to know whether a dataset contains any missing value or not.

In [None]:
dataset.info()

### Visualize the number of abnormal and normal cases 
The tendency of `abnormal` cases is 2 times higher than the `normal` cases.

In [None]:
dataset["class"].value_counts().sort_index().plot.bar()

## Correlation between features
A [correlation coefficient](https://en.wikipedia.org/wiki/Correlation_coefficient) is a numerical measure of some type of correlation, meaning a statistical relationship between two variables.

In [None]:
plt.subplots(figsize=(12,8))
sns.heatmap(dataset.corr())

## Custom correlogram
A [pair plot](https://seaborn.pydata.org/generated/seaborn.pairplot.html) allows us to see both distribution of single variables and relationships between two variables.

Lots of things are going on in the below pair plot. Let¡¯s try to understand the __pair plot__. In __pair plot__, there are mainly two things that we need to understand. One is the __distribution of a feature__ and another is the __relationship between one feature to all others__. If we look at the diagonal we can see the distribution of each feature. Let¡¯s consider the `first row X first column`, this diagonal shows us the distribution of `pelvic_incidence`. Similarly, if we look at the `second row X second column` diagonal we can see the distribution of `pelvic_tilt`. All the cells except the diagonals show the relationship between one feature to another. Let¡¯s consider the `first row X second column`, here we can the relationship between `pelvic_incidence` and `pelvic_tilt`.



In [None]:
sns.pairplot(dataset, hue="class")

## Histogram of Each Feature
A [Histogram](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html) is the most commonly used graph to show frequency distributions.

In [None]:
dataset.hist(figsize=(15,12),bins = 20, color="#007959AA")
plt.title("Features Distribution")
plt.show()

## Detecting and Removing Outliers

In [None]:
plt.subplots(figsize=(15,6))
dataset.boxplot(patch_artist=True, sym="k.")
plt.xticks(rotation=90)

### Detect and Remove Outliers by hand

In [None]:
# detecting Outlier
# Inter Quartile Range is the distance between the 3rd Quartile and the first Qartile

minimum = 0
maximum = 0

def detect_outlier(feature):
    first_q = np.percentile(feature, 25)
    third_q = np.percentile(feature, 75) 
    IQR = third_q - first_q
    IQR *= 1.5
    minimum = first_q - IQR 
    maximum = third_q + IQR
    flag = False
    
    if(minimum > np.min(feature)):
        flag = True
    if(maximum < np.max(feature)):
        flag = True
    
    return flag

In [None]:
# we use tukey method to remove outliers.
# whiskers are set at 1.5 times Interquartile Range (IQR

def  remove_outlier(feature):
    first_q = np.percentile(X[feature], 25)
    third_q = np.percentile(X[feature], 75)
    IQR = third_q - first_q
    IQR *= 1.5
    
    minimum = first_q - IQR # the acceptable minimum value
    maximum = third_q + IQR # the acceptable maximum value
    
    median = X[feature].median()
    
    """
    # any value beyond the acceptance range are considered
    as outliers. 
    # we replace the outliers with the median value of that 
      feature.
    """
    
    X.loc[X[feature] < minimum, feature] = median 
    X.loc[X[feature] > maximum, feature] = median

# taking all the columns except the last one
# last column is the label

X = dataset.iloc[:, :-1]
for i in range(len(X.columns)): 
        remove_outlier(X.columns[i])

In [None]:
X = dataset.iloc[:, :-1]

In [None]:
for i in range(len(X.columns)):
    if(detect_outlier(X[X.columns[i]])):
        print(X.columns[i], "Contains Outlier")

In [None]:
for i in range (3):
    for i in range(len(X.columns)):
        remove_outlier(X.columns[i])

### After removing Outliers

In [None]:
plt.subplots(figsize=(15,6))
X.boxplot(patch_artist=True, sym="k.")
plt.xticks(rotation=90)

## Feature Scaling
[Feature scaling](http://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html) though standardization (or Z-score normalization) can be an important preprocessing step for many machine learning algorithms. Our dataset contain features with highly varying in magnitudes, units and range. But since, most of the machine learning algorithms use Eucledian distance between two data points in their computations, this will create a problem. To avoid this effect, we need to bring all features to the same level of magnitudes. This can be acheived by [feature scaling](https://medium.com/greyatom/why-how-and-when-to-scale-your-features-4b30ab09db5e).

In [None]:
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(X)
scaled_df = pd.DataFrame(data = scaled_data, columns = X.columns)
scaled_df.head()

## Label Encoding
Certain algorithms like XGBoost can only have numerical values as their predictor variables. Hence  we need encode our categorical values. 
[LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) from `sklearn.preprocessing` package encode labels with value between 0 and n_classes-1.

In [None]:
label = dataset["class"]

In [None]:
encoder = LabelEncoder()
label = encoder.fit_transform(label)

In [None]:
X = scaled_df
y = label 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=0)

In [None]:
clf_gnb = GaussianNB()
pred_gnb = clf_gnb.fit(X_train, y_train).predict(X_test)
accuracy_score(pred_gnb, y_test)

In [None]:
clf_svc = SVC(kernel="linear")
pred_svc = clf_svc.fit(X_train, y_train).predict(X_test)
accuracy_score(pred_svc, y_test)

In [None]:
clf_xgb =  XGBClassifier()
pred_xgb = clf_xgb.fit(X_train, y_train).predict(X_test)
accuracy_score(pred_xgb, y_test)

In [None]:
confusion_matrix(pred_xgb, y_test)

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))
plot_importance(clf_xgb, ax=ax)

## Marginal plot
A [marginal plot](https://python-graph-gallery.com/82-marginal-plot-with-seaborn/) allows to study the relationship between 2 numeric variables. The central chart display their correlation.

Lets visualize the relationship between `degree_spondylolisthesis` and `class`.

In [None]:
sns.set(style="white", color_codes=True)
sns.jointplot(x=X["degree_spondylolisthesis"], y=label, kind='kde', color="skyblue")

__Thats all. If you think the kernel is useful, then give a upvote. Cheers :) __