# Library Imports

**NumPy** is a scientific computing/linear algebra library for Python, and many libraries related to data science use it as a 
building block. [Numpy](https://numpy.org/) is also super fast, as it has bindings to C libraries.  
  
  
**Pandas** is the primary [data analysis library](https://pandas.pydata.org/) in Python and uses dataframes (similar to tables) as the main workhorses for data storage and manipulation.  

**Matplotlib** and **Seaborn** are two visualization libraries in Python. [Matplotlib](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.html) is the original Python visualization library and contains all important visualization functions. [Seaborn](https://seaborn.pydata.org/) is intended for use in generating high level statistical plots for exploratory data analysis.

**Sklearn** is a basic machine learning [library](https://scikit-learn.org/stable/), which offers implementations for all major types of learning models and their tuning parameters, and offers robust capabilities in model testing as well.


You can install this libraries by using:  
  
$\text{pip install } [\text{library name here}]$

In [None]:
# standard library imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
# reading in a csv file to a dataframe
df = pd.read_csv(r'diabetes_data_upload.csv')

# Data Cleaning & Exploration

In [None]:
df.head()

In [None]:
df.tail(10)

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
sns.histplot(df['Age'], kde=True, bins=35)

In [None]:
sns.boxplot(x = 'class', y='Age', data=df)

In [None]:
df = df[df['Age'] <= 80]
# now plot it again

In [None]:
sns.displot(x='Age', data = df , hue='class', bins=35)

In [None]:
df = pd.get_dummies(data = df, columns=['Gender', 'Polyuria', 'Polydipsia', 'sudden weight loss',
       'weakness', 'Polyphagia', 'Genital thrush', 'visual blurring',
       'Itching', 'Irritability', 'delayed healing', 'partial paresis',
       'muscle stiffness', 'Alopecia', 'Obesity', 'class'], drop_first=True)

In [None]:
df.head()

In [None]:
plt.figure(figsize=(16,12))
sns.heatmap(df.corr(), cmap='coolwarm', annot=True)
plt.tight_layout()

# Training an ML Model

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [None]:
X = df.drop('class_Positive', axis=1)
y = df['class_Positive']

In [None]:
X.columns

In [None]:
df.columns

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

In [None]:
reg = LogisticRegression(max_iter=150)
reg.fit(X_train, y_train)
pred = reg.predict(X_test)

In [None]:
print(pred)

# Testing

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
print(classification_report(y_test, pred))
print()
print(confusion_matrix(y_test, pred))

# Decision Tree & Random Forest

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [None]:
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
pred1 = dtc.predict(X_test)

rfc = RandomForestClassifier(n_estimators=128)
rfc.fit(X_train, y_train)
pred2 = dtc.predict(X_test)

In [None]:
# metrics for decision tree
print(classification_report(y_test, pred1))
print()
print(confusion_matrix(y_test, pred1))

In [None]:
# metrics for random forest classifier
print(classification_report(y_test, pred2))
print()
print(confusion_matrix(y_test, pred2))

# K-Nearest Neighbors

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

In [None]:
sc = StandardScaler()
x_scaled = sc.fit_transform(X_train)

In [None]:
knn = KNeighborsClassifier(n_neighbors=int(np.sqrt(len(y))))
knn.fit(x_scaled, y_train)
pred = knn.predict(sc.transform(X_test))

In [None]:
print(classification_report(y_test, pred))
print()
print(confusion_matrix(y_test, pred))