In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
from sklearn.feature_selection import SelectKBest, chi2

# II. Data Analysis
Once we have our data, one of the first and most important steps is to understand what the data is like. This is especially important if you're collecting your own data and have to determine what data is useful.

In [None]:
df = pd.read_csv('diabetes.csv')

In [None]:
df.head()

Calling `df.describe()` shows a summary of information about each column in our table. We can then visualize each column using a histogram and boxplot.

## Questions to discuss
- Do the distributions of the features look [normal](https://en.wikipedia.org/wiki/Normal_distribution)? If not, how are they skewed?
- Do you think this could present a problem for the machine learning techniques?
- Look at the maximum and minimum values for each feature. Find those values on the box plots and histograms. Do they look right to you? Do you see anything we should be concerned about? If so, what would you do to fix those issues?

In [None]:
df.describe()

In [None]:
_ = df.hist(figsize=(10,8), grid=False)

In [None]:
# How many people had a skin thickness of 0?
print(df.SkinThickness.min())
print(df.SkinThickness.value_counts()[0])

In [None]:
# How many people had 17 pregnancies?
print(df.Pregnancies.max())
print(df.Pregnancies.value_counts()[17])

In [None]:
_ = df.plot(kind= 'box' , subplots=True, layout=(3,3), sharex=False, sharey=False, figsize=(10,8), sort_columns=True)

# Which features are the most useful?
Our machine learning classifier is going to use these features to predict whether or not a patient has diabetes. As the desginers of the classifier, we need to have a general idea of how the classifier works. One way to do this is to look at the **usefulness** of each feature and decide whether this makes sense from a clinical perspective.

We'll use a [Chi-Squared Test](https://en.wikipedia.org/wiki/Chi-squared_test) to see which features are most strongly associated with one class or another (positive or negative).

First, we'll separate our dataset into two separate parts:
- **X** - these are the features that we'll use to look for patterns. These are also called *independent variables*

- **y** - these are the labels that tell us whether or not a patient has diabetes. This is also called the *dependent variable*

In [None]:
def prepare_dataset(df):
    """
    Separates the dataset into X and y
    """
    X = df.loc[:, df.columns != 'Outcome']
    y = df.Outcome
    return X, y

In [None]:
X, y = prepare_dataset(df)

In [None]:
X.head()

In [None]:
y.head()

In [None]:
# Run the test on our data
chi2_scores, p_values = chi2(X, y)

In [None]:
feature_names = X.columns

In [None]:
for (feat_name, score) in zip(feature_names, chi2_scores):
    print(feat_name, score)

Now we'll sort our features by the results of our chi2 test and then visualize them to compare how useful each feature is. A higher chi2 score means that a feature is more predictive.

In [None]:
sorted_feature_name_scores = sorted(zip(feature_names, chi2_scores), key=lambda x:x[1], reverse=True)

In [None]:
fig, ax = plt.subplots()
feat_names, feat_scores = zip(*sorted_feature_name_scores)
n = len(feat_names)
x_plot = range(n)
ax.bar(x_plot, feat_scores)

ax.set_xticks(x_plot)
_ = ax.set_xticklabels(feat_names, rotation='45')

## Questions to discuss
- What is this chart showing us?

- Does this make sense clinically?

- For those that are not predictive, why do you think that is?

# Up Next
Now that we understand what our dataset looks like and have a better understanding of our task, we can clean up our data, train the machine learning algorithms, and evaluate their performance.

[III. Machine Learning Classification](III_MachineLearningClassification.ipynb)