# Cervical Cancer Risk Factors

## Introduction

This is an example notebook in which we explore a dataset of cervical cancer risk factors from the [University of California Irvine Machine Learning Database](https://archive.ics.uci.edu/ml/index.php). Note that the "solution" below is not the only way to approach this task! 

The first thing we'll want to do is to import any libraries/packages that we'll need. Recall that you can specify an alias for libraries or packages, so we don't have to type the longer name every time. 

### Import libraries

In [None]:
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import numpy as np

Next we need to import our data. We can do this by reading in the CSV using the `pandas` library. 

In [None]:
raw_data = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/00383/risk_factors_cervical_cancer.csv")
raw_data

Looking through the data, there seem to be a lot of question marks! This is likely indicting missing data. Python is not going to be able to handle question marks every well, and it will definitely mess up our math later! Let's re-import the data, letting Python know that the question marks should be coded as `NaN` (a special value in Python meaning "not a number").

In [None]:
cervical_cancer_data = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/00383/risk_factors_cervical_cancer.csv", na_values="?")
cervical_cancer_data.head()

Since there are more columns than can be displayed above, let's list them all so we know what data we have.

In [None]:
cervical_cancer_data.columns

### Data exploration

Next, let's dig into the data itself. First, we'll create a statistical summary table using the method `.describe`.

In [None]:
cervical_cancer_data.describe()

If we look closely, we can see that there are a few columns that are likely True/False (or "boolean") variables (like `Smokes` for example) that are being treated like numerical variables. We should look in the data dictionary of the dataset (found here: https://archive.ics.uci.edu/ml/datasets/Cervical+cancer+%28Risk+Factors%29). It's sparse, but we can find out which columns are supposed to be True/False and change those column types. Usually, when data are coded this way, 1 means "True" and 0 means "False". 

In [None]:
boolean_cols = ['Smokes', 'Hormonal Contraceptives', 'IUD', 'STDs', 
                'STDs:condylomatosis', 'STDs:cervical condylomatosis', 
                'STDs:vaginal condylomatosis', 'STDs:vulvo-perineal condylomatosis', 
                'STDs:syphilis','STDs:pelvic inflammatory disease', 'STDs:genital herpes',
                'STDs:molluscum contagiosum', 'STDs:AIDS', 'STDs:HIV', 'STDs:Hepatitis B', 
                'STDs:HPV', 'Dx:Cancer', 'Dx:CIN', 'Dx:HPV', 'Dx', 'Hinselmann', 'Schiller',
                'Citology', 'Biopsy']
for x in boolean_cols:
    cervical_cancer_data[x] = cervical_cancer_data[x].astype("bool")
cervical_cancer_data

Now, let's look at a statistical summary of just the entries where the `Dx:Cancer` is True. 

In [None]:
cancer_pos = cervical_cancer_data.loc[cervical_cancer_data.loc[:,"Dx:Cancer"] == True,:].copy()
cancer_pos.describe()

### Exploratory data visualization 

Now, let's do some data visualization to explore the data. We may be able to find patterns that aren't easily seen by looking at numbers alone. 

First, let's look at some histograms.

In [None]:
sns.displot(cervical_cancer_data['Age'], binwidth=5)
plt.title("Histogram of Age")
plt.show()

In [None]:
sns.displot(cervical_cancer_data['Smokes (packs/year)'], binwidth=1)
plt.title('Histogram of packs of cigarettes smoked per year')
plt.show()

How about some scatterplots? 

In [None]:
sns.relplot(data = cervical_cancer_data, x = cervical_cancer_data['Number of sexual partners'], y = cervical_cancer_data['STDs (number)'])
plt.show()

In [None]:
sns.relplot(data = cervical_cancer_data, x = cervical_cancer_data['Age'], y = cervical_cancer_data['STDs (number)'])
plt.show()

Let's look at a boxplot next.

In [None]:
sns.boxplot(y = cervical_cancer_data['Number of sexual partners'][~np.isnan(cervical_cancer_data['Number of sexual partners'])])
plt.show()

### Heat Maps and correlations

Now, let's take a look at the correlations between all of the variables. We can do this by creating a correlation table with the `.corr` method. Then, we can visualize those correlations in a heat map. 

In [None]:
corr = cervical_cancer_data.corr()
corr.head()

Let's make this table easier to look at by using a color gradient to shade the correlations: highly-correlated variables will be darker. 

In [None]:
corr.style.background_gradient()

Now we can make a heat map using the correlations that we've calculated. 

In [None]:
sns.heatmap(corr)
plt.show()

In the heat map above, very light tiles indicate a highly positive correlation, while very dark tiles indicate no correlation. Notice anything interesting? Perhaps this can help you develop a hypothesis that you can test in a later module!