<a href="https://colab.research.google.com/github/YBilodeau/Metabolic-Syndrome-Prediction-Project/blob/main/Project_2_Part_1_(Core).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project 2 - Part 1 (Core)**
- Yvon Bilodeau
- March 2022

Your second project is going to have a lot more freedom than your first project. This is because we want you to have a project in your portfolio that interests you or relates to the industry you would like to work in.

Your task for this week is to propose two possible datasets you would like to work with for Project 2.  

You will choose your first choice data set, and a backup data set in case the first proposed data set is not approved.  

This data can be from any source and can be on any topic with these limitations:

- the data must be available for use (it is your responsibility to ensure that - the license states that you are able to use it.)
- the data must be appropriate for a professional environment
- the data must NOT contain personal information
- the data must NOT be a dataset used for any assignment, lecture, or task from the course

Make sure you select a dataset that will be reasonable to work with in the amount of time we have left. Think about what questions you could reasonably answer with the dataset you select. 

You must propose two datasets that each have a supervised learning component. You may choose a regression or classification problem for each proposed data set.  

For this task:

Create a Colab notebook where you have uploaded and shown the .head() of each of your data sets.  For each of the proposed datasets, answer the following questions:

1) Source of data

2) Brief description of data

3) What is the target?

4) Is this a classification or regression problem?

5) How many features?

6) How many rows of data.

7) What, if any, challenges do your foresee in cleaning, exploring, or modeling with this dataset?

### **Mount Google Drive**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Imports
import pandas as pd

## **First choice: dataset 1**
- **Metabolic Syndrome Prediction**

### **1) Source of data**

- [Data World](https://data.world/informatics-edu/metabolic-syndrome-prediction)


### **2) Brief description of data**


To predict metabolic syndrome, yes or not based on common risk factors

The dataset for analysis came from the [NHANES](https://www.cdc.gov/nchs/nhanes/index.htm) initiative where the following variables were combined from multiple tables with SQL: abnormal waist circumference, triglycerides above 150, HDL cholesterol below 50 in women or 40 in men, history of hypertension and mildly elevated fasting blood sugar (100-125). Numerous other variables were added, such as uric acid, race, income, etc. that might contribute to the model but we will not sure, until we test the model.

### **3) What is the target?**

- 'MetabolicSyndrome':
> - MetSyn
> - No MetSyn

### **4) Is this a classification or regression problem?**

- Classification

### **5) How many features?**

- 14 Features

### **6) How many rows of data.**

- 2401

### **7) What, if any, challenges do your foresee in cleaning, exploring, or modeling with this dataset?**

- No challenges noted.

### **Load the Dataset**

- Data downloaded from [link](https://data.world/informatics-edu/metabolic-syndrome-prediction/file/Metabolic%20%20Syndrome.csv).

In [None]:
filename = "/content/drive/MyDrive/Colab Notebooks/CodingDojo/000 Data Files/Metabolic  Syndrome.csv"
df1 = pd.read_csv(filename)

### **Inspect the Data**

#### Display Rows and Column Count

In [None]:
# The .shape attribute returns a tuple of length 2, representing the dimensionality of the DataFrame.  
# The number of rows and columns of the DataFrame.
df1.shape
print(f'There are {df1.shape[0]} rows, and {df1.shape[1]} columns.')
print(f'The rows represent {df1.shape[0]} observations, and the columns represent {df1.shape[1]-1} features and 1 target variable.')

There are 2401 rows, and 15 columns.
The rows represent 2401 observations, and the columns represent 14 features and 1 target variable.


#### Display Data Types

In [None]:
df1.dtypes

seqn                   int64
Age                    int64
Sex                   object
Marital               object
Income               float64
Race                  object
WaistCirc            float64
BMI                  float64
Albuminuria            int64
UrAlbCr              float64
UricAcid             float64
BloodGlucose           int64
HDL                    int64
Triglycerides          int64
MetabolicSyndrome     object
dtype: object

#### Display Column Names, Count of Non-Null Values, and Data Types

In [None]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2401 entries, 0 to 2400
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   seqn               2401 non-null   int64  
 1   Age                2401 non-null   int64  
 2   Sex                2401 non-null   object 
 3   Marital            2193 non-null   object 
 4   Income             2284 non-null   float64
 5   Race               2401 non-null   object 
 6   WaistCirc          2316 non-null   float64
 7   BMI                2375 non-null   float64
 8   Albuminuria        2401 non-null   int64  
 9   UrAlbCr            2401 non-null   float64
 10  UricAcid           2401 non-null   float64
 11  BloodGlucose       2401 non-null   int64  
 12  HDL                2401 non-null   int64  
 13  Triglycerides      2401 non-null   int64  
 14  MetabolicSyndrome  2401 non-null   object 
dtypes: float64(5), int64(6), object(4)
memory usage: 281.5+ KB


#### Display First (15) Rows

In [None]:
df1.head(15)

Unnamed: 0,seqn,Age,Sex,Marital,Income,Race,WaistCirc,BMI,Albuminuria,UrAlbCr,UricAcid,BloodGlucose,HDL,Triglycerides,MetabolicSyndrome
0,62161,22,Male,Single,8200.0,White,81.0,23.3,0,3.88,4.9,92,41,84,No MetSyn
1,62164,44,Female,Married,4500.0,White,80.1,23.2,0,8.55,4.5,82,28,56,No MetSyn
2,62169,21,Male,Single,800.0,Asian,69.6,20.1,0,5.07,5.4,107,43,78,No MetSyn
3,62172,43,Female,Single,2000.0,Black,120.4,33.3,0,5.22,5.0,104,73,141,No MetSyn
4,62177,51,Male,Married,,Asian,81.1,20.1,0,8.13,5.0,95,43,126,No MetSyn
5,62178,80,Male,Widowed,300.0,White,112.5,28.5,0,9.79,4.8,105,47,100,No MetSyn
6,62184,26,Male,Single,9000.0,Black,78.6,22.1,0,9.21,5.4,87,61,40,No MetSyn
7,62189,30,Female,Married,6200.0,Asian,80.2,22.4,0,8.78,6.7,83,48,91,No MetSyn
8,62191,70,Male,Divorced,1000.0,Black,,,1,45.67,5.4,96,35,75,No MetSyn
9,62195,35,Male,,2500.0,Black,99.0,28.2,0,2.21,6.7,94,46,86,No MetSyn


- Data appears to have loaded correctly.

## **Second choice: dataset 2**
- **Heart Disease Prediction**

### **1) Source of data**

[Data World](https://data.world/informatics-edu/heart-disease-prediction)

### **2) Brief description of data**


This data set came from the University of California Irvine data repository and is used to predict heart disease

This is a data set used to predict heart disease. Patients were classified as having or not having heart disease based on cardiac catheterization, the gold standard. If they had more than 50% narrowing of a coronary artery they were labeled as having heart disease.

In this cohort, there are 270 patients and there are 13 independent predictive variables or column attributes. The attributes are explained on the website: https://archive.ics.uci.edu/ml/datasets/Heart+Disease

After this dataset became available, the UCI data repository made another cohort available with 303 patients. They shared this with Kaggle which is a data competition initiative. First, the file format is .data which is uncommonly used. Secondly, the outcome was reversed by accident. This is why we are still using the older cohort of patients

### **3) What is the target?**

- Heart Disease 
> - Absence
> - Presence

### **4) Is this a classification or regression problem?**

- Classification

### **5) How many features?**

- 13

### **6) How many rows of data.**

- 270

### **7) What, if any, challenges do your foresee in cleaning, exploring, or modeling with this dataset?**

- No challenges noted.

### **Load the Dataset**

- Data downloaded from [link](https://data.world/informatics-edu/heart-disease-prediction/file/%20Heart_Disease_Prediction.csv).

In [None]:
filename = "/content/drive/MyDrive/Colab Notebooks/CodingDojo/000 Data Files/Heart_Disease_Prediction.csv"
df2 = pd.read_csv(filename)

### **Inspect the Data**

#### Display Rows and Column Count

In [None]:
# The .shape attribute returns a tuple of length 2, representing the dimensionality of the DataFrame.  
# The number of rows and columns of the DataFrame.
df2.shape
print(f'There are {df2.shape[0]} rows, and {df2.shape[1]} columns.')
print(f'The rows represent {df2.shape[0]} observations, and the columns represent {df2.shape[1]-1} features and 1 target variable.')

There are 270 rows, and 14 columns.
The rows represent 270 observations, and the columns represent 13 features and 1 target variable.


#### Display Data Types

In [None]:
df2.dtypes

Age                          int64
Sex                          int64
Chest pain type              int64
BP                           int64
Cholesterol                  int64
FBS over 120                 int64
EKG results                  int64
Max HR                       int64
Exercise angina              int64
ST depression              float64
Slope of ST                  int64
Number of vessels fluro      int64
Thallium                     int64
Heart Disease               object
dtype: object

#### Display Column Names, Count of Non-Null Values, and Data Types

In [None]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 270 entries, 0 to 269
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Age                      270 non-null    int64  
 1   Sex                      270 non-null    int64  
 2   Chest pain type          270 non-null    int64  
 3   BP                       270 non-null    int64  
 4   Cholesterol              270 non-null    int64  
 5   FBS over 120             270 non-null    int64  
 6   EKG results              270 non-null    int64  
 7   Max HR                   270 non-null    int64  
 8   Exercise angina          270 non-null    int64  
 9   ST depression            270 non-null    float64
 10  Slope of ST              270 non-null    int64  
 11  Number of vessels fluro  270 non-null    int64  
 12  Thallium                 270 non-null    int64  
 13  Heart Disease            270 non-null    object 
dtypes: float64(1), int64(12), 

#### Display First (15) Rows

In [None]:
df2.head(15)

Unnamed: 0,Age,Sex,Chest pain type,BP,Cholesterol,FBS over 120,EKG results,Max HR,Exercise angina,ST depression,Slope of ST,Number of vessels fluro,Thallium,Heart Disease
0,70,1,4,130,322,0,2,109,0,2.4,2,3,3,Presence
1,67,0,3,115,564,0,2,160,0,1.6,2,0,7,Absence
2,57,1,2,124,261,0,0,141,0,0.3,1,0,7,Presence
3,64,1,4,128,263,0,0,105,1,0.2,2,1,7,Absence
4,74,0,2,120,269,0,2,121,1,0.2,1,1,3,Absence
5,65,1,4,120,177,0,0,140,0,0.4,1,0,7,Absence
6,56,1,3,130,256,1,2,142,1,0.6,2,1,6,Presence
7,59,1,4,110,239,0,2,142,1,1.2,2,1,7,Presence
8,60,1,4,140,293,0,2,170,0,1.2,2,2,7,Presence
9,63,0,4,150,407,0,2,154,0,4.0,2,3,7,Presence


- Data appears to have loaded correctly.