Lambda School Data Science

*Unit 2, Sprint 3, Module 1*

---


# Define ML problems

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Is your problem regression or classification?
- [ ] How is your target distributed?
    - Classification: How many classes? Are the classes imbalanced?
    - Regression: Is the target right-skewed? If so, you may want to log transform the target.
- [ ] Choose your evaluation metric(s).
    - Classification: Is your majority class frequency >= 50% and < 70% ? If so, you can just use accuracy if you want. Outside that range, accuracy could be misleading. What evaluation metric will you choose, in addition to or instead of accuracy?
    - Regression: Will you use mean absolute error, root mean squared error, R^2, or other regression metrics?
- [ ] Choose which observations you will use to train, validate, and test your model.
    - Are some observations outliers? Will you exclude them?
    - Will you do a random split or a time-based split?
- [ ] Begin to clean and explore your data.
- [ ] Begin to choose which features, if any, to exclude. Would some features "leak" future information?

If you haven't found a dataset yet, do that today. [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2) and choose your dataset.

Some students worry, ***what if my model isn't “good”?*** Then, [produce a detailed tribute to your wrongness. That is science!](https://twitter.com/nathanwpyle/status/1176860147223867393)

In [2]:
import pandas as pd

In [14]:
df = pd.read_csv('https://raw.githubusercontent.com/calebmckay1/Unit_2_Build_Week/master/framingham.csv')

In [5]:
target = 'TenYearCHD'

In [6]:
### Is your problem classification or regression? ###


# classification, either yes or no. Not how many.

In [7]:
### How is your target distributed? ###

# The target is distributed binarally, either yes, they're
# at risk to develop coronary heart disease in next 10 years, or no, currently not.

In [8]:
df[target].value_counts(normalize=True)

0    0.848042
1    0.151958
Name: TenYearCHD, dtype: float64

In [9]:
# it looks like about 84% of people dont have TenYearCHD, so if we were to just guess
# we'd guess that the patient isn't at risk 


### might have to use a different metric other than accuracy score.. ##

In [10]:
from sklearn.model_selection import train_test_split

In [15]:
df.head()

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


In [4]:
df.isnull().sum()

male                 0
age                  0
education          105
currentSmoker        0
cigsPerDay          29
BPMeds              53
prevalentStroke      0
prevalentHyp         0
diabetes             0
totChol             50
sysBP                0
diaBP                0
BMI                 19
heartRate            1
glucose            388
TenYearCHD           0
dtype: int64

In [30]:
# education fill missing with mode
# cigs per day fill missing with mode
# bp meds fill with missing mode
# totCH fill with missing mean
# BMI fill with 0/missing with mean
# heartRate fill missing/0 with mean
# glucose fill with missing/0 with mean

0     4.0
1     2.0
2     1.0
3     3.0
4     3.0
5     2.0
6     1.0
7     2.0
8     1.0
9     1.0
10    1.0
11    2.0
12    1.0
13    3.0
14    2.0
15    2.0
16    3.0
17    2.0
18    2.0
19    2.0
Name: education, dtype: float64

In [35]:
df['education'] = df['education'].fillna(value=df['education'].mode()[0])
df['cigsPerDay'] = df['cigsPerDay'].fillna(value=df['cigsPerDay'].mode()[0])
df['BPMeds'] = df['BPMeds'].fillna(value=df['BPMeds'].mode()[0])
df['totChol'] = df['totChol'].fillna(value=df['totChol'].mean())
df['BMI'] = df['BMI'].fillna(value=df['BMI'].mean())
df['heartRate'] = df['heartRate'].fillna(value=df['heartRate'].mode()[0])
df['glucose'] = df['glucose'].fillna(value=df['glucose'].mean())

In [36]:
df.shape

(4238, 16)

In [37]:
df.isnull().sum()

male               0
age                0
education          0
currentSmoker      0
cigsPerDay         0
BPMeds             0
prevalentStroke    0
prevalentHyp       0
diabetes           0
totChol            0
sysBP              0
diaBP              0
BMI                0
heartRate          0
glucose            0
TenYearCHD         0
dtype: int64

In [None]:
# from pandas_profiling import ProfileReport
# profile = ProfileReport(df, minimal=True).to_notebook_iframe()

# profile


## can't save it if i keep this cell outputted..

In [41]:
train,test = train_test_split(df, test_size=0.2, random_state=42)

In [42]:
train.shape, test.shape

((3390, 16), (848, 16))

In [43]:
### Are some observations outliers? Will you exclude them?
### Will you do a random split or a time-based split?


# There are a few outliers in different features, but I'll leave them for now.
# I plan on doing a random split since I don't have dates in my dataset.