# DSCI 100/Section 009/Group 169 Group Project Proposal

### [data_set_name] Analysis Proposal
By: Shady Abo El Kasim, Nalan Goosen, Labella Li, Yusen Wu

#### Introduction


* *Provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your proposal*
* *Clearly state the question you will try to answer with your project*
    * predictive question: asks about predictions of measurements or labels, what and not why (ex. "what will an individual choose based on prior data?")
* *Identify and describe the dataset that will be used to answer the question*
    * identify origins of dataset/brief description of organization, when the dataset was pulled

In [1]:
### Please run this cell before continuing.

library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
set.seed(2022) 

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.0     [32m✔[39m [34mrsample     [39m 1.0.0
[32m✔[39m [34mdials       [39m 1.0.0     [32m✔[39m [34mtune        [39m 1.0.0
[32m✔[39m [34minfer       [39m 1.0.2     [32m✔[39m [34mworkflows   [39m 1.0.0
[32m✔

#### Preliminary Exploratory Data Analysis

* *Demonstrate that the dataset can be read from the web into R*
* *Clean and wrangle your data into a tidy format*
* *Using only training data, summarize the data in at least one table (this is exploratory data analysis). An example of a useful table could be one that reports the number of observations in each class, the means of the predictor variables you plan to use in your analysis and how many rows have missing data.*
    * initial_split 
    * group_by, summarize
    * averages
    * map_df?
    * ^ allowed to use more than one table for several functions
* *Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do (this is exploratory data analysis). An example of a useful visualization could be one that compares the distributions of each of the predictor variables you plan to use in your analysis.*
    * discuss which variables to use in analysis

In [2]:
#reading and cleaning dataset

heart_disease_data <- read_csv("https://raw.githubusercontent.com/labellali/dsci-100-2022w1-group-169/main/data/heart_disease_dataset.csv")
heart_disease_data <- heart_disease_data |>
    rename(age = Column1,
            sex = Column2,
                            chest_pain = Column3,
                            resting_blood_pressure = Column4,
                            cholesterol = Column5,
                            fasting_blood_sugar = Column6,
                            rest_ecg = Column7,
                            max_heart_rate = Column8,
                            exercised_ind_angina = Column9,
                            oldpeak = Column10,
                            slope = Column11,
                            ca = Column12,
                            thal = Column13,
                            num = Column14) |>
    mutate(sex = as.factor(sex),
          chest_pain = as.factor(chest_pain),
          fasting_blood_sugar = as.factor(fasting_blood_sugar),
          rest_ecg = as.factor(rest_ecg),
          exercised_ind_angina = as.factor(exercised_ind_angina),
          num = as.factor(num),
          thal = na_if(thal, '?'),
          ca = na_if(ca, '?'),
          thal = as.numeric(thal),
          ca = as.numeric(ca))

# NOTE: considering turning exercised_ind_angina into a lgl vector

heart_disease_data

[1mRows: [22m[34m303[39m [1mColumns: [22m[34m14[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (2): Column12, Column13
[32mdbl[39m (12): Column1, Column2, Column3, Column4, Column5, Column6, Column7, Col...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


age,sex,chest_pain,resting_blood_pressure,cholesterol,fasting_blood_sugar,rest_ecg,max_heart_rate,exercised_ind_angina,oldpeak,slope,ca,thal,num
<dbl>,<fct>,<fct>,<dbl>,<dbl>,<fct>,<fct>,<dbl>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
63,1,1,145,233,1,2,150,0,2.3,3,0,6,0
67,1,4,160,286,0,2,108,1,1.5,2,3,3,2
67,1,4,120,229,0,2,129,1,2.6,2,2,7,1
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
57,1,4,130,131,0,0,115,1,1.2,2,1,7,3
57,0,2,130,236,0,2,174,0,0.0,2,1,3,1
38,1,3,138,175,0,0,173,0,0.0,1,,3,0


In [3]:
# splitting data into initial and training datasets

heart_disease_split <- initial_split(heart_disease_data, prop = 0.75, strata = num)
training_data <- training(heart_disease_split)
testing_data <- testing(heart_disease_split)

training_data
testing_data

age,sex,chest_pain,resting_blood_pressure,cholesterol,fasting_blood_sugar,rest_ecg,max_heart_rate,exercised_ind_angina,oldpeak,slope,ca,thal,num
<dbl>,<fct>,<fct>,<dbl>,<dbl>,<fct>,<fct>,<dbl>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
63,1,1,145,233,1,2,150,0,2.3,3,0,6,0
37,1,3,130,250,0,0,187,0,3.5,3,0,3,0
41,0,2,130,204,0,2,172,0,1.4,1,0,3,0
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
52,1,4,125,212,0,0,168,0,1,1,2,7,3
58,0,2,136,319,1,2,152,0,0,1,2,3,3
55,0,4,128,205,0,1,130,1,2,2,1,7,3


age,sex,chest_pain,resting_blood_pressure,cholesterol,fasting_blood_sugar,rest_ecg,max_heart_rate,exercised_ind_angina,oldpeak,slope,ca,thal,num
<dbl>,<fct>,<fct>,<dbl>,<dbl>,<fct>,<fct>,<dbl>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
56,1,3,130,256,1,2,142,1,0.6,2,1,6,2
44,1,2,120,263,0,0,173,0,0.0,1,0,7,0
52,1,3,172,199,1,0,162,0,0.5,1,0,7,0
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
68,1,4,144,193,1,0,141,0,3.4,2,2,7,2
57,1,4,130,131,0,0,115,1,1.2,2,1,7,3
38,1,3,138,175,0,0,173,0,0.0,1,,3,0


In [4]:
# exploratory data analysis

# looking at the number of observations in each class to predict

training_data_count <- training_data |>
    group_by(num) |>
    summarize(n = n())
training_data_count

num,n
<fct>,<int>
0,122
1,41
2,25
3,25
4,12


In [7]:
# calculating the means of every numeric predictor

training_data_means <- training_data |>
    select(age, resting_blood_pressure, cholesterol, max_heart_rate, oldpeak, slope, thal, ca) |>
    map_df(mean, na.rm = TRUE)
training_data_means

age,resting_blood_pressure,cholesterol,max_heart_rate,oldpeak,slope,thal,ca
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
54.60889,132.7644,249.4756,151.4133,1.052444,1.617778,4.654709,0.6891892


#### Method

* *Explain how you will conduct either your data analysis and which variables/columns you will use. Note - you do not need to use all variables/columns that exist in the raw data set. In fact, that's often not a good idea. For each variable think: is this a useful variable for prediction?*
    * filter for certain country/time range
    * select(-removal) any unnecessary columns
* *Describe at least one way that you will visualize the results*
    * consider facet_grid?
    * different kinds of geom visualization functions

- **note**: while cleaning the dataset and looking at the variables, i thought that ca, oldpeak, and slope were probably removable variables if end up going with this dataset

#### Expected outcomes and significance

* *What do you expect to find?*
* *What impact could such findings have?*
* *What future questions could this lead to?*