# GGR274H1 Data Science Conference

The final project for this course will use data science methods including programming and statistical analysis of data from [Toronto Community Health Profiles](http://www.torontohealthprofiles.ca) on the topic below.  You will present your findings in the style of an oral presentation at an academic Data Science Conference (DSC).  You and your team will give a short oral presentation of your findings, and be prepared to answer questions about your work.

## Deliverables

1. The Jupyter notebook (.ipynb) that produced the slides for the presentation, to be submitted before by April 5, 9:30 AM. Submission details to follow.

2. A 5 minute oral presentation summarizing the work that you and your team will give at the Data Science Conference (DSC).

# The Data Science Conference

## When

- __Date:__ April 5, 2022

- __Time:__ 10:00 - 12:00

## Location 

-  SS2118 (lecture room)


# Conference Slides

- You should produce your conference slides using the Jupyter notebook Slideshow Extension RISE which is available on the [UofT Jupyterhub](https://jupyter.utoronto.ca). The template of a Jupyter notebook document to produce the slides for your presentation is [here](GGR274DSconfslides.ipynb).  You will be allowed to present five content slides (i.e., does not include slides to break up sections described below).

- Your presentation slides should include reproducible python output (i.e., graphs, tables should be produced by python code not hard-coded or inserted as an image), but not python code unless it's directly related to one of the sections below.  See the [template](GGR274DSconfslides.ipynb) for an example of how to write code chunks to do this.

- Your title slide must include the names of your team members, your tutorial section (e.g., TUTXXXX), and your group number as assigned by your TA.

- Your conference slides should include the following sections: 

   + Introduction 
   + Data 
   + Methods 
   + Results 
   + Conclusion  


A few guidelines for an effective conference presentation:

- Your slides should be clear, concise, and easy to read quickly.

- Do not use small fonts as it will need to be read at a distance of about 6 feet.

- Figures often display information more efficiently than text.  

- Numbered or bulleted lists convey points in slides more effectively than blocks of text. 

# Oral Presentation

Your group will be asked to give a 5 minute presentation summarizing your work.  This time limit is firm and you will be asked to stop when time is up.  Each team member must speak during this presentation.

# Teamwork

- Preparatory work will be carried out in tutorials and will form part of your tutorial grade.

- You will work in a group comprised of either 3 or 4 students in the same tutorial.

- All team members are expected to contribute equally to the completion of the project.  All team members must present part of the oral presentation at the DSC.

- Your group may not work with members of another group.  You may not discuss the project with anyone except for your team, professors, and the course TAs.

- If you are concerned about any issues with your team, contact a member of the teaching team as soon as possible.


# Evaluation

**Students will be evaluated as a team.**

Grade component                  |  Value 
---------------------------------|--------
Content of slides                |  50%
Reproducibility of slide content |  10%
Oral Presentation                |  40%




## Reproducibility of Conference Slides

- The rubric for the content of the conference slides evaluation is below. 

In [7]:
import pandas as pd
slidesrubric = pd.read_csv('confrenceslidesrubric.csv', keep_default_na=False)
slidesrubric.style.hide_index()

Criteria,Unnamed: 1,Excellent,Good,Adequate,Poor
Content,Reasonable scope,The scope of the analysis is clear and questions can be fully addressed using the available data.,The scope of the analysis is clear and questions can be reasonably addressed using the available data.,"The scope of the analysis is less clear, the questions can somewhat be addressed using the available data with slight modifications.","The questions are beyond the scope, cannot be reasonably addressed with the available data; need to resort to additional data or complete modification."
,Data wrangling,Creative use of data wrangling to produce informative variables.,Appropriate use of data wrangling to create sensible variables.,Some use of data wrangling to create new variables.,No evidence of data wrangling to create any variables.
,Graphical display,Choice of graphs are appropriate and creative; graphs reveal useful information and tell a story. Meaningful captions and titles.,"Choice of graphs are appropriate; graphs reveal useful information, but are not self-sufficient. Might require some explaining.","Choice of graphs are appropriate; graphs reveal some useful information. Might require some explaining and minor changes to titles/axes/labels, etc.","A lack of visual aid; graphs are inappropriate, reveal no information."
,Statistical methods,The choice of method is appropriate; analyses are complete; diverse and creative use of more than one approach.,The choice of method is appropriate; some non-essential analyses are missing,The choice of method is somewhat appropriate; some analyses are missing.,The choice of method is inappropriate; essential analyses are missing.
,Appropriate conclusion,Results are clearly and completely summarized. Appropriate limitations and concerned clearly stated.,Results are completely summarized. Some limitations and concerned are stated.,Some results are summarized. The conclusion is not appropriate and no mentioning of any limitations.,Results are not summarized and conclusion is missing.
Writing,Organization,Contents are very well organized under the appropriate section and subsection headings.,Contents are organized under the appropriate section and subsection headings.,Contents are somewhat organized under section and subsection headings.,Contents are poorly organized under section and subsection headings.
,Overall Writing,Very polished and well written.,"Few errors in spelling, punctuation, and/or grammar. Mostly clear and understandable.","Partly unclear, but mostly understandable. Several errors in spelling, punctuation, and/or grammar.","Too many errors in spelling, punctuation, and/or grammar, which make it unclear and difficult to follow."


- Before the conference presentation, you must send your TA the files to reproduce your slides by TBD.

- Your TA will attempt to reproduce your conference slides using the Jupyter notebook (.ipynb) and data files you submit.  

- If your TA cannot run the .ipynb files you submit to reproduce your conference slides content then your group will receive 0; if the TA has to make minor changes to get it to run then your group will receive 1; and if it runs with no changes then your group will receive 2. 

- If the Jupyter notebook and other files are submitted after the deadline every member of your group will lose 10% of your overall final project mark as long as they are submitted at most 24 hours after the deadline. The project files will not be accepted more than 24 hours after the deadline.

## Oral Presentation

- Your group will give a 5 minute presentation about your work.  The rubric for the oral presentation is below.

In [8]:
oralrubric = pd.read_csv('Oralpresrubric.csv', keep_default_na=False)
oralrubric.style.hide_index()

Criteria,Excellent (Rare),Good (Common),Adequate (Common),Poor (Very rare)
Preparedness,Extremely prepared and rehearsed.,Primarily prepared but with some dependence on written notes.,The presenter was not well prepared and sometimes reading off notes.,Evident lack of preparation/rehearsal. Complete dependence on notes.
Speech clarity,"Words were articulated clearly and distinctly, and very easy to understand.","Words were articulated clearly and distinctly most of the time, easy to understand.","Clear attempts to enunciate, with some occasional mumbling, but still understandable.","A lot of word slurring or mumbling, barely understandable."
Content Clarity,"Just the right amount of explanation and details were given, the presentation effectively achieved its points.","Sufficient explanation and details were given, the presentation achieved most points.","Some explanation, too little or too much details were given, the presentation achieved some points.","No explanation, insufficient or too many unnecessary details, the presentation was confusing and had no clear objectives."
Transitional Phrases,Effective use of words and phrases to enhance the flow and signal transitions.,Good use of words and phrases to control the flow and signal transitions.,Some use of transitional words and phrases to signal transitions.,Lack of transitions and a poor progression of flow.
Vocabulary,"Accurate use of statistical terms and phrases, and the presentation was professional and polished.",Good use of statistical terms and phrases whenever necessary.,"Demonstrated efforts to incorporate statistical terms and phrases, but some were used inaccurately.",Completely inaccurate and wrong use of statistical terms and phrases and signals a lack of understanding.
Delivery,"Well-paced, good volume, there was eye contact and the presenter was confident.","Good pace and volume, there was some eye contact and the presenter seemed confident.","Pacing could be improved, volume or eye contact was not consistent.","Poor pacing, barely audible, a lack of eye contact."
The wow factor,Overall an excellent and impressive presentation.,,,


- Every member of the team is expected to speak as part of the oral presentation. One way to do this is to have each team member present at least one slide.

- If a student in a group isn’t present at their group’s presentation then they will receive 50% of the group mark.

- If a student doesn’t speak at all during the presentation and is unable to answer a direct question then they will receive 50% of the group mark. If a student neither speaks nor responds to any questions they will receive 0. 


# Data Analysis Expectations

You will carry out a data analysis on data from [Toronto Community Health Profiles](http://www.torontohealthprofiles.ca) using python to address the topic below.  

We expect that your analysis will require data wrangling, exploratory data analysis (plots and summary statistics), statistical tests and modeling.  Your project does not need to include all of these statistical methods nor does it need to include all of the variables in the data set.  You might also choose not to include all observations, or to make new variables from the data that may be more suitable for answering your questions of interest.

The goal is not to carry out an exhaustive analysis, nor to apply everything you have learned in the course.  The goal is to demonstrate that you have learned how to use python, that you can appropriately apply the methods we have covered in class to address a question, and that you can effectively interpret and present the results.

## The Data 


Information about data from the [About Page from the Toronto Community Health Profiles webste](http://www.torontohealthprofiles.ca/a_aboutUs.php?varTab=HPDtbl):
"The Toronto Community Health Profiles Partnership has been providing community-level demographic, 
socioeconomic, and population health information to community organizations and health and social services providers throughout Toronto." While somewhat out of date, many of these data are still very relevant today. There is a wealth of information about health outcomes and socioeconomic characteristics, all summarized at the neighborhood level. These data come from a number of sources - e.g., socioeconomic data come from the Census, hospital admission data come from the Canadian Institute for Health Information, and chronic disease data come from the Institute for Clinical Evaluative Sciences. 

Your group will explore the 'All Socio-demographic (Census) - 2006' (`1_socdem_neighb_2006-2.xls`) and 'Adult Health and Disease - 2007' (`1_ahd_neighb_db_ast_hbp_mhv_copd_2007.xls`) datasets at the Toronto Neighbourhood level. These data sets are availabe in the `data` directory. To complete your group project, you only _need_ to 'wrangle' and analyze these two datasets, but you are also welcome to explore spatial patterns using the tools we learned in the last few weeks of class. A spatial file of Toronto Neighbourhoods that can be joined to the previously mentioned data is available at the [City of Toronto's Open Data Portal](https://open.toronto.ca/dataset/neighbourhoods/).

We have proposed a few questions for your group to address in your project.  There are many ways you can address these questions in the data.  Your group will need to focus your unique research question and make choices about what variables are important to _your_ question. You do not need to consider every variable in the data set.




# Final Project Questions

Below are a few example questions that can inspire your group projects. These general questions will require you and your group to decide which data to use to answer these questions.

1. Sociodemographic and income variables are associated with a wide range of health outcomes. Using the Adult Health and Disease dataset, explore the following:
    - Classify neighbourhoods based on their income quintiles (5 groups, 20% each), and explore the demographics (e.g., age, lone parents, education, immigration, etc.) within each group.
    - Repeat the above by classifying neighbourhoods as quintiles by a sociodemographic variable of your choice, and summarize income and other demographic variables.
    - Analzye how three chronic diseases (e.g., High Blood Pressure, Diabetes, Asthma, etc. - you pick, there are a bunch!) are associated with income and the sociodemographic variable you chose. Are some diseases more or less related to the income/demographic variables? 
    - Discuss why these relationships do or do not make sense (for example, if a neighbourhood is predominatly young parents we may see less cardiovascular disease, which presents later in life). You may need to do a little research to figure this part out. Cite your sources.
    
2. Do certain diseases have similar spatial patterns? (NB: some of the methods mentioned in this question will be covered later in the course)
    - Explore the geography of 4 diseases from the Adult Health and Disease dataset by mapping them using the libpysal and geopandas libraries. 
    - Use a measure of global spatial autocorrelation (for example Moran's I) to determine if the diseases are actually spatially clustered.
    - Use a spatial clustering tool (e.g., the Local G or LISA statistics we covered in class) to map where hot spots and cold spots of the diseases are.
    - Compare the patterns and state whether your group thinks there is significant overlap. If you can do this using code, all the better :). 
    - To go one step further, use the sociodemographic data to explore whether the disease maps correspond to patterns in sociodemographic variables (of your choosing). 
    
3. Are differences in disease outcomes by sex (as classified in the Adult Health and Disease dataset) related to different sociodemographic variables? 
    - The various diseases in the dataset are broken down by Male and Female sex (note, this is biological sex assigned at birth, and not gender). Pick 3 different diseases, and explore if men and women experience the diseases at the same rates.
    - Calculate the differences in percent (e.g., % with mental health visits) for each neighbourhood. How different are these percentages on average for the diseases you picked? 
    - For each disease, identify the neighbourhoods with the biggest absolute differences (use the top 10th percentile). How do sociodemographic variables in these neighbourhoods differ from neighbourhoods with smaller differences (aka the bottom 90th percentile)? 
    - Given what you've learned, discuss why you think these differences might exist. Feel free to create a map to help justify your explanation. 