# Visualizing Predictive Models for Users

Welcome to `Heart AI`, the latest and greatest dating app start-up. We're building a machine learning model to predict which user to match with which other user in speed dating events. We've gathered some existing data from a few in-person speed dating events, in which participants record their hobbies, demographic details, and level of interest in other participants they've met. Before we engineer the back-end, we'd like you, our resident Human-AI Interaction expert, to implement a few sample visualizations so we know what features to support.

## The Data 
`dates.csv` contains 8,378 entries from our pilot test data at heterosexual speed dating events at Columbia University from 2002-2004. In these events, each participant met each of all opposite-gender participants for four minutes. The number of speed dates dates varied by the event, on average there were 15, but it could be as few as 5 or as many as 22. Afterward, each participant was asked if they would like to meet any of their speed dating partners again. They also provided ratings on six **attributes** about each speed date:

- Attractiveness
- Sincerity
- Intelligence
- Fun
- Ambition
- Shared Interests

The dataset also includes varying participants' perspectives on those attributes, along with other demographic information and hobbies as described below. 

Each row of the dataset is a speed date, and since participants have multiple dates, they appear in the dataset multiple times. Each column is described below:

| Column Header       | Description     |
| :------------- |  ----------: | 
|  iid | Numerical ID unique to this person   |
| gender   | This participant's self-reported gender (f = female)|
| age | Age in years of this participant |
| race | This person's race |
| field | This person's field of study |
| income | The median household income of the zipcode where this person grew up |
| from | Where this person is originally from |
| tot_rounds | The total number of speed dating rounds (i.e., num speed dates)
| round_num | Index of which speed date of the event (first, second , third...)|
| pid | The partner's unique numerical ID |
| age_partner | The partner's age |
| race_partner | The partner's race |
| same_race | Whether this participant and the partner are the same race (y = yes)|
| request | This participant would like to meet this partner in a follow-up date |
| request_partner | The partner would like to meet this partner in a follow-up date |
| match | Both participants would like a follow-up meeting |
| like | How much this person liked this partner |
| prob_yes | This person's self-reported probability the partner will say yes to a 2nd date |
| like_partner | How much the partner liked this person |
| prob_yes_partner | The partner's probability this person will say yes to a 2nd date |

The next 17 columns all relate to the six attribute ratings listed above: how the participant rated each partner, themself, and how the partner rated the participant:

| Attribute Header       | Description     |
| :------------- |  ----------: | 
| attractive | Rating of Attractiveness this person gave to their partner |
| sincere | Rating of Sincerity this person gave to their partner |
| intelligence | Rating of Intelligence this person gave to their partner |
| fun | Rating of Fun this person gave to their partner |
| ambitious | Rating of Ambition this person gave to their partner |
| shared_interests | Rating of Shared Interests this person gave to their partner |
| attractive_partner | Rating of Attractiveness the partner gave to this person |
| sincere_partner | Rating of Sincerity the partner gave to this person |
| intelligence_partner | Rating of Intelligence the partner gave to this person |
| fun_partner | Rating of Fun the partner gave to this person |
| ambitious_partner | Rating of Ambition the partner gave to this person |
| shared_interests_partner | Rating of Shared Interests the partner gave to this person |
| attractive_self | Rating of Attractiveness this person gave them self |
| sincere_self | Rating of Sincerity this person gave them self |
| intelligence_self | Rating of Intelligence this person gave them self |
| fun_self | Rating of Fun the partner this person gave them self |
| ambitious_self | Rating of Ambition this person gave them self |

The next 17 columns are the participant's answer to the question _"How **interest**ed are you in the following activities, on a scale of 1-10?"_: sports (Playing sports/ athletics), tvsports (Watching sports), excercise, dining (Dining out), museums (Museums/galleries), art, hiking (Hiking/camping), gaming, clubbing (Dancing/clubbing), reading, tv (Watching TV), theater, movies, concerts (Going to concerts), music, shopping, and yoga (yoga/meditation). The numerical answers to these questions are recorded in the last 17 columns with the column head appended with `_num`. The low, moderate, high categories were determined based on quartiles: less than quartile 1 = `low`, less than quartile 3 = `moderate`, otherwise `high`. 



---

---

---

## Part 1: Explore & Visualize 
_50% of the total effort on this assignment._

The goal of these tasks are to experiment with different ways of visualizing data. Show your work in code as well as your final visualizations in this notebook. Include answers to all questions.

If you’ve never visualized in Python, here are some helpful resources, **review** these first!
- The `pandas` module has some visualization support: https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html 
- You can also use Matplotlib, which is incredibly powerful: https://matplotlib.org/gallery/index.html 

_Remember_, if you're going to borrow and adapt code from a resource likes these, make sure you know what the code is doing before you adapt it. That way you get meaningful outcomes, rather than illegible graphs and data!


In [None]:
import pandas as pd # import pandas library
df = pd.read_csv('data/dates.csv', low_memory=False) # read the csv file into a pandas dataframe object

Recall from Assignment 1 the data exploration methods from the `pandas` module we used previously:

- `df.head()` will show us the first 5 rows of our dataset. You can also specify the first N rows, like `df.head(18)` will show us the first 18 rows.
- `df.sample(10)` will show us 10 randomly sampled rows of our dataset
- `df.shape` will tell us how many rows and how many columns are in the dataset
- `df.columns` will list the names of all columns in the dataset
- `df.describe()` will give you summary statistics about all numerical columns in the dataset

_Actually_, you'll need to recall many things from Assignment 1. Maybe take a few minutes to review it!

In [None]:
# Complete data exploration here (optional)

### Task 1a. Create a histogram of speed date participant counts by age. 
_5% effort._

A [histogram](https://en.wikipedia.org/wiki/Histogram) shows _counts_ of values in a bar chart form. This histogram will let us see if most of the participants are younger or older. One axis should be a sorted continuous range of the youngest person in the dataset to the age of the oldest person in the dataset. The other axis should be the counts for each age. Remember your axes labels! 

_Hint_: Recall each row of the data is a speed date, not one person (`iid`). Each person will appear in the dataset 5 to 22 times. You may want to look into [`pandas drop_duplicates()`](https://www.geeksforgeeks.org/python-pandas-dataframe-drop_duplicates/) so you don't double/quintuple count anyone!

_Hint_: You may need to `import matplotlib.pyplot as plt` and use `plt.xlabel` and `plt.title` methods to add labels to your histogram. This works even for the `pandas DataFrame.hist` method!

In [None]:
# Python code that creates the described histogram:

### Task 1b. Create a histogram of participant counts by age, split by gender. 
_8% effort._

This histogram will show us the distribution of speed dating participants by age, one histogram for each gender in the dataset.

_Hint_: It may make sense to reuse the dataframe plotted from Task 1a, but perhaps add an informative column. When using `pandas DataFrame.hist()` method, these two histograms can be generated with ~1 line of code + [a few lines extra to add axes labels](https://stackoverflow.com/questions/42832675/setting-axis-labels-for-histogram-pandas) + 1 line to `ax.set_ylim((0,45))` and set the y-axes labels to be the same.

In [None]:
# Python code that creates the described histogram:

When shown data or an ML model, humans tend to have _confirmation bias_, meaning that they tend to believe that whatever the data or model says is what they really thought all along. Ever broke up with a significant other and your friends tell you "I told you so"? This is confirmation bias. With Bayesian reasoning, we can take into account a viewer’s prior reasonable guess before they see data. This is a good technique to help users reflect on how the data might conflict with "what they thought all along." 

### Task 1c. Record prior: female ages --> successful matches.
_2% effort._

Write down what you believe (before looking at the data. Just guess!) is the relationship between female participant age and number of successful matches. Do you expect successful matches to be the same across all ages or higher in certain age ranges? Why?

_Double click this text to write your answer to the question here._

### Task 1d. Record prior: male ages --> successful matches.
_2% effort._

Write down what you believe (before looking at the data. Just guess!) is the relationship between male participant age and number of successful matches. Do you expect successful matches to be the same across all ages or higher in certain age ranges? Why?

_Double click this text to write your answer to the question here._

### Task 1e. Generate histograms of priors.
_5% effort._

1. Create a histogram of successful matches female participants had by age. 
2. Create a histogram of successful matches male participants had by age. 

We're now interested in data at the match level, rather than the individual level. If you previously dropped duplicate `iid`s, you may want to update your data to include them so you can count _every_ match, not just the first one!

_Hint_: You might find it helpful to figure out how to [filter out values by a conditional using pandas](https://www.geeksforgeeks.org/drop-rows-from-the-dataframe-based-on-certain-condition-applied-on-a-column/).

In [None]:
# Code that generates histograms 1 & 2:

### Task 1f. Compare priors with histograms.
_8% effort._

Compare your prior guess in 1c and 1d to the histograms in 1e. What did you learn from the histograms? Are there parts of your prior guess that were confirmed by the histograms? Are there parts of your prior guess that were wrong or different than you expected?

_Double click this text to write your answer to the question here._

When users see different possibilities separately in a data or ML system, there’s a bias towards thinking _all possibilities are equally likely_, when really some options are more or less probable in real life. e.g., While a headache could be caused by autumn allergies or by brain cancer, the likelihood of allergies is far higher in real life than brain cancer.

### Task 1g.
_20% effort._

Create a visualization of _your choice_, allowing users to examine how different personal features correlate with successful matches with increasing age. Overlay 4 different participant demographics in the same plot, with age as the x-axis and including only data from successful matches. Design this visualization however you wish. Justify your design by writing a few sentences about how your visualization will help users compare the 4 different attributes by age. Talk about encoding choices such as: plot type, use of size, color, and axes labels. Are there any flaws in your visualization?

_Examples_: This could be a stacked histogram showing the number of successful matches by the 4+ participant races in the dataset, or it could be the mean of 4 different interests overlayed in the same line chart, or you could look at the self-ratings along the attributes, or you could consider adding columns to the dataset that might provide additional insights! Explore the data and generate something interesting to you!

_Hint_: The [`pandas DataFrame.groupby`](https://www.geeksforgeeks.org/pandas-groupby/) method can be very useful here!

_Double click this text to write your answer to the question here._

In [None]:
# Code that generates your visualization:

---

---

---

## Part 2: Designing Personal Predictions
_50% of the total effort on this assignment._

The goal of Part 2 is to start designing an interactive interface, where a user that comes to the `Heart AI` visualization can put in their own information (like age, gender, interests, self-perceived attributes, and so on), and see how their information relates to the possibility of a successful speed date match. 

Show your work in code as well as your final visualizations in this notebook. Include answers to all questions.

To add some minimal interactivity with minimal effort, consider using Jupyter Notebook Widgets: 
- https://ipywidgets.readthedocs.io/en/latest/examples/Using%20Interact.html 
- https://towardsdatascience.com/interactive-controls-for-jupyter-notebooks-f5c94829aee6
- https://ipywidgets.readthedocs.io/en/latest/examples/Widget%20List.html


### Task 2a. Design for personas. 
_20% effort._

For each of the fictional users given, create a single visualization that provides insight into that user's likely experience at a speed dating event. To experiment with design choices, make each user/visualization pair a different visual encoding that represents different design choices (e.g. you could try a different plot type for some users). 

- Morgan is a young black male college student studying Law. 
- Taylor is a 27 year old woman. She greatly enjoys reading and yoga.
- Cal is a white man in his late 20s, who believes he's moderately attractive.
- Reilly is a Psychologist of non-binary gender, who realllly doesn't like going to art museums.

_ Hint_: If the demographic detail for the user is missing from the description, then it could be a good candidate for displaying "What-if" scenarios along that dimension. If it is provided, you could consider showing how slight changes in that information might impact success.

In [None]:
# Code that generates a visualization for Morgan:

In [None]:
# Code that generates a visualization for Taylor:

In [None]:
# Code that generates a visualization for Cal:

In [None]:
# Code that generates a visualization for Reilly:

### Task 2b. Which visualization from 2a do you think is the most successful design? What visualization techniques did you use? 
_3% effort._

_Double click this text to write your answer to the question here._

### Task 2c.  Limitations
_7% effort._

Given your visualizations in 2a, what would be good questions for a user to ask a personalized visualization from this dataset? What would be some questions that a personalized visualization (with this dataset alone) cannot answer?

_Double click this text to write your answer to the question here._

### Task 2d. Best information.
_7% effort._

If users like those in 2a visit the `Heart AI` interactive tool, what information would you have them put in to show the most relevant match success visualization and why?

_Double click this text to write your answer to the question here._

### Task 2e. Data processing.
_3% effort._

In the given data, Interest information is pre-binned as low, moderate, high. Is there any other data in this dataset that could be helpful to bin the data? How would you bin it?

_Double click this text to write your answer to the question here._

### Task 2f. Data use.
_10% effort._

Is the data used in this activity a good choice for `Heart AI`'s pilot testing? Why/Why not? How might it be improved? 

_Double click this text to write your answer to the question here._

---

---

---

## Submit your Assignment
Once you've completed all of the above, you're done with assignment 3! You might want to double check that your code works like you expect. You can do this by choosing "Restart & Run All" in the Kernel menu. If it outputs errors, you should go back and check what you've done. Iris needs to be able to run your notebook on her computer!

Once you think everything is set, please upload your final notebook (with all of your code run and output showing), to Glow with filename `[yourunixID]_haii20[assignmentnumber].ipynb`, e.g., `ikh1_haii20a3.ipynb`

If you've modified your data file outside of this notebook, please zip it up with your Jupyter Notebook and submit together as one `.zip` file so that I can run it!