# Speed Dating 

![image.png](attachment:image.png)

As technology develops, people are less and less likely to meet potential partners through organic means. Dating is approaching a new era where online dating applications are quickly rising in popularity. This project utilizes data from a speed dating experiment to examine trends in what people value when looking for partners through brief interactions. 

## 1. Data

[Speed Dating Experiment by Anna Montoya](https://www.kaggle.com/annavictoria/speed-dating-experiment)

This dataset is sourced from Kaggle where it was uploaded by Anna Montoya. It was originally gathered for a study conducted by Columbia Business School professors Ray Fisman and Sheena Iyengar. The experiment consisted of 22 waves of speed dating events where each "date" was 4 minutes long. Participants would rate each other on six attributes: attractiveness, sincerity, intelligence, fun, ambition, and shared interests. They would then choose whether or not they would like to match with their date. Participants were also asked to share their demographic information and to answer other questions about dating. 

## 2. Data Cleaning 

[Data Wrangling Notebook](https://github.com/annaaful/Springboard/blob/main/Capstone%203%20Speed%20Dating/Data%20Wrangling%20.ipynb)

This dataset contains 8378 rows and 195 columns. Here are the main steps I took to prepare the data for EDA.

- Filtered through all the columns and dropped irrevelant features.
- Removed data with inconsistent point allocation system when rating partners.
- Filled null values of demographic and attribute dating with each feature's respective mean or median. 

## 3. Exploratory Data Analysis

[EDA Notebook](https://github.com/annaaful/Springboard/blob/main/Capstone%203%20Speed%20Dating/Exploratory%20Data%20Analysis.ipynb)

**Feature Distributions**

![image.png](attachment:image.png)

There was a total of 6,330 interactions between males and females in the speed dating experiment, a majority of which did not end up in successful matches. 

![image.png](attachment:image.png)

Most of the participants are between 20 and 35 years old. There are a few outliers in the 50-55 range.

![raceimp%20distribution.PNG](attachment:raceimp%20distribution.PNG)

Most participants do not consider it important that their partner has the same racial background or not.

![image.png](attachment:image.png)

A majority of the partipants came into the speed dating event with low expectations. 

**Attributes** 

The following heatmaps show the correlation between attribute questions answered by participants before and after the dating event. <br>
- (1=awful, 10=great) <br>
- attr=Attractive, sinc=Sincere, intel=Intelligent, fun=Fun, amb=Ambitious, shar=Shared Interests/Hobbies <br>
- attribute_1 = before event, attribute_2 = after event

What you look for in the opposite sex? (100 point allocation)
![image.png](attachment:image.png)

Attractiveness has a weak to medium negative correlation with the rest of the attributes. This indicates that as attractiveness increased in importance, the rest of the attributes decreased and vice versa.

There's a high but not perfect positive correlation between ratings on what participants look for in their partners before and after the dating event. It appears that their opinions changed slightly during this time period. It may not be a direct result of the dating event itself, but it's probably that there was some influence.

What you think MOST of your fellow men/women look for in the opposite sex? (100 point allocation)
![image.png](attachment:image.png)

Attractiveness has a much weaker negative correlation with the rest of the attributes in the case that participants are asked to rate what others of the same gender look for in their partners. The rest of the attributes generally have a weak to medium positive correlation with each other. This indicates that participants were probably more neutral and less particular about what they thought others of the same gender valued.

What do you think the opposite sex looks for in a date? (100 point allocation)
![image.png](attachment:image.png)

Here is the correlation between attributes that participants think the other gender looks for in their partners. Attractiveness has a weak to strong negative correlation with the rest of the attributes. This indicates that as attractiveness increased in importance, the rest of the attributes decreased and vice versa. It's more extreme than in the case of what participants said they looked for themselves.

How do you think you measure up? (Scale of 1-10)
![image.png](attachment:image.png)

Participants generally have the same opinions of their own attributes before versus after the event.

How do you think others perceive you? (Scale of 1-10)
![image.png](attachment:image.png)

Aside from attractiveness, participants had a noticably weaker positive correlation for attributes before versus after the event. Perhaps their confidence on perception of other's opinions weakened or strengthened from the speed dating.

**How Attributes Affect Likability and Matches**

![image.png](attachment:image.png)

This is a graph that shows the correlation between the attribute ratings of their dates versus how much participants like their dates. The orange represents a decision of yes from the participant, and blue represents no. How much the participant likes their date is generally positively correlated with their decision as yes. However, there are many cases where participants rated their dates high in attractiveness and intelligence, didn't like them at all, and still said yes. There was an exceptionally high number of cases where dates with high likability but low attractiveness still resulted in a decision of no. Out of all the other attributes, fun seems to be the most important. The others are not as influential. 

![image.png](attachment:image.png)

Generally it’s more likely to be a match when participants really like the other person and think the person will say yes for them.

![image.png](attachment:image.png)

There are so many cases where participants don’t think the other person will say yes, but they do and they match! This implies people actually aren’t that confident in themselves or cannot express themselves clearly.

## 4. Preprocessing and Training

[Preprocessing and Training Notebook](https://github.com/annaaful/Springboard/blob/main/Capstone%203%20Speed%20Dating/Preprocessing%2C%20Training%20and%20Modeling.ipynb)

**Preprocessing**

- One-hot encoding for categorical data 
- Upsampling for class imbalance 
- Feature selection with SelectKBest 

**Training and Scaling**

- Train-test split with 80/20.
- Standardization.

## 5. Machine Learning Algorithms and Modeling 

[Modeling Notebook](https://github.com/annaaful/Springboard/blob/main/Capstone%203%20Speed%20Dating/Preprocessing%2C%20Training%20and%20Modeling.ipynb)

I constructed four different types of machine learning models: logistic regression, decision tree, random forest, and support vector machine.

**Logistic Regression**

- The logistic regression model had 84.7% accuracy score on the training set and 84.2% on the test set. 
![image.png](attachment:image.png)
- With hyperparameter tuning, the model had 84.76% accuracy score on the training set and 84.2% on the test set. The precision was 0.77 amd the recall was 0.68. Both models had extremely similar results.
![image-2.png](attachment:image-2.png)

**Decision Tree**

- The decision tree model with gini index had an accuracy score of 84.0% on the test set. 
![image-3.png](attachment:image-3.png)
- The decision tree model with entropy had an accuracy score of 84.6% on the test set.
![image-4.png](attachment:image-4.png)

**Random Forest**

For the random forest model, I went straight into hyperparameter tuning with RandomizedSearchCV. The accuracy score was 98.0% on the training set and 92.1% on the test set. 
![image.png](attachment:image.png)

**Support Vector Machine**

- The support vector machine model had an accuracy score of 84.7% on the training set and 84.6% on the test set. 
![image.png](attachment:image.png)
- With hyperparameter tuning and GridSearchCV, the accuracy score was 84.7% on the training set and 94.3% on the test set.
![image-2.png](attachment:image-2.png)

**The Best Model**

A high accuracy score or f1-score is usually a good implication that a model is doing well but in many cases like this one, we need to dive deeper. These findings are intended to be seen from a business perspective from a company who profits from having successful matches. When predicting whether or not two people will match, it is more important that we not do miss the opportunity to put two people who will match together than put two people together that will not match. We value missed compatible matches more than putting two people together that do not match. In other words, the cost of false negatives is higher than the cost of false positives. Thus, we are looking for the highest recall rate. 

Both the decision tree using the entropy method and the support vector machine without hyperparameter tuning yielded perfect recall rates of 1.00 for match=1. This means that there were no false negatives predicted by the model. Both also had the same accuracy score. 

## 6. Findings and Conclusion

In terms of the importance of the six attributes, attractiveness and intelligence are definitely the most important. There were many cases in both attributes where a participant would rate their date high in those attributes, not like them at all, but still say yes to a match. In terms of attractiveness, there were many cases where a participant would say they really liked their date, rate them low in attractiveness and still say no to a match. If someone is perceived as low in these categories, they are not likely to receive a yes. A lot of people are superficial, which is not surprising considering the extremely brief amount of time they had during interactions.

Another notable finding is that most people are extremely confident in themselves. During the survey questions where participants were asked to rate themselves, most of the answers were quite high. However when it came to the dating event, they had a hard time with determining whether or not their date would say yes to them. Perhaps this is becasuse people are not as confident in themselves as they appear or on the other hand, cannot express themselves well. Timing also plays a large factor in this.

The best models were a tie between the decision tree using the entropy method and the support vector machine without hyperparameter tuning. They both had no false negatives with a recall rate of 1.00 in the case where matches were yes. This is very important in the scenario of speed dating because we do not want to miss out on any cases of potentially successful matches. 

**Future Steps**

It would be really interesting to gather data from mobile dating applications and create machine learning models with those features as a next project. It's different from speed dating in the way that users choose whether or not they want to match with a person before any type of interaction. It's purely based on what the person chooses to display on their profile. I want to see how these differences influence what people value when looking for potential partners. 