# üõ†Ô∏è Project Setup

In this first step, we'll prepare the workspace for the **OKCupid Date-A-Scientist** project.

## üì¶ Download the Project Materials

1. Download the ZIP file for the **Date-A-Scientist** project from the following link:  
   üëâ [Download Starter Files](https://content.codecademy.com/PRO/paths/data-science/OKCupid-Date-A-Scientist-Starter.zip?_gl=1*1gdkc3o*_gcl_au*MTg1MTk2MTQ1LjE3NjE4MjEwMzk.*_ga*MTc5MTQ4OTgwMy4xNzMwMzc2Mzgy*_ga_3LRZM6TM9L*czE3NjM0MjgyMzYkbzEzNSRnMSR0MTc2MzQyOTQyNyRqNjAkbDAkaDA.)
2. Unzip the folder. You should see two items:
   - `date-a-scientist.ipynb` (a blank Jupyter Notebook)
   - `profiles.csv` (the dataset containing user profiles)

## üöÄ Launch Jupyter Notebook

1. Open your terminal or command line interface.
2. Type the following command to start Jupyter Notebook:

   ```bash
   jupyter notebook
   ```

3. A browser tab will open automatically.
4. Click on `date-a-scientist.ipynb` to open the notebook.
5. Build your project inside this file.

## üß† How to Use Jupyter Notebook

Jupyter Notebook is an interactive tool that lets you combine code, visualizations, and explanatory text in one document. It's perfect for data science projects because it supports exploration, analysis, and storytelling.

If you need help setting up or using Jupyter Notebook, check out these resources:

- [Command Line Interface Setup](#)
- [Introducing Jupyter Notebook](#)
- [Setting up Jupyter Notebook](#)
- [Getting Started with Jupyter](#)
- [Getting More out of Jupyter Notebook](#)

---


```markdown
# üîê Git Repository Setup with SSH and LFS

To manage version control and large files efficiently, this project uses **Git**, **SSH authentication**, and **Git LFS**.

## üìÅ Local Project Path

```bash
cd code/DataScientistML_Codecademy/OKCupid-Date-A-Scientist-Starter
```

## üß± Initialize and Push to Remote Repository

```bash
git init
git lfs install
git add .
git commit -m "Initial commit for OKCupid Date-A-Scientist project"
git remote add origin git@personal-github:gabrielarcangelbol/OKCupid-Date-A-Scientist-Starter.git
git branch -M main
git push -u origin main
```

## ‚öôÔ∏è Configure `.gitattributes` for Large Files

To track CSV files with Git LFS, the following line was added to `.gitattributes`:

```
*.csv filter=lfs diff=lfs merge=lfs -text
```

Then committed:

```bash
git add .gitattributes
git commit -m "Configure Git LFS for CSV files"
```

## üîê SSH Key Setup for Personal GitHub Account

To ensure Git uses your **personal SSH key**, follow these steps:

1. Start the SSH agent:

   ```bash
   eval "$(ssh-agent -s)"
   ```

2. Add your personal SSH key:

   ```bash
   ssh-add ~/.ssh/id_rsa_personal
   ```

## üîç Verify Remote Configuration

To confirm the correct remote is set:

```bash
cd code/DataScientistML_Codecademy/OKCupid-Date-A-Scientist-Starter
git remote -v
```

You should see:

```
origin  git@personal-github:gabrielarcangelbol/OKCupid-Date-A-Scientist-Starter.git (fetch)
origin  git@personal-github:gabrielarcangelbol/OKCupid-Date-A-Scientist-Starter.git (push)
```

---

‚úÖ With Git, SSH, and LFS configured, your project is ready for version-controlled development and collaboration.
```

# üß≠ Project Scoping

Properly scoping your project creates structure and helps you think through your entire workflow before diving into the analysis. This section outlines the key components of the project scope, inspired by the [University of Chicago‚Äôs Data Science Project Scoping Guide](https://datasciencepublicpolicy.org/our-work/tools-guides/data-science-project-scoping-guide/).

---

## üéØ 1. Project Goals

- **Primary Objective**: Explore the OKCupid dataset to uncover patterns in dating preferences and behaviors using machine learning and NLP techniques.
- **Secondary Goals**:
  - Identify which user attributes are most predictive of compatibility.
  - Apply clustering to discover latent user segments.
  - Build a supervised model to predict user traits or preferences based on profile text.

---

## üìä 2. Data Overview

- **Source**: `profiles.csv` from OKCupid (provided in starter files)
- **Structure**: Each row represents a user profile with multiple features including:
  - Demographics (age, sex, orientation, location)
  - Lifestyle (diet, smoking, drinking, drugs)
  - Essay responses (10 free-text fields)
- **Challenges**:
  - Missing values
  - Unstructured text (essays)
  - Potential class imbalance

---

## üß™ 3. Analytical Approach

### Exploratory Data Analysis (EDA)
- Distribution of key demographics
- Missing data patterns
- Word frequency and sentiment in essay fields

### Feature Engineering
- Text vectorization (TF-IDF, embeddings)
- Aggregated lifestyle indicators
- Derived compatibility scores

### Modeling
- **Unsupervised**: Clustering (e.g., KMeans, DBSCAN) to identify user segments
- **Supervised**: Classification (e.g., logistic regression, random forest) to predict:
  - Smoking habits
  - Orientation
  - Personality traits (inferred from essays)

---

## üß± 4. Constraints & Assumptions

- **Assumptions**:
  - Users are honest in their profiles
  - Essay content reflects personality and preferences
- **Constraints**:
  - No access to actual match outcomes
  - Limited metadata on user interactions

---

## üîÑ 5. Risks & Adjustments

- NLP models may underperform due to short or noisy text
- Clustering may not yield interpretable segments
- Some hypotheses may not be supported by the data

---

## üìå Next Steps

- Clean and preprocess the dataset
- Perform EDA and visualize key trends
- Define target variables for modeling
- Iterate on feature engineering and model evaluation

---

üìö For more guidance, refer to the full [Data Science Project Scoping Guide](https://datasciencepublicpolicy.org/our-work/tools-guides/data-science-project-scoping-guide/) and [Scoping Worksheet (PDF)](https://worldbank.github.io/Data-in-Action/downloads/UChicago_DSSG_Project_Scoping.pdf).

Sources: [Scoping Guide](https://datasciencepublicpolicy.org/our-work/tools-guides/data-science-project-scoping-guide/), [Scoping Worksheet PDF](https://worldbank.github.io/Data-in-Action/downloads/UChicago_DSSG_Project_Scoping.pdf)


# ü§ñ Select ML-Solvable Problem

In this section, we define a machine learning problem that is feasible given the dataset and project scope. The goal is to identify a task that:

- Can be solved using available data
- Is achievable within a reasonable timeframe
- Has a clear input-output structure suitable for ML
- Aligns with the broader goals of the project

---

## üß© Problem Statement

Many users on OKCupid consider **Zodiac signs** important when evaluating compatibility. However, not all users provide their zodiac sign in their profile. This creates a gap in the matching process.

**Problem**: Can we predict a user's zodiac sign based on other profile attributes such as lifestyle habits (drinking, smoking, drug use) and free-text essay responses?

---

## üß† ML Framing

This is a **multi-class classification** problem with 12 possible zodiac labels.

### **Input Features**:
- Categorical: `drinks`, `smokes`, `drugs`, `job`, `education`
- Textual: `essay0` to `essay9` (to be vectorized using NLP techniques)

### **Target Variable**:
- `sign` (Zodiac sign)

---

## üîç Why This Problem?

- The `sign` column has missing values, making it a good candidate for imputation via ML.
- Lifestyle and personality traits (expressed in essays) may correlate with astrological archetypes.
- Predicting zodiac signs could enhance match recommendations for users who omit this field.

---

## üß™ ML Techniques to Be Used

- **Text preprocessing**: cleaning, tokenization, TF-IDF or embeddings
- **Feature encoding**: one-hot or ordinal encoding for categorical variables
- **Modeling**: 
  - Baseline: Logistic Regression or Naive Bayes
  - Advanced: Random Forest, XGBoost, or fine-tuned transformer (if time permits)
- **Evaluation**: Accuracy, F1-score, confusion matrix

---

## üìå Next Steps

- Analyze class distribution of `sign`
- Explore correlations between zodiac and lifestyle features
- Preprocess essays and engineer features
- Train and evaluate classification models

---

üìö Reference: [Google‚Äôs Guide to Identifying ML-Suitable Problems](https://developers.google.com/machine-learning/problem-framing)

# üìÇ Load and Check Data

Before applying any machine learning techniques, we need to load the dataset and verify that it contains a suitable **label or response variable** for supervised learning.

The dataset is stored in `profiles.csv` and includes:

### üßæ Multiple-choice columns:
- `body_type`, `diet`, `drinks`, `drugs`, `education`, `ethnicity`, `height`, `income`, `job`, `offspring`, `orientation`, `pets`, `religion`, `sex`, `sign`, `smokes`, `speaks`, `status`

### üìù Essay fields:
- `essay0` to `essay9`, covering topics like self-summary, lifestyle, preferences, and personality

---

## üéØ Target Variable Check

Since our selected ML task is to **predict the user's zodiac sign**, we will check the `sign` column for:

- Presence of values
- Class distribution
- Missing data

If the `sign` column is too sparse or unreliable, we may need to revisit our problem definition.

---

## üì¶ Load Dataset
```python
import pandas as pd

# Load the dataset
df = pd.read_csv("profiles.csv")

# Display basic info
print("Shape of dataset:", df.shape)
df.info()

# Preview the first few rows
df.head()
```

---

## ‚úÖ Next Steps

- Check for missing values in the `sign` column
- Explore distribution of zodiac signs
- Assess completeness of other relevant features (`drinks`, `smokes`, `drugs`, `essay0`‚Äì`essay9`)
- Decide whether to proceed with this target or redefine the ML problem

In [3]:
# üì¶ Load and Inspect OKCupid Dataset

import pandas as pd

# Load the dataset
df = pd.read_csv("profiles.csv")

# Basic shape and structure
print("‚úÖ Dataset loaded successfully.")
print("üî¢ Number of rows and columns:", df.shape)

# Overview of column types and missing values
print("\nüìã Dataset Info:")
df.info()

# Preview the first few rows
print("\nüîç First 5 rows:")
df.head()

‚úÖ Dataset loaded successfully.
üî¢ Number of rows and columns: (59946, 31)

üìã Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59946 entries, 0 to 59945
Data columns (total 31 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   age          59946 non-null  int64  
 1   body_type    54650 non-null  object 
 2   diet         35551 non-null  object 
 3   drinks       56961 non-null  object 
 4   drugs        45866 non-null  object 
 5   education    53318 non-null  object 
 6   essay0       54458 non-null  object 
 7   essay1       52374 non-null  object 
 8   essay2       50308 non-null  object 
 9   essay3       48470 non-null  object 
 10  essay4       49409 non-null  object 
 11  essay5       49096 non-null  object 
 12  essay6       46175 non-null  object 
 13  essay7       47495 non-null  object 
 14  essay8       40721 non-null  object 
 15  essay9       47343 non-null  object 
 16  ethnicity    54266 non-null  obj

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,...,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
0,22,a little extra,strictly anything,socially,never,working on college/university,about me:<br />\n<br />\ni would love to think...,currently working as an international agent fo...,making people laugh.<br />\nranting about a go...,"the way i look. i am a six foot half asian, ha...",...,"south san francisco, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism and very serious about it,m,gemini,sometimes,english,single
1,35,average,mostly other,often,sometimes,working on space camp,i am a chef: this is what that means.<br />\n1...,dedicating everyday to being an unbelievable b...,being silly. having ridiculous amonts of fun w...,,...,"oakland, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism but not too serious about it,m,cancer,no,"english (fluently), spanish (poorly), french (...",single
2,38,thin,anything,socially,,graduated from masters program,"i'm not ashamed of much, but writing public te...","i make nerdy software for musicians, artists, ...",improvising in different contexts. alternating...,my large jaw and large glasses are the physica...,...,"san francisco, california",,straight,has cats,,m,pisces but it doesn&rsquo;t matter,no,"english, french, c++",available
3,23,thin,vegetarian,socially,,working on college/university,i work in a library and go to school. . .,reading things written by old dead people,playing synthesizers and organizing books acco...,socially awkward but i do my best,...,"berkeley, california",doesn&rsquo;t want kids,straight,likes cats,,m,pisces,no,"english, german (poorly)",single
4,29,athletic,,socially,never,graduated from college/university,hey how's it going? currently vague on the pro...,work work work work + play,creating imagery to look at:<br />\nhttp://bag...,i smile a lot and my inquisitive nature,...,"san francisco, california",,straight,likes dogs and likes cats,,m,aquarius,no,english,single


In [4]:
# üìä Check Target Variable and Feature Completeness

# Check how many missing values are in the 'sign' column
missing_sign = df['sign'].isnull().sum()
total_rows = len(df)
print(f"‚ùì Missing 'sign' values: {missing_sign} out of {total_rows} ({missing_sign/total_rows:.2%})")

# View distribution of zodiac signs
print("\nüîÆ Zodiac Sign Distribution:")
print(df['sign'].value_counts(dropna=False))

# Check missing values in lifestyle features
print("\nüö¶ Missing values in lifestyle columns:")
print(df[['drinks', 'smokes', 'drugs']].isnull().sum())

# Check completeness of essay fields
print("\nüìù Missing values in essay fields:")
essay_cols = [f"essay{i}" for i in range(10)]
print(df[essay_cols].isnull().sum())


‚ùì Missing 'sign' values: 11056 out of 59946 (18.44%)

üîÆ Zodiac Sign Distribution:
sign
NaN                                              11056
gemini and it&rsquo;s fun to think about          1782
scorpio and it&rsquo;s fun to think about         1772
leo and it&rsquo;s fun to think about             1692
libra and it&rsquo;s fun to think about           1649
taurus and it&rsquo;s fun to think about          1640
cancer and it&rsquo;s fun to think about          1597
pisces and it&rsquo;s fun to think about          1592
sagittarius and it&rsquo;s fun to think about     1583
virgo and it&rsquo;s fun to think about           1574
aries and it&rsquo;s fun to think about           1573
aquarius and it&rsquo;s fun to think about        1503
virgo but it doesn&rsquo;t matter                 1497
leo but it doesn&rsquo;t matter                   1457
cancer but it doesn&rsquo;t matter                1454
gemini but it doesn&rsquo;t matter                1453
taurus but it doesn&rsquo;t 

# üîç Preliminary Data Assessment

After loading and inspecting the dataset, we identified several key insights and challenges that will guide our next steps.

## üß† Target Variable: `sign`

- ‚úÖ Present in 48,890 out of 59,946 rows (~81.56%)
- ‚ùå Missing in ~18.44% of entries
- ‚úÖ Contains 12 zodiac signs, but with many variations (e.g., "gemini and it‚Äôs fun to think about", "gemini but it doesn‚Äôt matter", "gemini")
- üîß **Action**: Normalize zodiac labels by extracting the base sign (e.g., "gemini") and discarding modifiers

## üö¶ Lifestyle Features

| Feature   | Missing Values | % Missing |
|-----------|----------------|-----------|
| `drinks`  | 2,985          | ~5.0%     |
| `smokes`  | 5,512          | ~9.2%     |
| `drugs`   | 14,080         | ~23.5%    |

- üîß **Action**: Consider imputing missing values or treating them as a separate category ("unknown")

## üìù Essay Fields

- Missing values range from ~9% (`essay0`) to ~32% (`essay8`)
- Text fields are rich but sparse and noisy
- üîß **Action**:
  - Combine essays into a single text field for NLP preprocessing
  - Apply cleaning (lowercasing, punctuation removal, etc.)
  - Use TF-IDF or embeddings for feature extraction

## ‚ö†Ô∏è Other Observations

- `offspring`, `pets`, and `religion` have high missingness (>30%)
- `income` is present but may be skewed or zero-filled
- `height` and `age` are complete and usable

---

## ‚úÖ Next Steps

1. **Normalize the `sign` column** to extract base zodiac labels
2. **Clean and combine essay fields** for NLP feature engineering
3. **Handle missing values** in lifestyle and categorical features
4. **Visualize distributions** to guide feature selection
5. **Prepare training data** for supervised classification

---

üìå These steps will help us build a robust pipeline for predicting zodiac signs based on lifestyle and personality traits.


# üßπ Data Cleaning: Zodiac Signs and Essay Text

Before we begin exploratory analysis, we need to clean and standardize key features in the dataset. This includes:

1. Normalizing the `sign` column to extract the base zodiac label
2. Combining all essay fields into a single text column for NLP processing

These steps will ensure consistency and simplify downstream modeling.


### ‚úÖ Step 1: Normalize the `sign` column

In [5]:
import re
import pandas as pd

# Define a function to extract the base zodiac sign from messy strings
def extract_zodiac(sign):
    if pd.isnull(sign):
        return None
    # Use regex to find the zodiac keyword in the string
    match = re.search(r'\b(aries|taurus|gemini|cancer|leo|virgo|libra|scorpio|sagittarius|capricorn|aquarius|pisces)\b', sign.lower())
    return match.group(0) if match else None

# Apply the function to the 'sign' column
df['zodiac_clean'] = df['sign'].apply(extract_zodiac)

# Display the cleaned distribution
print("üîÆ Cleaned Zodiac Sign Distribution:")
print(df['zodiac_clean'].value_counts(dropna=False))

üîÆ Cleaned Zodiac Sign Distribution:
zodiac_clean
None           11056
leo             4374
gemini          4310
libra           4207
cancer          4206
virgo           4141
taurus          4140
scorpio         4134
aries           3989
pisces          3946
sagittarius     3942
aquarius        3928
capricorn       3573
Name: count, dtype: int64


**Explanation:**
- Many entries in the `sign` column contain modifiers like ‚Äúbut it doesn‚Äôt matter‚Äù or ‚Äúand it‚Äôs fun to think about‚Äù.
- This function extracts only the core zodiac sign using regular expressions.
- The result is stored in a new column `zodiac_clean`.
---


### ‚úÖ Step 2: Combine all essay fields into one column

In [6]:
# List of essay column names
essay_cols = [f"essay{i}" for i in range(10)]

# Fill missing essay values with empty strings and concatenate them
df['essays_combined'] = df[essay_cols].fillna('').agg(' '.join, axis=1)

# Preview the combined essay text for the first profile
print("üìù Sample combined essay text:")
print(df['essays_combined'].iloc[0][:500])  # Show first 500 characters

üìù Sample combined essay text:
about me:<br />
<br />
i would love to think that i was some some kind of intellectual:
either the dumbest smart guy, or the smartest dumb guy. can't say i
can tell the difference. i love to talk about ideas and concepts. i
forge odd metaphors instead of reciting cliches. like the
simularities between a friend of mine's house and an underwater
salt mine. my favorite word is salt by the way (weird choice i
know). to me most things in life are better as metaphors. i seek to
make myself a little be


**Explanation:**
- Essay fields are spread across `essay0` to `essay9`, each representing a different prompt.
- We fill missing values with empty strings to avoid `NaN` issues.
- Then we concatenate all essays into a single column `essays_combined` for easier NLP processing.

---