# üõ†Ô∏è Project Setup

In this first step, we'll prepare the workspace for the **OKCupid Date-A-Scientist** project.

## üì¶ Download the Project Materials

1. Download the ZIP file for the **Date-A-Scientist** project from the following link:  
   üëâ [Download Starter Files](https://content.codecademy.com/PRO/paths/data-science/OKCupid-Date-A-Scientist-Starter.zip?_gl=1*1gdkc3o*_gcl_au*MTg1MTk2MTQ1LjE3NjE4MjEwMzk.*_ga*MTc5MTQ4OTgwMy4xNzMwMzc2Mzgy*_ga_3LRZM6TM9L*czE3NjM0MjgyMzYkbzEzNSRnMSR0MTc2MzQyOTQyNyRqNjAkbDAkaDA.)
2. Unzip the folder. You should see two items:
   - `date-a-scientist.ipynb` (a blank Jupyter Notebook)
   - `profiles.csv` (the dataset containing user profiles)

## üöÄ Launch Jupyter Notebook

1. Open your terminal or command line interface.
2. Type the following command to start Jupyter Notebook:

   ```bash
   jupyter notebook
   ```

3. A browser tab will open automatically.
4. Click on `date-a-scientist.ipynb` to open the notebook.
5. Build your project inside this file.

## üß† How to Use Jupyter Notebook

Jupyter Notebook is an interactive tool that lets you combine code, visualizations, and explanatory text in one document. It's perfect for data science projects because it supports exploration, analysis, and storytelling.

If you need help setting up or using Jupyter Notebook, check out these resources:

- [Command Line Interface Setup](#)
- [Introducing Jupyter Notebook](#)
- [Setting up Jupyter Notebook](#)
- [Getting Started with Jupyter](#)
- [Getting More out of Jupyter Notebook](#)

---


```markdown
# üîê Git Repository Setup with SSH and LFS

To manage version control and large files efficiently, this project uses **Git**, **SSH authentication**, and **Git LFS**.

## üìÅ Local Project Path

```bash
cd code/DataScientistML_Codecademy/OKCupid-Date-A-Scientist-Starter
```

## üß± Initialize and Push to Remote Repository

```bash
git init
git lfs install
git add .
git commit -m "Initial commit for OKCupid Date-A-Scientist project"
git remote add origin git@personal-github:gabrielarcangelbol/OKCupid-Date-A-Scientist-Starter.git
git branch -M main
git push -u origin main
```

## ‚öôÔ∏è Configure `.gitattributes` for Large Files

To track CSV files with Git LFS, the following line was added to `.gitattributes`:

```
*.csv filter=lfs diff=lfs merge=lfs -text
```

Then committed:

```bash
git add .gitattributes
git commit -m "Configure Git LFS for CSV files"
```

## üîê SSH Key Setup for Personal GitHub Account

To ensure Git uses your **personal SSH key**, follow these steps:

1. Start the SSH agent:

   ```bash
   eval "$(ssh-agent -s)"
   ```

2. Add your personal SSH key:

   ```bash
   ssh-add ~/.ssh/id_rsa_personal
   ```

## üîç Verify Remote Configuration

To confirm the correct remote is set:

```bash
cd code/DataScientistML_Codecademy/OKCupid-Date-A-Scientist-Starter
git remote -v
```

You should see:

```
origin  git@personal-github:gabrielarcangelbol/OKCupid-Date-A-Scientist-Starter.git (fetch)
origin  git@personal-github:gabrielarcangelbol/OKCupid-Date-A-Scientist-Starter.git (push)
```

---

‚úÖ With Git, SSH, and LFS configured, your project is ready for version-controlled development and collaboration.
```

# üß≠ Project Scoping

Properly scoping your project creates structure and helps you think through your entire workflow before diving into the analysis. This section outlines the key components of the project scope, inspired by the [University of Chicago‚Äôs Data Science Project Scoping Guide](https://datasciencepublicpolicy.org/our-work/tools-guides/data-science-project-scoping-guide/).

---

## üéØ 1. Project Goals

- **Primary Objective**: Explore the OKCupid dataset to uncover patterns in dating preferences and behaviors using machine learning and NLP techniques.
- **Secondary Goals**:
  - Identify which user attributes are most predictive of compatibility.
  - Apply clustering to discover latent user segments.
  - Build a supervised model to predict user traits or preferences based on profile text.

---

## üìä 2. Data Overview

- **Source**: `profiles.csv` from OKCupid (provided in starter files)
- **Structure**: Each row represents a user profile with multiple features including:
  - Demographics (age, sex, orientation, location)
  - Lifestyle (diet, smoking, drinking, drugs)
  - Essay responses (10 free-text fields)
- **Challenges**:
  - Missing values
  - Unstructured text (essays)
  - Potential class imbalance

---

## üß™ 3. Analytical Approach

### Exploratory Data Analysis (EDA)
- Distribution of key demographics
- Missing data patterns
- Word frequency and sentiment in essay fields

### Feature Engineering
- Text vectorization (TF-IDF, embeddings)
- Aggregated lifestyle indicators
- Derived compatibility scores

### Modeling
- **Unsupervised**: Clustering (e.g., KMeans, DBSCAN) to identify user segments
- **Supervised**: Classification (e.g., logistic regression, random forest) to predict:
  - Smoking habits
  - Orientation
  - Personality traits (inferred from essays)

---

## üß± 4. Constraints & Assumptions

- **Assumptions**:
  - Users are honest in their profiles
  - Essay content reflects personality and preferences
- **Constraints**:
  - No access to actual match outcomes
  - Limited metadata on user interactions

---

## üîÑ 5. Risks & Adjustments

- NLP models may underperform due to short or noisy text
- Clustering may not yield interpretable segments
- Some hypotheses may not be supported by the data

---

## üìå Next Steps

- Clean and preprocess the dataset
- Perform EDA and visualize key trends
- Define target variables for modeling
- Iterate on feature engineering and model evaluation

---

üìö For more guidance, refer to the full [Data Science Project Scoping Guide](https://datasciencepublicpolicy.org/our-work/tools-guides/data-science-project-scoping-guide/) and [Scoping Worksheet (PDF)](https://worldbank.github.io/Data-in-Action/downloads/UChicago_DSSG_Project_Scoping.pdf).

Sources: [Scoping Guide](https://datasciencepublicpolicy.org/our-work/tools-guides/data-science-project-scoping-guide/), [Scoping Worksheet PDF](https://worldbank.github.io/Data-in-Action/downloads/UChicago_DSSG_Project_Scoping.pdf)


# ü§ñ Select ML-Solvable Problem

In this section, we define a machine learning problem that is feasible given the dataset and project scope. The goal is to identify a task that:

- Can be solved using available data
- Is achievable within a reasonable timeframe
- Has a clear input-output structure suitable for ML
- Aligns with the broader goals of the project

---

## üß© Problem Statement

Many users on OKCupid consider **Zodiac signs** important when evaluating compatibility. However, not all users provide their zodiac sign in their profile. This creates a gap in the matching process.

**Problem**: Can we predict a user's zodiac sign based on other profile attributes such as lifestyle habits (drinking, smoking, drug use) and free-text essay responses?

---

## üß† ML Framing

This is a **multi-class classification** problem with 12 possible zodiac labels.

### **Input Features**:
- Categorical: `drinks`, `smokes`, `drugs`, `job`, `education`
- Textual: `essay0` to `essay9` (to be vectorized using NLP techniques)

### **Target Variable**:
- `sign` (Zodiac sign)

---

## üîç Why This Problem?

- The `sign` column has missing values, making it a good candidate for imputation via ML.
- Lifestyle and personality traits (expressed in essays) may correlate with astrological archetypes.
- Predicting zodiac signs could enhance match recommendations for users who omit this field.

---

## üß™ ML Techniques to Be Used

- **Text preprocessing**: cleaning, tokenization, TF-IDF or embeddings
- **Feature encoding**: one-hot or ordinal encoding for categorical variables
- **Modeling**: 
  - Baseline: Logistic Regression or Naive Bayes
  - Advanced: Random Forest, XGBoost, or fine-tuned transformer (if time permits)
- **Evaluation**: Accuracy, F1-score, confusion matrix

---

## üìå Next Steps

- Analyze class distribution of `sign`
- Explore correlations between zodiac and lifestyle features
- Preprocess essays and engineer features
- Train and evaluate classification models

---

üìö Reference: [Google‚Äôs Guide to Identifying ML-Suitable Problems](https://developers.google.com/machine-learning/problem-framing)