# **Notebook 01: Data Collection**

## Objectives
- Download the leads dataset from Kaggle
- Load data into pandas DataFrame
- Perform initial inspection of the dataset
- Save raw data to outputs folder for version control

## Inputs
- Kaggle dataset: [Leads Dataset](https://www.kaggle.com/datasets/ashydv/leads-dataset)

## Outputs
- `outputs/datasets/collection/leads_raw.csv`

---

## Change working directory

Change the working directory from the jupyter_notebooks folder to the project root.

In [1]:
import os

# Get current directory
current_dir = os.getcwd()
print(f"Current directory: {current_dir}")

# Change to parent directory (project root)
os.chdir(os.path.dirname(current_dir))
print(f"New directory: {os.getcwd()}")

Current directory: /Users/anthony/Downloads/Project-5-main/jupyter_notebooks
New directory: /Users/anthony/Downloads/Project-5-main


---

## Import Libraries

In [2]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

---

## Load Dataset

### Option 1: Load from Kaggle API

If you have the Kaggle API configured, you can download directly:

```python
# Install kaggle if needed: pip install kaggle
# Ensure ~/.kaggle/kaggle.json is configured
!kaggle datasets download -d ashydv/leads-dataset
!unzip leads-dataset.zip -d inputs/datasets/raw/
```

### Option 2: Load from local file (Manual Download)

1. Download from: https://www.kaggle.com/datasets/ashydv/leads-dataset
2. Place the CSV file in `inputs/datasets/raw/`

In [3]:
# Load the dataset
# Adjust the path based on your download method
df = pd.read_csv('inputs/datasets/raw/Leads.csv')

print(f"Dataset loaded successfully!")
print(f"Shape: {df.shape[0]} rows x {df.shape[1]} columns")

Dataset loaded successfully!
Shape: 9240 rows x 37 columns


---

## Initial Data Inspection

### Dataset Shape

In [4]:
print(f"Number of rows: {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}")

Number of rows: 9240
Number of columns: 37


### First Few Rows

In [5]:
df.head()

Unnamed: 0,Prospect ID,Lead Number,Lead Origin,Lead Source,Do Not Email,Do Not Call,Converted,TotalVisits,Total Time Spent on Website,Page Views Per Visit,...,Get updates on DM Content,Lead Profile,City,Asymmetrique Activity Index,Asymmetrique Profile Index,Asymmetrique Activity Score,Asymmetrique Profile Score,I agree to pay the amount through cheque,A free copy of Mastering The Interview,Last Notable Activity
0,7927b2df-8bba-4d29-b9a2-b6e0beafe620,660737,API,Olark Chat,No,No,0,0.0,0,0.0,...,No,Select,Select,02.Medium,02.Medium,15.0,15.0,No,No,Modified
1,2a272436-5132-4136-86fa-dcc88c88f482,660728,API,Organic Search,No,No,0,5.0,674,2.5,...,No,Select,Select,02.Medium,02.Medium,15.0,15.0,No,No,Email Opened
2,8cc8c611-a219-4f35-ad23-fdfd2656bd8a,660727,Landing Page Submission,Direct Traffic,No,No,1,2.0,1532,2.0,...,No,Potential Lead,Mumbai,02.Medium,01.High,14.0,20.0,No,Yes,Email Opened
3,0cc2df48-7cf4-4e39-9de9-19797f9b38cc,660719,Landing Page Submission,Direct Traffic,No,No,0,1.0,305,1.0,...,No,Select,Mumbai,02.Medium,01.High,13.0,17.0,No,No,Modified
4,3256f628-e534-4826-9d63-4a8b88782852,660681,Landing Page Submission,Google,No,No,1,2.0,1428,1.0,...,No,Select,Mumbai,02.Medium,01.High,15.0,18.0,No,No,Modified


### Data Types

In [6]:
df.dtypes

Prospect ID                                       object
Lead Number                                        int64
Lead Origin                                       object
Lead Source                                       object
Do Not Email                                      object
Do Not Call                                       object
Converted                                          int64
TotalVisits                                      float64
Total Time Spent on Website                        int64
Page Views Per Visit                             float64
Last Activity                                     object
Country                                           object
Specialization                                    object
How did you hear about X Education                object
What is your current occupation                   object
What matters most to you in choosing a course     object
Search                                            object
Magazine                       

### Column Names

In [7]:
print("All columns in the dataset:")
for i, col in enumerate(df.columns, 1):
    print(f"{i:2}. {col}")

All columns in the dataset:
 1. Prospect ID
 2. Lead Number
 3. Lead Origin
 4. Lead Source
 5. Do Not Email
 6. Do Not Call
 7. Converted
 8. TotalVisits
 9. Total Time Spent on Website
10. Page Views Per Visit
11. Last Activity
12. Country
13. Specialization
14. How did you hear about X Education
15. What is your current occupation
16. What matters most to you in choosing a course
17. Search
18. Magazine
19. Newspaper Article
20. X Education Forums
21. Newspaper
22. Digital Advertisement
23. Through Recommendations
24. Receive More Updates About Our Courses
25. Tags
26. Lead Quality
27. Update me on Supply Chain Content
28. Get updates on DM Content
29. Lead Profile
30. City
31. Asymmetrique Activity Index
32. Asymmetrique Profile Index
33. Asymmetrique Activity Score
34. Asymmetrique Profile Score
35. I agree to pay the amount through cheque
36. A free copy of Mastering The Interview
37. Last Notable Activity


### Statistical Summary

In [8]:
# Numerical columns
df.describe()

Unnamed: 0,Lead Number,Converted,TotalVisits,Total Time Spent on Website,Page Views Per Visit,Asymmetrique Activity Score,Asymmetrique Profile Score
count,9240.0,9240.0,9103.0,9240.0,9103.0,5022.0,5022.0
mean,617188.435606,0.38539,3.445238,487.698268,2.36282,14.306252,16.344883
std,23405.995698,0.486714,4.854853,548.021466,2.161418,1.386694,1.811395
min,579533.0,0.0,0.0,0.0,0.0,7.0,11.0
25%,596484.5,0.0,1.0,12.0,1.0,14.0,15.0
50%,615479.0,0.0,3.0,248.0,2.0,14.0,16.0
75%,637387.25,1.0,5.0,936.0,3.0,15.0,18.0
max,660737.0,1.0,251.0,2272.0,55.0,18.0,20.0


In [9]:
# Categorical columns
df.describe(include='object')

Unnamed: 0,Prospect ID,Lead Origin,Lead Source,Do Not Email,Do Not Call,Last Activity,Country,Specialization,How did you hear about X Education,What is your current occupation,...,Lead Quality,Update me on Supply Chain Content,Get updates on DM Content,Lead Profile,City,Asymmetrique Activity Index,Asymmetrique Profile Index,I agree to pay the amount through cheque,A free copy of Mastering The Interview,Last Notable Activity
count,9240,9240,9204,9240,9240,9137,6779,7802,7033,6550,...,4473,9240,9240,6531,7820,5022,5022,9240,9240,9240
unique,9240,5,21,2,2,17,38,19,10,6,...,5,1,1,6,7,3,3,1,2,16
top,7927b2df-8bba-4d29-b9a2-b6e0beafe620,Landing Page Submission,Google,No,No,Email Opened,India,Select,Select,Unemployed,...,Might be,No,No,Select,Mumbai,02.Medium,02.Medium,No,No,Modified
freq,1,4886,2868,8506,9238,3437,6492,1942,5043,5600,...,1560,9240,9240,4146,3222,3839,2788,9240,6352,3407


### Target Variable Distribution

In [10]:
print("Target Variable (Converted) Distribution:")
print(df['Converted'].value_counts())
print(f"\nConversion Rate: {df['Converted'].mean():.1%}")

Target Variable (Converted) Distribution:
0    5679
1    3561
Name: Converted, dtype: int64

Conversion Rate: 38.5%


### Missing Values Overview

In [11]:
# Check for missing values
missing = df.isnull().sum()
missing_pct = (df.isnull().sum() / len(df) * 100).round(2)

missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Missing %': missing_pct
}).sort_values('Missing %', ascending=False)

print("Missing Values Summary:")
print(missing_df[missing_df['Missing Count'] > 0])

Missing Values Summary:
                                               Missing Count  Missing %
Lead Quality                                            4767      51.59
Asymmetrique Activity Index                             4218      45.65
Asymmetrique Profile Score                              4218      45.65
Asymmetrique Activity Score                             4218      45.65
Asymmetrique Profile Index                              4218      45.65
Tags                                                    3353      36.29
Lead Profile                                            2709      29.32
What matters most to you in choosing a course           2709      29.32
What is your current occupation                         2690      29.11
Country                                                 2461      26.63
How did you hear about X Education                      2207      23.89
Specialization                                          1438      15.56
City                                    

### Check for 'Select' Values (Proxy for Null)

This dataset uses 'Select' as a placeholder when no option was chosen, which should be treated as missing data.

In [12]:
# Check for 'Select' values in categorical columns
print("Columns with 'Select' values:")
for col in df.select_dtypes(include='object').columns:
    if 'Select' in df[col].values:
        count = (df[col] == 'Select').sum()
        pct = (count / len(df) * 100)
        print(f"  {col}: {count} ({pct:.1f}%)")

Columns with 'Select' values:
  Specialization: 1942 (21.0%)
  How did you hear about X Education: 5043 (54.6%)
  Lead Profile: 4146 (44.9%)
  City: 2249 (24.3%)


---

## Save Raw Dataset

Save the raw dataset to the outputs folder for version control and reproducibility.

In [13]:
import os

# Create output directory if it doesn't exist
output_path = 'outputs/datasets/collection'
os.makedirs(output_path, exist_ok=True)

# Save raw dataset
df.to_csv(f'{output_path}/leads_raw.csv', index=False)
print(f"Raw dataset saved to: {output_path}/leads_raw.csv")

Raw dataset saved to: outputs/datasets/collection/leads_raw.csv


---

## Conclusions

### Dataset Summary
- **Records:** 9,240 leads
- **Features:** 37 columns
- **Target Variable:** Converted (binary: 0/1)
- **Baseline Conversion Rate:** ~30%

### Key Observations
1. The dataset contains a mix of numerical and categorical features
2. Several columns have missing values that need to be addressed
3. 'Select' values are used as placeholders and should be treated as missing
4. Class imbalance exists (30% positive class) - this will need to be handled during modelling

### Next Steps
- Proceed to Notebook 02 for data cleaning
- Handle missing values and 'Select' placeholders
- Investigate each feature in more detail during EDA