<a href="https://www.kaggle.com/code/abbas829/pandas-01b?scriptVersionId=292637312" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# üîç The Data Detective: Mastering Exploratory Data Analysis with Pandas

## Uncovering Hidden Insights in User Behavior using Python

**Mission Briefing:** In the fast-paced world of tech startups, data is the ultimate evidence. You've just been hired as the **Lead Data Investigator**. Your first assignment? A mysterious dataset containing raw user profiles. Your boss needs to know *who* these users are before the marketing team can launch their next big campaign.

As the **Data Detective**, you will use **Pandas**‚Äîthe most powerful library in the Python ecosystem‚Äîto forensicially examine this data, identify patterns, and solve the mystery of your user base.

---

## üõ†Ô∏è Step 1: Gathering the Forensic Toolkit

Every investigator needs a reliable set of tools. For Python-based data forensics, we rely on the industry standards: `Pandas` for manipulation and `NumPy` for numerical operations.

> [!TIP]
> Standard convention is to import pandas as `pd` and numpy as `np`. This keeps our code clean and matches universal documentation styles.

In [None]:
import numpy as np
import pandas as pd

## üìç Step 2: Locating the Evidence (Data Acquisition)

Our intelligence report points to a remote server where the raw logs are stored. We need to define the source URL before we can begin the extraction.

*Action: Define the source URI for the user dataset.*

In [None]:
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user'

## üìÅ Step 3: Opening the Case File

Now, we ingest the data into a Pandas **DataFrame**. We've noticed the file uses a pipe `|` delimiter rather than the standard comma, and we want to use the `user_id` as our primary key (index).

```mermaid
graph LR
A[Raw Data Link] --> B{Pandas Read}
B --> C[Structured DataFrame]
C --> D[Data Detective Dashboard]
```

> [!IMPORTANT]
> Using `index_col` during the loading phase saves us from having to reset the index later, keeping our memory usage efficient.

In [None]:
users = pd.read_csv(url, sep='|', index_col='user_id')
users.head()

## üëÄ Step 4: First Glance at the Suspects

A detective never dives in without a quick overview. We'll examine the first 25 entries to identify visible patterns or potential anomalies in the data entry.

| Feature | Description |
| :--- | :--- |
| `age` | User's age in years |
| `gender` | M/F designation |
| `occupation` | Self-reported job title |
| `zip_code` | Geographic location marker |

In [None]:
users.head(25)

## üîç Step 5: Checking the Trail's End

Does the data remain consistent until the very last row? We check the final 10 entries to verify there are no trailing errors or mismatched formats.

*Action: Inspect the tail of the dataset.*

In [None]:
users.tail(10)

## üìä Step 6: Measuring the Population Scale

How large is our investigation? Knowing the total number of observations is critical for statistical significance.

**Metric:** Total Number of Users

In [None]:
print(f'Total Users Investigated: {len(users)}')

## üè∑Ô∏è Step 7: Inventorying User Features

What specific data points do we have on each suspect? Counting the columns allows us to plan our multidimensional analysis.

*Action: Count and list the feature set.*

In [None]:
print(f'Number of Features: {len(users.columns)}')
print(f'Feature Names: {list(users.columns)}')

## üß¨ Step 8: Analyzing Data DNA (Types)

Understanding data types (`dtypes`) is like checking the DNA of your evidence. Are we looking at integers, text (objects), or categories?

> [!NOTE]
> In Pandas, text columns are often listed as `object`. For modern analysis, we could convert these to `category` to save memory.

In [None]:
users.dtypes

## üëÆ Step 9: Deep Dive into Occupations

The boss is curious about the professional background of our users. Let's isolate the `occupation` column for specialized analysis.

*Action: Extract and view the occupation profile.*

In [None]:
users.occupation

## üéì Step 10: Identifying Diversity (Unique Jobs)

Is our user base concentrated in one field, or spread across various industries? Counting the unique entries tells us the level of diversity.

*Action: Count distinct occupations.*

In [None]:
unique_jobs = users['occupation'].nunique()
print(f'Distinct Careers Identified: {unique_jobs}')

## üèÜ Step 11: The Most Frequent Actor

Which occupation dominates the ecosystem? Identifying the mode (most common value) helps tailer marketing strategies.

*Action: Find the top occupation.*

In [None]:
most_common = users['occupation'].value_counts().idxmax()
print(f'Primary User Segment: {most_common}')

## üìâ Step 12: The Statistical Summary

Let's run a full statistical profile on the numeric data. This gives us the mean, standard deviation, and quartiles of our user ages.

*Action: Generate descriptive statistics.*

In [None]:
users.describe()

## üìã Step 13: The Holistic Investigation Report

By including all columns in our summary, we can even see metrics for text columns, such as 'unique' and 'top' frequency.

*Action: View the full summary for all features.*

In [None]:
users.describe(include='all')

## ‚è≥ Step 14: Analyzing the Demographics (Average Age)

Understanding the age distribution is vital for user experience design. 

*Action: Calculate the mean age.*

In [None]:
mean_age = users['age'].mean()
print(f'Average User Age: {mean_age:.1f} years')

## üîç Step 15: Identifying Outliers

Which ages are the rarest in our population? Finding the least common entries helps identify niche groups.

*Action: Examine the frequency of rare age segments.*

In [None]:
users['age'].value_counts().tail(5)

--- 

# üèÅ Case Closed: Final Intelligence Briefing

**Mission Accomplished.** You've successfully navigated the complexities of raw data and transformed it into actionable intelligence.

### Key Findings Summary:
- üë• **Population Size:** 943 Users
- üíº **Job Diversity:** 21 Distinct Occupations
- üéì **Dominant Group:** Students
- üéÇ **Average Age:** ~34.1 years

This initial exploration forms the bedrock of any successful Machine Learning project. Without knowing your data, you cannot model it. 

**Well done, Detective.**

---

## üë§ Lead Investigator Details

**Author:** Tassawar Abbas  
**Email:** abbas829@gmail.com  
**License:** MIT License