<a href="https://www.kaggle.com/code/abbas829/pandas-data-story?scriptVersionId=294665172" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

<a href="https://www.kaggle.com/code/abbas829/pandas-01b?scriptVersionId=292637312" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# üîç The Data Detective: Mastering Exploratory Data Analysis with Pandas

## Uncovering Hidden Insights in User Behavior using Python

**Mission Briefing:** In the fast-paced world of tech startups, data is the ultimate evidence. You've just been hired as the **Lead Data Investigator**. Your first assignment? A mysterious dataset containing raw user profiles. Your boss needs to know *who* these users are before the marketing team can launch their next big campaign.

As the **Data Detective**, you will use **Pandas**‚Äîthe most powerful library in the Python ecosystem‚Äîto forensicially examine this data, identify patterns, and solve the mystery of your user base.

---

## üõ†Ô∏è Step 1: Gathering the Forensic Toolkit

Every investigator needs a reliable set of tools. For Python-based data forensics, we rely on the industry standards: `Pandas` for manipulation and `NumPy` for numerical operations.

> [!TIP]
> Standard convention is to import pandas as `pd` and numpy as `np`. This keeps our code clean and matches universal documentation styles.

In [1]:
import numpy as np
import pandas as pd

## üìç Step 2: Locating the Evidence (Data Acquisition)

Our intelligence report points to a remote server where the raw logs are stored. We need to define the source URL before we can begin the extraction.

*Action: Define the source URI for the user dataset.*

In [2]:
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user'

## üìÅ Step 3: Opening the Case File

Now, we ingest the data into a Pandas **DataFrame**. We've noticed the file uses a pipe `|` delimiter rather than the standard comma, and we want to use the `user_id` as our primary key (index).

```mermaid
graph LR
A[Raw Data Link] --> B{Pandas Read}
B --> C[Structured DataFrame]
C --> D[Data Detective Dashboard]
```

> [!IMPORTANT]
> Using `index_col` during the loading phase saves us from having to reset the index later, keeping our memory usage efficient.

In [3]:
users = pd.read_csv(url, sep='|', index_col='user_id')
users.head()

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213


## üëÄ Step 4: First Glance at the Suspects

A detective never dives in without a quick overview. We'll examine the first 25 entries to identify visible patterns or potential anomalies in the data entry.

| Feature | Description |
| :--- | :--- |
| `age` | User's age in years |
| `gender` | M/F designation |
| `occupation` | Self-reported job title |
| `zip_code` | Geographic location marker |

In [4]:
users.head(25)

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213
6,42,M,executive,98101
7,57,M,administrator,91344
8,36,M,administrator,5201
9,29,M,student,1002
10,53,M,lawyer,90703


## üîç Step 5: Checking the Trail's End

Does the data remain consistent until the very last row? We check the final 10 entries to verify there are no trailing errors or mismatched formats.

*Action: Inspect the tail of the dataset.*

In [5]:
users.tail(10)

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
934,61,M,engineer,22902
935,42,M,doctor,66221
936,24,M,other,32789
937,48,M,educator,98072
938,38,F,technician,55038
939,26,F,student,33319
940,32,M,administrator,2215
941,20,M,student,97229
942,48,F,librarian,78209
943,22,M,student,77841


## üìä Step 6: Measuring the Population Scale

How large is our investigation? Knowing the total number of observations is critical for statistical significance.

**Metric:** Total Number of Users

In [6]:
print(f'Total Users Investigated: {len(users)}')

Total Users Investigated: 943


## üè∑Ô∏è Step 7: Inventorying User Features

What specific data points do we have on each suspect? Counting the columns allows us to plan our multidimensional analysis.

*Action: Count and list the feature set.*

In [7]:
print(f'Number of Features: {len(users.columns)}')
print(f'Feature Names: {list(users.columns)}')

Number of Features: 4
Feature Names: ['age', 'gender', 'occupation', 'zip_code']


## üß¨ Step 8: Analyzing Data DNA (Types)

Understanding data types (`dtypes`) is like checking the DNA of your evidence. Are we looking at integers, text (objects), or categories?

> [!NOTE]
> In Pandas, text columns are often listed as `object`. For modern analysis, we could convert these to `category` to save memory.

In [8]:
users.dtypes

age            int64
gender        object
occupation    object
zip_code      object
dtype: object

## üëÆ Step 9: Deep Dive into Occupations

The boss is curious about the professional background of our users. Let's isolate the `occupation` column for specialized analysis.

*Action: Extract and view the occupation profile.*

In [9]:
users.occupation

user_id
1         technician
2              other
3             writer
4         technician
5              other
           ...      
939          student
940    administrator
941          student
942        librarian
943          student
Name: occupation, Length: 943, dtype: object

## üéì Step 10: Identifying Diversity (Unique Jobs)

Is our user base concentrated in one field, or spread across various industries? Counting the unique entries tells us the level of diversity.

*Action: Count distinct occupations.*

In [10]:
unique_jobs = users['occupation'].nunique()
print(f'Distinct Careers Identified: {unique_jobs}')

Distinct Careers Identified: 21


## üèÜ Step 11: The Most Frequent Actor

Which occupation dominates the ecosystem? Identifying the mode (most common value) helps tailer marketing strategies.

*Action: Find the top occupation.*

In [11]:
most_common = users['occupation'].value_counts().idxmax()
print(f'Primary User Segment: {most_common}')

Primary User Segment: student


## üìâ Step 12: The Statistical Summary

Let's run a full statistical profile on the numeric data. This gives us the mean, standard deviation, and quartiles of our user ages.

*Action: Generate descriptive statistics.*

In [12]:
users.describe()

Unnamed: 0,age
count,943.0
mean,34.051962
std,12.19274
min,7.0
25%,25.0
50%,31.0
75%,43.0
max,73.0


## üìã Step 13: The Holistic Investigation Report

By including all columns in our summary, we can even see metrics for text columns, such as 'unique' and 'top' frequency.

*Action: View the full summary for all features.*

In [13]:
users.describe(include='all')

Unnamed: 0,age,gender,occupation,zip_code
count,943.0,943,943,943.0
unique,,2,21,795.0
top,,M,student,55414.0
freq,,670,196,9.0
mean,34.051962,,,
std,12.19274,,,
min,7.0,,,
25%,25.0,,,
50%,31.0,,,
75%,43.0,,,


## ‚è≥ Step 14: Analyzing the Demographics (Average Age)

Understanding the age distribution is vital for user experience design. 

*Action: Calculate the mean age.*

In [14]:
mean_age = users['age'].mean()
print(f'Average User Age: {mean_age:.1f} years')

Average User Age: 34.1 years


## üîç Step 15: Identifying Outliers

Which ages are the rarest in our population? Finding the least common entries helps identify niche groups.

*Action: Examine the frequency of rare age segments.*

In [15]:
users['age'].value_counts().tail(5)

age
7     1
66    1
11    1
10    1
73    1
Name: count, dtype: int64

--- 

# üèÅ Case Closed: Final Intelligence Briefing

**Mission Accomplished.** You've successfully navigated the complexities of raw data and transformed it into actionable intelligence.

### Key Findings Summary:
- üë• **Population Size:** 943 Users
- üíº **Job Diversity:** 21 Distinct Occupations
- üéì **Dominant Group:** Students
- üéÇ **Average Age:** ~34.1 years

This initial exploration forms the bedrock of any successful Machine Learning project. Without knowing your data, you cannot model it. 

**Well done, Detective.**

---

## üë§ Lead Investigator Details

**Author:** Tassawar Abbas  
**Email:** abbas829@gmail.com  