# Session 12: Workflow in Practice (Initial Data Exploration)

**Unit 1: Introduction to Data Science**
**Hour: 12**
**Mode: Practical Lab**

---

### 1. Objective

In this lab, we will perform the first hands-on steps of our case study. We'll load the Telco Churn dataset and conduct an initial inspection to understand its structure and identify any immediate issues.

This covers the **Obtain** and the very beginning of the **Scrub** and **Explore** phases of the OSEMN workflow.

### 2. Setup

We only need the Pandas library for this session.

In [None]:
import pandas as pd

### 3. Obtain: Loading the Data

We'll load the dataset directly from a URL. This is a common practice when working with public datasets.

In [None]:
url = 'https://raw.githubusercontent.com/IBM/telco-customer-churn-on-icp4d/master/data/Telco-Customer-Churn.csv'
df = pd.read_csv(url)

### 4. Initial Inspection & Exploration

Let's apply the basic inspection techniques we learned in Session 3 to this new, more complex dataset.

#### 4.1. `.head()` and `.shape`

Let's look at the first few rows and check the size of our dataset.

In [None]:
df.head()

In [None]:
df.shape

**Finding:** We have 7043 customers (rows) and 21 columns of information.

#### 4.2. `.info()` - The Technical Summary

This is a critical step to check for missing values and incorrect data types.

In [None]:
df.info()

**Findings & Potential Issues:**
1.  **No Missing Values?** Most columns show 7043 non-null entries, which seems great. But is this true? We need to investigate further.
2.  **Incorrect Data Type:** The `TotalCharges` column, which should be a number (float), is currently an `object` (text). This is a major red flag! It means there are non-numeric values in that column that we'll need to fix in the **Scrub** phase.

#### 4.3. Exploring the Target Variable: `Churn`

It's always a good idea to understand the distribution of what you're trying to predict.

In [None]:
df['Churn'].value_counts()

We can see how many customers churned versus how many did not. It's often more useful to see this as a percentage.

In [None]:
# The normalize=True argument converts counts to proportions/percentages
churn_rate = df['Churn'].value_counts(normalize=True) * 100
print(churn_rate)

**Finding:** Approximately **26.5%** of the customers in this dataset have churned. This is called an **imbalanced dataset** because one class ("No") is much more common than the other ("Yes"). This is an important finding that will affect our modeling choices later.

### 5. Conclusion

In just one hour, we have:
1.  Loaded a real-world dataset into Pandas.
2.  Performed an initial inspection to understand its size and columns.
3.  Identified a critical data quality issue (`TotalCharges` having the wrong data type).
4.  Analyzed the distribution of our target variable (`Churn`) and discovered that we are dealing with an imbalanced class problem.

These initial findings are crucial and will guide our next steps in the **Scrub** (cleaning) and **Explore** (visualization) phases. This concludes our work for Unit 1!