### Data Profiling vs. Raw Inspection with ydata_profiling (Colab)
This lab will help you understand the difference between data profiling and raw data inspection, using the ydata_profiling library in a Google Colab environment.

Objectives
- Examine demographic factors and income categories.

- Leverage profiling to check missing/invalid entries.



In [1]:
# Step 1: Set Up the Environment
!pip install ydata-profiling pandas -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/62.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.1/400.1 kB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m296.5/296.5 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m679.0/679.0 kB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m37.7/37.7 MB[0m [31m29.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.4/105.4 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.3/43.3 kB[0m [31m2.6 MB/s[0m eta [36

In [7]:
# Step 2: Load Sample Data
import pandas as pd

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
column_names = [
    "age", "workclass", "fnlwgt", "education", "education_num",
    "marital_status", "occupation", "relationship", "race", "sex",
    "capital_gain", "capital_loss", "hours_per_week", "native_country",
    "income"
]
df = pd.read_csv(url, header=None, names=column_names, na_values=" ?", skipinitialspace=True)


In [8]:
# Step 3: Raw Data Inspection
print(df.head())
print(df.info())
print(df.describe())
print(df.isnull().sum())



   age         workclass  fnlwgt  education  education_num  \
0   39         State-gov   77516  Bachelors             13   
1   50  Self-emp-not-inc   83311  Bachelors             13   
2   38           Private  215646    HS-grad              9   
3   53           Private  234721       11th              7   
4   28           Private  338409  Bachelors             13   

       marital_status         occupation   relationship   race     sex  \
0       Never-married       Adm-clerical  Not-in-family  White    Male   
1  Married-civ-spouse    Exec-managerial        Husband  White    Male   
2            Divorced  Handlers-cleaners  Not-in-family  White    Male   
3  Married-civ-spouse  Handlers-cleaners        Husband  Black    Male   
4  Married-civ-spouse     Prof-specialty           Wife  Black  Female   

   capital_gain  capital_loss  hours_per_week native_country income  
0          2174             0              40  United-States  <=50K  
1             0             0             

In [9]:
# Step 4: Data Profiling with ydata_profiling
from ydata_profiling import ProfileReport
profile = ProfileReport(df, title='Iris Data Profile', explorative=True)
profile.to_notebook_iframe()


Output hidden; open in https://colab.research.google.com to view.

### Step 5: Analysis & Comparison
Answer these questions in your notebook:

- What inconsistencies or data type issues are found manually vs. via profiling?

- How many missing or 'unknown' values exist per variable?

- How many missing or 'unknown' values exist per variable?

In [10]:
# Step 6: Optional – Export the Profiling Report
# Save report as HTML
profile.to_file("titanic_data_profile.html")


Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Deliverables
- Snippets/output for steps above.

- Written comparison: Raw Inspection vs. Data Profiling using ydata_profiling.

- (Optional) Attach the exported HTML report.