# 01 â€“ Data Overview

This notebook introduces each data source in the Taobao CTR dataset.  We load the raw CSV files, inspect their structure, and generate simple summaries to understand their contents.

## Load raw data

The raw CSV files should be placed inside the `data/raw` directory.  The file names used in this project are:

- `user_profile.csv`
- `ad_feature.csv`
- `raw_sample.csv`
- `behavior_log.csv`

In [1]:
import os
import pandas as pd

# Define paths for raw data
raw_dir = os.path.join(os.path.pardir, 'data', 'raw')
user_path = os.path.join(raw_dir, 'user_profile.csv')
ad_path = os.path.join(raw_dir, 'ad_feature.csv')
click_path = os.path.join(raw_dir, 'raw_sample.csv')
behaviour_path = os.path.join(raw_dir, 'behavior_log.csv')

# Load dataframes (use engine='python' to handle large files and potential encoding issues)
user_df = pd.read_csv(user_path, engine='python', nrows=100_000)
ad_df = pd.read_csv(ad_path, engine='python', nrows=100_000)
click_df = pd.read_csv(click_path, engine='python', nrows=100_000)
behaviour_df = pd.read_csv(behaviour_path, engine='python', nrows=100_000)

# Display basic information
print('User profile shape:', user_df.shape)
print('Ad feature shape:', ad_df.shape)
print('Click log shape:', click_df.shape)
print('Behaviour log shape:', behaviour_df.shape)


User profile shape: (100000, 9)
Ad feature shape: (100000, 6)
Click log shape: (100000, 6)
Behaviour log shape: (100000, 5)


In [3]:
# Inspect first few rows of each dataset
for name, df in [('user_profile', user_df), ('ad_feature', ad_df), ('raw_sample', click_df), ('behavior_log', behaviour_df)]:
    print(f"{name} head:", df.head())

user_profile head:    userid  cms_segid  cms_group_id  final_gender_code  age_level  \
0     234          0             5                  2          5   
1     523          5             2                  2          2   
2     612          0             8                  1          2   
3    1670          0             4                  2          4   
4    2545          0            10                  1          4   

   pvalue_level  shopping_level  occupation  new_user_class_level   
0           NaN               3           0                    3.0  
1           1.0               3           1                    2.0  
2           2.0               3           0                    NaN  
3           NaN               1           0                    NaN  
4           NaN               3           0                    NaN  
ad_feature head:    adgroup_id  cate_id  campaign_id  customer     brand   price
0       63133     6406        83237         1   95471.0  170.00
1      313401

In [4]:
# Examine data types and missing values
for name, df in [('user_profile', user_df), ('ad_feature', ad_df), ('raw_sample', click_df), ('behavior_log', behaviour_df)]:
    print(f"{name} info:")
    display(df.info())
    missing = df.isnull().sum()
    missing = missing[missing > 0]
    if not missing.empty:
        print('Missing values:', missing)

user_profile info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 9 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   userid                 100000 non-null  int64  
 1   cms_segid              100000 non-null  int64  
 2   cms_group_id           100000 non-null  int64  
 3   final_gender_code      100000 non-null  int64  
 4   age_level              100000 non-null  int64  
 5   pvalue_level           45978 non-null   float64
 6   shopping_level         100000 non-null  int64  
 7   occupation             100000 non-null  int64  
 8   new_user_class_level   67579 non-null   float64
dtypes: float64(2), int64(7)
memory usage: 6.9 MB


None

Missing values: pvalue_level             54022
new_user_class_level     32421
dtype: int64
ad_feature info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 6 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   adgroup_id   100000 non-null  int64  
 1   cate_id      100000 non-null  int64  
 2   campaign_id  100000 non-null  int64  
 3   customer     100000 non-null  int64  
 4   brand        71820 non-null   float64
 5   price        100000 non-null  float64
dtypes: float64(2), int64(4)
memory usage: 4.6 MB


None

Missing values: brand    28180
dtype: int64
raw_sample info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 6 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   user        100000 non-null  int64 
 1   time_stamp  100000 non-null  int64 
 2   adgroup_id  100000 non-null  int64 
 3   pid         100000 non-null  object
 4   nonclk      100000 non-null  int64 
 5   clk         100000 non-null  int64 
dtypes: int64(5), object(1)
memory usage: 4.6+ MB


None

behavior_log info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   user        100000 non-null  int64 
 1   time_stamp  100000 non-null  int64 
 2   btag        100000 non-null  object
 3   cate        100000 non-null  int64 
 4   brand       100000 non-null  int64 
dtypes: int64(4), object(1)
memory usage: 3.8+ MB


None