# Beginning steps

In this exercise, you will get a quick look at sample data using some basic DataFrame operations and taking a first look at CTR. The data comes from Avazu, a leading global advertising platform and captures user interactions on various device types for different websites and apps.

The target variable will be in the click column. The hour is in a YYMMDDHH format, and there are a few integer columns: device_type for the type of device, banner_pos for the position of a banner ad (also known as a display ad), etc. There will also be other variables discussed in later chapters.

Sample data in DataFrame form is loaded as df.pandas as pd are available in your workspace.

In [5]:
# Import data
from pandas import read_pickle
df = read_pickle("data/data.pkl")

# Look at basics of Dataframe 
print(df.head(5))
print(df.columns)

# Define X and y
X = df.loc[:, ~df.columns.isin(['click'])]
y = df.click

# Sample CTR
print("Sample CTR :\n", 
      y.sum()/len(y))

     C1    C14  C15  C16   C17  C18  C19     C20  C21  banner_pos  click  \
0  1005  20596  320   50  2161    0   35      -1  157           0      0   
1  1005  15701  320   50  1722    0   35  100084   79           0      1   
2  1005  20596  320   50  2161    0   35      -1  157           0      0   
3  1005  20362  320   50  2333    0   39      -1  157           0      0   
4  1005  17212  320   50  1887    3   39      -1   23           0      1   

   device_conn_type     device_model_int  device_type      hour  
0                 0 -6892224247118359062            1  14102100  
1                 0   137884114573136964            1  14102100  
2                 0 -2512938341375609741            1  14102100  
3                 0  7741261153921945767            1  14102100  
4                 0  7090997833464111984            1  14102100  
Index(['C1', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21',
       'banner_pos', 'click', 'device_conn_type', 'device_model_int',
       '

# Feature exploration
Using the same Avazu dataset, you will explore how the values of device_type and banner_pos are distributed, as well as how CTR varies based on them.

Sample data in DataFrame form is loaded as df. The X and y variables that you created in the last exercise are available in your workspace. pandas as pd are also available in your workspace.

In [7]:
# Distribution of values for device type
print("Distribution of device type: ")
print(X.device_type.value_counts()/len(X))

# Sample CTR by device type 
print("CTR by device type: ")
print(df.groupby('device_type')['click'].sum()/len(y))

# Distribution of values for banner position
print("Distribution of banner position: ")
print(X.banner_pos.value_counts()/len(X))

# Sample CTR by banner position
print("CTR by banner position: ")
print(df.groupby('banner_pos')['click'].sum()/len(y))

Distribution of device type: 
1    0.966667
0    0.033333
Name: device_type, dtype: float64
CTR by device type: 
device_type
0    0.033333
1    0.333333
Name: click, dtype: float64


For both device type and banner position, notice that a significant portion of values seem to be in a particular category, and that there seems to be a particular breakdown that has a significantly higher CTR than the other values.