In [110]:
# IMPORTS
import numpy as np
import scipy as sp
import pandas as pd

from google.cloud import storage
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# CONSTANTS
BUCKET_NAME = 'bu-ds561-dcmag-hw6'
BLOB_NAME = 'request.csv'

# Retreiving the Data From the Storage Bucket

First thing's first, we need to get the `request.csv` from the storage bucket. We will save it locally to our `data/` directory and then load it into a pandas dataframe.

In [111]:
# Get the Blob from the Bucket
storage_client = storage.Client()
bucket = storage_client.bucket(BUCKET_NAME)
blob = bucket.blob(BLOB_NAME)

# Open the Blob and read the data into a Pandas DataFrame
with blob.open('r') as f:
    df = pd.read_csv(f)
    
# Store the DataFrame as a CSV file (for future use)
df.to_csv('data/request.csv', index=False)

# Evaluating the Data

Now that we have the data, we need to clean it and extract the features we want to use for our model. 

We're going to filter out unnecessary columns like `request_id` and `timestamp` as those are just generated at the time of request and offer no indication to the income.

We're going to clean the data by removing the `files` predicate from the filename, reducing the gender to boolean, and grouping the age/income from an enumerated list to a numerical value.

In [112]:
# Load the DataFrame from the CSV file
df = pd.read_csv('data/request.csv', on_bad_lines='skip')

# Add Headers to the DataFrame
df.columns = ['request_id', 'country', 'client_ip', 'gender', 'age', 'income', 'is_banned', 'time_of_request', 'requested_file']

# Drop Drop Duplicates & Unnecessary Columns
df = df.drop(['request_id', 'time_of_request'], axis=1)
df.drop_duplicates(inplace=True)

# Map the Gender Column to Boolean
# (Male = 1, Female = 0)
df['gender'] = df['gender'].map({'Male': 1, 'Female': 0})

# Map the Age/Income Columns to Integer Values
ages_list = ['0-16', '17-25', '26-35', '36-45', '46-55', '56-65', '66-75', '76+']
incomes_list = ['0-10k', '10k-20k', '20k-40k', '40k-60k', '60k-100k', '100k-150k', '150k-250k', '250k+']

df['age'] = df['age'].map({age: i for i, age in enumerate(ages_list)})
df['income'] = df['income'].map({income: i for i, income in enumerate(incomes_list)})

# Convert the Country column to a One-Hot Encoding
df['country'] = pd.Categorical(df['country']).codes

# Convert the IP addresses to numerical values
df['client_ip'] = df['client_ip'].apply(lambda x: int(x.replace('.', '')))

# Clean the Requested File Column
# files/1.html -> 1
df['requested_file'] = df['requested_file'].apply(lambda x: x.split('/')[1].split('.')[0])

# Save the DataFrame as a CSV file (for future use)
df.to_csv('data/cleaned_request.csv', index=False)

print(df.head())

   country     client_ip  gender  age  income  is_banned requested_file
0       21    1139885159       0    6       7          0           3475
1       62    9522297201       1    5       7          0           4678
3      155    2442221370       1    5       3          0           2116
5      163  210162186132       0    6       2          0           5762
7       45    1932491643       0    4       0          0           3813


# Train/Test & Selecting the Model

Now that we have our cleaned dataframe, we need to split it into a training and testing set. We will use the training set to train our model and the testing set to evaluate the accuracy of our model. We will be experimenting with a few different models and features to see which one performs the best.

## Country Model

This model, we are predicting the country based on the client IP. Looking at the HTTP client code this should be relatively simple. We know intuitively that the octets will have a consistent pattern if they pertain to the same country. So as long as we convert the IP addresses to numerical features that can be shaped that will be fine.

I chose to use a DecisionTreeClassifier because they provide a feature importance score. In this context that means we can see which octets are the most important in determining the country. Additionally, being a decision tree based model, they are robust to outliers and can handle categorical data.

In [113]:
# Load the cleaned DataFrame from the CSV file
df = pd.read_csv('data/cleaned_request.csv', on_bad_lines='skip')

# Split the data for the IP model (predicting country)
X_ip = df['client_ip']
Y_ip = df['country']

# Reshape the data
X_ip = X_ip.values.reshape(-1, 1)

# Split the data into training and testing sets
X_ip_train, X_ip_test, Y_ip_train, Y_ip_test = train_test_split(X_ip, Y_ip, test_size=0.2)

# Build and train the IP model
country_model = DecisionTreeClassifier()

# Fit the IP model
country_model.fit(X_ip_train, Y_ip_train)

# Evaluate the IP model
Y_ip_pred = country_model.predict(X_ip_test)
ip_accuracy = country_model.score(X_ip_test, Y_ip_test)
print("Accuracy for Country Prediction:", ip_accuracy)

Accuracy for Country Prediction: 0.9911


We can see that the model achieves an accuracy of over 99% on the test set.

## Income Model

This model, we are predicting the income based on whichever features we choose. Some of the features such as country and gender, make sense for income prediction. However, something worth noticing is that the income is generated pseudorandomly (if we check the HTTP script), so we expect the model to perform assuming a uniform distribution of incomes.

I chose to use a DecisionTreeClassifier because they provide a feature importance score. In this context that means we can see which octets are the most important in determining the country. Additionally, being a decision tree based model, they are robust to outliers and can handle categorical data.

In [114]:
# Load the cleaned DataFrame
df = pd.read_csv('data/cleaned_request.csv', on_bad_lines='skip')

# Select relevant features
X_income = df[['country', 'client_ip', 'gender', 'age', 'is_banned', 'requested_file']]
Y_income = df['income']

# Split the data into training and testing sets
X_income_train, X_income_test, Y_income_train, Y_income_test = train_test_split(X_income, Y_income, test_size=0.2)

# Build and train the income model (use RandomForestRegressor for regression)
income_model = DecisionTreeClassifier()

# Train the model
income_model.fit(X_income_train, Y_income_train)

# Evaluate the income model
Y_income_pred = income_model.predict(X_income_test)
income_accuracy = income_model.score(X_income_test, Y_income_test)
print("Accuracy for Income Prediction:", income_accuracy)

Accuracy for Income Prediction: 0.1284


Since there is a uniform distribution of 8 incomes, we expect the model to achieve an accuracy of at least 12.5% (1/8). Since the income is generated pseudorandomly, there is no correlation between the income and  So this model is performing as expected