# Problem
Using a dataset of past advertisements on the Internet, can we accurately predict what image will be an advertisement based on attributes of that image?

# Project
The features encode the geometry of the image (if available) as well as phrases occurring in the URL, 
the image's URL and alt text, the anchor text, and words occurring near the anchor text.

Number of Instances: 3,279 (2,821 non ads, 458 ads)
Number of Attributes: 1,558 (3 continous; others binary)

28% of instances are missing some of the continuous attributes.
Missing values should be interpreted as "unknown"
Class Distribution- number of instances per class: 2,821 non ads, 458 ads.

The task is to predict whether an image is an advertisement ("ad") or not ("non ad").

Deliverables
Please send us the following:
Code, and associated files, used for the project. You can send us a zipfile, or upload the project to a public github repo.
The algorithm you developed to make your predictions
How we can run the algorithm on a test data set
The process you used to analyze the data and came to your conclusions

In [180]:
import os
import numpy as np
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split

In [2]:
pth = './datafiles/'
inputfile = 'data'
colfile = 'column.names.txt'

In [167]:
df1 = pd.read_csv(os.path.join(pth, inputfile), header=None)
df2 = pd.read_csv(os.path.join(pth, colfile), sep=":", skiprows=0)
df2 = df2.reset_index()
df2.columns = ['variable', 'type']

In [152]:
def ValidateNumeric(df, col):
    coldic = dict( (x, int(x)) if str(x).isdigit() else (x, np.nan) for x in df[col].unique() )
    for k in coldic:
        df.ix[df[col] == k, col] = coldic[k] 

    df[col] = df[col].astype(float)

In [174]:
# df2['variable'] = df2['variable'].str.replace("[^A-Za-z0-9]+", "_")
df1.columns = df2['variable'].tolist() + ['ad']
continousvars = df2.ix[df2['type'].str.strip() == 'continuous.', 'variable'].tolist() + ['ad']

In [175]:
for col in df1.columns:
    if col not in continousvars:
        ValidateNumeric(df1, col)

In [176]:
xcols = [x for x in df1.columns if x!='ad']
xs = df1[xcols]
ys = df1['ad']=='ad.'

In [181]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
                                        xs, ys, test_size=0.33, random_state=420)