# Introduction to Data Science
## Homework 2

Student Name: Joyce Wu

Student Netid: jmw784
***

### Part 1: Case study
- Read [this article](http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html) in the New York Times.
- Use what we've learned in class and from the book to describe how one could set Target's problem up as a predictive modeling problem, such that they could have gotten the results that they did.  Formulate your solution as a proposed plan using our data science terminology.  Include all the aspects of the formulation that you see as relevant to solving the problem.  Be precise but concise.

1. Understand the business motivation. The business motivation is to predict whether women are pregnant are not before other companies do, so that Target can send targeted ads to these women before other companies do. This way the women will build brand loyalty to Target's baby products.
2. Invest in the data. Purchase the data about customers from outside companies and/or set up a system to collect this data when customers make purchases. Prepare the data so it is in a format that you can develop a model with.
3. Perform feature selection. Decide whether the features should be considered continuous or if they should be discretized. Then, try to reduce the number of irrelevant features for your model. You can use domain knowledge by consulting a psychologist, surveying pregnant women, visualizing the data, etc. You can also choose many features you think may be relevant and calculate the information gain of certain features in regards to the ability to predict whether a women is pregnant or not. For example, some features could be for assessing whether a woman increased their purchases of hand lotion or not. Features could be related to purchase frequency, purchase categories, and purchase volume. The target variable is whether the woman turns out to be pregnant within a certain period of time, and it is a binary variable (yes/no). This is a classification problem.
4. Develop a model using a supervised segmentation technique. One of such techniques is to create a tree-structured model, but other algorithms may also be employed. Determine if you desire the probability that a woman will be pregnant or just a yes/no output, and incorporate that in your model.
5. Train your model using the training set of data, with data from women that did end up being pregnant and women that did not end up being pregnant. Check to see if your model is overfitting. Introduce regularization or set limits on your algorithm if overfitting is a problem.
6. Test your model with new data, repeat previous steps as necessary if the model does not perform well on new data.

### Part 2: Exploring data in the command line
For this part we will be using the data file located in `"data/advertising_events.csv"`. This file consists of records that pertain to some online advertising events on a given day. There are 4 comma separated columns in this order: `userid`, `timestamp`, `domain`, and `action`. These fields are of type `int`, `int`, `string`, and `int` respectively. Answer the following questions using Linux/Unix bash commands. All questions can be answered in one line (sometimes, with pipes)! Some questions will have many possible solutions. Don't forget that in IPython notebooks you must prefix all bash commands with an exclamation point, i.e. `"!command arguments"`.

[Hints: You can experiment with whatever you want in the notebook and then delete things to construct your answer later.  You can also use ssh to use the actual bash shell on EC2 (see original directions) and then just paste your answers here. Recall that once you enter the "!" then filename completion should work.]

1\. How many records (lines) are in this file?

In [1]:
# Place your code here
!wc -l <advertising_events.csv

   10341


2\. How many unique users are in this file? (hint: consider the 'cut' command and use pipe operator '|')

In [2]:
# Place your code here
!cat advertising_events.csv | cut -d, -f1 | sort | uniq | wc -l

     732


3\. Rank all domains by the number of visits they received in descending order. (hint: consider the 'cut', 'uniq' and 'sort' commands and the pipe operator).

In [3]:
# Place your code here
!cat advertising_events.csv | cut -d, -f3 | sort | uniq -c | sort -k 1,1 -r

3114 google.com
2092 facebook.com
1036 youtube.com
1034 yahoo.com
1022 baidu.com
 513 wikipedia.org
 511 amazon.com
 382 qq.com
 321 twitter.com
 316 taobao.com


4\. List all records for the user with user id 37. (hint: this can be done using 'grep')

In [4]:
# Place your code here
!cat advertising_events.csv | grep "^37,"

37,648061658,google.com,0
37,642479972,google.com,2
37,644493341,facebook.com,2
37,654941318,facebook.com,1
37,649979874,baidu.com,1
37,653061949,yahoo.com,1
37,655020469,google.com,3
37,640878012,amazon.com,0
37,659864136,youtube.com,1
37,640361378,yahoo.com,1
37,653862134,facebook.com,0
37,648828970,youtube.com,0


### Part 3: Dealing with data Pythonically

In [5]:
# You might find these packages useful. You may import any others you want!
import pandas as pd
import numpy as np
from IPython.display import display

1\. Load the data set `"data/ads_dataset.tsv"` into a Python Pandas data frame called `ads`.

In [12]:
# Place your code here
ads = pd.read_csv('ads_dataset.tsv', sep='\t')

ads.head()

Unnamed: 0,isbuyer,buy_freq,visit_freq,buy_interval,sv_interval,expected_time_buy,expected_time_visit,last_buy,last_visit,multiple_buy,multiple_visit,uniq_urls,num_checkins,y_buy
,0,,1,0.0,0.0,0.0,0.0,106,106,0,0,169,2130,0
,0,,1,0.0,0.0,0.0,0.0,72,72,0,0,154,1100,0
,0,,1,0.0,0.0,0.0,0.0,5,5,0,0,4,12,0
,0,,1,0.0,0.0,0.0,0.0,6,6,0,0,150,539,0
,0,,2,0.0,0.5,0.0,-101.1493,101,101,0,1,103,362,0


2\. Write a Python function called `getDfSummary()` that does the following:
- Takes as input a data frame
- For each variable in the data frame calculates the following features:
  - `number_nan` to count the number of missing not-a-number values
  - Ignoring missing, NA, and Null values:
    - `number_distinct` to count the number of distinct values a variable can take on
    - `mean`, `max`, `min`, `std` (standard deviation), and `25%`, `50%`, `75%` to correspond to the appropriate percentiles
- All of these new features should be loaded in a new data frame. Each row of the data frame should be a variable from the input data frame, and the columns should be the new summary features.
- Returns this new data frame containing all of the summary information

Hint: The pandas `describe()` [(manual page)](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html) method returns a useful series of values that can be used here.

In [7]:
def getDfSummary(input_data):
    # Place your code here
    output_data = pd.DataFrame()
    output_data['variable']=pd.Series([column for column in input_data])
    output_data['number_nan']=pd.Series(np.sum(input_data[column].isnull()) for column in input_data)
    output_data['number_distinct']=pd.Series(len(input_data[column].dropna().unique()) for column in input_data)
    output_data['mean']=pd.Series(input_data[column].mean() for column in input_data)
    output_data['max']=pd.Series(input_data[column].max() for column in input_data)
    output_data['min']=pd.Series(input_data[column].min() for column in input_data)
    output_data['std']=pd.Series(input_data[column].std() for column in input_data)
    output_data['25%']=pd.Series(input_data[column].dropna().quantile(q=0.25) for column in input_data)
    output_data['50%']=pd.Series(input_data[column].dropna().quantile(q=0.5) for column in input_data)
    output_data['75%']=pd.Series(input_data[column].dropna().quantile(q=0.75) for column in input_data)
    return output_data

getDfSummary(ads)

Unnamed: 0,variable,number_nan,number_distinct,mean,max,min,std,25%,50%,75%
0,isbuyer,0,2,0.042632,1.0,0.0,0.202027,0.0,0.0,0.0
1,buy_freq,52257,10,1.240653,15.0,1.0,0.782228,1.0,1.0,1.0
2,visit_freq,0,64,1.852777,84.0,0.0,2.92182,1.0,1.0,2.0
3,buy_interval,0,295,0.210008,174.625,0.0,3.922016,0.0,0.0,0.0
4,sv_interval,0,5886,5.82561,184.9167,0.0,17.595442,0.0,0.0,0.104167
5,expected_time_buy,0,348,-0.19804,84.28571,-181.9238,4.997792,0.0,0.0,0.0
6,expected_time_visit,0,15135,-10.210786,91.40192,-187.6156,31.879722,0.0,0.0,0.0
7,last_buy,0,189,64.729335,188.0,0.0,53.476658,18.0,51.0,105.0
8,last_visit,0,189,64.729335,188.0,0.0,53.476658,18.0,51.0,105.0
9,multiple_buy,0,2,0.006357,1.0,0.0,0.079479,0.0,0.0,0.0


3\. How long does it take for your `getDfSummary()` function to work on your `ads` data frame? Show us the results below.

Hint: `%timeit getDfSummary(ads)`

In [8]:
# Place your code here
%timeit getDfSummary(ads)

10 loops, best of 3: 76.8 ms per loop


4\. Using the results returned from `getDfSummary()`, which fields, if any, contain missing `NaN` values?

In [9]:
# Place your code here
summary = getDfSummary(ads)
missingVariables = summary[summary['number_nan']>0]['variable']
missingVariables

1    buy_freq
Name: variable, dtype: object

5\. For the fields with missing values, does it look like the data is missing at random? Are there any other fields that correlate perfectly, or predict that the data is missing? If missing, what should the data value be?

Hint: create another data frame that has just the records with a missing value. Get a summary of this data frame using `getDfSummary()` and compare the differences. Do some feature distributions change dramatically?

In [10]:
# Place your code here

for variable in missingVariables:
    missingRows = ads[ads[variable].isnull()]
    display(getDfSummary(missingRows))

Unnamed: 0,variable,number_nan,number_distinct,mean,max,min,std,25%,50%,75%
0,isbuyer,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,buy_freq,52257,0,,,,,,,
2,visit_freq,0,48,1.651549,84.0,1.0,2.147955,1.0,1.0,2.0
3,buy_interval,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,sv_interval,0,5112,5.686388,184.9167,0.0,17.623555,0.0,0.0,0.041667
5,expected_time_buy,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,expected_time_visit,0,13351,-9.669298,91.40192,-187.6156,31.23903,0.0,0.0,0.0
7,last_buy,0,189,65.741317,188.0,0.0,53.484622,19.0,52.0,106.0
8,last_visit,0,189,65.741317,188.0,0.0,53.484622,19.0,52.0,106.0
9,multiple_buy,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0


No, it does not look like the values are missing at random. isbuyer, buy_interval, expected_time_buy, and multiple_buy are all set to 0 when buy_freq is missing. It looks like this data is from people who visit, but do not buy anything. The data value should be 0 if the buy_freq is missing.

6\. Which variables are binary?

In [11]:
# Place your code here
binaryVariables = []

for column in ads:
    if len(ads[column].dropna().unique()) == 2:
        binaryVariables.append(column)
        
print binaryVariables

['isbuyer', 'multiple_buy', 'multiple_visit', 'y_buy']
