# Conversion Details Script

### Overview : 
 
The purpose of this documentation is to help you understand and have an idea as to the intention of each line of code. 





### Reminder : 
Before you start to run this script, please make sure Python, a programming language used to write this script, and module pip have both been installed on your computer in order to avoid any errors occuring while running the script. If you have not installed required programs stated above, please see below links for installation guidance.

* [How to install Python](http://docs.python-guide.org/en/latest/starting/install/win/#install-windows)
* [How to install pip](https://pip.pypa.io/en/stable/installing/#do-i-need-to-install-pip)

---

### Let's get started:

There are tons of packages has been built and shared in Python open source environment to let Python uers use as long as the packages are installed on their local computer. Below are the packages (pandas, datetime, calendar, and os) that we will need in our script. This snippet is to import all of them in advance for our later use.


```python

import pandas as pd
from pandas import Series
import datetime
import calendar
import os

```


On Ipython Notebook, pandas set the default for the columns being displayed as 20. We can simply apply below code to modify the number of columns being showed.


```python

pd.set_option("display.max_columns", 50)
pd.get_option("display.max_columns")

```



This chunk of code is to ask you to input all the files you would like to combine together into a final file for data cleaning use. The limit for the number of input files here is set as 10, which is total 11 files can be input since the first number starts in 0 in Python. **If you have more than 11 files, you can easily change the range number.**      `file_size = list(range(10))` At the end, you will have all your files combined into a variable "raw".


```Python


file_size = list(range(10))

for i in file_size:
    
    if i == 0:
        file_name = raw_input("Please type in path of the file > ")
        input_file = pd.read_csv(file_name)
        print "input_file shape : ", input_file.shape
        raw = input_file
        
        answer = raw_input("If there are other files, please type \"y\", otherwise, press \"n\" > ")
        if answer == "y":
            continue
        else:
            print "The raw file comprises", i+1,"file.","The shape of the raw file : ", raw.shape
            break
        
    else:
        file_name = raw_input("Please type in path of the file > ")
        input_file = pd.read_csv(file_name)
        print "input_file shape : ", input_file.shape
        raw = raw.append(input_file, ignore_index= True)
        
        answer = raw_input("If there are other files, please type \"y\", otherwise, press \"n\" > ")
        if answer == "y":
            continue
        else:
            print "The raw file comprises", i+1,"files.", "The shape of the raw file : ", raw.shape
            break
            
            
```



The original HD coversion details report consists of 99 columns but not all of them are useful to our analysis. Let's get rid of the unnecessary ones by assigning useful columns to a variable called columns.


```python

columns = ["Conversion Time","TDID","Conversion Id","# Impressions","# Display Clicks",
                     "Tracking Tag Name","First Impression Time","First Impression Campaign Name",
                     "First Impression Ad Group Name","Last Impression Time","Last Impression Campaign Name",
                     "Last Impression Ad Group Name","Last Impression Site","Last Impression Country","Last Impression Metro",
                     "Attribution Model","XDIDs","First Impression Device Type","Last Impression Device Type",
                     "Conversion Device Type"]


raw = raw[columns]

print raw.shape

```


Here is to get rid of all the rows with 0 impression and to reset the index of rows to avoid potential issues occurring in following phrases.



```python

print len(raw["# Impressions"])

raw = raw[raw["# Impressions"] != 0]
print len(raw["# Impressions"])

raw = raw.reset_index(drop=True)
print len(raw["# Impressions"])

```



Let's have a sneak peak of current data

```python

raw.head()

```



Mathmetical calculation cannot be applied to Object data types of data in Python. This step is to convert timestamp data type from Object to Time. Also, in order to make our following calcualtion easier, we only keep the time to hours level by making minutes and seconds as zero. 



```python

conversion_time_list = []

for i in raw["Conversion Time"]:
    t = datetime.datetime.strptime(i,"%Y-%m-%d %H:%M:%S.%f")
    t = datetime.datetime.replace(t,minute = 0, second = 0, microsecond = 0)
    conversion_time_list.append(t)

    
last_impression_list = []    

for i in raw["Last Impression Time"]:
    t = datetime.datetime.strptime(i,"%Y-%m-%d %H:%M:%S.%f")
    t = datetime.datetime.replace(t,minute = 0, second = 0, microsecond = 0)
    last_impression_list.append(t)
    
first_impression_list = []

for i in raw["First Impression Time"]:
    t = datetime.datetime.strptime(i,"%Y-%m-%d %H:%M:%S.%f")
    t = datetime.datetime.replace(t,minute = 0, second = 0, microsecond = 0)
    first_impression_list.append(t)

    
raw["cal_conversion_time"] = Series(conversion_time_list)
raw["cal_last_impression_time"] = Series(last_impression_list)
raw["cal_first_impression_time"] = Series(first_impression_list)

```


Now, we have all the timestamp ready with correct data type. It's time to calculate interval time within "First Impression","Last Impression", and "Conversion". At the end of this snippet, we will have the impression lag times formatted in days with once decimal.



```python


raw["Last To Convert"] = pd.to_datetime(raw["cal_conversion_time"]) - pd.to_datetime(raw["cal_last_impression_time"])
raw["First To Last"] = pd.to_datetime(raw["cal_last_impression_time"]) - pd.to_datetime(raw["cal_first_impression_time"])
raw["First To Convert"] = pd.to_datetime(raw["cal_conversion_time"]) - pd.to_datetime(raw["cal_first_impression_time"])

list_day = []
list_time = []

for row in raw["Last To Convert"]:
    date = str(row).split(" days ")
    list_day.append(date[0])
    list_time.append(date[1])
    
raw["Last_To_Convert_Day"] = Series(list_day)
raw["Last_To_Convert_Time"] = Series(list_time)


list_day = []
list_time = []

for row in raw["First To Last"]:
    date = str(row).split(" days ")
    list_day.append(date[0])
    list_time.append(date[1])
    
raw["First_To_Last_Day"] = Series(list_day)
raw["First_To_Last_Time"] = Series(list_time)


list_day = []
list_time = []


for row in raw["First To Convert"]:
    date = str(row).split(" days ")
    list_day.append(date[0])
    list_time.append(date[1])
    
raw["First_To_Convert_Day"] = Series(list_day)
raw["First_To_Convert_Time"] = Series(list_time)   



day_ratio_list = []

for i in raw["Last_To_Convert_Time"]:
    t = datetime.datetime.strptime(i, "%H:%M:%S")
    day_ratio = round(float(t.hour) / 24, 1)
    day_ratio_list.append(day_ratio)

raw["Last_To_Convert_ratio"] = Series(day_ratio_list)


day_ratio_list = []

for i in raw["First_To_Last_Time"]:
    t = datetime.datetime.strptime(i, "%H:%M:%S")
    day_ratio = round(float(t.hour) / 24, 1)
    day_ratio_list.append(day_ratio)

raw["First_To_Last_ratio"] = Series(day_ratio_list)


day_ratio_list = []

for i in raw["First_To_Convert_Time"]:
    t = datetime.datetime.strptime(i, "%H:%M:%S")
    day_ratio = round(float(t.hour) / 24, 1)
    day_ratio_list.append(day_ratio)

raw["First_To_Convert_ratio"] = Series(day_ratio_list)

raw["Last_To_Convert_Day"] = pd.to_numeric(raw["Last_To_Convert_Day"])
raw["Last_To_Convert_ratio"] = pd.to_numeric(raw["Last_To_Convert_ratio"])
raw["Last_To_Convert"] = raw["Last_To_Convert_Day"] + raw["Last_To_Convert_ratio"]


raw["First_To_Last_Day"] = pd.to_numeric(raw["First_To_Last_Day"])
raw["First_To_Last_ratio"] = pd.to_numeric(raw["First_To_Last_ratio"])
raw["First_To_Last"] = raw["First_To_Last_Day"] + raw["First_To_Last_ratio"]


raw["First_To_Convert_Day"] = pd.to_numeric(raw["First_To_Convert_Day"])
raw["First_To_Convert_ratio"] = pd.to_numeric(raw["First_To_Convert_ratio"])
raw["First_To_Convert"] = raw["First_To_Convert_Day"] + raw["First_To_Convert_ratio"]


```


After impression lag time, let's create other columns to make our data more insightful and friendly use. 
* Conversion Device Path : Replacing "other" device type with "PC" and add a "->" in between three impression columns.

```python

raw[["First Impression Device Type","Last Impression Device Type","Conversion Device Type"]] = raw[["First Impression Device Type","Last Impression Device Type","Conversion Device Type"]].replace("Other","PC")


raw["Conversion Device Path"] = raw["First Impression Device Type"] + "->" + raw["Last Impression Device Type"] + "->" + raw["Conversion Device Type"]
```


* DOW

```python

day_list = []

for row in raw["Conversion Time"]:
    test = datetime.datetime.strptime(str(row),"%Y-%m-%d %X.%f").date()
    day = datetime.datetime.weekday(test)
    dow = calendar.day_name[day]
    day_list.append(dow)
    
raw["DOW"] = Series(day_list)

```



* Ad Group Path

```python

raw["Ad Group First to Last Imps Path"] = raw["First Impression Ad Group Name"] + "->" + raw["Last Impression Ad Group Name"]


```



Last but not the least, we remove the columns, created during the cleaning process but not necessary to be printed out in the final file, 




In [152]:
raw_final = raw[["Conversion Time","TDID","Conversion Id","# Impressions","# Display Clicks",
                     "Tracking Tag Name","First Impression Time","First Impression Campaign Name",
                     "First Impression Ad Group Name","Last Impression Time","Last Impression Campaign Name",
                     "Last Impression Ad Group Name","Last Impression Site","Last Impression Country","Last Impression Metro",
                     "Attribution Model","XDIDs","First Impression Device Type","Last Impression Device Type",
                     "Conversion Device Type", "First_To_Convert","First_To_Last","Last_To_Convert", "Conversion Device Path", "DOW"]]

In [153]:
print raw_final.shape
raw_final.head(5)

(7542, 25)


Unnamed: 0,Conversion Time,TDID,Conversion Id,# Impressions,# Display Clicks,Tracking Tag Name,First Impression Time,First Impression Campaign Name,First Impression Ad Group Name,Last Impression Time,Last Impression Campaign Name,Last Impression Ad Group Name,Last Impression Site,Last Impression Country,Last Impression Metro,Attribution Model,XDIDs,First Impression Device Type,Last Impression Device Type,Conversion Device Type,First_To_Convert,First_To_Last,Last_To_Convert,Conversion Device Path,DOW
0,2017-01-25 03:07:41.2950,cb308cac-8eba-4a1e-995a-5ac8782416d0,331c96d5-3b8c-4b27-8d4c-c75f98c02e54,7,0,Dealer Locator,2017-01-06 03:52:46.4076,#8735_2017_SCHDA_HX Programmatic_Winter Non Event,Honda Focus Models: Civic In-Market - Mobile,2017-01-19 15:54:30.4399,#8735_2017_SCHDA_HX Programmatic_Winter Non Event,Honda Focus Models: Civic In-Market - Mobile,ace.mu.nu,United States,Los Angeles CA,Standard,cb308cac-8eba-4a1e-995a-5ac8782416d0,Mobile,Mobile,Mobile,19.0,13.5,5.5,Mobile->Mobile->Mobile,Wednesday
1,2017-01-25 05:17:56.5853,28c7af36-9e67-4f48-a930-e4a30d0792d1,ff3630e7-06bf-4b4b-b77b-d0245abe9c55,1,0,Dealer Locator,2016-12-27 02:22:42.5883,#6023_SCHDA_Digital Programmatic 2016 - Happy ...,HX_Display_Behavioral_Hispanic_LA DMA - Mobile...,2016-12-27 02:22:42.5883,#6023_SCHDA_Digital Programmatic 2016 - Happy ...,HX_Display_Behavioral_Hispanic_LA DMA - Mobile...,holadoctor.com,United States,Los Angeles CA,Standard,28c7af36-9e67-4f48-a930-e4a30d0792d1,Mobile,Mobile,Mobile,29.1,0.0,29.1,Mobile->Mobile->Mobile,Wednesday
2,2017-01-25 19:16:28.3129,97a9df31-8948-44e3-9f2f-35df45a86c51,c89578a4-4377-4f90-95fc-3b7dbd56b552,11,0,Dealer Locator,2016-12-26 19:40:16.4137,6023_SCHDA_Digital Programmatic_2016 Increment...,Hx_Display_Mobile_Behaviorial Targeting_Hispan...,2016-12-30 13:00:42.0601,6023_SCHDA_Digital Programmatic_2016 Increment...,Hx_Display_Mobile_Behaviorial Targeting_Hispan...,laopinion.com,United States,Los Angeles CA,Drawbridge,97a9df31-8948-44e3-9f2f-35df45a86c51,Mobile,Mobile,Mobile,30.0,3.8,26.3,Mobile->Mobile->Mobile,Wednesday
3,2017-01-25 21:22:37.2311,2c62111f-0226-4410-84f8-bafb76d4ed92,0796d1bb-2ed9-4f53-b7c7-88c5eb37e060,5,0,Offer Page,2016-12-29 04:01:00.5833,6023_SCHDA_Digital Programmatic_2016 Increment...,Hx_Display_Mobile_Behaviorial Targeting_Hispan...,2016-12-29 05:13:00.1732,6023_SCHDA_Digital Programmatic_2016 - hx Happ...,HX_Display_BEHAVIORAL_HONDA CIVIC DEMO_LA DMA ...,www.cookingclassy.com,United States,Los Angeles CA,Standard,2c62111f-0226-4410-84f8-bafb76d4ed92,Mobile,Mobile,Mobile,27.7,0.0,27.7,Mobile->Mobile->Mobile,Wednesday
4,2017-01-26 02:44:09.9164,0c89ae16-f5ec-46cb-8b58-d1a917672d9a,712349be-6bc4-4009-bc96-6e2a09fffa5d,1,0,Dealer Locator,2017-01-19 06:21:44.9499,#8735_2017_SCHDA_HX Programmatic_Winter Non Event,Honda Intenders,2017-01-19 06:21:44.9499,#8735_2017_SCHDA_HX Programmatic_Winter Non Event,Honda Intenders,www.realtor.com,United States,Los Angeles CA,Drawbridge,"0c89ae16-f5ec-46cb-8b58-d1a917672d9a,bae0d786-...",PC,PC,Mobile,6.8,0.0,6.8,PC->PC->Mobile,Thursday


In [154]:
# Flitering to the tracking tag that you are interested in

raw_final["Tracking Tag Name"].unique()

array(['Dealer Locator', 'Offer Page', 'Homepage', 'Dealer Locator Results'], dtype=object)

In [155]:
print len(raw.index)

final = raw_final[raw_final["Tracking Tag Name"] == "Homepage"]
print len(final.index)

7542
144


In [None]:
print "The file is generating..."

writer = pd.ExcelWriter("socal_Conversion_details.xlsx")
final.to_excel(writer, index= False)
writer.save()

print "The file is saved under path", os.getcwd()