In [None]:
import json
import urllib
import pandas as pd
import numpy as np
import datetime as dt
import time

# Goals: 

This notebook will help work through the steps required to get some data from a URL and do the data cleansing steps required to use it. 

* ####Learn how to query using GA Query
    * Build a working query to get pageview data for Liftopia.com
    * Use this query to
* ####Data Cleaning Tricks
    * String Manipulation using replace and regex
    * Use this query to
* ####Use the pd.merge function like a VLOOKUP
    * Use a seperate source of data to turn product ID's into product names
* ####Convert a string into a Datetime object
    * Often dates are complitcated to work, with, learn some quick tips for manipulating dates
    
________________________________________________


### Learn How to Query Using GA Query


- Go to [https://ga-dev-tools.appspot.com/query-explorer/](https://ga-dev-tools.appspot.com/query-explorer/)
- You'll need to authorize using your Liftopia.com account
- Select from the drop downs to make sure you have the following: 
    <img src="assets/gaquery1.png" width="800" />

- The different options there allow you to select which data source you want (Cloud Store and Liftopia have seperate data stores)
- Next go through the next options to build a query with the following parameters:
    - start-date = 2014-12-01
    - end - date = 2013-12-31
    - metrics = ga:pageviews
        -this says "Show me pageviews (its like the what in Good Data)
    - dimensions = ga:pagepath and ga:date
        -this is the "how" from GoodData
    -sort = ga:pageviews
    - filters = ga:pagePath=@product_id
        - This filter is using text matching. The @symbol means "contains substring". [Here's more info on filtering](https://developers.google.com/analytics/devguides/reporting/core/v3/reference#filters)

-Next you should be able to run your query! You should see some results like this: 
    <img src="assets/gaquery2.png" width="800" />



________________________


### Ok - great so you managed to get your query working... How do I see the data?!


- At the very bottom of the query results, there are some options. One of them contains a link that will give you the results in a JSON file: 
    <img src="assets/gaquery3.png" width="700" />

- Just to see what it looks like. Copy the link from the box and paste it into a new Tab. If you did it right, you should see something like this: 
    - Note: if the JSON looks ugly, install the chrome plugin JsonView

  <img src="assets/gaquery4.png" width="700" />
  
  
-What you've just done is built a URL that can be accessed directly from Python using the  JSON library! 
-Note, the access key appened to the end of the URL will expire every 60 minutes. So keep that in mind

________________________



### Accessing that JSON data from Pandas

- Now we will build out a method to get the data out of that JSON file. 

In [None]:
# Fill in the code below to make a working URL that gives you the same result as the one you pasted into your browser window
access_token = ''
start_date = ''
end_date =  ''
start_index = 1


# The back slash symbol at the end of the line used with the + is saying concatenate with the string on the next line
# Makes it easier to read the long string
url = 'first part of the URL' +\
       'second part of the url %s' %start_date +\
        'third part of the url %s'  %end_date +\
        'more url here...%s' %start_index +\
        'access token here...%s' %access_token
        
print "===========Does this match the one you copied from GA Query?========================="
print "URL: %s" %url 
print "====================================================================================="



In [None]:
#To get the data from the URL into a variable We need to use a combination of 2 libraries JSON and URLLIB)
#Here's the code:

result = json.load(urllib.urlopen(url))

#run this cell to see what the JSON looks like
result

### Ok that's pretty ugly...
- So at this point we have all the data from the JSON file saved in the variable "result" 
- A JSON file is a dictionary, so you you access portions of it using the following structure

In [None]:
result['query']

- You can go deeper into a level by adding another key: 

In [None]:
result['query']['dimensions']

- Use your web browser to look at the JSON result and find the key for the data that we want to analyze... (it should have the page name, the date of the pageview and the number of pageviews. 

In [None]:
#Ok now turn that into a dataframe called df using the column names: 'page_name','view_date','pageviews' : 
#fill in the code here: 
df = pd.DataFrame()





### Awesome - now we have our GA Query data into  Panadas... let's clean it up a bit
- When you have a string, you can use the .split() method to split the string into a list of items before and after the character(s) you wanted to split. On.
- Like this: 

In [None]:
print df['page_name'].ix[1]
print df['page_name'].ix[1].split("c")
#You can access the items in this list by referencing their index [0] or [1]
print df['page_name'].ix[1].split("c")[0]

- Now we are going to use a similar method on the dataframe column:
    -Instead of splitting the individual item , we can split a column using df['column name'].str.split()

- You're going to want to use .split and maybe even some .replace

In [None]:
splits = df.page_name.str.split('c')
splits.str[0].head()

- Use this method to create a column in your data fram called "product_id" that contains the items between "product_id=" and "&"

In [None]:
import re
df = df[df.product_id.str.contains('^[0-9]+$', flags=re.IGNORECASE, regex=True, na=False)]
df.info()

- Next lets drop NA values and convert the column to type "int" - hint: us .astype()

In [None]:
#Your code here




### Merging with another Dataframe

- Ok now we have a clean column of numbers that correspond to the product ID. Now we can use this number and bring in the Product name. I've included a file called 'produc_id_to_product_name.csv'
- Use this file along with the pandas "merge" functionality to make a new column in the dataframe that shows the product name
- Include another column that shows  whether or not the product is dateless.

In [None]:
#Your code here





In [None]:
#your code here


_____________________________
### Converting Strings to Dates
- The last step is to get the trip date out of the Page name. Use a techinque similar to the Product ID method we used
- create a column called df['trip_date'] to hold these values

####Inspecting results
- some of the values you got are empty... Why? Maybe they are dateless products. 
- Use some filtering on the dataframe to inspect the rows that have a Null trip_date and are dateless

In [None]:
#your code here




In [None]:
#If everything without a date is datelss, we can drop those products. 
df=df.dropna()

- Now that we have a column with a string containing our date. We can using the Pandas .str methods to get portions of it out. 
- To convert a Year, Month, Day into a datetime object. We do the following:
    - Year * 10000 + Month * 100 + Day
    - Then we can use the pd.to_datedtime() to convert. 
- To get specific portions from a string we use string indexing: 

In [None]:
#your code here


# Use this to see how many values you have
df.info()

In [None]:
#check out the unique values:
df['product_id'].unique()

- There are some values here we will want to remove. 
    -Take a look at the pagenames for those rows that aren't numbers. They're wierd URLs.. I think we can drop them. 
    -Lets use a little regex to clean it up.
    -Regex is a way to do advanced string manipulation