# Problem Set 3: Datascrapping

> In general, need to a bit more clear on what output should look like.

In this problem set we are going to be developing our own custom data scrappers, use some other libraries to scrape data, and finally we will be utilizing **Amazon’s Web Services** to save data on the cloud. We will also create a `virtual machine` on **AWS** and continuously scrape data, without having to have our own computer on the whole time. 

First, let’s start by setting up an **AWS** account. **AWS** provides a number of cloud services that make it very easy to setup remote processes, and take advantage of cloud computing. Their built-in console facilitates the creation of instances of services such as remote servers, and web-based storage. We will be using **EC2** and **S3**. **EC2** allows us to create virtual servers, so we can run scheduled tasks without having to run them continuously on our computer. **S3** provides storage space on the cloud. We can easily save any scrapped data into a **bucket** or data container for future use.

AWS offers a [free tier]( https://aws.amazon.com/free) of services. Most of what we will be using in class should be covered by the free tier of usage. Once we create an account, we can set up an **EC2** server, or an **S3** bucket. 


First, we will build a scrapper for foursquare, and a scrapper for instagram. Lets install the libraries from the command line:

```Python
# Python library for accessing foursquare's API
pip install foursquare

# Python library for accessing instagram's API
pip install python-instagram

# Python library for interfacing with AWS
pip install boto 
```

> Add some more for setup, such as when where to create your AWS account, create a EC2 server, and create an S3 bucket

## Part 1

Following the instructions found in the in-class tutorial, and python foursquare's [**API**](https://github.com/mLewisLogic/foursquare), create a scraper that continuously gets the trending venues in Riyadh. 


* First, make sure to [create your developer passwords]( https://foursquare.com/developers/apps) for foursquare. You will need them to authenticate.

> Explain this just a bit more... what do you put for URLs and redirect URIs?

* Second, look at [trending]( https://developer.foursquare.com/docs/venues/trending) end point of Foursquare’s **API**. Endpoints allow access to some of the resources from the API. More documentation can be accessed [here]( https://developer.foursquare.com/docs/). The python library provides an easy to use wrapped around the foursquare **API**, so we don’t need to come up with **REST** requests ourselves. 

* Third, create a python function that uses the python library, and authenticates you with the key, and hits the API returning a number of trending venues.
    1. Inputs: 
      1. A `string` that represents a location in Riyadh. You might want to use a point close to the city center. 
      * Limit the search to 1,000 results, and a 5,000 mt. radius.
    2. Outputs:
      1. A `response dictionary` returned by the function.

In [1]:
# Your code here


* Fourth, using [boto]( https://github.com/boto/boto), a python wrapper for **AWS**, create a python function that can upload a string to **AWS** [**S3**]( https://aws.amazon.com/s3/). S3 provides cloud buckets where you can store information such as `JSON`s or `strings`. First, you will need to [sign up]( https://portal.aws.amazon.com/gp/aws/developer/registration/index.html?nc2=h_ct) for an account (you get some free services, and 5 gb of free storage).  Then, you will have to get [access keys]( http://docs.aws.amazon.com/general/latest/gr/managing-aws-access-keys.html).

**The function should:**
* Create an [S3 connection]( http://docs.pythonboto.org/en/latest/ref/s3.html)
* Select a bucket from your S3 account (You should previously [create]( http://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html) the S3 bucket). 
* Upload a string to a target path that is **specific** to that string. IE you can use the **date and time** as the path, to make sure no 2 files are the same. 

    1. Inputs: 
      1. A `string` that represents the target path to upload the string.
      * A `string` which is a `json` containing the response from the endpoint.
    2. Outputs:
      1. A `printed` message of success if uploaded correctly.


In [2]:
# Your code here


* Fifth, a function that consecutively hits foursquare’s **API** using the first function you built. The function should hit the API every `N` time, for a given period of time. You can use a library like [threading]( https://docs.python.org/2/library/threading.html) to repeat the operation for given period of time. The function should also parse the response into a dictionary, and keep the following keys:

```Python
                checkin['name']
                checkin['hereNow']['count']
                checkin['categories'][0]['name']
                checkin['location']['lat']
                checkin['location']['lng']
                checkin['stats']['checkinsCount']
                checkin['stats']['usersCount']
                datetime
                
```

1. Inputs: 
    1. A `number` that defines the total time to run the function for.
2. Outputs:
    1. A `json` of the parsed response.
    2. Use your S3 function to upload it to your bucket. Add a screenshot of your bucket with the files to the ipython notebook.
    
    >removed second checkins count, are we using an ipython notebook?

In [3]:
# Your code here


## Part 2

Following the instructions found in the in-class tutorial, and python instagram’s [**API**](https://github.com/Instagram/python-instagram), create a scraper that continuously gets the recent media in Riyadh. 


* First, make sure to [create your developer passwords](https://www.instagram.com/developer/) for Instagram. You will need them to authenticate.

* Second, look at the [location](https://www.instagram.com/developer/endpoints/locations/) end points of Instagram’s **API**. 

* Third, authenticate into the instagram API. Then, create a python function that uses the python instagram library, and hits the API returning a number of recent media by location.
    1. Inputs: 
      1. A `list` that contains a number of Instagram place id’s in Riyadh. 
    2. Outputs:
      1. A `response` [**media**](https://github.com/Instagram/python-instagram/blob/master/instagram/models.py#L46) object.
      
      > What exactly are you looking for here? What is a media object? Do we have to use the Python Instagram library?


* Fourth, create a python function that uses the python instagram library, and hits the API to search for location id’s. You can use the `location_search` function of the library. This should return a list of places that you can then use to search for recent media.
    1. Inputs: 
      1. Two location `float` numbers representing the lat and lon of Riyadh.
      * Limit the search to 5,000 results, and a 5,000 mt. radius.
    2. Outputs:
      1. A `list` of Instagram locations.


* Fifth, a function that consecutively hitsinstagram’s **API** using the first function you built. The function should hit the API every `N` time, for a given period of time. You can use a library like [threading]( https://docs.python.org/2/library/threading.html) to repeat the operation for given period of time. The function should also parse the response object into a dictionary, and keep the keys that are relevant to you:

1. Inputs: 
    1. A `number` that defines the total time to run the function for.
    2. A `list` of locations that will be used to run your previous function.

2. Outputs:
    1. A `json` of the parsed response.
    2. Use your S3 function to upload it to your bucket. Add a screenshot of your bucket with the files to the ipython notebook.


## Part 3

Now, lets read some of the jsons we have been collecting, and create a quick plot based on their geo-location and some of their basic information.  


* Download the foursquare files from your S3 bucket. You can use a UI like [S3 Browser](http://s3browser.com/), or write a Boto function.

* Create a function that looks for all the files within a given local directory, and opens them. Here is some documentation: [list directory](https://docs.python.org/2/library/os.html#os.listdir), [open files](https://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files) 

* Third, plot the latitude and longitude of each **unique** venue. The radius of the circle should be proportional to the number of check-ins per venue. 
    1. Inputs: 
      1. A `string` that represents a file path containing all the jsons from S3. Alternatively, you can just write a function to download the files from S3 on real time by passing your credentials, and a bucket name.  
    2. Outputs:
      1. A plot with the unique venues represented by a scatter plot. The radius of each circle should be related to the number of check-ins per venue. 
