# Data Science Project: Web Service Deployment In EC2
### Francisco Xavier Flores and Pam Needle

**Warning** Acquiring and cleaning data is a messy process, but your approach shouldn't be.  Approach this lab with a rigorous problem solving mindset.  Design and implement a solution that is robust to unexpected inputs and handles these anomalies gracefully.

If you make changes to your code and rerun a python notebook, your changes may not be detected because python is lazy about reloading modules.  The following two lines will force reloads.

In [1]:
%load_ext autoreload
%autoreload 2

# I. Introduction

**TODO: Set the context (introduce the dataset and questions), provide motivation for why these are interesting questions to explore. The introduction should end with a brief summary of your findings.**

- Background: 
- What is EC2?
- What is the cloud?
- all those networky terms explained

**Infrastructure as a Service (IaaS)**
The modernization of web services has recently been increasingly pushing for deployment in infrastructure-as-a-service (IaaS) clouds such as Amazon EC2, Windows Azure, and Rackspace. Infrastructure as a Service (IaaS) is a form of cloud computing that provides virtualized computing resources over the Internet, typically provided by a third-party provider hosts hardware, software, servers, storage and other infrastructure components on behalf of its users. IaaS providers also host users' applications and handle tasks including system maintenance, backup and resiliency planning. <br><br>    
  
  
<font color = 'red'>**WHAT IS THE CLOUD**</font>  
**TODO**    <br><br>


**What is EC2?**  
Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides secure, resizable compute capacity in the cloud, that is usually known as the most popular IaaS cloud available. 
<br><br>



Industry claims that over 1% of Internet traffic goes to EC2 and that outages in EC2 are reputed to hamper a huge variety of services. Our goal is to determine who is using EC2 Iaas Clouds using the UNIX dig command, which is used to query DNS servers for information on the domains obtained from the top 1 million domains, how these services are using the cloud using Amazon’s publicly available IP ranges to determine how many regions are used,




<font color = 'red'> FINISH THIS!!! </font>

# II. Methodology (2 pages)

**TODO: Describe, at a high-level, the methods you employed. Focus more attention on the more challenging/interesting/novel aspects. Provide references to your code as appropriate.**
- describes your data and the methodologies used to acquire, clean and prepare your data
- analysis with references to code


<font color = 'red'> **TODO </font> 






1. Get all possible subdomains of domains listed in Alexa's top 1m
    -  <font color = 'red'>EXPLAIN why do you need to get all subdomains and not just domains? </font>
2. Perform dns look ups using dig for all subdomains



## Data Acquistion

Amazon previously published "Alexa Top 1m Sites" which was a list of the the top 1 million web site domains ordered by Alexa Traffic Rank. This data used to be publically available, however, Amazon now charges a fee. We will be working with the top 1m domains published in 2013 as a result. We extraced a list of subdomains from a dataset derived from the Alexa's 2013 Top 1m domains that contains all subdomains for each domain in the top 1m [http://pages.cs.wisc.edu/~keqhe/imc2013/Alexa_subdomain_dns_records.tar.gz]. Note: these are all subdomains from 2013 so we are going off the assumption that these have remained the same. 


**GOAL: Our goal was to create a list of all cloud-using subdomains which are associated with domains in Alexa's top 1m sites. A subdomain is considered to be "cloud-using" if it has an IP address that falls within Amazon's public IP address ranges. We are starting with a dataset of subdomains that already fit this description, but because this was generated in 2013, we must still reget all IP addresses and cross reference against amazon IP ranges for these subdomains to get the current cloud-using subdomains.**

We created a file called *uniquewithrank.txt* which is a list of unique subdomain names and their associated ranking using the following command: 

    $ awk -F'#' '!seen[$2]++ {print $1, $2}' ALL_subdomains_Alexa_top1m.csv > uniquewithrank.csv
    
The file *uniquewithrank.txt* contains a total of 34277354 subdomains.  

    
### DNS Queries with 'dig'
In order to get the ip addresses associated with each subdomain, we performed dns queries on each subdomain using the UNIX tool 'dig'.


<font color = 'red'>**MORE BACKGROUND/EXPLANATION ON DIG**</font>  
<br><br>


In order to perform the dns queries, we needed a file containing a subdomain to query on each line. We extracted the subdomain names from the uniquewithrank.csv file through a python function 'extract_subdomains()'. This function yielded a new file called 'extractedsubdomains.txt' which was then passed as an argument to the dig command. 

    $ dig -f extractedsubdomains.txt +noall +answer | awk '$4=="A" {print $1, $5}' > results.txt
  dig -f uniqsortednameonly.txt +noall +answer | awk '$4=="A" {print $1, $5}' | tee redoqueries.txt


The dig queries were the most time consuming part of the data acquistion. After running dig for about two weeks, we retrieved our results file called 'results.txt' and stopped running dig. The total number of subdomains queried during this time were 2,496,336. (This is only approximately _ percent of the number of subdomains).  
<br>

We cleaned the resulting file by sorting on subdomain and eliminating any repeated lines (instances where subdomain and ip already appeared in the file) and wrote this to a new file 'dnsresults1.txt'. This was done by the following command: 

    $ sort -k1 results.txt | uniq > dnsresults1.txt
    
*dnsresults1.txt* contains a total of 2,496,336 subdomains.  <br><br>





### Manipulating Data 


#### Cross-referencing resulting ips with amazons ....__ to get list of cloud-using subdomains?
**todo: NEED TO GET ONLY THOSE THAT USE SERVICE=EC2**
Using the dns query results file dnsresults1.txt, we ran our python function 'crossref_subdomainip()' to cross reference each of these subdomain ip addresses with amazons public ip ranges to see which subdomains used amazon ec2 :

In [2]:
#from parser import crossref_subdomainip()
import parser
parser.crossref_subdomainip() #write new file subdomains.csv 

#This will create a file called 'subdomains.csv'

This function yielded a new file *subdomains.csv* with columns rank,subdomain,subdomainip, and region.  
<br><br>

#### Populate Database
**TOFIX**
We used posgresql database to organize our subdomain data.  Using this file and the previously created *uniquewithrank.txt* file, we populated a psql database. We created two tables "top1msubdomains" (a table derived from the top1m subdomain csv file from Aaron's Dataset') and "dnssubdomains" (a table of all subdomains in Alexas top 1m with an ip address that falls within one of amazon's ip ranges) by running the following commands in terminal:

    $ createdb alexadb
    $ psql alexadb -f create.sql
   
  <br><br>
    
#### How many subdomains from initial list did we query?
queried_subs() creates a new file called 'allqueriedsubdomains.txt'
to eliminate duplicates
    $ sort -u allqueriedsubdomains.txt
    
Now we need to make sure that each subdomain in this file is a valid subdomain (is present in our initial subdomain list). We did this through querying the tables we will build.  
From the DNS query result file, filter our subdomain names not present in our original list.
**TODO**  
<br><br>


#### Determine Rank       
The resulting entries in the*subdomains.csv* file have rank values = 0 as the rank column defaults to 0. We retrieved the correct rank associated with each subdomain through psql query (included at the bottom of our create.sql file):
```sql
UPDATE dnssubdomains
SET rank = top1msubdomains.alexa_rank
FROM top1msubdomains
WHERE top1msubdomains.subdomain=dnssubdomains.subdomain;
```
  <br><br>

#### Create a list of domains in Alexa's top 1m that have cloud-using subdomains
Link each subdomain with its domain to determine which domains in Alexa's top1m list have cloud-using subdomains
**TODO**  
<br><br>

#### Calculate number of subdomains for each domain (SEE AARONS TABLE 4)
**TODO**  
<br><br>





We needed to perform additional queries to manipulate/organize the data further.
First, we aggregated our results into a csv with headers for rank,subdomain,subdomainip,region

```sql
\Copy (select * from dnssubdomains where rank!=0) To '/vagrant/projectfrankiepam/finalresults.csv' With CSV Delimiter ',';
```  
<br><br>







# III. Results (2 pages)

**TODO: Present your findings through:**
- statistics
- tables
- visualizations

### What percentage of alexa's top1m list have a subdomain thats cloud using?
**TODO** <font color = 'red'>(aarons results said 40,333 domains (or >4%) and found a total of 713,910 cloud-using subdomains) but we did not do dnsqueries for all the subdomains so we have to figure this out somehow........</font>



# IV. Conclusions (1/2 page)

**TODO: Summarize the conclusions of your study. This might include a discussion of future work. **

# IV. Related Work (1/2 page)

**TODO: Briefly describe related work on this topic (if applicable) **