# EDA Project 2 - 
_______

### Data Description
These data provide a window into how people are interacting with the government online. The data come from a unified Google Analytics account for U.S. federal government agencies known as the Digital Analytics Program. This program helps government agencies understand how people find, access, and use government services online. The program does not track individuals, and anonymizes the IP addresses of visitors.

Not every government website is represented in these data. Currently, the Digital Analytics Program collects web traffic from around 400 executive branch government domains, across about 5,700 total websites, including every cabinet department. We continue to pursue and add more sites frequently; to add your site, email the Digital Analytics Program.

## Question and Problem Definition

## Workflow Goals

### Import Libraries

In [1]:
from collections.abc import Sequence
import datetime
import numpy as np
import pandas as pd
import requests
import matplotlib.pyplot as plt

import pyarrow.parquet as pq
from tqdm import tqdm

%matplotlib inline

### Data Acquisition
**Background**

One of the challenges of this project was acquiring data.  The original dataset we prospected was found to be corrupted. 
After looking at the data, we found that the there were so many errors, it would have been nearly impossible to complete the project as planned.

**Error Source** 

The data was corrupted by human error.  The developers who maintain the dataset, allowed synthetic data to contaminate the downloadable `.csv` files they make available to the public.  Over half of the observations contained website domains such as www.fakesite.com.  

**Solution**

Instead of waiting for the data errors to be fixed and run the risk of having incomplete data for our project, we decided to proceed with acquiring data through the API.  We did need to adjust our initial data questions as the API is still in BETA and did not have the same headers available as the `.csv` files.  Where we did lose some variable information, we did gain the ability to acquire time-series data, which was not available in the original dataset.  

In [4]:
def api_to_parquet(agencies: Sequence[str], reports: Sequence[str], api_key: str, response_limit=1000) -> None:
    """_summary_

    Args:
        agencies (Sequence[str]): _description_
        reports (Sequence[str]): _description_
        api_key (str): _description_
        response_limit (int, optional): _description_. Defaults to 1000.
    """
    start_date = datetime.datetime.strptime('2020-03-01', '%Y-%m-%d')
    end_date = datetime.datetime.strptime('2020-03-31', '%Y-%m-%d')
    increment_days = 30
    
    num_periods = ((datetime.datetime.now() - start_date).days + 1) // increment_days
    
    # Loop through the reports.
    for report in reports:
        new_dataframe = pd.DataFrame()
        # Loop through the agencies.
        for agency in agencies:
            print(f"I'm on {report}/{agency}")
            # Create the URL for the API call.
            url = f"https://api.gsa.gov/analytics/dap/v1.1/agencies/{agency}/reports/{report}/data?api_key={api_key}"
            with tqdm(total=num_periods) as pbar:
                # Loop through the date range.
                while start_date < datetime.datetime.now():
                    
                    # Add the date range parameters to the URL and increment the dates by 1 day.
                    url_date_params = f"&after={start_date.strftime('%Y-%m-%d')}&before={end_date.strftime('%Y-%m-%d')}&limit={response_limit}"
                    full_url = url + url_date_params

                    response = requests.get(full_url).json() # api call
                    response = pd.DataFrame(response) # make the json response a dataframe.

                    # Concatenate the new data to the existing dataframe.
                    new_dataframe = pd.concat([new_dataframe, response])
                    
                    # Increment the dates by the specified number of days.
                    start_date += datetime.timedelta(days=increment_days)
                    end_date += datetime.timedelta(days=increment_days)
                    
                    # Update progress bar.
                    pbar.update(1)
            start_date = datetime.datetime.strptime('2020-01-01', '%Y-%m-%d')
            end_date = datetime.datetime.strptime('2020-03-31', '%Y-%m-%d')
            
            print(full_url)            
        new_dataframe['date'] = pd.to_datetime(new_dataframe['date'])

        # write to parquet file to the data folder.
        new_dataframe.to_parquet(f"./data/{report}_report_all_agencies.parquet")


In [None]:

list_of_agencies = ['energy', 'health-human-services', 
                    'office-personnel-management', 'postal-service', 
                    'small-business-administration', 'social-security-administration',
                    'state', 'transportation',
                    'treasury', 'commerce', 'agency-international-development']
list_of_reports = ['second-level-domain','language']

# api key 
api_key = 'replace_with_your_api_key'

# call the function
api_to_parquet(list_of_agencies, list_of_reports, api_key, response_limit=1000)


### Load Dataset
(observations & notes)

In [5]:
pdf1 = pd.read_parquet('./data/site_report_all_agencies.parquet')
pdf1.info()


<class 'pandas.core.frame.DataFrame'>
Index: 392525 entries, 0 to 5
Data columns (total 6 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   id             392525 non-null  int64         
 1   date           392525 non-null  datetime64[ns]
 2   report_name    392525 non-null  object        
 3   report_agency  392525 non-null  object        
 4   domain         392525 non-null  object        
 5   visits         392525 non-null  int64         
dtypes: datetime64[ns](1), int64(2), object(3)
memory usage: 21.0+ MB


### View and Describe Data
(observations & notes)

{
"id": 123561668,
"date": "2023-04-12",
"report_name": "download",
"report_agency": null,
"page": "irs.gov/forms-pubs/about-form-4868",
"page_title": "About Form 4868, Application for Automatic Extension of Time to File U.S. Individual Income Tax Return | Internal Revenue Service",
"event_label": "https://www.irs.gov/pub/irs-pdf/f4868.pdf",
"total_events": 48266
},
{
"id": 123591100,
"date": "2023-04-12",
"report_name": "site",
"report_agency": "agency-international-development",
"domain": "usaid.gov",
"visits": 42902
},
