# Machine Project 6: EDGAR Web Logs

#### <span style="color:red">Make sure to read the [README](README.md) before starting this project</span>

## Required Information

Please fill out the following details.  
- Enter your **full name (as it appears on Canvas)** and **NetID**.  
- If you are working in a group (maximum of 4 members), include the **full names and NetIDs** of all your partners.  
- If you're working alone, enter `None` for the partner fields.

> **Important:** Each student must submit the project individually.  
Failure to complete this section may result in your submission being flagged for plagiarism.

In [None]:
# Project: MP6
# Student 1: vardaan kapoor, vkapoor5

## <span style="color:red">Important:</span>

* **Before you begin**, make sure to `pull` any changes from GitLab. From the terminal, run:
```
git checkout main
git pull
git checkout MP6
git merge main
```
* Follow all instructions carefully. If anything is unclear, attend office hours or post on Piazza.
* You may add additional code cells as needed. However, **only cells with `#Q_` in the code will be graded**.
* To test, **Restart and Run all Cells** then **save the notebook** and run `python3 tester.py` from the terminal

> ⚠️ **Reminders:**
>
> - Make sure you are on the `MP6` branch by running `git branch` and checking the output.
> - Frequently `add`, `commit`, and `push` your code to avoid losing progress.


In [1]:
# Add additional imports used throughout the project here

# these lines automatically reload modules when their code changes
%load_ext autoreload
%autoreload 2




In [72]:
import pandas as pd
import pickle # used for grading graphs
from zipfile import ZipFile
from io import TextIOWrapper
import numpy as np
from edgar_utils import lookup_region

# Group Part (75%)

For this portion of the machine project, you may collaborate with your group members in any way (including looking at group members' code). You may also seek help from CS 320 course staff (peer mentors, TAs, and the instructor). You **may not** seek or receive help from other CS 320 students (outside of your group) or anybody else outside of the course.

## Part 1: `server_log.zip` analysis
> 📄 **Work in:** [`mp6.ipynb`](mp6.ipynb)

In [6]:
# Use pandas to read in "server_log.zip as a csv file"
with ZipFile("server_log.zip") as zf:
    with zf.open("rows.csv") as f:
        df=pd.read_csv(f)
        print(df.head(n=5))

0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
Name: zone, dtype: float64


### Q1: What's the total size in bytes of the files requested?

Look at the `size` column of the CSV in `server_log.zip`.  We want to include duplicates here; this gives us an estimate of the amount of network traffic handled by EDGAR (since this data is only a sample, the true value will be even larger). Answer with an integer. 

**Note:** If you use `numpy` make sure to cast the final answer to an `int`.

In [13]:
#Q1
sizeCol=df["size"]
(int)(sizeCol.sum())

24801002666

### Q2: How many filings have been accessed by the 10 IPs with the most accesses?

Answer with a dictionary, with the (anonymized) IP as key and the number of requests seen in the logs as the values. Each row in the logs corresponds to one request. Note that the anonymized IP addresses are consistent between requests.

**Hint:** for this question and most of the others expecting dictionary output, it might be easiest to use Pandas operations to process the data into a `Series` and to use the `to_dict()` method. Consider using tools like `groupby`, `apply`, and aggregation methods like `size()`. In Q30-32 from [MP1](../mp1/README.md), there is an example of `apply`.


In [25]:
#Q2
top10ip=df.groupby("ip").size().sort_values(ascending=False).head(10)
top10ip.to_dict()

{'54.152.17.ccg': 12562,
 '183.195.251.hah': 6524,
 '52.45.218.ihf': 5562,
 '68.180.231.abf': 5493,
 '204.212.175.bch': 4708,
 '103.238.106.gif': 4428,
 '208.77.215.jeh': 3903,
 '208.77.214.jeh': 3806,
 '217.174.255.dgd': 3551,
 '82.13.163.caf': 3527}

In [34]:
def findErrors():
    c=0
    for row in df.itertuples():
        if(row[8]>400):
            c+=1
    return c

### Q3: What fraction of the requests had errors?

Any request with a status code greater than or equal to 400 has an error. Answer with a floating point number. 

**Note:** If you use `numpy` make sure to cast the final answer to a `float`.

In [36]:
#Q3

findErrors()/len(df)

    

0.03466852724527611

In [69]:
def createFileFormatting(row):
    return "{}/{}/{}".format((int)(row["cik"]),row["accession"],row["extention"])
    

### Q4: What is the second most frequently accessed file?

Answer with a string formatted like so: "cik/accession/extention" (these are the names of columns in "rows.csv").

In [70]:

df["concatSt"]=df.apply(createFileFormatting,axis=1)


In [71]:
#Q4
file_counts = df.groupby("concatSt").size().reset_index(name="count")
second_file = file_counts.sort_values("count", ascending=False).iloc[1]
second_file["concatSt"]

'1584509/0001584509-16-000514/armk-20160930_def.xml'

## Part 2: Creating `edgar_utils.py` module
> 📄 **Work in:** [`edgar_utils.py`](edgar_utils.py)

This part is to be started during [Lab 9](../../labs/Lab9/README.md). Finish the `edgar_utils.py` module now if you didn't have enough time
during the scheduled lab.

In [114]:
df2=lookup_region("101.1.1.abc")
df2

'China'

## Part 3: Using `edgar_utils.py` module
> 📄 **Work in:** [`mp6.ipynb`](mp6.ipynb)

### Q5: Which region accesses resources most heavily in `server_log.zip`?

Use your `lookup_region` function and answer with a string.

In [None]:
#Q5

### Q6: What fraction of IPs in each region are high-volume users?

Consider IPs which accessed more than 300 EDGAR resoures to be
high-volume. This might indicate machines running automated scraping
and analysis tasks.

Note that given the sampling done in the data, the true EDGAR usage of
these machines is likely to be even heavier.

Answer with a dictionary, where the keys are the regions and the
values are the fraction (in floating point form) of IPs from that
region classified as high-volume.

**Example:**

Say "United States of America" has four IPs:
* 1.1.1.1 appears 1200 times in the logs
* 2.2.2.2 appears 900 times in the logs
* 3.3.3.3 appears 5 times in the logs
* 4.4.4.4 appears 234 times in the logs

This means that 1/2 of the IPs in the US are high volume, so there should be an entry like this:

```
{
    "United States of America": 0.5,
    ...
}
```

**Note:** Some of the filings are listed as having a region of '-'. Please include this in your final
answer.

**Note:** If you use `numpy` make sure to cast dictionary entries to `float`.

In [None]:
#Q6

### Requirement: `filings` dictionary

Read every file ending with .htm or .html in `docs.zip`, and create a `Filing`
object based on that file. Then, save that `Filing` object to a dictionary as follows:
- **Key:** The filepath for this filing object (ex. `850693/0000850693-07-000159/-index.htm`)
- **Value:** The `Filing` object created from this filepath.

Creating this dictionary once now will save us from needing to loop over all values
in future questions.

In [None]:
# Create `filings` dictionary

### Q7: What dates appear in the `886982/0000769993-16-001958/-index.htm` file of `docs.zip`?

Read the HTML from this file and use it to create a `Filing` object,
from which you can access the `.dates` attribute.

In [None]:
#Q7

### Q8: What is the distribution of states for the filings in `docs.zip`?

Answer with a dict, like the following:

```
{'CA': 92,
 'NY': 83,
 'TX': 67,
 'None': 56,
 'MA': 30,
 'IL': 25,
 'PA': 25,
 'CO': 25,
 ...
}
```

The showing order of each key-value pair doesn't really matter. Please include `None` in the
dictionary.

**Hint:** We created the `filings` dictionary above, which means we don't have to
iterate through `docs.zip` here again!

In [None]:
#Q8

### Q9: What is the distribution for the ten most common addresses for the filings in `docs.zip`?

Answer in the same format as the previous question.

Expected output:
```
{'2000 AVENUE OF THE STARS, 12TH FLOOR\nLOS ANGELES CA 90067': 134,
 '2000 AVENUE OF THE STARS, 12TH FLOOR\nLOS ANGELES CA 90067\n3102014100': 113,
 '3 LANDMARK SQUARE\nSUITE 500\nSTAMFORD CT 06901\n2033564400': 60,
 'C/O KKR ASSET MANAGEMENT LLC\n555 CALIFORNIA STREET, 50TH FLOOR\nSAN FRANCISCO CA 94104': 36,
 'C/O ARES MANAGEMENT LLC\n2000 AVENUE OF THE STARS, 12TH FLOOR\nLOS ANGELES CA 90067': 35,
 '4740 AGAR DRIVE\nRICHMOND A1 V7B 1A3': 25,
 'CENTRALIS S.A., 8-10 AVENUE DE LA GARE\nLUXEMBOURG N4 L-1610': 25,
 'CENTRALIS S.A., 8-10 AVENUE DE LA GARE\nLUXEMBOURG N4 L-1610\n352-26-186-1': 25,
 '3 LANDMARK SQUARE\nSUITE 500\nSTAMFORD CT 06901': 24,
 '801 CHERRY STREET\nSUITE 2100\nFORT WORTH TX 76102': 22}
```

In [None]:
#Q9

# Individual Part (25%)

For this portion of the machine project, you are only allowed to seek help from CS 320 course staff (peer mentors, TAs, and the instructor). You **may not** receive help from anyone else.

## Part 4: Combining logs with documents
> 📄 **Work in:** [`mp6.ipynb`](mp6.ipynb)

### Q10: What is the distribution of requests across industries?

For each request in the logs that has a corresponding filing in
`docs.zip`, lookup the SIC (ignore rows in the logs which refer to
pages not in `docs.zip`).

Answer with a dictionary, where the keys are the SIC and the values
are the number of times the resources of that industry were accessed.

If you're curious, consider looking up the industry names for the top
couple categories:
https://www.sec.gov/corpfin/division-of-corporation-finance-standard-industrial-classification-sic-code-list

Expected output:

```
{2834: 984,
 1389: 656,
 1311: 550,
 2836: 429,
 6022: 379,
 1000: 273,
 ...
 }
 ```

In [None]:
#Q10


### Q11: How many requests were made in each hour?

Use `pd.to_datetime` (the `hour` attributes of the converted
timestamps may be useful) or string manipulation to process the `time`
column. Answer with a dictionary, where the keys are integers from 0
to 23 representing the hour of the day, and the values are the number
of requests made in that hour.

In [None]:
#Q11

### Q12: What is the geographic overlap in interest between Australia, France, Indonesia, and Viet Nam?

Answer with a Digraph like the following:

<img src="img/digraph.png" width=400>

In addition to a node for each of these three countries, there should
be a node for each state having a filing accessed by somebody in one
of these countries.

An edge from a country to a state means somebody in that country
looked at least one filing for a company in that state.

**Important:** Make sure not to hardcode these values. It might be helpful to
define a list like `countries = ["Australia", "France", "Indonesia", "Viet Nam"]` and then loop over the filings for these countries only.

In [None]:
#Q12
d = graphviz.Digraph()

# ADD CODE HERE

# IMPORTANT -- Do not remove -- 
with open("Q12.pkl", "wb") as f:
    pickle.dump(d.source, f)

d

### Q13: Geographic Plotting of Postal Codes

In this question, you will plot geographic data from `locations.geojson` over a background map from the shapefile `"shapes/cb_2018_us_state_20m.shp"`. Each point represents an address and should be colored by its **postal code**.

Follow the instructions **carefully**, as this question involves geospatial data manipulation that is sensitive to the order of operations.

**Required Steps:**

1. Extract ZIP codes from the address column:
   - Use a regular expression to extract the **5-digit** ZIP code from the address string.
   - Ignore entries with missing or malformed ZIP codes.
   - If a ZIP code has a state and then the code, make sure to only take the 5 digit code. For example, only use `93821` in `CA 93821`.

2. Filter valid ZIP codes:
   - Only keep rows where the ZIP code is a number between **10000** and **60000**.

3. Crop the data to a specific bounding box (do this **before** projecting):
   - Define the bounding box:  
     `west = -90`, `east = -65`, `south = 25`, `north = 50`
   - Use `shapely.geometry.box(west, south, east, north)` to create a bounding box and either `.intersection()` to crop the background or `.intersects()` to filter the data points.

4. Project to Mercator ("epsg:2022"):
   - After filtering, apply `.to_crs("epsg:2022")` **to both** the background and the locations.

5. Plot:
   - Use `.plot(ax=ax)` and be sure to **pass the same `ax` object** to both the background and the points.
   - Set:
     - Background color to `"lightgray"`
     - Point colors using the `"viridis"` colormap
     - Color of each point should represent the ZIP code using `column="zipcode"` in `.plot(...)`.
     - Add a colorbar with `legend=True`
     - Remove axis labels with `ax.set_axis_off()`

The result should look similar to this:

<img src="img/geo.png" width="400px">

**Hints:**

- **Do not project before filtering or intersecting with the bounding box.** Always crop in latitude/longitude (the original coordinate system) first.
- Use `re.findall(regex, address)` to extract ZIP codes.
- Clean the data **before** plotting.
- If your map looks distorted or blank, double-check that you projected **after** cropping and used the correct EPSG code.
- Double-check the ZIP code range: it must be between 10000 and 60000.
- Use `legend=True` for the colorbar.


In [None]:
west = -90
east = -65
north = 50
south = 25

In [None]:
#Q13

# ADD CODE HERE

# Create the plot
fig, ax = plt.subplots()

# PLOT HERE

# IMPORTANT -- Do not remove -- 
with open("Q13.pkl", "wb") as f:
    pickle.dump(fig, f)

plt.show()

## <span style="color:red">Important:</span>
Make sure to follow these steps to submit the project
1. **Kernel > Restart Kernel and Run All Cells** and then save the notebook
2. Run `tester.py` to check your answers
3. Run the following commands from the terminal:
```
git status # make sure you are in the correct branch
git add <required files>
git commit -m "Some message"
git push
```
4. Once you've pushed your project to GitLab, **verify that the pipeline ran successfully**.
    * Build > Jobs > Select the latest commit hash > Check tester output
5. Create a **_merge request_** to submit the project
    * Code > Merge requests > New merge request