In [1]:
# imports

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import math as m

import functions as f
import functions_m as m
import prepare_m as p


## Having been tasked to identify anomalies in access logs for the Codeup curriculum, we responded to the following questions :  

1. Which lesson appears to attract the most traffic consistently across cohorts (per program)?
2. Is there a cohort that referred to a lesson significantly more than other cohorts, one that other cohorts seemed to gloss over?
3. Are there students who, when active, hardly access the curriculum? If so, what information do you have about these students?
4. Is there any suspicious activity, such as users/machines/etc accessing the curriculum who shouldn’t be? Does it appear that any web-scraping is happening? Are there any suspicious IP addresses?
5. At some point in 2019, the ability for students and alumni to access both curriculums (web dev to ds, ds to web dev) should have been shut off. Do you see any evidence of that happening? Did it happen before?
6. What topics are grads continuing to reference after graduation and into their jobs (for each program)?
7. Which lessons are least accessed?


### We first performed a MySQL query to obtain the access logs, then turned that information into a .csv file to facilitate future access.

In [2]:
# read the .csv

df = pd.read_csv('curriculum_logs.csv', index_col = [0])

In [3]:
# clean up df with date-time

df = p.to_datetime(df, 'date')
df.head(1)

Unnamed: 0,date,time,url_path,user_id,cohort_id,ip,id,name,start_date,end_date,program_id
0,2018-01-26,09:55:03,/,1,8.0,97.105.19.61,8,Hampton,2015-09-22,2016-02-06,1


### After cleaning, we started to addresses the issues at hand.

### 1. Which lesson appears to attract the most traffic consistently across cohorts (per program)?

In [4]:
# find the most accessed lesson per programme

m.lesson_most_accessed(df)

The most accessed PHP cohort lesson page is javascript-i    736
Name: url_path, dtype: int64.
The most accessed full-stack java cohort lesson page is javascript-i    17457
Name: url_path, dtype: int64.
The most accessed data science cohort lesson page is classification/overview    1785
Name: url_path, dtype: int64.
The most accessed front-end cohort lesson page is content/html-css    2
Name: url_path, dtype: int64.


**TAKEAWAY, QUESTION 1 :**  

- The most accessed PHP lesson page is javascript-i, with 736 hits.

- The most accessed full-stack java lesson page is javascript-i, with 17457 hits.

- The most accessed data science lesson page is classification/overview, with 1785 hits.

- The most accessed front-end lesson page is content/html-css, with 2 hits.

### 2. Is there a cohort that referred to a lesson significantly more than other cohorts, one that other cohorts seemed to gloss over?

In [5]:
# most-accessed url per cohort

#m.url_most_accessed(df)

**TAKEAWAY, QUESTION 2 :**  

The Jupiter cohort had a quite large (1866) number of accesses to the ‘toc’ page. Many cohorts accessed the ‘toc’ page, but Jupiter’s amount of access was notably higher.   

The Staff cohort accessed the javascript-i page 1817 times, the next closes access was with the Ceres cohort, at 1003 accesses.

### 3. Are there students who, when active, hardly access the curriculum? If so, what information do you have about these students?

### 4. Is there any suspicious activity, such as users/machines/etc accessing the curriculum who shouldn’t be? Does it appear that any web-scraping is happening? Are there any suspicious IP addresses?

### 5. At some point in 2019, the ability for students and alumni to access both curriculums (web dev to ds, ds to web dev) should have been shut off. Do you see any evidence of that happening? Did it happen before?

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 847330 entries, 0 to 847329
Data columns (total 15 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   date              847330 non-null  datetime64[ns]
 1   time              847330 non-null  object        
 2   url_path          847329 non-null  object        
 3   user_id           847330 non-null  int64         
 4   cohort_id         847330 non-null  float64       
 5   ip                847330 non-null  object        
 6   id                847330 non-null  int64         
 7   name              847330 non-null  object        
 8   start_date        847330 non-null  object        
 9   end_date          847330 non-null  object        
 10  program_id        847330 non-null  int64         
 11  php_cohort        847330 non-null  int64         
 12  fs_java_cohort    847330 non-null  int64         
 13  ds_cohort         847330 non-null  int64         
 14  fron

In [7]:
f.check_permissions(df, 'url_path')

             date      time                      url_path  user_id  cohort_id  \
328347 2019-09-23  11:45:09                      java-iii      476       34.0   
330826 2019-09-25  19:30:44                  javascript-i      476       34.0   
330828 2019-09-25  19:31:07                  javascript-i      476       34.0   
330829 2019-09-25  19:31:11                        java-i      476       34.0   
330830 2019-09-25  19:31:12                  javascript-i      476       34.0   
330831 2019-09-25  19:31:14                        java-i      476       34.0   
330832 2019-09-25  19:31:19                       java-ii      476       34.0   
330837 2019-09-25  19:32:23                  javascript-i      476       34.0   
330838 2019-09-25  19:32:34                        java-i      476       34.0   
330839 2019-09-25  19:32:38                        java-i      476       34.0   
330840 2019-09-25  19:32:44             java-i/console-io      476       34.0   
377469 2019-11-25  14:26:14 

### 6. What topics are grads continuing to reference after graduation and into their jobs (for each program)?

In [8]:
f.post_grad(df, 1, 'url_path')

javascript-i    736
html-css        542
spring          501
java-iii        479
java-ii         454
Name: url_path, dtype: int64



In [9]:
f.post_grad(df, 2, 'url_path')

javascript-i    4229
spring          3760
html-css        3136
java-iii        3058
java-ii         2985
Name: url_path, dtype: int64



In [10]:
f.post_grad(df, 3, 'url_path')

sql/mysql-overview                                275
classification/overview                           266
classification/scale_features_or_not.svg          219
anomaly-detection/AnomalyDetectionCartoon.jpeg    193
anomaly-detection/overview                        191
Name: url_path, dtype: int64



In [11]:
f.post_grad(df, 4, 'url_path')

content/html-css                               2
content/html-css/gitbook/images/favicon.ico    1
content/html-css/introduction.html             1
Name: url_path, dtype: int64



### 7. Which lessons are least accessed?

In [12]:
f.least_total(df, 'url_path', 3, 5)

content/appendix/further-reading/gitbook/images/favicon.ico    6
2.03.06_CorrelationTests                                       6
servlets                                                       6
Name: url_path, dtype: int64



In [13]:
f.least_cohort(df, 'url_path', 1, 5)

url_path               name     url_path             
mysql/sample-database  Hampton  mysql/sample-database    6
Name: url_path, dtype: int64

url_path                   name   url_path                 
prework/cli/03-file-paths  Teddy  prework/cli/03-file-paths    6
Name: url_path, dtype: int64

url_path                               name     url_path                             
examples/javascript/dom-query-js.html  Sequoia  examples/javascript/dom-query-js.html    6
Name: url_path, dtype: int64

url_path                                  name    url_path                                
prework/cli/04-navigating-the-filesystem  Arches  prework/cli/04-navigating-the-filesystem    6
Name: url_path, dtype: int64

url_path                name     url_path              
appendix/documentation  Niagara  appendix/documentation    6
Name: url_path, dtype: int64

url_path                             name       url_path                           
javascript-i/bom-and-dom/dom-events  Pinnacles 