

1. Which lesson appears to attract the most traffic consistently across cohorts (per program)?
2. Is there a cohort that referred to a lesson significantly more than other cohorts seemed to gloss over?
3. Are there students who, when active, hardly access the curriculum? If so, what information do you have about these students?
4. Is there any suspicious activity, such as users/machines/etc accessing the curriculum who shouldn’t be? Does it appear that any web-scraping is happening? Are there any suspicious IP addresses?
5. At some point in 2019, the ability for students and alumni to access both curriculums (web dev to ds, ds to web dev) should have been shut off. Do you see any evidence of that happening? Did it happen before?
6. What topics are grads continuing to reference after graduation and into their jobs (for each program)?
7. Which lessons are least accessed?
8. Anything else I should be aware of?

In [1]:
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt

import numpy as np
import pandas as pd
import seaborn as sns

from env import host, user, password

# DBSCAN import
from sklearn.cluster import DBSCAN

import acquire
import prepare

In [2]:
#acquire sql query
df = acquire.acquire()

In [3]:
# see prepare for cleaning
df = prepare.clean_cohort(df)

In [4]:
# splits our df into a webdev and data science
df_wd, df_ds = prepare.program_split(df)

## Prepare
   #### Prepare Summary:
   - Acquired data from SQL server and saved locally on csv
   - Dropped unecessary columns
   - Removed staff from database
   - Removed nulls
   - Grouped course material into lesson column
   - Creatd url and subpath column by removing \ from path
   - Created a counter column to help aid in ip hits
   - Split data between df_wd (webdev) & df_ds (data science) 

## 1. Which lesson appears to attract the most traffic consistently across cohorts (per program)?

## Data Science Lessons (14 count): (taught to students )
- fundamentals
- sql
- capstones
- python
- regression'
- advanced-topics
- classification
- distributed-ml
- status
- clustering
- time_series
- anomaly
- nlp
- storytelling

## Data Science Not-Lessons (paths that are not curriculum)
- pending
- appendix

## Web Dev Lessons:
- java
- javascript
- toc
- html-css
- spring
- jquery
- mysql
- structure
- php
- larvel
- fundamentals

## Web Dev Non-Lesson:
- search
- index

In [5]:
# DS Lessons count from top
df_ds.groupby('lesson')[['user_id']].agg('count').sort_values(by='user_id', ascending=False).head(6)

Unnamed: 0_level_0,user_id
lesson,Unnamed: 1_level_1
fundamentals,16691
sql,13685
classification,11841
python,10492
pending,8940
regression,7471


In [6]:
# see counts for web dev lessons from top
df_wd.groupby('lesson')[['user_id']].agg('count').sort_values(by='user_id', ascending=False).head(6)

Unnamed: 0_level_0,user_id
lesson,Unnamed: 1_level_1
javascript,149227
java,138797
html-css,92725
mysql,71913
pending,58516
spring,52344


## 5 Most Popular Lessons for Data Science
- Fundamentals
- SQL
- Classfication
- Python
- Regression



## 5 Most Popular Lessons for Web Dev
- Javascript
- Java
- HTML-CSS
- MySQL
- Spring


### 2. Is there a cohort that referred to a lesson significantly more than other cohorts seemed to gloss over?

### 3. Are there students who, when active, hardly access the curriculum? If so, what information do you have about these students?

In [7]:
# the last date stamp from this database is on: 
df.date.max()

'2021-04-21'

These 5 cohorts are not fully represented by the dataset as their end date for students is beyond the last entry from this dataset


2021-05-04 ---- Kalypso   ----    23691

2021-06-08 ---- Luna     ----     16623

2021-07-19 ---- Marco    ----     16397

2021-09-03 ---- Neptune    ----    7276

2021-10-01 ---- Oberon   ----      1672

### 4. Is there any suspicious activity, such as users/machines/etc accessing the curriculum who shouldn’t be? Does it appear that any web-scraping is happening? Are there any suspicious IP addresses?

### 5. At some point in 2019, the ability for students and alumni to access both curriculums (web dev to ds, ds to web dev) should have been shut off. Do you see any evidence of that happening? Did it happen before?

In [8]:
# set datetime
df_ds.date = pd.to_datetime(df_ds.date)
#set index
df_ds =df_ds.set_index('date')
#locaiton of 2019
ds_2019 = df_ds.loc['2019']
### Looking at the last hits from data science at webdev material

ds_2019_url = ds_2019[ds_2019.url == 'javascript-i']
ds_2019_url.tail()

Unnamed: 0_level_0,time,path,user_id,cohort_id,program_id,ip,name,start_date,end_date,created_at,program,url,subpath,counter,lesson
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2019-11-26,10:55:05,javascript-i,476,34.0,3,97.105.19.58,Bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,data_science,javascript-i,[javascript-i],1,pending
2019-11-26,15:32:05,javascript-i/conditionals,472,34.0,3,97.105.19.58,Bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,data_science,javascript-i,"[javascript-i, conditionals]",1,pending
2019-12-03,10:04:40,javascript-i,467,34.0,3,97.105.19.58,Bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,data_science,javascript-i,[javascript-i],1,pending
2019-12-03,11:49:29,javascript-i,467,34.0,3,97.105.19.58,Bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,data_science,javascript-i,[javascript-i],1,pending
2019-12-14,16:46:24,javascript-i,476,34.0,3,136.50.49.145,Bayes,2019-08-19,2020-01-30,2019-08-20 14:38:55,data_science,javascript-i,[javascript-i],1,pending


In [9]:
### Looking at the last hits from webdev at data science material

df_wd.date = pd.to_datetime(df_wd.date)
df_wd =df_wd.set_index('date')
wd_2019 = df_wd.loc['2019']
#12-22
wd_2019_url = wd_2019[wd_2019.url == '4-python']
wd_2019_url.tail()

Unnamed: 0_level_0,time,path,user_id,cohort_id,program_id,ip,name,start_date,end_date,created_at,program,url,subpath,counter,lesson
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2019-11-12,14:25:49,4-python/project,458,33.0,2,97.105.19.58,Ceres,2019-07-15,2019-12-11,2019-07-15 16:57:21,web_dev,4-python,"[4-python, project]",1,pending
2019-11-12,14:25:52,4-python/3-data-types-and-variables,458,33.0,2,97.105.19.58,Ceres,2019-07-15,2019-12-11,2019-07-15 16:57:21,web_dev,4-python,"[4-python, 3-data-types-and-variables]",1,pending
2019-11-12,14:26:01,4-python/4-control-structures,458,33.0,2,97.105.19.58,Ceres,2019-07-15,2019-12-11,2019-07-15 16:57:21,web_dev,4-python,"[4-python, 4-control-structures]",1,pending
2019-11-12,14:26:30,4-python/5-functions,458,33.0,2,97.105.19.58,Ceres,2019-07-15,2019-12-11,2019-07-15 16:57:21,web_dev,4-python,"[4-python, 5-functions]",1,pending
2019-12-22,19:45:47,4-python/intro-to-sklearn,18,22.0,2,45.20.117.182,Teddy,2018-01-08,2018-05-17,2018-01-08 13:59:10,web_dev,4-python,"[4-python, intro-to-sklearn]",1,pending



- The last access of a WD student accessing DS material was 12-22-19
- The last access of a DS student accessing WD material was 12-14-19

There is no determined cut off date for accessing the material until sometime in the last two weeks of 2019. There is also a few students who potentially could be scraping the material due to their access behavior.

## 6. What topics are grads continuing to reference after graduation and into their jobs (for each program)?

In [15]:
df_ds.date = pd.to_datetime(df_ds.date)
df_wd.date = pd.to_datetime(df_wd.date)


AttributeError: 'DataFrame' object has no attribute 'date'

In [11]:
#datetime to end_date
df_wd.end_date = pd.to_datetime(df.end_date)
df_ds.end_date = pd.to_datetime(df.end_date)

In [13]:
#checking timeframe for web_dev
df_wd.end_date.min(), df_wd.end_date.max()

(NaT, NaT)

In [12]:
df_wd['datetime'] = df_wd.date
df_wd['datetime'] = pd.to_datetime(df_wd['datetime'])

AttributeError: 'DataFrame' object has no attribute 'date'

In [None]:
### 

In [None]:
df.date

In [None]:
#datetime to end_date
df_wd.end_date = pd.to_datetime(df.end_date)
df_ds.end_date = pd.to_datetime(df.end_date)

In [None]:
df_ds.date = pd.to_datetime(df_ds.date)


In [None]:
# add to prepare.py
df_wd['datetime'] = df_wd.date
df_wd['datetime'] = pd.to_datetime(df_wd['datetime'])

df_ds['datetime'] = df_ds.date
df_ds['datetime'] = pd.to_datetime(df_ds['datetime'])

In [None]:
df_wd.end_date = pd.to_datetime(df.end_date)
df_ds.end_date = pd.to_datetime(df.end_date)

#making a dataframe for each 
wd_diff_time = df_wd[df_wd.date > df_wd.end_date][['path', 'user_id', 'end_date', 'datetime', 'url', 'lesson']]
ds_diff_time = df_ds[df_ds.date > df_ds.end_date][['path', 'user_id', 'end_date', 'datetime', 'url', 'lesson']]

In [None]:
# top lessons Data Science alumni reference
pd.DataFrame(ds_diff_time['lesson'].value_counts(ascending = False)).head(4)


In [None]:
# top lessons Web Dev alumni reference
pd.DataFrame(wd_diff_time['lesson'].value_counts(ascending = False)).head(4)

Web Dev:

java	20304

javascript	19794

pending	15290

html-css	13914

DS:

sql	1555

fundamentals	1514

pending	1472

classification	1312

## 7. Which lessons are least accessed?

In [None]:
# DS Lessons count from bottom
df_ds.groupby('lesson')[['user_id']].agg('count').sort_values(by='user_id', ascending=False).tail(7)

In [None]:
# see counts for web dev lessons from bottom
df_wd.groupby('lesson')[['user_id']].agg('count').sort_values(by='user_id', ascending=False).tail(8)

## 5 Least Popular Lessons for Data Science
- time_series
- storytelling
- nlp
- distributed
- advancted topics

## 5 Least Popular Lessons for Web Dev\
- jquery
- toc
- fundamentals
- structure
- php